# Textbooks Are All You Need - Implementation with Alternative Datasets

This notebook follows the exact approach from the paper but uses different datasets:

1. **Main Code Dataset** (~6B tokens)
   - CodeParrot dataset instead of The Stack/StackOverflow
   - Will filter for high-quality examples using their classifier approach

2. **Textbook Dataset** (<1B tokens)
   - Python official documentation instead of GPT-3.5 generated textbooks
   - Includes tutorials, language reference, and library docs

3. **Exercises Dataset** (~180M tokens)
   - LeetCode problems instead of their synthetic exercises
   - Includes problem descriptions and solutions

In [1]:
# Install required packages
!pip install datasets transformers torch beautifulsoup4 requests

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

## 1. Main Code Dataset - CodeParrot

In [None]:
from datasets import load_dataset
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

print("Loading CodeParrot dataset...")
code_dataset = load_dataset("codeparrot/codeparrot-clean", split="train")

# Load quality classifier (similar to paper's approach)
classifier = AutoModelForSequenceClassification.from_pretrained("microsoft/codebert-base")
tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")

def filter_quality_code(code: str) -> bool:
    """Filter code based on quality using CodeBERT classifier"""
    inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512)
    outputs = classifier(**inputs)
    return outputs.logits[0][1] > 0.8  # High quality threshold

# Filter dataset
filtered_code = [x for x in code_dataset if filter_quality_code(x['content'])]
print(f"Filtered dataset size: {len(filtered_code)} examples")

Loading CodeParrot dataset...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/1.30k [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/54 [00:00<?, ?it/s]

Downloading data:   0%|          | 0/54 [00:00<?, ?files/s]

file-000000000001.json.gz:   0%|          | 0.00/246M [00:00<?, ?B/s]

file-000000000002.json.gz:   0%|          | 0.00/248M [00:00<?, ?B/s]

file-000000000003.json.gz:   0%|          | 0.00/247M [00:00<?, ?B/s]

file-000000000004.json.gz:   0%|          | 0.00/247M [00:00<?, ?B/s]

file-000000000005.json.gz:   0%|          | 0.00/247M [00:00<?, ?B/s]

file-000000000006.json.gz:   0%|          | 0.00/246M [00:00<?, ?B/s]

file-000000000007.json.gz:   0%|          | 0.00/246M [00:00<?, ?B/s]

file-000000000008.json.gz:   0%|          | 0.00/248M [00:00<?, ?B/s]

file-000000000009.json.gz:   0%|          | 0.00/245M [00:00<?, ?B/s]

file-000000000010.json.gz:   0%|          | 0.00/245M [00:00<?, ?B/s]

file-000000000011.json.gz:   0%|          | 0.00/244M [00:00<?, ?B/s]

file-000000000012.json.gz:   0%|          | 0.00/243M [00:00<?, ?B/s]

file-000000000013.json.gz:   0%|          | 0.00/245M [00:00<?, ?B/s]

file-000000000014.json.gz:   0%|          | 0.00/242M [00:00<?, ?B/s]

file-000000000015.json.gz:   0%|          | 0.00/243M [00:00<?, ?B/s]

file-000000000016.json.gz:   0%|          | 0.00/240M [00:00<?, ?B/s]

file-000000000017.json.gz:   0%|          | 0.00/242M [00:00<?, ?B/s]

file-000000000018.json.gz:   0%|          | 0.00/242M [00:00<?, ?B/s]

file-000000000019.json.gz:   0%|          | 0.00/241M [00:00<?, ?B/s]

file-000000000020.json.gz:   0%|          | 0.00/242M [00:00<?, ?B/s]

file-000000000021.json.gz:   0%|          | 0.00/236M [00:00<?, ?B/s]

file-000000000022.json.gz:   0%|          | 0.00/238M [00:00<?, ?B/s]

file-000000000023.json.gz:   0%|          | 0.00/240M [00:00<?, ?B/s]

file-000000000024.json.gz:   0%|          | 0.00/237M [00:00<?, ?B/s]

file-000000000025.json.gz:   0%|          | 0.00/238M [00:00<?, ?B/s]

file-000000000026.json.gz:   0%|          | 0.00/237M [00:00<?, ?B/s]

file-000000000027.json.gz:   0%|          | 0.00/238M [00:00<?, ?B/s]

file-000000000028.json.gz:   0%|          | 0.00/239M [00:00<?, ?B/s]

file-000000000029.json.gz:   0%|          | 0.00/238M [00:00<?, ?B/s]

file-000000000030.json.gz:   0%|          | 0.00/239M [00:00<?, ?B/s]

file-000000000031.json.gz:   0%|          | 0.00/237M [00:00<?, ?B/s]

file-000000000032.json.gz:   0%|          | 0.00/239M [00:00<?, ?B/s]

file-000000000033.json.gz:   0%|          | 0.00/236M [00:00<?, ?B/s]

file-000000000034.json.gz:   0%|          | 0.00/237M [00:00<?, ?B/s]

file-000000000035.json.gz:   0%|          | 0.00/235M [00:00<?, ?B/s]

file-000000000036.json.gz:   0%|          | 0.00/236M [00:00<?, ?B/s]

file-000000000037.json.gz:   0%|          | 0.00/234M [00:00<?, ?B/s]

file-000000000038.json.gz:   0%|          | 0.00/235M [00:00<?, ?B/s]

file-000000000039.json.gz:   0%|          | 0.00/234M [00:00<?, ?B/s]

file-000000000040.json.gz:   0%|          | 0.00/234M [00:00<?, ?B/s]

file-000000000041.json.gz:   0%|          | 0.00/235M [00:00<?, ?B/s]

file-000000000042.json.gz:   0%|          | 0.00/236M [00:00<?, ?B/s]

file-000000000043.json.gz:   0%|          | 0.00/236M [00:00<?, ?B/s]

file-000000000044.json.gz:   0%|          | 0.00/234M [00:00<?, ?B/s]

file-000000000045.json.gz:   0%|          | 0.00/237M [00:00<?, ?B/s]

file-000000000046.json.gz:   0%|          | 0.00/234M [00:00<?, ?B/s]

file-000000000047.json.gz:   0%|          | 0.00/232M [00:00<?, ?B/s]

file-000000000048.json.gz:   0%|          | 0.00/232M [00:00<?, ?B/s]

file-000000000049.json.gz:   0%|          | 0.00/233M [00:00<?, ?B/s]

file-000000000050.json.gz:   0%|          | 0.00/234M [00:00<?, ?B/s]

file-000000000051.json.gz:   0%|          | 0.00/233M [00:00<?, ?B/s]

file-000000000052.json.gz:   0%|          | 0.00/234M [00:00<?, ?B/s]

file-000000000053.json.gz:   0%|          | 0.00/230M [00:00<?, ?B/s]

file-000000000054.json.gz:   0%|          | 0.00/142M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Loading dataset shards:   0%|          | 0/108 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/498 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at microsoft/codebert-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

## 2. Textbook Dataset - Python Documentation

In [None]:
import requests
from bs4 import BeautifulSoup

def get_python_docs():
    """Get Python documentation content"""
    # Python tutorial
    tutorial_url = "https://docs.python.org/3/tutorial/"
    response = requests.get(tutorial_url)
    soup = BeautifulSoup(response.text, 'html.parser')
    tutorial_content = soup.get_text()

    # Python language reference
    reference_url = "https://docs.python.org/3/reference/"
    response = requests.get(reference_url)
    soup = BeautifulSoup(response.text, 'html.parser')
    reference_content = soup.get_text()

    return tutorial_content + reference_content

textbook_content = get_python_docs()
print(f"Textbook content size: {len(textbook_content)} characters")

## 3. Exercises Dataset - LeetCode Problems

In [None]:
def get_leetcode_problems():
    """Get LeetCode problems and solutions"""
    # Using LeetCode API to get problems
    api_url = "https://leetcode.com/api/problems/all/"
    response = requests.get(api_url)
    problems = response.json()

    exercises = []
    for problem in problems['stat_status_pairs']:
        if not problem['paid_only']:
            problem_detail = get_problem_detail(problem['stat']['question__title_slug'])
            exercises.append(problem_detail)

    return exercises

exercises = get_leetcode_problems()
print(f"Exercise dataset size: {len(exercises)} problems")

## Combine Datasets

In [None]:
def prepare_combined_dataset():
    """Combine and prepare all datasets for training"""
    # Combine all data sources
    all_data = {
        'code': filtered_code,
        'textbook': textbook_content,
        'exercises': exercises
    }

    # Create vocabulary
    chars = sorted(list(set(''.join(all_data))))
    vocab_size = len(chars)

    # Create encoding maps
    stoi = {ch: i for i, ch in enumerate(chars)}
    itos = {i: ch for i, ch in enumerate(chars)}

    return all_data, vocab_size, stoi, itos

data, vocab_size, stoi, itos = prepare_combined_dataset()
print(f"Total vocabulary size: {vocab_size}")

## Training Process

Following the paper's approach:
1. Pretrain on filtered code dataset
2. Train on textbook dataset
3. Finetune on exercises

In [None]:
# Training configuration
config = {
    'batch_size': 64,
    'block_size': 256,
    'max_iters': 5000,
    'learning_rate': 1e-4,
    'n_embd': 512,
    'n_head': 8,
    'n_layer': 8,
    'dropout': 0.2
}

def train_model(data, config):
    """Train model following the paper's three-stage process"""
    # 1. Pretrain on filtered code
    print("Stage 1: Pretraining on filtered code...")
    model = train_on_dataset(data['code'], config)

    # 2. Train on textbook content
    print("\nStage 2: Training on textbook content...")
    model = train_on_dataset(data['textbook'], config, model)

    # 3. Finetune on exercises
    print("\nStage 3: Finetuning on exercises...")
    model = train_on_dataset(data['exercises'], config, model)

    return model

# Start training
model = train_model(data, config)

## Evaluation

Test the model on coding tasks similar to HumanEval

In [None]:
def evaluate_model(model, test_problems):
    """Evaluate model on coding problems"""
    correct = 0
    total = len(test_problems)

    for problem in test_problems:
        generated_code = generate_solution(model, problem['prompt'])
        if test_solution(generated_code, problem['tests']):
            correct += 1

    return correct / total

# Run evaluation
accuracy = evaluate_model(model, test_problems)
print(f"Model accuracy: {accuracy:.2%}")