<a href="https://colab.research.google.com/github/khnhenriette/ProjectADL/blob/math-medium/medium_fine_tune_math.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Fine tune gpt2-medium for basic math tasks

In [1]:
!pip install transformers datasets
!pip install torch


Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

Use the dataset math_dataset.json that includes 10000 simple math examples of the form "input": "89 minus 84 equals", "output": "5" using addition, subtraction, multiplication and division -- ensure dataset is uploaded to Google Colab before running

In [2]:
from datasets import Dataset
import json

# Load the dataset
with open('math_dataset.json', 'r') as file:
    data = json.load(file)  # This is a list of dictionaries

# Convert the list of dictionaries into a Hugging Face Dataset
dataset = Dataset.from_list(data)  # Use from_list for list input

# Split dataset into training and validation sets
split_dataset = dataset.train_test_split(test_size=0.1)
train_dataset = split_dataset['train']
eval_dataset = split_dataset['test']

print(f"Training examples: {len(train_dataset)}")
print(f"Validation examples: {len(eval_dataset)}")


Training examples: 9000
Validation examples: 1000


In [4]:
print(train_dataset[0])

{'input': '29 minus 70 equals', 'output': '-41'}


Use the Huggin Face Trainer to fine-tune the available gpt2-medium to perform better on the simple math tasks

In [7]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments

# Load GPT-2 Medium tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2-medium")
model = GPT2LMHeadModel.from_pretrained("gpt2-medium")

# Set the EOS token as the padding token
tokenizer.pad_token = tokenizer.eos_token

"""
# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples['input'], text_target=examples['output'], truncation=True)

train_dataset = train_dataset.map(tokenize_function, batched=True)
eval_dataset = eval_dataset.map(tokenize_function, batched=True)
"""

# Tokenize the dataset
def tokenize_function(examples):
    model_inputs = tokenizer(
        examples['input'],
        truncation=True,
        padding="max_length",  # Ensures uniform input length
        max_length=128,       # Adjust max length as needed
    )
    # Tokenize the target/output
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples['output'],
            truncation=True,
            padding="max_length",  # Ensures uniform output length
            max_length=128,        # Adjust max length as needed
        )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Apply the tokenizer
train_dataset = train_dataset.map(tokenize_function, batched=True)
eval_dataset = eval_dataset.map(tokenize_function, batched=True)


# Define training arguments
training_args = TrainingArguments(
    output_dir="./gpt2_math_finetuned",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    num_train_epochs=3,
    per_device_train_batch_size=8,
    save_steps=500,
    save_total_limit=2,
    push_to_hub=False,
    logging_dir="./logs",  # Directory for storing logs
    logging_steps=50,      # Log progress every 50 steps
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
)

# Start training
trainer.train()


Map:   0%|          | 0/9000 [00:00<?, ? examples/s]



Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

  trainer = Trainer(


Epoch,Training Loss,Validation Loss
1,0.0289,0.029103
2,0.0276,0.028629
3,0.0268,0.028536


TrainOutput(global_step=3375, training_loss=0.03526891844360917, metrics={'train_runtime': 3118.282, 'train_samples_per_second': 8.659, 'train_steps_per_second': 1.082, 'total_flos': 6268729688064000.0, 'train_loss': 0.03526891844360917, 'epoch': 3.0})

Save the fine-tuned model for future use

In [8]:
# Save the model and tokenizer
model.save_pretrained("./gpt2_math_finetuned")
tokenizer.save_pretrained("./gpt2_math_finetuned")


('./gpt2_math_finetuned/tokenizer_config.json',
 './gpt2_math_finetuned/special_tokens_map.json',
 './gpt2_math_finetuned/vocab.json',
 './gpt2_math_finetuned/merges.txt',
 './gpt2_math_finetuned/added_tokens.json')

In [9]:
!zip -r gpt2_math_finetuned.zip ./gpt2_math_finetuned


  adding: gpt2_math_finetuned/ (stored 0%)
  adding: gpt2_math_finetuned/generation_config.json (deflated 24%)
  adding: gpt2_math_finetuned/tokenizer_config.json (deflated 55%)
  adding: gpt2_math_finetuned/merges.txt (deflated 53%)
  adding: gpt2_math_finetuned/config.json (deflated 52%)
  adding: gpt2_math_finetuned/checkpoint-3375/ (stored 0%)
  adding: gpt2_math_finetuned/checkpoint-3375/generation_config.json (deflated 24%)
  adding: gpt2_math_finetuned/checkpoint-3375/tokenizer_config.json (deflated 55%)
  adding: gpt2_math_finetuned/checkpoint-3375/training_args.bin (deflated 51%)
  adding: gpt2_math_finetuned/checkpoint-3375/trainer_state.json (deflated 82%)
  adding: gpt2_math_finetuned/checkpoint-3375/merges.txt (deflated 53%)
  adding: gpt2_math_finetuned/checkpoint-3375/rng_state.pth (deflated 25%)
  adding: gpt2_math_finetuned/checkpoint-3375/config.json (deflated 52%)
  adding: gpt2_math_finetuned/checkpoint-3375/vocab.json (deflated 68%)
  adding: gpt2_math_finetuned/ch

In [10]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [11]:
!mv gpt2_math_finetuned.zip /content/drive/MyDrive/
