# Math Question Answer Verification Competition

## Starter Code

Borrowed from [official Unsloth implementation](https://colab.research.google.com/drive/1Ys44kVvmeZtnICzWz0xgpRnrIOjZAuxp?usp=sharing#scrollTo=MKX_XKs_BNZR)

In [1]:
# %%capture
# This cell will take time
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

Found existing installation: unsloth 2024.11.7
Uninstalling unsloth-2024.11.7:
  Successfully uninstalled unsloth-2024.11.7
Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-c7acanpx/unsloth_3e758a97f2b04ef3bc338324b4f40e53
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-c7acanpx/unsloth_3e758a97f2b04ef3bc338324b4f40e53
  Resolved https://github.com/unslothai/unsloth.git to commit f26d4e739ed507de7a9088da53d10fd02f58d160
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: unsloth
  Building wheel for unsloth (pyproject.toml) ... [?25l[?25hdone
  Created wheel for unsloth: filename=unsloth-2024.11.7-py3-none-a

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [3]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

==((====))==  Unsloth 2024.11.7: Fast Llama patching. Transformers = 4.46.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.1+cu124. CUDA = 8.0. CUDA Toolkit = 12.4.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


## Load model and wrap with LoRA adapters

In [4]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 32, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.11.7 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## Competition dataset

In [5]:
# download and load competition dataset

from datasets import load_dataset
dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp")
# print and see dataset
dataset

DatasetDict({
    train: Dataset({
        features: ['question', 'is_correct', 'answer', 'solution'],
        num_rows: 1000000
    })
    test: Dataset({
        features: ['question', 'is_correct', 'answer', 'solution'],
        num_rows: 10000
    })
})

In [6]:
# Define Prompt Templates for Different Types of Problems and Add Code Solution Guidance

prompt_template_arithmetic = """You are a great mathematician and you are tasked with finding if an answer to a given maths question is correct or not. Yout response should be 'True' if correct, otherwise 'False'. Below is Question, Answer and Explanation. To solve this question accurately, use code if necessary.

### Validation Process:
1. Focus on the QUESTION and STUDENT ANSWER:
   - Understand what the question is asking for
   - Evaluate if the student's answer makes mathematical sense
   - Check if the answer format matches the question requirements

2. Mathematical Correctness:
   - Verify if the answer satisfies the question conditions
   - Check if the mathematical expression is valid
   - Ensure the answer is in the correct domain

3. Format Considerations:
   - Ensure LaTeX expressions are properly formatted
   - Verify the correctness of any dimensional constraints

### Question:
{}

### Answer:
{}

### Explanation:
{}

### Output:
{}"""

prompt_template_algebra = """You are a great mathematician specializing in algebra and you are tasked with finding if an answer to a given maths question is correct or not. Yout response should be 'True' if correct, otherwise 'False'. Below is Question, Answer and Explanation. You can use code if needed.

### Validation Process:
1. Focus on the QUESTION and STUDENT ANSWER:
   - Understand what the question is asking for
   - Evaluate if the student's answer makes mathematical sense
   - Check if the answer format matches the question requirements

2. Mathematical Correctness:
   - Verify if the answer satisfies the question conditions
   - Check if the mathematical expression is valid
   - Ensure the answer is in the correct domain

3. Format Considerations:
   - Ensure LaTeX expressions are properly formatted
   - Verify the correctness of any dimensional constraints

### Question:
{}

### Answer:
{}

### Explanation:
{}

### Output:
{}"""

prompt_template_geometry = """You are a great mathematician with expertise in geometryand and you are tasked with finding if an answer to a given maths question is correct or not. Yout response should be 'True' if correct, otherwise 'False'. Below is Question, Answer and Explanation. You can use code if it can assist with calculations.

### Validation Process:
1. Focus on the QUESTION and STUDENT ANSWER:
   - Understand what the question is asking for
   - Evaluate if the student's answer makes mathematical sense
   - Check if the answer format matches the question requirements

2. Mathematical Correctness:
   - Verify if the answer satisfies the question conditions
   - Check if the mathematical expression is valid
   - Ensure the answer is in the correct domain

3. Format Considerations:
   - Ensure LaTeX expressions are properly formatted
   - Verify the correctness of any dimensional constraints

### Question:
{}

### Answer:
{}

### Explanation:
{}

### Output:
{}"""

prompt_template_word_problem = """You are a great mathematician with experience in real-life applications andand you are tasked with finding if an answer to a given maths question is correct or not. Yout response should be 'True' if correct, otherwise 'False'. Below is Question, Answer and Explanation. You can use code if it helps verify the answer.

### Validation Process:
1. Focus on the QUESTION and STUDENT ANSWER:
   - Understand what the question is asking for
   - Evaluate if the student's answer makes mathematical sense
   - Check if the answer format matches the question requirements

2. Mathematical Correctness:
   - Verify if the answer satisfies the question conditions
   - Check if the mathematical expression is valid
   - Ensure the answer is in the correct domain

3. Format Considerations:
   - Ensure LaTeX expressions are properly formatted
   - Verify the correctness of any dimensional constraints

### Question:
{}

### Answer:
{}

### Explanation:
{}

### Output:
{}"""

# Choose the appropriate prompt template based on the problem content
def select_prompt_template(question):
    if "rationalize" in question or "solve for" in question:
        return prompt_template_algebra
    elif "vector" in question or "area" in question or "geometry" in question:
        return prompt_template_geometry
    elif "calculate" in question or "total" in question or "time" in question:
        return prompt_template_arithmetic
    else:
        return prompt_template_word_problem

# Add the EOS_TOKEN and generate a text-format prompt.
EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):
    questions = examples["question"]
    answers = examples["answer"]
    solutions = examples["solution"]  # Add a new solution column.
    outputs = examples["is_correct"]

    texts = []
    for question, answer, solution, output in zip(questions, answers, solutions, outputs):
        # Select the appropriate template.
        prompt_template = select_prompt_template(question)

        # Format the prompt and add solution to the Explanation section.
        text = prompt_template.format(question, answer, solution, output) + EOS_TOKEN
        texts.append(text)

    return {"text": texts}



In [7]:
# Process the training dataset and generate prompt for each datapoint

train_dataset = dataset['train'].map(formatting_prompts_func, batched = True,)

Map:   0%|          | 0/1000000 [00:00<?, ? examples/s]

In [None]:
#print a smaple training example
train_dataset['text'][0]

## SFT

In [8]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
from transformers import TrainerCallback

training_args = TrainingArguments(
  per_device_train_batch_size = 8,
  gradient_accumulation_steps = 8,
  warmup_ratio = 0.1, # Change to warmup ratio instead of warmup steps
  num_train_epochs = 3, # Set this for 3 full training run.
  # max_steps = 50,
  learning_rate = 2e-4,
  fp16 = not is_bfloat16_supported(),
  bf16 = is_bfloat16_supported(),
  logging_steps = 1,
  optim = "adamw_8bit",
  weight_decay = 0.05,
  lr_scheduler_type = "cosine", # Change from linear to cosine
  seed = 3407,
  output_dir = "outputs",
  eval_strategy = "steps", # Add evaluation strategy
  eval_steps = 50, # eval. every 50 steps
  save_strategy = "steps", # save the strat.
  save_steps = 50, # save every 50 steps
  load_best_model_at_end = True, # load the best model
  metric_for_best_model = "eval_loss",
  max_grad_norm = 0.3,
  report_to = "none", # Use this for WandB etc
  )

import random

# Set a random seed to ensure reproducibility.
random.seed(42)

# Randomly sample 30,000 data entries.
train_dataset = train_dataset.shuffle(seed=42).select(range(30000))


# Train test split
from sklearn.model_selection import train_test_split
#From the 30,000 sampled training data entries, further sample 10% to use as evaluation.
train_indices, val_indices = train_test_split(
  range(len(train_dataset)),
  test_size=0.1,
  random_state=42,
  stratify=train_dataset['is_correct'] # Stratification
  )

trainer = SFTTrainer(
  model = model,
  tokenizer = tokenizer,
  train_dataset = train_dataset.select(train_indices), # training dataset
  eval_dataset = train_dataset.select(val_indices), # evaluation dataset
  dataset_text_field = "text",
  max_seq_length = max_seq_length,
  dataset_num_proc = 4,
  packing = False, # Can make training 5x faster for short sequences.
  args = training_args
  )

trainer_stats = trainer.train()

Map (num_proc=4):   0%|          | 0/27000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/3000 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 27,000 | Num Epochs = 3
O^O/ \_/ \    Batch size per device = 8 | Gradient Accumulation steps = 8
\        /    Total batch size = 64 | Total steps = 1,263
 "-____-"     Number of trainable parameters = 83,886,080


Step,Training Loss,Validation Loss
50,0.5794,0.55426
100,0.532,0.479675
150,0.4965,0.466845
200,0.4791,0.458901
250,0.4634,0.452358
300,0.4591,0.445931
350,0.4396,0.439732
400,0.4456,0.433047
450,0.3821,0.429368
500,0.3949,0.423562


## inference

In [None]:
# Sample inferene data point
test_dataset = dataset['test']

sample_ques = test_dataset['question'][0]
sample_ans = test_dataset['answer'][0]


In [None]:
# Running inference on single test
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
input_prompt = prompt.format(
        sample_ques, # ques
        sample_ans, # given answer
        "", # output - leave this blank for generation! LLM willl generate is it is True or False
    )

print("Input Promt:\n", input_prompt)
inputs = tokenizer(
[
    input_prompt
], return_tensors = "pt").to("cuda")

input_shape = inputs['input_ids'].shape
input_token_len = input_shape[1] # 1 because of batch
outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
# you can get the whole generated text by uncommenting the below line
# text_generated = tokenizer.batch_decode([outputs, skip_special_tokens=True)

response = tokenizer.batch_decode([outputs[0][input_token_len:]], skip_special_tokens=True)
response

## saving model

In [10]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

In [11]:
if True:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference


==((====))==  Unsloth 2024.11.7: Fast Llama patching. Transformers = 4.46.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.1+cu124. CUDA = 8.0. CUDA Toolkit = 12.4.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


## Generate the CSV of test data.

In [13]:
import csv
from tqdm import tqdm

def evaluate_model_and_save_results(model, tokenizer, dataset, output_file="model_results.csv"):
    """
    Run model predictions on the test set and save the results to a CSV file.

    Parameters:
    - model: The trained model.
    - tokenizer: The tokenizer used by the model.
    - dataset: The test dataset, containing 'question' and 'answer' fields.
    - output_file: The path to the output CSV file.
    """
    FastLanguageModel.for_inference(model)  # Enable faster inference mode.
    total_samples = len(dataset)  # Total number of samples in the test set.

    results = []

    for i in tqdm(range(total_samples), desc="Running inference"):
        # Retrieve the question and answer of the current sample.
        question = dataset['question'][i]
        answer = dataset['answer'][i]
        solution = dataset['solution'][i]

        # Dynamically select the prompt template.
        prompt_template = select_prompt_template(question)

        # Format the input prompt using prompt_template.
        input_prompt = prompt_template.format(
            question,
            answer,
            solution,
            ""
        )

        inputs = tokenizer([input_prompt], return_tensors="pt").to("cuda")
        input_token_len = inputs['input_ids'].shape[1]

        outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)
        response = tokenizer.decode(outputs[0][input_token_len:], skip_special_tokens=True).strip()

        # Parse the model's generated result to determine whether it is 'True' or 'False'.
        is_correct = 'True' if response.lower() == 'true' else 'False'

        # Store the result in the results list.
        results.append({"ID": i, "is_correct": is_correct})

    # Write the results to the CSV file.
    with open(output_file, mode="w", newline="") as file:
        writer = csv.DictWriter(file, fieldnames=["ID", "is_correct"])
        writer.writeheader()
        writer.writerows(results)

    print(f"The model's prediction results have been saved to {output_file}")

# Call the function to run the evaluation and save the results.
evaluate_model_and_save_results(model, tokenizer, dataset['test'])


Running inference:   0%|          | 19/10000 [00:04<42:43,  3.89it/s]


KeyboardInterrupt: 