# Math Question Answer Verification Competition

## Starter Code

Borrowed from [official Unsloth implementation](https://colab.research.google.com/drive/1Ys44kVvmeZtnICzWz0xgpRnrIOjZAuxp?usp=sharing#scrollTo=MKX_XKs_BNZR)

In [None]:
%%capture
# This cell will take time
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

==((====))==  Unsloth 2024.11.7: Fast Llama patching. Transformers = 4.46.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.1+cu124. CUDA = 8.0. CUDA Toolkit = 12.4.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

## Load model and wrap with LoRA adapters

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 8, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 8,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,  # We support rank stabilized LoRA
    loftq_config = {
        "loftq_bits": 4,
        "loftq_iter": 2
    }, # And LoftQ
)

Unsloth 2024.11.7 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## Competition dataset

In [None]:
# download and load competition dataset

from datasets import load_dataset
dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp")
# print and see dataset
dataset

README.md:   0%|          | 0.00/2.09k [00:00<?, ?B/s]

train-00000-of-00002.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00001-of-00002.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/3.65M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['question', 'is_correct', 'answer', 'solution'],
        num_rows: 1000000
    })
    test: Dataset({
        features: ['question', 'is_correct', 'answer', 'solution'],
        num_rows: 10000
    })
})

In [None]:
from datasets import Dataset

# Randomly sample 20,000 examples from the training set
sampled_indices = torch.randperm(len(dataset['train']))[:20000].tolist()
train_dataset_small = Dataset.from_dict(dataset['train'][sampled_indices])

In [None]:
prompt = """
You are a meticulous evaluator tasked with verifying the correctness of mathematical solutions. Your job is to analyze the given question, check the student's reasoning and the provided answer to determine if the student is correct or not.

TASK DESCRIPTION:
- Evaluate the mathematical problem and verify whether the student's final answer is correct based on the reasoning provided in their solution process.

QUESTION TYPES:
- Arithmetic: Basic operations like addition, subtraction, multiplication, and division.
- Algebra: Equations, expressions, variables, and simplification.
- Geometry: Problems involving shapes, areas, volumes, and angles.
- Calculus: Derivatives, integrals, and functions.
- Probability and Statistics: Mean, median, probabilities, and distributions.
- Logic and Word Problems: Real-world scenarios requiring logical reasoning.

SOLUTION TYPES:
- Step-by-Step Explanation: Solutions broken down into sequential steps.
- Direct Calculation: Straightforward numerical evaluation.
- Formula/Application: Usage of mathematical formulas or equations.
- Proof or Derivation: Logical proofs or derivations of results.
- Code-Based Solution: Computational solutions using programming or scripts.

1. Question: {question}
   - Analyze the question so that you have your own understanding of it.
2. Student's Answer: {answer}
3. Student's Solution Process: {solution}
   - Assess whether the solution process reflects a clear understanding of the question.
   - Check if the student has addressed all components of the question.
   - Verify that each step of the solution flows logically from the previous one.
   - Ensure no steps are skipped or assumptions are made without justification.
   - Examine calculations for precision and correctness.
   - Confirm adherence to mathematical rules and methods relevant to the problem type.
   - Evaluate if the solution process fully resolves the question, considering all requirements stated in the problem.

Output "True" if student is correct;
Output "False" if student is incorrect.

OUTPUT FORMAT:
{output}
"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    texts = []
    for question, answer, solution, output in zip(
        examples["question"],
        examples["answer"],
        examples["solution"],
        examples["is_correct"]
    ):
        text = prompt.format(
            question=question,
            answer=answer,
            solution=solution,
            output=output
        ) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

In [None]:
# Process the training dataset and generate prompt for each datapoint

# Use only 20,000 datasets for training
train_dataset = train_dataset_small.map(formatting_prompts_func, batched=True)

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

In [None]:
#print a smaple training example
train_dataset['text'][0]

'\nYou are a meticulous evaluator tasked with verifying the correctness of mathematical solutions. Your job is to analyze the given question, check the student\'s reasoning and the provided answer to determine if the student is correct or not.\n\nTASK DESCRIPTION:\n- Evaluate the mathematical problem and verify whether the student\'s final answer is correct based on the reasoning provided in their solution process.\n\nQUESTION TYPES:\n- Arithmetic: Basic operations like addition, subtraction, multiplication, and division.\n- Algebra: Equations, expressions, variables, and simplification.\n- Geometry: Problems involving shapes, areas, volumes, and angles.\n- Calculus: Derivatives, integrals, and functions.\n- Probability and Statistics: Mean, median, probabilities, and distributions.\n- Logic and Word Problems: Real-world scenarios requiring logical reasoning.\n\nSOLUTION TYPES:\n- Step-by-Step Explanation: Solutions broken down into sequential steps.\n- Direct Calculation: Straightforw

## SFT

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
from transformers import TrainerCallback

training_args = TrainingArguments(
        per_device_train_batch_size = 16,
        gradient_accumulation_steps = 8,
        warmup_ratio = 0.1, # Change to warmup ratio instead of warmup steps
        num_train_epochs = 3, # Set this for 3 full training runs.
        learning_rate = 1e-3, # Changed from 2e-4 to 1e-3
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_torch_fused",
        weight_decay = 0.05,
        lr_scheduler_type = "cosine_with_restarts", # Change from linear to cosine w/ restarts
        seed = 3407,
        output_dir = "outputs",
        eval_strategy = "steps", # Add evaluation strategy
        eval_steps = 50, # eval. every 50 steps
        save_strategy = "steps", # save the strat.
        save_steps = 50, # save every 50 steps
        load_best_model_at_end = True, # load the best model
        metric_for_best_model = "eval_loss",
        max_grad_norm = 0.3,
        report_to = "none", # Use this for WandB etc
    )

# Train test split
from sklearn.model_selection import train_test_split

train_indices, val_indices = train_test_split(
    range(len(train_dataset)),
    test_size=0.1,
    random_state=42,
    stratify=train_dataset['is_correct']  # Stratification
)

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset.select(train_indices), # training dataset
    eval_dataset = train_dataset.select(val_indices), # evaluation dataset
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 4,
    packing = False, # Can make training 5x faster for short sequences.
    args = training_args
)

trainer_stats = trainer.train()

Map (num_proc=4):   0%|          | 0/18000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/2000 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 18,000 | Num Epochs = 3
O^O/ \_/ \    Batch size per device = 16 | Gradient Accumulation steps = 8
\        /    Total batch size = 128 | Total steps = 420
 "-____-"     Number of trainable parameters = 20,971,520


Step,Training Loss,Validation Loss
50,0.34,0.335591
100,0.3154,0.318461
150,0.2837,0.307107
200,0.2815,0.296194
250,0.2647,0.283892
300,0.217,0.278912
350,0.2183,0.273583
400,0.2089,0.271328


## Inference

In [None]:
# Sample inference data point
test_dataset = dataset['test']

sample_ques = test_dataset['question'][0]
sample_ans = test_dataset['answer'][0]
sample_sol = test_dataset['solution'][0]


In [None]:
# Running inference on single test
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
input_prompt = prompt.format(
        question=sample_ques,
        answer=sample_ans,
        solution=sample_sol,
        output=""
    )

print("Input Prompt:\n", input_prompt)
inputs = tokenizer(
[
    input_prompt
], return_tensors = "pt").to("cuda")

input_shape = inputs['input_ids'].shape
input_token_len = input_shape[1] # 1 because of batch
outputs = model.generate(**inputs,
    max_new_tokens = 64,
    min_new_tokens = 1,
    temperature = 0.1,
    do_sample = False,
    num_beams = 1,
    use_cache = True,
    pad_token_id = tokenizer.pad_token_id,
    eos_token_id = tokenizer.eos_token_id)
# you can get the whole generated text by uncommenting the below line
# text_generated = tokenizer.batch_decode([outputs, skip_special_tokens=True)

response = tokenizer.batch_decode([outputs[0][input_token_len:]], skip_special_tokens=True)
response

Input Prompt:
 
You are a meticulous evaluator tasked with verifying the correctness of mathematical solutions. Your job is to analyze the given question, check the student's reasoning and the provided answer to determine if the student is correct or not.

TASK DESCRIPTION:
- Evaluate the mathematical problem and verify whether the student's final answer is correct based on the reasoning provided in their solution process.

QUESTION TYPES:
- Arithmetic: Basic operations like addition, subtraction, multiplication, and division.
- Algebra: Equations, expressions, variables, and simplification.
- Geometry: Problems involving shapes, areas, volumes, and angles.
- Calculus: Derivatives, integrals, and functions.
- Probability and Statistics: Mean, median, probabilities, and distributions.
- Logic and Word Problems: Real-world scenarios requiring logical reasoning.

SOLUTION TYPES:
- Step-by-Step Explanation: Solutions broken down into sequential steps.
- Direct Calculation: Straightforward 

['False\n']

## Generate CSV File for Evaluation

In [None]:
from tqdm import tqdm

def generate_submission_file(model, tokenizer, test_dataset, prompt_template, output_file="submission.csv"):
    """
    Generate a submission file for the competition.
    """
    FastLanguageModel.for_inference(model)
    predictions = []

    print("Generating .csv file...")
    for i in tqdm(range(len(test_dataset))):
        try:
            input_prompt = prompt.format(
                question=test_dataset['question'][i],
                answer=test_dataset['answer'][i],
                solution=test_dataset['solution'][i],
                output=""
            )

            inputs = tokenizer([input_prompt], return_tensors="pt").to("cuda")
            input_token_len = inputs['input_ids'].shape[1]

            with torch.no_grad():
                outputs = model.generate(
                    **inputs,
                    max_new_tokens=1,
                    min_new_tokens=1,
                    temperature=0.1,
                    do_sample=False,
                    use_cache=True,
                    num_beams=1,
                    early_stopping=True
                )

            response = tokenizer.decode(outputs[0][input_token_len:], skip_special_tokens=True).strip().lower()
            response = response.replace("output only", "").replace("'", "").replace('"', "").strip()

            # Default to False，unless it is certainly True
            predicted_label = response == 'true'
            predictions.append(predicted_label)

        except Exception as e:
            print(f"Error asessing {i}: {str(e)}")
            # Default to False
            predictions.append(False)

    # Create dataframe
    import pandas as pd
    submission_df = pd.DataFrame({
        'ID': range(len(predictions)),
        'is_correct': predictions
    })

    # Save to .csv
    submission_df.to_csv(output_file, index=False)
    print(f"\nFile saved as {output_file}")

    # Display distribution
    value_counts = submission_df['is_correct'].value_counts()
    print("\nPrediction Distribution:")
    print(f"True: {value_counts.get(True, 0)}")
    print(f"False: {value_counts.get(False, 0)}")

    return submission_df

# Generate .csv file
submission = generate_submission_file(
    model,
    tokenizer,
    dataset['test'],
    prompt,
    "submission.csv"
)

Generating .csv file...


100%|██████████| 10000/10000 [26:48<00:00,  6.22it/s]


File saved as submission.csv

Prediction Distribution:
True: 3361
False: 6639





In [None]:
from google.colab import files

files.download('/content/submission.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## saving model

In [None]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

In [None]:
if True:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference


==((====))==  Unsloth 2024.11.6: Fast Llama patching. Transformers = 4.46.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.1+cu124. CUDA = 8.0. CUDA Toolkit = 12.4.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [None]:
!zip -r /content/lora_model.zip /content/lora_model/

  adding: content/lora_model/ (stored 0%)
  adding: content/lora_model/tokenizer.json (deflated 85%)
  adding: content/lora_model/README.md (deflated 66%)
  adding: content/lora_model/tokenizer_config.json (deflated 96%)
  adding: content/lora_model/adapter_config.json (deflated 54%)
  adding: content/lora_model/special_tokens_map.json (deflated 71%)
  adding: content/lora_model/adapter_model.safetensors (deflated 7%)


In [None]:
!zip -r /content/outputs.zip /content/outputs/

  adding: content/outputs/ (stored 0%)
  adding: content/outputs/checkpoint-700/ (stored 0%)
  adding: content/outputs/checkpoint-700/optimizer.pt (deflated 13%)
  adding: content/outputs/checkpoint-700/scheduler.pt (deflated 55%)
  adding: content/outputs/checkpoint-700/tokenizer.json (deflated 85%)
  adding: content/outputs/checkpoint-700/README.md (deflated 66%)
  adding: content/outputs/checkpoint-700/trainer_state.json (deflated 81%)
  adding: content/outputs/checkpoint-700/tokenizer_config.json (deflated 96%)
  adding: content/outputs/checkpoint-700/training_args.bin (deflated 51%)
  adding: content/outputs/checkpoint-700/rng_state.pth (deflated 25%)
  adding: content/outputs/checkpoint-700/adapter_config.json (deflated 54%)
  adding: content/outputs/checkpoint-700/special_tokens_map.json (deflated 71%)
  adding: content/outputs/checkpoint-700/adapter_model.safetensors (deflated 7%)
  adding: content/outputs/checkpoint-100/ (stored 0%)
  adding: content/outputs/checkpoint-100/opt