# Enhancing Mathematical Reasoning in LLMs: Fine-Tuning for Math Word Problem Solving

## 1: Introduction: 
The main goal is to fine-tune the transformer models on the MathQA dataset to improve its performance in answering mathematical word problems. Math questions require an understanding and application of logic and reasoning. 

**Impact:** We fine-tuned a pre-trained model on a specific domain (mathematics) to enhance its ability to understand and accurately answer domain-specific questions.

In [1]:
from huggingface_hub import login
import wandb
from dotenv import load_dotenv
import os

load_dotenv()

# Access the environment variables from the .env file
hf_token = os.environ.get('HF_TOKEN')
wandb_token = os.environ.get('WANDB_TOKEN')

wandb.login(key=wandb_token)
login(token=hf_token)

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mnickrwu[0m ([33mnick-wu[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/nrw9167/.netrc


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /home/nrw9167/.cache/huggingface/token
Login successful


## 2: Load Dataset & Pre-Trained Model
[**MathQA**](https://huggingface.co/datasets/math_qa) is a challenging dataset that includes diverse mathematical multiple-choice questions that require understanding and reasoning. We chose this dataset for its vast structured data and features including rationale and annotated formulas.

In [2]:
from datasets import load_dataset

# Testing with a smaller subset of the data
# mathqa = load_dataset("math_qa", split="train[:5000]")
# mathqa = mathqa.train_test_split(test_size=0.2)

# Initialize dataset and available models
mathqa = load_dataset("math_qa")
model_name = "LIAMF-USP/roberta-large-finetuned-race"

model_names = ["LIAMF-USP/roberta-large-finetuned-race", "microsoft/deberta-v3-large", "google/bigbird-roberta-large", "xlnet/xlnet-base-cased", "FacebookAI/xlm-roberta-large", "distilbert/distilbert-base-uncased"]

In [3]:
# Print training sample
mathqa['train'][0]

{'Problem': "the banker ' s gain of a certain sum due 3 years hence at 10 % per annum is rs . 36 . what is the present worth ?",
 'Rationale': '"explanation : t = 3 years r = 10 % td = ( bg × 100 ) / tr = ( 36 × 100 ) / ( 3 × 10 ) = 12 × 10 = rs . 120 td = ( pw × tr ) / 100 ⇒ 120 = ( pw × 3 × 10 ) / 100 ⇒ 1200 = pw × 3 pw = 1200 / 3 = rs . 400 answer : option a"',
 'options': 'a ) rs . 400 , b ) rs . 300 , c ) rs . 500 , d ) rs . 350 , e ) none of these',
 'correct': 'a',
 'annotated_formula': 'divide(multiply(const_100, divide(multiply(36, const_100), multiply(3, 10))), multiply(3, 10))',
 'linear_formula': 'multiply(n2,const_100)|multiply(n0,n1)|divide(#0,#1)|multiply(#2,const_100)|divide(#3,#1)|',
 'category': 'gain'}

## 3: Cleaning and Pre-Processing
* **Data Cleaning:** Handling missing values, filtering answers, splitting options
* **Data Pre-Processing:** Convert mathematical questions and answers into token sequences that the model can process.
* **Data Splitting:** Training [80%], Development [12%], Test [8%]

In [4]:
print(mathqa)

DatasetDict({
    train: Dataset({
        features: ['Problem', 'Rationale', 'options', 'correct', 'annotated_formula', 'linear_formula', 'category'],
        num_rows: 29837
    })
    test: Dataset({
        features: ['Problem', 'Rationale', 'options', 'correct', 'annotated_formula', 'linear_formula', 'category'],
        num_rows: 2985
    })
    validation: Dataset({
        features: ['Problem', 'Rationale', 'options', 'correct', 'annotated_formula', 'linear_formula', 'category'],
        num_rows: 4475
    })
})


In [5]:
# Transform and split `options` string into list of answer choices
def split_options(example):
    example["options"] = example['options'].split(", ")
    return example

# Filter out any data with more or less than 5 possible answer choices
def filter_by_length(example):
    return len(example['options']) == 5

mathqa = mathqa.map(split_options)
mathqa = mathqa.filter(filter_by_length)

In [6]:
import re

# Remove any answer indicators from `Rationale` field in the dataset
def remove_answer_from_rationale(example):
    # More complex patterns to catch various ways answers are indicated
    patterns = [
        r'\banswer\s*[:.]\s*[a-e]\b',           # "answer: a" or "answer. a"
        r'\banswer\s*is\s*[a-e]\b',             # "answer is a"
        r'\banswer\s*[a-e]\b',                  # "answer a"
        r'\bcorrect\s*option\s*[:.]\s*[a-e]\b', # "correct option: a"
        r'\bans\s*[:.]\s*[a-e]\b',              # "ans: a"
        r'\bimo\s*[a-e]\b',                     # "imo a"
        r'\b[a-e]\)\b',                         # "a)"
        r'\b[a-e]\.\b',                         # "a."
        r'\b[a-e]\b\s*is\s*correct\b',          # "a is correct"
        r'\b[a-e]\b\s*is\s*the\s*answer\b',     # "a is the answer"
        r'\b[a-e]\b\s*-\s*',                    # "a -"
        r'\boption\s*[a-e]\b',                  # "option a"
        r'\bnone of these\b',                   # "none of these"
        r'\b[a-e]\b\s*is\s*right\b',            # "a is right"
        r'([a-eA-E])(?!.*[a-eA-E])',
    ]

    # Replace identified patterns with empty string
    for pattern in patterns:
        example["Rationale"] = re.sub(pattern, '', example["Rationale"], flags=re.IGNORECASE)

    # Clean up multiple spaces and newlines
    example["Rationale"] = re.sub(r'(.*=).*', r'\1', example["Rationale"])
    example["Rationale"] = re.sub(r'\s{2,}', ' ', example["Rationale"])
    example["Rationale"] = re.sub(r'\n+', '\n', example["Rationale"])
    
    example["Rationale"] = example["Rationale"].strip()

    return example

mathqa = mathqa.map(remove_answer_from_rationale)

In [7]:
# Training sample has answer removed in rationale and 5 individual answer choice strings
mathqa['train'][0]

{'Problem': "the banker ' s gain of a certain sum due 3 years hence at 10 % per annum is rs . 36 . what is the present worth ?",
 'Rationale': '"explanation : t = 3 years r = 10 % td = ( bg × 100 ) / tr = ( 36 × 100 ) / ( 3 × 10 ) = 12 × 10 = rs . 120 td = ( pw × tr ) / 100 ⇒ 120 = ( pw × 3 × 10 ) / 100 ⇒ 1200 = pw × 3 pw = 1200 / 3 =',
 'options': ['a ) rs . 400 ',
  'b ) rs . 300 ',
  'c ) rs . 500 ',
  'd ) rs . 350 ',
  'e ) none of these'],
 'correct': 'a',
 'annotated_formula': 'divide(multiply(const_100, divide(multiply(36, const_100), multiply(3, 10))), multiply(3, 10))',
 'linear_formula': 'multiply(n2,const_100)|multiply(n0,n1)|divide(#0,#1)|multiply(#2,const_100)|divide(#3,#1)|',
 'category': 'gain'}

In [8]:
from transformers import AutoTokenizer, AutoModelForMultipleChoice, TrainingArguments, Trainer
import torch
from accelerate import Accelerator

# Initialize Accelerator
accelerator = Accelerator()

In [9]:
import evaluate
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

accuracy = evaluate.load("accuracy")

# Evaluate models by accuracy, f1, precision, and recall
def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    return {
        'accuracy': accuracy_score(p.label_ids, preds),
        'f1': precision_recall_fscore_support(p.label_ids, preds, average='macro')[2],
        'precision': precision_recall_fscore_support(p.label_ids, preds, average='macro')[0],
        'recall': precision_recall_fscore_support(p.label_ids, preds, average='macro')[1]
}

In [10]:
def base_preprocess_function(examples, tokenizer):
    MAX_SEQ_LENGTH = tokenizer.model_max_length if tokenizer.model_max_length < 512 else 256
    
    labels_map = {"a": 0, "b": 1, "c": 2, "d": 3, "e": 4}
    questions = examples["Problem"]
    options_list = examples["options"]
    categories = examples["category"]
    labels = [labels_map[ans] for ans in examples["correct"]]

    batch_input_ids = []
    batch_attention_masks = []
    batch_labels = []
    batch_categories = []
    
    # Iterate over each example in the batch
    for question, category, options, label in zip(questions, categories, options_list, labels):
        choices_inputs = []

        for option in options:
            # [0] Category; Problem; Option
            input_question = f'[CATEGORY] {category} [PROBLEM] {question}' 
            input_option = f'[OPTION] {option}'

            # Tokenize the context and the question-option pair
            inputs = tokenizer(
                input_question,
                input_option,
                add_special_tokens=True,
                max_length=MAX_SEQ_LENGTH,
                padding="max_length",
                truncation=True,
                return_overflowing_tokens=False
            )
            
            choices_inputs.append(inputs)

        # Extract input ids and attention masks for all options
        input_ids = [x['input_ids'] for x in choices_inputs]
        attention_masks = [x['attention_mask'] for x in choices_inputs]
        
        batch_input_ids.append(input_ids)
        batch_attention_masks.append(attention_masks)
        batch_labels.append(label)

    # Return processed batch data as a dictionary
    return {
        "input_ids": batch_input_ids,
        "attention_mask": batch_attention_masks,
        "labels": torch.tensor(batch_labels, dtype=torch.long),
    }

## 4: Evaluating Base Models
We evaluate the performance of various base pre-trained models on the MathQA dataset to select the best suitable base and establish a benchmark for comparison with the fine-tuned model.

The following models are evaluated on the MathQA test set:
* **`LIAMF-USP/roberta-large-finetuned-race`**
* `microsoft/deberta-v3-large`
* `google/bigbird-roberta-large`
* `xlnet/xlnet-base-cased`
* `FacebookAI/xlm-roberta-large`
* `distilbert/distilbert-base-uncased`

In [None]:
# Initialize base models and tokenizers
models = { name: AutoModelForMultipleChoice.from_pretrained(name) for name in model_names }
tokenizers = { name: AutoTokenizer.from_pretrained(name) for name in model_names }

In [None]:
tokenized_datasets = {name: mathqa['test'].map(base_preprocess_function, fn_kwargs={'tokenizer': tkn}, batched=True) for name, tkn in tokenizers.items()}

In [None]:
# Print decoded sample text input
for name in model_names:
    accepted_keys = ["input_ids", "attention_mask", "labels"]
    features = [{k: v for k, v in tokenized_datasets[name][i].items() if k in accepted_keys} for i in range(10)]
    batch = DataCollatorForMultipleChoice(tokenizers[name])(features)
    
    idx = 5
    print([tokenizers[name].decode(batch["input_ids"][idx][i].tolist()) for i in range(5)],"\n")

### 4.1: First Iteration
Testing and comparing base model performance with question, option, and no pre-processed rationale

In [11]:
# 1st Iteration
results = {}
for name, model in models.items():
    trainer = Trainer(
        model=model,
        eval_dataset=tokenized_datasets[name],
        compute_metrics=compute_metrics
    )
    results[name] = trainer.evaluate()

print(results)

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mnickrwu[0m ([33mnick-wu[0m). Use [1m`wandb login --relogin`[0m to force relogin


Attention type 'block_sparse' is not possible if sequence_length: 256 <= num global tokens: 2 * config.block_size + min. num sliding tokens: 3 * config.block_size + config.num_random_blocks * config.block_size + additional buffer: config.num_random_blocks * config.block_size = 704 with config.block_size = 64, config.num_random_blocks = 3. Changing attention type to 'original_full'...


{'LIAMF-USP/roberta-large-finetuned-race': {'eval_loss': 1.5375503301620483, 'eval_accuracy': 0.49210084033613444, 'eval_f1': 0.48690145392783163, 'eval_precision': 0.4895015998986854, 'eval_recall': 0.4858255651820508, 'eval_runtime': 200.4258, 'eval_samples_per_second': 14.843, 'eval_steps_per_second': 1.856}, 'microsoft/deberta-v3-large': {'eval_loss': 1.6093789339065552, 'eval_accuracy': 0.24, 'eval_f1': 0.23916774512522507, 'eval_precision': 0.254085226353791, 'eval_recall': 0.24635359483100033, 'eval_runtime': 258.2878, 'eval_samples_per_second': 11.518, 'eval_steps_per_second': 1.44}, 'google/bigbird-roberta-large': {'eval_loss': 1.606080412864685, 'eval_accuracy': 0.293109243697479, 'eval_f1': 0.2926189170937211, 'eval_precision': 0.29562990901530106, 'eval_recall': 0.2948272429784039, 'eval_runtime': 231.6373, 'eval_samples_per_second': 12.843, 'eval_steps_per_second': 1.606}, 'xlnet/xlnet-base-cased': {'eval_loss': 1.6289170980453491, 'eval_accuracy': 0.16, 'eval_f1': 0.15297

In [12]:
# 1st Iteration
for key in results.keys():
    print(f"{key}: {results[key]}\n")

LIAMF-USP/roberta-large-finetuned-race: {'eval_loss': 1.5375503301620483, 'eval_accuracy': 0.49210084033613444, 'eval_f1': 0.48690145392783163, 'eval_precision': 0.4895015998986854, 'eval_recall': 0.4858255651820508, 'eval_runtime': 200.4258, 'eval_samples_per_second': 14.843, 'eval_steps_per_second': 1.856}

microsoft/deberta-v3-large: {'eval_loss': 1.6093789339065552, 'eval_accuracy': 0.24, 'eval_f1': 0.23916774512522507, 'eval_precision': 0.254085226353791, 'eval_recall': 0.24635359483100033, 'eval_runtime': 258.2878, 'eval_samples_per_second': 11.518, 'eval_steps_per_second': 1.44}

google/bigbird-roberta-large: {'eval_loss': 1.606080412864685, 'eval_accuracy': 0.293109243697479, 'eval_f1': 0.2926189170937211, 'eval_precision': 0.29562990901530106, 'eval_recall': 0.2948272429784039, 'eval_runtime': 231.6373, 'eval_samples_per_second': 12.843, 'eval_steps_per_second': 1.606}

xlnet/xlnet-base-cased: {'eval_loss': 1.6289170980453491, 'eval_accuracy': 0.16, 'eval_f1': 0.15297957666690

### 4.2: Second Iteration
Testing and comparing base model performance with question, option, and pre-processed rationale

In [13]:
# 2nd Iteration
results = {}
for name, model in models.items():
    trainer = Trainer(
        model=model,
        eval_dataset=tokenized_datasets[name],
        compute_metrics=compute_metrics
    )
    results[name] = trainer.evaluate()

print(results)

Attention type 'block_sparse' is not possible if sequence_length: 256 <= num global tokens: 2 * config.block_size + min. num sliding tokens: 3 * config.block_size + config.num_random_blocks * config.block_size + additional buffer: config.num_random_blocks * config.block_size = 704 with config.block_size = 64, config.num_random_blocks = 3. Changing attention type to 'original_full'...


{'LIAMF-USP/roberta-large-finetuned-race': {'eval_loss': 1.7576335668563843, 'eval_accuracy': 0.22453781512605042, 'eval_f1': 0.21967586975974313, 'eval_precision': 0.22253264398415445, 'eval_recall': 0.2213882836360634, 'eval_runtime': 212.6478, 'eval_samples_per_second': 13.99, 'eval_steps_per_second': 1.749}, 'microsoft/deberta-v3-large': {'eval_loss': 1.6093631982803345, 'eval_accuracy': 0.24873949579831933, 'eval_f1': 0.24674655202020634, 'eval_precision': 0.25300166132945645, 'eval_recall': 0.2481927322246705, 'eval_runtime': 271.6509, 'eval_samples_per_second': 10.952, 'eval_steps_per_second': 1.369}, 'google/bigbird-roberta-large': {'eval_loss': 1.6111717224121094, 'eval_accuracy': 0.20941176470588235, 'eval_f1': 0.20773049763904478, 'eval_precision': 0.21088324137587286, 'eval_recall': 0.20831432813874046, 'eval_runtime': 238.9903, 'eval_samples_per_second': 12.448, 'eval_steps_per_second': 1.557}, 'xlnet/xlnet-base-cased': {'eval_loss': 1.6096572875976562, 'eval_accuracy': 0.

In [14]:
# 2nd Iteration
for key in results.keys():
    print(f"{key}: {results[key]}\n")

LIAMF-USP/roberta-large-finetuned-race: {'eval_loss': 1.7576335668563843, 'eval_accuracy': 0.22453781512605042, 'eval_f1': 0.21967586975974313, 'eval_precision': 0.22253264398415445, 'eval_recall': 0.2213882836360634, 'eval_runtime': 212.6478, 'eval_samples_per_second': 13.99, 'eval_steps_per_second': 1.749}

microsoft/deberta-v3-large: {'eval_loss': 1.6093631982803345, 'eval_accuracy': 0.24873949579831933, 'eval_f1': 0.24674655202020634, 'eval_precision': 0.25300166132945645, 'eval_recall': 0.2481927322246705, 'eval_runtime': 271.6509, 'eval_samples_per_second': 10.952, 'eval_steps_per_second': 1.369}

google/bigbird-roberta-large: {'eval_loss': 1.6111717224121094, 'eval_accuracy': 0.20941176470588235, 'eval_f1': 0.20773049763904478, 'eval_precision': 0.21088324137587286, 'eval_recall': 0.20831432813874046, 'eval_runtime': 238.9903, 'eval_samples_per_second': 12.448, 'eval_steps_per_second': 1.557}

xlnet/xlnet-base-cased: {'eval_loss': 1.6096572875976562, 'eval_accuracy': 0.210084033

### 4.3: Third Iteration
Testing and comparing base model performance with question, option, pre-processed rationale, and formula

In [38]:
# 3rd Iteration
results = {}
for name, model in models.items():
    trainer = Trainer(
        model=model,
        eval_dataset=tokenized_datasets[name],
        compute_metrics=compute_metrics
    )
    results[name] = trainer.evaluate()

print(results)

{'LIAMF-USP/roberta-large-finetuned-race': {'eval_loss': 1.5200344324111938, 'eval_accuracy': 0.3196638655462185, 'eval_f1': 0.3096483124936282, 'eval_precision': 0.3274146764658731, 'eval_recall': 0.3105565282966045, 'eval_runtime': 209.5813, 'eval_samples_per_second': 14.195, 'eval_steps_per_second': 1.775}, 'microsoft/deberta-v3-large': {'eval_loss': 1.6094452142715454, 'eval_accuracy': 0.19563025210084034, 'eval_f1': 0.19225121541062654, 'eval_precision': 0.20058589059213178, 'eval_recall': 0.1984869488193308, 'eval_runtime': 270.748, 'eval_samples_per_second': 10.988, 'eval_steps_per_second': 1.374}, 'google/bigbird-roberta-large': {'eval_loss': 1.6094380617141724, 'eval_accuracy': 0.200672268907563, 'eval_f1': 0.1995640649359511, 'eval_precision': 0.20037087713054674, 'eval_recall': 0.20043076565012208, 'eval_runtime': 208.9374, 'eval_samples_per_second': 14.239, 'eval_steps_per_second': 1.78}, 'xlnet/xlnet-base-cased': {'eval_loss': 1.6104209423065186, 'eval_accuracy': 0.2026890

In [39]:
# 3rd Iteration
for key in results.keys():
    print(f"{key}: {results[key]}\n")

LIAMF-USP/roberta-large-finetuned-race: {'eval_loss': 1.5200344324111938, 'eval_accuracy': 0.3196638655462185, 'eval_f1': 0.3096483124936282, 'eval_precision': 0.3274146764658731, 'eval_recall': 0.3105565282966045, 'eval_runtime': 209.5813, 'eval_samples_per_second': 14.195, 'eval_steps_per_second': 1.775}

microsoft/deberta-v3-large: {'eval_loss': 1.6094452142715454, 'eval_accuracy': 0.19563025210084034, 'eval_f1': 0.19225121541062654, 'eval_precision': 0.20058589059213178, 'eval_recall': 0.1984869488193308, 'eval_runtime': 270.748, 'eval_samples_per_second': 10.988, 'eval_steps_per_second': 1.374}

google/bigbird-roberta-large: {'eval_loss': 1.6094380617141724, 'eval_accuracy': 0.200672268907563, 'eval_f1': 0.1995640649359511, 'eval_precision': 0.20037087713054674, 'eval_recall': 0.20043076565012208, 'eval_runtime': 208.9374, 'eval_samples_per_second': 14.239, 'eval_steps_per_second': 1.78}

xlnet/xlnet-base-cased: {'eval_loss': 1.6104209423065186, 'eval_accuracy': 0.2026890756302521

## 5: Fine-Tuned Model Preparation
Prepare the preprocessing function for the dataset tokenization and a custom data collator for multiple choice questions for model fine-tuning

In [11]:
def preprocess_function(examples, tokenizer, mode=0):
    MAX_SEQ_LENGTH = tokenizer.model_max_length if tokenizer.model_max_length < 512 else 256
    
    if mode == 0:
        MAX_SEQ_LENGTH = 128
    elif mode == 1:
        MAX_SEQ_LENGTH = 320
    elif mode == 2:
        MAX_SEQ_LENGTH = 384
    
    labels_map = {"a": 0, "b": 1, "c": 2, "d": 3, "e": 4}
    questions = examples["Problem"]
    contexts = examples["Rationale"]
    formulas = examples['annotated_formula']
    options_list = examples["options"]
    categories = examples["category"]
    labels = [labels_map[ans] for ans in examples["correct"]]

    batch_input_ids = []
    batch_attention_masks = []
    batch_labels = []
    batch_categories = []
    
    # Iterate over each example in the batch
    for question, category, context, options, formula, label in zip(questions, categories, contexts, options_list, formulas, labels):
        choices_inputs = []

        for option in options:
            # [0] Category; Problem; Option
            if mode == 0:
                input_string = f'<s> [CATEGORY] {category} </s> </s>  [PROBLEM] {question} </s> </s> [OPTION] {option} </s>'

            # [1] Category; Problem; Rationale; Option; 
            elif mode == 1:
                input_string = f'<s> [CATEGORY] {category} </s> </s> [PROBLEM] {question} </s> </s> [CONTEXT] {context} </s> </s> [OPTION] {option} </s>'

            # [2] Category; Problem; Formula; Rationale; Option; 
            elif mode == 2:
                input_string = f'<s> [CATEGORY] {category} </s> </s> [PROBLEM] {question} </s> </s> [CONTEXT] {context} </s> </s> [OPTION] {option} </s> </s> [FORMULA] {formula} </s>'            

            # Tokenize the context and the question-option pair
            inputs = tokenizer(
                input_string,
                add_special_tokens=False,
                max_length=MAX_SEQ_LENGTH,
                padding="max_length",
                truncation=True,
                return_overflowing_tokens=False
            )
            
            choices_inputs.append(inputs)

        # Extract input ids and attention masks for all options
        input_ids = [x['input_ids'] for x in choices_inputs]
        attention_masks = [x['attention_mask'] for x in choices_inputs]
        
        batch_input_ids.append(input_ids)
        batch_attention_masks.append(attention_masks)
        batch_labels.append(label)

    # Return processed batch data as a dictionary
    return {
        "input_ids": batch_input_ids,
        "attention_mask": batch_attention_masks,
        "labels": torch.tensor(batch_labels, dtype=torch.long),
    }

In [12]:
from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union

@dataclass
class DataCollatorForMultipleChoice:
    """
    Custom data collator that will dynamically pad the inputs for multiple choice received.
    """
    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features):
        # Determine the label key in the features
        label_name = "label" if "label" in features[0].keys() else "labels"
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)

        # Find the maximum number of choices across all samples (to handle variable numbers safely)
        max_num_choices = max(len(feature["input_ids"]) for feature in features)

        # Flatten the features for padding, ensuring all have the same number of choices
        flattened_features = []
        for feature in features:
            feature_choices = []
            for i in range(max_num_choices):
                try:
                    # Extract each choice as a separate feature
                    choice_features = {k: v[i] for k, v in feature.items() if k != label_name and isinstance(v, list)}
                    feature_choices.append(choice_features)
                except IndexError:
                    # If some choices are missing, pad manually
                    # Use the structure of the first choice to create empty padding
                    empty_choice = {k: [] * len(v[0]) if isinstance(v[0], list) else v for k, v in feature.items() if k != label_name and isinstance(v, list)}
                    feature_choices.append(empty_choice)
            flattened_features.extend(feature_choices)

        # Pad the flattened features
        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )

        # Reshape the padded features back into their original shape [batch_size, num_choices, sequence_length]
        batch = {k: v.view(batch_size, max_num_choices, -1) for k, v in batch.items() if v.dim() > 1}

        # Add back the labels
        batch["labels"] = torch.tensor(labels, dtype=torch.int64)

        return batch


In [13]:
# Initialize, tokenize, and preprocess mathqa dataset
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenized_mathqa = mathqa.map(preprocess_function, fn_kwargs={'tokenizer': tokenizer, 'mode': 1}, batched=True, remove_columns=mathqa["train"].column_names)

In [14]:
# Print decoded sample text input
accepted_keys = ["input_ids", "attention_mask", "labels"]
features = [{k: v for k, v in tokenized_mathqa["validation"][i].items() if k in accepted_keys} for i in range(10)]
batch = DataCollatorForMultipleChoice(tokenizer)(features)

idx = 5
[tokenizer.decode(batch["input_ids"][idx][i].tolist()) for i in range(5)]

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


['<s> [CATEGORY] gain </s> </s> [PROBLEM] in a school of 650 boys, 44 % of muslims, 28 % hindus, 10 % sikhs and the remaining of other communities. how many belonged to the other communities? </s> </s> [CONTEXT] 44 + 28 + 10 = 82 % 100 – 82 = 18 % 650 * 18 / 100 = </s> </s> [OPTION] a ) 173  </s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pa

## 6: Fine-tuning Model
Fine-tune the `LIAMF-USP/roberta-large-finetuned-race` pre-trained model on 3 epochs per model.

In [27]:
import gc

# Delete tensors
gc.collect()  # Garbage collect to free memory

torch.cuda.empty_cache()

### 6.1: First Iteration
Overfit fine-tuning on Rationale with answers

In [18]:
# Iteration 1 RoBERTA
training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-mathqa",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=10, # Adjust batch size depending on the available GPU memory
    per_device_eval_batch_size=16,  # Evaluation batch size can be larger if evaluation is less frequent
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    fp16=True
)

# Initialize Trainer
trainer = Trainer(
    model=finetuned_model,
    args=training_args,
    train_dataset=tokenized_mathqa["train"],
    eval_dataset=tokenized_mathqa["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

# Train the Model
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.2057,0.174363,0.956303,0.956123,0.955748,0.956546
2,0.1484,0.164226,0.958319,0.958263,0.958815,0.957779
3,0.1309,0.166103,0.964034,0.963642,0.963798,0.96351


TrainOutput(global_step=8910, training_loss=0.18729583658821522, metrics={'train_runtime': 5964.2442, 'train_samples_per_second': 14.937, 'train_steps_per_second': 1.494, 'total_flos': 2.075520624780672e+17, 'train_loss': 0.18729583658821522, 'epoch': 3.0})

In [19]:
trainer.push_to_hub()

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/4.98k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/nickrwu/roberta-large-finetuned-race-finetuned-mathqa/commit/75e0925b987f6537b6cda1f391b778ad2806aeaf', commit_message='End of training', commit_description='', oid='75e0925b987f6537b6cda1f391b778ad2806aeaf', pr_url=None, pr_revision=None, pr_num=None)

In [35]:
finetuned_eval_result = trainer.evaluate(tokenized_mathqa["test"])

print(f"{model_name}-finetuned-mathqa: {finetuned_eval_result}")

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


### 6.2: Second Iteration
Fine-tuned on Rationale without answers

In [25]:
# Iteration 2: RoBERTa
training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-mathqa",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8, # Adjust batch size depending on the available GPU memory
    per_device_eval_batch_size=16,  # Evaluation batch size can be larger if evaluation is less frequent
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    fp16=True
)

# Initialize Trainer
trainer = Trainer(
    model=finetuned_model,
    args=training_args,
    train_dataset=tokenized_mathqa["train"],
    eval_dataset=tokenized_mathqa["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

# Train the Model
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,1.3497,1.288583,0.465882,0.464044,0.476122,0.460934
2,1.2074,1.168439,0.518655,0.518212,0.52534,0.515323
3,1.0072,1.128198,0.547563,0.547022,0.552756,0.544449


TrainOutput(global_step=11136, training_loss=1.252042861848042, metrics={'train_runtime': 6197.7167, 'train_samples_per_second': 14.374, 'train_steps_per_second': 1.797, 'total_flos': 2.075520624780672e+17, 'train_loss': 1.252042861848042, 'epoch': 3.0})

In [26]:
trainer.push_to_hub()

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/4.98k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/nickrwu/roberta-large-finetuned-race-finetuned-mathqa/commit/1b254161977e4aa443c91774a59af0d484e650e4', commit_message='End of training', commit_description='', oid='1b254161977e4aa443c91774a59af0d484e650e4', pr_url=None, pr_revision=None, pr_num=None)

In [27]:
finetuned_eval_result = trainer.evaluate(tokenized_mathqa["test"])

print(f"{model_name}-finetuned-mathqa: {finetuned_eval_result}")

LIAMF-USP/roberta-large-finetuned-race-finetuned-mathqa: {'eval_loss': 1.128198266029358, 'eval_accuracy': 0.547563025210084, 'eval_f1': 0.5470219441640726, 'eval_precision': 0.5527563562833936, 'eval_recall': 0.5444486622799508, 'eval_runtime': 62.4166, 'eval_samples_per_second': 47.664, 'eval_steps_per_second': 2.98, 'epoch': 3.0}


### 6.3: Third Iteration
Fine-tune three models and compare performance:
1. **RoBERTa-MQA:** We use the MathQA questions and options to fine-tune and predict subsequent correct answers.
2. **RoBERTa-MQA-RAT:** We use MathQA questions, options, and rationales as context for the RoBERTa model.
3. **RoBERTa-MQA-FORMRAT:** We use MathQA questions, options, annotated formulas, and rationales as contextual input knowledge for the model.

In [15]:
models = {}
tokenizers = {}
datasets = {}
finetuned_results = {}

model_names = ['roberta-mqa', 'roberta-mqa-rat', 'roberta-mqa-formrat']

for i, model in enumerate(model_names):
    models[model]= AutoModelForMultipleChoice.from_pretrained(model_name)
    tokenizers[model] = AutoTokenizer.from_pretrained(model_name)
    datasets[model] = mathqa.map(preprocess_function, fn_kwargs={'tokenizer': tokenizers[model], 'mode': i}, batched=True, remove_columns=mathqa["train"].column_names)

#### 6.3.1: RoBERTa-MQA
`INPUT: [Category] [Question] [Option]`

In [32]:
# Iteration 3: RoBERTA
# [0] QA

training_args = TrainingArguments(
    output_dir=f"{model_names[0]}",
    evaluation_strategy="steps",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=28, # Adjust batch size depending on the available GPU memory
    per_device_eval_batch_size=28,  # Evaluation batch size can be larger if evaluation is less frequent
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    eval_steps=1200,
    fp16=True
)

trainer = Trainer(
    model=finetuned_model,
    args=training_args,
    train_dataset=tokenized_mathqa["train"],
    eval_dataset=tokenized_mathqa["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

# Train the Model
trainer.train()

Step,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1200,1.5519,1.567401,0.299798,0.296425,0.301775,0.295354
2400,1.5254,1.524914,0.309433,0.301113,0.329206,0.30112


TrainOutput(global_step=3183, training_loss=1.5484931722887274, metrics={'train_runtime': 2577.7112, 'train_samples_per_second': 34.56, 'train_steps_per_second': 1.235, 'total_flos': 1.037760312390336e+17, 'train_loss': 1.5484931722887274, 'epoch': 3.0})

In [24]:
# [0] QA

idx = 0

training_args = TrainingArguments(
    output_dir=f"{model_names[idx]}",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=28, # Adjust batch size depending on the available GPU memory
    per_device_eval_batch_size=28,  # Evaluation batch size can be larger if evaluation is less frequent
    num_train_epochs=3, # This could be raised to more than 5 epochs
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    fp16=True
)

mqa_trainer = Trainer(
    model=models[model_names[idx]],
    args=training_args,
    train_dataset=datasets[model_names[idx]]["train"],
    eval_dataset=datasets[model_names[idx]]["validation"],
    tokenizer=tokenizers[model_names[idx]],
    data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizers[model_names[idx]]),
    compute_metrics=compute_metrics,
)

# Train the Model
mqa_trainer.train()

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,1.5076,1.490123,0.337217,0.332791,0.336564,0.332084
2,1.4244,1.45836,0.3594,0.355998,0.361534,0.354532
3,1.3553,1.46309,0.379341,0.377391,0.381881,0.375967


TrainOutput(global_step=3183, training_loss=1.4295901467206154, metrics={'train_runtime': 2578.2236, 'train_samples_per_second': 34.553, 'train_steps_per_second': 1.235, 'total_flos': 1.037760312390336e+17, 'train_loss': 1.4295901467206154, 'epoch': 3.0})

In [25]:
mqa_trainer.push_to_hub()

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/4.92k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/nickrwu/roberta-mqa/commit/aafccb99903aacb389e6eac63ffb2b95e8e63761', commit_message='End of training', commit_description='', oid='aafccb99903aacb389e6eac63ffb2b95e8e63761', pr_url=None, pr_revision=None, pr_num=None)

In [26]:
finetuned_results[model_names[idx]] = mqa_trainer.evaluate(datasets[model_names[idx]]['test'])

print(finetuned_results[model_names[idx]])

{'eval_loss': 1.4505012035369873, 'eval_accuracy': 0.3791596638655462, 'eval_f1': 0.37570327300169265, 'eval_precision': 0.3797368954737583, 'eval_recall': 0.3745757529255687, 'eval_runtime': 27.8536, 'eval_samples_per_second': 106.808, 'eval_steps_per_second': 3.842, 'epoch': 3.0}


#### 6.3.2: RoBERTa-MQA-RAT
`INPUT: [Category] [Question] [Context] [Option]`

In [16]:
# [1] QA, Rationale

idx = 1

training_args = TrainingArguments(
    output_dir=f"{model_names[idx]}",
    evaluation_strategy="steps",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8, # Adjust batch size depending on the available GPU memory
    per_device_eval_batch_size=16,  # Evaluation batch size can be larger if evaluation is less frequent
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    eval_steps=1200,
    fp16=True
)

mqa_rat_trainer = Trainer(
    model=models[model_names[idx]],
    args=training_args,
    train_dataset=datasets[model_names[idx]]["train"],
    eval_dataset=datasets[model_names[idx]]["validation"],
    tokenizer=tokenizers[model_names[idx]],
    data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizers[model_names[idx]]),
    compute_metrics=compute_metrics,
)

# Train the Model
mqa_rat_trainer.train()

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1200,1.4516,1.404326,0.404212,0.40137,0.411096,0.400788
2400,1.3834,1.341952,0.443424,0.441747,0.444677,0.441811
3600,1.3342,1.330794,0.451266,0.448872,0.454032,0.447026
4800,1.263,1.24129,0.490701,0.489656,0.494118,0.488115
6000,1.2209,1.209765,0.509523,0.507899,0.513351,0.505909
7200,1.1856,1.180378,0.517365,0.515859,0.519975,0.513916
8400,1.1134,1.152672,0.533722,0.531562,0.537305,0.529426
9600,1.0924,1.130727,0.545597,0.544,0.547505,0.542483
10800,1.0556,1.116104,0.551199,0.549191,0.552186,0.547789


TrainOutput(global_step=11136, training_loss=1.2432966461811943, metrics={'train_runtime': 8945.165, 'train_samples_per_second': 9.959, 'train_steps_per_second': 1.245, 'total_flos': 2.59440078097584e+17, 'train_loss': 1.2432966461811943, 'epoch': 3.0})

In [17]:
mqa_rat_trainer.push_to_hub()

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.92k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/nickrwu/roberta-mqa-rat/commit/803865390a227030cab0e8806e18baca9d4209be', commit_message='End of training', commit_description='', oid='803865390a227030cab0e8806e18baca9d4209be', pr_url=None, pr_revision=None, pr_num=None)

In [19]:
finetuned_results[model_names[idx]] = mqa_rat_trainer.evaluate(datasets[model_names[idx]]['test'])

print(f"{model_names[idx]}: {finetuned_results[model_names[idx]]}")

roberta-mqa-rat: {'eval_loss': 1.1457306146621704, 'eval_accuracy': 0.5388235294117647, 'eval_f1': 0.5376322726900289, 'eval_precision': 0.541013500652071, 'eval_recall': 0.535899919441021, 'eval_runtime': 84.424, 'eval_samples_per_second': 35.239, 'eval_steps_per_second': 2.203, 'epoch': 3.0}


#### 6.3.3: RoBERTa-MQA-FORMRAT
`INPUT: [Category] [Question] [Context] [Option] [Formula]`

In [73]:
# [2] QA, Rationale, Formula

idx = 2

training_args = TrainingArguments(
    output_dir=f"{model_names[idx]}",
    evaluation_strategy="steps",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4, # Adjust batch size depending on the available GPU memory
    per_device_eval_batch_size=16,  # Evaluation batch size can be larger if evaluation is less frequent
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    gradient_accumulation_steps=2,  # Use if increasing batch size is constrained by memory
    eval_steps=1200,
    fp16=True
)
mqa_formrat_trainer = Trainer(
    model=models[model_names[idx]],
    args=training_args,
    train_dataset=datasets[model_names[idx]]["train"],
    eval_dataset=datasets[model_names[idx]]["validation"],
    tokenizer=tokenizers[model_names[idx]],
    data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizers[model_names[idx]]),
    compute_metrics=compute_metrics,
)
# Train the Model
mqa_formrat_trainer.train()

Step,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1200,1.451,1.41247,0.410486,0.409301,0.41512,0.410704
2400,1.416,1.348209,0.441183,0.439376,0.443776,0.438534
3600,1.3157,1.293263,0.478826,0.477217,0.47759,0.477343
4800,1.2616,1.238934,0.503249,0.502178,0.505315,0.501149
6000,1.221,1.204925,0.505266,0.503874,0.505966,0.50287
7200,1.1556,1.17924,0.528792,0.52762,0.529538,0.526548
8400,1.082,1.15925,0.545149,0.543384,0.548717,0.541466
9600,1.0692,1.115328,0.561282,0.560626,0.564084,0.559374
10800,1.0066,1.113491,0.567107,0.565861,0.568285,0.56498


TrainOutput(global_step=11136, training_loss=1.229307026356116, metrics={'train_runtime': 9335.8873, 'train_samples_per_second': 9.542, 'train_steps_per_second': 1.193, 'total_flos': 2.59440078097584e+17, 'train_loss': 1.229307026356116, 'epoch': 3.0})

In [None]:
mqa_formrat_trainer.push_to_hub()

In [None]:
finetuned_results[model_names[idx]] = mqa_formrat_trainer.evaluate(datasets[model_names[idx]]['test'])

print(finetuned_results[model_names[idx]])

In [None]:
for key in finetuned_results.keys():
    print(f"{key}: {finetuned_results[key]}\n")

In [39]:
trainer.evaluate(mathqa['test'].map(preprocess_function, fn_kwargs={'tokenizer': tokenizer, 'mode': 0}, batched=True))

{'eval_loss': 1.609375,
 'eval_accuracy': 0.21411764705882352,
 'eval_f1': 0.185497663707947,
 'eval_precision': 0.21243028638281397,
 'eval_recall': 0.20865303907478397,
 'eval_runtime': 62.5859,
 'eval_samples_per_second': 47.535,
 'eval_steps_per_second': 2.972,
 'epoch': 3.0}

## 7: Evaluation
This section focuses on evaluating the performance of our fine-tuned Roberta model. We will use various metrics to assess the accuracy, precision, recall, and F1 score of the model on the MathQA dataset.

**Performance Metrics**
* **Accuracy:** Measure the proportion of correctly predicted answers in the test set for both models.
* **Precision:** Assess the positive predictive value to see how many of the predicted positives are actually correct.
* **Recall:** Determine the true positive rate, which indicates how many actual positives were identified correctly.
* **F1 Score:** Calculate the harmonic mean of precision and recall to provide a balance between the two metrics.

In [32]:
model_names = ['roberta-mqa', 'roberta-mqa-rat', 'roberta-mqa-formrat']

model_dict = {}
tokenizer_dict = {}
test_datasets = {}
finetuned_results = {}

for i, m in enumerate(model_names):
    model_dict[m] = AutoModelForMultipleChoice.from_pretrained(f'nickrwu/{m}')
    tokenizer_dict[m] = AutoTokenizer.from_pretrained(f'nickrwu/{m}')

    test_datasets[m] = mathqa['test'].map(preprocess_function, fn_kwargs={'tokenizer': tokenizer_dict[m], 'mode': i}, batched=True, remove_columns=mathqa["train"].column_names)



In [33]:
# Initialize Trainer
trainer_dict = {}

for i, m in enumerate(model_names):
    trainer_dict[m] = Trainer(
        model=model_dict[m],
        eval_dataset=test_datasets[m],
        tokenizer=tokenizer_dict[m],
        data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer_dict[m]),
        compute_metrics=compute_metrics
    )

In [34]:
for m in model_names:
    finetuned_results[m] = trainer_dict[m].evaluate(test_datasets[m])

for key in finetuned_results.keys():
    print(f"{key}: {finetuned_results[key]}\n")

You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


roberta-mqa: {'eval_loss': 1.4505120515823364, 'eval_accuracy': 0.3791596638655462, 'eval_f1': 0.37570327300169265, 'eval_precision': 0.3797368954737583, 'eval_recall': 0.3745757529255687, 'eval_runtime': 30.2649, 'eval_samples_per_second': 98.299, 'eval_steps_per_second': 12.291}

roberta-mqa-rat: {'eval_loss': 1.1457091569900513, 'eval_accuracy': 0.5388235294117647, 'eval_f1': 0.5376322726900289, 'eval_precision': 0.541013500652071, 'eval_recall': 0.535899919441021, 'eval_runtime': 85.4527, 'eval_samples_per_second': 34.815, 'eval_steps_per_second': 4.353}

roberta-mqa-formrat: {'eval_loss': 1.1326119899749756, 'eval_accuracy': 0.5596638655462185, 'eval_f1': 0.5576580467591747, 'eval_precision': 0.5593347827070027, 'eval_recall': 0.556618048931913, 'eval_runtime': 107.0704, 'eval_samples_per_second': 27.785, 'eval_steps_per_second': 3.474}



In [35]:
from sklearn.metrics import confusion_matrix, classification_report

idx = 2

def generate_report(trainer, test_dataset):
    # Predictions
    raw_pred, _, _ = trainer.predict(test_dataset)
    predicted_labels = np.argmax(raw_pred, axis=1)
    
    # Evaluate predictions
    true_labels = test_dataset['labels']
    accuracy = accuracy_score(true_labels, predicted_labels)
    conf_matrix = confusion_matrix(true_labels, predicted_labels)
    report = classification_report(true_labels, predicted_labels)
    
    print("Accuracy:", accuracy)
    print("Confusion Matrix:\n", conf_matrix)
    print("Classification Report:\n", report)

    return true_labels, predicted_labels

# true_labels, predicted_labels = generate_report(saved_trainer, saved_tokenized_mathqa["test"])
true_labels, predicted_labels = generate_report(trainer_dict[model_names[idx]], test_datasets[model_names[idx]])


Accuracy: 0.5596638655462185
Confusion Matrix:
 [[351  77  80  61  42]
 [ 86 322  84  68  45]
 [ 75  75 396  79  49]
 [ 73  84  58 360  49]
 [ 62  53  60  50 236]]
Classification Report:
               precision    recall  f1-score   support

           0       0.54      0.57      0.56       611
           1       0.53      0.53      0.53       605
           2       0.58      0.59      0.59       674
           3       0.58      0.58      0.58       624
           4       0.56      0.51      0.54       461

    accuracy                           0.56      2975
   macro avg       0.56      0.56      0.56      2975
weighted avg       0.56      0.56      0.56      2975



In [44]:
def print_incorrect(true_labels, predicted_labels, test_dataset, n, category_filter):
    incorrect_indices = np.where(np.array(true_labels) != predicted_labels)[0]
    incorrect_samples = tokenized_mathqa["test"].select(incorrect_indices)
    counter = 0
    for i, example in enumerate(incorrect_samples):
        true_label = example['labels']
        # Extract the input text
        input_text = tokenizer.decode(example['input_ids'][true_label], skip_special_tokens=True)
        
        # Check if the category matches the filter
        category_start = input_text.find("[CATEGORY]") + len("[CATEGORY] ")
        category_end = input_text.find(" ", category_start)
        category = input_text[category_start:category_end]

        if category == category_filter:
            predicted_label = predicted_labels[incorrect_indices[i]]
            answer_map = {0: "a", 1: "b", 2: "c", 3: "d", 4: "e"}

            print(f"\n[True ({answer_map[true_label]})] \n{tokenizer.decode(example['input_ids'][true_label], skip_special_tokens=True)}")
            print(f"\n[Predicted ({answer_map[predicted_label]})] \n{tokenizer.decode(example['input_ids'][predicted_label], skip_special_tokens=True)}")

            print("---------")
            counter += 1
            if counter >= n:
                break

categories = ['general', 'gain', 'physics', 'geometry', 'other', 'probability']

for c in categories:
    print_incorrect(true_labels, predicted_labels, test_datasets[model_names[idx]], 0, c)
    
                
# print_incorrect(true_labels, predicted_labels, test_datasets[model_names[idx]], 5)


[True (b)] 
 [CATEGORY] general   [PROBLEM] each week a restaurant serving mexican food uses the same volume of chili paste, which comes in either 35 - ounce cans or 25 - ounce cans of chili paste. if the restaurant must order 20 more of the smaller cans than the larger cans to fulfill its weekly needs, then how manysmallercans are required to fulfill its weekly needs?   [CONTEXT] "let x be the number of 35 ounce cans. therefore ( x + 20 ) is the number of 25 ounce cans. total volume is same, therefore 35 x = 25 ( x + 20 ) 10 x = 500 x = 50 therefore, number of 15 ounce cans = 50 + 20 =   [OPTION] b ) 70  

[Predicted (d)] 
 [CATEGORY] general   [PROBLEM] each week a restaurant serving mexican food uses the same volume of chili paste, which comes in either 35 - ounce cans or 25 - ounce cans of chili paste. if the restaurant must order 20 more of the smaller cans than the larger cans to fulfill its weekly needs, then how manysmallercans are required to fulfill its weekly needs?   [CONT