## Peft model evaluation using [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness)

In this notebook, we are going to learn how to evaluate the finetuned lora model on the hellaswag task using lm-eval-harness toolkit.

In [8]:
# Install LM-Eval
!pip install -q datasets evaluate lm_eval


[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


### First we will check the accuracy score on the hellaswag task for the base bert without finetuning

In [4]:
import lm_eval


output = lm_eval.simple_evaluate(model = 'hf',
                        model_args = {
                            'pretrained' : 'bert-base-cased',
                            'dtype' : 'bfloat16'},
                        tasks = 'hellaswag',
                        device = 'cuda:0',
                        batch_size = 128,
                        log_samples = False)
output["results"]

2024-11-01:20:45:03,210 INFO     [evaluator.py:164] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2024-11-01:20:45:03,211 INFO     [evaluator.py:188] Initializing hf model, with arguments: {'pretrained': 'bert-base-cased', 'dtype': 'bfloat16'}
2024-11-01:20:45:03,213 INFO     [huggingface.py:129] Using device 'cuda:0'
2024-11-01:20:45:03,450 INFO     [huggingface.py:481] Using model type 'default'
2024-11-01:20:45:03,741 INFO     [huggingface.py:365] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'}
If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`
2024-11-01:20:45:15,862 INFO     [task.py:415] Building contexts for hellaswag on rank 0...
100%|██████████| 10042/10042 [00:02<00:00, 4477.77it/s]
2024-11-01:20:45:18,875 INFO     [evaluator.py:489] Running loglikelihood requests
Running loglikelihood requests: 100%|██████████| 40168/40

{'hellaswag': {'alias': 'hellaswag',
  'acc,none': 0.24905397331208923,
  'acc_stderr,none': 0.004315812968431576,
  'acc_norm,none': 0.2439753037243577,
  'acc_norm_stderr,none': 0.004286002710084076}}

### Now lets try to finetune the bert on the imdb dataset (this is for demonstration and finetuning on imdb may not increase the scores on hellaswag task)

In [9]:
# Import necessary libraries
import evaluate
import numpy as np
from datasets import load_dataset
from transformers import AutoTokenizer, BertForSequenceClassification, Trainer, TrainingArguments

from peft import LoraConfig, TaskType, get_peft_model

In [10]:
# Configure LoRA for Sequence Classification
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,        # Set task type to sequence classification
    target_modules=["query", "key"]    # Specify target modules for LoRA tuning
)

# Initialize the BERT model for sequence classification
model = BertForSequenceClassification.from_pretrained(
    'bert-base-cased',
    num_labels = 2
)

# Wrap the model with LoRA configuration
model = get_peft_model(model, lora_config)

model.print_trainable_parameters()

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 296,450 || all params: 108,608,260 || trainable%: 0.2730


In [11]:
# load the dataset
dataset = load_dataset("imdb")

def tokenize_function(row):
    return tokenizer(row["text"], padding="max_length", truncation = True)

tokenized_datasets = dataset.map(tokenize_function, batched = True)

train_dataset = tokenized_datasets["train"]
eval_dataset = tokenized_datasets["test"]

Map: 100%|██████████| 25000/25000 [00:06<00:00, 3799.73 examples/s]


In [12]:
# Define a function to compute evaluation metrics

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    metric = evaluate.load("accuracy")
    return metric.compute(predictions = predictions, references = labels)

In [13]:
# Configure training arguments
training_args = TrainingArguments("bert-lora-imdb",
    eval_strategy="epoch",
    per_device_train_batch_size=32, # decrease this for OOM error
    per_device_eval_batch_size=64,
    save_strategy="epoch",
    learning_rate=2e-3,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    do_eval=True,
    do_predict=True,
    metric_for_best_model="accuracy",
    report_to="none")

# Initialize the Trainer for the model training loop
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

#start training
trainer.train()

 13%|█▎        | 500/3910 [08:16<56:48,  1.00it/s]  

{'loss': 0.3296, 'grad_norm': 1.0583174228668213, 'learning_rate': 0.0017442455242966753, 'epoch': 0.64}


                                                  
 20%|██        | 782/3910 [17:50<37:42,  1.38it/s]

{'eval_loss': 0.2824118435382843, 'eval_accuracy': 0.8816, 'eval_runtime': 308.5134, 'eval_samples_per_second': 81.034, 'eval_steps_per_second': 1.267, 'epoch': 1.0}


 26%|██▌       | 1000/3910 [21:14<45:16,  1.07it/s]  

{'loss': 0.2756, 'grad_norm': 1.9588807821273804, 'learning_rate': 0.0014884910485933505, 'epoch': 1.28}


 38%|███▊      | 1500/3910 [29:05<38:42,  1.04it/s]

{'loss': 0.2461, 'grad_norm': 0.7111016511917114, 'learning_rate': 0.0012327365728900255, 'epoch': 1.92}


                                                   
 40%|████      | 1564/3910 [35:10<28:13,  1.39it/s]

{'eval_loss': 0.2278510481119156, 'eval_accuracy': 0.9102, 'eval_runtime': 304.8594, 'eval_samples_per_second': 82.005, 'eval_steps_per_second': 1.283, 'epoch': 2.0}


 51%|█████     | 2000/3910 [42:06<29:44,  1.07it/s]   

{'loss': 0.2154, 'grad_norm': 1.8200898170471191, 'learning_rate': 0.000976982097186701, 'epoch': 2.56}


                                                   
 60%|██████    | 2346/3910 [52:29<18:45,  1.39it/s]

{'eval_loss': 0.20936396718025208, 'eval_accuracy': 0.91996, 'eval_runtime': 299.637, 'eval_samples_per_second': 83.434, 'eval_steps_per_second': 1.305, 'epoch': 3.0}


 64%|██████▍   | 2500/3910 [54:57<22:19,  1.05it/s]   

{'loss': 0.1988, 'grad_norm': 2.2769970893859863, 'learning_rate': 0.000721227621483376, 'epoch': 3.2}


 77%|███████▋  | 3000/3910 [1:02:52<14:23,  1.05it/s]

{'loss': 0.1635, 'grad_norm': 1.54856538772583, 'learning_rate': 0.00046547314578005116, 'epoch': 3.84}


                                                     
 80%|████████  | 3128/3910 [1:09:56<09:30,  1.37it/s]

{'eval_loss': 0.21944810450077057, 'eval_accuracy': 0.92164, 'eval_runtime': 303.6132, 'eval_samples_per_second': 82.342, 'eval_steps_per_second': 1.288, 'epoch': 4.0}


 90%|████████▉ | 3500/3910 [1:15:48<06:22,  1.07it/s]   

{'loss': 0.1355, 'grad_norm': 1.006062626838684, 'learning_rate': 0.00020971867007672635, 'epoch': 4.48}


                                                     
100%|██████████| 3910/3910 [1:27:17<00:00,  1.36it/s]

{'eval_loss': 0.2205527275800705, 'eval_accuracy': 0.92556, 'eval_runtime': 302.417, 'eval_samples_per_second': 82.667, 'eval_steps_per_second': 1.293, 'epoch': 5.0}


100%|██████████| 3910/3910 [1:27:17<00:00,  1.34s/it]

{'train_runtime': 5237.7573, 'train_samples_per_second': 23.865, 'train_steps_per_second': 0.747, 'train_loss': 0.213202116983321, 'epoch': 5.0}





TrainOutput(global_step=3910, training_loss=0.213202116983321, metrics={'train_runtime': 5237.7573, 'train_samples_per_second': 23.865, 'train_steps_per_second': 0.747, 'total_flos': 3.300271872e+16, 'train_loss': 0.213202116983321, 'epoch': 5.0})

### Now take the finetuned lora checkpoint and check the accuracy score on hellaswag task.

In [16]:
# use the path of your checkpoint here
output = lm_eval.simple_evaluate(model = 'hf',
                        model_args = {
                          'pretrained' : 'bert-base-cased',
                          'peft' : './bert-lora-imdb/checkpoint-3910',
                          'dtype' : 'bfloat16'},
                        tasks = 'hellaswag',
                        device = 'cuda:0',
                        batch_size = 128,
                        log_samples = False)

output["results"]

2024-11-01:23:37:57,640 INFO     [evaluator.py:164] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2024-11-01:23:37:57,641 INFO     [evaluator.py:188] Initializing hf model, with arguments: {'pretrained': 'bert-base-cased', 'peft': './bert-lora-imdb/checkpoint-3910', 'dtype': 'bfloat16'}
2024-11-01:23:37:57,643 INFO     [huggingface.py:129] Using device 'cuda:0'
2024-11-01:23:37:57,891 INFO     [huggingface.py:481] Using model type 'default'
2024-11-01:23:37:58,161 INFO     [huggingface.py:365] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda:0'}
If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`
2024-11-01:23:38:10,295 INFO     [task.py:415] Building contexts for hellaswag on rank 0...
100%|██████████| 10042/10042 [00:02<00:00, 4453.89it/s]
2024-11-01:23:38:13,313 INFO     [evaluator.py:489] Running loglikelihood requests
Running logli

{'hellaswag': {'alias': 'hellaswag',
  'acc,none': 0.2535351523600876,
  'acc_stderr,none': 0.00434145484189232,
  'acc_norm,none': 0.24875522804222266,
  'acc_norm_stderr,none': 0.0043140816086246455}}