## Peft model evaluation using lm-eval-harness

In this notebook, we are going to learn how to evaluate the finetuned lora model on the hellaswag task using lm-eval-harness toolkit.

In [1]:
# Install LM-Eval
!pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git

Collecting git+https://github.com/EleutherAI/lm-evaluation-harness.git
  Cloning https://github.com/EleutherAI/lm-evaluation-harness.git to /tmp/pip-req-build-njrcgnmf
  Running command git clone --filter=blob:none --quiet https://github.com/EleutherAI/lm-evaluation-harness.git /tmp/pip-req-build-njrcgnmf
  Resolved https://github.com/EleutherAI/lm-evaluation-harness.git to commit 838a3e03c17efea79a0019153ed0544ee3630554
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting evaluate (from lm_eval==0.4.5)
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting jsonlines (from lm_eval==0.4.5)
  Downloading jsonlines-4.0.0-py3-none-any.whl.metadata (1.6 kB)
Collecting peft>=0.2.0 (from lm_eval==0.4.5)
  Downloading peft-0.13.2-py3-none-any.whl.metadata (13 kB)
Collecting pytablewriter (from lm_eval==0.4.5)
  Downloading pytablewriter-1.2.0-py3-none-a

### First we will check the accuracy score on the hellaswag task for the base bert without finetuning

In [2]:
!lm_eval \
--model hf \
--model_args pretrained=bert-base-cased,dtype=bfloat16 \
--tasks hellaswag \
--device cuda:0 \
--batch_size 128 \
--output_path ./results --log_samples

config.json: 100%|█████████████████████████████| 570/570 [00:00<00:00, 3.92MB/s]
tokenizer_config.json: 100%|██████████████████| 49.0/49.0 [00:00<00:00, 356kB/s]
vocab.txt: 100%|█████████████████████████████| 213k/213k [00:00<00:00, 3.89MB/s]
tokenizer.json: 100%|████████████████████████| 436k/436k [00:00<00:00, 19.8MB/s]
model.safetensors: 100%|██████████████████████| 436M/436M [00:02<00:00, 163MB/s]
If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`
hellaswag.py: 100%|████████████████████████| 4.36k/4.36k [00:00<00:00, 29.3MB/s]
dataset_infos.json: 100%|██████████████████| 2.53k/2.53k [00:00<00:00, 22.2MB/s]
README.md: 100%|███████████████████████████| 6.84k/6.84k [00:00<00:00, 22.6MB/s]
Downloading data: 100%|█████████████████████| 47.5M/47.5M [00:00<00:00, 143MB/s]
Downloading data: 100%|█████████████████████| 11.8M/11.8M [00:00<00:00, 214MB/s]
Downloading data: 100%|█████████████████████| 12.2M/12.2M [00:00<00:00, 201MB/s]
Generating train split: 100%|███

### Now lets try to finetune the bert on the imdb dataset

In [3]:
# Import necessary libraries
from datasets import load_dataset
from transformers import AutoTokenizer, BertForSequenceClassification, TrainingArguments, Trainer
from peft import LoraConfig, TaskType, get_peft_model
import numpy as np
import evaluate

In [4]:
# Configure LoRA for Sequence Classification
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,        # Set task type to sequence classification
    target_modules=["query", "key"]    # Specify target modules for LoRA tuning
)

# Initialize the BERT model for sequence classification
model = BertForSequenceClassification.from_pretrained(
    'bert-base-cased',
    num_labels = 2
)

# Wrap the model with LoRA configuration
model = get_peft_model(model, lora_config)

model.print_trainable_parameters()

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 296,450 || all params: 108,608,260 || trainable%: 0.2730




In [5]:
# load the dataset
dataset = load_dataset("imdb")

def tokenize_function(row):
    return tokenizer(row["text"], padding="max_length", truncation = True)

tokenized_datasets = dataset.map(tokenize_function, batched = True)

train_dataset = tokenized_datasets["train"]
eval_dataset = tokenized_datasets["test"]

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [6]:
# Define a function to compute evaluation metrics

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    metric = evaluate.load("accuracy")
    return metric.compute(predictions = predictions, references = labels)

In [7]:
# Configure training arguments
training_args = TrainingArguments("bert-lora-imdb",
    eval_strategy="epoch",
    per_device_train_batch_size=32, # decrease this for OOM error
    per_device_eval_batch_size=64,
    save_strategy="epoch",
    learning_rate=2e-3,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    do_eval=True,
    do_predict=True,
    metric_for_best_model="accuracy",
    report_to="none")

# Initialize the Trainer for the model training loop
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

#start training
trainer.train()

  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.244326,0.90088
2,0.289700,0.261619,0.89772
3,0.211900,0.228944,0.91248
4,0.168500,0.200274,0.92484
5,0.168500,0.215479,0.92524


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


TrainOutput(global_step=1955, training_loss=0.19924672899953544, metrics={'train_runtime': 7220.4597, 'train_samples_per_second': 17.312, 'train_steps_per_second': 0.271, 'total_flos': 3.300271872e+16, 'train_loss': 0.19924672899953544, 'epoch': 5.0})

### Now take the finetuned lora checkpoint and check the accuracy score on hellaswag task.

In [None]:
# use the path of your checkpoint here
!lm_eval \
    --model hf --model_args pretrained=bert-base-cased,dtype="bfloat16,peft=bert-lora-imdb/checkpoint-1955" \
    --tasks hellaswag \
    --device cuda:0 \
    --batch_size 128 \
    --output_path ./results \
    --log_samples

  pid, fd = os.forkpty()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`
100%|███████████████████████████████████| 10042/10042 [00:04<00:00, 2011.52it/s]
Running loglikelihood requests:   0%|                 | 0/40168 [00:00<?, ?it/s]We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.
Running loglikelihood requests: 100%|████| 40168/40168 [05:12<00:00, 128.34it/s]
fatal: not a git repository (or any parent up to mount point /kaggle)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
hf (pretrained=bert-base-cased,dtype=bfloat16,peft=/kaggle/working/bert-lora-imdb/checkpoint-1955), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 128
|  Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|---------|------:|------|-----:|--------|---|-----:|---|-----:|
|hellaswag|      1|none  |     0|acc 

### You can find the detailed metrics under the results directory for both the models