In [1]:
import torch
from rich.markdown import Markdown

device = ('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


In [2]:
%cd ..

/home/oscarn/flan-gpt2


In [3]:
from models.model_utils import GPT2Model

wrapped_model = GPT2Model("output/gpt2_med_lora-distill_2000-flan_t5_xl_qntz", device, peft=True)
peft_model = wrapped_model.get_model(peft=True)
peft_tokenizer = wrapped_model.get_tokenizer()

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


In [4]:
from eval.eval_utils import Evaluation
from data.data_utils import create_instruct_dataset


eval = Evaluation(peft_model, peft_tokenizer, device)

2025-09-09 04:59:59,807 - datasets - INFO - PyTorch version 2.5.1 available.


In [5]:
num_samples = 100
CAUSAL_LM = True

In [6]:
template = """
## INPUT PROMPT
{prompt}
## GROUND_TRUTH
{completion}
## MODEL OUTPUT
{prediction}
"""

### ANLI

In [7]:
dataset = create_instruct_dataset(num_samples, ["anli"])
example = dataset[0]
prediction = eval.generate([example["prompt"]], return_full_text=not CAUSAL_LM)[0]
Markdown(template.format(prompt=example["prompt"], 
                         completion=example["completion"], 
                         prediction=prediction))

Map:   0%|          | 0/16946 [00:00<?, ? examples/s]

Filter:   0%|          | 0/16946 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating responses... : 100%|██████████| 1/1 [00:01<00:00,  1.02s/it]


## Common Gen

In [8]:
dataset = create_instruct_dataset(num_samples, ["common_gen"])
example = dataset[0]
prediction = eval.generate([example["prompt"]], return_full_text=not CAUSAL_LM)[0]
Markdown(template.format(prompt=example["prompt"], 
                         completion=example["completion"], 
                         prediction=prediction))

Generating responses... : 100%|██████████| 1/1 [00:00<00:00,  1.35it/s]


## SQUAD

In [9]:
dataset = create_instruct_dataset(num_samples, ["squad"])
example = dataset[0]
prediction = eval.generate([example["prompt"]], return_full_text=not CAUSAL_LM)[0]
Markdown(template.format(prompt=example["prompt"], 
                         completion=example["completion"], 
                         prediction=prediction))

Using the latest cached version of the dataset since rajpurkar/squad couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'plain_text' at /home/oscarn/.cache/huggingface/datasets/rajpurkar___squad/plain_text/0.0.0/7b6d24c440a36b6815f21b70d25016731768db1f (last modified on Fri Nov 22 00:36:36 2024).


Filter:   0%|          | 0/87599 [00:00<?, ? examples/s]

Map:   0%|          | 0/87241 [00:00<?, ? examples/s]

Filter:   0%|          | 0/87241 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating responses... : 100%|██████████| 1/1 [00:00<00:00,  1.46it/s]


## Cosmos QA

In [12]:
dataset = create_instruct_dataset(num_samples, ["cosmos_qa"])
example = dataset[0]
prediction = eval.generate([example["prompt"]], return_full_text=not CAUSAL_LM)[0]
Markdown(template.format(prompt=example["prompt"], 
                         completion=example["completion"], 
                         prediction=prediction))

Using the latest cached version of the module from /home/oscarn/.cache/huggingface/modules/datasets_modules/datasets/allenai--cosmos_qa/3e18538cbfdb2c04189b16642715f0f6da3e97ed5df0aadcec3641245b2cf157 (last modified on Fri Nov 22 00:37:03 2024) since it couldn't be found locally at allenai/cosmos_qa, or remotely on the Hugging Face Hub.


Map:   0%|          | 0/25262 [00:00<?, ? examples/s]

Filter:   0%|          | 0/25262 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating responses... : 100%|██████████| 1/1 [00:00<00:00,  1.73it/s]


## CoQA

In [13]:
dataset = create_instruct_dataset(num_samples, ["coqa"])
example = dataset[0]
prediction = eval.generate([example["prompt"]], max_tokens=150, return_full_text=not CAUSAL_LM)[0]
Markdown(template.format(prompt=example["prompt"], 
                         completion=example["completion"], 
                         prediction=prediction))

Filter:   0%|          | 0/7199 [00:00<?, ? examples/s]

Map:   0%|          | 0/4063 [00:00<?, ? examples/s]

Filter:   0%|          | 0/4063 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating responses... : 100%|██████████| 1/1 [00:07<00:00,  7.74s/it]


## XSum

In [15]:
dataset = create_instruct_dataset(num_samples, ["xsum"])
example = dataset[0]
prediction = eval.generate([example["prompt"]], return_full_text=not CAUSAL_LM)[0]
Markdown(template.format(prompt=example["prompt"], 
                         completion=example["completion"], 
                         prediction=prediction))

Using the latest cached version of the dataset since EdinburghNLP/xsum couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /home/oscarn/.cache/huggingface/datasets/EdinburghNLP___xsum/default/1.2.0/40db7604fedb616a9d2b0673d11838fa5be8451c (last modified on Tue Sep  9 05:02:59 2025).


Filter:   0%|          | 0/204045 [00:00<?, ? examples/s]

Generating responses... : 100%|██████████| 1/1 [00:02<00:00,  2.75s/it]


## BoolQ

In [16]:
dataset = create_instruct_dataset(num_samples, ["bool_q"])
example = dataset[0]
prediction = eval.generate([example["prompt"]], return_full_text=not CAUSAL_LM)[0]
Markdown(template.format(prompt=example["prompt"], 
                         completion=example["completion"], 
                         prediction=prediction))

Filter:   0%|          | 0/9427 [00:00<?, ? examples/s]

Map:   0%|          | 0/9399 [00:00<?, ? examples/s]

Filter:   0%|          | 0/9399 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating responses... : 100%|██████████| 1/1 [00:01<00:00,  1.29s/it]


## Python Code

In [17]:
dataset = create_instruct_dataset(num_samples, ["python_code"])
example = dataset[0]
prediction = eval.generate([example["prompt"]], max_tokens=1000, return_full_text=not CAUSAL_LM)[0]
Markdown(template.format(prompt=example["prompt"], 
                         completion=example["completion"], 
                         prediction=prediction))

Filter:   0%|          | 0/16750 [00:00<?, ? examples/s]

Map:   0%|          | 0/16237 [00:00<?, ? examples/s]

Filter:   0%|          | 0/16237 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating responses... : 100%|██████████| 1/1 [00:03<00:00,  3.25s/it]


## Eng-Spa

In [18]:
dataset = create_instruct_dataset(num_samples, ["eng_spa"])
example = dataset[0]
prediction = eval.generate([example["prompt"]], return_full_text=not CAUSAL_LM)[0]
Markdown(template.format(prompt=example["prompt"], 
                         completion=example["completion"], 
                         prediction=prediction))

Generating responses... : 100%|██████████| 1/1 [00:00<00:00,  2.13it/s]


## PAWS

In [19]:
dataset = create_instruct_dataset(num_samples, ["paws"])
example = dataset[0]
prediction = eval.generate([example["prompt"]], return_full_text=not CAUSAL_LM)[0]
Markdown(template.format(prompt=example["prompt"], 
                         completion=example["completion"], 
                         prediction=prediction))

Map:   0%|          | 0/49401 [00:00<?, ? examples/s]

Filter:   0%|          | 0/49401 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating responses... : 100%|██████████| 1/1 [00:10<00:00, 10.12s/it]


## Quora

In [24]:
dataset = create_instruct_dataset(num_samples, ["quora"])
example = dataset[0]
prediction = eval.generate([example["prompt"]], return_full_text=not CAUSAL_LM)[0]
Markdown(template.format(prompt=example["prompt"], 
                         completion=example["completion"], 
                         prediction=prediction))

Filter:   0%|          | 0/56402 [00:00<?, ? examples/s]

Generating responses... : 100%|██████████| 1/1 [00:01<00:00,  1.13s/it]


## Alpaca

In [25]:
dataset = create_instruct_dataset(num_samples, ["alpaca"])
example = dataset[0]
prediction = eval.generate([example["prompt"]], return_full_text=not CAUSAL_LM)[0]
Markdown(template.format(prompt=example["prompt"], 
                         completion=example["completion"], 
                         prediction=prediction))

Filter:   0%|          | 0/52002 [00:00<?, ? examples/s]

Map:   0%|          | 0/20100 [00:00<?, ? examples/s]

Filter:   0%|          | 0/20100 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating responses... : 100%|██████████| 1/1 [00:01<00:00,  1.13s/it]
