#### Lab 13: Auto Regressive Model Quality

Goal: use some quality metrics on a AR model and explain/interpret your results
* Load in an AR model (such as bloom)
* Get some input (i.e. AI is important because...)
* Do a forward pass to get output
* Select some metrics and interpret the model quality
* Hint: you might need to use a reference text

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from bert_score import score as bert_score
from rouge_score import rouge_scorer

# Step 1: Load model & tokenizer
model_name = "bigscience/bloomz-560m"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Step 2: Set prompt and get output
prompt = "The deepest part of the ocean is"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=30)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Optional: Define reference text for comparison
reference_text = "The deepest part of the ocean is Challenger Deep"
# Step 3: Print model output
print("Prompt:", prompt)
print("Generated:", generated_text)
print("Reference:", reference_text)

# Step 4: BERTScore evaluation
P, R, F1 = bert_score([generated_text], [reference_text], lang="en", verbose=False)
print(f"\nBERTScore - Precision: {P[0]:.4f}, Recall: {R[0]:.4f}, F1: {F1[0]:.4f}")

# Step 5: ROUGE evaluation
scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
rouge = scorer.score(reference_text, generated_text)
print(f"ROUGE-L: {rouge['rougeL'].fmeasure:.4f}")


  from .autonotebook import tqdm as notebook_tqdm


Prompt: The deepest part of the ocean is
Generated: The deepest part of the ocean is the deepest part of the sea
Reference: The deepest part of the ocean is Challenger Deep


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



BERTScore - Precision: 0.9338, Recall: 0.9233, F1: 0.9285
ROUGE-L: 0.6364
