#Model Evaluation

In this notebook, we will evaluate our fine-tuned text simplification model.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM


save_path = r"E:\simplification_model"

model = AutoModelForSeq2SeqLM.from_pretrained(save_path)
tokenizer = AutoTokenizer.from_pretrained(save_path)


## Generate Predictions on Validation Set
We take a sample of the validation data, pass it through the model,  
and decode the generated output into human-readable text.

In [None]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)


def generate_predictions(dataset, num_samples=20):
    examples = dataset.shuffle(seed=42).select(range(num_samples))
    inputs_text = examples['source_text']
    targets_text = examples['target_text']

    preds = []
    for text in inputs_text:
        inputs_enc = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
        inputs_enc = {k:v.to(device) for k,v in inputs_enc.items()}  # move all tensors to GPU
        outputs = model.generate(**inputs_enc, max_length=128, num_beams=4)
        decoded = tokenizer.decode(outputs[0].cpu(), skip_special_tokens=True)
        preds.append(decoded)

    return inputs_text, preds, targets_text

inputs, preds, targets = generate_predictions(val_dataset, num_samples=20)

# Evaluate with Metrics (BLEU, ROUGE)
These metrics compare generated predictions with the reference target text.  


In [None]:
!pip install evaluate
!pip install nltk absl-py rouge_score

import evaluate
# Load metrics
bleu = evaluate.load("bleu")
rouge = evaluate.load("rouge")

filtered_data = [(i, p, t) for i, p, t in zip(inputs, preds, targets) if t is not None]
inputs_filtered, preds_filtered, targets_filtered = zip(*filtered_data) if filtered_data else ([], [], [])

# Compute BLEU
bleu_score = bleu.compute(predictions=list(preds_filtered), references=[[t] for t in targets_filtered])

# Compute ROUGE
rouge_score = rouge.compute(predictions=list(preds_filtered), references=list(targets_filtered))

#results
print("BLEU Score:", bleu_score)
print("ROUGE Score:", rouge_score)

Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting absl-py
  Downloading absl_py-2.3.1-py3-none-any.whl.metadata (3.3 kB)
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting click (from nltk)
  Downloading click-8.2.1-py3-none-any.whl.metadata (2.5 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   ------ --------------------------------- 0.3/1.5 MB ? eta -:--:--
   ------ --------------------------------- 0.3/1.5 MB ? eta -:--:--
   ------------- -------------------------- 0.5/1.5 MB 882.6 kB/s eta 0:00:02
   ------------- -------------------------- 0.5/1.5 MB 882.6 kB/s eta 0:00:02
   ------------- -------------------------- 0.5/1.5 MB 882.6 kB/s eta 0:00:02
   ------------- -------------------------- 0.5/1.5 MB 882.6 kB/s eta 0:00:02
   ---------

  DEPRECATION: Building 'rouge_score' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'rouge_score'. Discussion can be found at https://github.com/pypa/pip/issues/6334


BLEU Score: {'bleu': 0.536825251888985, 'precisions': [0.672093023255814, 0.5560975609756098, 0.49230769230769234, 0.45135135135135135], 'brevity_penalty': 1.0, 'length_ratio': 1.2078651685393258, 'translation_length': 430, 'reference_length': 356}
ROUGE Score: {'rouge1': np.float64(0.7305138188820598), 'rouge2': np.float64(0.6054983573466652), 'rougeL': np.float64(0.7025116620416674), 'rougeLsum': np.float64(0.7031786447654161)}


## Model Evaluation Insights

- **BLEU Score:** 0.54 → Indicates good overlap between model predictions and reference simplifications; much better than typical scores (0.2–0.4) for text generation tasks.  
- **ROUGE Scores:**  
  - ROUGE-1: 0.73  
  - ROUGE-2: 0.61  
  - ROUGE-L: 0.70  
  These high scores show the model preserves key words, phrases, and sentence structure effectively.  

**Conclusion:** The model performs well on simplification, producing outputs that are close to human references in both content and readability.


## Demo

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch

model_path = r"E:\simplification_model"
tokenizer = T5Tokenizer.from_pretrained(model_path)
model = T5ForConditionalGeneration.from_pretrained(model_path)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def simplify_text(sentence):
    inputs_enc = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True).to(device)
    outputs = model.generate(
        inputs_enc['input_ids'],
        max_length=60,
        num_beams=4,
        early_stopping=True
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Enter a sentence to simplify (or type 'quit' to exit):\n")
while True:
    text = input("Original: ")
    if text.lower() in ["quit", "exit", "q"]:
        break
    simplified = simplify_text(text)
    print(f"Simplified: {simplified}")
    print("------")


Enter a sentence to simplify (or type 'quit' to exit):



Original:  the yowa era, marked by famine, ends in japan.


Simplified: the yowa era ends in japan.
------


Original:  She now has an album and a huge hit single, which topped the charts and attracted millions of views.


Simplified: she now has an album and a huge hit single.
------


Original:  it was here that he composed messiah, zadok the priest and music for the royal fireworks.


Simplified: he composed messiah, zadok the priest and music for the royal fireworks.
------


Original:  the slide rule, also known colloquially as a slipstick, is a mechanical analog computer.


Simplified: the slide rule, also known as a slipstick, is a mechanical analog computer.
------


Original:  quit
