## Natural Language Models

**Step I** is a text summarization inference/evaluation task.  We use two models from HuggingFace (HF) in particular,

- "google-t5/t5-small" (a Google T5 model, pretrained on general text), and
- "ubikpt/t5-small-finetuned-cnn" (a model from the HF hub; a T5 model fine-tuned with the `CNN-DailyMail dataset`).

In [2]:
!pip install transformers datasets evaluate accelerate torch rouge_score bert_score



In [3]:
import torch
import pandas as pd
import numpy as np
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

from rouge_score import rouge_scorer
import bert_score
from transformers import pipeline

from evaluate import load
import re

from transformers import logging

logging.set_verbosity_error()

import warnings

warnings.filterwarnings("ignore")

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
# Move the model to GPU if available
device = (
    "cuda" if torch.cuda.is_available() else
    "mps" if torch.backends.mps.is_available() else
    "cpu"
)
device

'mps'

In [5]:
from datasets import load_dataset

# dataset details to pull
dataset_name = "cnn_dailymail"
dataset_version = "1.0.0"

# Load the dataset 3% for inference
dataset = load_dataset(dataset_name, dataset_version, split='test[:3%]')


In [6]:
ds = pd.DataFrame(dataset)
ds.head()

Unnamed: 0,article,highlights,id
0,(CNN)The Palestinian Authority officially beca...,Membership gives the ICC jurisdiction over all...,f001ec5c4704938247d27a44948eebb37ae98d01
1,(CNN)Never mind cats having nine lives. A stra...,"Theia, a bully breed mix, was apparently hit b...",230c522854991d053fe98a718b1defa077a8efef
2,"(CNN)If you've been following the news lately,...",Mohammad Javad Zarif has spent more time with ...,4495ba8f3a340d97a9df1476f8a35502bcce1f69
3,(CNN)Five Americans who were monitored for thr...,17 Americans were exposed to the Ebola virus w...,a38e72fed88684ec8d60dd5856282e999dc8c0ca
4,(CNN)A Duke student has admitted to hanging a ...,Student is no longer on Duke University campus...,c27cf1b136cc270023de959e7ab24638021bc43f


In [None]:
# update the same above logic to dataset object removing the CNN word from the begining
#dataset = dataset.map(lambda x: {'article': x['article'][5:]})


In [8]:
# ds = pd.DataFrame(dataset)
# ds.head()

### google-t5-model

In [7]:
# tokenize the sentences using the tokenizer
#prefix = "summarize: "

def preprocess_function(examples, tokenizer, prefix="summarize: "):
  '''
  Preprocess the data with tokenizer
  '''

  # remove the CNN word fro the article
  #examples = examples.map(lambda x: {'article': x['article'][5:]})

  # # remove all punctuations form the article apply regex re.sub(r'[^\w\s]', '', text)
  # examples = examples.map(lambda x: {'article': re.sub(r'[^\w\s]', '', x['article'])})

  inputs = [prefix + doc for doc in examples["article"]]
  model_inputs = tokenizer(inputs, max_length=1024, padding=True, truncation=True, return_tensors="pt").to(device)

  #print("moded passed to device")
  labels = tokenizer(text_target=examples["highlights"], max_length=128, padding=True, truncation=True)
  #print("label passed to device")

  model_inputs["labels"] = labels["input_ids"]
  return model_inputs


In [32]:
# # check the function
# # tokenize the dataset
# tokenized_dataset = dataset.map(preprocess_function , batched=True)

# # convert to pytorch format
# tokenized_dataset.set_format("torch")

# pd.DataFrame(tokenized_dataset).head()

In [8]:
def generate(modelName, data, max_tokens):

  # get the tokenizer
  checkpoint = modelName
  tokenizer = AutoTokenizer.from_pretrained(checkpoint)
  #tokenizer.pad_token = tokenizer.eos_token

  model_inputs = preprocess_function(data, tokenizer, 'summarize: ')

  #print(model_inputs.keys())

  inputs = model_inputs["input_ids"]
  labels = model_inputs["labels"]

  # get the model
  model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
  model.to(device)

  decoded_preds = []
  batch_size = 10

  # get the predictions
  for i in range (0, len(inputs), batch_size):
    batch = inputs[i:i+batch_size]
    input_batch = torch.tensor(batch).to(device)

    outputs = model.generate(input_batch, max_new_tokens=max_tokens, do_sample=False) # , no_repeat_ngram_size=2, early_stop=True
    batch_preds = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]
    decoded_preds.extend(batch_preds)

  # get the labels
  decoded_labels = [tokenizer.decode(label, skip_special_tokens=True) for label in labels]

  return decoded_preds, decoded_labels, model_inputs

def compute_metrics(modelName, max_tokens, decoded_preds, decoded_labels):

  # # Initialize the ROUGE scorer
  rouge = load('rouge')
  results_rg = rouge.compute(predictions=decoded_preds, references=decoded_labels)

  # # perplexity
  perplexity = load("perplexity", module_type="metric")
  results_pp = perplexity.compute(model_id='gpt2', predictions=decoded_preds)

  # bert score
  bertscore = load("bertscore")
  results_br = bertscore.compute(predictions=decoded_preds, references=decoded_labels, lang="en")

  print(f"{modelName} - (Max New Tokens = {max_tokens})")
  print(f"rouge1: {results_rg['rouge1']}")
  print(f"rouge2: {results_rg['rouge2']}")
  print(f"perplexity: {results_pp['mean_perplexity']} (mean)")
  print(f"precision: {np.mean(results_br['precision'])} (mean)")
  print(f"recall: {np.mean(results_br['recall'])} (mean)")
  print(f"f1: {np.mean(results_br['f1'])} (mean)")


### "google-t5/t5-small" with 20 max_tokens

In [57]:
modelName = "google-t5/t5-small"
max_tokens = 20
decoded_preds, decoded_labels, model_inputs = generate(modelName, dataset, max_tokens)

In [58]:
compute_metrics(modelName, max_tokens, decoded_preds, decoded_labels)

100%|██████████| 22/22 [00:03<00:00,  6.53it/s]


google-t5/t5-small - (Max New Tokens = 20)
rouge1: 0.23079182401060494
rouge2: 0.08035300172513223
perplexity: 124.46651547749838 (mean)
precision: 0.875558959055638 (mean)
recall: 0.8485299664994944 (mean)
f1: 0.8616953319397526 (mean)


In [12]:
decoded_preds[0] , decoded_labels[0]

('the ICC officially became the third member of the international criminal court. the court is based',
 'Membership gives the ICC jurisdiction over alleged crimes committed in Palestinian territories since last June. Israel and the United States opposed the move, which could open the door to war crimes investigations against Israelis.')

### "google-t5/t5-small" with 100 max_tokens

In [59]:
modelName = "google-t5/t5-small"
max_tokens = 100
decoded_preds, decoded_labels, model_inputs = generate(modelName, dataset, max_tokens)

In [60]:
compute_metrics(modelName, max_tokens, decoded_preds, decoded_labels)

100%|██████████| 22/22 [00:16<00:00,  1.30it/s]


google-t5/t5-small - (Max New Tokens = 100)
rouge1: 0.2805152378831797
rouge2: 0.09438852351692872
perplexity: 43.222421516197315 (mean)
precision: 0.8685887136321137 (mean)
recall: 0.8670880452446316 (mean)
f1: 0.8677291852840479 (mean)


In [61]:
print(decoded_preds[0])
print(decoded_labels[0])

the ICC officially became the 123rd member of the international criminal court. the court is based in the Netherlands. the ICC has accepted its jurisdiction over alleged crimes committed in the occupied territories.
Membership gives the ICC jurisdiction over alleged crimes committed in Palestinian territories since last June. Israel and the United States opposed the move, which could open the door to war crimes investigations against Israelis.


#### "ubikpt/t5-small-finetuned-cnn" with max_tokens = 100

In [9]:
modelName = "ubikpt/t5-small-finetuned-cnn"
max_tokens = 100
decoded_preds, decoded_labels, model_inputs = generate(modelName, dataset, max_tokens)

In [10]:
compute_metrics(modelName, max_tokens, decoded_preds, decoded_labels)

100%|██████████| 22/22 [00:14<00:00,  1.57it/s]


ubikpt/t5-small-finetuned-cnn - (Max New Tokens = 100)
rouge1: 0.2710688164892585
rouge2: 0.10014986138804857
perplexity: 79.34572821078093 (mean)
precision: 0.875644208210102 (mean)
recall: 0.8613906905271005 (mean)
f1: 0.8683256829994312 (mean)


### Inference sample text with the models

**For each model, select two articles from the dataset and, for each article generate a summary using the two models.**

In [40]:
article1 = ds['article'][10]
article2 = ds['article'][11]

In [41]:
# infer with the google-t5 model
summarizer_t5 = pipeline("summarization", model="google-t5/t5-small")
article1_sum = summarizer_t5(article1)
article2_sum = summarizer_t5(article2)

In [42]:
import textwrap

arc1_wrap = textwrap.fill(article1, width=200)
arc2_wrap = textwrap.fill(article2, width=200)

sum_arc1 = textwrap.fill(article1_sum[0]['summary_text'], width=200)
sum_arc2 = textwrap.fill(article2_sum[0]['summary_text'], width=200)

print(f"Model :::google-t5/t5-small\n")
print(f"Article ::: {arc1_wrap}\n")
print(f"Summarization ::: {sum_arc1}")
print("\n\n")
print(f"Article ::: {arc2_wrap}\n")
print(f"Summarization ::: {sum_arc2}")

Model :::google-t5/t5-small

Article ::: London (CNN)A 19-year-old man was charged Wednesday with terror offenses after he was arrested as he returned to Britain from Turkey, London's Metropolitan Police said. Yahya Rashid, a UK national
from northwest London, was detained at Luton airport on Tuesday after he arrived on a flight from Istanbul, police said. He's been charged with engaging in conduct in preparation of acts of terrorism,
and with engaging in conduct with the intention of assisting others to commit acts of terrorism. Both charges relate to the period between November 1 and March 31. Rashid is due to appear in
Westminster Magistrates' Court on Wednesday, police said. CNN's Lindsay Isaac contributed to this report.

Summarization ::: 19-year-old was arrested at Luton airport on tuesday . he's been charged with engaging in conduct in preparation of acts of terrorism .



Article ::: (CNN)Paul Walker is hardly the first actor to die during a production. But Walker's death in N

In [43]:
# infer with the google-t5 model
summarizer_ubi = pipeline("summarization", model="ubikpt/t5-small-finetuned-cnn")
article1_sum_ubi = summarizer_ubi(article1)
article2_sum_ubi = summarizer_ubi(article2)

In [44]:

sum_arc1_ubi = textwrap.fill(article1_sum_ubi[0]['summary_text'], width=200)
sum_arc2_ubi = textwrap.fill(article2_sum_ubi[0]['summary_text'], width=200)

print(f"Model :::ubikpt/t5-small-finetuned-cnn\n")
print(f"Article ::: {arc1_wrap}\n")
print(f"Summarization ::: {sum_arc1_ubi}")
print("\n\n")
print(f"Article ::: {arc2_wrap}\n")
print(f"Summarization ::: {sum_arc2_ubi}")

Model :::ubikpt/t5-small-finetuned-cnn

Article ::: London (CNN)A 19-year-old man was charged Wednesday with terror offenses after he was arrested as he returned to Britain from Turkey, London's Metropolitan Police said. Yahya Rashid, a UK national
from northwest London, was detained at Luton airport on Tuesday after he arrived on a flight from Istanbul, police said. He's been charged with engaging in conduct in preparation of acts of terrorism,
and with engaging in conduct with the intention of assisting others to commit acts of terrorism. Both charges relate to the period between November 1 and March 31. Rashid is due to appear in
Westminster Magistrates' Court on Wednesday, police said. CNN's Lindsay Isaac contributed to this report.

Summarization ::: Yahya Rashid, 19, arrested at Luton airport on Tuesday . he's charged with engaging in conduct in preparation of acts of terrorism .



Article ::: (CNN)Paul Walker is hardly the first actor to die during a production. But Walker's de

**Observation:: If we look at the inference for the two article from both the models**
- the model google-t5 keeps names out of summary and address the person by their designation like 19-year old fast and furios actor, while the second model ubikpt addesed with the name in the summary.
- The fist model is a bit descriptive in the second article  where as the second model is short in summary.
