<a href="https://colab.research.google.com/github/nowshinJahan17/Text-Summarization/blob/Nowshin_Jahan/Copy_of_gitcommand.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [64]:
# Install datasets
!pip install datasets
!pip install evaluate
!pip install -U sacrebleu
!pip install rouge_score

# Import required libraries
import pandas as pd
from transformers import pipeline, set_seed
import matplotlib.pyplot as plt
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt')

# Import datasets and transformers
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load the dataset
dataset = load_dataset("cnn_dailymail", "3.0.0")

# Print the dataset and some sample data
print(dataset)

print(f"Freatures in cnn_dailymail :{dataset['train'].column_names}")
print(dataset['train'][0])
print(dataset['validation'][0])
print(dataset['test'][1])

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=6f9e46e77a79b0568f2ab52fdfe8459b72e3cd6174b80c0905fb6ca514723c2d
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 287113
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 13368
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 11490
    })
})
Freatures in cnn_dailymail :['article', 'highlights', 'id']
{'article': 'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something simi

# Prepare text for summarization

In [31]:
sample_text = dataset["train"][0]["article"][:1000]
summaries = {}


# Baseline summarization function

In [32]:
def baseline_summary_three_sent(text):
    return "\n".join(sent_tokenize(text)[:3])


# Generate baseline summary

In [33]:
summaries['baseline'] = baseline_summary_three_sent(sample_text)
summaries['baseline']


'Editor\'s note: In our Behind the Scenes series, CNN correspondents share their experiences in covering news and analyze the stories behind the events.\nHere, Soledad O\'Brien takes users inside a jail where many of the inmates are mentally ill. An inmate housed on the "forgotten floor," where many mentally ill inmates are housed in Miami before trial.\nMIAMI, Florida (CNN) -- The ninth floor of the Miami-Dade pretrial detention facility is dubbed the "forgotten floor."'

# Model implementation: GPT2-Medium

In [34]:
from transformers import pipeline, set_seed
set_seed(42)
pipe = pipeline('text-generation', model='gpt2-medium')
gpt2_query = sample_text + "\nTL;DR:\n"
pipe_out = pipe(gpt2_query, max_length=512, clean_up_tokenization_spaces=True)


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


## View Generated Text

In [35]:
pipe_out

[{'generated_text': 'Editor\'s note: In our Behind the Scenes series, CNN correspondents share their experiences in covering news and analyze the stories behind the events. Here, Soledad O\'Brien takes users inside a jail where many of the inmates are mentally ill. An inmate housed on the "forgotten floor," where many mentally ill inmates are housed in Miami before trial. MIAMI, Florida (CNN) -- The ninth floor of the Miami-Dade pretrial detention facility is dubbed the "forgotten floor." Here, inmates with the most severe mental illnesses are incarcerated until they\'re ready to appear in court. Most often, they face drug charges or charges of assaulting an officer --charges that Judge Steven Leifman says are usually "avoidable felonies." He says the arrests often result from confrontations with police. Mentally ill people often won\'t do what they\'re told when police arrive on the scene -- confrontation seems to exacerbate their illness and they become more paranoid, delusional, and

In [36]:
pipe_out[0]['generated_text'][len(gpt2_query):]


'To get to the jail that holds the mentally ill, visit our Behind the Scenes blog -- Click here The story doesn\'t end there:\xa0 In 2014, a judge ordered the jail to provide treatment for 40 mental-health detainees in the mental-health unit, as part of a $22,000 settlement.\nInmates in the mental-health unit at Miami-Dade County\'s jail are often locked in a cell that\'s usually just like any other in the facility and can become chaotic. Mental health unit employees are often required to leave the jail and drive to the facility, instead of coming in to work with the inmates themselves. Most mental-health detainees spend hours a day sleeping in front of the wall by the pool, waiting for treatment to show up.\nThere are two more stories from the Inside the Tincup Jail series:\n\xa0\xa0\xa0 The "no contact" policy:\xa0 At an overcrowded state mental health facility in Miami-Dade County, the rules allow the police to try to convince the jail staff to allow a person they believe is mentall

In [37]:
summaries['gpt2'] = "\n".join(sent_tokenize(pipe_out[0]['generated_text'][len(gpt2_query):]))

# **BART**

In [38]:
pipe = pipeline("summarization", model="facebook/bart-large-cnn")
pipe_out = pipe(sample_text)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [39]:
pipe_out

[{'summary_text': 'Miami-Dade pretrial detention facility is dubbed the "forgotten floor" Here, inmates with the most severe mental illnesses are incarcerated. Most often, they face drug charges or charges of assaulting an officer. Judge Steven Leifman says the arrests often result from confrontations with police.'}]

In [40]:
summaries['bart'] = "\n".join(sent_tokenize(pipe_out[0]['summary_text'][len(gpt2_query):]))

# PEGASUS

In [41]:
pipe = pipeline('summarization', model="google/pegasus-cnn_dailymail")
pipe_out = pipe(sample_text)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [42]:
pipe_out

[{'summary_text': 'Mentally ill inmates are housed on the "forgotten floor" of a Miami jail .<n>Judge Steven Leifman says the charges are usually "avoidable felonies"<n>He says the arrests often result from confrontations with police .<n>Mentally ill people often won\'t do what they\'re told when police arrive on the scene .'}]

In [43]:
summaries["pegasus"] = pipe_out[0]["summary_text"].replace(" .<n>", ".\n").replace("<n>", "\n")


In [44]:
summaries["pegasus"]

'Mentally ill inmates are housed on the "forgotten floor" of a Miami jail.\nJudge Steven Leifman says the charges are usually "avoidable felonies"\nHe says the arrests often result from confrontations with police.\nMentally ill people often won\'t do what they\'re told when police arrive on the scene .'

# T5

In [45]:
pipe = pipeline('summarization', model="t5-small")
pipe_out = pipe(sample_text)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [58]:
pipe_out

[{'summary_text': "inmates with the most severe mental illnesses are incarcerated until they're ready to appear in court . most often, they face drug charges or charges of assaulting an officer . mentally ill people become more paranoid, delusional, and less likely to follow dir ."}]

In [47]:
summaries['t5'] = 'n'.join(sent_tokenize(pipe_out[0]['summary_text']))

## comparing different summaries

In [48]:
print ("GROUND TRUTH")

print (dataset['train'][0]['highlights'])

for model_name in summaries:
  print(model_name.upper())
  print (summaries[model_name])

GROUND TRUTH
Mentally ill inmates in Miami are housed on the "forgotten floor"
Judge Steven Leifman says most are there as a result of "avoidable felonies"
While CNN tours facility, patient shouts: "I am the son of the president"
Leifman says the system is unjust and he's fighting for change .
BASELINE
Editor's note: In our Behind the Scenes series, CNN correspondents share their experiences in covering news and analyze the stories behind the events.
Here, Soledad O'Brien takes users inside a jail where many of the inmates are mentally ill. An inmate housed on the "forgotten floor," where many mentally ill inmates are housed in Miami before trial.
MIAMI, Florida (CNN) -- The ninth floor of the Miami-Dade pretrial detention facility is dubbed the "forgotten floor."
GPT2
To get to the jail that holds the mentally ill, visit our Behind the Scenes blog -- Click here The story doesn't end there:  In 2014, a judge ordered the jail to provide treatment for 40 mental-health detainees in the me

In [52]:


from evaluate import load

bleu_metric = load("sacrebleu")




In [66]:
bleu_metric.add(prediction = [summaries['t5']], reference =[dataset['train'][0]['highlights']])

results = bleu_metric.compute(predictions = predictions, references = references)

results['precision'] =[np.round(p,2) for p in results['precision']]

pd.DataFrame. from_dict(results, orient='index', colums = ['value'])

ValueError: Predictions and/or references don't match the expected format.
Expected format:
Feature option 0: {'predictions': Value(dtype='string', id='sequence'), 'references': Sequence(feature=Value(dtype='string', id='sequence'), length=-1, id='references')}
Feature option 1: {'predictions': Value(dtype='string', id='sequence'), 'references': Value(dtype='string', id='sequence')},
Input predictions: ["inmates with the most severe mental illnesses are incarcerated until they're ready to appear in court .nmost often, they face drug charges or charges of assaulting an officer .nmentally ill people become more paranoid, delusional, and less likely to follow dir ."],
Input references: ["Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .\nYoung actor says he has no plans to fritter his cash away .\nRadcliffe's earnings from first five Potter films have been held in trust fund ."]

In [65]:


from evaluate import load


rouge_metric = load("rouge")


In [75]:
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
reference = dataset['train'][0]['highlights']
records = []
for model_name in summaries:
  rouge_metric.add(prediction= summaries[model_name], reference = reference)
  score = rouge_metric. compute()
  rouge_dict = {rn: score[rn]  for rn in rouge_names}
  print(score)
  records.append(rouge_dict)
  print (len(records))
  print(len(summaries.keys()))
  pd.DataFrame.from_records(records, index= list(summaries.keys())[:len(records)])
  df= pd.DataFrame.from_records(records, index= list(summaries.keys())[:len(records)])
  print(df)

{'rouge1': 0.03448275862068966, 'rouge2': 0.0, 'rougeL': 0.03448275862068966, 'rougeLsum': 0.03448275862068966}
1
5
            rouge1  rouge2    rougeL  rougeLsum
baseline  0.034483     0.0  0.034483   0.034483
{'rouge1': 0.04761904761904762, 'rouge2': 0.0, 'rougeL': 0.034013605442176874, 'rougeLsum': 0.04761904761904762}
2
5
            rouge1  rouge2    rougeL  rougeLsum
baseline  0.034483     0.0  0.034483   0.034483
gpt2      0.047619     0.0  0.034014   0.047619
{'rouge1': 0.0, 'rouge2': 0.0, 'rougeL': 0.0, 'rougeLsum': 0.0}
3
5
            rouge1  rouge2    rougeL  rougeLsum
baseline  0.034483     0.0  0.034483   0.034483
gpt2      0.047619     0.0  0.034014   0.047619
bart      0.000000     0.0  0.000000   0.000000
{'rouge1': 0.06741573033707866, 'rouge2': 0.0, 'rougeL': 0.06741573033707866, 'rougeLsum': 0.06741573033707866}
4
5
            rouge1  rouge2    rougeL  rougeLsum
baseline  0.034483     0.0  0.034483   0.034483
gpt2      0.047619     0.0  0.034014   0.047619
bart   

Evaluation on the test set of the CNN/DAILYMAIL DATASET


In [76]:
def calculate_metric_on_baseline_test_ds(dataset, metric, column_test = 'article',column_summary ='highlights'):
  summaries = [baseline_summary_three_sent(text)for text in dataset[column_text]]
  metric.add_batch(predictions = summaries, references = dataset[column_summary])

  score = metric.compute()
  return score

In [81]:
test_sampled = dataset['train'].shuffle(seed =42).select(range(1000))
score = calculate_metric_on_baseline_test_ds(
    dataset = test_sampled,
    metric = rouge_metric,
    column_test= 'article',
    column_summary='highlights'
)
rouge_dict= {rn: score[rn].mid.fmeasure for rn in rouge_names}
pd.dataFrame.from_dict(rouge_dict, oriet = 'index', columns = ['baseline']).T

NameError: name 'calculate_metric_on_baseline_test_ds' is not defined