### Runtime

- Choose the GPU runtime for this model
- Show runtime


### evaluate

Overview: The evaluate library is a general-purpose library used to compute a wide range of evaluation metrics for NLP models. It's part of the Hugging Face ecosystem and provides a standardized interface to evaluate models on different tasks.

Functionality:

Wide Range of Metrics: It supports various metrics like accuracy, precision, recall, F1 score, BLEU, ROUGE, and more.

### rouge_score

Overview: The rouge_score library is a specialized library for calculating the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metric, which is commonly used to evaluate the quality of text summaries.


In [1]:
!pip install evaluate

!pip install rouge_score



In [2]:
!pip install pyarrow==15.0.2



In [1]:
from transformers import pipeline

In [3]:
content = """
  We are on the brink of a technological revolution that could jumpstart productivity, boost global growth and raise incomes around the world.
  Yet it could also replace jobs and deepen inequality. The rapid advance of artificial intelligence has captivated the world, causing both excitement and alarm, and
  raising important questions about its potential impact on the global economy. The net effect is difficult to foresee, as AI will ripple through economies in complex ways.
  What we can say with some confidence is that we will need to come up with a set of policies to safely leverage the vast potential of AI for the benefit of humanity.

  In a new analysis, IMF staff examine the potential impact of AI on the global labor market. Many studies have predicted the likelihood that jobs will be replaced by AI.
  Yet we know that in many cases AI is likely to complement human work. The IMF analysis captures both these forces.

  The findings are striking: almost 40 percent of global employment is exposed to AI. Historically, automation and information technology have tended to affect routine tasks,
  but one of the things that sets AI apart is its ability to impact high-skilled jobs. As a result, advanced economies face greater risks from AI—but also more opportunities to
  leverage its benefits—compared with emerging market and developing economies.

  In advanced economies, about 60 percent of jobs may be impacted by AI. Roughly half the exposed jobs may benefit from AI integration, enhancing productivity. For the other
  half, AI applications may execute key tasks currently performed by humans, which could lower labor demand, leading to lower wages and reduced hiring. In the most extreme
  cases, some of these jobs may disappear.

  In emerging markets and low-income countries, by contrast, AI exposure is expected to be 40 percent and 26 percent, respectively. These findings suggest emerging market and
  developing economies face fewer immediate disruptions from AI. At the same time, many of these countries don’t have the infrastructure or skilled workforces to harness the
  benefits of AI, raising the risk that over time the technology could worsen inequality among nations.
"""

print(content)


  We are on the brink of a technological revolution that could jumpstart productivity, boost global growth and raise incomes around the world. 
  Yet it could also replace jobs and deepen inequality. The rapid advance of artificial intelligence has captivated the world, causing both excitement and alarm, and 
  raising important questions about its potential impact on the global economy. The net effect is difficult to foresee, as AI will ripple through economies in complex ways. 
  What we can say with some confidence is that we will need to come up with a set of policies to safely leverage the vast potential of AI for the benefit of humanity.

  In a new analysis, IMF staff examine the potential impact of AI on the global labor market. Many studies have predicted the likelihood that jobs will be replaced by AI.
  Yet we know that in many cases AI is likely to complement human work. The IMF analysis captures both these forces.

  The findings are striking: almost 40 percent of global 

- Go to https://huggingface.co/
- Click on Models -> Select "Summarization" from the left
- Show that the facebook/bart-large-cnn is the first model

https://huggingface.co/facebook/bart-large-cnn

In [9]:
summarization_pipeline = pipeline("summarization", model="facebook/bart-large-cnn", device=0)

summarization_pipeline

<transformers.pipelines.text2text_generation.SummarizationPipeline at 0x7bf5ab4bf760>

In [10]:
summarization_pipeline.device

device(type='cuda', index=0)

In [11]:
summarization_pipeline(content)

[{'summary_text': 'The rapid advance of artificial intelligence has captivated the world. IMF staff examine the potential impact of AI on the global labor market. In advanced economies, about 60 percent of jobs may be impacted by AI. In emerging markets and low-income countries, by contrast, AI exposure is expected to be 40 percent and 26 percent.'}]

In [12]:
results = summarization_pipeline(content, max_length=80, min_length=20)

print(results[0]['summary_text'])

The rapid advance of artificial intelligence has captivated the world. IMF staff examine the potential impact of AI on the global labor market. In advanced economies, about 60 percent of jobs may be impacted by AI.


https://huggingface.co/Falconsai/text_summarization

In [14]:
summarization_pipeline = pipeline("summarization", model="Falconsai/text_summarization", device=0)

summarization_pipeline

<transformers.pipelines.text2text_generation.SummarizationPipeline at 0x7bf5b229b970>

In [15]:
results = summarization_pipeline(content, max_length=80, min_length=20)

print(results[0]['summary_text'])

We are on the brink of a technological revolution that could jumpstart productivity, boost global growth and raise incomes around the world . The rapid advance of artificial intelligence has captivated the world, causing both excitement and alarm, and raising important questions about its potential impact on the global economy . In a new analysis, IMF staff examine the potential impact of AI on



- Behind the scenes create a folder called news_articles in Google Drive and upload the news_articles.csv file there
- Go to  news_articles/ on https://drive.google.com/
- https://drive.google.com/drive/u/0/folders/11AKRIT_VEkg_f3XsIVNik6w-YB6FZj8Z
- Run the command below
- Enable all permissions
- Open up the folder on the left and show the Drive mounted


In [2]:
from google.colab import drive

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
import pandas as pd

news_df = pd.read_csv('/content/drive/MyDrive/oreilly_news_articles/news_articles.csv')

news_df.sample(20)

Unnamed: 0,Content,Summary,Dataset
2634,By . Ellie Zolfagharifard . We hardly ever thi...,Michael Stevens outlines the devastating scena...,CNN/Daily Mail
2349,Media playback is not supported on this device...,Taekwondo world champion Bianca Walkden says s...,XSum
3129,"The permit will allow those living in Reading,...",A permit will be needed for some Berkshire res...,XSum
3962,"The ""Kon-Tiki2"" expedition began in Peru in No...",Authorities in Chile say they have rescued 14 ...,XSum
3020,(CNN) -- During a family trip to the beach las...,Growing black bump turns out to be a sea snail...,CNN/Daily Mail
3943,Ched Evans was seen smiling in public today as...,Ched Evans visited the parents of his girlfrie...,CNN/Daily Mail
170,"ASUNCION, Paraguay (CNN) -- A second woman st...",Paraguayan President Fernando Lugo fathered so...,CNN/Daily Mail
2190,"Claire Blackman, the wife of Royal Marine Serg...",MPs and senior military figures among those ca...,CNN/Daily Mail
3710,"Mr Mbabazi made the announcement on You Tube, ...",Uganda's former Prime Minister Amama Mbabazi i...,XSum
1670,"(CNN) -- Dharun Ravi, the former Rutgers Unive...",NEW: Ravi's court appearance formalized his de...,CNN/Daily Mail


In [5]:
news_df.shape

(5000, 3)

Null value check

In [6]:
news_df.isnull().sum()

Unnamed: 0,0
Content,0
Summary,0
Dataset,0


In [4]:
import re

def clean_txt(txt):
    txt = txt.lower()

    # Remove backslashes, @ symbols
    txt = txt.replace("\\", " ")
    txt = txt.replace("@", " ")

    # Replace non-breaking space (used to help not split words if text is wrapped)
    txt = txt.replace("\xa0", " ")

    txt = txt.replace("/", " ")
    txt = txt.replace("\n", " ")
    txt = txt.replace("'s", " ")
    txt = txt.replace('"', ' ')

    # Remove extra whitespace
    txt = re.sub(r'\s+', ' ', txt).strip()

    return txt


In [5]:
cleaned_df = news_df.copy()

cleaned_df['Content'] = cleaned_df['Content'].map(clean_txt)
cleaned_df['Summary'] = cleaned_df['Summary'].map(clean_txt)

cleaned_df.sample(10)

Unnamed: 0,Content,Summary,Dataset
3324,saido berahino has insisted strike partner bro...,brown ideye joined west brom in £10million mov...,CNN/Daily Mail
3529,the 28-year-old will sign his contract on tues...,ian cathro will be confirmed as steve mcclaren...,XSum
3623,by . damien gayle and daily mail reporter . pu...,new york yankees star alex rodriguez among tho...,CNN/Daily Mail
852,"the body of india chipchase, 20, a bar worker,...",an investigation into the disappearance of a w...,XSum
1217,champion trainer paul nicholls has said he acc...,paul nicholls understands sam twiston-davies n...,CNN/Daily Mail
1784,"the ecuadorian international, 25, limped off d...",swansea city winger jefferson montero has been...,XSum
3588,a church has had its application to retune its...,church barred from re-tuning its bells so thei...,CNN/Daily Mail
4140,rhys murphy second-half double confirmed victo...,forest green secured a fourth successive natio...,XSum
2080,scotland richie ramsay made a flying start to ...,dp world tour championship is taking place thi...,CNN/Daily Mail
2380,"the clip, said to have been filmed in the city...",a video has appeared online apparently showing...,XSum


Checking an instance of Content and Summary

In [9]:
news_df['Content'][15]

"Ashton Kutcher has entered the firestorm surrounding under-fire taxi-hiring app Uber and defended controversial comments made by an executive who\xa0suggested spending $1 million to dig up dirt on journalists who criticize the company. Kutcher, an investor in the app, took to Twitter on Wednesday to show his support for beleaguered VP Emil Michael and described Sarah Lacy, a female journalist who has been highly critical of the company, as 'shady'. 'What is so wrong about digging up dirt on shady journalist?' tweeted the celebrity tech entrepreneur who has invested in tech firms including Skype, Foursquare, Airbnb and Spotify through his\xa0venture capital firm A-Grade. Under-fire: Ashton Kutcher has entered the firestorm surrounding taxi-hiring app Uber and defended Senior VP Emil Michael who suggested spending $1 million to dig up dirt on journalists who criticize the company . Outburst: Kutcher, an investor in the app, took to Twitter on Wednesday to attack Sarah Lacy, the journali

\xa0, @ are removed

In [10]:
cleaned_df['Content'][15]

"ashton kutcher has entered the firestorm surrounding under-fire taxi-hiring app uber and defended controversial comments made by an executive who suggested spending $1 million to dig up dirt on journalists who criticize the company. kutcher, an investor in the app, took to twitter on wednesday to show his support for beleaguered vp emil michael and described sarah lacy, a female journalist who has been highly critical of the company, as hady'. 'what is so wrong about digging up dirt on shady journalist?' tweeted the celebrity tech entrepreneur who has invested in tech firms including skype, foursquare, airbnb and spotify through his venture capital firm a-grade. under-fire: ashton kutcher has entered the firestorm surrounding taxi-hiring app uber and defended senior vp emil michael who suggested spending $1 million to dig up dirt on journalists who criticize the company . outburst: kutcher, an investor in the app, took to twitter on wednesday to attack sarah lacy, the journalist who h

In [11]:
news_df['Summary'][15]

"Uber investor Kutcher has tweeted his support for the under-fire app and accused critic Sarah Lacy of being 'shady'\nKutcher's comments follow the firestorm that\xa0erupted\xa0after senior VP Emil\xa0Michael\xa0suggested Uber should hire a $1 million team of researchers .\nThey 'would dig up dirt on journalists' personal lives and their families'\nMichael was reportedly speaking with specific reference to Lacy, an\xa0outspoken critic of the online taxi service .\nActor quickly apologized but was strongly criticized by Twitter users - many of whom accused him of only getting involved because he is an investor .\nIn 2011 he was forced into an embarrassing climbdown after tweeting that the sacking of Joe Paterno as head coach of Penn State showed 'no class'"

New line characters are removed

In [12]:
cleaned_df['Summary'][15]

"uber investor kutcher has tweeted his support for the under-fire app and accused critic sarah lacy of being hady' kutcher comments follow the firestorm that erupted after senior vp emil michael suggested uber should hire a $1 million team of researchers . they 'would dig up dirt on journalists' personal lives and their families' michael was reportedly speaking with specific reference to lacy, an outspoken critic of the online taxi service . actor quickly apologized but was strongly criticized by twitter users - many of whom accused him of only getting involved because he is an investor . in 2011 he was forced into an embarrassing climbdown after tweeting that the sacking of joe paterno as head coach of penn state showed 'no class'"

Go to https://huggingface.co/sshleifer/distilbart-cnn-12-6

### BART (Bidirectional and Auto-Regressive Transformers)
Overview: BART is a transformer model developed by Facebook AI that combines bidirectional and auto-regressive elements. It is designed for tasks that involve sequence generation, such as text summarization, machine translation, and text generation. BART can be thought of as a generalization of both BERT (which is bidirectional) and GPT (which is auto-regressive).

Architecture: BART is essentially an encoder-decoder architecture, where the encoder is similar to BERT and the decoder is similar to GPT. The encoder takes the input text, processes it bidirectionally, and then the decoder generates the output text auto-regressively (predicting each word one at a time).

### DistilBART

Overview: DistilBART is a distilled version of BART. Distillation is a process where a smaller model (the "student") is trained to mimic the behavior of a larger model (the "teacher"), resulting in a model that is faster and lighter while retaining much of the performance of the original model.

Efficiency: The primary goal of DistilBART is to provide a more efficient alternative to BART. It is typically smaller, faster, and requires less computational resources, making it suitable for deployment in environments with limited resources, such as mobile devices or real-time applications.

In [6]:
model = "sshleifer/distilbart-cnn-12-6"

model

'sshleifer/distilbart-cnn-12-6'

https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.SummarizationPipeline

In [7]:
summarizer = pipeline("summarization", model=model, truncation=True, device=0)

summarizer

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


<transformers.pipelines.text2text_generation.SummarizationPipeline at 0x7961fc96f340>

In [8]:
example_text = cleaned_df['Content'][8]

example_text

"many people might associate roses with romance novels, but in fact the smell of chocolate tempts readers to buy loved-up fiction. researchers at antwerp university in belgium found that while a chocolatey smell makes people hungry to buy romance novels, it does not necessarily tempt people to buy grittier genres such as crime or business books. the scientists, who permeated a bookshop with the perfume of chocolate, also found that the scent encouraged customers to browse through titles. the government funded study found that bookshop visitors were almost six times more likely to buy a romance novel if they smelt chocolate - and the same results applied to cookery books too . they studied the behaviour of 201 customers in a popular chain bookshop over ten days to find that customers were 3.5 times more likely to pick up romance novels when they smelt chocolate, the guardian reported. the government funded study also found that bookshop visitors were almost six times more likely to buy 

In [9]:
summary_txt = summarizer(example_text)

summary_txt

[{'summary_text': ' Researchers at antwerp university in belgium, Belgium, permeated a bookshop with the perfume of chocolate . The government funded study found that bookshop visitors were almost six times more likely to buy a romance novel if they smelt chocolate - and the same results applied to cookery books too . The researchers concluded romance, food and drink books were strongly linked with the smell of chocolate, but other categories of literature were not .'}]

In [10]:
clean_txt(summary_txt[0]['summary_text'])

'researchers at antwerp university in belgium, belgium, permeated a bookshop with the perfume of chocolate . the government funded study found that bookshop visitors were almost six times more likely to buy a romance novel if they smelt chocolate - and the same results applied to cookery books too . the researchers concluded romance, food and drink books were strongly linked with the smell of chocolate, but other categories of literature were not .'

In [11]:
ref_txt = cleaned_df['Summary'][8]

ref_txt

'belgium scientists flooded a bookshop with a chocolatey smell to find customers were almost six times more likely to buy a romance novel . the government-funded study said the scent does not have a big effect on making people want to purchase travel, crime, business or comic books . researchers at antwerp university said the sweet smell encouraged customers to browse and . sales of all books rose during the experiment .'

##### Load from huggingface

In [13]:
import evaluate

rouge = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate the quality of text summarization and machine-generated text by comparing it to reference summaries or texts. The different ROUGE scores capture various aspects of the overlap between the generated text and the reference text. Here’s an overview of the most commonly used ROUGE metrics:

1. ROUGE-N

Definition: ROUGE-N measures the overlap of n-grams (contiguous sequences of n items, typically words) between the generated summary and the reference summary.

Variants:

- ROUGE-1: Measures the overlap of unigrams (single words). It evaluates the ability of the model to capture the content in individual words.
- ROUGE-2: Measures the overlap of bigrams (two consecutive words). This captures the ability of the model to maintain the sequence of words.
- ROUGE-3, ROUGE-4, etc.: Measure overlap of trigrams, four-grams, and so on. Higher n-grams capture more context but can be harder to match exactly.
- Interpretation: Higher ROUGE-N scores indicate better preservation of the original content and structure in the generated text.

2. ROUGE-L
- Definition: ROUGE-L measures the longest common subsequence (LCS) between the generated and reference summaries. The LCS is a sequence that appears in both texts in the same order but not necessarily consecutively.
- Why LCS?: Unlike n-gram overlap, LCS takes into account sentence-level word order and is less strict about consecutive matching, making it more adaptable to different writing styles.
- ROUGE-LSum is a variant of the ROUGE-L metric, specifically designed for evaluating the quality of summarization tasks where multiple sentences are involved. It extends the original ROUGE-L by focusing on the sequence of sentences in addition to the sequence of words within those sentences.

In [14]:
summary_result = rouge.compute(
    predictions = [summary_txt[0]["summary_text"]],
    references = [ref_txt], use_stemmer = True
)

summary_result

{'rouge1': 0.47482014388489213,
 'rouge2': 0.2627737226277372,
 'rougeL': 0.3165467625899281,
 'rougeLsum': 0.3165467625899281}

In [15]:
articles_txt = cleaned_df['Content']

articles_txt

Unnamed: 0,Content
0,it seems to be customary for manchester united...
1,during cnn going green: green light for busine...
2,london (cnn) -- police are to investigate clai...
3,(cnn) -- four people were killed and one serio...
4,family believes kristi and benjamin strack kil...
...,...
4995,"by . dan bloom . published: . 08:46 est, 14 no..."
4996,staff at the children ward at southampton gene...
4997,"by . mia de graaf . published: . 14:38 est, 12..."
4998,stores of timber were destroyed in the fire at...


##### tqdm module to create a progress bar for a loop that iterates over the first 50 elements of the articless_txt list inside the loop it uses a summarizer to generate summaries for each text element and appends the generated summaries to the candidate_summaries list

In [16]:
from tqdm import tqdm

candidate_summaries = []

for i, text in enumerate(tqdm(articles_txt[:50])):
    candidate = summarizer(text)

    candidate_summaries.append(candidate[0]["summary_text"])

 18%|█▊        | 9/50 [00:07<00:35,  1.15it/s]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
 26%|██▌       | 13/50 [00:10<00:26,  1.40it/s]Your max_length is set to 142, but your input_length is only 126. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=63)
 76%|███████▌  | 38/50 [00:32<00:11,  1.03it/s]Your max_length is set to 142, but your input_length is only 116. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=58)
100%|██████████| 50/50 [00:40<00:00,  1.23it/s]


In [17]:
summaries_txt = cleaned_df['Summary']

summaries_txt

Unnamed: 0,Summary
0,victor valdes posed with a fan outside san car...
1,jet republic has teamed up with climatecare to...
2,new: the uk government stands firmly against t...
3,an explosion occurrs at a storage tank that wa...
4,– police suspect foul play and poisoning in th...
...,...
4995,"olivia adams, 13, was told she had broken a st..."
4996,a thief stole a games console from a hospital ...
4997,nitzan benhaim tackled 90ft-tall waves in naza...
4998,a luxury gazebo firm escaped major damage by t...


In [18]:
result_unagg = rouge.compute(
    predictions=candidate_summaries,
    references=summaries_txt[:50],
    use_stemmer=True,
    use_aggregator=False
)

result_unagg

{'rouge1': [0.3829787234042554,
  0.40740740740740744,
  0.5084745762711864,
  0.5,
  0.27309236947791166,
  0.4852941176470589,
  0.3076923076923077,
  0.5420560747663552,
  0.47482014388489213,
  0.3777777777777778,
  0.31683168316831684,
  0.5168539325842696,
  0.4356435643564357,
  0.09677419354838708,
  0.3188405797101449,
  0.3850267379679144,
  0.17391304347826086,
  0.3783783783783784,
  0.7580645161290323,
  0.23809523809523805,
  0.35443037974683544,
  0.22222222222222224,
  0.3055555555555556,
  0.19718309859154928,
  0.42696629213483145,
  0.3134328358208955,
  0.22222222222222224,
  0.3373493975903615,
  0.288135593220339,
  0.3576158940397351,
  0.42276422764227645,
  0.4403669724770642,
  0.2380952380952381,
  0.4409448818897638,
  0.35999999999999993,
  0.2527075812274368,
  0.1739130434782609,
  0.3958333333333333,
  0.25,
  0.1518987341772152,
  0.5192307692307692,
  0.20338983050847456,
  0.628099173553719,
  0.5663716814159292,
  0.7714285714285715,
  0.1,
  0.41428

In [19]:
result_agg = rouge.compute(predictions=candidate_summaries,
                           references=summaries_txt[:50],
                           use_stemmer=True)
result_agg

{'rouge1': 0.3711112180118469,
 'rouge2': 0.1735745443175322,
 'rougeL': 0.2748008430248823,
 'rougeLsum': 0.275912754084674}

##### Minimum and maximum Rouge score indices are obtained to check best and worst summaries

In [20]:
import numpy as np

result_unagg_rsum = np.array(result_unagg["rougeLsum"])

max_arg = np.argmax(result_unagg_rsum)
min_arg = np.argmin(result_unagg_rsum)

max_arg, min_arg

(44, 13)

In [21]:
candidate_summaries[min_arg]

' The 27-year-old is unattached after leaving the club in july . The former udinese man initially joined the hornets on loan in 2012, before signing on a free transfer in 2013 . He has two senior international caps for sweden, making 100 appearances in total .'

In [22]:
summaries_txt[min_arg]

'former watford defender joel ekstrand is currently training with bristol city, head coach lee johnson has said.'

In [23]:
candidate_summaries[max_arg]

' janelle duncan-bailey, 25, went missing in south london in the early hours of wed Wednesday . Her body was found yesterday afternoon in mayfield crescent, thornton heath . Jerome mcdonald, 30, has been charged with her murder and will appear in court tomorrow .'

In [24]:
summaries_txt[max_arg]

'janelle duncan-bailey went missing in south london early on wednesday . her body was found yesterday in thornton heath . jerome mcdonald, 30, has been charged with her murder .'

In [25]:
act_vs_pred_summaries_df = pd.DataFrame(
    list(zip(candidate_summaries, summaries_txt[:50])),
    columns = ["Predicted_Summaries", "Reference_summaries"]
)

act_vs_pred_summaries_df.head(10)

Unnamed: 0,Predicted_Summaries,Reference_summaries
0,victor valdes enjoyed a meal out at san carlo...,victor valdes posed with a fan outside san car...
1,Aviation industry is often seen as one of the...,jet republic has teamed up with climatecare to...
2,London police to investigate claims that brit...,new: the uk government stands firmly against t...
3,The explosion occurred at a storage tank that...,an explosion occurrs at a storage tank that wa...
4,Family believes kristi and benjamin strack ki...,– police suspect foul play and poisoning in th...
5,An executive at apple said that the company h...,apple says the company has no obligation to he...
6,Cell phone novels are written entirely on han...,hugely popular cell phone novels have created ...
7,The facebook group plus size modeling shared ...,the facebook group plus size modeling . shared...
8,"Researchers at antwerp university in belgium,...",belgium scientists flooded a bookshop with a c...
9,NEW: egypt prime minister appeals for calm an...,new: three arrested american students identifi...
