# **Text** **Summarization**

In [None]:
import pandas as pd

In [None]:
df=pd.read_csv("/content/text_summarization.csv", sep=',', on_bad_lines='skip')

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2906 entries, 0 to 2905
Data columns (total 7 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   text                          2906 non-null   object 
 1   labels                        2906 non-null   object 
 2   no_sentences                  2905 non-null   float64
 3   Flesch Reading Ease Score     2905 non-null   float64
 4   Dale-Chall Readability Score  2905 non-null   float64
 5   text_rank_summary             2905 non-null   object 
 6   lsa_summary                   2905 non-null   object 
dtypes: float64(3), object(4)
memory usage: 159.1+ KB


In [None]:
df.head()

Unnamed: 0,text,labels,no_sentences,Flesch Reading Ease Score,Dale-Chall Readability Score,text_rank_summary,lsa_summary
0,Ad sales boost Time Warner profit\n\nQuarterly...,business,26.0,62.17,9.72,It hopes to increase subscribers by offering t...,Its profits were buoyed by one-off gains which...
1,Dollar gains on Greenspan speech\n\nThe dollar...,business,17.0,65.56,9.09,The dollar has hit its highest level against t...,"""I think the chairman's taking a much more san..."
2,Yukos unit buyer faces loan claim\n\nThe owner...,business,14.0,69.21,9.66,The owners of embattled Russian oil giant Yuko...,Yukos' owner Menatep Group says it will ask Ro...
3,High fuel prices hit BA's profits\n\nBritish A...,business,24.0,62.98,9.86,Looking ahead to its full year results to Marc...,"Rod Eddington, BA's chief executive, said the ..."
4,Pernod takeover talk lifts Domecq\n\nShares in...,business,17.0,70.63,10.23,Reports in the Wall Street Journal and the Fin...,Shares in UK drinks and food firm Allied Domec...


In [None]:
df.tail()

Unnamed: 0,text,labels,no_sentences,Flesch Reading Ease Score,Dale-Chall Readability Score,text_rank_summary,lsa_summary
2901,New consoles promise big problems\n\nMaking ga...,tech,52.0,60.85,9.2,Instead of employing lots of artists to create...,Mr Wright said that enabling players to devise...
2902,BT program to beat dialler scams\n\nBT is intr...,tech,17.0,56.29,9.23,If a bill rises substantially above its usual ...,BT is introducing two initiatives to help beat...
2903,Be careful how you code\n\nA new European dire...,tech,45.0,56.29,8.73,"If it gets its way, the Dutch government will ...",A new European directive could put software wr...
2904,US cyber security chief resigns\n\nThe man mak...,tech,16.0,47.42,9.14,Amit Yoran was director of the National Cyber ...,The man making sure US computer networks are s...
2905,Losing yourself in online gaming\n\nOnline rol...,tech,149.0,68.91,7.15,One unnamed correspondent - all are anonymous ...,"Shame they are getting more popular, as you kn..."


**Checking** for missing values

In [None]:
print(df.isnull().sum())

text                            0
labels                          0
no_sentences                    1
Flesch Reading Ease Score       1
Dale-Chall Readability Score    1
text_rank_summary               1
lsa_summary                     1
dtype: int64


## **Checking and removing duplicate records**

In [None]:
print("Duplicate rows :",df.duplicated().sum())
df=df.drop_duplicates()
print("Duplicate rows after cleaning :",df.duplicated().sum())

Duplicate rows : 779
Duplicate rows after cleaning : 0


In [None]:
import re
import nltk
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

## 1. **Text Cleaning (Lowercasing, Removing URLs, Special Characters, and Numbers)**

In [None]:
import re

def clean_text(text):
    text = str(text) if pd.isna(text) else text

    text = text.lower()
    text = re.sub(r'\n', ' ', text)
    text = re.sub(r'\s+', ' ', text).strip()
    text = re.sub(r'[^\w\s%:]', '', text)
    text = re.sub(r'\b(?!19|20)\d{2,}\b', '', text)
    return text

df['clean_text'] = df['text'].fillna('').apply(clean_text)
print(df[['text', 'clean_text']].head())

                                                text  \
0  Ad sales boost Time Warner profit\n\nQuarterly...   
1  Dollar gains on Greenspan speech\n\nThe dollar...   
2  Yukos unit buyer faces loan claim\n\nThe owner...   
3  High fuel prices hit BA's profits\n\nBritish A...   
4  Pernod takeover talk lifts Domecq\n\nShares in...   

                                          clean_text  
0  ad sales boost time warner profit quarterly pr...  
1  dollar gains on greenspan speech the dollar ha...  
2  yukos unit buyer faces loan claim the owners o...  
3  high fuel prices hit bas profits british airwa...  
4  pernod takeover talk lifts domecq shares in uk...  


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['clean_text'] = df['text'].fillna('').apply(clean_text)


## 2. **Tokenization**

In [None]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

from nltk.tokenize import word_tokenize, sent_tokenize

def word_tokenization(text):
    tokens = word_tokenize(text)
    return tokens

def sentence_tokenization(text):
    tokens = sent_tokenize(text)
    return tokens

df['word_tokens'] = df['clean_text'].apply(word_tokenization)
df['sentence_tokens'] = df['clean_text'].apply(sentence_tokenization)

print(df[['clean_text', 'word_tokens', 'sentence_tokens']].head())

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


                                          clean_text  \
0  ad sales boost time warner profit quarterly pr...   
1  dollar gains on greenspan speech the dollar ha...   
2  yukos unit buyer faces loan claim the owners o...   
3  high fuel prices hit bas profits british airwa...   
4  pernod takeover talk lifts domecq shares in uk...   

                                         word_tokens  \
0  [ad, sales, boost, time, warner, profit, quart...   
1  [dollar, gains, on, greenspan, speech, the, do...   
2  [yukos, unit, buyer, faces, loan, claim, the, ...   
3  [high, fuel, prices, hit, bas, profits, britis...   
4  [pernod, takeover, talk, lifts, domecq, shares...   

                                     sentence_tokens  
0  [ad sales boost time warner profit quarterly p...  
1  [dollar gains on greenspan speech the dollar h...  
2  [yukos unit buyer faces loan claim the owners ...  
3  [high fuel prices hit bas profits british airw...  
4  [pernod takeover talk lifts domecq shares in u..

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['word_tokens'] = df['clean_text'].apply(word_tokenization)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['sentence_tokens'] = df['clean_text'].apply(sentence_tokenization)


## 3.**Stopword Removal**

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    tokens = word_tokenize(text)
    filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
    return " ".join(filtered_tokens)

df['clean_text_no_stopwords'] = df['clean_text'].apply(remove_stopwords)

print(df[['clean_text', 'clean_text_no_stopwords']].head())

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


                                          clean_text  \
0  ad sales boost time warner profit quarterly pr...   
1  dollar gains on greenspan speech the dollar ha...   
2  yukos unit buyer faces loan claim the owners o...   
3  high fuel prices hit bas profits british airwa...   
4  pernod takeover talk lifts domecq shares in uk...   

                             clean_text_no_stopwords  
0  ad sales boost time warner profit quarterly pr...  
1  dollar gains greenspan speech dollar hit highe...  
2  yukos unit buyer faces loan claim owners embat...  
3  high fuel prices hit bas profits british airwa...  
4  pernod takeover talk lifts domecq shares uk dr...  


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['clean_text_no_stopwords'] = df['clean_text'].apply(remove_stopwords)


## 4. **Lemmatization**

In [None]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

lemmatizer = WordNetLemmatizer()

def lemmatization(text):
    tokens = word_tokenize(text)
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return " ".join(lemmatized_tokens)

df['lemmatized_text'] = df['clean_text_no_stopwords'].apply(lemmatization)

print(df[['clean_text_no_stopwords', 'lemmatized_text']].head())


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


                             clean_text_no_stopwords  \
0  ad sales boost time warner profit quarterly pr...   
1  dollar gains greenspan speech dollar hit highe...   
2  yukos unit buyer faces loan claim owners embat...   
3  high fuel prices hit bas profits british airwa...   
4  pernod takeover talk lifts domecq shares uk dr...   

                                     lemmatized_text  
0  ad sale boost time warner profit quarterly pro...  
1  dollar gain greenspan speech dollar hit highes...  
2  yukos unit buyer face loan claim owner embattl...  
3  high fuel price hit ba profit british airway b...  
4  pernod takeover talk lift domecq share uk drin...  


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['lemmatized_text'] = df['clean_text_no_stopwords'].apply(lemmatization)


5. Convert Processed Tokens Back to Text

## **FEATURE EXTRACTION USING TF-IDF**

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd


In [None]:
tfidf_vectorizer= TfidfVectorizer(max_features=2000, ngram_range=(1,2), stop_words='english')

In [None]:
tfidf_matrix= tfidf_vectorizer.fit_transform(df['lemmatized_text'])
tfidf_df= pd.DataFrame(tfidf_matrix.toarray(), columns= tfidf_vectorizer.get_feature_names_out())
tfidf_df.head()

Unnamed: 0,100m,19,1980s,1994,1997,1998,1999,1bn,20,200,...,year said,yen,york,young,young people,younger,youre,yugansk,yukos,zealand
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.105137,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.472036,0.417721,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
tfidf_df.to_csv('tfidf_matrix.csv', index=False)

## **SENTENCE SCORING AND SELECTION**

In [None]:
import numpy as np

def score_sentence(sentence, tfidf_df, feature_names):
    words = sentence.split()
    word_scores = [tfidf_df[word].mean() for word in words if word in feature_names]
    return np.mean(word_scores) if word_scores else 0

feature_names = set(tfidf_df.columns)

df['sentence_scores'] = df['sentence_tokens'].apply(lambda sentences: [score_sentence(sent, tfidf_df, feature_names) for sent in sentences])

N = 3
df['selected_sentences'] = df['sentence_tokens'].apply(lambda sentences: sorted(zip(sentences, df['sentence_scores']), key=lambda x: x[1], reverse=True)[:N])

df[['sentence_tokens', 'selected_sentences']].head()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['sentence_scores'] = df['sentence_tokens'].apply(lambda sentences: [score_sentence(sent, tfidf_df, feature_names) for sent in sentences])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['selected_sentences'] = df['sentence_tokens'].apply(lambda sentences: sorted(zip(sentences, df['sentence_scores']), key=lambda x: x[1], reverse=True)[:N])


Unnamed: 0,sentence_tokens,selected_sentences
0,[ad sales boost time warner profit quarterly p...,[(ad sales boost time warner profit quarterly ...
1,[dollar gains on greenspan speech the dollar h...,[(dollar gains on greenspan speech the dollar ...
2,[yukos unit buyer faces loan claim the owners ...,[(yukos unit buyer faces loan claim the owners...
3,[high fuel prices hit bas profits british airw...,[(high fuel prices hit bas profits british air...
4,[pernod takeover talk lifts domecq shares in u...,[(pernod takeover talk lifts domecq shares in ...


In [None]:
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


## **ABSTRACTIVE SUMMARIZATION**



In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small").to(device)

def generate_summary(text, max_length=150, min_length=50):
    if len(text.split()) < 10:
        return text

    inputs = tokenizer("summarize: " + text, return_tensors="pt", truncation=True, max_length=512)
    inputs = inputs.to(device)

    summary_ids = model.generate(
        inputs.input_ids,
        max_length=max_length,
        min_length=min_length,
        length_penalty=2.0,
        num_beams=4,
        early_stopping=True
    )

    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
df_sample = df.head(5).copy()
df_sample.loc[:, "summary"] = df_sample["clean_text"].apply(generate_summary)
print(df_sample[["clean_text", "summary"]])

                                          clean_text  \
0  ad sales boost time warner profit quarterly pr...   
1  dollar gains on greenspan speech the dollar ha...   
2  yukos unit buyer faces loan claim the owners o...   
3  high fuel prices hit bas profits british airwa...   
4  pernod takeover talk lifts domecq shares in uk...   

                                             summary  
0  timewarner said fourth quarter sales rose 2% t...  
1  the dollar has hit its highest level against t...  
2  yukos unit buyer faces loan claim from owners ...  
3  british airways blamed high fuel prices for a ...  
4  pernod takeover talk lifts domecq shares in uk...  


In [None]:
for idx, row in df_sample.iterrows():
    print(f"Original Text ({idx}):\n{row['clean_text']}\n")
    print(f"Summary ({idx}):\n{row['summary']}\n")
    print("="*80)


Original Text (0):
ad sales boost time warner profit quarterly profits at us media giant timewarner jumped % to 113bn 600m for the three months to december from 639m yearearlier the firm which is now one of the biggest investors in google benefited from sales of highspeed internet connections and higher advert sales timewarner said fourth quarter sales rose 2% to 111bn from 109bn its profits were buoyed by oneoff gains which offset a profit dip at warner bros and less users for aol time warner said on friday that it now owns 8% of searchengine google but its own internet business aol had has mixed fortunes it lost  subscribers in the fourth quarter profits were lower than in the preceding three quarters however the company said aols underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues it hopes to increase subscribers by offering the online service free to timewarner internet customers and will try to sign up aols existing customers fo

## GUI

In [None]:
!pip install gradio transformers --quiet

import gradio as gr
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small").to(device)

def generate_summary(text):
    if len(text.strip()) == 0:
        return "Please enter some text."
    if len(text.split()) < 10:
        return text

    inputs = tokenizer("summarize: " + text, return_tensors="pt", max_length=512, truncation=True).to(device)

    summary_ids = model.generate(
        inputs.input_ids,
        max_length=200,
        min_length=40,
        num_beams=4,
        length_penalty=2.0,
        early_stopping=True
    )
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

interface = gr.Interface(
    fn=generate_summary,
    inputs=gr.Textbox(lines=15, placeholder="Enter article text here...", label="Input Text"),
    outputs=gr.Textbox(label="Generated Summary"),
    title="Text Summarization using T5",
    description="Enter a news article or paragraph. This app generates a summary using a pre-trained T5 model."
)
interface.launch()


It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://20e24f70dc7c4e7ccc.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




## EVALUATION USING ROUGE

In [None]:
!pip install evaluate rouge_score


Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=8a595d8e7945058f08a1656f0ced832d24fd2416e5cf3738c00ea12436628804
  Stored in directory: /root/.cache/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge_score
Installing collected packages: rouge_score, evaluate
Successfully installed evaluate-0.4.3 rouge_score-0.1.2


In [None]:
df_sample["reference_summary"] = [
    "Time Warner's profits rose due to strong ad and internet sales.",
    "The dollar rose following a speech by Greenspan.",
    "The new owner of Yukos unit may face legal claims from the former owner.",
    "British Airways reported lower profits due to high fuel costs.",
    "Pernod's possible takeover of Domecq boosts share prices."
]


In [None]:
import evaluate
rouge = evaluate.load("rouge")
predictions = df_sample['summary'].tolist()
references = df_sample['reference_summary'].tolist()
results = rouge.compute(predictions=predictions, references=references)
print("ROUGE scores:")
for key, value in results.items():
    print(f"{key}: {value:.4f}")


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

ROUGE scores:
rouge1: 0.1363
rouge2: 0.0223
rougeL: 0.1022
rougeLsum: 0.1022


## EVALUATION USING BERT

In [None]:
!pip install bert-score


Collecting bert-score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.0.0->bert-score)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.0.0->bert-score)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.0.0->bert-score)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.0.0->bert-score)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.0.0->bert-score)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.0.

In [None]:
def evaluate_with_bertscore(reference_summary, generated_summary):
    P, R, F1 = score([generated_summary], [reference_summary], lang="en", verbose=False)
    return {
        "BERTScore Precision": round(P.mean().item(), 4),
        "BERTScore Recall": round(R.mean().item(), 4),
        "BERTScore F1": round(F1.mean().item(), 4)
    }


In [None]:
from bert_score import score

def evaluate_with_bertscore(reference, prediction):
    P, R, F1 = score([prediction], [reference], lang="en")
    return {
        "BERTScore Precision": P.item(),
        "BERTScore Recall": R.item(),
        "BERTScore F1": F1.item()
    }

bert_scores = []
for ref, pred in zip(references, predictions):
    bert_scores.append(evaluate_with_bertscore(ref, pred))
avg_bert_score = {
    "BERTScore Precision": round(sum([s["BERTScore Precision"] for s in bert_scores]) / len(bert_scores), 4),
    "BERTScore Recall": round(sum([s["BERTScore Recall"] for s in bert_scores]) / len(bert_scores), 4),
    "BERTScore F1": round(sum([s["BERTScore F1"] for s in bert_scores]) / len(bert_scores), 4)
}

print("Average BERTScore:")
print(avg_bert_score)

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You sho

Average BERTScore:
{'BERTScore Precision': 0.8114, 'BERTScore Recall': 0.8881, 'BERTScore F1': 0.848}
