Importing Required Libraries

In [None]:
!pip install transformers[sentencepiece] datasets rouge_score py7zr -q

In [None]:
from transformers import pipeline, set_seed
import pandas as pd
import nltk
from nltk.tokenize import sent_tokenize
nltk.download ('punkt')
from datasets import load_metric

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
# Data loading
df = pd.read_parquet('preprocessed_data.parquet')

In [None]:
# Set the seed for reproducibility of random operations
set_seed(42)

# Extract the first 1000 characters from the 'article' column of the second row in the DataFrame
sample_text = df.iloc[1]['article'][:1000]
summary = {}

In [None]:
# Define a function to generate a baseline summary consisting of the first three sentences
def baseline_sum(text):
  return "\n".join(sent_tokenize(text)[:3])

In [None]:
summary['baseline'] = baseline_sum(sample_text)
summary['baseline']

'reactive oxygen species cytokines considered important factors pathogenesis pancreatic cancer one two source ros nicotinamide adenine dinucleotide phosphate oxidase involved pancreatic cancer development three ros activate signaling pathways mediated mitogen activated protein kinases nf janus kinase signal transducer activator transcription forty eight inhibits cancer cell apoptosis induces cytokine expression epithelial mesenchymal transition ten eleven high levels fibronectin laminin ten eleven cytokines fourteen observed pancreatic cancer growth factors fourteen extracellular matrix proteins ten cytokines one thousand four hundred seventeen shown activate nox pathogenesis pancreatic cancer development bioactive compounds curcumin genistein resveratrol antioxidant antitumor activities pancreatic cancer briefly review role ros cytokines pathogenesis pancreatic cancer addition bioactive compounds may prevent development pancreatic cancer also discussed since ros pro inflammatory cytok

# **GPT-2**

In [None]:
out = pipeline("text-generation", model="gpt2-medium")
gpt2_text = sample_text + "\nTL;DR:\n"
output = out(gpt2_text, max_length=512, clean_up_tokenization_spaces=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [None]:
output

[{'generated_text': "reactive oxygen species cytokines considered important factors pathogenesis pancreatic cancer one two source ros nicotinamide adenine dinucleotide phosphate oxidase involved pancreatic cancer development three ros activate signaling pathways mediated mitogen activated protein kinases nf janus kinase signal transducer activator transcription forty eight inhibits cancer cell apoptosis induces cytokine expression epithelial mesenchymal transition ten eleven high levels fibronectin laminin ten eleven cytokines fourteen observed pancreatic cancer growth factors fourteen extracellular matrix proteins ten cytokines one thousand four hundred seventeen shown activate nox pathogenesis pancreatic cancer development bioactive compounds curcumin genistein resveratrol antioxidant antitumor activities pancreatic cancer briefly review role ros cytokines pathogenesis pancreatic cancer addition bioactive compounds may prevent development pancreatic cancer also discussed since ros pr

In [None]:
# Tokenize sentences
summary['gpt2'] = "\n".join(sent_tokenize(output[0]['generated_text'][len(gpt2_text):]))

# **T5**

In [None]:
out = pipeline("summarization", model="t5-small")
output = out(sample_text)

In [None]:
output

[{'summary_text': 'ros nicotinamide adenine dinucleotide phosphate oxidase involved pancreatic cancer development three ros activate signaling pathways mediated mitogen activated protein kinases nf janus kine transducer activator transcription forty eight inhibits cancer cell apoptosis induces cytokine expression epithelial mesenchymal transition ten 11 high levels fibronectin laminin ten eleven cytokines'}]

In [None]:
summary['t5'] = "\n".join(sent_tokenize(output[0]['summary_text']))

# **BART**

In [None]:
out = pipeline("summarization", model="facebook/bart-large-cnn")
output = out(sample_text)

In [None]:
output

[{'summary_text': 'Pancreatic cancer briefly review role ros cytokines pathogenesis. bioactive compounds curcumin genistein resveratrol antioxidant antitumor activities. reactive oxygen species cytokines considered important factors pathogenesis pancreatic cancer one two source ros nicotinamide adenine dinucleotide phosphate oxidase involved pancreatic Cancer development.'}]

In [None]:
summary['bart'] = "\n".join(sent_tokenize(output[0]['summary_text']))

In [None]:
print('Ground Truth')
print(df.iloc[1]['abstract'])

for model_name in summary:
  print(model_name.upper())
  print(summary[model_name])

Ground Truth
pancreatic cancer is one of the most aggressive drug resistant and lethal types of cancer with poor prognosis various factors including reactive oxygen species cytokines growth factors and extracellular matrix proteins are reported to be involved in the development of pancreatic cancer however the pathogenesis of pancreatic cancer has not been completely elucidated oxidative stress has been shown to contribute to the development of pancreatic cancer evidences supporting the role of reactive oxygen species and cytokines as risk for pancreatic cancer and the concept of antioxidant supplementation as preventive approach for pancreatic cancer have been proposed here we review the literature on oxidative stress cytokine expression inflammatory signaling and natural antioxidant supplementation in relation to pancreatic cancer
BASELINE
reactive oxygen species cytokines considered important factors pathogenesis pancreatic cancer one two source ros nicotinamide adenine dinucleotide

Measuring Rouge Metrics

In [None]:
rouge_score = load_metric('rouge')

  rouge_metric = load_metric('rouge')


In [None]:
rouge_type = ["rouge1", "rouge2", "rougeL", "rougeLsum"]

In [None]:
reference = df.iloc[1]['abstract']

result = []

for model_name in summary:
  rouge_score.add(prediction=summary[model_name], reference=reference)
  score = rouge_score.compute()
  rouge_dict = dict((rn, score[rn].mid.fmeasure ) for rn in rouge_type )
  print('rouge_dict', rouge_dict)
  result.append(rouge_dict)

pd.DataFrame.from_records(result, index=summary.keys())

rouge_dict {'rouge1': 0.3319148936170213, 'rouge2': 0.12017167381974249, 'rougeL': 0.19574468085106383, 'rougeLsum': 0.19574468085106383}
rouge_dict {'rouge1': 0.3206997084548105, 'rouge2': 0.05865102639296187, 'rougeL': 0.14577259475218662, 'rougeLsum': 0.22157434402332363}
rouge_dict {'rouge1': 0.10975609756097561, 'rouge2': 0.024691358024691357, 'rougeL': 0.08536585365853658, 'rougeLsum': 0.08536585365853658}
rouge_dict {'rouge1': 0.24358974358974358, 'rouge2': 0.07792207792207792, 'rougeL': 0.15384615384615385, 'rougeLsum': 0.1794871794871795}


Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
baseline,0.331915,0.120172,0.195745,0.195745
gpt2,0.3207,0.058651,0.145773,0.221574
t5,0.109756,0.024691,0.085366,0.085366
bart,0.24359,0.077922,0.153846,0.179487


## Conclusion

When all the criteria are taken into account, GPT-2 performs the best. This is especially true given its high ROUGE-Lsum score, which denotes improved overall coherence and sentence structure. When it comes to content overlap, the baseline sets a high standard, whereas GPT-2 produces summaries that are more consistent. Although BART trails GPT-2 in some important areas, it nevertheless performs comparably. T5 has the lowest overall performance.