# Pegasus_Media_Sum_Model_3a_Extract_Abstract
    # November 6, 2022

This notebook has the following model built for an abstractive summarizer based on Pegasus Transfomer Model from Hugging Face.

Two abstractions are performed. (1) Abstraction from the full original article and (2) Abstraction from the extracted summary from the extractive model.

The abstractions are evaluated for ROUGE Scores against the gold labels (original highlights).

Dataset Summary

This large-scale media interview dataset contains 463.6K transcripts with abstractive summaries, collected from interview transcripts and overview / topic descriptions from NPR and CNN.

Data Fields

    id: paper id
    document: a string/list containing the body of a set of documents
    summary: a string containing the abstract of the set

Extractive Model Details

    Sentence Transformer: Sentence Transformer
    Pre-Training: all-MiniLM-L6-v2
    **Supervised**: Supervised using Abstractive Summaries
    Classification: KMeans Clustering and Neighbors
    Trigram Blocking: Yes
    Fine Tuning: None
    Evaluation Metrics: RougeL and Cosine Similarity
    
    NOTE: Abstractive summaries are used as a gold label to compute RougeL Scores
    


In [None]:
!pip install -q sentencepiece

[?25l[K     |▎                               | 10 kB 16.2 MB/s eta 0:00:01[K     |▌                               | 20 kB 9.8 MB/s eta 0:00:01[K     |▊                               | 30 kB 13.1 MB/s eta 0:00:01[K     |█                               | 40 kB 5.0 MB/s eta 0:00:01[K     |█▎                              | 51 kB 4.8 MB/s eta 0:00:01[K     |█▌                              | 61 kB 5.6 MB/s eta 0:00:01[K     |█▉                              | 71 kB 5.7 MB/s eta 0:00:01[K     |██                              | 81 kB 6.0 MB/s eta 0:00:01[K     |██▎                             | 92 kB 6.6 MB/s eta 0:00:01[K     |██▋                             | 102 kB 5.5 MB/s eta 0:00:01[K     |██▉                             | 112 kB 5.5 MB/s eta 0:00:01[K     |███                             | 122 kB 5.5 MB/s eta 0:00:01[K     |███▍                            | 133 kB 5.5 MB/s eta 0:00:01[K     |███▋                            | 143 kB 5.5 MB/s eta 0:00:01[K   

In [None]:

!pip install -q transformers


[K     |████████████████████████████████| 5.5 MB 5.6 MB/s 
[K     |████████████████████████████████| 7.6 MB 21.3 MB/s 
[K     |████████████████████████████████| 163 kB 36.9 MB/s 
[?25h

In [None]:

!pip install -q datasets


[K     |████████████████████████████████| 441 kB 5.1 MB/s 
[K     |████████████████████████████████| 95 kB 4.3 MB/s 
[K     |████████████████████████████████| 115 kB 50.6 MB/s 
[K     |████████████████████████████████| 212 kB 55.0 MB/s 
[K     |████████████████████████████████| 127 kB 47.5 MB/s 
[K     |████████████████████████████████| 115 kB 48.5 MB/s 
[?25h

In [None]:

!pip install -q evaluate
import evaluate


[?25l[K     |████▌                           | 10 kB 22.1 MB/s eta 0:00:01[K     |█████████                       | 20 kB 6.4 MB/s eta 0:00:01[K     |█████████████▌                  | 30 kB 9.1 MB/s eta 0:00:01[K     |██████████████████              | 40 kB 4.5 MB/s eta 0:00:01[K     |██████████████████████▌         | 51 kB 4.3 MB/s eta 0:00:01[K     |███████████████████████████     | 61 kB 5.0 MB/s eta 0:00:01[K     |███████████████████████████████▌| 71 kB 5.7 MB/s eta 0:00:01[K     |████████████████████████████████| 72 kB 950 kB/s 
[?25h

In [None]:

!pip install -q rouge_score


  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone


In [None]:
!pip install datasets --quiet
!pip install nltk --quiet

In [None]:
# NLTK
import re # relugar expression
import nltk # natural language toolkit for sentence tokenization and display
import string
import heapq
nltk.download('punkt')
nltk.download('stopwords')
from nltk import word_tokenize
from nltk.util import ngrams

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
from datasets import load_dataset, load_metric

In [None]:
import pandas as pd
import numpy as np

In [None]:
# Mount drive for saving model checkpoints, loading Task 2 data below

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from pprint import pprint

In [None]:
filepath = 'drive/My Drive/Colab_Notebooks_1/model_3a_extracted_mediasum1000.csv'
dataset = load_dataset('csv', data_files = filepath, split='train' )
dataset



Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-2426e0b62a4f60ee/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-2426e0b62a4f60ee/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


Dataset({
    features: ['orig_article', 'orig_summary', 'extracted_summary'],
    num_rows: 1000
})

In [None]:
from transformers import PegasusTokenizer, TFPegasusForConditionalGeneration

check_point = 'google/pegasus-cnn_dailymail'

model = TFPegasusForConditionalGeneration.from_pretrained(check_point, from_pt=True)
tokenizer = PegasusTokenizer.from_pretrained(check_point)


Downloading:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFPegasusForConditionalGeneration.

Some weights or buffers of the TF 2.0 model TFPegasusForConditionalGeneration were not initialized from the PyTorch model and are newly initialized: ['model.encoder.embed_positions.weight', 'model.decoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

In [None]:
model.summary()

Model: "tf_pegasus_for_conditional_generation"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 model (TFPegasusMainLayer)  multiple                  570797056 
                                                                 
 final_logits_bias (BiasLaye  multiple                 96103     
 r)                                                              
                                                                 
Total params: 570,893,159
Trainable params: 570,797,056
Non-trainable params: 96,103
_________________________________________________________________


In [None]:
rouge = evaluate.load('rouge')

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [None]:
# Iterate through the dataset, extract the summary and compute RougeL, Cosine Similarity Scores

size_of_dataset = 100 # *****Enter size of dataset for evaluation*****

# get original article, original summary and extracted summary from dataset
orig_article_list = dataset['orig_article'][0:size_of_dataset]
orig_summary_list = dataset['orig_summary'][0:size_of_dataset]
extracted_summary_list = dataset['extracted_summary'][0:size_of_dataset]

# zip original article, original summary and extracted summary
zipped_input = zip(orig_article_list, orig_summary_list, extracted_summary_list)

# Empty List to Store Scores
rouge_1_list_with_no_ext = []
rouge_2_list_with_no_ext = []
rouge_L_list_with_no_ext = []

rouge_1_list_with_ext = []
rouge_2_list_with_ext = []
rouge_L_list_with_ext = []

# Counter for Tracking Results
count = 1

# iterate Through the Dataset
for orig_art, orig_high, ext_summary in zipped_input:  
  print('Example', count)  

  orig_art_list = nltk.sent_tokenize(orig_art)
  orig_high_list = nltk.sent_tokenize(orig_high)
  ext_summary_list = nltk.sent_tokenize(ext_summary)

  print("Original Article")
  print(orig_art)
  print("")

  print("Original Highlights")
  print(orig_high)
  print("")

# Get Abstrative Summary from Extracted Summary
  max_l = len(orig_high)
  article_to_summarize = ''.join(ext_summary_list)
  inputs = tokenizer(article_to_summarize, max_length=1024, truncation=True, return_tensors="tf")
  summary_ids = model.generate(inputs["input_ids"], 
                              num_beams=3,
                              no_repeat_ngram_size=2,
                              max_length=max_l)
  
  print("Abstracted Summary from Extraction")
  pprint(tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0], compact=True)
  print("")
  candidate = tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

# Set Candidate and Reference for Rouge Scores
  REFERENCE = orig_high
  predictions = [candidate]
  references = [REFERENCE]
  rouge_results = rouge.compute(predictions=predictions,
                        references=references)

  rouge_1_score = (rouge_results['rouge1'])
  rouge_2_score = (rouge_results['rouge2'])
  rouge_L_score = (rouge_results['rougeL'])

  rouge_1_list_with_ext.append(rouge_1_score)
  rouge_2_list_with_ext.append(rouge_2_score)
  rouge_L_list_with_ext.append(rouge_L_score)

# Generate Abstractive Summary from Original Article
  article_to_summarize = ''.join(orig_art)
  inputs = tokenizer(article_to_summarize, max_length=1024, truncation=True, return_tensors="tf")
  summary_ids = model.generate(inputs["input_ids"], 
                              num_beams=3,
                              no_repeat_ngram_size=2,
                              max_length=max_l)
  
  print("Abstracted Summary from Original Article")
  pprint(tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0], compact=True)
  print("")
  candidate = tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

# Set Candidate and Reference for Rouge Scores
  REFERENCE = orig_high
  predictions = [candidate]
  references = [REFERENCE]
  rouge_results = rouge.compute(predictions=predictions,
                        references=references)

  rouge_1_score = (rouge_results['rouge1'])
  rouge_2_score = (rouge_results['rouge2'])
  rouge_L_score = (rouge_results['rougeL'])

  rouge_1_list_with_no_ext.append(rouge_1_score)
  rouge_2_list_with_no_ext.append(rouge_2_score)
  rouge_L_list_with_no_ext.append(rouge_L_score)


  count = count + 1

Example 1
Original Article
FARAI CHIDEYA, host: Now, moving on, Forest Whitaker as Moses, Tisha Campbell Martin as Mary Magdalene - well, that's all in "The Bible Experience. " A New Testament edition was released in 2006.  This edition is billed as "The Complete Bible. " It doesn't have one person reading the gospels.  It features nearly 400 African-American artists, actors and ministers, plus sound effects. FARAI CHIDEYA, host: Just listen to Blair Underwood's rendition of Jesus on the cross. Mr.  BLAIR UNDERWOOD (Actor): (As Jesus) My God, my God, why have you forsaken me?.  FARAI CHIDEYA, host: Now, we've got two people affiliated with the project with us today.  Kyle Bowser, he co-produced "The Bible Experience" and actress Wendy Raquel Robinson, one of the actors in "The Bible Experience," and she also stars in the CW series, "The Game. "FARAI CHIDEYA, host: Hi folks, how are you doing?.  Ms.  WENDY RAQUEL ROBINSON (Actress): Great. Mr.  KYLE BOWSER (Co-producer, "The Bible Exper

In [None]:
# Calculate Mean Rouge for Dataset

# Calculate Mean Rouge for Dataset

print("Rouge Scores with NO Extractive Summarization")
print("********************************************")
print("")
print("Rouge 1 Scores")
print("raw scores")
print(rouge_1_list_with_no_ext)
print(len(rouge_1_list_with_no_ext))
print("")
print("mean scores")
print(np.mean(np.asarray(rouge_1_list_with_no_ext)))
print("")
print("Rouge 2 Scores")
print("raw scores")
print(rouge_2_list_with_no_ext)
print(len(rouge_2_list_with_no_ext))
print("")
print("mean scores")
print(np.mean(np.asarray(rouge_2_list_with_no_ext)))
print("")
print("Rouge L Scores")
print("raw scores")
print(rouge_L_list_with_no_ext)
print(len(rouge_L_list_with_no_ext))
print("")
print("mean scores")
print(np.mean(np.asarray(rouge_L_list_with_no_ext)))


Rouge Scores with NO Extractive Summarization
********************************************

Rouge 1 Scores
raw scores
[0.5054945054945056, 0.36000000000000004, 0.23529411764705882, 0.3783783783783784, 0.2608695652173913, 0.35714285714285715, 0.14634146341463414, 0.4318181818181818, 0.23655913978494625, 0.2972972972972973, 0.2682926829268293, 0.2471910112359551, 0.4042553191489362, 0.22222222222222224, 0.25263157894736843, 0.1794871794871795, 0.17204301075268816, 0.3711340206185567, 0.15625, 0.24999999999999994, 0.21782178217821782, 0.27397260273972607, 0.3283582089552239, 0.2941176470588235, 0.2434782608695652, 0.2795698924731182, 0.19999999999999998, 0.4948453608247423, 0.21333333333333332, 0.3209876543209877, 0.27999999999999997, 0.27586206896551724, 0.15730337078651688, 0.2626262626262626, 0.592, 0.43137254901960786, 0.2926829268292683, 0.24074074074074076, 0.25000000000000006, 0.40816326530612246, 0.3768115942028986, 0.32380952380952377, 0.29473684210526313, 0.4827586206896552, 0.4

In [None]:
# Calculate Mean Rouge for Dataset

print("Rouge Scores with Extractive Summarization")
print("********************************************")
print("")
print("Rouge 1 Scores")
print("raw scores")
print(rouge_1_list_with_ext)
print(len(rouge_1_list_with_ext))
print("")
print("mean scores")
print(np.mean(np.asarray(rouge_1_list_with_ext)))
print("")
print("Rouge 2 Scores")
print("raw scores")
print(rouge_2_list_with_ext)
print(len(rouge_2_list_with_ext))
print("")
print("mean scores")
print(np.mean(np.asarray(rouge_2_list_with_ext)))
print("")
print("Rouge L Scores")
print("raw scores")
print(rouge_L_list_with_ext)
print(len(rouge_L_list_with_ext))
print("")
print("mean scores")
print(np.mean(np.asarray(rouge_L_list_with_ext)))

Rouge Scores with Extractive Summarization
********************************************

Rouge 1 Scores
raw scores
[0.47500000000000003, 0.20512820512820515, 0.3829787234042554, 0.4888888888888889, 0.25, 0.3333333333333333, 0.18749999999999997, 0.3404255319148936, 0.29629629629629634, 0.3714285714285714, 0.4313725490196078, 0.3888888888888889, 0.3953488372093023, 0.24657534246575347, 0.3333333333333333, 0.4313725490196078, 0.3516483516483516, 0.3877551020408163, 0.3296703296703297, 0.2758620689655172, 0.1869158878504673, 0.5151515151515151, 0.34343434343434337, 0.2588235294117647, 0.2641509433962264, 0.47058823529411764, 0.27586206896551724, 0.5052631578947369, 0.2564102564102564, 0.3218390804597701, 0.3614457831325301, 0.3058823529411765, 0.33333333333333337, 0.2929936305732484, 0.48818897637795283, 0.5233644859813085, 0.2439024390243902, 0.28169014084507044, 0.21978021978021975, 0.46808510638297873, 0.41025641025641024, 0.45283018867924524, 0.3047619047619048, 0.52991452991453, 0.455