# Final_Model_3a_Sbert_Transformer_Clustering_MediaSum_Extractive_Summarization
    Oct 31, 2022




This notebook has the following model built for an extractive summarizer based on Bert Sentence Transfomer Model from Hugging Face.

1.**SentenceTransformers** is a Python framework for state-of-the-art sentence, text and image embeddings. The initial work is described in paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. This framework can be used to compute sentence / text embeddings for more than 100 languages. **These** embeddings can then be compared e.g. with cosine-similarity to find sentences with a similar meaning. This can be useful for semantic textual similar, semantic search, or paraphrase mining.

2.In this notebook, **a ranked retrieval or extraction** of the input document is made using the **abstractive summary as the query vector**. Following steps are followed:  


    Step 1: KMeans Clustering of Input Document
    Step 2: Derive a query embedding vector of the abstractive summary
    Step 3: Compute Distance of Query to Centroid of Each Cluster
    Step 4: Select the Top N Clusters
    Step 5: Apply KMeans Nearest Neighbors to the Top N Clusters
    Step 6: Select the Extractive Summarization

Dataset Summary

This large-scale media interview dataset contains 463.6K transcripts with abstractive summaries, collected from interview transcripts and overview / topic descriptions from NPR and CNN.

Data Fields

    id: paper id
    document: a string/list containing the body of a set of documents
    summary: a string containing the abstract of the set

Model Details

    Sentence Transformer: Sentence Transformer
    Pre-Training: all-MiniLM-L6-v2
    **Supervised**: Supervised using Abstractive Summaries
    Classification: KMeans Clustering and Neighbors
    Trigram Blocking: Yes
    Fine Tuning: None
    Evaluation Metrics: RougeL and Cosine Similarity
    
    NOTE: Abstractive summaries are used as a gold label to compute RougeL Scores

# 1. Setup

#### This section install key libraries

In [None]:
!pip install bert-extractive-summarizer --quiet

In [None]:
!pip install datasets --quiet
!pip install nltk --quiet

In [None]:
!pip install -q rouge_score

In [None]:
!pip install -q evaluate


In [None]:
!pip install -U spacy
!pip install -U spacy-lookups-data

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!pip install -r rouge/requirements.txt
!pip install rouge-score
!pip install rouge_score

[31mERROR: Could not open requirements file: [Errno 2] No such file or directory: 'rouge/requirements.txt'[0m
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!python -m spacy download en_core_web_lg

2022-11-01 02:52:36.593137: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-lg==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.4.1/en_core_web_lg-3.4.1-py3-none-any.whl (587.7 MB)
[K     |████████████████████████████████| 587.7 MB 12 kB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


# 2.0 Import Libraries

In [None]:
# NLTK
import re # relugar expression
import nltk # natural language toolkit for sentence tokenization and display
import string
import heapq
nltk.download('punkt')
nltk.download('stopwords')
from nltk import word_tokenize
from nltk.util import ngrams
import evaluate

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
import os
import logging

import numpy as np
import tensorflow as tf
from tensorflow import keras

# Only log error messages
tf.get_logger().setLevel(logging.ERROR)

os.environ["TOKENIZERS_PARALLELISM"] = "false"


In [None]:
import evaluate

In [None]:
import spacy
import pandas as pd

print(spacy.__version__)
print(pd.__version__)

3.4.2
1.3.5


In [None]:
import pickle
import subprocess
import sys
import nltk
from nltk import Nonterminal, nonterminals, Production, CFG, PCFG

In [None]:
#shift reduce parser example
from nltk.grammar import Nonterminal
from nltk.parse.api import ParserI
from nltk.tree import Tree

In [None]:
nlp = spacy.load("en_core_web_lg")

In [None]:
from rouge_score import rouge_scorer

In [None]:
import time

In [None]:
# Mount drive for saving model checkpoints, loading Task 2 data below

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# 3.0 Load the dataset



In [None]:
from datasets import load_dataset, load_metric


In [None]:
dataset_id = "ccdv/mediasum"
dataset = load_dataset(dataset_id, split="train")



In [None]:
# Inspect data structure
print(dataset)

Dataset({
    features: ['document', 'summary'],
    num_rows: 443596
})


In [None]:
# Inspect Shape
print(dataset.shape)



(443596, 2)


# 4.0 Inspect MediaSum

In [None]:
# inspect first example
dataset[0]

{'document': 'FARAI CHIDEYA, host: Now, moving on, Forest Whitaker as Moses, Tisha Campbell Martin as Mary Magdalene - well, that\'s all in "The Bible Experience." A New Testament edition was released in 2006. This edition is billed as "The Complete Bible." It doesn\'t have one person reading the gospels. It features nearly 400 African-American artists, actors and ministers, plus sound effects.</s>FARAI CHIDEYA, host: Just listen to Blair Underwood\'s rendition of Jesus on the cross.</s>Mr. BLAIR UNDERWOOD (Actor): (As Jesus) My God, my God, why have you forsaken me?</s>FARAI CHIDEYA, host: Now, we\'ve got two people affiliated with the project with us today. Kyle Bowser, he co-produced "The Bible Experience" and actress Wendy Raquel Robinson, one of the actors in "The Bible Experience," and she also stars in the CW series, "The Game."</s>FARAI CHIDEYA, host: Hi folks, how are you doing?</s>Ms. WENDY RAQUEL ROBINSON (Actress): Great.</s>Mr. KYLE BOWSER (Co-producer, "The Bible Experien

In [None]:
print(f"- The {dataset_id} dataset has {dataset.num_rows} examples.")
print(f"- Each example is a {type(dataset[0])} with a {type(dataset[0]['document'])} as value.")
print(f"- Examples look like this: {dataset[0]}")

- The ccdv/mediasum dataset has 443596 examples.
- Each example is a <class 'dict'> with a <class 'str'> as value.
- Examples look like this: {'document': 'FARAI CHIDEYA, host: Now, moving on, Forest Whitaker as Moses, Tisha Campbell Martin as Mary Magdalene - well, that\'s all in "The Bible Experience." A New Testament edition was released in 2006. This edition is billed as "The Complete Bible." It doesn\'t have one person reading the gospels. It features nearly 400 African-American artists, actors and ministers, plus sound effects.</s>FARAI CHIDEYA, host: Just listen to Blair Underwood\'s rendition of Jesus on the cross.</s>Mr. BLAIR UNDERWOOD (Actor): (As Jesus) My God, my God, why have you forsaken me?</s>FARAI CHIDEYA, host: Now, we\'ve got two people affiliated with the project with us today. Kyle Bowser, he co-produced "The Bible Experience" and actress Wendy Raquel Robinson, one of the actors in "The Bible Experience," and she also stars in the CW series, "The Game."</s>FARAI C

# 4.0 Helper Functions

In [None]:
def preprocess(sentence):
    sentence=str(sentence)
    sentence = sentence.replace('\n', ' ')
    sentence = sentence.replace('?</s>', '?.')
    sentence = sentence.replace('</s>', '')
    sentence = sentence.replace("\'", "'")
    sentence = sentence.replace('--', '')
    sentence = sentence.replace('|', '')
    sentence = sentence.replace('/', '')
    sentence = sentence.replace('Dr.', 'Dr')
    sentence = sentence.replace('?.', '?. ')
    sentence = sentence.replace('.', '. ')
    sentence = sentence.replace('!', '!. ')
    return sentence

In [None]:
def calculate_rouge(reference, predictions):
  rouge = evaluate.load('rouge')
  predictions = extracted_sentence_list
  references = orig_highlights_list
  results = rouge.compute(predictions=predictions,
                        references=references)
  print(results)

In [None]:
"""
LexRank implementation
Source: https://github.com/crabcamp/lexrank/tree/dev
"""

import numpy as np
from scipy.sparse.csgraph import connected_components
from scipy.special import softmax
import logging

logger = logging.getLogger(__name__)

def degree_centrality_scores(
    similarity_matrix,
    threshold=None,
    increase_power=True,
):
    if not (
        threshold is None
        or isinstance(threshold, float)
        and 0 <= threshold < 1
    ):
        raise ValueError(
            '\'threshold\' should be a floating-point number '
            'from the interval [0, 1) or None',
        )

    if threshold is None:
        markov_matrix = create_markov_matrix(similarity_matrix)

    else:
        markov_matrix = create_markov_matrix_discrete(
            similarity_matrix,
            threshold,
        )

    scores = stationary_distribution(
        markov_matrix,
        increase_power=increase_power,
        normalized=False,
    )

    return scores


def _power_method(transition_matrix, increase_power=True, max_iter=10000):
    eigenvector = np.ones(len(transition_matrix))

    if len(eigenvector) == 1:
        return eigenvector

    transition = transition_matrix.transpose()

    for _ in range(max_iter):
        eigenvector_next = np.dot(transition, eigenvector)

        if np.allclose(eigenvector_next, eigenvector):
            return eigenvector_next

        eigenvector = eigenvector_next

        if increase_power:
            transition = np.dot(transition, transition)

    logger.warning("Maximum number of iterations for power method exceeded without convergence!")
    return eigenvector_next


def connected_nodes(matrix):
    _, labels = connected_components(matrix)

    groups = []

    for tag in np.unique(labels):
        group = np.where(labels == tag)[0]
        groups.append(group)

    return groups


def create_markov_matrix(weights_matrix):
    n_1, n_2 = weights_matrix.shape
    if n_1 != n_2:
        raise ValueError('\'weights_matrix\' should be square')

    row_sum = weights_matrix.sum(axis=1, keepdims=True)

    # normalize probability distribution differently if we have negative transition values
    if np.min(weights_matrix) <= 0:
        return softmax(weights_matrix, axis=1)

    return weights_matrix / row_sum


def create_markov_matrix_discrete(weights_matrix, threshold):
    discrete_weights_matrix = np.zeros(weights_matrix.shape)
    ixs = np.where(weights_matrix >= threshold)
    discrete_weights_matrix[ixs] = 1

    return create_markov_matrix(discrete_weights_matrix)


def stationary_distribution(
    transition_matrix,
    increase_power=True,
    normalized=True,
):
    n_1, n_2 = transition_matrix.shape
    if n_1 != n_2:
        raise ValueError('\'transition_matrix\' should be square')

    distribution = np.zeros(n_1)

    grouped_indices = connected_nodes(transition_matrix)

    for group in grouped_indices:
        t_matrix = transition_matrix[np.ix_(group, group)]
        eigenvector = _power_method(t_matrix, increase_power=increase_power)
        distribution[group] = eigenvector

    if normalized:
        distribution /= n_1

    return distribution

In [None]:
# Trigram Blocking for Extractive Summarization

def trigram_blocking(input_text):

  trigram_master_list=[]
  trigram_temp_list=[]
  clean_list = []
  trigram_flag = []
  output_list = input_text

  #output_list = ['oh my god', 'oh my god' ,'lovely day today', 'This year is good', 'Oh my god' ]

  for idx, sentence in enumerate(output_list):
    if idx == 0:
      token = nltk.word_tokenize(sentence.lower())  #tokenize your text and make it lowercase in onestep
      trigrams=ngrams(token,3)          # find all the trigram in text1
      #print(idx, sentence)
      for item in trigrams:
        #print(item)
        if item not in trigram_master_list:
          trigram_master_list.append(item)
      clean_list.append(sentence)
    elif idx > 0:
      token = nltk.word_tokenize(sentence.lower())  #tokenize your text and make it lowercase in onestep
      trigrams=ngrams(token,3)
      #print(idx, sentence) 
      for item in trigrams:
        #print(item)
        trigram_temp_list.append(item)
      trigram_flag = []  
      for trigram_temp in trigram_temp_list:
        if trigram_temp in trigram_master_list:
          trigram_flag.append("Y")
        else:
          trigram_flag.append("N")
      if "N" in trigram_flag:
        clean_list.append(sentence)
        for i in trigram_temp_list:
          trigram_master_list.append(i)

  #print(trigram_master_list)
  #print(clean_list)

  return (clean_list)

#5.0 Create a Dataset for Model Building and Evaluation
 

In [None]:
# Generate a dataset of 1000 examples for model prototyping

# DO NOT CHANGE THE CODE IN THIS CELL

size_of_dataset = 1000 # DO NOT CHANGE THIS VALUE
raw_dataset = dataset[0:size_of_dataset]
document_list =  raw_dataset['document']
summary_list =  raw_dataset['summary']

In [None]:
# Pre-process text

def preprocess_text(sentence):
    sentence=str(sentence)
    sentence = sentence.replace('\n', ' ')
    sentence = sentence.replace('?</s>', '?.')
    sentence = sentence.replace('</s>', '')
    sentence = sentence.replace("\'", "'")
    sentence = sentence.replace('--', '')
    sentence = sentence.replace('|', '')
    sentence = sentence.replace('/', '')
    sentence = sentence.replace('Dr.', 'Dr')
    sentence = sentence.replace('?.', '?. ')
    sentence = sentence.replace('.', '. ')
    sentence = sentence.replace('!', '!. ')

    ### END YOUR CODE
    return sentence

In [None]:
# Store in a List
clean_document_list = list(((map(preprocess_text, document_list))))
clean_summary_list = list(((map(preprocess_text, summary_list))))

In [None]:
# Inspect Cleaned Text
clean_document_list[0]

'FARAI CHIDEYA, host: Now, moving on, Forest Whitaker as Moses, Tisha Campbell Martin as Mary Magdalene - well, that\'s all in "The Bible Experience. " A New Testament edition was released in 2006.  This edition is billed as "The Complete Bible. " It doesn\'t have one person reading the gospels.  It features nearly 400 African-American artists, actors and ministers, plus sound effects. FARAI CHIDEYA, host: Just listen to Blair Underwood\'s rendition of Jesus on the cross. Mr.  BLAIR UNDERWOOD (Actor): (As Jesus) My God, my God, why have you forsaken me?.  FARAI CHIDEYA, host: Now, we\'ve got two people affiliated with the project with us today.  Kyle Bowser, he co-produced "The Bible Experience" and actress Wendy Raquel Robinson, one of the actors in "The Bible Experience," and she also stars in the CW series, "The Game. "FARAI CHIDEYA, host: Hi folks, how are you doing?.  Ms.  WENDY RAQUEL ROBINSON (Actress): Great. Mr.  KYLE BOWSER (Co-producer, "The Bible Experience: The Complete Bi

In [None]:
# Create a pandas dataframe to hold cleansed data
df = pd.DataFrame(list(zip(clean_document_list, clean_summary_list)),
               columns =['document', 'summary'])

In [None]:
# write to a csv file
df.to_csv("cleaned_mediasum1000", index = False)

In [None]:
# load the cleaned mediasum1000 into a Hugging Face Data Dictionary
dataset = load_dataset('csv', data_files = 'cleaned_mediasum1000', split='train' )



Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-b48d9d30d92aae2a/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-b48d9d30d92aae2a/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


In [None]:
# Inspect the data dictionary
dataset

Dataset({
    features: ['document', 'summary'],
    num_rows: 1000
})

#6.0 Create a Test Dataset for Model Evaluation

In [None]:
# Generate a dataset of "x" examples for model evaluation

size_of_dataset = 1000 # change the value as you develop
small_dataset = dataset[0:size_of_dataset]


#7.0 Model Use Pre-Trained Sentence Transformer for Extractive Summarization #

    (NO FINE TUNING)

In [None]:
# Install and import Sentence Transformer
!pip install -U sentence-transformers 
import nltk
from sentence_transformers import SentenceTransformer, util
import numpy as np
encoder_model = SentenceTransformer('all-MiniLM-L6-v2')
#model
#encoder_model = SentenceTransformer('johngiorgi/declutr-base')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from scipy.spatial.distance import cosine
from sklearn.neighbors import NearestNeighbors
from sklearn.cluster import KMeans
!pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


#8.0 Evaluate Rouge Across Eval Data Set

In [None]:
# RougueL Scorer
scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
rouge = evaluate.load('rouge')


In [None]:
test = small_dataset['document'][649:650]
test

['NEAL CONAN, HOST: Forty years and a few days ago, an eight-and-a-half-minute song broke on to the record charts, soon drenched the radio and claimed a permanent place in the lives of millions. NEAL CONAN, HOST: Singer-songwriter Don McLean, of course.  All these years later, "American Pie" continues to haunt the imagination and to inspire folklore, including the claim that the song was first written and performed in Saratoga Springs, New York.  The facts from the horse\'s mouth in just a moment. NEAL CONAN, HOST: For many, "American Pie" recalls a specific moment in time.  Where does the Chevy from the levee take you? Give us a call: 800-989-8255.  Email us: talk@npr. org.  You can join the conversation at our website.  Go to npr. org, click on TALK OF THE NATION. NEAL CONAN, HOST: Don McLean joins us now from his home in Maine.  His latest album is "Addicted to Black. " Next year, he\'ll be on the road in the United Kingdom for his 40th anniversary of "American Pie" tour.  Thanks ve

In [None]:
from numpy.lib.utils import source
# Iterate through the dataset, extract the summary and compute RougeL, Cosine Similarity Scores

n_clusters = 5 # KMeans Hyper Parameter
n_nearest_neighbors = 10
top_n_clusters = n_clusters
compression_ratio = 0.3

start = time.time()

# get article from dataset
input_article = small_dataset['document'][0:size_of_dataset]
#print("input document")
#print("***************")
#print(input_article)
#print(len(input_article))
#print(type(input_article))
#print("")


# get summary from dataset
input_highlights = small_dataset['summary'][0:size_of_dataset]
#print("input highlights")
#print("*****************")
#print(input_highlights)
#print(len(input_highlights))
#print(type(input_highlights))
#print("")


# zip article and summary
zipped_input = zip(input_article, input_highlights)

# Empty List to Store Scores
rougeL_precision = []
rougeL_recall = []
rougeL_fmeasure = []

rouge_1_list = []
rouge_2_list = []
rouge_L_list = []

cosine_similarity_results = []

# Counter for Tracking Results
count = 1

# Extracted Highlights
output_highlights_list = []

for input_article, input_highlights in zipped_input:

  print('Example:', count)
  
# Tokenize the string texts  
  source_article_list = nltk.sent_tokenize(input_article)
  source_highlight_list = nltk.sent_tokenize(input_highlights)
  
# Join the list into a string
  formatted_source_article = " ".join(source_article_list)
  formatted_source_highlight = " ".join(source_highlight_list)
  
# get embeddings from model
  news_embeddings = encoder_model.encode(source_article_list)
  #print("embeddings")
  #print('**********')
  #print((news_embeddings.shape))
  #print(news_embeddings[54])
  #print

  output_list=[]

# Kmeans Clustering of Source Document

  if len(source_article_list) < n_clusters:
    n_clusters = len(source_article_list)
  else:
    n_clusters = n_clusters

  cluster_model = KMeans(n_clusters=n_clusters, random_state=0) # Hyper Param
  news_clusters = cluster_model.fit_predict(news_embeddings)
  #print("news clusters")
  #print("*************")
  #print(news_clusters.shape)
  #print("")
  #print(news_clusters[0:55])

# Bucket each sentence to its cluster index
  cluster_news_ids = {i: [] for i in range(n_clusters)} # Hyper Param n_clusters
  #print(cluster_news_ids) # empty dict for each of the clusters

  for i, c in enumerate(news_clusters):
    cluster_news_ids[c].append(i)

  # let us see how many news headlines in each cluster
  #for i in range (0, 5):
    #print("cluster id:", i, "cluster count:", len(cluster_news_ids[i]))

# Embed the query
  query = formatted_source_highlight
  query_embedding = encoder_model.encode([query]) # MUST BE A LIST

# Compute the distance from the query embedding to each cluster centroid
  query_cluster_dists = [cosine(query_embedding[0], cluster_model.cluster_centers_[c])
                       for c in range(n_clusters)] # 5 refers to number of clusters

# inspect query_cluster_dists
  #print(len(query_cluster_dists)) # distance of query to each cluster centroids
  #print(query_cluster_dists) # distance from each cluster

# Get the top k nearest clusters and retrieve their document ids
# (You can try different numbers of top clusters, to see the trade-off between
# speed and recall of all the best articles we found above.)

  top_clusters = np.argsort(query_cluster_dists)[:top_n_clusters] # get the top 1 nearest clusters
  #print(top_clusters) # 48 candidates in first cluster and 19 in the second one

  candidate_news_ids = [i for c in top_clusters for i in cluster_news_ids[c]]
  #print(len(candidate_news_ids))
  #print(query_cluster_dists[2])
  #print(np.amin(query_cluster_dists))
  #print(top_clusters)

# Now use Nearest Neighbors only on the top cluster candidates
  candidate_news_embeds = [news_embeddings[i] for i in candidate_news_ids]
  #print(len(candidate_news_embeds))
  #print(len(candidate_news_embeds[0]))

  n_nearest_neighbors = int(compression_ratio * len(source_article_list))
  if n_nearest_neighbors == 0:
    n_nearest_neighbors == 1

  knn_model = NearestNeighbors(n_neighbors=n_nearest_neighbors)
  knn_model.fit(candidate_news_embeds)

  dists, topk_idx = knn_model.kneighbors(query_embedding)
  for d, i in zip(dists[0], topk_idx[0]):
    orig_i = candidate_news_ids[i]
    #print(d, source_article_list[orig_i]) #source_article_list = nltk.sent_tokenize(test_doc)
    output_list.append(source_article_list[orig_i])
  
  result = output_list 

# note result is a string returned by model
  summary = "".join(result)

# Take the string from summary and convert to list of strings for each sentence
  extracted_sentence_list = nltk.sent_tokenize(summary)

# Apply Trigram Blocking
  result_list = trigram_blocking(extracted_sentence_list)
  summary = " ".join(result_list)
  output_highlights_list.append(summary)

  predictions = " ".join(result_list)
  references = formatted_source_highlight

  rougeL_scores = scorer.score(predictions,
                      references)
  
  pred = [predictions]
  ref = [references]

  rouge_results = rouge.compute(predictions=pred, references=ref)
  print(rouge_results)

  rouge_1_score = (rouge_results['rouge1'])
  rouge_2_score = (rouge_results['rouge2'])
  rouge_L_score = (rouge_results['rougeL'])

  rouge_1_list.append(rouge_1_score)
  rouge_2_list.append(rouge_2_score)
  rouge_L_list.append(rouge_L_score)

  precision = (rougeL_scores['rougeL'].precision)
  recall = (rougeL_scores['rougeL'].recall)
  fmeasure = (rougeL_scores['rougeL'].fmeasure)

  rougeL_precision.append(precision)
  rougeL_recall.append(recall)
  rougeL_fmeasure.append(fmeasure)

  # calculate cosine similarity
  doc1 = nlp((" ".join(predictions)))
  doc2 = nlp((" ".join(references)))
  cosine_similarity = doc1.similarity(doc2)
  cosine_similarity_results.append(cosine_similarity)

  count = count + 1

Example: 1
{'rouge1': 0.09536423841059602, 'rouge2': 0.04780876494023904, 'rougeL': 0.07417218543046357, 'rougeLsum': 0.07417218543046357}
Example: 2
{'rouge1': 0.03977900552486188, 'rouge2': 0.022148394241417495, 'rougeL': 0.026519337016574586, 'rougeLsum': 0.026519337016574586}
Example: 3
{'rouge1': 0.10625000000000001, 'rouge2': 0.04402515723270441, 'rougeL': 0.06875, 'rougeLsum': 0.06875}
Example: 4
{'rouge1': 0.15311004784688995, 'rouge2': 0.0673076923076923, 'rougeL': 0.11004784688995213, 'rougeLsum': 0.11004784688995213}
Example: 5
{'rouge1': 0.1092896174863388, 'rouge2': 0.03296703296703297, 'rougeL': 0.060109289617486336, 'rougeLsum': 0.060109289617486336}
Example: 6
{'rouge1': 0.07675438596491227, 'rouge2': 0.03736263736263736, 'rougeL': 0.05701754385964912, 'rougeLsum': 0.05701754385964912}
Example: 7
{'rouge1': 0.1543026706231454, 'rouge2': 0.035820895522388055, 'rougeL': 0.10089020771513353, 'rougeLsum': 0.10089020771513353}
Example: 8
{'rouge1': 0.1337386018237082, 'rouge

In [None]:
# Print Compute Time
print('\nTime:', time.time() - start)


Time: 2503.2569892406464


In [None]:
# Export extractive summary to a CSV

# get article from dataset
input_article_list = small_dataset['document'][0:size_of_dataset]

# get summary from dataset
input_highlights_list = small_dataset['summary'][0:size_of_dataset]

df = pd.DataFrame(list(zip(input_article_list, input_highlights_list, output_highlights_list)),
                  columns = ['orig_article', 'orig_summary', 'extracted_summary'])


# Edit this filepath to wherever you saved the data in your Drive
filepath = 'drive/My Drive/Colab_Notebooks_1/model_3a_extracted_mediasum1000.csv'

df.to_csv(filepath,index = False)

In [None]:
# read back the csv file
data_import = pd.read_csv(filepath)        
#data_import.rename(columns = {'0':'orig_article', '1':'orig_summary', '2':'extracted_summary'}, inplace = True)

col1 = data_import.orig_article.values.tolist()
col2 = data_import.orig_summary.values.tolist()
col3 = data_import.extracted_summary.values.tolist()

In [None]:
# Validate data_import
print(col1[0])
print("")

print(col2[0])
print("")

print(col3[0])


FARAI CHIDEYA, host: Now, moving on, Forest Whitaker as Moses, Tisha Campbell Martin as Mary Magdalene - well, that's all in "The Bible Experience. " A New Testament edition was released in 2006.  This edition is billed as "The Complete Bible. " It doesn't have one person reading the gospels.  It features nearly 400 African-American artists, actors and ministers, plus sound effects. FARAI CHIDEYA, host: Just listen to Blair Underwood's rendition of Jesus on the cross. Mr.  BLAIR UNDERWOOD (Actor): (As Jesus) My God, my God, why have you forsaken me?.  FARAI CHIDEYA, host: Now, we've got two people affiliated with the project with us today.  Kyle Bowser, he co-produced "The Bible Experience" and actress Wendy Raquel Robinson, one of the actors in "The Bible Experience," and she also stars in the CW series, "The Game. "FARAI CHIDEYA, host: Hi folks, how are you doing?.  Ms.  WENDY RAQUEL ROBINSON (Actress): Great. Mr.  KYLE BOWSER (Co-producer, "The Bible Experience: The Complete Bible")

In [None]:
# Calculate Mean Rouge for Dataset

print("RougeL Precision Scores")
print(rougeL_precision)
print(len(rougeL_precision))
print(np.mean(np.asarray(rougeL_precision)))
print("")

print("RougeL Recall Scores")
print(rougeL_recall)
print(len(rougeL_precision))
print(np.mean(np.asarray(rougeL_recall)))
print("")

print("RougeL Fmeasure Scores")
print(rougeL_fmeasure)
print(len(rougeL_fmeasure))
print(np.mean(np.asarray(rougeL_fmeasure)))
print("")

print("Rouge 1 Scores")
print(rouge_1_list)
print(len(rouge_1_list))
print(np.mean(np.asarray(rouge_1_list)))
print("")

print("Rouge 2 Scores")
print(rouge_2_list)
print(len(rouge_2_list))
print(np.mean(np.asarray(rouge_2_list)))
print("")

print("Rouge L Scores")
print(rouge_L_list)
print(len(rouge_L_list))
print(np.mean(np.asarray(rouge_L_list)))
print("")



RougeL Precision Scores
[0.6511627906976745, 0.5909090909090909, 0.4230769230769231, 0.5348837209302325, 0.4, 0.6190476190476191, 0.4473684210526316, 0.5, 0.575, 0.5882352941176471, 0.6, 0.5555555555555556, 0.575, 0.23255813953488372, 0.5128205128205128, 0.4782608695652174, 0.5217391304347826, 0.4782608695652174, 0.5172413793103449, 0.30303030303030304, 0.32558139534883723, 0.6060606060606061, 0.4878048780487805, 0.45161290322580644, 0.4, 0.7333333333333333, 0.5757575757575758, 0.4642857142857143, 0.4857142857142857, 0.7073170731707317, 0.40816326530612246, 0.3902439024390244, 0.38636363636363635, 0.43478260869565216, 0.7307692307692307, 0.6440677966101694, 0.3614457831325301, 0.4634146341463415, 0.3870967741935484, 0.6470588235294118, 0.4838709677419355, 0.6363636363636364, 0.6170212765957447, 0.7, 0.4897959183673469, 0.5, 0.45454545454545453, 0.2962962962962963, 0.5348837209302325, 0.5652173913043478, 0.42424242424242425, 0.7209302325581395, 0.5294117647058824, 0.7307692307692307, 0.

In [None]:
# Calculate Mean Cosine Similarity

print("Cosine Similarity")
print(np.mean(np.asarray(cosine_similarity_results)))
print("")



Cosine Similarity
0.9903292558527907



In [None]:
dataset = load_dataset('csv', data_files = filepath, split='train' )



Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-0f5b6607c69fd2fe/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-0f5b6607c69fd2fe/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


In [None]:
dataset

Dataset({
    features: ['orig_article', 'orig_summary', 'extracted_summary'],
    num_rows: 1000
})