# CNN_Daily_Mail_Final_Model_3a_Sbert_Transformer_Clustering_Extractive_Summarization
    November 6, 2022




This notebook has the following model built for an extractive summarizer based on Bert Sentence Transfomer Model from Hugging Face.

1.**SentenceTransformers** is a Python framework for state-of-the-art sentence, text and image embeddings. The initial work is described in paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. This framework can be used to compute sentence / text embeddings for more than 100 languages. **These** embeddings can then be compared e.g. with cosine-similarity to find sentences with a similar meaning. This can be useful for semantic textual similar, semantic search, or paraphrase mining.

2.In this notebook, **a ranked retrieval or extraction** of the input document is made using the **abstractive summary as the query vector**. Following steps are followed:  


    Step 1: KMeans Clustering of Input Document
    Step 2: Derive a query embedding vector of the abstractive summary
    Step 3: Compute Distance of Query to Centroid of Each Cluster
    Step 4: Select the Top N Clusters
    Step 5: Apply KMeans Nearest Neighbors to the Top N Clusters
    Step 6: Select the Extractive Summarization

Dataset Summary

The CNN / DailyMail Dataset is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail. The current version supports both extractive and abstractive summarization, though the original version was created for machine reading and comprehension and abstractive question answering. 

Data Fields

    id: a string containing the heximal formated SHA1 hash of the url where the story was retrieved from
    article: a string containing the body of the news article
    highlights: a string containing the highlight of the article as written by the article author

Model Details

    Sentence Transformer: Sentence Transformer
    Pre-Training: all-MiniLM-L6-v2
    **Supervised**: Supervised using Abstractive Summaries
    Classification: KMeans Clustering and Neighbors
    Trigram Blocking: Yes
    Fine Tuning: None
    Evaluation Metrics: RougeL and Cosine Similarity
    
    NOTE: Abstractive summaries are used as a gold label to compute RougeL Scores

# 1. Setup

#### This section install key libraries

In [None]:
!pip install bert-extractive-summarizer --quiet

[K     |████████████████████████████████| 5.5 MB 7.3 MB/s 
[K     |████████████████████████████████| 7.6 MB 47.0 MB/s 
[K     |████████████████████████████████| 163 kB 47.8 MB/s 
[?25h

In [None]:
!pip install datasets --quiet
!pip install nltk --quiet

[K     |████████████████████████████████| 441 kB 5.1 MB/s 
[K     |████████████████████████████████| 115 kB 28.0 MB/s 
[K     |████████████████████████████████| 95 kB 3.2 MB/s 
[K     |████████████████████████████████| 212 kB 43.7 MB/s 
[K     |████████████████████████████████| 127 kB 68.4 MB/s 
[K     |████████████████████████████████| 115 kB 71.9 MB/s 
[?25h

In [None]:
!pip install -q rouge_score

  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone


In [None]:
!pip install -q evaluate


[?25l[K     |████▌                           | 10 kB 14.0 MB/s eta 0:00:01[K     |█████████                       | 20 kB 7.1 MB/s eta 0:00:01[K     |█████████████▌                  | 30 kB 9.9 MB/s eta 0:00:01[K     |██████████████████              | 40 kB 3.7 MB/s eta 0:00:01[K     |██████████████████████▌         | 51 kB 4.0 MB/s eta 0:00:01[K     |███████████████████████████     | 61 kB 4.7 MB/s eta 0:00:01[K     |███████████████████████████████▌| 71 kB 5.2 MB/s eta 0:00:01[K     |████████████████████████████████| 72 kB 842 kB/s 
[?25h

In [None]:
!pip install -U spacy
!pip install -U spacy-lookups-data

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting spacy-lookups-data
  Downloading spacy_lookups_data-1.0.3-py2.py3-none-any.whl (98.5 MB)
[K     |████████████████████████████████| 98.5 MB 1.1 MB/s 
Installing collected packages: spacy-lookups-data
Successfully installed spacy-lookups-data-1.0.3


In [None]:
!pip install -r rouge/requirements.txt
!pip install rouge-score
!pip install rouge_score

[31mERROR: Could not open requirements file: [Errno 2] No such file or directory: 'rouge/requirements.txt'[0m
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!python -m spacy download en_core_web_lg

2022-11-06 16:01:08.972105: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-lg==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.4.1/en_core_web_lg-3.4.1-py3-none-any.whl (587.7 MB)
[K     |████████████████████████████████| 587.7 MB 14 kB/s 
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.4.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


# 2.0 Import Libraries

In [None]:
# NLTK
import re # relugar expression
import nltk # natural language toolkit for sentence tokenization and display
import string
import heapq
nltk.download('punkt')
nltk.download('stopwords')
from nltk import word_tokenize
from nltk.util import ngrams
import evaluate

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
import os
import logging

import numpy as np
import tensorflow as tf
from tensorflow import keras

# Only log error messages
tf.get_logger().setLevel(logging.ERROR)

os.environ["TOKENIZERS_PARALLELISM"] = "false"


In [None]:
import evaluate

In [None]:
import spacy
import pandas as pd

print(spacy.__version__)
print(pd.__version__)

3.4.2
1.3.5


In [None]:
import pickle
import subprocess
import sys
import nltk
from nltk import Nonterminal, nonterminals, Production, CFG, PCFG

In [None]:
#shift reduce parser example
from nltk.grammar import Nonterminal
from nltk.parse.api import ParserI
from nltk.tree import Tree

In [None]:
nlp = spacy.load("en_core_web_lg")

In [None]:
from rouge_score import rouge_scorer

In [None]:
import time

In [None]:
# Mount drive for saving model checkpoints, loading Task 2 data below

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# 3.0 Load the dataset



In [None]:
from datasets import load_dataset, load_metric


In [None]:
# Load 1000 examples of CNN Dailymail from a CSV File

filepath = 'drive/My Drive/Colab_Notebooks_1/cnn_dailymail_1000.csv'
dataset = load_dataset('csv', data_files = filepath, split='train' )
dataset



Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-d8ef26a734f1246a/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-d8ef26a734f1246a/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


Dataset({
    features: ['document', 'summary'],
    num_rows: 1000
})

# 4.0 Helper Functions

In [None]:
def preprocess(sentence):
    sentence=str(sentence)
    sentence = sentence.replace('\n', ' ')
    sentence = sentence.replace('?</s>', '?.')
    sentence = sentence.replace('</s>', '')
    sentence = sentence.replace("\'", "'")
    sentence = sentence.replace('--', '')
    sentence = sentence.replace('|', '')
    sentence = sentence.replace('/', '')
    sentence = sentence.replace('Dr.', 'Dr')
    sentence = sentence.replace('?.', '?. ')
    #sentence = sentence.replace('.', '. ')
    sentence = sentence.replace('!', '!. ')
    #sentence = sentence.replace('"', '')
    sentence = sentence.replace("This material may not be published, broadcast, rewritten, or redistributed.",'')
    #sentence = sentence.replace(' "', '')
    #sentence = sentence.replace('" ', '')
    return sentence

In [None]:
def calculate_rouge(reference, predictions):
  rouge = evaluate.load('rouge')
  predictions = extracted_sentence_list
  references = orig_highlights_list
  results = rouge.compute(predictions=predictions,
                        references=references)
  print(results)

In [None]:
"""
LexRank implementation
Source: https://github.com/crabcamp/lexrank/tree/dev
"""

import numpy as np
from scipy.sparse.csgraph import connected_components
from scipy.special import softmax
import logging

logger = logging.getLogger(__name__)

def degree_centrality_scores(
    similarity_matrix,
    threshold=None,
    increase_power=True,
):
    if not (
        threshold is None
        or isinstance(threshold, float)
        and 0 <= threshold < 1
    ):
        raise ValueError(
            '\'threshold\' should be a floating-point number '
            'from the interval [0, 1) or None',
        )

    if threshold is None:
        markov_matrix = create_markov_matrix(similarity_matrix)

    else:
        markov_matrix = create_markov_matrix_discrete(
            similarity_matrix,
            threshold,
        )

    scores = stationary_distribution(
        markov_matrix,
        increase_power=increase_power,
        normalized=False,
    )

    return scores


def _power_method(transition_matrix, increase_power=True, max_iter=10000):
    eigenvector = np.ones(len(transition_matrix))

    if len(eigenvector) == 1:
        return eigenvector

    transition = transition_matrix.transpose()

    for _ in range(max_iter):
        eigenvector_next = np.dot(transition, eigenvector)

        if np.allclose(eigenvector_next, eigenvector):
            return eigenvector_next

        eigenvector = eigenvector_next

        if increase_power:
            transition = np.dot(transition, transition)

    logger.warning("Maximum number of iterations for power method exceeded without convergence!")
    return eigenvector_next


def connected_nodes(matrix):
    _, labels = connected_components(matrix)

    groups = []

    for tag in np.unique(labels):
        group = np.where(labels == tag)[0]
        groups.append(group)

    return groups


def create_markov_matrix(weights_matrix):
    n_1, n_2 = weights_matrix.shape
    if n_1 != n_2:
        raise ValueError('\'weights_matrix\' should be square')

    row_sum = weights_matrix.sum(axis=1, keepdims=True)

    # normalize probability distribution differently if we have negative transition values
    if np.min(weights_matrix) <= 0:
        return softmax(weights_matrix, axis=1)

    return weights_matrix / row_sum


def create_markov_matrix_discrete(weights_matrix, threshold):
    discrete_weights_matrix = np.zeros(weights_matrix.shape)
    ixs = np.where(weights_matrix >= threshold)
    discrete_weights_matrix[ixs] = 1

    return create_markov_matrix(discrete_weights_matrix)


def stationary_distribution(
    transition_matrix,
    increase_power=True,
    normalized=True,
):
    n_1, n_2 = transition_matrix.shape
    if n_1 != n_2:
        raise ValueError('\'transition_matrix\' should be square')

    distribution = np.zeros(n_1)

    grouped_indices = connected_nodes(transition_matrix)

    for group in grouped_indices:
        t_matrix = transition_matrix[np.ix_(group, group)]
        eigenvector = _power_method(t_matrix, increase_power=increase_power)
        distribution[group] = eigenvector

    if normalized:
        distribution /= n_1

    return distribution

In [None]:
# Trigram Blocking for Extractive Summarization

def trigram_blocking(input_text):

  trigram_master_list=[]
  trigram_temp_list=[]
  clean_list = []
  trigram_flag = []
  output_list = input_text

  #output_list = ['oh my god', 'oh my god' ,'lovely day today', 'This year is good', 'Oh my god' ]

  for idx, sentence in enumerate(output_list):
    if idx == 0:
      token = nltk.word_tokenize(sentence.lower())  #tokenize your text and make it lowercase in onestep
      trigrams=ngrams(token,3)          # find all the trigram in text1
      #print(idx, sentence)
      for item in trigrams:
        #print(item)
        if item not in trigram_master_list:
          trigram_master_list.append(item)
      clean_list.append(sentence)
    elif idx > 0:
      token = nltk.word_tokenize(sentence.lower())  #tokenize your text and make it lowercase in onestep
      trigrams=ngrams(token,3)
      #print(idx, sentence) 
      for item in trigrams:
        #print(item)
        trigram_temp_list.append(item)
      trigram_flag = []  
      for trigram_temp in trigram_temp_list:
        if trigram_temp in trigram_master_list:
          trigram_flag.append("Y")
        else:
          trigram_flag.append("N")
      if "N" in trigram_flag:
        clean_list.append(sentence)
        for i in trigram_temp_list:
          trigram_master_list.append(i)

  #print(trigram_master_list)
  #print(clean_list)

  return (clean_list)

#5.0 Create a Dataset for Model Building and Evaluation
 

In [None]:
# Generate a dataset of "x" examples for model evaulation
size_of_dataset = 1000 # Do Not Change the Value Here
small_dataset = dataset[0:size_of_dataset]

#6.0 Model Use Pre-Trained Sentence Transformer for Extractive Summarization #

    (NO FINE TUNING)

In [None]:
# Install and import Sentence Transformer
!pip install -U sentence-transformers 
import nltk
from sentence_transformers import SentenceTransformer, util
import numpy as np
encoder_model = SentenceTransformer('all-MiniLM-L6-v2')
#model
#encoder_model = SentenceTransformer('johngiorgi/declutr-base')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 3.1 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 29.6 MB/s 
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125938 sha256=3ed2e1836b62f66217eb258f911a09609e4d9186d6e94f8859907989917005dc
  Stored in directory: /root/.cache/pip/wheels/bf/06/fb/d59c1e5bd1dac7f6cf61ec0036cc3a10ab8fecaa6b2c3d3ee9
Successfully built sentence-transformers
Installing collected packages: sentencepiece, sentence-transformers
Successfully installed sentenc

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [None]:
from scipy.spatial.distance import cosine
from sklearn.neighbors import NearestNeighbors
from sklearn.cluster import KMeans
!pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


#8.0 Evaluate Rouge Across Eval Data Set

In [None]:
# RougueL Scorer
scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
rouge = evaluate.load('rouge')


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [None]:
from numpy.lib.utils import source
# Iterate through the dataset, extract the summary and compute RougeL, Cosine Similarity Scores

n_clusters = 5 # KMeans Hyper Parameter
n_nearest_neighbors = 10
top_n_clusters = n_clusters
compression_ratio = 0.3

start = time.time()

# get article from dataset
input_article = small_dataset['document'][0:size_of_dataset]
#print("input document")
#print("***************")
#print(input_article)
#print(len(input_article))
#print(type(input_article))
#print("")


# get summary from dataset
input_highlights = small_dataset['summary'][0:size_of_dataset]
#print("input highlights")
#print("*****************")
#print(input_highlights)
#print(len(input_highlights))
#print(type(input_highlights))
#print("")


# zip article and summary
zipped_input = zip(input_article, input_highlights)

# Empty List to Store Scores
rougeL_precision = []
rougeL_recall = []
rougeL_fmeasure = []

rouge_1_list = []
rouge_2_list = []
rouge_L_list = []

cosine_similarity_results = []

# Counter for Tracking Results
count = 1

# Extracted Highlights
output_highlights_list = []

for input_article, input_highlights in zipped_input:

  print('Example:', count)
  
# Tokenize the string texts  
  source_article_list = nltk.sent_tokenize(input_article)
  source_highlight_list = nltk.sent_tokenize(input_highlights)
  
# Join the list into a string
  formatted_source_article = " ".join(source_article_list)
  formatted_source_highlight = " ".join(source_highlight_list)
  
# get embeddings from model
  news_embeddings = encoder_model.encode(source_article_list)
  #print("embeddings")
  #print('**********')
  #print((news_embeddings.shape))
  #print(news_embeddings[54])
  #print

  output_list=[]

# Kmeans Clustering of Source Document

  if len(source_article_list) < n_clusters:
    n_clusters = len(source_article_list)
  else:
    n_clusters = n_clusters

  cluster_model = KMeans(n_clusters=n_clusters, random_state=0) # Hyper Param
  news_clusters = cluster_model.fit_predict(news_embeddings)
  #print("news clusters")
  #print("*************")
  #print(news_clusters.shape)
  #print("")
  #print(news_clusters[0:55])

# Bucket each sentence to its cluster index
  cluster_news_ids = {i: [] for i in range(n_clusters)} # Hyper Param n_clusters
  #print(cluster_news_ids) # empty dict for each of the clusters

  for i, c in enumerate(news_clusters):
    cluster_news_ids[c].append(i)

  # let us see how many news headlines in each cluster
  #for i in range (0, 5):
    #print("cluster id:", i, "cluster count:", len(cluster_news_ids[i]))

# Embed the query
  query = formatted_source_highlight
  query_embedding = encoder_model.encode([query]) # MUST BE A LIST

# Compute the distance from the query embedding to each cluster centroid
  query_cluster_dists = [cosine(query_embedding[0], cluster_model.cluster_centers_[c])
                       for c in range(n_clusters)] # 5 refers to number of clusters

# inspect query_cluster_dists
  #print(len(query_cluster_dists)) # distance of query to each cluster centroids
  #print(query_cluster_dists) # distance from each cluster

# Get the top k nearest clusters and retrieve their document ids
# (You can try different numbers of top clusters, to see the trade-off between
# speed and recall of all the best articles we found above.)

  top_clusters = np.argsort(query_cluster_dists)[:top_n_clusters] # get the top 1 nearest clusters
  #print(top_clusters) # 48 candidates in first cluster and 19 in the second one

  candidate_news_ids = [i for c in top_clusters for i in cluster_news_ids[c]]
  #print(len(candidate_news_ids))
  #print(query_cluster_dists[2])
  #print(np.amin(query_cluster_dists))
  #print(top_clusters)

# Now use Nearest Neighbors only on the top cluster candidates
  candidate_news_embeds = [news_embeddings[i] for i in candidate_news_ids]
  #print(len(candidate_news_embeds))
  #print(len(candidate_news_embeds[0]))

  n_nearest_neighbors = int(compression_ratio * len(source_article_list))
  if n_nearest_neighbors == 0:
    n_nearest_neighbors = 1

  knn_model = NearestNeighbors(n_neighbors=n_nearest_neighbors)
  knn_model.fit(candidate_news_embeds)

  dists, topk_idx = knn_model.kneighbors(query_embedding)
  for d, i in zip(dists[0], topk_idx[0]):
    orig_i = candidate_news_ids[i]
    #print(d, source_article_list[orig_i]) #source_article_list = nltk.sent_tokenize(test_doc)
    output_list.append(source_article_list[orig_i])
  
  result = output_list 

# note result is a string returned by model
  summary = "".join(result)

# Take the string from summary and convert to list of strings for each sentence
  extracted_sentence_list = nltk.sent_tokenize(summary)

# Apply Trigram Blocking
  result_list = trigram_blocking(extracted_sentence_list)
  summary = " ".join(result_list)
  output_highlights_list.append(summary)

  predictions = " ".join(result_list)
  references = formatted_source_highlight

  rougeL_scores = scorer.score(predictions,
                      references)
  
  pred = [predictions]
  ref = [references]

  rouge_results = rouge.compute(predictions=pred, references=ref)
  print(rouge_results)

  rouge_1_score = (rouge_results['rouge1'])
  rouge_2_score = (rouge_results['rouge2'])
  rouge_L_score = (rouge_results['rougeL'])

  rouge_1_list.append(rouge_1_score)
  rouge_2_list.append(rouge_2_score)
  rouge_L_list.append(rouge_L_score)

  precision = (rougeL_scores['rougeL'].precision)
  recall = (rougeL_scores['rougeL'].recall)
  fmeasure = (rougeL_scores['rougeL'].fmeasure)

  rougeL_precision.append(precision)
  rougeL_recall.append(recall)
  rougeL_fmeasure.append(fmeasure)

  # calculate cosine similarity
  doc1 = nlp((" ".join(predictions)))
  doc2 = nlp((" ".join(references)))
  cosine_similarity = doc1.similarity(doc2)
  cosine_similarity_results.append(cosine_similarity)

  count = count + 1

Example: 1
{'rouge1': 0.28461538461538455, 'rouge2': 0.23255813953488372, 'rougeL': 0.28461538461538455, 'rougeLsum': 0.28461538461538455}
Example: 2
{'rouge1': 0.2054380664652568, 'rouge2': 0.07902735562310029, 'rougeL': 0.13293051359516614, 'rougeLsum': 0.17522658610271902}
Example: 3
{'rouge1': 0.16822429906542055, 'rouge2': 0.09404388714733541, 'rougeL': 0.11838006230529594, 'rougeLsum': 0.13084112149532712}
Example: 4
{'rouge1': 0.19607843137254904, 'rouge2': 0.03973509933774834, 'rougeL': 0.1176470588235294, 'rougeLsum': 0.1176470588235294}
Example: 5
{'rouge1': 0.18360655737704917, 'rouge2': 0.052805280528052806, 'rougeL': 0.09836065573770492, 'rougeLsum': 0.09836065573770492}
Example: 6
{'rouge1': 0.22, 'rouge2': 0.09395973154362415, 'rougeL': 0.14666666666666667, 'rougeLsum': 0.14666666666666667}
Example: 7
{'rouge1': 0.1691542288557214, 'rouge2': 0.075, 'rougeL': 0.1044776119402985, 'rougeLsum': 0.1044776119402985}
Example: 8
{'rouge1': 0.44871794871794873, 'rouge2': 0.285714

In [None]:
# Print Compute Time
print('\nTime:', time.time() - start)


Time: 1082.089067697525


In [None]:
# Clean up the extracted list
output_highlights_list = list(((map(preprocess, output_highlights_list))))

In [None]:
# Export extractive summary to a CSV

# get article from dataset
input_article_list = small_dataset['document'][0:size_of_dataset]

# get summary from dataset
input_highlights_list = small_dataset['summary'][0:size_of_dataset]

df = pd.DataFrame(list(zip(input_article_list, input_highlights_list, output_highlights_list)),
                  columns = ['orig_article', 'orig_summary', 'extracted_summary'])


# Edit this filepath to wherever you saved the data in your Drive
filepath = 'drive/My Drive/Colab_Notebooks_1/model_3a_extracted_CNNDaily1000.csv'

df.to_csv(filepath,index = False)

In [None]:
# read back the csv file
data_import = pd.read_csv(filepath)        
#data_import.rename(columns = {'0':'orig_article', '1':'orig_summary', '2':'extracted_summary'}, inplace = True)

col1 = data_import.orig_article.values.tolist()
col2 = data_import.orig_summary.values.tolist()
col3 = data_import.extracted_summary.values.tolist()

In [None]:
# Validate data_import
print(col1[0])
print("")

print(col2[0])
print("")

print(col3[0])


LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don't think I'll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart. Details of how

In [None]:
# Calculate Mean Rouge for Dataset

print("RougeL Precision Scores")
print(rougeL_precision)
print(len(rougeL_precision))
print(np.mean(np.asarray(rougeL_precision)))
print("")

print("RougeL Recall Scores")
print(rougeL_recall)
print(len(rougeL_precision))
print(np.mean(np.asarray(rougeL_recall)))
print("")

print("RougeL Fmeasure Scores")
print(rougeL_fmeasure)
print(len(rougeL_fmeasure))
print(np.mean(np.asarray(rougeL_fmeasure)))
print("")

print("Rouge 1 Scores")
print(rouge_1_list)
print(len(rouge_1_list))
print(np.mean(np.asarray(rouge_1_list)))
print("")

print("Rouge 2 Scores")
print(rouge_2_list)
print(len(rouge_2_list))
print(np.mean(np.asarray(rouge_2_list)))
print("")

print("Rouge L Scores")
print(rouge_L_list)
print(len(rouge_L_list))
print(np.mean(np.asarray(rouge_L_list)))
print("")



RougeL Precision Scores
[0.9487179487179487, 0.4897959183673469, 0.4634146341463415, 0.375, 0.36585365853658536, 0.5116279069767442, 0.44680851063829785, 0.5714285714285714, 0.42424242424242425, 0.4642857142857143, 0.6216216216216216, 0.4444444444444444, 0.6666666666666666, 0.25806451612903225, 0.3695652173913043, 0.30612244897959184, 0.46153846153846156, 0.4523809523809524, 0.4666666666666667, 0.6285714285714286, 0.46511627906976744, 0.48148148148148145, 0.3695652173913043, 0.2702702702702703, 0.4727272727272727, 0.46875, 0.5277777777777778, 0.3, 0.6857142857142857, 0.5, 0.6666666666666666, 0.5531914893617021, 0.6129032258064516, 0.32432432432432434, 0.6981132075471698, 0.7567567567567568, 0.4444444444444444, 0.47058823529411764, 0.6976744186046512, 0.40425531914893614, 0.4318181818181818, 0.4838709677419355, 0.5882352941176471, 0.7346938775510204, 0.5405405405405406, 0.6976744186046512, 0.4166666666666667, 0.3170731707317073, 0.4523809523809524, 0.4090909090909091, 0.5476190476190477

In [None]:
# Calculate Mean Cosine Similarity

print("Cosine Similarity")
print(np.mean(np.asarray(cosine_similarity_results)))
print("")



Cosine Similarity
0.9939102274002851



In [None]:
dataset = load_dataset('csv', data_files = filepath, split='train' )



Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-2435f418515587ba/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-2435f418515587ba/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


In [None]:
dataset

Dataset({
    features: ['orig_article', 'orig_summary', 'extracted_summary'],
    num_rows: 1000
})

In [None]:
dataset['extracted_summary'][0]

'LONDON, England (Reuters)  Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him.Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties.Radcliffe\'s earnings from the first five Potter films have been held in a trust fund which he has not been able to touch.At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart.His latest outing as the boy wizard in "Harry Potter and the Order of the Phoenix" is breaking records on both sides of the Atlantic and he will reprise the role in the last two films.There is life beyond Potter, however. "I don\'t plan t