# Final_Model_5a_TD_IDF_MediaSum_Extractive_Summarization
Oct 31, 2022 (1:08 PM)




## This notebook has the following model built for an extractive summarizer based on TF-IDF Methodology.

### Methodology

Input document -> Finding most important words from the document -> Finding sentence scores on the basis of important words ->Choosing the most important sentences on the basis of scores obtained.

### What is TFIDF Approach ?

TFIDF, short for term frequency–inverse document frequency, is a numeric measure that is use to score the importance of a word in a document based on how often did it appear in that document and a given collection of documents. The intuition behind this measure is : If a word appears frequently in a document, then it should be important and we should give that word a high score. But if a word appears in too many other documents, it’s probably not a unique identifier, therefore we should assign a lower score to that word.

Formula for calculating tf and idf:

    TF(w) = (Number of times term w appears in a document) / (Total number of terms in the document)
    IDF(w) = log_e(Total number of documents / Number of documents with term w in it)

Hence tfidf for a word can be calculated as:

    TFIDF(w) = TF(w) * IDF(w)

### Dataset Summary

This large-scale media interview dataset contains 463.6K transcripts with abstractive summaries, collected from interview transcripts and overview / topic descriptions from NPR and CNN.

Data Fields

    id: paper id
    document: a string/list containing the body of a set of documents
    summary: a string containing the abstract of the set



# 1. Setup

#### This section install key libraries

In [None]:
!pip install datasets --quiet
!pip install nltk --quiet

In [None]:
!pip install -q rouge_score

In [None]:
!pip install -q evaluate


In [None]:
!pip install -U spacy
!pip install -U spacy-lookups-data

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!pip install -r rouge/requirements.txt
!pip install rouge-score
!pip install rouge_score

[31mERROR: Could not open requirements file: [Errno 2] No such file or directory: 'rouge/requirements.txt'[0m
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!python -m spacy download en_core_web_lg

2022-11-01 02:32:34.872003: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-lg==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.4.1/en_core_web_lg-3.4.1-py3-none-any.whl (587.7 MB)
[K     |█████████████████▎              | 318.2 MB 1.1 MB/s eta 0:04:06
[31mERROR: Exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/cli/base_command.py", line 180, in _main
    status = self.run(options, args)
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/cli/req_command.py", line 199, in wrapper
    return func(self, options, args)
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/commands/install.py", line 319, in run
    reqs, check_supported_wheels=not options.target_

# 2.0 Import Libraries

In [None]:
# NLTK
import re # relugar expression
import nltk # natural language toolkit for sentence tokenization and display
import string
import heapq
nltk.download('punkt')
nltk.download('stopwords')
from nltk import word_tokenize
from nltk.util import ngrams
import evaluate

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
import os
import logging

import numpy as np
import tensorflow as tf
from tensorflow import keras

# Only log error messages
tf.get_logger().setLevel(logging.ERROR)

os.environ["TOKENIZERS_PARALLELISM"] = "false"


In [None]:
import evaluate

In [None]:
import spacy
import pandas as pd

print(spacy.__version__)
print(pd.__version__)

3.4.2
1.3.5


In [None]:
import pickle
import subprocess
import sys
import nltk
from nltk import Nonterminal, nonterminals, Production, CFG, PCFG

In [None]:
#shift reduce parser example
from nltk.grammar import Nonterminal
from nltk.parse.api import ParserI
from nltk.tree import Tree

In [None]:
nlp = spacy.load("en_core_web_lg")

In [None]:
from rouge_score import rouge_scorer

In [None]:
import math

from nltk import sent_tokenize, word_tokenize, PorterStemmer
from nltk.corpus import stopwords    

In [None]:
import time

In [None]:
# Mount drive for saving model checkpoints, loading Task 2 data below

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# 3.0 Load the dataset



In [None]:
from datasets import load_dataset, load_metric


In [None]:
dataset_id = "ccdv/mediasum"
dataset = load_dataset(dataset_id, split="train")



In [None]:
# inspect data structure
print(dataset)

Dataset({
    features: ['document', 'summary'],
    num_rows: 443596
})


In [None]:
# inspect shape
print(dataset.shape)



(443596, 2)


# 4.0 Inspect MediaSum Data

In [None]:
# inspect first example
dataset[0]

{'document': 'FARAI CHIDEYA, host: Now, moving on, Forest Whitaker as Moses, Tisha Campbell Martin as Mary Magdalene - well, that\'s all in "The Bible Experience." A New Testament edition was released in 2006. This edition is billed as "The Complete Bible." It doesn\'t have one person reading the gospels. It features nearly 400 African-American artists, actors and ministers, plus sound effects.</s>FARAI CHIDEYA, host: Just listen to Blair Underwood\'s rendition of Jesus on the cross.</s>Mr. BLAIR UNDERWOOD (Actor): (As Jesus) My God, my God, why have you forsaken me?</s>FARAI CHIDEYA, host: Now, we\'ve got two people affiliated with the project with us today. Kyle Bowser, he co-produced "The Bible Experience" and actress Wendy Raquel Robinson, one of the actors in "The Bible Experience," and she also stars in the CW series, "The Game."</s>FARAI CHIDEYA, host: Hi folks, how are you doing?</s>Ms. WENDY RAQUEL ROBINSON (Actress): Great.</s>Mr. KYLE BOWSER (Co-producer, "The Bible Experien

In [None]:
print(f"- The {dataset_id} dataset has {dataset.num_rows} examples.")
print(f"- Each example is a {type(dataset[0])} with a {type(dataset[0]['document'])} as value.")
print(f"- Examples look like this: {dataset[0]}")

- The ccdv/mediasum dataset has 443596 examples.
- Each example is a <class 'dict'> with a <class 'str'> as value.
- Examples look like this: {'document': 'FARAI CHIDEYA, host: Now, moving on, Forest Whitaker as Moses, Tisha Campbell Martin as Mary Magdalene - well, that\'s all in "The Bible Experience." A New Testament edition was released in 2006. This edition is billed as "The Complete Bible." It doesn\'t have one person reading the gospels. It features nearly 400 African-American artists, actors and ministers, plus sound effects.</s>FARAI CHIDEYA, host: Just listen to Blair Underwood\'s rendition of Jesus on the cross.</s>Mr. BLAIR UNDERWOOD (Actor): (As Jesus) My God, my God, why have you forsaken me?</s>FARAI CHIDEYA, host: Now, we\'ve got two people affiliated with the project with us today. Kyle Bowser, he co-produced "The Bible Experience" and actress Wendy Raquel Robinson, one of the actors in "The Bible Experience," and she also stars in the CW series, "The Game."</s>FARAI C

# 5.0 Create a Dataset for Model Building and Evaluation
 

In [None]:
# Generate a dataset of 1000 examples for model building

# DO NOT CHANGE THE CODE HERE IN THIS CELL

size_of_dataset = 1000 # DO NOT CHANGE THIS VALUE
raw_dataset = dataset[0:size_of_dataset]
document_list =  raw_dataset['document']
summary_list =  raw_dataset['summary']

In [None]:
# Pre-process text

def preprocess_text(sentence):
  
    sentence=str(sentence)
    sentence = sentence.replace('\n', ' ')
    sentence = sentence.replace('?</s>', '?.')
    sentence = sentence.replace('</s>', '')
    sentence = sentence.replace("\'", "'")
    sentence = sentence.replace('--', '')
    sentence = sentence.replace('|', '')
    sentence = sentence.replace('/', '')
    sentence = sentence.replace('Dr.', 'Dr')
    sentence = sentence.replace('?.', '?. ')
    sentence = sentence.replace('.', '. ')
    sentence = sentence.replace('!', '!. ')

    return sentence

In [None]:
# Store in a list
clean_document_list = list(((map(preprocess_text, document_list))))
clean_summary_list = list(((map(preprocess_text, summary_list))))

In [None]:
# Inspect cleaned text
clean_document_list[0]

'FARAI CHIDEYA, host: Now, moving on, Forest Whitaker as Moses, Tisha Campbell Martin as Mary Magdalene - well, that\'s all in "The Bible Experience. " A New Testament edition was released in 2006.  This edition is billed as "The Complete Bible. " It doesn\'t have one person reading the gospels.  It features nearly 400 African-American artists, actors and ministers, plus sound effects. FARAI CHIDEYA, host: Just listen to Blair Underwood\'s rendition of Jesus on the cross. Mr.  BLAIR UNDERWOOD (Actor): (As Jesus) My God, my God, why have you forsaken me?.  FARAI CHIDEYA, host: Now, we\'ve got two people affiliated with the project with us today.  Kyle Bowser, he co-produced "The Bible Experience" and actress Wendy Raquel Robinson, one of the actors in "The Bible Experience," and she also stars in the CW series, "The Game. "FARAI CHIDEYA, host: Hi folks, how are you doing?.  Ms.  WENDY RAQUEL ROBINSON (Actress): Great. Mr.  KYLE BOWSER (Co-producer, "The Bible Experience: The Complete Bi

In [None]:
# Create a pandas dataframe to hold cleansed data
df = pd.DataFrame(list(zip(clean_document_list, clean_summary_list)),
               columns =['document', 'summary'])

In [None]:
# write to a csv file
df.to_csv("cleaned_mediasum1000", index = False)

In [None]:
# load the cleaned mediasum1000 into a Hugging Face Data Dictionary
dataset = load_dataset('csv', data_files = 'cleaned_mediasum1000', split='train' )



Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-4c2d137a1379d8b8/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-4c2d137a1379d8b8/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


In [None]:
# Inspect the data dictionary
dataset

Dataset({
    features: ['document', 'summary'],
    num_rows: 1000
})

# 6.0 Create a Test Dataset for Model Evaluation

In [None]:
# Generate a dataset of "x" examples for model evaluation

size_of_dataset = 1000 # change the value as you develop the model
small_dataset = dataset[0:size_of_dataset]
#print(small_dataset)
#print(len(small_dataset))


#7.0 Build Model using TD-IDF

In [None]:
# TD-IDF Functions

import math

from nltk import sent_tokenize, word_tokenize, PorterStemmer
from nltk.corpus import stopwords

def _create_frequency_table(text_string) -> dict:
    """
    we create a dictionary for the word frequency table.
    For this, we should only use the words that are not part of the stopWords array.
    Removing stop words and making frequency table
    Stemmer - an algorithm to bring words to its root word.
    :rtype: dict
    """
    stopWords = set(stopwords.words("english"))
    words = word_tokenize(text_string)
    ps = PorterStemmer()

    freqTable = dict()
    for word in words:
        word = ps.stem(word)
        if word in stopWords:
            continue
        if word in freqTable:
            freqTable[word] += 1
        else:
            freqTable[word] = 1

    return freqTable


def _create_frequency_matrix(sentences):
    frequency_matrix = {}
    stopWords = set(stopwords.words("english"))
    ps = PorterStemmer()

    for sent in sentences:
        freq_table = {}
        words = word_tokenize(sent)
        for word in words:
            word = word.lower()
            word = ps.stem(word)
            if word in stopWords:
                continue

            if word in freq_table:
                freq_table[word] += 1
            else:
                freq_table[word] = 1

        frequency_matrix[sent[:15]] = freq_table

    return frequency_matrix


def _create_tf_matrix(freq_matrix):
    tf_matrix = {}

    for sent, f_table in freq_matrix.items():
        tf_table = {}

        count_words_in_sentence = len(f_table)
        for word, count in f_table.items():
            tf_table[word] = count / count_words_in_sentence

        tf_matrix[sent] = tf_table

    return tf_matrix


def _create_documents_per_words(freq_matrix):
    word_per_doc_table = {}

    for sent, f_table in freq_matrix.items():
        for word, count in f_table.items():
            if word in word_per_doc_table:
                word_per_doc_table[word] += 1
            else:
                word_per_doc_table[word] = 1

    return word_per_doc_table


def _create_idf_matrix(freq_matrix, count_doc_per_words, total_documents):
    idf_matrix = {}

    for sent, f_table in freq_matrix.items():
        idf_table = {}

        for word in f_table.keys():
            idf_table[word] = math.log10(total_documents / float(count_doc_per_words[word]))

        idf_matrix[sent] = idf_table

    return idf_matrix


def _create_tf_idf_matrix(tf_matrix, idf_matrix):
    tf_idf_matrix = {}

    for (sent1, f_table1), (sent2, f_table2) in zip(tf_matrix.items(), idf_matrix.items()):

        tf_idf_table = {}

        for (word1, value1), (word2, value2) in zip(f_table1.items(),
                                                    f_table2.items()):  # here, keys are the same in both the table
            tf_idf_table[word1] = float(value1 * value2)

        tf_idf_matrix[sent1] = tf_idf_table

    return tf_idf_matrix


def _score_sentences(tf_idf_matrix) -> dict:
    """
    score a sentence by its word's TF
    Basic algorithm: adding the TF frequency of every non-stop word in a sentence divided by total no of words in a sentence.
    :rtype: dict
    """

    sentenceValue = {}

    for sent, f_table in tf_idf_matrix.items():
        total_score_per_sentence = 0

        count_words_in_sentence = len(f_table)
        for word, score in f_table.items():
            total_score_per_sentence += score

        if count_words_in_sentence == 0:
          count_words_in_sentence = 1
        sentenceValue[sent] = total_score_per_sentence / count_words_in_sentence

    return sentenceValue


def _find_average_score(sentenceValue) -> int:
    """
    Find the average score from the sentence value dictionary
    :rtype: int
    """
    sumValues = 0
    for entry in sentenceValue:
        sumValues += sentenceValue[entry]

    # Average value of a sentence from original summary_text
    average = (sumValues / len(sentenceValue))

    return average


def _generate_summary(sentences, sentenceValue, threshold):
    sentence_count = 0
    summary = ''

    for sentence in sentences:
        if sentence[:15] in sentenceValue and sentenceValue[sentence[:15]] >= (threshold):
            summary += " " + sentence
            sentence_count += 1

    return summary


def run_summarization(text):
    """
    :param text: Plain summary_text of long article
    :return: summarized summary_text
    """

    '''
    We already have a sentence tokenizer, so we just need 
    to run the sent_tokenize() method to create the array of sentences.
    '''
    # 1 Sentence Tokenize
    sentences = sent_tokenize(text)
    total_documents = len(sentences)
    #print(sentences)

    # 2 Create the Frequency matrix of the words in each sentence.
    freq_matrix = _create_frequency_matrix(sentences)
    #print(freq_matrix)

    '''
    Term frequency (TF) is how often a word appears in a document, divided by how many words are there in a document.
    '''
    # 3 Calculate TermFrequency and generate a matrix
    tf_matrix = _create_tf_matrix(freq_matrix)
    #print(tf_matrix)

    # 4 creating table for documents per words
    count_doc_per_words = _create_documents_per_words(freq_matrix)
    #print(count_doc_per_words)

    '''
    Inverse document frequency (IDF) is how unique or rare a word is.
    '''
    # 5 Calculate IDF and generate a matrix
    idf_matrix = _create_idf_matrix(freq_matrix, count_doc_per_words, total_documents)
    #print(idf_matrix)

    # 6 Calculate TF-IDF and generate a matrix
    tf_idf_matrix = _create_tf_idf_matrix(tf_matrix, idf_matrix)
    #print(tf_idf_matrix)

    # 7 Important Algorithm: score the sentences
    sentence_scores = _score_sentences(tf_idf_matrix)
    #print(sentence_scores)

    # 8 Find the threshold
    threshold = _find_average_score(sentence_scores)
    #print(threshold)

    # 9 Important Algorithm: Generate the summary
    summary = _generate_summary(sentences, sentence_scores, 1.3 * threshold) # orig 1.3
    return summary

#8.0 Evaluate Rouge Across Eval Dataset

In [None]:
# RougueL Scorer
scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
rouge = evaluate.load('rouge')

In [None]:
# Iterate through the dataset, extract the summary and compute RougeL, Cosine Similarity Scores

start = time.time()

# get article from dataset
input_article = small_dataset['document'][0:size_of_dataset]
#len(input_article)

# get summary from dataset
input_highlights = small_dataset['summary'][0:size_of_dataset]
#len(input_highlights)

# zip article and summary
zipped_input = zip(input_article, input_highlights)

# Empty List to Store Scores
rougeL_precision = []
rougeL_recall = []
rougeL_fmeasure = []

rouge_1_list = []
rouge_2_list = []
rouge_L_list = []

cosine_similarity_results = []

# Counter for Tracking Results
count = 1

# Extracted Highlights
output_highlights_list = []

# iterate Through the Eval Dataset
for input_article, input_highlights in zipped_input:

  print('Example:', count)

# Tokenize the string texts
  
  source_article_list = nltk.sent_tokenize(input_article)
  source_highlight_list = nltk.sent_tokenize(input_highlights)
  
# Join the list into a string
  formatted_source_article = " ".join(source_article_list)
  formatted_source_highlight = " ".join(source_highlight_list)
  
# Run td-idf model
  result = run_summarization(formatted_source_article)
  
# Tokenize the string result in a list of sentences
  result = nltk.sent_tokenize(result)
  
# Determine the number of sentences in the final extraction  
  num_sentences = int(len(source_article_list) * 0.3) # 30% compression ration
  
# compress the TD-IDF results into desired number of output sentences 
  result = result[0:num_sentences]
  
# Join into a string.
  summary = " ".join(result)
  output_highlights_list.append(summary)

# Take the string from summary and convert to list of strings for each sentence
  extracted_sentence_list = nltk.sent_tokenize(summary)

# Define predictions and references for score calculation
  predictions = " ".join(extracted_sentence_list)
  references = formatted_source_highlight

# Calculate Scores
  rougeL_scores = scorer.score(predictions,
                      references)
  
  pred = [predictions]
  ref = [references]

  rouge_results = rouge.compute(predictions=pred, references=ref)
  print(rouge_results)

  rouge_1_score = (rouge_results['rouge1'])
  rouge_2_score = (rouge_results['rouge2'])
  rouge_L_score = (rouge_results['rougeL'])

  rouge_1_list.append(rouge_1_score)
  rouge_2_list.append(rouge_2_score)
  rouge_L_list.append(rouge_L_score)

  #print(rouge_1_score, rouge_2_score, rouge_L_score)

  precision = (rougeL_scores['rougeL'].precision)
  recall = (rougeL_scores['rougeL'].recall)
  fmeasure = (rougeL_scores['rougeL'].fmeasure)

  rougeL_precision.append(precision)
  rougeL_recall.append(recall)
  rougeL_fmeasure.append(fmeasure)

  # calculate cosine similarity
  doc1 = nlp((" ".join(predictions)))
  doc2 = nlp((" ".join(references)))
  cosine_similarity = doc1.similarity(doc2)
  cosine_similarity_results.append(cosine_similarity)

  count = count + 1

Example: 1
{'rouge1': 0.1415929203539823, 'rouge2': 0.036036036036036036, 'rougeL': 0.10619469026548672, 'rougeLsum': 0.10619469026548672}
Example: 2
{'rouge1': 0.1183431952662722, 'rouge2': 0.0, 'rougeL': 0.0710059171597633, 'rougeLsum': 0.0710059171597633}
Example: 3
{'rouge1': 0.1643835616438356, 'rouge2': 0.0, 'rougeL': 0.0821917808219178, 'rougeLsum': 0.0821917808219178}
Example: 4
{'rouge1': 0.21739130434782608, 'rouge2': 0.022222222222222223, 'rougeL': 0.08695652173913043, 'rougeLsum': 0.08695652173913043}
Example: 5
{'rouge1': 0.10416666666666666, 'rouge2': 0.0, 'rougeL': 0.06250000000000001, 'rougeLsum': 0.06250000000000001}
Example: 6
{'rouge1': 0.10869565217391305, 'rouge2': 0.010989010989010988, 'rougeL': 0.06521739130434782, 'rougeLsum': 0.06521739130434782}
Example: 7
{'rouge1': 0.1587301587301587, 'rouge2': 0.03278688524590164, 'rougeL': 0.12698412698412698, 'rougeLsum': 0.12698412698412698}
Example: 8
{'rouge1': 0.11111111111111112, 'rouge2': 0.0, 'rougeL': 0.0740740740

In [None]:
# Print Compute Time
print('\nTime:', time.time() - start)


Time: 292.54293298721313


In [None]:
# Export extractive summary to a CSV

# get article from dataset
input_article_list = small_dataset['document'][0:size_of_dataset]

# get summary from dataset
input_highlights_list = small_dataset['summary'][0:size_of_dataset]

df = pd.DataFrame(list(zip(input_article_list, input_highlights_list, output_highlights_list)),
                  columns = ['orig_article', 'orig_summary', 'extracted_summary'])

# Edit this filepath to wherever you saved the data in your Drive
filepath = 'drive/My Drive/Colab_Notebooks_1/model_5a_extracted_mediasum1000.csv'

df.to_csv(filepath,index = False)

In [None]:
# read back the csv file
data_import = pd.read_csv(filepath)        
#data_import.rename(columns = {'0':'orig_article', '1':'orig_summary', '2':'extracted_summary'}, inplace = True)

col1 = data_import.orig_article.values.tolist()
col2 = data_import.orig_summary.values.tolist()
col3 = data_import.extracted_summary.values.tolist()

In [None]:
# Validate data_import
print(col1[0])
print("")

print(col2[0])
print("")

print(col3[0])

FARAI CHIDEYA, host: Now, moving on, Forest Whitaker as Moses, Tisha Campbell Martin as Mary Magdalene - well, that's all in "The Bible Experience. " A New Testament edition was released in 2006.  This edition is billed as "The Complete Bible. " It doesn't have one person reading the gospels.  It features nearly 400 African-American artists, actors and ministers, plus sound effects. FARAI CHIDEYA, host: Just listen to Blair Underwood's rendition of Jesus on the cross. Mr.  BLAIR UNDERWOOD (Actor): (As Jesus) My God, my God, why have you forsaken me?.  FARAI CHIDEYA, host: Now, we've got two people affiliated with the project with us today.  Kyle Bowser, he co-produced "The Bible Experience" and actress Wendy Raquel Robinson, one of the actors in "The Bible Experience," and she also stars in the CW series, "The Game. "FARAI CHIDEYA, host: Hi folks, how are you doing?.  Ms.  WENDY RAQUEL ROBINSON (Actress): Great. Mr.  KYLE BOWSER (Co-producer, "The Bible Experience: The Complete Bible")

In [None]:
# Calculate Mean Rouge for Dataset

print("RougeL Precision Scores")
print(rougeL_precision)
print(len(rougeL_precision))
print(np.mean(np.asarray(rougeL_precision)))
print("")

print("RougeL Recall Scores")
print(rougeL_recall)
print(len(rougeL_precision))
print(np.mean(np.asarray(rougeL_recall)))
print("")

print("RougeL Fmeasure Scores")
print(rougeL_fmeasure)
print(len(rougeL_fmeasure))
print(np.mean(np.asarray(rougeL_fmeasure)))
print("")

print("Rouge 1 Scores")
print(rouge_1_list)
print(len(rouge_1_list))
print(np.mean(np.asarray(rouge_1_list)))
print("")

print("Rouge 2 Scores")
print(rouge_2_list)
print(len(rouge_2_list))
print(np.mean(np.asarray(rouge_2_list)))
print("")

print("Rouge L Scores")
print(rouge_L_list)
print(len(rouge_L_list))
print(np.mean(np.asarray(rouge_L_list)))
print("")




RougeL Precision Scores
[0.13953488372093023, 0.2727272727272727, 0.11538461538461539, 0.09302325581395349, 0.1, 0.14285714285714285, 0.10526315789473684, 0.13157894736842105, 0.25, 0.14705882352941177, 0.16, 0.05555555555555555, 0.075, 0.09302325581395349, 0.10256410256410256, 0.2608695652173913, 0.10869565217391304, 0.13043478260869565, 0.1724137931034483, 0.06060606060606061, 0.023255813953488372, 0.06060606060606061, 0.14634146341463414, 0.06451612903225806, 0.10909090909090909, 0.1111111111111111, 0.3333333333333333, 0.32142857142857145, 0.17142857142857143, 0.14634146341463414, 0.14285714285714285, 0.2682926829268293, 0.09090909090909091, 0.06521739130434782, 0.24358974358974358, 0.06779661016949153, 0.18072289156626506, 0.17073170731707318, 0.25806451612903225, 0.11764705882352941, 0.25806451612903225, 0.10909090909090909, 0.1702127659574468, 0.08, 0.10204081632653061, 0.26666666666666666, 0.030303030303030304, 0.037037037037037035, 0.06976744186046512, 0.10869565217391304, 0.15

In [None]:
# Calculate Mean Cosine Similarity

print("Cosine Similarity")
print(cosine_similarity_results)
print(len(cosine_similarity_results))
print(np.mean(np.asarray(cosine_similarity_results)))
print("")


Cosine Similarity
[0.9840430305703868, 0.9909455310278662, 0.9889409491279539, 0.984788779465725, 0.9872122777832908, 0.9782290751906546, 0.9808392333333794, 0.9908355286133735, 0.9944758123175022, 0.9816287600634722, 0.9897187318505773, 0.9886171993069411, 0.9863993640926353, 0.9813325355753797, 0.9928760853733353, 0.9763323246316347, 0.988016448252086, 0.9942582117332623, 0.9897788889971425, 0.9878004849516893, 0.9675971132454149, 0.9789580129741826, 0.9920296480986706, 0.9696922846365241, 0.9896898116445304, 0.9907471657536945, 0.99382310449765, 0.9859831089510986, 0.984187535676079, 0.9908353131838448, 0.9909748706387853, 0.9809685821397556, 0.9818965986739537, 0.9848786923882843, 0.9826253148908425, 0.9828233092417001, 0.9952420671022569, 0.9914127202349503, 0.9898060488764937, 0.9618149859941615, 0.9935216593532935, 0.9896864974550268, 0.9914694127404945, 0.978196911620361, 0.9932569220390225, 0.9936073443905414, 0.9859339877784685, 0.9811363869846645, 0.9643259816986294, 0.97961

In [None]:
dataset = load_dataset('csv', data_files = filepath, split='train' )



Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-073b314f9fc5a023/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-073b314f9fc5a023/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


In [None]:
dataset

Dataset({
    features: ['orig_article', 'orig_summary', 'extracted_summary'],
    num_rows: 1000
})