# CNN_Daily_Mail_Final_Model_5a_TD-IDF_Extractive_Summarization
Nov 6, 2022




## This notebook has the following model built for an extractive summarizer based on TF-IDF Methodology.

### Methodology

Input document -> Finding most important words from the document -> Finding sentence scores on the basis of important words ->Choosing the most important sentences on the basis of scores obtained.

### What is TFIDF Approach ?

TFIDF, short for term frequency–inverse document frequency, is a numeric measure that is use to score the importance of a word in a document based on how often did it appear in that document and a given collection of documents. The intuition behind this measure is : If a word appears frequently in a document, then it should be important and we should give that word a high score. But if a word appears in too many other documents, it’s probably not a unique identifier, therefore we should assign a lower score to that word.

Formula for calculating tf and idf:

    TF(w) = (Number of times term w appears in a document) / (Total number of terms in the document)
    IDF(w) = log_e(Total number of documents / Number of documents with term w in it)

Hence tfidf for a word can be calculated as:

    TFIDF(w) = TF(w) * IDF(w)

### Dataset Summary

This large-scale media interview dataset contains 463.6K transcripts with abstractive summaries, collected from interview transcripts and overview / topic descriptions from NPR and CNN.

Data Fields

    id: paper id
    document: a string/list containing the body of a set of documents
    summary: a string containing the abstract of the set



# 1. Setup

#### This section install key libraries

In [1]:
!pip install datasets --quiet
!pip install nltk --quiet

[K     |████████████████████████████████| 441 kB 5.1 MB/s 
[K     |████████████████████████████████| 163 kB 58.1 MB/s 
[K     |████████████████████████████████| 95 kB 3.0 MB/s 
[K     |████████████████████████████████| 115 kB 56.1 MB/s 
[K     |████████████████████████████████| 212 kB 50.9 MB/s 
[K     |████████████████████████████████| 127 kB 46.4 MB/s 
[K     |████████████████████████████████| 115 kB 51.8 MB/s 
[?25h

In [2]:
!pip install -q rouge_score

  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone


In [3]:
!pip install -q evaluate


[?25l[K     |████▌                           | 10 kB 21.1 MB/s eta 0:00:01[K     |█████████                       | 20 kB 6.5 MB/s eta 0:00:01[K     |█████████████▌                  | 30 kB 8.9 MB/s eta 0:00:01[K     |██████████████████              | 40 kB 4.5 MB/s eta 0:00:01[K     |██████████████████████▌         | 51 kB 4.5 MB/s eta 0:00:01[K     |███████████████████████████     | 61 kB 5.3 MB/s eta 0:00:01[K     |███████████████████████████████▌| 71 kB 5.9 MB/s eta 0:00:01[K     |████████████████████████████████| 72 kB 1.0 MB/s 
[?25h

In [4]:
!pip install -U spacy
!pip install -U spacy-lookups-data

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting spacy-lookups-data
  Downloading spacy_lookups_data-1.0.3-py2.py3-none-any.whl (98.5 MB)
[K     |████████████████████████████████| 98.5 MB 1.2 MB/s 
Installing collected packages: spacy-lookups-data
Successfully installed spacy-lookups-data-1.0.3


In [5]:
!pip install -r rouge/requirements.txt
!pip install rouge-score
!pip install rouge_score

[31mERROR: Could not open requirements file: [Errno 2] No such file or directory: 'rouge/requirements.txt'[0m
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [6]:
!python -m spacy download en_core_web_lg

2022-11-06 16:01:53.124067: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-lg==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.4.1/en_core_web_lg-3.4.1-py3-none-any.whl (587.7 MB)
[K     |████████████████████████████████| 587.7 MB 15 kB/s 
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.4.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


# 2.0 Import Libraries

In [7]:
# NLTK
import re # relugar expression
import nltk # natural language toolkit for sentence tokenization and display
import string
import heapq
nltk.download('punkt')
nltk.download('stopwords')
from nltk import word_tokenize
from nltk.util import ngrams
import evaluate

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [8]:
import os
import logging

import numpy as np
import tensorflow as tf
from tensorflow import keras

# Only log error messages
tf.get_logger().setLevel(logging.ERROR)

os.environ["TOKENIZERS_PARALLELISM"] = "false"


In [9]:
import evaluate

In [10]:
import spacy
import pandas as pd

print(spacy.__version__)
print(pd.__version__)

3.4.2
1.3.5


In [11]:
import pickle
import subprocess
import sys
import nltk
from nltk import Nonterminal, nonterminals, Production, CFG, PCFG

In [12]:
#shift reduce parser example
from nltk.grammar import Nonterminal
from nltk.parse.api import ParserI
from nltk.tree import Tree

In [13]:
nlp = spacy.load("en_core_web_lg")

In [14]:
from rouge_score import rouge_scorer

In [15]:
import math

from nltk import sent_tokenize, word_tokenize, PorterStemmer
from nltk.corpus import stopwords    

In [16]:
import time

In [17]:
# Mount drive for saving model checkpoints, loading Task 2 data below

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# 3.0 Load the dataset



In [18]:
from datasets import load_dataset, load_metric


In [19]:
# Load 1000 examples of CNN Dailymail from a CSV File

filepath = 'drive/My Drive/Colab_Notebooks_1/cnn_dailymail_1000.csv'
dataset = load_dataset('csv', data_files = filepath, split='train' )
dataset



Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-d8ef26a734f1246a/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-d8ef26a734f1246a/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


Dataset({
    features: ['document', 'summary'],
    num_rows: 1000
})

# 5.0 Create a Dataset for Model Building and Evaluation
 

In [20]:
 # Generate a dataset of "x" examples for model evaulation
size_of_dataset = 1000 # change the value as you develop
small_dataset = dataset[0:size_of_dataset]

#6.0 Build Model using TD-IDF

In [21]:
# TD-IDF Functions

import math

from nltk import sent_tokenize, word_tokenize, PorterStemmer
from nltk.corpus import stopwords

def _create_frequency_table(text_string) -> dict:
    """
    we create a dictionary for the word frequency table.
    For this, we should only use the words that are not part of the stopWords array.
    Removing stop words and making frequency table
    Stemmer - an algorithm to bring words to its root word.
    :rtype: dict
    """
    stopWords = set(stopwords.words("english"))
    words = word_tokenize(text_string)
    ps = PorterStemmer()

    freqTable = dict()
    for word in words:
        word = ps.stem(word)
        if word in stopWords:
            continue
        if word in freqTable:
            freqTable[word] += 1
        else:
            freqTable[word] = 1

    return freqTable


def _create_frequency_matrix(sentences):
    frequency_matrix = {}
    stopWords = set(stopwords.words("english"))
    ps = PorterStemmer()

    for sent in sentences:
        freq_table = {}
        words = word_tokenize(sent)
        for word in words:
            word = word.lower()
            word = ps.stem(word)
            if word in stopWords:
                continue

            if word in freq_table:
                freq_table[word] += 1
            else:
                freq_table[word] = 1

        frequency_matrix[sent[:15]] = freq_table

    return frequency_matrix


def _create_tf_matrix(freq_matrix):
    tf_matrix = {}

    for sent, f_table in freq_matrix.items():
        tf_table = {}

        count_words_in_sentence = len(f_table)
        for word, count in f_table.items():
            tf_table[word] = count / count_words_in_sentence

        tf_matrix[sent] = tf_table

    return tf_matrix


def _create_documents_per_words(freq_matrix):
    word_per_doc_table = {}

    for sent, f_table in freq_matrix.items():
        for word, count in f_table.items():
            if word in word_per_doc_table:
                word_per_doc_table[word] += 1
            else:
                word_per_doc_table[word] = 1

    return word_per_doc_table


def _create_idf_matrix(freq_matrix, count_doc_per_words, total_documents):
    idf_matrix = {}

    for sent, f_table in freq_matrix.items():
        idf_table = {}

        for word in f_table.keys():
            idf_table[word] = math.log10(total_documents / float(count_doc_per_words[word]))

        idf_matrix[sent] = idf_table

    return idf_matrix


def _create_tf_idf_matrix(tf_matrix, idf_matrix):
    tf_idf_matrix = {}

    for (sent1, f_table1), (sent2, f_table2) in zip(tf_matrix.items(), idf_matrix.items()):

        tf_idf_table = {}

        for (word1, value1), (word2, value2) in zip(f_table1.items(),
                                                    f_table2.items()):  # here, keys are the same in both the table
            tf_idf_table[word1] = float(value1 * value2)

        tf_idf_matrix[sent1] = tf_idf_table

    return tf_idf_matrix


def _score_sentences(tf_idf_matrix) -> dict:
    """
    score a sentence by its word's TF
    Basic algorithm: adding the TF frequency of every non-stop word in a sentence divided by total no of words in a sentence.
    :rtype: dict
    """

    sentenceValue = {}

    for sent, f_table in tf_idf_matrix.items():
        total_score_per_sentence = 0

        count_words_in_sentence = len(f_table)
        for word, score in f_table.items():
            total_score_per_sentence += score

        if count_words_in_sentence == 0:
          count_words_in_sentence = 1
        sentenceValue[sent] = total_score_per_sentence / count_words_in_sentence

    return sentenceValue


def _find_average_score(sentenceValue) -> int:
    """
    Find the average score from the sentence value dictionary
    :rtype: int
    """
    sumValues = 0
    for entry in sentenceValue:
        sumValues += sentenceValue[entry]

    # Average value of a sentence from original summary_text
    average = (sumValues / len(sentenceValue))

    return average


def _generate_summary(sentences, sentenceValue, threshold):
    sentence_count = 0
    summary = ''

    for sentence in sentences:
        if sentence[:15] in sentenceValue and sentenceValue[sentence[:15]] >= (threshold):
            summary += " " + sentence
            sentence_count += 1

    return summary


def run_summarization(text):
    """
    :param text: Plain summary_text of long article
    :return: summarized summary_text
    """

    '''
    We already have a sentence tokenizer, so we just need 
    to run the sent_tokenize() method to create the array of sentences.
    '''
    # 1 Sentence Tokenize
    sentences = sent_tokenize(text)
    total_documents = len(sentences)
    #print(sentences)

    # 2 Create the Frequency matrix of the words in each sentence.
    freq_matrix = _create_frequency_matrix(sentences)
    #print(freq_matrix)

    '''
    Term frequency (TF) is how often a word appears in a document, divided by how many words are there in a document.
    '''
    # 3 Calculate TermFrequency and generate a matrix
    tf_matrix = _create_tf_matrix(freq_matrix)
    #print(tf_matrix)

    # 4 creating table for documents per words
    count_doc_per_words = _create_documents_per_words(freq_matrix)
    #print(count_doc_per_words)

    '''
    Inverse document frequency (IDF) is how unique or rare a word is.
    '''
    # 5 Calculate IDF and generate a matrix
    idf_matrix = _create_idf_matrix(freq_matrix, count_doc_per_words, total_documents)
    #print(idf_matrix)

    # 6 Calculate TF-IDF and generate a matrix
    tf_idf_matrix = _create_tf_idf_matrix(tf_matrix, idf_matrix)
    #print(tf_idf_matrix)

    # 7 Important Algorithm: score the sentences
    sentence_scores = _score_sentences(tf_idf_matrix)
    #print(sentence_scores)

    # 8 Find the threshold
    threshold = _find_average_score(sentence_scores)
    #print(threshold)

    # 9 Important Algorithm: Generate the summary
    summary = _generate_summary(sentences, sentence_scores, 1.3 * threshold) # orig 1.3
    return summary

In [22]:
def preprocess(sentence):
    sentence=str(sentence)
    sentence = sentence.replace('\n', ' ')
    sentence = sentence.replace('?</s>', '?.')
    sentence = sentence.replace('</s>', '')
    sentence = sentence.replace("\'", "'")
    sentence = sentence.replace('--', '')
    sentence = sentence.replace('|', '')
    sentence = sentence.replace('/', '')
    sentence = sentence.replace('Dr.', 'Dr')
    sentence = sentence.replace('?.', '?. ')
    #sentence = sentence.replace('.', '. ')
    sentence = sentence.replace('!', '!. ')
    #sentence = sentence.replace('"', '')
    sentence = sentence.replace("This material may not be published, broadcast, rewritten, or redistributed.",'')
    #sentence = sentence.replace(' "', '')
    #sentence = sentence.replace('" ', '')
    return sentence

#7.0 Evaluate Rouge Across Eval Dataset

In [23]:
# RougueL Scorer
scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
rouge = evaluate.load('rouge')

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [24]:
# Iterate through the dataset, extract the summary and compute RougeL, Cosine Similarity Scores

start = time.time()

# get article from dataset
input_article = small_dataset['document'][0:size_of_dataset]
#len(input_article)

# get summary from dataset
input_highlights = small_dataset['summary'][0:size_of_dataset]
#len(input_highlights)

# zip article and summary
zipped_input = zip(input_article, input_highlights)

# Empty List to Store Scores
rougeL_precision = []
rougeL_recall = []
rougeL_fmeasure = []

rouge_1_list = []
rouge_2_list = []
rouge_L_list = []

cosine_similarity_results = []

# Counter for Tracking Results
count = 1

# Extracted Highlights
output_highlights_list = []

# iterate Through the Eval Dataset
for input_article, input_highlights in zipped_input:

  print('Example:', count)

# Tokenize the string texts
  
  source_article_list = nltk.sent_tokenize(input_article)
  source_highlight_list = nltk.sent_tokenize(input_highlights)
  
# Join the list into a string
  formatted_source_article = " ".join(source_article_list)
  formatted_source_highlight = " ".join(source_highlight_list)
  
# Run td-idf model
  result = run_summarization(formatted_source_article)
  
# Tokenize the string result in a list of sentences
  result = nltk.sent_tokenize(result)
  
# Determine the number of sentences in the final extraction  
  num_sentences = int(len(source_article_list) * 0.3) # 30% compression ration
  
# compress the TD-IDF results into desired number of output sentences 
  result = result[0:num_sentences]
  
# Join into a string.
  summary = " ".join(result)
  output_highlights_list.append(summary)

# Take the string from summary and convert to list of strings for each sentence
  extracted_sentence_list = nltk.sent_tokenize(summary)

# Define predictions and references for score calculation
  predictions = " ".join(extracted_sentence_list)
  references = formatted_source_highlight

# Calculate Scores
  rougeL_scores = scorer.score(predictions,
                      references)
  
  pred = [predictions]
  ref = [references]

  rouge_results = rouge.compute(predictions=pred, references=ref)
  print(rouge_results)

  rouge_1_score = (rouge_results['rouge1'])
  rouge_2_score = (rouge_results['rouge2'])
  rouge_L_score = (rouge_results['rougeL'])

  rouge_1_list.append(rouge_1_score)
  rouge_2_list.append(rouge_2_score)
  rouge_L_list.append(rouge_L_score)

  #print(rouge_1_score, rouge_2_score, rouge_L_score)

  precision = (rougeL_scores['rougeL'].precision)
  recall = (rougeL_scores['rougeL'].recall)
  fmeasure = (rougeL_scores['rougeL'].fmeasure)

  rougeL_precision.append(precision)
  rougeL_recall.append(recall)
  rougeL_fmeasure.append(fmeasure)

  # calculate cosine similarity
  doc1 = nlp((" ".join(predictions)))
  doc2 = nlp((" ".join(references)))
  cosine_similarity = doc1.similarity(doc2)
  cosine_similarity_results.append(cosine_similarity)

  count = count + 1

Example: 1
{'rouge1': 0.14285714285714288, 'rouge2': 0.0, 'rougeL': 0.09523809523809523, 'rougeLsum': 0.09523809523809523}
Example: 2
{'rouge1': 0.2772277227722772, 'rouge2': 0.12121212121212122, 'rougeL': 0.19801980198019803, 'rougeLsum': 0.23762376237623764}
Example: 3
{'rouge1': 0.2222222222222222, 'rouge2': 0.052173913043478265, 'rougeL': 0.10256410256410256, 'rougeLsum': 0.10256410256410256}
Example: 4
{'rouge1': 0.05405405405405406, 'rouge2': 0.0, 'rougeL': 0.05405405405405406, 'rougeLsum': 0.05405405405405406}
Example: 5
{'rouge1': 0.12080536912751677, 'rouge2': 0.013605442176870746, 'rougeL': 0.08053691275167785, 'rougeLsum': 0.08053691275167785}
Example: 6
{'rouge1': 0.14814814814814817, 'rouge2': 0.0, 'rougeL': 0.09876543209876543, 'rougeLsum': 0.09876543209876543}
Example: 7
{'rouge1': 0.14388489208633093, 'rouge2': 0.058394160583941604, 'rougeL': 0.11510791366906475, 'rougeLsum': 0.11510791366906475}
Example: 8
{'rouge1': 0.2318840579710145, 'rouge2': 0.08955223880597016, '



{'rouge1': 0.2096774193548387, 'rouge2': 0.049180327868852465, 'rougeL': 0.16129032258064518, 'rougeLsum': 0.16129032258064518}
Example: 296
{'rouge1': 0.07272727272727272, 'rouge2': 0.0, 'rougeL': 0.03636363636363636, 'rougeLsum': 0.03636363636363636}
Example: 297
{'rouge1': 0.13114754098360654, 'rouge2': 0.016666666666666666, 'rougeL': 0.09836065573770492, 'rougeLsum': 0.09836065573770492}
Example: 298
{'rouge1': 0.1616161616161616, 'rouge2': 0.06185567010309278, 'rougeL': 0.10101010101010101, 'rougeLsum': 0.10101010101010101}
Example: 299
{'rouge1': 0.15503875968992245, 'rouge2': 0.0, 'rougeL': 0.07751937984496123, 'rougeLsum': 0.07751937984496123}
Example: 300
{'rouge1': 0.10810810810810811, 'rouge2': 0.0, 'rougeL': 0.08108108108108109, 'rougeLsum': 0.08108108108108109}
Example: 301
{'rouge1': 0.05714285714285715, 'rouge2': 0.0, 'rougeL': 0.05714285714285715, 'rougeLsum': 0.05714285714285715}
Example: 302
{'rouge1': 0.14814814814814814, 'rouge2': 0.0, 'rougeL': 0.11111111111111112,

In [25]:
# Print Compute Time
print('\nTime:', time.time() - start)


Time: 297.50478625297546


In [26]:
# Clean up the extracted list
output_highlights_list = list(((map(preprocess, output_highlights_list))))

In [27]:
# Export extractive summary to a CSV

# get article from dataset
input_article_list = small_dataset['document'][0:size_of_dataset]

# get summary from dataset
input_highlights_list = small_dataset['summary'][0:size_of_dataset]

df = pd.DataFrame(list(zip(input_article_list, input_highlights_list, output_highlights_list)),
                  columns = ['orig_article', 'orig_summary', 'extracted_summary'])

# Edit this filepath to wherever you saved the data in your Drive
filepath = 'drive/My Drive/Colab_Notebooks_1/model_5a_extracted_CNNDaily1000.csv'

df.to_csv(filepath,index = False)

In [28]:
# read back the csv file
data_import = pd.read_csv(filepath)        
#data_import.rename(columns = {'0':'orig_article', '1':'orig_summary', '2':'extracted_summary'}, inplace = True)

col1 = data_import.orig_article.values.tolist()
col2 = data_import.orig_summary.values.tolist()
col3 = data_import.extracted_summary.values.tolist()

In [29]:
# Validate data_import
print(col1[0])
print("")

print(col2[0])
print("")

print(col3[0])

LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don't think I'll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart. Details of how

In [30]:
# Calculate Mean Rouge for Dataset

print("RougeL Precision Scores")
print(rougeL_precision)
print(len(rougeL_precision))
print(np.mean(np.asarray(rougeL_precision)))
print("")

print("RougeL Recall Scores")
print(rougeL_recall)
print(len(rougeL_precision))
print(np.mean(np.asarray(rougeL_recall)))
print("")

print("RougeL Fmeasure Scores")
print(rougeL_fmeasure)
print(len(rougeL_fmeasure))
print(np.mean(np.asarray(rougeL_fmeasure)))
print("")

print("Rouge 1 Scores")
print(rouge_1_list)
print(len(rouge_1_list))
print(np.mean(np.asarray(rouge_1_list)))
print("")

print("Rouge 2 Scores")
print(rouge_2_list)
print(len(rouge_2_list))
print(np.mean(np.asarray(rouge_2_list)))
print("")

print("Rouge L Scores")
print(rouge_L_list)
print(len(rouge_L_list))
print(np.mean(np.asarray(rouge_L_list)))
print("")




RougeL Precision Scores
[0.10256410256410256, 0.20408163265306123, 0.14634146341463414, 0.041666666666666664, 0.17073170731707318, 0.09302325581395349, 0.1702127659574468, 0.14285714285714285, 0.12121212121212122, 0.03571428571428571, 0.10810810810810811, 0.1388888888888889, 0.05555555555555555, 0.06451612903225806, 0.0, 0.08163265306122448, 0.038461538461538464, 0.07142857142857142, 0.13333333333333333, 0.11428571428571428, 0.13953488372093023, 0.0, 0.13043478260869565, 0.02702702702702703, 0.14545454545454545, 0.15625, 0.027777777777777776, 0.075, 0.14285714285714285, 0.05263157894736842, 0.2, 0.1276595744680851, 0.12903225806451613, 0.10810810810810811, 0.09433962264150944, 0.1891891891891892, 0.06666666666666667, 0.1568627450980392, 0.11627906976744186, 0.06382978723404255, 0.09090909090909091, 0.12903225806451613, 0.058823529411764705, 0.14285714285714285, 0.08108108108108109, 0.09302325581395349, 0.125, 0.07317073170731707, 0.023809523809523808, 0.06818181818181818, 0.11904761904

In [31]:
# Calculate Mean Cosine Similarity

print("Cosine Similarity")
print(cosine_similarity_results)
print(len(cosine_similarity_results))
print(np.mean(np.asarray(cosine_similarity_results)))
print("")


Cosine Similarity
[0.9909555757477774, 0.9826497799958789, 0.9841789293346889, 0.9713596657052305, 0.9802258490262686, 0.9864947508295535, 0.9880333040179164, 0.9813212173118575, 0.9818567022036533, 0.9659505003502292, 0.9715173698656306, 0.9886698345503844, 0.9778476152166365, 0.9719674386835401, 0.9816544224505239, 0.9943581315373976, 0.987240473112351, 0.9932177519992903, 0.992108293276123, 0.9914316017221015, 0.9838841498083023, 0.9778925487075557, 0.9917221317399944, 0.9588371227244108, 0.9868214432775414, 0.9893681912483281, 0.9088143132882072, 0.9864539581338497, 0.9873663308423717, 0.9655371453407633, 0.9957889353757139, 0.9895955814088517, 0.9581997154194364, 0.9950347393093335, 0.9923927337266644, 0.9820177526609946, 0.9849442841219332, 0.9890243979287908, 0.9909746029928146, 0.9856163011839493, 0.9938641734718601, 0.9778660492752088, 0.9827693786921291, 0.9831471389339183, 0.9690484295288839, 0.9935882935646851, 0.9892389578265208, 0.9688615035332204, 0.9442849126080847, 0.9

In [32]:
dataset = load_dataset('csv', data_files = filepath, split='train' )



Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-62065a366888c47c/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-62065a366888c47c/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


In [33]:
dataset

Dataset({
    features: ['orig_article', 'orig_summary', 'extracted_summary'],
    num_rows: 1000
})

In [34]:
dataset['extracted_summary'][0]

' Details of how he\'ll mark his landmark birthday are under wraps. His agent and publicist had no comment on his plans. "Hopefully none of you will be reading about it." There is life beyond Potter, however. E-mail to a friend . Copyright 2007 Reuters.'