# Final_Model_4a.1_Bert_Based_Transformer_Cosine_MedialSum_Extractive_Summary
    Oct 27, 2022


This notebook has the following model built for an extractive summarizer based on Bert Sentence Transfomer Model from Hugging Face.

1. **Bert based sentence transformer** is built from scratch with a non-linear activation.
2.SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. The initial work is described in paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. This framework can be used to compute sentence / text embeddings for more than 100 languages. **These** embeddings can then be compared e.g. with cosine-similarity to find sentences with a similar meaning. This can be useful for semantic textual similar, semantic search, or paraphrase mining.

Dataset Summary

This large-scale media interview dataset contains 463.6K transcripts with abstractive summaries, collected from interview transcripts and overview / topic descriptions from NPR and CNN.

Data Fields

    id: paper id
    document: a string/list containing the body of a set of documents
    summary: a string containing the abstract of the set

Model Details

    Sentence Transformer: Sentence Transformer
    Pre-Training: bert-base-uncased
    **Supervised**: Unsupervised Model
    Classification: Degree Centrality of Sentences in a Document
    Trigram Blocking: Yes
    Fine Tuning: None
    Evaluation Metrics: RougeL and Cosine Similarity

    ## Step 1: use an existing language model
    word_embedding_model = models.Transformer('bert-base-uncased')

    ## Step 2: use a pool function over the token embeddings
    pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),pooling_mode_mean_tokens=True)

    ## Join steps 1 and 2 using the modules argument
    model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
    dense_model = models.Dense(in_features=pooling_model.get_sentence_embedding_dimension(), out_features=pooling_model.get_sentence_embedding_dimension(), activation_function=nn.ReLU())
    model = SentenceTransformer(modules=[word_embedding_model, pooling_model, dense_model])
    


# 1. Setup

#### This section install key libraries

In [None]:
!pip install bert-extractive-summarizer --quiet

[K     |████████████████████████████████| 5.3 MB 7.4 MB/s 
[K     |████████████████████████████████| 7.6 MB 43.0 MB/s 
[K     |████████████████████████████████| 163 kB 72.6 MB/s 
[?25h

In [None]:
!pip install datasets --quiet
!pip install nltk --quiet

[K     |████████████████████████████████| 441 kB 5.2 MB/s 
[K     |████████████████████████████████| 115 kB 42.6 MB/s 
[K     |████████████████████████████████| 212 kB 66.8 MB/s 
[K     |████████████████████████████████| 95 kB 5.4 MB/s 
[K     |████████████████████████████████| 127 kB 74.7 MB/s 
[K     |████████████████████████████████| 115 kB 60.5 MB/s 
[?25h

In [None]:
!pip install -q rouge_score

  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone


In [None]:
!pip install -q evaluate


[?25l[K     |████▌                           | 10 kB 22.3 MB/s eta 0:00:01[K     |█████████                       | 20 kB 7.8 MB/s eta 0:00:01[K     |█████████████▌                  | 30 kB 10.8 MB/s eta 0:00:01[K     |██████████████████              | 40 kB 4.5 MB/s eta 0:00:01[K     |██████████████████████▌         | 51 kB 4.6 MB/s eta 0:00:01[K     |███████████████████████████     | 61 kB 5.5 MB/s eta 0:00:01[K     |███████████████████████████████▌| 71 kB 6.0 MB/s eta 0:00:01[K     |████████████████████████████████| 72 kB 1.2 MB/s 
[?25h

In [None]:
!pip install -U spacy
!pip install -U spacy-lookups-data

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting spacy-lookups-data
  Downloading spacy_lookups_data-1.0.3-py2.py3-none-any.whl (98.5 MB)
[K     |████████████████████████████████| 98.5 MB 113 kB/s 
Installing collected packages: spacy-lookups-data
Successfully installed spacy-lookups-data-1.0.3


In [None]:
!pip install -r rouge/requirements.txt
!pip install rouge-score
!pip install rouge_score

[31mERROR: Could not open requirements file: [Errno 2] No such file or directory: 'rouge/requirements.txt'[0m
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!python -m spacy download en_core_web_lg

2022-11-01 02:58:30.355418: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-lg==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.4.1/en_core_web_lg-3.4.1-py3-none-any.whl (587.7 MB)
[K     |████████████████████████████████| 587.7 MB 14 kB/s 
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.4.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


# 2.0 Import Libraries

In [None]:
# NLTK
import re # relugar expression
import nltk # natural language toolkit for sentence tokenization and display
import string
import heapq
nltk.download('punkt')
nltk.download('stopwords')
from nltk import word_tokenize
from nltk.util import ngrams
import evaluate

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
import os
import logging

import numpy as np
import tensorflow as tf
from tensorflow import keras

# Only log error messages
tf.get_logger().setLevel(logging.ERROR)

os.environ["TOKENIZERS_PARALLELISM"] = "false"


In [None]:
import evaluate

In [None]:
import spacy
import pandas as pd

print(spacy.__version__)
print(pd.__version__)

3.4.2
1.3.5


In [None]:
import pickle
import subprocess
import sys
import nltk
from nltk import Nonterminal, nonterminals, Production, CFG, PCFG

In [None]:
#shift reduce parser example
from nltk.grammar import Nonterminal
from nltk.parse.api import ParserI
from nltk.tree import Tree

In [None]:
nlp = spacy.load("en_core_web_lg")


In [None]:
from rouge_score import rouge_scorer

In [None]:
import time

In [None]:
# Mount drive for saving model checkpoints, loading Task 2 data below

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# 3.0 Load the dataset



In [None]:
from datasets import load_dataset, load_metric


In [None]:
dataset_id = "ccdv/mediasum"
dataset = load_dataset(dataset_id, split="train")

Downloading builder script:   0%|          | 0.00/5.28k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/1.81k [00:00<?, ?B/s]



Downloading and preparing dataset mediasum/roberta_prepended to /root/.cache/huggingface/datasets/ccdv___mediasum/roberta_prepended/1.0.0/46142d4b8658fecd0c985e9f5f070b3e37b42ed7feabd04682dce334bd5e526c...


Downloading data:   0%|          | 0.00/1.44G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/34.1M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/33.9M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset mediasum downloaded and prepared to /root/.cache/huggingface/datasets/ccdv___mediasum/roberta_prepended/1.0.0/46142d4b8658fecd0c985e9f5f070b3e37b42ed7feabd04682dce334bd5e526c. Subsequent calls will reuse this data.


In [None]:
# Inspect data structure
print(dataset)

Dataset({
    features: ['document', 'summary'],
    num_rows: 443596
})


In [None]:
# Inspect shape
print(dataset.shape)



(443596, 2)


# 4.0 Inspect MediaSum

In [None]:
# inspect first example
dataset[0]

{'document': 'FARAI CHIDEYA, host: Now, moving on, Forest Whitaker as Moses, Tisha Campbell Martin as Mary Magdalene - well, that\'s all in "The Bible Experience." A New Testament edition was released in 2006. This edition is billed as "The Complete Bible." It doesn\'t have one person reading the gospels. It features nearly 400 African-American artists, actors and ministers, plus sound effects.</s>FARAI CHIDEYA, host: Just listen to Blair Underwood\'s rendition of Jesus on the cross.</s>Mr. BLAIR UNDERWOOD (Actor): (As Jesus) My God, my God, why have you forsaken me?</s>FARAI CHIDEYA, host: Now, we\'ve got two people affiliated with the project with us today. Kyle Bowser, he co-produced "The Bible Experience" and actress Wendy Raquel Robinson, one of the actors in "The Bible Experience," and she also stars in the CW series, "The Game."</s>FARAI CHIDEYA, host: Hi folks, how are you doing?</s>Ms. WENDY RAQUEL ROBINSON (Actress): Great.</s>Mr. KYLE BOWSER (Co-producer, "The Bible Experien

In [None]:
print(f"- The {dataset_id} dataset has {dataset.num_rows} examples.")
print(f"- Each example is a {type(dataset[0])} with a {type(dataset[0]['document'])} as value.")
print(f"- Examples look like this: {dataset[0]}")

- The ccdv/mediasum dataset has 443596 examples.
- Each example is a <class 'dict'> with a <class 'str'> as value.
- Examples look like this: {'document': 'FARAI CHIDEYA, host: Now, moving on, Forest Whitaker as Moses, Tisha Campbell Martin as Mary Magdalene - well, that\'s all in "The Bible Experience." A New Testament edition was released in 2006. This edition is billed as "The Complete Bible." It doesn\'t have one person reading the gospels. It features nearly 400 African-American artists, actors and ministers, plus sound effects.</s>FARAI CHIDEYA, host: Just listen to Blair Underwood\'s rendition of Jesus on the cross.</s>Mr. BLAIR UNDERWOOD (Actor): (As Jesus) My God, my God, why have you forsaken me?</s>FARAI CHIDEYA, host: Now, we\'ve got two people affiliated with the project with us today. Kyle Bowser, he co-produced "The Bible Experience" and actress Wendy Raquel Robinson, one of the actors in "The Bible Experience," and she also stars in the CW series, "The Game."</s>FARAI C

# 4.0 Helper Functions

In [None]:
def preprocess_text(sentence):
    sentence=str(sentence)
    sentence = sentence.replace('\n', ' ')
    sentence = sentence.replace('?</s>', '?.')
    sentence = sentence.replace('</s>', '')
    sentence = sentence.replace("\'", "'")
    sentence = sentence.replace('--', '')
    sentence = sentence.replace('|', '')
    sentence = sentence.replace('/', '')
    sentence = sentence.replace('Dr.', 'Dr')
    sentence = sentence.replace('?.', '?. ')
    sentence = sentence.replace('.', '. ')
    sentence = sentence.replace('!', '!. ')

    ### END YOUR CODE
    return sentence

In [None]:
def calculate_rouge(reference, predictions):
  rouge = evaluate.load('rouge')
  predictions = extracted_sentence_list
  references = orig_highlights_list
  results = rouge.compute(predictions=predictions,
                        references=references)
  print(results)

In [None]:
"""
LexRank implementation
Source: https://github.com/crabcamp/lexrank/tree/dev
"""

import numpy as np
from scipy.sparse.csgraph import connected_components
from scipy.special import softmax
import logging

logger = logging.getLogger(__name__)

def degree_centrality_scores(
    similarity_matrix,
    threshold=None,
    increase_power=True,
):
    if not (
        threshold is None
        or isinstance(threshold, float)
        and 0 <= threshold < 1
    ):
        raise ValueError(
            '\'threshold\' should be a floating-point number '
            'from the interval [0, 1) or None',
        )

    if threshold is None:
        markov_matrix = create_markov_matrix(similarity_matrix)

    else:
        markov_matrix = create_markov_matrix_discrete(
            similarity_matrix,
            threshold,
        )

    scores = stationary_distribution(
        markov_matrix,
        increase_power=increase_power,
        normalized=False,
    )

    return scores


def _power_method(transition_matrix, increase_power=True, max_iter=10000):
    eigenvector = np.ones(len(transition_matrix))

    if len(eigenvector) == 1:
        return eigenvector

    transition = transition_matrix.transpose()

    for _ in range(max_iter):
        eigenvector_next = np.dot(transition, eigenvector)

        if np.allclose(eigenvector_next, eigenvector):
            return eigenvector_next

        eigenvector = eigenvector_next

        if increase_power:
            transition = np.dot(transition, transition)

    logger.warning("Maximum number of iterations for power method exceeded without convergence!")
    return eigenvector_next


def connected_nodes(matrix):
    _, labels = connected_components(matrix)

    groups = []

    for tag in np.unique(labels):
        group = np.where(labels == tag)[0]
        groups.append(group)

    return groups


def create_markov_matrix(weights_matrix):
    n_1, n_2 = weights_matrix.shape
    if n_1 != n_2:
        raise ValueError('\'weights_matrix\' should be square')

    row_sum = weights_matrix.sum(axis=1, keepdims=True)

    # normalize probability distribution differently if we have negative transition values
    if np.min(weights_matrix) <= 0:
        return softmax(weights_matrix, axis=1)

    return weights_matrix / row_sum


def create_markov_matrix_discrete(weights_matrix, threshold):
    discrete_weights_matrix = np.zeros(weights_matrix.shape)
    ixs = np.where(weights_matrix >= threshold)
    discrete_weights_matrix[ixs] = 1

    return create_markov_matrix(discrete_weights_matrix)


def stationary_distribution(
    transition_matrix,
    increase_power=True,
    normalized=True,
):
    n_1, n_2 = transition_matrix.shape
    if n_1 != n_2:
        raise ValueError('\'transition_matrix\' should be square')

    distribution = np.zeros(n_1)

    grouped_indices = connected_nodes(transition_matrix)

    for group in grouped_indices:
        t_matrix = transition_matrix[np.ix_(group, group)]
        eigenvector = _power_method(t_matrix, increase_power=increase_power)
        distribution[group] = eigenvector

    if normalized:
        distribution /= n_1

    return distribution

In [None]:
# Trigram Blocking for Extractive Summarization

def trigram_blocking(input_text):

  trigram_master_list=[]
  trigram_temp_list=[]
  clean_list = []
  trigram_flag = []
  output_list = input_text

  #output_list = ['oh my god', 'oh my god' ,'lovely day today', 'This year is good', 'Oh my god' ]

  for idx, sentence in enumerate(output_list):
    if idx == 0:
      token = nltk.word_tokenize(sentence.lower())  #tokenize your text and make it lowercase in onestep
      trigrams=ngrams(token,3)          # find all the trigram in text1
      #print(idx, sentence)
      for item in trigrams:
        #print(item)
        if item not in trigram_master_list:
          trigram_master_list.append(item)
      clean_list.append(sentence)
    elif idx > 0:
      token = nltk.word_tokenize(sentence.lower())  #tokenize your text and make it lowercase in onestep
      trigrams=ngrams(token,3)
      #print(idx, sentence) 
      for item in trigrams:
        #print(item)
        trigram_temp_list.append(item)
      trigram_flag = []  
      for trigram_temp in trigram_temp_list:
        if trigram_temp in trigram_master_list:
          trigram_flag.append("Y")
        else:
          trigram_flag.append("N")
      if "N" in trigram_flag:
        clean_list.append(sentence)
        for i in trigram_temp_list:
          trigram_master_list.append(i)

  #print(trigram_master_list)
  #print(clean_list)

  return (clean_list)

#5.0 Create a Dataset for Model Building and Evaluation
 

In [None]:
# Generate a dataset of 1000 examples for model prototyping

# DO NOT CHANGE THE CODE IN THIS CELL

size_of_dataset = 1000 # DO NOT CHANGE THIS
raw_dataset = dataset[0:size_of_dataset]
document_list =  raw_dataset['document']
summary_list =  raw_dataset['summary']

In [None]:
# Pre-process text

def preprocess_text(sentence):
    sentence=str(sentence)
    sentence = sentence.replace('\n', ' ')
    sentence = sentence.replace('?</s>', '?.')
    sentence = sentence.replace('</s>', '')
    sentence = sentence.replace("\'", "'")
    sentence = sentence.replace('--', '')
    sentence = sentence.replace('|', '')
    sentence = sentence.replace('/', '')
    sentence = sentence.replace('Dr.', 'Dr')
    sentence = sentence.replace('?.', '?. ')
    sentence = sentence.replace('.', '. ')
    sentence = sentence.replace('!', '!. ')

    return sentence

In [None]:
# Store in a list
clean_document_list = list(((map(preprocess_text, document_list))))
clean_summary_list = list(((map(preprocess_text, summary_list))))

In [None]:
# Inspect cleaned text
clean_document_list[0]

'FARAI CHIDEYA, host: Now, moving on, Forest Whitaker as Moses, Tisha Campbell Martin as Mary Magdalene - well, that\'s all in "The Bible Experience. " A New Testament edition was released in 2006.  This edition is billed as "The Complete Bible. " It doesn\'t have one person reading the gospels.  It features nearly 400 African-American artists, actors and ministers, plus sound effects. FARAI CHIDEYA, host: Just listen to Blair Underwood\'s rendition of Jesus on the cross. Mr.  BLAIR UNDERWOOD (Actor): (As Jesus) My God, my God, why have you forsaken me?.  FARAI CHIDEYA, host: Now, we\'ve got two people affiliated with the project with us today.  Kyle Bowser, he co-produced "The Bible Experience" and actress Wendy Raquel Robinson, one of the actors in "The Bible Experience," and she also stars in the CW series, "The Game. "FARAI CHIDEYA, host: Hi folks, how are you doing?.  Ms.  WENDY RAQUEL ROBINSON (Actress): Great. Mr.  KYLE BOWSER (Co-producer, "The Bible Experience: The Complete Bi

In [None]:
# Create a pandas dataframe to hold cleansed data
df = pd.DataFrame(list(zip(clean_document_list, clean_summary_list)),
               columns =['document', 'summary'])

In [None]:
 # write to a csv file
df.to_csv("cleaned_mediasum1000", index = False)

In [None]:
# load the cleaned mediasum1000 into a Hugging Face Data Dictionary
dataset = load_dataset('csv', data_files = 'cleaned_mediasum1000', split='train' )



Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-8bcbd714b81c3b26/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-8bcbd714b81c3b26/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


In [None]:
# Inspect the data dictionary
dataset

Dataset({
    features: ['document', 'summary'],
    num_rows: 1000
})

#6.0 Create a Test Dataset for Model Evaluation

In [None]:
# Generate a dataset of "x" examples for model evaulation

size_of_dataset = 1000 # change the value as you develop
small_dataset = dataset[0:size_of_dataset]



# Model User Pre-Trained Bert for Extrative Summarization

In [None]:
# Build the Model

!pip install -U sentence-transformers

import nltk
from sentence_transformers import SentenceTransformer, util
import numpy as np
from sklearn.cluster import AgglomerativeClustering

from sentence_transformers import SentenceTransformer, models
from torch import nn

## Step 1: use an existing language model
word_embedding_model = models.Transformer('bert-base-uncased')
#word_embedding_model = models.Transformer('bert-base-uncased', max_seq_length=256)

## Step 2: use a pool function over the token embeddings
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),pooling_mode_mean_tokens=True)

## Join steps 1 and 2 using the modules argument
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

#dense_model = models.Dense(in_features=pooling_model.get_sentence_embedding_dimension(), out_features=256, activation_function=nn.Tanh())
dense_model = models.Dense(in_features=pooling_model.get_sentence_embedding_dimension(), out_features=pooling_model.get_sentence_embedding_dimension(), activation_function=nn.ReLU())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model, dense_model])


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 2.9 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 32.8 MB/s 
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125938 sha256=196a7a69cc23fb06c0492e816c6b78ebe59257e1e775c1f2e514c4a7bcd5703b
  Stored in directory: /root/.cache/pip/wheels/bf/06/fb/d59c1e5bd1dac7f6cf61ec0036cc3a10ab8fecaa6b2c3d3ee9
Successfully built sentence-transformers
Installing collected packages: sentencepiece, sentence-transformers
Successfully installed sentenc

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

#7.0 Evaluate Rouge Across Eval Data Set

In [None]:
# RogueL Scorer
scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
rouge = evaluate.load('rouge')

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [None]:
# Iterate through the dataset, extract the summary and compute RougeL, Cosine Similarity Scores

start = time.time()

# get article from dataset
input_article = small_dataset['document'][0:size_of_dataset]
#len(input_article)

# get summary from dataset
input_highlights = small_dataset['summary'][0:size_of_dataset]
#len(input_highlights)

# zip article and summary
zipped_input = zip(input_article, input_highlights)

# Empty List to Store Scores
rougeL_precision = []
rougeL_recall = []
rougeL_fmeasure = []

rouge_1_list = []
rouge_2_list = []
rouge_L_list = []

cosine_similarity_results = []

# Counter for Tracking Results
count = 1

# Extracted Highlights
output_highlights_list = []

# iterate Through the Eval Dataset
for input_article, input_highlights in zipped_input:
  
  print('Example:', count)

  # Tokenize the string texts
  
  source_article_list = nltk.sent_tokenize(input_article)
  source_highlight_list = nltk.sent_tokenize(input_highlights)
  
# Join the list into a string
  formatted_source_article = " ".join(source_article_list)
  formatted_source_highlight = " ".join(source_highlight_list)
  
# get embeddings from model
  embeddings = model.encode(source_article_list, convert_to_tensor=True)

  output_list=[]

#Compute the pair-wise cosine similarities
  cos_scores = util.cos_sim(embeddings, embeddings).numpy()

#Compute the centrality for each sentence
  centrality_scores = degree_centrality_scores(cos_scores, threshold=None)

#We argsort so that the first element is the sentence with the highest score
  most_central_sentence_indices = np.argsort(-centrality_scores)

  num_sentences = int(len(source_article_list) * 0.30) # 30% compression ration

  for idx in most_central_sentence_indices[0:num_sentences]:
    #print(source_article_list[idx].strip())
    output_list.append(source_article_list[idx])

  result = output_list 

# note result is a string returned by model
  summary = "".join(result)

# Take the string from summary and convert to list of strings for each sentence
  extracted_sentence_list = nltk.sent_tokenize(summary)

# Apply Trigram Blocking
  result_list = trigram_blocking(extracted_sentence_list)
  summary = " ".join(result_list)
  output_highlights_list.append(summary)

  predictions = " ".join(result_list)
  references = formatted_source_highlight

  rougeL_scores = scorer.score(predictions,
                      references)
  
  pred = [predictions]
  ref = [references]

  rouge_results = rouge.compute(predictions=pred, references=ref)
  print(rouge_results)

  rouge_1_score = (rouge_results['rouge1'])
  rouge_2_score = (rouge_results['rouge2'])
  rouge_L_score = (rouge_results['rougeL'])

  rouge_1_list.append(rouge_1_score)
  rouge_2_list.append(rouge_2_score)
  rouge_L_list.append(rouge_L_score)

  precision = (rougeL_scores['rougeL'].precision)
  recall = (rougeL_scores['rougeL'].recall)
  fmeasure = (rougeL_scores['rougeL'].fmeasure)

  rougeL_precision.append(precision)
  rougeL_recall.append(recall)
  rougeL_fmeasure.append(fmeasure)

  # calculate cosine similarity
  doc1 = nlp((" ".join(predictions)))
  doc2 = nlp((" ".join(references)))
  cosine_similarity = doc1.similarity(doc2)
  cosine_similarity_results.append(cosine_similarity)

  count = count + 1

print('\nTime:', time.time() - start)


Example: 1
{'rouge1': 0.0846824408468244, 'rouge2': 0.039950062421972535, 'rougeL': 0.0697384806973848, 'rougeLsum': 0.0697384806973848}
Example: 2
{'rouge1': 0.03316062176165803, 'rouge2': 0.012461059190031154, 'rougeL': 0.026943005181347155, 'rougeLsum': 0.026943005181347155}
Example: 3
{'rouge1': 0.07672634271099744, 'rouge2': 0.020565552699228794, 'rougeL': 0.06138107416879795, 'rougeLsum': 0.06138107416879795}
Example: 4
{'rouge1': 0.09574468085106383, 'rouge2': 0.039145907473309614, 'rougeL': 0.06382978723404255, 'rougeLsum': 0.06382978723404255}
Example: 5
{'rouge1': 0.10989010989010989, 'rouge2': 0.022099447513812154, 'rougeL': 0.08241758241758242, 'rougeLsum': 0.08241758241758242}
Example: 6
{'rouge1': 0.04852941176470588, 'rouge2': 0.010309278350515462, 'rougeL': 0.030882352941176472, 'rougeLsum': 0.030882352941176472}
Example: 7
{'rouge1': 0.1696113074204947, 'rouge2': 0.021352313167259784, 'rougeL': 0.09187279151943463, 'rougeLsum': 0.09187279151943463}
Example: 8
{'rouge1'

In [None]:
# Print Compute Time
print('\nTime:', time.time() - start)


Time: 7440.9349772930145


In [None]:
# Export extractive summary to a CSV

# get article from dataset
input_article_list = small_dataset['document'][0:size_of_dataset]

# get summary from dataset
input_highlights_list = small_dataset['summary'][0:size_of_dataset]

df = pd.DataFrame(list(zip(input_article_list, input_highlights_list, output_highlights_list)),
                  columns = ['orig_article', 'orig_summary', 'extracted_summary'])

# Edit this filepath to wherever you saved the data in your Drive
filepath = 'drive/My Drive/Colab_Notebooks_1/model_4a1_extracted_mediasum1000.csv'

df.to_csv(filepath,index = False)

In [None]:
# read back the csv file
data_import = pd.read_csv(filepath)        
#data_import.rename(columns = {'0':'orig_article', '1':'orig_summary', '2':'extracted_summary'}, inplace = True)

col1 = data_import.orig_article.values.tolist()
col2 = data_import.orig_summary.values.tolist()
col3 = data_import.extracted_summary.values.tolist()

In [None]:
# Validate data_import
print(col1[0])
print("")

print(col2[0])
print("")

print(col3[0])

FARAI CHIDEYA, host: Now, moving on, Forest Whitaker as Moses, Tisha Campbell Martin as Mary Magdalene - well, that's all in "The Bible Experience. " A New Testament edition was released in 2006.  This edition is billed as "The Complete Bible. " It doesn't have one person reading the gospels.  It features nearly 400 African-American artists, actors and ministers, plus sound effects. FARAI CHIDEYA, host: Just listen to Blair Underwood's rendition of Jesus on the cross. Mr.  BLAIR UNDERWOOD (Actor): (As Jesus) My God, my God, why have you forsaken me?.  FARAI CHIDEYA, host: Now, we've got two people affiliated with the project with us today.  Kyle Bowser, he co-produced "The Bible Experience" and actress Wendy Raquel Robinson, one of the actors in "The Bible Experience," and she also stars in the CW series, "The Game. "FARAI CHIDEYA, host: Hi folks, how are you doing?.  Ms.  WENDY RAQUEL ROBINSON (Actress): Great. Mr.  KYLE BOWSER (Co-producer, "The Bible Experience: The Complete Bible")

In [None]:
# Calculate Mean Rouge for Dataset

print("RougeL Precision Scores")
print(rougeL_precision)
print(len(rougeL_precision))
print(np.mean(np.asarray(rougeL_precision)))
print("")

print("RougeL Recall Scores")
print(rougeL_recall)
print(len(rougeL_precision))
print(np.mean(np.asarray(rougeL_recall)))
print("")

print("RougeL Fmeasure Scores")
print(rougeL_fmeasure)
print(len(rougeL_fmeasure))
print(np.mean(np.asarray(rougeL_fmeasure)))
print("")

print("Rouge 1 Scores")
print(rouge_1_list)
print(len(rouge_1_list))
print(np.mean(np.asarray(rouge_1_list)))
print("")

print("Rouge 2 Scores")
print(rouge_2_list)
print(len(rouge_2_list))
print(np.mean(np.asarray(rouge_2_list)))
print("")

print("Rouge L Scores")
print(rouge_L_list)
print(len(rouge_L_list))
print(np.mean(np.asarray(rouge_L_list)))
print("")


RougeL Precision Scores
[0.6511627906976745, 0.5909090909090909, 0.46153846153846156, 0.4418604651162791, 0.5, 0.5238095238095238, 0.34210526315789475, 0.34210526315789475, 0.6, 0.5294117647058824, 0.48, 0.5, 0.575, 0.27906976744186046, 0.4358974358974359, 0.391304347826087, 0.43478260869565216, 0.43478260869565216, 0.5517241379310345, 0.3939393939393939, 0.37209302325581395, 0.3939393939393939, 0.4146341463414634, 0.3548387096774194, 0.41818181818181815, 0.7111111111111111, 0.5151515151515151, 0.42857142857142855, 0.45714285714285713, 0.4634146341463415, 0.3673469387755102, 0.36585365853658536, 0.5, 0.2608695652173913, 0.6282051282051282, 0.6779661016949152, 0.40963855421686746, 0.36585365853658536, 0.2903225806451613, 0.5, 0.3870967741935484, 0.509090909090909, 0.6595744680851063, 0.66, 0.5510204081632653, 0.5666666666666667, 0.3939393939393939, 0.18518518518518517, 0.46511627906976744, 0.5217391304347826, 0.45454545454545453, 0.6511627906976745, 0.47058823529411764, 0.84615384615384

In [None]:
# Calculate Mean Cosine Similarity

print("Cosine Similarity")
print(np.mean(np.asarray(cosine_similarity_results)))
print("")


Cosine Similarity
0.9890104177004241



In [None]:
dataset = load_dataset('csv', data_files = filepath, split='train' )



Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-c01011be978bf828/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-c01011be978bf828/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


In [None]:
dataset

Dataset({
    features: ['orig_article', 'orig_summary', 'extracted_summary'],
    num_rows: 1000
})