In [1]:
!git clone https://github.com/lekshmi-j/automatic-text-summarization.git



Cloning into 'automatic-text-summarization'...
remote: Enumerating objects: 55, done.[K
remote: Counting objects: 100% (55/55), done.[K
remote: Compressing objects: 100% (42/42), done.[K
remote: Total 55 (delta 21), reused 30 (delta 8), pack-reused 0 (from 0)[K
Receiving objects: 100% (55/55), 168.90 KiB | 1.28 MiB/s, done.
Resolving deltas: 100% (21/21), done.


In [2]:
%cd automatic-text-summarization

/content/automatic-text-summarization


In [3]:
import nltk

nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("omw-1.4")
nltk.download("punkt_tab")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [4]:
!pip install rouge-score -q


  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone


In [5]:
from datasets import load_dataset
from rouge_score import rouge_scorer
import pandas as pd


Prepare summaries for comparison

In [6]:
from src.preprocess import preprocess_article
from src.extractive import (
    tfidf_sentence_scores,
    build_similarity_matrix,
    textrank_scores,
    get_top_sentences
)
from src.abstractive import (summarize_text,chunk_text, summarize_article)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


In [7]:
dataset = load_dataset("cnn_dailymail", "3.0.0")

sample = dataset["test"][50]
article = sample["article"]
reference_summary = sample["highlights"]


README.md: 0.00B [00:00, ?B/s]

3.0.0/train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

3.0.0/train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

3.0.0/train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

3.0.0/validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

3.0.0/test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

In [8]:
dataset = load_dataset("cnn_dailymail", "3.0.0")

sample = dataset["test"][50]   # fixed index for reproducibility
article = sample["article"]
reference_summary = sample["highlights"]


In [9]:
print("REFERENCE SUMMARY:\n")
print(reference_summary)


REFERENCE SUMMARY:

An outside review found that a Rolling Stone article about campus rape was "deeply flawed"
Danny Cevallos says that there are obstacles to a successful libel case, should one be filed .


Generate Extractive Summary (TF-IDF)

In [10]:
original, cleaned = preprocess_article(article)

tfidf_scores = tfidf_sentence_scores(cleaned)
extractive_summary = get_top_sentences(
    original, tfidf_scores, k=5
)


Generate Graph-Based Summary (TextRank)

In [11]:
sim_matrix = build_similarity_matrix(cleaned)
textrank_scores_vec = textrank_scores(sim_matrix)

graph_summary = get_top_sentences(
    original, textrank_scores_vec, k=5
)


Generate Abstractive Summary (Transformer)

In [14]:
print("ORIGINAL ARTICLE (TRUNCATED):\n")
print(article[:800], "\n")

print("HUMAN SUMMARY:\n")
print(reference_summary, "\n")

print("EXTRACTIVE (TF-IDF) SUMMARY:\n")
print(extractive_summary, "\n")

print("GRAPH-BASED (TextRank) SUMMARY:\n")
print(graph_summary, "\n")

print("ABSTRACTIVE (BART) SUMMARY:\n")
print(abstractive_summary)


ORIGINAL ARTICLE (TRUNCATED):

(CNN)According to an outside review by Columbia Journalism School professors, "(a)n institutional failure at Rolling Stone resulted in a deeply flawed article about a purported gang rape at the University of Virginia." The Columbia team concluded that "The failure encompassed reporting, editing, editorial supervision and fact-checking." Hardly a ringing endorsement of the editorial process at the publication. The magazine's managing editor, Will Dana, wrote, "We would like to apologize to our readers and to all of those who were damaged by our story and the ensuing fallout, including members of the Phi Kappa Psi fraternity and UVA administrators and students." Brian Stelter: Fraternity to 'pursue all available legal action' The next question is: . Can UVA, Phi Kappa Psi or any of the other 

HUMAN SUMMARY:

An outside review found that a Rolling Stone article about campus rape was "deeply flawed"
Danny Cevallos says that there are obstacles to a successfu

In [15]:
scorer = rouge_scorer.RougeScorer(
    ["rouge1", "rouge2", "rougeL"],
    use_stemmer=True
)

summaries = {
    "Extractive_TFIDF": extractive_summary,
    "Graph_TextRank": graph_summary,
    "Abstractive_BART": abstractive_summary
}

results = {}

for method, summary in summaries.items():
    score = scorer.score(reference_summary, summary)
    results[method] = {
        "ROUGE-1": score["rouge1"].fmeasure,
        "ROUGE-2": score["rouge2"].fmeasure,
        "ROUGE-L": score["rougeL"].fmeasure
    }


In [16]:
df = pd.DataFrame(results).T
df


Unnamed: 0,ROUGE-1,ROUGE-2,ROUGE-L
Extractive_TFIDF,0.157025,0.041667,0.115702
Graph_TextRank,0.136986,0.0,0.082192
Abstractive_BART,0.17931,0.027972,0.124138
