<a href="https://colab.research.google.com/github/ihabiba/NLP-Labs/blob/main/Text_Summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part 1: Extractive Summarization

## Step 1.1 – Imports & Example Sentences

In [1]:
# Import necessary libraries for extractive summarization
import networkx as nx
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Example sentences
sentences = [
    "Artificial intelligence is transforming industries.",
    "Applications of AI include healthcare, finance, and education.",
    "AI improves efficiency but raises ethical concerns like privacy.",
    "Healthcare benefits from AI in diagnostics and patient care."
]

print(f"Number of sentences: {len(sentences)}")
for i, s in enumerate(sentences, start=1):
    print(f"S{i}: {s}")


Number of sentences: 4
S1: Artificial intelligence is transforming industries.
S2: Applications of AI include healthcare, finance, and education.
S3: AI improves efficiency but raises ethical concerns like privacy.
S4: Healthcare benefits from AI in diagnostics and patient care.


## Step 1.2 – Build Cosine Similarity Matrix

In [2]:
# Function to calculate cosine similarity matrix
def build_similarity_matrix(sentences):
    """
    Build a cosine similarity matrix for a list of sentences using TF-IDF.
    """
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(sentences)
    similarity_matrix = cosine_similarity(tfidf_matrix)
    return similarity_matrix

# Build similarity matrix for our example sentences
similarity_matrix = build_similarity_matrix(sentences)
similarity_matrix


array([[1.        , 0.        , 0.        , 0.        ],
       [0.        , 1.        , 0.05448421, 0.23139971],
       [0.        , 0.05448421, 1.        , 0.05079878],
       [0.        , 0.23139971, 0.05079878, 1.        ]])

## Step 1.3 – Implement TextRank

In [3]:
# Function for TextRank Algorithm
def textrank(sentences, similarity_matrix, damping=0.85, max_iter=100, tol=1e-4):
    """
    Run TextRank on a set of sentences using a similarity matrix.
    Returns a list of (score, sentence) sorted in descending score order.
    """
    n = len(sentences)
    G = nx.Graph()

    # Build graph: nodes are sentence indices, edges weighted by similarity
    for i in range(n):
        for j in range(n):
            if i != j:
                G.add_edge(i, j, weight=similarity_matrix[i][j])

    # PageRank on sentence graph
    scores = nx.pagerank(G, alpha=damping, max_iter=max_iter, tol=tol)

    # Sort sentences by score (high → low)
    ranked_sentences = sorted(
        ((scores[i], s) for i, s in enumerate(sentences)),
        reverse=True
    )
    return ranked_sentences

# Quick sanity check
textrank_results = textrank(sentences, similarity_matrix)
textrank_results


[(0.3934179368766816,
  'Applications of AI include healthcare, finance, and education.'),
 (0.3882139749244412,
  'Healthcare benefits from AI in diagnostics and patient care.'),
 (0.17074903250528736,
  'AI improves efficiency but raises ethical concerns like privacy.'),
 (0.047619055693589915, 'Artificial intelligence is transforming industries.')]

## Step 1.4 & 1.5 – Apply TextRank and Display Sentence Scores

In [4]:
# Rebuild similarity matrix (for clarity; already built above)
similarity_matrix = build_similarity_matrix(sentences)

# Apply TextRank
textrank_results = textrank(sentences, similarity_matrix)

# Print TextRank results
print("TextRank Sentence Scores:")
for score, sentence in textrank_results:
    print(f"Score: {score:.4f} | Sentence: {sentence}")


TextRank Sentence Scores:
Score: 0.3934 | Sentence: Applications of AI include healthcare, finance, and education.
Score: 0.3882 | Sentence: Healthcare benefits from AI in diagnostics and patient care.
Score: 0.1707 | Sentence: AI improves efficiency but raises ethical concerns like privacy.
Score: 0.0476 | Sentence: Artificial intelligence is transforming industries.


## Step 1.6 – Generate Extractive Summary with TextRank

In [5]:
def generate_textrank_summary(ranked_sentences, top_k=2):
    """
    Take ranked (score, sentence) pairs and return an ordered summary
    using the top_k sentences in their original order.
    """
    # Extract top_k sentences
    top_sentences = [sent for _, sent in ranked_sentences[:top_k]]

    # Preserve original order based on 'sentences' list
    ordered_summary = [s for s in sentences if s in top_sentences]
    return " ".join(ordered_summary)

textrank_summary = generate_textrank_summary(textrank_results, top_k=2)
print("TextRank Summary:")
print(textrank_summary)


TextRank Summary:
Applications of AI include healthcare, finance, and education. Healthcare benefits from AI in diagnostics and patient care.


## Step 1.7 – Implement LexRank


In [6]:
# Function for LexRank Algorithm
def lexrank(sentences, similarity_matrix, threshold=0.01):
    """
    Run LexRank on a set of sentences using a similarity matrix.
    Returns:
      - scores: dict[node_index -> score]
      - ranked_sentences: list of (score, sentence) sorted high → low
    """
    n = len(sentences)
    G = nx.DiGraph()

    # Build the graph with thresholding and row normalization
    for i in range(n):
        # Count how many neighbours exceed the threshold
        row_sum = sum(similarity_matrix[i][j] > threshold for j in range(n))
        for j in range(n):
            if i != j and similarity_matrix[i][j] > threshold and row_sum > 0:
                # Normalize by number of valid edges (simple version)
                weight = similarity_matrix[i][j] / row_sum
                G.add_edge(i, j, weight=weight)

    # Compute PageRank on this directed graph
    scores = nx.pagerank(G, max_iter=100, tol=1e-6)

    # Build ranked sentence list
    ranked_sentences = sorted(
        ((scores[node], sentences[node]) for node in scores),
        reverse=True
    )
    return scores, ranked_sentences

# Build similarity matrix (again for clarity)
similarity_matrix = build_similarity_matrix(sentences)

# Apply LexRank
lexrank_scores, lexrank_results = lexrank(sentences, similarity_matrix)

# Print all LexRank scores
print("LexRank Sentence Scores (All Sentences):")
for node, score in lexrank_scores.items():
    print(f"Sentence {node + 1}: Score: {score:.4f}")

print("\nLexRank Ranked Sentences:")
for score, sentence in lexrank_results:
    print(f"Score: {score:.4f} | Sentence: {sentence}")


LexRank Sentence Scores (All Sentences):
Sentence 2: Score: 0.4130
Sentence 3: Score: 0.1793
Sentence 4: Score: 0.4077

LexRank Ranked Sentences:
Score: 0.4130 | Sentence: Applications of AI include healthcare, finance, and education.
Score: 0.4077 | Sentence: Healthcare benefits from AI in diagnostics and patient care.
Score: 0.1793 | Sentence: AI improves efficiency but raises ethical concerns like privacy.


## Step 1.7 (continued) – Generate LexRank Summary

In [7]:
def generate_lexrank_summary(ranked_sentences, top_k=2):
    """
    Take ranked (score, sentence) pairs and return an ordered summary
    using the top_k sentences in their original order.
    """
    top_sentences = [sent for _, sent in ranked_sentences[:top_k]]
    ordered_summary = [s for s in sentences if s in top_sentences]
    return " ".join(ordered_summary)

lexrank_summary = generate_lexrank_summary(lexrank_results, top_k=2)
print("LexRank Summary:")
print(lexrank_summary)

print("\nTextRank Summary (for comparison):")
print(textrank_summary)


LexRank Summary:
Applications of AI include healthcare, finance, and education. Healthcare benefits from AI in diagnostics and patient care.

TextRank Summary (for comparison):
Applications of AI include healthcare, finance, and education. Healthcare benefits from AI in diagnostics and patient care.


# Part 2: Abstractive Summarization

## Step 2.1 – Import Libraries and Define Source & Target Text


In [8]:
import torch
import torch.nn as nn
import torch.optim as optim

# Sample data: input sentence and target summary
source_text = ["artificial intelligence is transforming industries"]
target_text = ["ai transforms industries"]

# Vocabulary
vocab = ["<pad>", "<sos>", "<eos>", "artificial", "intelligence", "is",
         "transforming", "industries", "ai", "transforms"]

word2idx = {word: idx for idx, word in enumerate(vocab)}
idx2word = {idx: word for word, idx in word2idx.items()}

# Convert sentences to indices
def tokenize(text, word2idx):
    return [
        [word2idx["<sos>"]] +
        [word2idx[word] for word in sentence.split()] +
        [word2idx["<eos>"]]
        for sentence in text
    ]


## Step 2.2 – Converting to PyTorch Tensors


In [9]:
source_indices = tokenize(source_text, word2idx)
target_indices = tokenize(target_text, word2idx)

# Convert to tensors
source_tensor = torch.tensor(source_indices, dtype=torch.long)
target_tensor = torch.tensor(target_indices, dtype=torch.long)


## Step 2.3 – Define Hyperparameters


In [10]:
# Hyperparameters
embedding_dim = 16
hidden_dim = 32
vocab_size = len(vocab)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


## Step 2.4 – Define Encoder


In [11]:
# Encoder
class Encoder(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)

    def forward(self, x):
        embedded = self.embedding(x)
        outputs, (hidden, cell) = self.lstm(embedded)
        return hidden, cell

## Step 2.5 – Define Decoder


In [12]:
# Decoder
class Decoder(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x, hidden, cell):
        x = x.unsqueeze(1)  # Add batch dimension
        embedded = self.embedding(x)
        output, (hidden, cell) = self.lstm(embedded, (hidden, cell))
        prediction = self.fc(output.squeeze(1))
        return prediction, hidden, cell


## Step 2.6 – Define Seq2Seq Model


In [13]:
# Seq2Seq Model
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, source, target, teacher_forcing_ratio=0.5):
        batch_size = target.shape[0]
        target_len = target.shape[1]
        vocab_size = self.decoder.fc.out_features

        outputs = torch.zeros(batch_size, target_len, vocab_size).to(device)

        hidden, cell = self.encoder(source)

        x = target[:, 0]  # <sos> token

        for t in range(1, target_len):
            output, hidden, cell = self.decoder(x, hidden, cell)
            outputs[:, t, :] = output

            teacher_force = torch.rand(1).item() < teacher_forcing_ratio
            x = target[:, t] if teacher_force else output.argmax(1)

        return outputs


## Step 2.7 – Instantiate the Encoder and Decoder


In [14]:
# Instantiate the model
encoder = Encoder(vocab_size, embedding_dim, hidden_dim).to(device)
decoder = Decoder(vocab_size, embedding_dim, hidden_dim).to(device)
model = Seq2Seq(encoder, decoder).to(device)


## Step 2.7 – Define the Loss Function and Optimizer


In [15]:
# Loss and optimizer
criterion = nn.CrossEntropyLoss(ignore_index=word2idx["<pad>"])
optimizer = optim.Adam(model.parameters())


## Step 2.7 – Training Loop


In [16]:
# Training loop
num_epochs = 100
for epoch in range(num_epochs):
    model.train()
    optimizer.zero_grad()

    source = source_tensor.to(device)
    target = target_tensor.to(device)

    output = model(source, target)

    output = output[:, 1:].reshape(-1, vocab_size)
    target = target[:, 1:].reshape(-1)

    loss = criterion(output, target)
    loss.backward()
    optimizer.step()

    if (epoch + 1) % 10 == 0:
        print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}")


Epoch [10/100], Loss: 2.2575
Epoch [20/100], Loss: 2.1125
Epoch [30/100], Loss: 1.8664
Epoch [40/100], Loss: 1.4928
Epoch [50/100], Loss: 1.1304
Epoch [60/100], Loss: 0.8545
Epoch [70/100], Loss: 0.6474
Epoch [80/100], Loss: 0.4986
Epoch [90/100], Loss: 0.3868
Epoch [100/100], Loss: 0.3021


## Step 2.7 – Generate Summary


In [18]:
# Generate a summary
model.eval()
with torch.no_grad():
    source = source_tensor.to(device)
    hidden, cell = encoder(source)

    x = torch.tensor([word2idx["<sos>"]]).to(device)
    summary = []

    for _ in range(10):
        output, hidden, cell = decoder(x, hidden, cell)
        x = output.argmax(1)
        word = idx2word[x.item()]

        if word == "<eos>":
            break

        summary.append(word)

    print("Generated Summary:", " ".join(summary))


Generated Summary: ai transforms industries


# Part 3: Laboratory Task

## Step 3.1 – Load Brown Corpus


In [19]:
# Import necessary libraries
import networkx as nx
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
from nltk.corpus import brown

nltk.download('brown')

# Load built-in dataset (Brown Corpus)
sentences = [" ".join(sentence) for sentence in brown.sents(categories='news')[:100]]


[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


## Step 3.1 – Build Similarity Matrix & Apply TextRank


In [20]:
# Build similarity matrix for Brown Corpus sentences
similarity_matrix = build_similarity_matrix(sentences)

# Apply TextRank
textrank_results = textrank(sentences, similarity_matrix)

# Display TextRank results (sentence scores)
print("TextRank Sentence Scores:")
for score, sentence in textrank_results[:5]:  # showing only top 5 for readability
    print(f"Score: {score:.4f}, Sentence: {sentence}")


TextRank Sentence Scores:
Score: 0.0205, Sentence: The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted .
Score: 0.0204, Sentence: `` This is one of the major items in the Fulton County general assistance program '' , the jury said , but the State Welfare Department `` has seen fit to distribute these funds through the welfare departments of all the counties in the state with the exception of Fulton County , which receives none of this money .
Score: 0.0181, Sentence: `` Only a relative handful of such reports was received '' , the jury said , `` considering the widespread interest in the election , the number of voters and the size of this city '' .
Score: 0.0170, Sentence: The jury also commented on the Fulton ordinary's court which has been under fire for its practices in the appointment of appraisers

## Step 3.1 – Apply LexRank


In [21]:
# Build similarity matrix (can reuse, but we rebuild here for clarity)
similarity_matrix = build_similarity_matrix(sentences)

# Apply LexRank
lexrank_scores, lexrank_results = lexrank(sentences, similarity_matrix)

# Print all LexRank Scores (All sentences with their scores)
print("LexRank Sentence Scores (All Sentences):")
for node, score in lexrank_scores.items():
    print(f"Sentence {node + 1}: Score: {score:.4f}")

# Print LexRank Summary (Ranked Sentences)
print("\nLexRank Summary (Top Ranked Sentences):")
for score, sentence in lexrank_results[:5]:  # showing top 5 to keep output readable
    print(f"Score: {score:.4f}, Sentence: {sentence}")


LexRank Sentence Scores (All Sentences):
Sentence 1: Score: 0.0113
Sentence 2: Score: 0.0207
Sentence 3: Score: 0.0120
Sentence 4: Score: 0.0183
Sentence 5: Score: 0.0119
Sentence 6: Score: 0.0133
Sentence 7: Score: 0.0150
Sentence 9: Score: 0.0124
Sentence 10: Score: 0.0110
Sentence 11: Score: 0.0095
Sentence 12: Score: 0.0095
Sentence 13: Score: 0.0126
Sentence 14: Score: 0.0106
Sentence 15: Score: 0.0206
Sentence 16: Score: 0.0080
Sentence 17: Score: 0.0127
Sentence 18: Score: 0.0076
Sentence 19: Score: 0.0172
Sentence 21: Score: 0.0131
Sentence 22: Score: 0.0133
Sentence 23: Score: 0.0108
Sentence 24: Score: 0.0162
Sentence 26: Score: 0.0090
Sentence 27: Score: 0.0113
Sentence 29: Score: 0.0109
Sentence 30: Score: 0.0147
Sentence 31: Score: 0.0083
Sentence 34: Score: 0.0054
Sentence 35: Score: 0.0091
Sentence 36: Score: 0.0120
Sentence 37: Score: 0.0090
Sentence 38: Score: 0.0129
Sentence 40: Score: 0.0114
Sentence 41: Score: 0.0082
Sentence 42: Score: 0.0093
Sentence 43: Score: 0.

## Step 3.2 – ROUGE Evaluation


In [23]:
!pip install rouge-score

Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=8b7098fc731dda76a8401d6ee9807de97b6a86fb621d95e6db9f875e8a00eb24
  Stored in directory: /root/.cache/pip/wheels/85/9d/af/01feefbe7d55ef5468796f0c68225b6788e85d9d0a281e7a70
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2


In [24]:
from rouge_score import rouge_scorer

# Extract top-ranked sentences as summaries
textrank_summary = textrank_results[0][1]
lexrank_summary = lexrank_results[0][1]

# Choose a reference summary (using first sentence from Brown corpus)
reference_summary = sentences[0]

# Initialize ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rouge3'], use_stemmer=True)

# Compute ROUGE scores
lexrank_scores = scorer.score(reference_summary, lexrank_summary)
textrank_scores = scorer.score(reference_summary, textrank_summary)

# Print ROUGE comparison
print("\nROUGE Scores Comparison:")

print("\nLexRank ROUGE Scores:")
for metric, score in lexrank_scores.items():
    print(f"{metric.upper()} - Precision: {score.precision:.4f}, Recall: {score.recall:.4f}, F1: {score.fmeasure:.4f}")

print("\nTextRank ROUGE Scores:")
for metric, score in textrank_scores.items():
    print(f"{metric.upper()} - Precision: {score.precision:.4f}, Recall: {score.recall:.4f}, F1: {score.fmeasure:.4f}")



ROUGE Scores Comparison:

LexRank ROUGE Scores:
ROUGE1 - Precision: 0.1750, Recall: 0.3043, F1: 0.2222
ROUGE2 - Precision: 0.0256, Recall: 0.0455, F1: 0.0328
ROUGE3 - Precision: 0.0000, Recall: 0.0000, F1: 0.0000

TextRank ROUGE Scores:
ROUGE1 - Precision: 0.1750, Recall: 0.3043, F1: 0.2222
ROUGE2 - Precision: 0.0256, Recall: 0.0455, F1: 0.0328
ROUGE3 - Precision: 0.0000, Recall: 0.0000, F1: 0.0000


## Analysis of TextRank and LexRank Based on ROUGE Scores

Both TextRank and LexRank produced identical ROUGE scores because both algorithms selected the same top-ranked sentence from the Brown corpus. This happens when the dataset contains short, factual, and structurally similar sentences, causing both graph-based methods to identify the same "central" sentence based on TF-IDF similarity.

The ROUGE-1 scores (unigram overlap) show moderate similarity to the reference summary, while ROUGE-2 and ROUGE-3 scores are low or zero. This is expected because bigram and trigram matches require exact consecutive word overlap, which is unlikely when comparing single sentences. The low scores do not indicate poor summarization performance; they simply reflect the strictness of ROUGE when used on very short texts.

Overall, neither algorithm performed better—both produced the same summary and therefore the same evaluation results. The similarity in output demonstrates that, for this dataset, TextRank and LexRank behave similarly due to the uniform style and limited variation in the input sentences.
