<a href="https://colab.research.google.com/github/hegame1998/NLP-Assignment/blob/main/NLP%20Assignment2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

I will do this approach in this code:


* **Loads two documents:** one to summarize, one as style reference.

* **Estimates token length** using word count (proxy for 4000-token limit).

* **Performs chunk-based summarization** using TextRank-style TF-IDF cosine similarity.

* **If the summary is too large**, it recursively shrinks it.

* **Saves the summaries.**

* **Prints a query prompt** to generate a style-following summary.

#Data Collection

This is where I load my input and style reference documents.


In [1]:
import os
import re
import nltk
import numpy as np
from nltk.tokenize import sent_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
# === Data Collection ===
def load_documents(input_path, style_path):
    with open(input_path, "r", encoding="utf-8") as f:
        input_text = f.read()

    with open(style_path, "r", encoding="utf-8") as f:
        style_reference_text = f.read()

    return input_text, style_reference_text

#Preprocessing

Clean and tokenize the text.

In [3]:
# === Preprocessing ===
def clean_text(text):
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

def tokenize_sentences(text):
    return sent_tokenize(text)

def estimate_token_count(text):
    return len(text.split())

#Feature Extraction

Use TF-IDF and cosine similarity to rank sentences for summarization.


In [4]:
# === Feature Extraction ===
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def summarize_text(text, num_sentences=5):
    sentences = tokenize_sentences(text)
    if len(sentences) <= num_sentences:
        return text

    vectorizer = TfidfVectorizer(stop_words='english')
    X = vectorizer.fit_transform(sentences)
    sim_matrix = cosine_similarity(X)

    scores = sim_matrix.sum(axis=1)
    ranked_sentences = [sentences[i] for i in np.argsort(scores)[-num_sentences:]]
    ranked_sentences.sort(key=lambda s: sentences.index(s))
    return ' '.join(ranked_sentences)


#Model Training / Hierarchical Summarization

Iteratively summarize large texts to fit the context window (e.g., 4000 tokens).

In [5]:
# === Model Training / Hierarchical Summarization ===
TOKEN_LIMIT = 4000

def hierarchical_summarize(text, target_token_limit=TOKEN_LIMIT):
    clean = clean_text(text)
    sentences = tokenize_sentences(clean)
    token_estimate = estimate_token_count(clean)

    while token_estimate > target_token_limit:
        chunks = []
        chunk = []
        chunk_tokens = 0

        for sent in sentences:
            sent_tokens = estimate_token_count(sent)
            if chunk_tokens + sent_tokens > target_token_limit:
                chunks.append(' '.join(chunk))
                chunk = [sent]
                chunk_tokens = sent_tokens
            else:
                chunk.append(sent)
                chunk_tokens += sent_tokens

        if chunk:
            chunks.append(' '.join(chunk))

        summaries = [summarize_text(chunk, num_sentences=5) for chunk in chunks]
        sentences = []
        for s in summaries:
            sentences.extend(tokenize_sentences(s))
        text = ' '.join(sentences)
        token_estimate = estimate_token_count(text)

    return text


#Evaluation

Output and save summaries, simulate a style-following prompt.

In [6]:
# === Evaluation ===
def generate_styled_summary(input_text, style_reference_text):
    print("→ Measuring document lengths...")
    len_input = estimate_token_count(input_text)
    len_style = estimate_token_count(style_reference_text)

    print(f"→ Input tokens: {len_input}, Style reference tokens: {len_style}")

    print("→ Summarizing input text hierarchically...")
    summarized_input = hierarchical_summarize(input_text, TOKEN_LIMIT)

    print("→ Summarizing style reference text hierarchically...")
    summarized_style = hierarchical_summarize(style_reference_text, TOKEN_LIMIT)

    with open("summarized_input.txt", "w", encoding="utf-8") as f:
        f.write(summarized_input)

    with open("summarized_style.txt", "w", encoding="utf-8") as f:
        f.write(summarized_style)

    print("\n--- STYLE-FOLLOWING QUERY EXAMPLE ---")
    print("Please summarize the following text using the style of this reference text.")
    print("\nSTYLE SAMPLE:\n", summarized_style[:300] + "...")
    print("\nTEXT TO SUMMARIZE:\n", summarized_input[:300] + "...")


#Main Function

To tie everything together:


In [7]:
# === Entry Point ===
if __name__ == "__main__":
    input_text, style_reference_text = load_documents("input_document.txt", "style_document.txt")
    generate_styled_summary(input_text, style_reference_text)


FileNotFoundError: [Errno 2] No such file or directory: 'input_document.txt'