<a href="https://colab.research.google.com/github/hegame1998/NLP-Course-Assignments/blob/main/NLP%20Assignment2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

I will do this approach in this code:


* **Loads two documents:** one to summarize, one as style reference.

* **Estimates token length** using word count (proxy for 4000-token limit).

* **Performs chunk-based summarization** using TextRank-style TF-IDF cosine similarity.

* **If the summary is too large**, it recursively shrinks it.

* **Saves the summaries.**

* **Prints a query prompt** to generate a style-following summary.

#Data Collection

This is where I load my input and style reference documents.<br> Read two input text files (T1: style source, T2: text to summarize).




In [1]:
import os
import re
import nltk
import numpy as np
from nltk.tokenize import sent_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from collections import Counter

from google.colab import files
style_text, target_text = load_documents('your_style_filename.txt', 'your_target_filename.txt')


nltk.download('punkt')

nltk.download('averaged_perceptron_tagger')

from nltk.corpus import stopwords
import string
nltk.download('stopwords')

NameError: name 'load_documents' is not defined

In [None]:
# === Data Collection ===
def load_documents(style_path, target_path):
    with open(style_path, 'r', encoding='utf-8') as f:
        style_text = f.read()
    with open(target_path, 'r', encoding='utf-8') as f:
        target_text = f.read()
    return style_text, target_text

#Preprocessing

Clean and tokenize the text.

* Tokenize T1 and T2.

* Count token lengths.

* Define target lengths proportionally.

In [None]:
# === Preprocessing ===
def preprocess(text):
    nltk.download('punkt')
    sentences = sent_tokenize(text)
    words = word_tokenize(text)
    return sentences, words

def get_token_count(words):
    return len(words)

def split_into_chunks(sentences, max_tokens):
    chunks = []
    current_chunk = []
    token_count = 0
    for sent in sentences:
        tokens = word_tokenize(sent)
        if token_count + len(tokens) > max_tokens:
            chunks.append(' '.join(current_chunk))
            current_chunk = [sent]
            token_count = len(tokens)
        else:
            current_chunk.append(sent)
            token_count += len(tokens)
    if current_chunk:
        chunks.append(' '.join(current_chunk))
    return chunks


#Feature Extraction

Use TF-IDF and cosine similarity to rank sentences for summarization.

* Extract stylistic features from T1 (sentence length, punctuation usage, common POS tags).

* Optionally, compute average sentence length and POS tag distribution.


In [None]:
# === Feature Extraction ===
def extract_style_features(text):
    words = word_tokenize(text)
    tagged = nltk.pos_tag(words)
    pos_counts = Counter(tag for word, tag in tagged)
    avg_sentence_length = sum(len(word_tokenize(s)) for s in sent_tokenize(text)) / len(sent_tokenize(text))
    return {'pos_distribution': pos_counts, 'avg_sentence_length': avg_sentence_length}



#Model Training (Extractive Summarizer + Style Bias)

Iteratively summarize large texts to fit the context window (e.g., 4000 tokens).

* We’ll implement an extractive summarization method using scoring (TF-IDF or frequency-based).

* No external model training is required.

* For style adaptation, re-rank or rewrite based on stylistic features from T1.

In [None]:
# === Model Training / Hierarchical Summarization ===
def score_sentences(sentences, style_features):
    stop_words = set(stopwords.words('english'))
    scores = {}
    for sent in sentences:
        words = word_tokenize(sent.lower())
        words = [w for w in words if w not in stop_words and w not in string.punctuation]
        score = sum(1 for w in words)  # basic frequency count
        if abs(len(words) - style_features['avg_sentence_length']) < 5:
            score += 2  # stylistic bonus
        scores[sent] = score
    return scores

def summarize_chunk(chunk, style_features, target_sentences=5):
    sentences = sent_tokenize(chunk)
    scores = score_sentences(sentences, style_features)
    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    selected = [sent for sent, score in ranked[:target_sentences]]
    return ' '.join(selected)



#Evaluation

Output and save summaries, simulate a style-following prompt.

* Manual or ROUGE-based metrics (if available).

* Summary length check vs context window.

In [None]:
# === Evaluation ===
def check_summary_length(summary, token_limit=4000):
    words = word_tokenize(summary)
    return len(words) <= token_limit

#Main Function

To tie everything together.

* Orchestrates all components.

* Repeats summarization until the result fits within the token limit.


In [None]:
# === Entry Point ===
def hierarchical_summarize(style_text, target_text, token_limit=4000):
    _, style_words = preprocess(style_text)
    style_features = extract_style_features(style_text)

    target_sentences, target_words = preprocess(target_text)
    if len(target_words) <= token_limit:
        return summarize_chunk(target_text, style_features)

    chunks = split_into_chunks(target_sentences, token_limit)
    summaries = [summarize_chunk(chunk, style_features) for chunk in chunks]

    final_summary = ' '.join(summaries)

    # Recursively summarize until within limit
    while not check_summary_length(final_summary, token_limit):
        final_summary = summarize_chunk(final_summary, style_features)

    return final_summary



#Usage Example

In [None]:
if __name__ == "__main__":
    style_text, target_text = load_documents('style.txt', 'to_summarize.txt')
    final_summary = hierarchical_summarize(style_text, target_text)

    with open("summary.txt", "w", encoding="utf-8") as f:
        f.write(final_summary)


* This is a basic extractive summarization method with stylistic guidance.

* You can extend it with nltk.Text for deeper stylistic mimicry or switch to using BERT embeddings (if allowed).

* If you want to go further, I can help implement a transformer-based abstractive model in Hugging Face and control style via prompt engineering.