<a href="https://colab.research.google.com/github/hegame1998/NLP-Assignment/blob/main/NLP_Assignment2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Collection

I started by importing the **requests** library and then loaded my input texts directly from GitHub using their raw file URLs. These are:

* **source_text.txt →** the document I want to summarize.

* **style_text.txt →** the document that gives the style I want to follow.


In [None]:
# Data Collection & Import Library
import requests

# Replace with actual raw GitHub URLs
source_url = "https://raw.githubusercontent.com/hegame1998/NLP-Assignment/main/source_text.txt"
style_url = "https://raw.githubusercontent.com/hegame1998/NLP-Assignment/main/style_text.txt"

# Load text from GitHub
source_text = requests.get(source_url).text
style_text = requests.get(style_url).text

To verify and preserve these files locally in the Colab environment, I wrote them out again:

In [None]:
# Save the input texts (optional for reference or reuse)
with open("source_text.txt", "w", encoding="utf-8") as f:
    f.write(source_text)

with open("style_text.txt", "w", encoding="utf-8") as f:
    f.write(style_text)

This decouples data from code and makes my summarizer reusable for any new pair of texts.

# Preprocessing (without NLTK)

Because I didn’t want to rely on external libraries like **nltk.punkt** (which may fail to download), I wrote a simple sentence tokenizer using regular expressions. It splits the text at punctuation marks followed by capital letters — a basic but effective approach.

In [None]:
# Preprocessing (no punkt)
def naive_sentence_tokenize(text):
    import re
    # Split sentences on punctuation followed by a space and capital letter
    return re.split(r'(?<=[.?!])\s+(?=[A-Z])', text)

def preprocess_text(text):
    sentences = naive_sentence_tokenize(text)
    return [s.strip() for s in sentences if len(s.strip()) > 0]

* It splits on sentence-ending punctuation (e.g., ., ?, !) followed by a capital letter.

* It filters out empty results.

* It works without any external dependencies.

This is simple but effective for clean, structured English paragraphs.

# Feature Extraction

To manage summarization within a limited context window (e.g., 4000 tokens or fewer), I calculated the proportional space each text should get based on their lengths. This way, I ensured fairness in how much summary content each document contributes.

In [None]:
def compute_target_lengths(len1, len2, max_token=4000):
    total = len1 + len2
    proportion1 = len1 / total
    proportion2 = len2 / total
    return int(max_token * proportion1), int(max_token * proportion2)

for Example about of this part of code:<br>
If the source text has 90 sentences and the style has 10, and max_token=100, then:

* Source gets 90%

* Style gets 10%

This ensures **fair representation** in the final combined prompt.

# Model Training (Summarization Logic)

I built a **hierarchical summarization pipeline**. First, I divided the documents into smaller chunks. Then, I extracted a few key sentences from each chunk. I kept summarizing in this way until I reached the desired summary length.

My summarization method is extractive, so I simply picked the first N sentences from each chunk.

In [None]:
# Summarization Logic
def hierarchical_summarize(sentences, target_len, slice_size=20):
    summary = []
    for i in range(0, len(sentences), slice_size):
        chunk = sentences[i:i + slice_size]
        chunk_summary = simple_extract_summary(chunk, target_len)
        summary.extend(chunk_summary)
        if len(summary) >= target_len:
            break
    return summary[:target_len]

def simple_extract_summary(sentences, max_sentences):
    # Simple extractive summarization: pick first N sentences
    return sentences[:max_sentences]

The process of this part of code:

* Efficient and avoids token overflow.

* Keeps contextually important leading sentences.

* Modular: easy to replace with more advanced summarizers later.

# Evaluation

To evaluate my summaries, I printed the number of sentences in both the original and the summarized versions. I also displayed a few lines from the generated summary so I could visually assess the quality.

In [None]:
# Evaluation
def evaluate_summary(original, summary, label):
    print(f"=== {label} Summary Evaluation ===")
    print(f"Original sentences: {len(original)}")
    print(f"Summary sentences: {len(summary)}")
    print("Sample summary:")
    print("\n".join(summary[:5]))
    print("\n" + "-"*50 + "\n")

This gave me quick insight into how well the summary compressed the original content.

# Main Pipeline

I built a **main_pipeline()** function to:

* Preprocess both texts

* Compute proportional summary lengths

* Perform hierarchical summarization

* Run the evaluation

In [None]:
# Main Function

def main_pipeline(source_text, style_text):
    # Preprocessing
    source_sentences = preprocess_text(source_text)
    style_sentences = preprocess_text(style_text)

    # Proportional length calculation
    source_target_len, style_target_len = compute_target_lengths(
        len(source_sentences), len(style_sentences), max_token=50
    )

    # Hierarchical summarization
    source_summary = hierarchical_summarize(source_sentences, source_target_len)
    style_summary = hierarchical_summarize(style_sentences, style_target_len)

    # Evaluation
    evaluate_summary(source_sentences, source_summary, "Source")
    evaluate_summary(style_sentences, style_summary, "Style")

    return source_summary, style_summary

# Run the full pipeline
source_summary, style_summary = main_pipeline(source_text, style_text)

=== Source Summary Evaluation ===
Original sentences: 60
Summary sentences: 45
Sample summary:
Natural Language Processing (NLP) is a sub-field of artificial intelligence that focuses on the interaction between computers and humans through natural language.
The ultimate objective of NLP is to read, decipher, understand, and make sense of the human languages in a manner that is valuable.
Most NLP techniques rely on machine learning to derive meaning from human languages.
Applications of NLP include speech recognition, text summarization, machine translation, sentiment analysis, and more.
The field of NLP combines computational linguistics with statistical, machine learning, and deep learning models.

--------------------------------------------------

=== Style Summary Evaluation ===
Original sentences: 6
Summary sentences: 4
Sample summary:
In the beginning, language was simple.
It served only to convey the most basic of messages—danger, food, shelter.
As human societies grew more comp

After running my text summarization pipeline, I evaluate both the source and the style summaries. Here's how to interpret the output:

## Source Summary Evaluation

**What the means of OutPut:** <br>
My original ***source text*** had 60 sentences (likely technical and informative, such as a Wikipedia article on NLP). My summarizer reduced this to **45 sentences**, based on the proportion of source-to-style sentence count.

I chose a maximum budget (**max_token=50** total sentences), so 90% of that was allocated to the source (45 sentences), because the source was much longer than the style text.<br> **Interpretation:**

* These sentences are **clear, technical, and fact-heavy** — which is appropriate for summarizing a technical topic.

* The summary captures **core definitions, objectives, method**s, and **applications** of NLP.

* It retains **coherence** even though it’s extractive (i.e., no paraphrasing or abstraction).

## Style Summary Evaluation

**What the means of OutPut:** <br>
The ***style text*** was shorter — only 6 sentences long — and likely poetic or philosophical in tone. The summarizer extracted 4 of those sentences to preserve the core tone and content. <Br> **Interpretation:**

This summary captures the **evolution of language** in a more **narrative and emotional** tone.

It matches the **style** of the original text: reflective, abstract, and humanistic.

Even though it’s extractive, it maintains the **voice** and **style** well.

# Final Result

When I ran the summarizer on my two texts, the result was:

* The source text was long and technical (about NLP). It was reduced to a shorter, digestible summary that still captured key facts.

* The style text was philosophical and emotional. It got summarized while preserving its poetic tone.

* Each summary was proportionally scaled.

* The code ran without external dependencies like nltk, making it lightweight and portable.

<br> My summarizer doesn't just cut down text — it preserves the **function** of each passage:

* **Informative tone** for source text.

* **Philosophical tone** for style text.

By keeping summaries proportional, I ensure **balanced input** if I later use these summaries in prompts for a downstream LLM task (e.g., style transfer, paraphrasing).