
# AI Engineering – LLM Assignment: Text Summarization

**Objective:**  
Build, analyze, and evaluate a text summarization pipeline using LLMs.

This notebook covers:
- Data exploration
- Data preprocessing
- Summarization strategy
- Top 5 reviews extraction
- Model architecture & hyperparameters
- Evaluation approach
- Future improvements & research directions



## 1. Dataset Loading & Exploration

The dataset consists of user reviews collected from a mobile application.
The text contains:
- Mixed sentiment (positive & negative)
- Multilingual content (English + Hinglish/Hindi)
- Typos and informal language


In [16]:
import pandas as pd
import PyPDF2

file_path = "/content/reviews.pdf"

text = ""

with open(file_path, "rb") as f:
    reader = PyPDF2.PdfReader(f)
    for page in reader.pages:
        text += page.extract_text()

# Split into reviews
reviews = [r.strip() for r in text.split("\n") if len(r.strip()) > 20]

df = pd.DataFrame(reviews, columns=["review_text"])

df.head(), df.shape


(                                         review_text
 0  I've started using infloso app for a while now...
 1  to have a social media impact and gives a lot ...
 2  and improve their social media presence. Worst...
 3  Instagram neither my YouTube. Nothing works. I...
 4  followers still your can't find me there? How ...,
 (151, 1))

In [6]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/232.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1



## 2. Data Preprocessing

Steps:
- Lowercasing
- Removing special characters
- Normalizing whitespace
- Removing extremely short or noisy samples


In [17]:

import re

def clean_text(text):
    text = text.lower()
    text = re.sub(r"[^a-zA-Z0-9\s]", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

df["clean_review"] = df["review_text"].apply(clean_text)
df.head()


Unnamed: 0,review_text,clean_review
0,I've started using infloso app for a while now...,i ve started using infloso app for a while now...
1,to have a social media impact and gives a lot ...,to have a social media impact and gives a lot ...
2,and improve their social media presence. Worst...,and improve their social media presence worst ...
3,Instagram neither my YouTube. Nothing works. I...,instagram neither my youtube nothing works i h...
4,followers still your can't find me there? How ...,followers still your can t find me there how d...



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




## 3. Summarization Strategy

We use a **pre-trained transformer-based summarization model**.

Model choice:
- `facebook/bart-large-cnn`

Why?
- Strong performance on abstractive summarization
- Handles long-form text
- Pretrained on CNN/DailyMail dataset


In [18]:

from transformers import pipeline

summarizer = pipeline(
    "summarization",
    model="facebook/bart-large-cnn",
    tokenizer="facebook/bart-large-cnn"
)


Device set to use cuda:0



## 4. Generate Global Summary

We concatenate multiple reviews and summarize them in chunks to avoid token limits.


In [19]:

def summarize_reviews(
    reviews,
    max_words_per_chunk=700,
    min_words_per_chunk=80,
    max_summary_len=130,
    min_summary_len=40
):
    summaries = []
    chunk = []

    for review in reviews:
        review_words = review.split()

        # Skip extremely short reviews
        if len(review_words) < 10:
            continue

        # If adding review exceeds chunk size → summarize current chunk
        if len(chunk) + len(review_words) > max_words_per_chunk:
            chunk_text = " ".join(chunk)

            if len(chunk_text.split()) >= min_words_per_chunk:
                summary = summarizer(
                    chunk_text,
                    max_length=max_summary_len,
                    min_length=min_summary_len,
                    do_sample=False
                )
                summaries.append(summary[0]["summary_text"])

            chunk = review_words
        else:
            chunk.extend(review_words)

    # Summarize remaining chunk
    if len(chunk) >= min_words_per_chunk:
        chunk_text = " ".join(chunk)
        summary = summarizer(
            chunk_text,
            max_length=max_summary_len,
            min_length=min_summary_len,
            do_sample=False
        )
        summaries.append(summary[0]["summary_text"])

    return summaries


global_summaries = summarize_reviews(df["clean_review"].tolist())
global_summaries


[' infloso is a mind blowing app i have used this app for few months and i am very satisfied with it this app shows only for influencer so can a brand find influencer according to there business is not what i was expecting my requests for collaboration are left pending for more than a few months this app is such an amazing one i think every influencer should have this app 10 10 best app i ever seen in my life and they provide best campaigns to influencer as i m influencer.',
 ' infloso is a very good and helpful app for new influencers it s helps us with new brands for collaboration a perfect platform for all the micro influencers its so helpfull although its a good experience with infloso satisfied with this app thanks infloso.',
 ' infloso is an exceptional paid barter collaboration app that has revolutionized my content creation journey as a content creator. The platform s innovative features and user friendly interface have empowered me to effortlessly connect and collaborate with 


## 5. Top 5 Reviews Extraction

We rank reviews based on:
- Length
- Content richness (simple heuristic)


In [20]:

df["length"] = df["clean_review"].apply(lambda x: len(x.split()))
top_5_reviews = df.sort_values("length", ascending=False).head(5)
top_5_reviews[["review_text"]]


Unnamed: 0,review_text
117,not able to connect YouTube on your app. .nice...
27,because I'm not able to login A big platform t...
72,love it It's been months that I've this app on...
58,1 campaign apply Kiya hai...but abhi tak under...
0,I've started using infloso app for a while now...



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.



## 6. Architecture & Hyperparameters

This project uses **BART**, a transformer-based encoder–decoder model designed for abstractive text summarization. The encoder captures bidirectional contextual information, while the decoder generates fluent and meaningful summaries. The pre-trained **facebook/bart-large-cnn** model was selected due to its proven effectiveness on long-form and noisy real-world text, such as user-generated reviews.

**Hyperparameters used during inference:**
- `max_length = 130` to maintain concise summaries  
- `min_length = 40` to prevent incomplete outputs  
- `do_sample = False` to ensure deterministic and reproducible results  
- Beam search (default) decoding to improve generation quality  

The implementation was carried out using the **HuggingFace Transformers** library.

---

## 7. Evaluation Strategy

The summarization results were primarily evaluated through **qualitative human assessment**, focusing on clarity, coherence, and coverage of dominant themes. **ROUGE metrics** were considered as an optional quantitative evaluation method. Additional checks were performed to ensure factual consistency and logical flow in the generated summaries.

---

## 8. Future Improvements

Potential enhancements include fine-tuning the model on domain-specific review datasets, adopting multilingual summarization models such as **mBART** or **mT5**, incorporating sentiment-aware summarization, and clustering reviews before summarization to improve structure and focus.

---

## 9. Further Research Directions

Future research may explore **aspect-based summarization**, **reinforcement learning with human feedback (RLHF)**, and techniques for **hallucination detection and mitigation** in abstractive summarization models.