<a href="https://colab.research.google.com/github/nehakerung/nlp-text-summarisation/blob/main/NLP_neha_submit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Summarisation
Train a summarisation model that can automatically generate short summaries of news articles.
You’ll be using the provided dataset, where each article is paired with a human-written summary.


## 1. Retrieve document information

In [None]:
## On Gooogle colab
## Upload files to temporary files
filename = "/content/dataset/articles/001.txt"

## of upload folder
# from google.colab import drive
# drive.mount('/content/drive')
# filename = "path to dataset"


In [None]:
# read file
with open(filename, 'r') as f:
  article = f.read()

## 2. Pre-process text: Prepare dataset for training
1.   Text Cleaning
2.   Tokenization
3.   Lowercasing
4. Stop word removal
5. Stemming / Lemmatisation
6. Removing duplicates or nulls
7. Spelling correction or normalisation

In [None]:
# using spacy to clean text
import spacy

In [None]:
nlp = spacy.load("en_core_web_sm")

# pre processes text and returns a list of sentences
def pre_process_sent(txt):
    doc = nlp(txt)
    tokenised_sentences = []

    for sent in doc.sents:  # sentence-by-sentence
        tokens = []
        for token in sent:
            # Skip short tokens
            # Remove unwanted tokens
            if token.is_space or token.is_stop:
                continue
            tokens.append(token.lemma_.lower())

        # Add sentence only if it's not empty
        if tokens:
            tokenised_sentences.append(" ".join(tokens))
    return tokenised_sentences

# preprocesses text and returns a full string
def pre_process_txt(txt):
    doc = nlp(txt)

    # cleaned_tokens = pre_process_sent(txt)

    # cleaned_text = " ".join(cleaned_tokens)
    tokenised_sentences = []
    for sent in doc.sents:  # sentence-by-sentence
        tokens = []
        for token in sent:
            # Skip short tokens
            # Remove unwanted tokens
            if token.is_punct or token.is_space or token.is_stop:
                continue
            tokens.append(token.lemma_.lower())

        # Add sentence only if it's not empty
        if tokens:
            tokenised_sentences.append(" ".join(tokens))
    cleaned_text = " ".join(tokenised_sentences)
    return cleaned_text

# clean_txt = pre_process_txt(article)

print(pre_process_sent(article))
print(pre_process_txt(article))

['claxton hunt major medal british hurdler sarah claxton confident win major medal month european indoor championships madrid .', '25 - year - old smash british record 60 m hurdle twice season , set new mark 7.96 second win aaa title .', '" confident , " say claxton .', '" race come .', '" long training think chance medal . "', 'claxton win national 60 m hurdle title past year struggle translate domestic success international stage .', ', scotland - bear athlete own equal fifth - fast time world year .', 'week birmingham grand prix , claxton leave european medal favourite russian irina shevchenko trail sixth spot .', 'time , claxton prepare campaign hurdle - explain leap form .', 'previous season , 25 - year - old contest long jump move colchester london - focused attention .', 'claxton new training regime pay dividend european indoors place 5 - 6 march .']
claxton hunt major medal british hurdler sarah claxton confident win major medal month european indoor championships madrid 25 yea

## Feature extraction

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# cleaned_sentences = pre_process_sent(clean_txt)
def get_similarity_matrix(sentences):
  # # Step 2: Create the TF-IDF vectorizer and fit it to the documents
  vectorizer = TfidfVectorizer()
  tfidf_matrix = vectorizer.fit_transform(sentences)

  # # Step 3: Print the vocabulary (words and their indexes)
  print("Vocabulary:")
  print(vectorizer.get_feature_names_out())

  # # Step 4: Print the TF-IDF matrix
  print("\nTF-IDF Matrix:")
  print(tfidf_matrix.toarray())

  # Step 5: Compute cosine similarity between document 1 and 2
  similarity = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])
  print("\nCosine Similarity between Doc 1 and Doc 2:")
  print(similarity[0][0])

# similarity

In [None]:
from sentence_transformers import SentenceTransformer

def get_embeddings_similarities(sentences):

  # 1. Load a pretrained Sentence Transformer model
  model = SentenceTransformer("all-MiniLM-L6-v2")

  # The sentences to encode

  # 2. Calculate embeddings by calling model.encode()
  embeddings = model.encode(sentences)
  print(embeddings.shape)
  # [3, 384]

  # 3. Calculate the embedding similarities
  similarities = model.similarity(embeddings, embeddings)
  print(similarities)
  # tensor([[1.0000, 0.6660, 0.1046],
  #         [0.6660, 1.0000, 0.1411],
  #         [0.1046, 0.1411, 1.0000]])

# Model

Preprocess texts to use

In [None]:
# Prep files

# from google.colab import drive
# drive.mount('/content/drive')

articles = "/content/dataset/articles/001.txt"
summary = "/content/dataset/summary/001.txt"

# article_2 = "/content/dataset/articles/004.txt"
# summary_2 = "/content/dataset/summary/004.txt"

article_3 = "/content/dataset/articles/003.txt"
summary_3 = "/content/dataset/summary/003.txt"

article_2 = "/content/dataset/articles/004.txt"
summary_2 = "/content/dataset/summary/004.txt"
# read file
with open(articles, 'r') as f:
  article = f.read()
with open(summary, 'r') as f:
  summary = f.read()

# with open(article_2, 'r') as f:
#   article_2 = f.read()
# with open(summary_2, 'r') as f:
#   summary_2 = f.read()


with open(article_3, 'r') as f:
  article_3 = f.read()
with open(summary_2, 'r') as f:
  summary_3 = f.read()

with open(article_2, 'r') as f:
  article_2 = f.read()
with open(summary_2, 'r') as f:
  summary_2 = f.read()


In [None]:
# Pre process data
print(pre_process_sent(article))
print(pre_process_txt(article))

article = pre_process_txt(article)
summary = pre_process_txt(summary)

article_2 = pre_process_txt(article_2)
summary_2 = pre_process_txt(summary_2)

article_3 = pre_process_txt(article_3)
summary_3 = pre_process_txt(summary_3)

# sent
sent_1 = pre_process_sent(article)
sent_2 = pre_process_sent(article_2)
sent_3 = pre_process_sent(article_3)


get_similarity_matrix(sent_1)
get_embeddings_similarities(sent_1)

['claxton hunt major medal british hurdler sarah claxton confident win major medal month european indoor championships madrid .', '25 - year - old smash british record 60 m hurdle twice season , set new mark 7.96 second win aaa title .', '" confident , " say claxton .', '" race come .', '" long training think chance medal . "', 'claxton win national 60 m hurdle title past year struggle translate domestic success international stage .', ', scotland - bear athlete own equal fifth - fast time world year .', 'week birmingham grand prix , claxton leave european medal favourite russian irina shevchenko trail sixth spot .', 'time , claxton prepare campaign hurdle - explain leap form .', 'previous season , 25 - year - old contest long jump move colchester london - focused attention .', 'claxton new training regime pay dividend european indoors place 5 - 6 march .']
claxton hunt major medal british hurdler sarah claxton confident win major medal month european indoor championships madrid 25 yea

IndexError: index (1) out of range

## Assess similarities

Step 2: Feature Extraction

The system identifies key linguistic and semantic features like word frequency, entities, syntax, and topic relevance. This helps the system understand which parts of the document carry the most meaning.

Vector semmantics and Embeddings

## Model Application

Depending on the chosen approach—extractive, abstractive, or hybrid, the NLP model interprets and condenses the text. Transformer-based architectures like GPT, BERT, and T5 perform exceptionally well here, as they capture relationships and context far better than earlier models.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Trainer, TrainingArguments
from datasets import Dataset, DatasetDict

train_data = {
    "article": [article],
    "summary": [summary],
}
print(article)
val_data = {
    "article": [article_2],
    "summary": [summary_2],
}

train_dataset = Dataset.from_dict(train_data)
val_dataset = Dataset.from_dict(val_data)
dataset = DatasetDict({"train": train_dataset, "validation":val_dataset})

claxton hunt major medal british hurdler sarah claxton confident win major medal month european indoor championships madrid 25 year old smash british record 60 m hurdle twice season set new mark 7.96 second win aaa title confident say claxton race come long training think chance medal claxton win national 60 m hurdle title past year struggle translate domestic success international stage scotland bear athlete own equal fifth fast time world year week birmingham grand prix claxton leave european medal favourite russian irina shevchenko trail sixth spot time claxton prepare campaign hurdle explain leap form previous season 25 year old contest long jump move colchester london focused attention claxton new training regime pay dividend european indoors place 5 6 march


In [None]:
model_name = "sshleifer/distilbart-cnn-12-6"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

def preprocess_func(examples):
  inputs = examples["article"]
  targets = examples["summary"]
  model_inputs = tokenizer(inputs, max_length=1024, truncation=True, padding="max_length")

  with tokenizer.as_target_tokenizer():
    labels = tokenizer(targets, max_length=64, truncation=True, padding="max_length")

  model_inputs["labels"] = labels["input_ids"]
  return model_inputs

tokenized_dataset = dataset.map(preprocess_func, batched=True)

Map:   0%|          | 0/1 [00:00<?, ? examples/s]



Map:   0%|          | 0/1 [00:00<?, ? examples/s]

In [None]:
training_args = TrainingArguments(
    output_dir="./simple-distilbart-summarizer",
    learning_rate=3e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=3,
    logging_dir='./logs',
    save_total_limit=1,
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"]
)

In [None]:
trainer.train()

trainer.save_model("./trained_simple_distilbart")

inputs = tokenizer(article_3, return_tensors="pt", max_length=256, truncation=True)
summary_ids = model.generate(**inputs, min_length=5, max_length=50, length_penalty=2.0)
gen_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Summary:", tokenizer.decode(summary_ids[0], skip_special_tokens=True))



Step,Training Loss




Summary: greene set sight world title maurice greene aim wipe pain lose olympic 100 m title athens win fourth world championship crown summer settle bronze greece fellow american justin gatlin francis obikw


### Design a summarisation model
Step 4: Post-Processing

Finally, the system refines the summary for grammar, coherence, and tone. The model ensures the text flows naturally and aligns with the context of the source content.

Modern AI summarizers don’t just shorten text, they refine understanding, turning dense content into clear insights.


## Evaluate
Evaluate quality of generated summeries using ROUGE/BLEU scores

In [None]:
pip install rouge-score

Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=fb705c8697d9b342f1ce6a2b456d7d7bffcc2df38ab56762b344a271bd996579
  Stored in directory: /root/.cache/pip/wheels/85/9d/af/01feefbe7d55ef5468796f0c68225b6788e85d9d0a281e7a70
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2


In [None]:
from rouge_score import rouge_scorer

reference_summary = summary_3
generated_summary = gen_summary

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

scores = scorer.score(reference_summary, generated_summary)

print("ROUGE-1", scores['rouge1'])
print("ROUGE-2", scores['rouge2'])
print("ROUGE-L", scores['rougeL'])

ROUGE-1 Score(precision=0.0, recall=0.0, fmeasure=0.0)
ROUGE-2 Score(precision=0.0, recall=0.0, fmeasure=0.0)
ROUGE-L Score(precision=0.0, recall=0.0, fmeasure=0.0)
