# Text Summarisation Model

This will explore the use of BERT, T5 and pegasus to summarise news articles.

## 1. Upload dataset file
If cloned to your own IDE, it is included on the repo.
If you would like to use your own dataset, a compatible folder layout would look like the following; where the number of articles coresponds with the same number summary.
```
dataset
├── Articles
│   └── 001.txt
|   ├── 002.txt
|   ...
|   └── 511.txt
└── Summary
    └── 001.txt
    ├── 002.txt
    ...
    └── 511.txt

```
Please bear in mind, in colab you will have to take several extra steps:
* Save the dataset folder as a zip file
* Uploade to the current colab workspace as a zip file
* Unzip via following

In [7]:
# IF IN COLAB RENAME FILE TO DATASET THEN UNCOMMENT AND RUN THE FOLLOWING:

# !unzip /content/dataset.zip
# %ls

## 2. Text Pre-processing

Use spacy because it is appropriate for nlp, lemma func available.

In [8]:
import spacy

nlp = spacy.load("en_core_web_sm")

### Pre-process function for article data

In [9]:
def pre_process_txt(txt):
    doc = nlp(txt)
    tokenised_sentences = []
    for sent in doc.sents:  # sentence-by-sentence

        tokens = []
        for token in sent:
            # Skip short tokens
            # Remove unwanted tokens
            if token.is_punct or token.is_space or token.is_stop:
                continue
            if token.length_ < 3:
                continue
            tokens.append(token.lemma_.lower())

        # Add sentence only if it's not empty
        if tokens:
            tokenised_sentences.append(" ".join(tokens))
    cleaned_text = " ".join(tokenised_sentences)

    return cleaned_text


### Pre-process function for summary

In [10]:

def pre_process_summary_txt(txt):
    doc = nlp(txt)
    tokenised_sentences = []
    for sent in doc.sents:  # sentence-by-sentence
        
        tokens = []
        for token in sent:
 
            if token.is_space:
                continue
            tokens.append(token.lower())

        # Add sentence only if it's not empty
        if tokens:
            tokenised_sentences.append(" ".join(tokens))
            
    cleaned_text = " ".join(tokenised_sentences)
    return cleaned_text

## 2. Open file

Open the files uploaded to be read. Also applying processing function.
Try and except in case it failes

In [11]:
from pathlib import Path

# def load_dataset(base_path="/content/dataset"):
def load_dataset(base_path):
    articles_dir = Path(base_path) / "Articles"
    summary_dir = Path(base_path) / "Summary"

    if not articles_dir.exists() or not summary_dir.exists():
        raise FileNotFoundError("Articles or Summary directory not found.")

    # Get all article files and sort them
    article_files = sorted(articles_dir.glob("*.txt"))

    dataset = []
    
    for article_path in article_files:
        try:
            # Get corresponding summary file
            file_id = article_path.stem  # e.g., "001" from "001.txt"
            summary_path = summary_dir / f"{file_id}.txt"

            # Skip if summary doesn't exist
            if not summary_path.exists():
                print(f"Warning: No summary found for {file_id}")
                continue

            # Read both files
            with open(article_path, 'r', encoding='utf-8') as f:
                article = f.read()
                article = pre_process_txt(article)
            with open(summary_path, 'r', encoding='utf-8') as f:
                summary = f.read()
                summary = pre_process_txt(summary)

            dataset.append({
                'id': file_id,
                'article': article,
                'summary': summary
            })
        except Exception as e:
            print(f"Error processing {file_id}: {e}")

    print(f"Loaded {len(dataset)} article-summary pairs")
    return dataset


In [12]:

# ADD YOUR BASE PATH HERE {BASE_PATH}/Articles or {BASE_PATH}/Summary
BASE_PATH="dataset"

data = load_dataset(BASE_PATH)


Error processing 001: 'spacy.tokens.token.Token' object has no attribute 'length_'
Error processing 002: 'spacy.tokens.token.Token' object has no attribute 'length_'
Error processing 003: 'spacy.tokens.token.Token' object has no attribute 'length_'
Error processing 004: 'spacy.tokens.token.Token' object has no attribute 'length_'
Error processing 005: 'spacy.tokens.token.Token' object has no attribute 'length_'
Error processing 006: 'spacy.tokens.token.Token' object has no attribute 'length_'
Error processing 007: 'spacy.tokens.token.Token' object has no attribute 'length_'
Error processing 008: 'spacy.tokens.token.Token' object has no attribute 'length_'
Error processing 009: 'spacy.tokens.token.Token' object has no attribute 'length_'
Error processing 010: 'spacy.tokens.token.Token' object has no attribute 'length_'
Error processing 011: 'spacy.tokens.token.Token' object has no attribute 'length_'
Error processing 012: 'spacy.tokens.token.Token' object has no attribute 'length_'
Erro

## 4: Train data

In [13]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Trainer, TrainingArguments
from datasets import Dataset, DatasetDict
import random

# Shuffle for randomness
random.seed(42)  # For reproducibility
random.shuffle(data)

# Calculate split indices
total = len(data)
train_end = int(total * 0.8)  # 80% train
test_end = train_end + int(total * 0.1)  # 10% test
# Remaining 10% for validation

# Split the data
train_items = data[:train_end]
val_items = data[train_end:test_end]
test_items = data[test_end:]

# Create datasets
train_dataset = Dataset.from_dict({
    "article": [item['article'] for item in train_items],
    "summary": [item['summary'] for item in train_items],
})

val_dataset = Dataset.from_dict({
    "article": [item['article'] for item in val_items],
    "summary": [item['summary'] for item in val_items],
})

test_dataset = Dataset.from_dict({
    "article": [item['article'] for item in test_items],
    "summary": [item['summary'] for item in test_items],
})

dataset_dict = DatasetDict({
    "train": train_dataset,
    "validation": val_dataset,
    "test": test_dataset,
})

print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")
print(f"Test samples: {len(val_dataset)}")

  from .autonotebook import tqdm as notebook_tqdm


Training samples: 0
Validation samples: 0
Test samples: 0


In [14]:
model_name = "sshleifer/distilbart-cnn-12-6"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

def preprocess_func(examples):
  inputs = examples["article"]
  targets = examples["summary"]
  model_inputs = tokenizer(inputs, max_length=1024, truncation=True, padding="max_length")

  with tokenizer.as_target_tokenizer():
    labels = tokenizer(targets, max_length=64, truncation=True, padding="max_length")

  model_inputs["labels"] = labels["input_ids"]
  return model_inputs

tokenized_dataset = dataset_dict.map(preprocess_func, batched=True)

In [15]:
training_args = TrainingArguments(
    output_dir="./simple-distilbart-summarizer",
    learning_rate=3e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=3,
    logging_dir='./logs',
    save_total_limit=1,
    report_to="none"
)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"]
)

# BART


In [35]:
trainer.train()
trainer.save_model("./trained_simple_distilbart")



Step,Training Loss




In [39]:
print(dataset_dict["test"])

Dataset({
    features: ['article', 'summary'],
    num_rows: 21
})


In [None]:
generated_summary = []
reference_summary = []


for item in dataset_dict["test"]:
    article = item["article"]
    summary = item["summary"]

    inputs = tokenizer(
        article,
        return_tensors="pt",
        max_length=1024,
        truncation=True
    ).to(model.device)

    summary_ids = model.generate(**inputs, max_length=80, num_beams=4,length_penalty=2.0,
        early_stopping=True,
        no_repeat_ngram_size=2,  # Avoid repetition
        temperature=0.8,  # Add some randomness
        do_sample=False,)  # Use greedy/beam search)
    gen_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    generated_summary.append(gen_summary)
    reference_summary.append(summary)
    # print("Actual Summary:", summary)
    # print("Generated Summary:", gen_summary)
    


Actual Summary: robbie agreement larne negotiate interested club great make sunderland ''"i hear sunday join sunderland lad say ''he trial number club
Generated Summary:  robbie weir poise join sunderland turn stoke city 17 year old irish league midfielder chase rangers fulham mick mccarthy appear win race larne boss jimmy mcgeough confirm weirs way inver park hear sunday join .
Actual Summary: james nolan 3:46.04 take second man 1500 m neil speaight 3:45.86 offaly man outside european indoor standard lisburn kelly mcneice reid 4:31.34 seventh woman 1500 m gary murray 8:11.22 11th man 3000m gillick half second clear take gold 46.45 .02 outside personal good set saturday semi final woman 60 m final ailis mcsweeney break michelle carroll long stand irish record clock 7.37 leave place deirdre ryan second woman high jump clearance 1.87 m aoife byrne take silver 800 m personal good 2:06.73.colin costello seventh 1500 m final 3:48.82).derval o'rourke break irish 60 m hurdle record clock 8.06

In [47]:
from evaluate import load

rouge = load("rouge")
results = rouge.compute(
    predictions=generated_summary,
    references=reference_summary
)

print(results)


{'rouge1': np.float64(0.35611881813181834), 'rouge2': np.float64(0.2286644241575414), 'rougeL': np.float64(0.25650589793161893), 'rougeLsum': np.float64(0.2547639736749432)}


{'rouge1': np.float64(0.35611881813181834), 'rouge2': np.float64(0.2286644241575414), 'rougeL': np.float64(0.25650589793161893), 'rougeLsum': np.float64(0.2547639736749432)}
Rouge score! btw