# NLP Tasks (Part 1)

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 18/12/2025   | Martin | Create  | Notebook created for various NLP tasks using HF | 
| 22/12/2025   | Martin | Update  | Continued translation. Up to before training model | 
| 23/12/2025   | Martin | Update  | Completed translation. Started on summarisation task | 

# Content

* [Introduction](#introduction)
* [1. Translation](#1-translation)
* [2. Summarisation](#2-summarisation)

# Introduction

Tackle common NLP problems using LLMs built using the HF package:

1. Translation
2. Summarisation

# 1. Translation

- Seq-2-Seq task
- Finetune existing language model (mT5, mBART, Marian - here)

<u>Components</u>

- Marian: English to French translation model
- KDE4 dataset: Localised files for KDE (Apps for Linux desktops)

In [23]:
import evaluate
import numpy as np
import torch
from datasets import load_dataset
from transformers import (
  pipeline,
  get_scheduler,
  AutoTokenizer,
  AutoModelForSeq2SeqLM,
  DataCollatorForSeq2Seq,
  Seq2SeqTrainingArguments,
  Seq2SeqTrainer,
)
from torch.optim import AdamW
from torch.utils.data import DataLoader
from accelerate import Accelerator
from tqdm.auto import tqdm

SEED = 20
MAXLEN = 128

In [2]:
raw_datasets = load_dataset("kde4", lang1="en", lang2="zh_CN")
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 139666
    })
})

In [3]:
split_datasets = raw_datasets['train'].train_test_split(train_size=0.9, seed=SEED)
split_datasets['validation'] = split_datasets.pop('test')
split_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 125699
    })
    validation: Dataset({
        features: ['id', 'translation'],
        num_rows: 13967
    })
})

Dataset contains 2 statements one for each language. The KDE4 dataset translates many of the technical terms to the corresponding language, but the pretrained model does not do this

In [4]:
# KDE4 dataset
split_datasets['train'][1]['translation']

{'en': 'Installation prefix for Qt', 'zh_CN': 'Qt ÁöÑÂÆâË£ÖÂâçÁºÄ'}

In [5]:
# Pretrained model
model_checkpoint = "Helsinki-NLP/opus-mt-en-zh"
translator = pipeline("translation", model=model_checkpoint)
print(translator("Pastes the clipboard contents at the current cursor position into the edit field."))

Device set to use cuda:0


[{'translation_text': 'Â∞ÜÂΩìÂâçÂÖâÊ†á‰ΩçÁΩÆ‰∏äÁöÑÂâ™Ë¥¥ÊùøÂÜÖÂÆπÁ≤òË¥¥Âà∞ÁºñËæëÂ≠óÊÆµ‰∏≠„ÄÇ'}]


In [6]:
# Define pretrained components
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, return_tensors='pt')
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

In [7]:
split_datasets['train'][1]['translation']

{'en': 'Installation prefix for Qt', 'zh_CN': 'Qt ÁöÑÂÆâË£ÖÂâçÁºÄ'}

In [8]:
# Example of splitting dataset and passing through tokenizer
en_sentence = split_datasets['train'][1]['translation']['en']
cn_sentence = split_datasets['train'][1]['translation']['zh_CN']

inputs = tokenizer(en_sentence, text_target=cn_sentence)
inputs

{'input_ids': [54596, 2765, 594, 10110, 15, 8, 632, 60, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [8, 632, 60, 8, 12, 9613, 637, 56891, 0]}

In [9]:
# Preprocessing the data
def preprocess(examples):
  inputs = [ex['en'] for ex in examples['translation']]
  targets = [ex['zh_CN'] for ex in examples['translation']]
  model_inputs = tokenizer(inputs, text_target=targets, max_length=MAXLEN, truncation=True)

  return model_inputs

tokenized_dataset = split_datasets.map(
  preprocess,
  batched=True,
  remove_columns=split_datasets['train'].column_names
)

`-100` represents the padding values that should not be used for training

In [10]:
# Define the data collator
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [11]:
batch = data_collator([tokenized_dataset['train'][i] for i in range(1, 3)])
batch.keys()

KeysView({'input_ids': tensor([[54596,  2765,   594, 10110,    15,     8,   632,    60,     0],
        [  457,     0, 65000, 65000, 65000, 65000, 65000, 65000, 65000]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 0, 0, 0, 0, 0, 0, 0]]), 'labels': tensor([[    8,   632,    60,     8,    12,  9613,   637, 56891,     0],
        [    8, 46315,     0,  -100,  -100,  -100,  -100,  -100,  -100]]), 'decoder_input_ids': tensor([[65000,     8,   632,    60,     8,    12,  9613,   637, 56891],
        [65000,     8, 46315,     0, 65000, 65000, 65000, 65000, 65000]])})

<u>Training Details</u>

- Model uses the `decoder_input_ids` with an attention mask to ensure that none of the after tokens are used during prediction
- `generate()` is used to generate tokens one by one
  - Need to set `predict_with_generate=True`
- _BLEU score:_ Evaluates how close generations are to the expected message, penalising for repeated words
- For translation tasks: Several sentences are used as labels

In [12]:
metric = evaluate.load('sacrebleu')

In [13]:
# An example of good translation
predictions = [
  "This plugin lets you translate web pages between several languages automatically."
]
references = [
  [
    "This plugin allows you to automatically translate web pages between several languages."
  ]
]
metric.compute(predictions=predictions, references=references)

{'score': 46.750469682990165,
 'counts': [11, 6, 4, 3],
 'totals': [12, 11, 10, 9],
 'precisions': [91.66666666666667,
  54.54545454545455,
  40.0,
  33.333333333333336],
 'bp': 0.9200444146293233,
 'sys_len': 12,
 'ref_len': 13}

In [14]:
# An example of poor translation
predictions = ["This This This This"]
references = [
  [
    "This plugin allows you to automatically translate web pages between several languages."
  ]
]
metric.compute(predictions=predictions, references=references)

{'score': 1.683602693167689,
 'counts': [1, 0, 0, 0],
 'totals': [4, 3, 2, 1],
 'precisions': [25.0, 16.666666666666668, 12.5, 12.5],
 'bp': 0.10539922456186433,
 'sys_len': 4,
 'ref_len': 13}

In [15]:
def compute_metrics(eval_preds):
  preds, labels = eval_preds

  # If the model returns more than the prediction logits
  if isinstance(preds, tuple):
    preds = preds[0]
  
  decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

  # Replace -100s in labels
  labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
  decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

  # Simple post-processing
  decoded_preds = [pred.strip() for pred in decoded_preds]
  decoded_labels = [[label.strip()] for label in decoded_labels]

  result = metric.compute(predictions=decoded_preds, references=decoded_labels)
  return {'bleu': result['score']}

In [18]:
# Define training arguments
args = Seq2SeqTrainingArguments(
  "marian-finetuned-kd4e-en-to-ch_ZN",
  eval_strategy="no",
  save_strategy="epoch",
  learning_rate=2e-5,
  per_device_train_batch_size=32,
  per_device_eval_batch_size=64,
  weight_decay=0.01,
  save_total_limit=3,
  num_train_epochs=3,
  predict_with_generate=True,
  fp16=True,
)

trainer = Seq2SeqTrainer(
  model,
  args,
  train_dataset=tokenized_dataset['train'],
  eval_dataset=tokenized_dataset['validation'],
  data_collator=data_collator,
  tokenizer=tokenizer,
  compute_metrics=compute_metrics
)

trainer.evaluate(max_length=MAXLEN)

  trainer = Seq2SeqTrainer(


{'eval_loss': 2.4871249198913574,
 'eval_model_preparation_time': 0.0017,
 'eval_bleu': 28.04402379160332,
 'eval_runtime': 638.9095,
 'eval_samples_per_second': 21.861,
 'eval_steps_per_second': 0.343}

In [19]:
# Run the training loop
trainer.train()

trainer.evaluate(max_length=MAXLEN)

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None}.


Step,Training Loss
500,1.5587
1000,1.3472
1500,1.2883
2000,1.2221
2500,1.1742
3000,1.1834
3500,1.1431
4000,1.1066
4500,1.0324
5000,1.0037




{'eval_loss': 0.9346717596054077,
 'eval_model_preparation_time': 0.0017,
 'eval_bleu': 41.547050185618154,
 'eval_runtime': 264.7145,
 'eval_samples_per_second': 52.763,
 'eval_steps_per_second': 0.827,
 'epoch': 3.0}

Model improved from BLEU score of 28.04 -> 41.55 which is a pretty good result

## Custom Pytorch training loop

In [24]:
tokenized_dataset.set_format("torch")
train_dataloader = DataLoader(
  tokenized_dataset['train'],
  shuffle=True,
  collate_fn=data_collator,
  batch_size=8
)
eval_dataloader = DataLoader(
  tokenized_dataset['validation'],
  collate_fn=data_collator,
  batch_size=8
)

In [28]:
# Training loop
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
optimizer = AdamW(model.parameters(), lr=2e-5)

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
  model, optimizer, train_dataloader, eval_dataloader
)

# Learning rate scheduler
num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
  "linear",
  optimizer=optimizer,
  num_warmup_steps=0,
  num_training_steps=num_training_steps
)

In [29]:
def postprocess(predictions, labels):
  predictions = predictions.cpu().numpy()
  labels = labels.cpu().numpy()

  decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

  # Replace -100 in the labels as we can't decode them.
  labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
  decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

  # Some simple post-processing
  decoded_preds = [pred.strip() for pred in decoded_preds]
  decoded_labels = [[label.strip()] for label in decoded_labels]
  return decoded_preds, decoded_labels

In [None]:
progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
  model.train()
  for batch in train_dataloader:
    optimizer.zero_grad()
    outputs = model(**batch)
    loss = outputs.loss
    accelerator.backward(loss)

    optimizer.step()
    lr_scheduler.step()
    progress_bar.update(1)
  
  # Evaluation at each epoch
  model.eval()
  for batch in tqdm(eval_dataloader):
    with torch.no_grad():
      generated_tokens = accelerator.unwrap_model(model).generate(
        batch['input_ids'],
        attention_mask=batch['attention_mask'],
        max_length=128
      )
    labels = batch['labels']

    # Pad predictions and labels before gathering
    generated_tokens = accelerator.pad_across_process(
      generated_tokens, dim=1, pad_index=tokenizer.pad_token_id
    )
    labels = accelerator.pad_across_process(
      labels, dim=1, pad_index=-100
    )

    # Gather predictions
    predictions_gathered = accelerator.gather(generated_tokens)
    labels_gathered = accelerator.gather(labels)

    decoded_preds, decoded_labels = postprocess(predcitions_gathered, labels_gathered)
    metric.add_batch(predictions=decoded_preds, references=decoded_labels)
  
results = metric.compute()
print(f"epoch {epoch}, BLEU score: {results['score']:.2f}")

In [None]:
# # Saving the model
# accelerator.wait_for_everyone()
# unwrapped_model = accelerator.unwrap_model(model)
# unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
# if accelerator.is_main_process:
#   tokenizer.save_pretrained(output_dir)
#   repo.push_to_hub(
#     commit_message=f"Training in progress epoch {epoch}", blocking=False
#   )

In [32]:
tokenizer_test = AutoTokenizer.from_pretrained(model_checkpoint)
translator_test = pipeline('translation', model=model, tokenizer=tokenizer_test)

Device set to use cuda:0


In [33]:
translator_test("Default to expanded threads")

[{'translation_text': 'ÈªòËÆ§Âà∞Êâ©Â±ïÁ∫øÊù°'}]

---

# 2. Summarisation

Challenging NLP task that requires the model to understand long passages and generate coherent text that capture the main topic

- ü§ñ Output: Bilingual English & Spanish model that summarises customer reviews
- üíæ Dataset: Multilingual Amazon Reviews Corpus
  - Use the title as the target summaries

‚úíÔ∏è NOTE: Dataset doesn't exist on HF, so take from: https://www.kaggle.com/datasets/mexwell/amazon-reviews-multi/data?select=train.csv

In [47]:
import datasets
from datasets import load_dataset, concatenate_datasets, DatasetDict

In [None]:
data_files = {"train": "train.csv", "validation": "validation.csv", "test": "test.csv"}
total_dataset = load_dataset("./data/amazon_review_multi", data_files=data_files)

eng_dataset = total_dataset.filter(lambda x: x['language'] == 'en')
spa_dataset = total_dataset.filter(lambda x: x['language'] == 'es')
eng_dataset

Filter:   0%|          | 0/1200000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/30000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/30000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1200000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/30000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/30000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 200000
    })
    validation: Dataset({
        features: ['Unnamed: 0', 'review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['Unnamed: 0', 'review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 5000
    })
})

## Preprocessing

- Target: Title (summary) | Data: Review body
- Selecting only book and ebook reviews
- Remove examples with short titles

In [44]:
def show_samples(
  dataset: datasets.dataset_dict.DatasetDict, 
  num_samples: int=3,
  seed: int=42
):
  sample = dataset['train'].shuffle(seed=seed).select(range(num_samples))
  for example in sample:
    print(f"\n>> Title: {example['review_title']}")
    print(f">> Review: {example['review_body']}")

show_samples(eng_dataset)


>> Title: Worked in front position, not rear
>> Review: 3 stars because these are not rear brakes as stated in the item description. At least the mount adapter only worked on the front fork of the bike that I got it for.

>> Title: meh
>> Review: Does it‚Äôs job and it‚Äôs gorgeous but mine is falling apart, I had to basically put it together again with hot glue

>> Title: Can't beat these for the money
>> Review: Bought this for handling miscellaneous aircraft parts and hanger "stuff" that I needed to organize; it really fit the bill. The unit arrived quickly, was well packaged and arrived intact (always a good sign). There are five wall mounts-- three on the top and two on the bottom. I wanted to mount it on the wall, so all I had to do was to remove the top two layers of plastic drawers, as well as the bottom corner drawers, place it when I wanted and mark it; I then used some of the new plastic screw in wall anchors (the 50 pound variety) and it easily mounted to the wall. Some ha

In [46]:
# Additional filtering on books and ebooks to reduce dataset size
def filter_books(example: dict):
  return (
    example['product_category'] == "book" or
    example['product_category'] == "digital_ebook_purchase"
  )

eng_books = eng_dataset.filter(filter_books)
spa_books = spa_dataset.filter(filter_books)
show_samples(eng_books)

Filter:   0%|          | 0/200000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/5000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/5000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/200000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/5000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/5000 [00:00<?, ? examples/s]


>> Title: I'm dissapointed.
>> Review: I guess I had higher expectations for this book from the reviews. I really thought I'd at least like it. The plot idea was great. I loved Ash but, it just didnt go anywhere. Most of the book was about their radio show and talking to callers. I wanted the author to dig deeper so we could really get to know the characters. All we know about Grace is that she is attractive looking, Latino and is kind of a brat. I'm dissapointed.

>> Title: Good art, good price, poor design
>> Review: I had gotten the DC Vintage calendar the past two years, but it was on backorder forever this year and I saw they had shrunk the dimensions for no good reason. This one has good art choices but the design has the fold going through the picture, so it's less aesthetically pleasing, especially if you want to keep a picture to hang. For the price, a good calendar

>> Title: Helpful
>> Review: Nearly all the tips useful and. I consider myself an intermediate to advanced use

In [51]:
# Concatenate the english and spanish datasets
books_dataset = DatasetDict()

for split in eng_books.keys():
  books_dataset[split] = concatenate_datasets(
    [eng_books[split], spa_books[split]]
  )
  books_dataset[split] = books_dataset[split].shuffle(seed=42)

show_samples(books_dataset)


>> Title: Easy to follow!!!!
>> Review: I loved The dash diet weight loss Solution. Never hungry. I would recommend this diet. Also the menus are well rounded. Try it. Has lots of the information need thanks.

>> Title: PARCIALMENTE DA√ëADO
>> Review: Me lleg√≥ el d√≠a que tocaba, junto a otros libros que ped√≠, pero la caja lleg√≥ en mal estado lo cual da√±√≥ las esquinas de los libros porque ven√≠an sin protecci√≥n (forro).

>> Title: no lo he podido descargar
>> Review: igual que el anterior


In [52]:
# Remove titles with 1-2 words (heuristic: split on whitespace)
books_dataset = books_dataset.filter(lambda x: len(x['review_title'].split(" ")) > 2)
books_dataset

Filter:   0%|          | 0/17612 [00:00<?, ? examples/s]

Filter:   0%|          | 0/424 [00:00<?, ? examples/s]

Filter:   0%|          | 0/442 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 9672
    })
    validation: Dataset({
        features: ['Unnamed: 0', 'review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 238
    })
    test: Dataset({
        features: ['Unnamed: 0', 'review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 245
    })
})

In [2]:
%watermark

Last updated: 2025-06-18T19:03:45.452311+08:00

Python implementation: CPython
Python version       : 3.11.9
IPython version      : 8.31.0

Compiler    : MSC v.1938 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 183 Stepping 1, GenuineIntel
CPU cores   : 20
Architecture: 64bit

