# NLP Tasks (Part 1)

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 18/12/2025   | Martin | Create  | Notebook created for various NLP tasks using HF | 
| 22/12/2025   | Martin | Update  | Continued translation. Up to before training model | 

# Content

* [Introduction](#introduction)
* [1. Translation](#1-translation)

# Introduction

Tackle common NLP problems using LLMs built using the HF package:

1. Translation
2. Summarisation

# 1. Translation

- Seq-2-Seq task
- Finetune existing language model (mT5, mBART, Marian - here)

<u>Components</u>

- Marian: English to French translation model
- KDE4 dataset: Localised files for KDE (Apps for Linux desktops)

In [32]:
import evaluate
import numpy as np
from datasets import load_dataset
from transformers import (
  pipeline,
  AutoTokenizer,
  AutoModelForSeq2SeqLM,
  DataCollatorForSeq2Seq,
  Seq2SeqTrainingArguments,
  Seq2SeqTrainer
)

SEED = 20
MAXLEN = 128

In [2]:
raw_datasets = load_dataset("kde4", lang1="en", lang2="zh_CN")
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 139666
    })
})

In [3]:
split_datasets = raw_datasets['train'].train_test_split(train_size=0.9, seed=SEED)
split_datasets['validation'] = split_datasets.pop('test')
split_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 125699
    })
    validation: Dataset({
        features: ['id', 'translation'],
        num_rows: 13967
    })
})

Dataset contains 2 statements one for each language. The KDE4 dataset translates many of the technical terms to the corresponding language, but the pretrained model does not do this

In [4]:
# KDE4 dataset
split_datasets['train'][1]['translation']

{'en': 'Installation prefix for Qt', 'zh_CN': 'Qt 的安装前缀'}

In [5]:
# Pretrained model
model_checkpoint = "Helsinki-NLP/opus-mt-en-zh"
translator = pipeline("translation", model=model_checkpoint)
print(translator("Pastes the clipboard contents at the current cursor position into the edit field."))

Device set to use cuda:0


[{'translation_text': '将当前光标位置上的剪贴板内容粘贴到编辑字段中。'}]


In [7]:
# Define pretrained components
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, return_tensors='pt')
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)



In [8]:
split_datasets['train'][1]['translation']

{'en': 'Installation prefix for Qt', 'zh_CN': 'Qt 的安装前缀'}

In [9]:
# Example of splitting dataset and passing through tokenizer
en_sentence = split_datasets['train'][1]['translation']['en']
cn_sentence = split_datasets['train'][1]['translation']['zh_CN']

inputs = tokenizer(en_sentence, text_target=cn_sentence)
inputs

{'input_ids': [54596, 2765, 594, 10110, 15, 8, 632, 60, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [8, 632, 60, 8, 12, 9613, 637, 56891, 0]}

In [10]:
# Preprocessing the data
def preprocess(examples):
  inputs = [ex['en'] for ex in examples['translation']]
  targets = [ex['zh_CN'] for ex in examples['translation']]
  model_inputs = tokenizer(inputs, text_target=targets, max_length=MAXLEN, truncation=True)

  return model_inputs

tokenized_dataset = split_datasets.map(
  preprocess,
  batched=True,
  remove_columns=split_datasets['train'].column_names
)

`-100` represents the padding values that should not be used for training

In [11]:
# Define the data collator
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [12]:
batch = data_collator([tokenized_dataset['train'][i] for i in range(1, 3)])
batch.keys()

KeysView({'input_ids': tensor([[54596,  2765,   594, 10110,    15,     8,   632,    60,     0],
        [  457,     0, 65000, 65000, 65000, 65000, 65000, 65000, 65000]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 0, 0, 0, 0, 0, 0, 0]]), 'labels': tensor([[    8,   632,    60,     8,    12,  9613,   637, 56891,     0],
        [    8, 46315,     0,  -100,  -100,  -100,  -100,  -100,  -100]]), 'decoder_input_ids': tensor([[65000,     8,   632,    60,     8,    12,  9613,   637, 56891],
        [65000,     8, 46315,     0, 65000, 65000, 65000, 65000, 65000]])})

<u>Training Details</u>

- Model uses the `decoder_input_ids` with an attention mask to ensure that none of the after tokens are used during prediction
- `generate()` is used to generate tokens one by one
  - Need to set `predict_with_generate=True`
- _BLEU score:_ Evaluates how close generations are to the expected message, penalising for repeated words
- For translation tasks: Several sentences are used as labels

In [18]:
metric = evaluate.load('sacrebleu')

Downloading builder script: 0.00B [00:00, ?B/s]

In [None]:
# An example of good translation
predictions = [
  "This plugin lets you translate web pages between several languages automatically."
]
references = [
  [
    "This plugin allows you to automatically translate web pages between several languages."
  ]
]
metric.compute(predictions=predictions, references=references)

{'score': 46.750469682990165,
 'counts': [11, 6, 4, 3],
 'totals': [12, 11, 10, 9],
 'precisions': [91.66666666666667,
  54.54545454545455,
  40.0,
  33.333333333333336],
 'bp': 0.9200444146293233,
 'sys_len': 12,
 'ref_len': 13}

In [20]:
# An example of poor translation
predictions = ["This This This This"]
references = [
  [
    "This plugin allows you to automatically translate web pages between several languages."
  ]
]
metric.compute(predictions=predictions, references=references)

{'score': 1.683602693167689,
 'counts': [1, 0, 0, 0],
 'totals': [4, 3, 2, 1],
 'precisions': [25.0, 16.666666666666668, 12.5, 12.5],
 'bp': 0.10539922456186433,
 'sys_len': 4,
 'ref_len': 13}

In [30]:
def compute_metrics(eval_preds):
  preds, labels = eval_preds

  # If the model returns more than the prediction logits
  if isinstance(preds, tuple):
    preds = preds[0]
  
  decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

  # Replace -100s in labels
  labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
  decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

  # Simple post-processing
  decoded_preds = [pred.strip() for pred in decoded_preds]
  decoded_labels = [[label.strip()] for label in decoded_labels]

  result = metric.compute(predictions=decoded_preds, references=decoded_labels)
  return {'bleu': result['score']}

In [None]:
# Define training arguments
args = Seq2SeqTrainingArguments(

)

array([100,   4, 300,   8, 500,  12])

In [2]:
%watermark

Last updated: 2025-06-18T19:03:45.452311+08:00

Python implementation: CPython
Python version       : 3.11.9
IPython version      : 8.31.0

Compiler    : MSC v.1938 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 183 Stepping 1, GenuineIntel
CPU cores   : 20
Architecture: 64bit

