# LLM assignment
In this assignment I try to train and fine-tune a pre-trained medium-sized BERT type LLM to a downstream task of my choosing. I've decided to fine0tune distilbert on a movie quotes daatset on huggingface

In [None]:
from huggingface_hub import login
login()

In [7]:
!pip install datasets


Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.5.0,>=2023.1.0 (from fsspec[http]<=2024.5.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.5.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Library imports and funtion definitions

In [5]:
from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer

## Downlaoding the model and the dataset

In [6]:
# Downloading distilbert
model_checkpoint = "distilbert-base-uncased"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

In [3]:
# Checking the number of parameters
distilbert_num_parameters = model.num_parameters() / 1_000_000
print(f"'>>> DistilBERT number of parameters: {round(distilbert_num_parameters)}M'")

'>>> DistilBERT number of parameters: 67M'


Let's first check how the model will do before fine-tuning.

In [7]:
# getting tokenizer to produce inputs for the model
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [27]:
import torch
# My review
text1 = "I loved the [MASK] it was easily one of my top 3 faves from the year"
inputs = tokenizer(text1, return_tensors="pt")
token_logits = model(**inputs).logits
# Find the location of [MASK] and extract its logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
# Pick the [MASK] candidates with the highest logits
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f"'>>> {text1.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

I've decided to use the Rotten tomatoes dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews.<br> Pang, B., & Lee, L. (2005). Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the ACL.

In [8]:
from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes")

In [9]:
ds

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

## Pre-processing

In [10]:
# Function to tokenize
def tokenize_function(examples):
    result = tokenizer(examples["text"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result


# Use batched=True to activate fast multithreading!
tokenized_datasets = ds.map(
    tokenize_function, batched=True, remove_columns=["text", "label"]
)
tokenized_datasets

Map:   0%|          | 0/8530 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 1066
    })
})

In [11]:
# Checking maximum context size to decide chunk size
tokenizer.model_max_length

512

In [12]:
# Function to cpncatenate and split into chunks
chunk_size = 128
def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size
    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }
    # Create a new labels column
    result["labels"] = result["input_ids"].copy()
    return result


In [13]:
# Mapping group_texts function to our tokenizd dataset
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

Map:   0%|          | 0/8530 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 1820
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 226
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 230
    })
})

## Fine-tuning with the trainer API

In [14]:
# collator to randomnly mask tokens
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

In [15]:
# function to create another collator that will mask words as a whole
import collections
import numpy as np

from transformers import default_data_collator

wwm_probability = 0.2


def whole_word_masking_data_collator(features):
    for feature in features:
        word_ids = feature.pop("word_ids")

        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)

        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id
        feature["labels"] = new_labels

    return default_data_collator(features)

In [16]:
# Downsampling traingin set cause of gpu restrictions
train_size = 1500
test_size = int(0.1 * train_size)

downsampled_dataset = lm_datasets["train"].train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 1500
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 150
    })
})

In [17]:
from transformers import TrainingArguments

# Specifying trainer parameters
batch_size = 64
# Show the training loss with every epoch
logging_steps = len(downsampled_dataset["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-rottentomatoesdataset",
    overwrite_output_dir=True,
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    push_to_hub=True,
    fp16=True,
    logging_steps=logging_steps,
    remove_unused_columns=False
)



In [18]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=whole_word_masking_data_collator,
    tokenizer=tokenizer,
)

In [19]:
import math
# Using perplexity as an evaluation metric
eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

>>> Perplexity: 148.69


In [20]:
# Training
trainer.train()

Epoch,Training Loss,Validation Loss
1,4.5121,4.122392
2,4.1703,4.109545
3,4.112,4.006074


TrainOutput(global_step=72, training_loss=4.260394652684529, metrics={'train_runtime': 31.4708, 'train_samples_per_second': 142.99, 'train_steps_per_second': 2.288, 'total_flos': 149131300608000.0, 'train_loss': 4.260394652684529, 'epoch': 3.0})

In [21]:
# Checking perplexity again to sse if the domain adaptation improved the results
eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

>>> Perplexity: 56.98


In [22]:
trainer.push_to_hub()

events.out.tfevents.1723056754.773f5000365c.996.1:   0%|          | 0.00/354 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/gobucbabu/distilbert-base-uncased-finetuned-rottentomatoesdataset/commit/ffe4c9584a4a76df344e1f4cc946d68057204e36', commit_message='End of training', commit_description='', oid='ffe4c9584a4a76df344e1f4cc946d68057204e36', pr_url=None, pr_revision=None, pr_num=None)

Ckecking if the fine tuning changed anything.

In [24]:
from transformers import pipeline

mask_filler = pipeline(
    "fill-mask", model="distilbert-base-uncased-finetuned-rottentomatoesdataset"
)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [29]:
# Checking again with our text
text1 = "The [MASK] of Dune: Part Two was INCREDIBLE. Every single frame was a painting. Hands down Greig Fraser's best work till date."
text2 = "I loved the [MASK] "

preds = mask_filler(text2)

for pred in preds:
    print(f">>> {pred['sequence']}")

>>> i loved the.
>>> i loved the ;
>>> i loved the!
>>> i loved the world
>>> i loved the sea
