# Lightweight Fine-Tuning Project

In this project, we'll create a language detection model using a parameter-efficient model. This model will leverage the **DistillBert** model.

TODO: In this cell, describe your choices for each of the following

* PEFT technique: LoRa
* Model: distilbert
* Evaluation approach: Accuracy
* Fine-tuning dataset: papluca/language-identification

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

### Download Language Detection dataset
We download the Language Detection dataset. We download 3 data splits: train, validation and test.

In [1]:
# Import the datasets and transformers packages

from datasets import load_dataset
ds = load_dataset("papluca/language-identification")

# Thin out the dataset to make it run faster for this example
splits = ["train", "validation", "test"]

for split in splits:
    ds[split] = ds[split].shuffle(seed=42).select(range(5000))

ds

  from .autonotebook import tqdm as notebook_tqdm
Downloading readme: 100%|██████████| 4.99k/4.99k [00:00<00:00, 5.23MB/s]
Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]
Downloading data:   0%|          | 0.00/12.0M [00:00<?, ?B/s][A
Downloading data:  35%|███▌      | 4.19M/12.0M [00:00<00:00, 11.5MB/s][A
Downloading data: 100%|██████████| 12.0M/12.0M [00:00<00:00, 19.8MB/s][A
Downloading data files:  33%|███▎      | 1/3 [00:00<00:01,  1.63it/s]
Downloading data:   0%|          | 0.00/1.71M [00:00<?, ?B/s][A
Downloading data: 100%|██████████| 1.71M/1.71M [00:00<00:00, 11.7MB/s][A
Downloading data files:  67%|██████▋   | 2/3 [00:00<00:00,  2.90it/s]
Downloading data:   0%|          | 0.00/1.69M [00:00<?, ?B/s][A
Downloading data: 100%|██████████| 1.69M/1.69M [00:00<00:00, 8.60MB/s][A
Downloading data files: 100%|██████████| 3/3 [00:00<00:00,  3.07it/s]
Extracting data files: 100%|██████████| 3/3 [00:00<00:00, 1068.43it/s]
Generating train split: 140000 examples [00

DatasetDict({
    train: Dataset({
        features: ['labels', 'text'],
        num_rows: 5000
    })
    validation: Dataset({
        features: ['labels', 'text'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['labels', 'text'],
        num_rows: 5000
    })
})

Let's take a peak at the first 5 records of the test dataset.

In [2]:
dataset = ds['test'][:5]
dataset

{'labels': ['el', 'de', 'bg', 'de', 'vi'],
 'text': ['Και, φυσικά, τα σπουδαία μνημεία της κατανόησης της ελευθερίας από τον δέκατο όγδοο αιώνα είναι το Σύνταγμα και ο Νόμος των Δικαιωμάτων.',
  'Wie schon jemand bemängelte, hat die gelieferte Sitzauflage KEINE Ecken- und keine seitliche Randabdeckung, wie auf dem Bild. Auch ist die Sitzauflage entweder mit nur sehr wenig oder gar keiner Bambus-Holzkohle gefüllt, verglichen mit den Auflagen, die ich früher bei einem anderen Anbieter bestellt habe.',
  'е като колекция от кибрити',
  'Sehr schöne Kugel. Am 10. Oktober gekauft und am 20. Oktober geliefert 😳 Als nette Zugabe befand sich noch ein Deichmann Prospekt in meinem Päckchen 😳 Primavera und Deichmann ? Muss ich das verstehen? Für die Lieferzeit und den konventionellen Prospekt jeweils 1 1/2 Punktabzüge . Werde dort sicher nicht mehr bestellen',
  'Một vài tầng đang cháy có lẽ nằm ngoài khả năng dập tắt của đội cứu hỏa mà chúng ta có thể xử lý.']}

### Tokenizer
We import a tokenizer to tokenize the text data provided as input. We also take a sample record to verify that the tokenized data is correct.

In [3]:
import pandas as pd
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased", use_fast=False)

# add pad_token to this tokenizer
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
str_to_int={"ar": 0, "bg":1, "de":2, "el":3, "en":4, "es":5, "fr":6, "hi":7, "it":8, "ja":9, "nl":10, \
              "pl":11, "pt":12, "ru":13, "sw":14, "th":15, "tr":16, "ur":17, "vi":18, "zh":19}

def preprocess_function(batch):
    """Preprocess the dataset by returning tokenized examples."""
    tokenized_batch = tokenizer(batch['text'], padding=True, truncation=True, return_tensors="pt")
    tokenized_batch["labels"] = [str_to_int[label] for label in batch["labels"]]
    return tokenized_batch

tokenized_ds = {}
for split in splits:
    tokenized_ds[split] = ds[split].map(preprocess_function, batched=True)
    tokenized_ds[split] = tokenized_ds[split].rename_column("labels", "label")

#converting dataset to Torch Tensor and define expected col labels. This is needed to train the Lora model.
tokenized_ds['train'].set_format('torch', columns=['label', 'input_ids', 'attention_mask'])
tokenized_ds['test'].set_format('torch', columns=['label', 'input_ids', 'attention_mask'])

# Show the first example of the tokenized training set
print(tokenized_ds["train"][0])

tokenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<00:00, 61.0kB/s]
config.json: 100%|██████████| 483/483 [00:00<00:00, 2.40MB/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 39.7MB/s]
Map: 100%|██████████| 5000/5000 [00:09<00:00, 547.84 examples/s]
Map: 100%|██████████| 5000/5000 [00:06<00:00, 719.99 examples/s]
Map: 100%|██████████| 5000/5000 [00:06<00:00, 782.19 examples/s]

{'label': tensor(1), 'input_ids': tensor([  101,  1181, 15290, 29744,  1184, 10260,  1196, 29746, 14150, 29745,
        15290, 19865, 25529, 10260, 29745,  1197, 16856, 10325, 29746,  1188,
        29436, 10325,  1194, 15290, 18947, 22919, 10260, 29741, 14150, 19865,
         1010,  1193, 29746, 10325, 29747, 10260, 29750,  1195, 10260, 29740,
        14150, 22919, 10260, 22919, 10260,  1188,  1197, 10260, 29745,  1010,
         1192, 10260,  1186, 15290, 19865, 22919, 10260,  1182,  1189, 15290,
        29436, 10325,  1012,   102,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,  




### Transformer
We import the transformer model. In our case, we decided to use the **DistillBert** model.

In [4]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=20, # because we have 20 languages
    id2label={0: "ar", 1: "bg", 2: "de", 3:"el", 4:"en", 5:"es", 6:"fr", 7:"hi", 8:"it", 9:"ja", 10:"nl", \
              11: "pl", 12:"pt", 13:"ru", 14:"sw", 15:"th", 16:"tr", 17:"ur", 18:"vi", 19:"zh"}, 
    label2id={"ar": 0, "bg":1, "de":2, "el":3, "en":4, "es":5, "fr":6, "hi":7, "it":8, "ja":9, "nl":10, \
              "pl":11, "pt":12, "ru":13, "sw":14, "th":15, "tr":16, "ur":17, "vi":18, "zh":19}
)

model

model.safetensors: 100%|██████████| 268M/268M [00:01<00:00, 222MB/s] 
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

### Let's assess the pre-trained model using the accuracy metric

In [5]:
# evaluate base model
import numpy as np
from transformers import Trainer, DataCollatorWithPadding, TrainingArguments

#compute metrics
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}

training_args = TrainingArguments(
    output_dir="./regular",
    learning_rate=1e-3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# eval loop
trainer = Trainer(
    model=model,
    args=training_args,
    eval_dataset=tokenized_ds['test'],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics
    )

In [6]:
trainer.evaluate()

{'eval_loss': 2.9903082847595215,
 'eval_accuracy': 0.0624,
 'eval_runtime': 52.7149,
 'eval_samples_per_second': 94.85,
 'eval_steps_per_second': 11.856}

We can see that the accuracy of the pre-trained model is pretty low (**6.24%**). Let's fine-tune the model using the LoRa approach.

## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [7]:
from peft import LoraConfig, get_peft_model, TaskType

config = LoraConfig(
    r=32,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=['q_lin', 'k_lin','v_lin', 'lin1', 'lin2', 'classifier', 'pre-classifier'],
    bias="lora_only",
    modules_to_save=["decode_head"],
    task_type=TaskType.SEQ_CLS
)
lora_model = get_peft_model(model, config)
lora_model.print_trainable_parameters()

trainable params: 3,658,536 || all params: 69,984,552 || trainable%: 5.227633664069179


We can see that only **5.23%** of the parameters will be trained using the LoRa approach. This is much more efficient than re-training all the weights. Below we can see which parameters will be updated.

In [8]:
# confirm that only the LoRa paramaters are trainable
for name, param in lora_model.named_parameters():
    if param.requires_grad:
        print(name, param.shape)

base_model.model.distilbert.transformer.layer.0.attention.q_lin.bias torch.Size([768])
base_model.model.distilbert.transformer.layer.0.attention.q_lin.lora_A.default.weight torch.Size([32, 768])
base_model.model.distilbert.transformer.layer.0.attention.q_lin.lora_B.default.weight torch.Size([768, 32])
base_model.model.distilbert.transformer.layer.0.attention.k_lin.bias torch.Size([768])
base_model.model.distilbert.transformer.layer.0.attention.k_lin.lora_A.default.weight torch.Size([32, 768])
base_model.model.distilbert.transformer.layer.0.attention.k_lin.lora_B.default.weight torch.Size([768, 32])
base_model.model.distilbert.transformer.layer.0.attention.v_lin.bias torch.Size([768])
base_model.model.distilbert.transformer.layer.0.attention.v_lin.lora_A.default.weight torch.Size([32, 768])
base_model.model.distilbert.transformer.layer.0.attention.v_lin.lora_B.default.weight torch.Size([768, 32])
base_model.model.distilbert.transformer.layer.0.ffn.lin1.bias torch.Size([3072])
base_model

In [9]:
# training the lora model

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./data/distillbert-languages-lora",
    learning_rate=1e-3,
    num_train_epochs=2,
    per_device_train_batch_size=8,#4,
    per_device_eval_batch_size=8,#2,
    save_total_limit=3,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_steps=5,
    remove_unused_columns=False,
)

trainer = Trainer(
    model=lora_model,
    args=training_args,
    train_dataset=tokenized_ds['train'],
    eval_dataset=tokenized_ds['test'],
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics
)

trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.2517,0.099619,0.9808
2,0.0012,0.092524,0.9862


TrainOutput(global_step=1250, training_loss=0.12442059442400932, metrics={'train_runtime': 568.6428, 'train_samples_per_second': 17.586, 'train_steps_per_second': 2.198, 'total_flos': 1414126275932160.0, 'train_loss': 0.12442059442400932, 'epoch': 2.0})

In [10]:
lora_model.save_pretrained("distillbert-lora")

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

Let's load the model:

In [11]:
import torch
from peft import AutoPeftModelForSequenceClassification
# setup device to use
device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
lora_model = AutoPeftModelForSequenceClassification.from_pretrained("distillbert-lora", num_labels=20).to(device)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Let's create a new trainer with the loaded model

In [14]:
trainer_loaded_model = Trainer(
model = lora_model,
tokenizer=tokenizer,
eval_dataset=tokenized_ds['test'],
data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
compute_metrics=compute_metrics
)

Let's evaluate the model

In [15]:
trainer_loaded_model.evaluate()

{'eval_loss': 0.09252355992794037,
 'eval_accuracy': 0.9862,
 'eval_runtime': 67.6795,
 'eval_samples_per_second': 73.878,
 'eval_steps_per_second': 9.235}

We can see that the PEFT model is performing much better than the pre-trained model. As a reminder, it only had a **6.24%** accuracy while the PEFT model shows a **98.6%** accuracy. Arguably, we didn't actually train the pre-trained model, we only used it as is to detect the languages. However, by using the LoRa approach, we managed to train the model without having to update all the parameters. We only updated **5%** of those parameters. So, this is a major improvement and at low cost.

### Inference
Let's apply the PEFT model on a couple of random examples:
* "It's a beautiful day in the park today." (English)
* "Je pense, donc je suis." (French)
* "Vamos a la playa!" (Spanish)
* "Goede morgen, I ben moe" (Dutch)

In [16]:
test_strings = ["'It's a beautiful day in the park today.",
               "Je pense, donc je suis",
               "Vamos a la playa!",
               "Goede morgen, I ben moe"]

In [25]:
from transformers import AutoTokenizer
id2label={0: "ar", 1: "bg", 2: "de", 3:"el", 4:"en", 5:"es", 6:"fr", 7:"hi", 8:"it", 9:"ja", 10:"nl", \
              11: "pl", 12:"pt", 13:"ru", 14:"sw", 15:"th", 16:"tr", 17:"ur", 18:"vi", 19:"zh"}

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased", use_fast=False)

for test_string in test_strings:
    input_ids = tokenizer(test_string, return_tensors="pt").input_ids.to(device)
    outputs = lora_model(input_ids=input_ids)
    logits = outputs.logits
    predicted_label_classes = logits.argmax(-1)
    print(id2label[predicted_label_classes.squeeze().tolist()])


en
fr
es
nl


Awesome! We can see that the loaded model provides the right predictions for the 4 strings that we provided.