# Deep Learning and Applied AI (DLAI) Project: **Fine-Tuning a Large Language Model (LLM) for Italian-to-Neapolitan Dialect Translation**

# Part II: Fine-Tuning

### Author: Aur Marina Iuliana, 1809715

# 1. Import Libraries

In [2]:
import torch
import wandb
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from trl import SFTConfig, SFTTrainer, DataCollatorForCompletionOnlyLM
import numpy as np
import pandas as pd
from datasets import Dataset, load_dataset
from sklearn.model_selection import train_test_split
from tqdm import tqdm
from peft import get_peft_model, LoraConfig, TaskType
import random

In [3]:
# Check CUDA availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(torch.cuda.is_available())
print(torch.version.cuda)

# Seed for reproducibility
seed = 1234
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
np.random.seed(seed)
random.seed(seed)

True
12.4


In [4]:
# Initialize wandb for experiment tracking
wandb.init(
    project = "nap-dialect-finetuning",
    name = "nap-dialect-gemma-2-2b-it_finetuning",
    reinit = True)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


[34m[1mwandb[0m: Currently logged in as: [33mmarinaaur[0m ([33mmarinaaur-sapienza[0m). Use [1m`wandb login --relogin`[0m to force relogin


# 2. Modelling

## 2.1 Gemma-2-2B-it


**Gemma-2-2B-it** is an instruction-tuned large language model (LLM) designed for multilingual dialogue scenarios, making it as an ideal candidate for our project, where we aim to finetune the model to enhance its understanding and generation of the Neapolitan dialect. 

In [5]:
model_id = "google/gemma-2-2b-it"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map = "auto",
    torch_dtype = torch.bfloat16,
    trust_remote_code = True,
)

model.gradient_checkpointing_enable()
model.config.use_cache = False

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  2.35it/s]


In [6]:
model

Gemma2ForCausalLM(
  (model): Gemma2Model(
    (embed_tokens): Embedding(256000, 2304, padding_idx=0)
    (layers): ModuleList(
      (0-25): 26 x Gemma2DecoderLayer(
        (self_attn): Gemma2Attention(
          (q_proj): Linear(in_features=2304, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2304, out_features=1024, bias=False)
          (v_proj): Linear(in_features=2304, out_features=1024, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2304, bias=False)
          (rotary_emb): Gemma2RotaryEmbedding()
        )
        (mlp): Gemma2MLP(
          (gate_proj): Linear(in_features=2304, out_features=9216, bias=False)
          (up_proj): Linear(in_features=2304, out_features=9216, bias=False)
          (down_proj): Linear(in_features=9216, out_features=2304, bias=False)
          (act_fn): PytorchGELUTanh()
        )
        (input_layernorm): Gemma2RMSNorm((2304,), eps=1e-06)
        (pre_feedforward_layernorm): Gemma2RMSNorm((2304,), eps

## 2.2 Low-Rank Adaptation (LoRA)

Large Language Models (LLMs) are often *high-dimensional* and *over-parametrized*, while the information encoded in them tends to be well-approximated in a much lower dimension (i.e., [lottery ticket hypothesis](https://arxiv.org/abs/1803.03635)). Low-Rank Adaptation (LoRA) is a **Parameter-Efficient Fine-Tuning (PEFT)** technique based on the idea that updates to the weights of LLMs have a low "intrinsic rank". It involves freezing the weights of the model and calculating **low-rank matrices** such that:

$$
W_0 + \Delta W = W_0 + BA
$$

where:
- $W_0$ represents the weight matrix of the original dense layer.
- $A$ and $B$ are the low-rank matrices, where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$.
- $r$ denotes the rank of these matrices, with  $r > 0 $ and $r \ll min(d, k)$ (values between 1 and 4 are shown to work well).

When training updates are approximately low-rank, matrices $A$ and $B$ can closely approximate the true update.

**Low-rank approximation** decomposes a large matrix into a product of lower-dimensional matrices, effectively reducing the number of parameters requiring fine-tuning (*proxy paramters*). This reduction in parameters results in a decrease in training time and GPU memory usage, while preserving the quality of the outputs.

In [7]:
# Define LoRA model
lora_config = LoraConfig(
    task_type = TaskType.CAUSAL_LM,
    r = 16,  # rank
    lora_alpha = 32,  # scaling factor
    lora_dropout = 0.2,  # dropout probability
    bias = "none",  # bias configuration
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",]
)

lora_model = get_peft_model(model, lora_config)

In [8]:
lora_model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Gemma2ForCausalLM(
      (model): Gemma2Model(
        (embed_tokens): Embedding(256000, 2304, padding_idx=0)
        (layers): ModuleList(
          (0-25): 26 x Gemma2DecoderLayer(
            (self_attn): Gemma2Attention(
              (q_proj): lora.Linear(
                (base_layer): Linear(in_features=2304, out_features=2048, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.2, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2304, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=2048, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_

In [9]:
lora_model.print_trainable_parameters()

trainable params: 20,766,720 || all params: 2,635,108,608 || trainable%: 0.7881


# 3. Import and Prepare Data


In [10]:
dataset = load_dataset("efederici/mt_nap_it", split = ['train[:80%]', 'train[80%:90%]', 'train[90%:]'])

dataset = {
    "train": dataset[0],
    "eval": dataset[1],
    "test": dataset[2]
}

dataset

{'train': Dataset({
     features: ['url', 'napoletano', 'italiano'],
     num_rows: 11320
 }),
 'eval': Dataset({
     features: ['url', 'napoletano', 'italiano'],
     num_rows: 1415
 }),
 'test': Dataset({
     features: ['url', 'napoletano', 'italiano'],
     num_rows: 1415
 })}

In [11]:
def generate_prompts(entry, tokenizer):
  user_prompt = '''Translate the provided text from Italian language to Neapolitan dialect. Return only the text translated in Neapolitan, without any additional details.\nItalian Text: {nap_text}
  '''

  chat = [
      {"role": "user", "content": user_prompt.format(nap_text = entry['italiano'])},
      {"role": "assistant", "content": entry['napoletano']}
  ]

  prompt = tokenizer.apply_chat_template(chat, tokenize = False, add_generation_prompt = False)

  return prompt

In [12]:
train_prompts = [generate_prompts(entry, tokenizer) for entry in dataset['train']]
eval_prompts = [generate_prompts(entry, tokenizer) for entry in dataset['eval']]

In [13]:
print(train_prompts[0])

<bos><start_of_turn>user
Translate the provided text from Italian language to Neapolitan dialect. Return only the text translated in Neapolitan, without any additional details.
Italian Text: Ma al tramonto giù a Posillipo<end_of_turn>
<start_of_turn>model
Ma 'int'ô tramonto 'nterra Pusilleco<end_of_turn>



In [14]:
train_dataset = [{"text": prompt} for prompt in train_prompts]
train_dataset = Dataset.from_list(train_dataset)

eval_dataset = [{"text": prompt} for prompt in eval_prompts]
eval_dataset = Dataset.from_list(eval_dataset)

In [15]:
train_dataset

Dataset({
    features: ['text'],
    num_rows: 11320
})

In [16]:
eval_dataset

Dataset({
    features: ['text'],
    num_rows: 1415
})

# 4. Fine-tuning the Model

In [17]:
# Define hyperparameters for training
learning_rate = 1e-4
weight_decay = 0.1
lr_scheduler_type = "linear"
optimizer = "adamw_torch_fused"
random_state = 3407
HAS_BFLOAT16 = torch.cuda.is_bf16_supported()
batch_size = 8
gradient_accumulation_steps = 16

In [18]:
training_args = SFTConfig(
        max_seq_length = 2048,
        output_dir = "nap-dialect-gemma-2-2b-it-finetuning",
        num_train_epochs = 2,
        per_device_train_batch_size = batch_size,
        per_device_eval_batch_size = batch_size,
        gradient_accumulation_steps = gradient_accumulation_steps,
        gradient_checkpointing = True,
        eval_strategy = "steps",
        eval_steps = (len(train_dataset) // batch_size) // (gradient_accumulation_steps * 2), # 2 evaluations per epoch
        save_strategy = "epoch",
        logging_steps = 1,
        warmup_steps = 100,
        learning_rate = learning_rate,
        fp16 = not HAS_BFLOAT16,
        bf16 = HAS_BFLOAT16,
        optim = optimizer,
        weight_decay = weight_decay,
        lr_scheduler_type = lr_scheduler_type,
        seed = random_state,
        report_to = "wandb",
        run_name = "nap-dialect-gemma-2-2b-it_finetuning"
    )

In [19]:
response_template = "<start_of_turn>model"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer = tokenizer)

In [20]:
# Initialize the SFTTrainer for supervised fine-tuning of Gemma model
trainer = SFTTrainer(
    model = lora_model,
    train_dataset = train_dataset,
    eval_dataset = eval_dataset,
    tokenizer = tokenizer,
    args = training_args,
    data_collator = collator
    )

Map: 100%|██████████| 11320/11320 [00:00<00:00, 43602.16 examples/s]
Map: 100%|██████████| 1415/1415 [00:00<00:00, 41966.47 examples/s]
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [21]:
# Fine-tune Gemma model on Neapolitan Dialect
trainer.train()

Step,Training Loss,Validation Loss
44,1.9937,2.036337
88,1.3242,1.337307
132,1.0025,1.118906
176,0.907,1.038473


TrainOutput(global_step=176, training_loss=1.7658500085500153, metrics={'train_runtime': 1175.0953, 'train_samples_per_second': 19.267, 'train_steps_per_second': 0.15, 'total_flos': 2.690223385326797e+16, 'train_loss': 1.7658500085500153, 'epoch': 1.990106007067138})

In [22]:
wandb.finish()

0,1
eval/loss,█▃▂▁
eval/runtime,█▁▄▆
eval/samples_per_second,▁█▅▃
eval/steps_per_second,▁█▅▃
train/epoch,▁▁▁▁▁▁▂▂▂▂▂▂▂▃▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▇▇▇▇▇▇▇█
train/global_step,▁▁▁▁▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▆▆▇▇▇▇█
train/grad_norm,█▇▆▂▂▁▁▁▁▁▁▁▁▁▁▁▂▁▁▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train/learning_rate,▁▁▂▂▂▂▃▃▃▃▄▄▅▅▅▆▆▆▆▇████▇▆▅▅▅▅▄▄▄▄▄▂▂▂▁▁
train/loss,█▇▇▄▄▄▄▃▃▃▃▃▃▃▂▃▂▂▂▂▂▂▂▁▁▂▁▂▁▁▁▁▂▁▁▁▁▁▁▁

0,1
eval/loss,1.03847
eval/runtime,22.0368
eval/samples_per_second,64.211
eval/steps_per_second,8.032
total_flos,2.690223385326797e+16
train/epoch,1.99011
train/global_step,176.0
train/grad_norm,2.67537
train/learning_rate,0.0
train/loss,0.907


In [23]:
# Show current memory stats
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
print(f"{start_gpu_memory} GB of memory reserved.")

23.061 GB of memory reserved.
