# Overview

**Note: All the images are from the Reference lists section or the internet**

We are going to fine-tuning a t5-small quantized model for English to French translation-related tasks using the LoRA method. We use the peft and transformers libraries for training.


# About quantization

Please check the list below:

* [Quantization Technologies](https://www.kaggle.com/code/aisuko/quantization-technologies)
* [Zero Degradation matrix multiplication](https://www.kaggle.com/code/aisuko/zero-degradation-matrix-multiplication)
* [Lighter models on GPU for inference](https://www.kaggle.com/code/aisuko/lighter-models-on-gpu-for-inference)


# About LoRA(Low Rank Adaptation)

> A technique that accelerates the fine-tuning of large models while consuming less memory.


**The idea is to freeze the original pre-trained weights(Matrices) and introduce new updata matrices**. These new matrics are trained on new data while keeping the overall number of changes low. The original weights matrix doesn't receive any adjustments. And finally, both the original and the adapted weights are combined.

![](https://files.mastodon.social/media_attachments/files/111/702/004/494/881/797/original/a26697e010f0096b.webp)

LoRA makes fine-tuning more efficient by drastically reducing the number of **trainable parameters**. In principle, LoRA can be applied to any subset of weight matrices in a neural network to reduce the number of trainable parameters. However, for simplicity and further parameter efficiency, **in Transformer models LoRA is typically applied to attention blocks only**. **The resulting number of trainable parameters in a LoRA model depends on the size of the low-rank update matrices, which is determined manily by the rank `r` and the shape of the original weight matrix**.


The differences between QLoRA and LoRA in real word case see notebook [fine-tuning llama2 with QLoRA](https://www.kaggle.com/code/aisuko/fine-tuning-llama2-with-qlora?scriptVersionId=158763163&cellId=1).

In [1]:
%%capture
!pip install transformers==4.36.2
!pip install bitsandbytes==0.41.3
!pip install accelerate==0.25.0
!pip install datasets==2.15.0
!pip install evaluate==0.4.1
!pip install peft==0.7.1

In [2]:
import os
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()

login(token="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")

os.environ["WANDB_API_KEY"]="b47babba2becb2d7866813aeb59313346b518185"
os.environ["WANDB_PROJECT"] = "Fine-tuning t5-small-on-opus100"
os.environ["WANDB_NAME"] = "ft-t5-small-on-opus100"

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


# Loading Dataset

We are going to use the optus 100 dataset for training which gives us access to more than 100 different languages.

In [3]:
from datasets import get_dataset_config_names

configs=get_dataset_config_names("wmt14")
print(configs)

Downloading readme:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/30 [00:00<?, ?it/s]

['cs-en', 'de-en', 'fr-en', 'hi-en', 'ru-en']


We will use the "en-fr" data for language translation. Let's download and load the dataset though the `load_dataset`.

In [4]:
from datasets import load_dataset

dataset=load_dataset("wmt14", "de-en")
dataset

Resolving data files:   0%|          | 0/30 [00:00<?, ?it/s]



Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/280M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/273M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/474k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/509k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/4508785 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3003 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 4508785
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 3000
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 3003
    })
})

# Data Tokenization

In [5]:
from transformers import AutoTokenizer

model_name="google/mt5-large"
prompt="My name is Kaggle, nice to see you."

tokenizer=AutoTokenizer.from_pretrained(model_name, load_in_8bit=True, device_map="auto")

tokenizer_config.json:   0%|          | 0.00/376 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/642 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [47]:
# use a sample of around 2000 instead of the complete dataset as training dataset
train_dataset=dataset['train'].shuffle(seed=42).select(range(5000))

# as evaluation dataset
eval_dataset=dataset['validation']

prefix = "translate English to German: "
def preprocess_func(data):
    inputs = [prefix + ex['en'] for ex in data['translation']]
    targets = [ex['de'] for ex in data['translation']]
    
    # Tokenize each row of inputs and outputs with padding
    model_inputs = tokenizer(inputs, truncation=True, padding="max_length", max_length=128)
    labels = tokenizer(targets, truncation=True, padding="max_length", max_length=128)
    
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs



# We tokenize the entire dataset
train_dataset=train_dataset.map(preprocess_func, batched=True)
eval_dataset=eval_dataset.map(preprocess_func, batched=True)

# Preparing the model

First, we will load the model with 8 bit(quantization the model). And then, we are using the LoRA. Here are the description of the LoraConfig:

* **r**: the rank of the update matrices, expressed in **int**. Lower rank results in smaller update matrices with fewer trainable parameters.
* **target_modules**: **The modules(for example, attention blocks)** to apply the LoRA update matrices.
* **alpha**: LoRA scaling factor
* **bias**: Specifies if the bias parameters should be trained. Can be 'none','all' or 'lora_only'.
* **module_to_save**: List of modules apart from LoRA layers to be set as trainable and saved in the final checkpoint. These typically include model's custome head that is randomly initialized for the fine-tuning task.
* **layer_to_transform**: List of layers to be transformed by LoRA. If not specified, all layers in `target_modules` are transformed.
* **layers_pattern**: Pattern to match layer names in `target_modules`, if `layer_to_transform` is specified. By default `PeftModel` will look at common layer pattern(`layers`,`h`, `blocks`, etc.), use it for exotic and custom models.
* **rank_pattern**: The mapping from layer names or regexp expression to ranks which are different from the default tank specified by `r`.
* **alpha_pattern**: The mapping from layer names or regexp expression to alphas which are different from the default alpha specified by `lora_alpha`.

In [18]:
from peft import PeftModel, prepare_model_for_kbit_training, PeftConfig, get_peft_model, LoraConfig, TaskType
from transformers import BitsAndBytesConfig
from transformers import AutoModelForSeq2SeqLM

bnb_config=BitsAndBytesConfig(
    load_in_8bit=True
)

model=AutoModelForSeq2SeqLM.from_pretrained(model_name, quantization_config=bnb_config, device_map="auto")

ValueError: 
                        Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
                        the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
                        these modules in 32-bit, you need to set `load_in_8bit_fp32_cpu_offload=True` and pass a custom
                        `device_map` to `from_pretrained`. Check
                        https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
                        for more details.
                        

In [None]:
# Freeze the original parameters
model=prepare_model_for_kbit_training(model)

peft_config=LoraConfig(
    # the task to train for (sequence-to-sequence language modeling in this case)
    task_type=TaskType.SEQ_2_SEQ_LM,
    # the dimension of the low-rank matrices
    r=4,
    # the scaling factor for the low-rank matrices
    lora_alpha=16,
    # the dropout probability of the LoRA layers
    lora_dropout=0.01,
    target_modules=["k","q","v","o"],
)

peft_model=get_peft_model(model, peft_config)
peft_model.print_trainable_parameters()

In [13]:
import evaluate
import numpy as np

accuracy=evaluate.load("accuracy")

def compute_metrics(p):
    predictions, labels=p
    predictions=np.argmax(predictions, axis=1)
    
    return accuracy.compute(predictions=predictions, references=labels)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

# Ttraining

In [19]:
from transformers import DataCollatorForSeq2Seq

# ignore tokenizer pad token in the loss
label_pad_token_id=-100

# padding the sentence of the entire datasets
data_collator=DataCollatorForSeq2Seq(
    tokenizer=tokenizer, 
    model=peft_model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)

In [20]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, Trainer
import torch

training_args=Seq2SeqTrainingArguments(
    output_dir=os.getenv("WANDB_NAME"),
    auto_find_batch_size=True,
    learning_rate=1e-3,
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=1,
    logging_dir=os.getenv("WANDB_NAME")+"/logs",
    logging_strategy="epoch",
    logging_steps=500,
    save_strategy="epoch",
    push_to_hub=False,
    report_to="wandb",
    run_name=os.getenv("WANDB_NAME"),
)


# Create Trainer instance
trainer=Seq2SeqTrainer(
    model=peft_model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

peft_model.config.use_cache=False
trainer.train()



Step,Training Loss
313,2.1869


TrainOutput(global_step=313, training_loss=2.186900958466454, metrics={'train_runtime': 911.2216, 'train_samples_per_second': 5.487, 'train_steps_per_second': 0.343, 'total_flos': 3747167600640000.0, 'train_loss': 2.186900958466454, 'epoch': 1.0})

In [25]:
from datasets import load_metric
bleu = load_metric("bleu")

  bleu = load_metric("bleu")


Downloading builder script:   0%|          | 0.00/2.48k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

In [21]:
from torch.utils.data import DataLoader

In [48]:
del train_dataset


In [49]:
import gc
torch.cuda.empty_cache()
gc.collect()

559

In [50]:

def collate_fn(batch):
    input_ids = torch.stack([torch.tensor(example['input_ids']) for example in batch])
    attention_mask = torch.stack([torch.tensor(example['attention_mask']) for example in batch])
    labels = torch.stack([torch.tensor(example['labels']) for example in batch])
    return {'input_ids': input_ids, 'attention_mask': attention_mask, 'labels': labels}


batch_size = 8
dataloader = DataLoader(eval_dataset, batch_size=batch_size, collate_fn=collate_fn)

In [54]:
[[i] for i in translations]

NameError: name 'translations' is not defined

In [56]:
from tqdm import tqdm
def evaluate_model(model, dataloader, metric):
    model.eval()
    translations = []
    references = []
    i = 0
    for batch in tqdm(dataloader):
        if i==1:break
        input_ids = batch['input_ids'].to(model.device)
        attention_mask = batch['attention_mask'].to(model.device)
        labels = batch['labels'].to(model.device)
        
        # Generate translation
        with torch.no_grad():
            output_ids = model.generate(input_ids, attention_mask=attention_mask, max_length=128, num_beams=4, early_stopping=True)
        
        # Decode the generated text
        generated_texts = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
        reference_texts = tokenizer.batch_decode(labels, skip_special_tokens=True)
        
        translations.extend(generated_texts)
        references.extend([[ref] for ref in reference_texts])
        i += 1
    # Compute the BLEU score
    print([[i] for i in translations])
    print(references)
    results = metric.compute(predictions=[[i] for i in translations], references=references)
    return results

# Evaluate the model
results = evaluate_model(model, dataloader, bleu)
print("BLEU score:", results["score"])

  0%|          | 1/375 [00:15<1:34:57, 15.24s/it]

[["A Republican Strategie to counter Obama's Reelection."], ['Die konservativen Führer begründeten ihre Politik mit der Begründung, dass sie ihre Politik mit dem Ziel der Bekämpfung von Electoralfraud unterstützen.'], ['Auch der Brennan Centre hält dieses als Mythos an, dass electoral fraud ist seltener in den USA als in den USA.'], ['Die republikanischen Advokaten identifizierten nur 300 Fälle von politischer Verfälschung in den USA in einem Jahr.'], ['Eine Sache ist sicher: Diese neuen Richtlinien haben einen negativen Einfluss auf die Wahlentscheidungen.'], ['Ich denke, dass die Maßnahmen Teile des Amerikanischen Demokratiesystems zerstören.'], ['Die Amerikanische Staaten sind verantwortlich für die Organisation der Bundeswahlen in den Vereinigten Staaten.'], ['Das ist in diesem Sinne, dass die meisten amerikanischen Regierungen seit 2009 neue Gesetze verabschiedet haben, die das Registrieren oder Voten erschweren.']]
[['Eine republikanische Strategie, um der Wiederwahl von Obama en




ValueError: Got a string but expected a list instead: 'Eine republikanische Strategie, um der Wiederwahl von Obama entgegenzutreten'

In [15]:
# trainer.evaluate() # out of GPU memory

OutOfMemoryError: CUDA out of memory. Tried to allocate 2.82 GiB (GPU 0; 15.89 GiB total capacity; 10.94 GiB already allocated; 2.83 GiB free; 12.77 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

In [None]:
kwargs={
    'model_name': f'{os.getenv("WANDB_NAME")}',
    'finetuned_from': model_name,
    'tasks': 'Translation',
#     'dataset_tags':'',
    'dataset':'opus100'
}

tokenizer.push_to_hub(os.getenv("WANDB_NAME"))
trainer.push_to_hub(**kwargs)

# Inference

In [34]:
peft_model.config.use_cache=True
context=tokenizer(["Hello"], return_tensors="pt")
output=peft_model.generate(**context)

tokenizer.decode(output[0], skip_special_tokens=True)

'- Vous ne êtes pas en dehors de la ville '

In [None]:
from peft import PeftModel, PeftConfig
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
model = PeftModel.from_pretrained(model, "aisuko/ft-t5-small-on-opus100")

output=model.generate(**context)
tokenizer.decode(output[0], skip_special_tokens=True)

# References List

* https://towardsdev.com/fine-tune-quantized-language-model-using-lora-with-peft-transformers-on-t4-gpu-287da2d5d7f1
* https://huggingface.co/docs/peft/quicktour