# NLP. Lesson 13. LLMs. Fine-tuning with LORA


## LLM

A large language model (LLM) is a language model notable for its ability to achieve general-purpose language generation and perform various NLP tasks.

### Generation with LLMs

LLMs, or Large Language Models, are the key component behind text generation. In a nutshell, they consist of large pretrained transformer models trained to predict the next word (or, more precisely, token) given some input text. Since they predict one token at a time, you need to do something more elaborate to generate new sentences other than just calling the model — you need to do autoregressive generation.

**Autoregressive generation** is the inference-time procedure of iteratively calling a model with its own generated outputs, given a few initial inputs.

### Autoregressive generation

> Casual language modeling - predicts the next token in a sequence of tokens, and the model can only attend to tokens on the left. This means the model cannot see future tokens. GPT-2 is an example of a causal language model.

A language model trained for causal language modeling takes a sequence of text tokens as input and returns the probability distribution for the next token.

A critical aspect of autoregressive generation with LLMs is how to select the next token from this probability distribution. Anything goes in this step as long as you end up with a token for the next iteration. This means it can be as simple as selecting the most likely token from the probability distribution or as complex as applying a dozen transformations before sampling from the resulting distribution. The process of autoregressive prediction is repeated iteratively until some stopping condition is reached. Ideally, the stopping condition is dictated by the model, which should learn when to output an end-of-sequence (EOS) token. If this is not the case, generation stops when some predefined maximum length is reached.

<img src="https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab13/AutoRegressiveGen.png" alt="Autoregressive generation" width="800"/>


Some cool LLM links: [HF Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), [HF coding assistant](https://huggingface.co/spaces/HuggingFaceH4/starchat2-playground)

LLMs: [BLOOM](https://huggingface.co/docs/transformers/model_doc/bloom), [Gemma](https://huggingface.co/docs/transformers/model_doc/gemma#gemma), [LLaMa](https://huggingface.co/docs/transformers/model_doc/llama), [GPT-2](https://huggingface.co/openai-community/gpt2)


## Parameter Efficient Fine Tuning (PEFT)

The traditional fine tuning method, in which all parameters of a pre-trained model are tuned, becomes impractical and computationally expensive when working with modern LLM models.

PEFT is a technique designed to fine-tune models while minimizing the need for extensive resources and cost. PEFT is a great choice when dealing with domain-specific tasks that necessitate model adaptation. By employing PEFT, we can strike a balance between retaining valuable knowledge from the pre-trained model and adapting it effectively to the target task with fewer parameters. There are various ways of achieving Parameter efficient fine-tuning. Low Rank Parameter or LoRA & QLoRA are most widely used and effective.

<img src="https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab13/LLMtraining.png" alt="LLM training and tuning" width="800"/>


#### Why PEFT?
- **Accelerated training time:** PEFT allows you to reduce the amount of time spent on training by fine-tuning a small number of parameters rather than the entire model.
- **Reduced Compute and Storage Costs:** PEFT fine-tunes only a small subset of parameters, significantly reducing compute and storage costs and reducing hardware requirements.
- **Less risk of overfitting:** By freezing most of the parameters of the pre-trained model, we can avoid overfitting on new data.
- **Overcoming catastrophic forgetting:** With PEFT, the model can adapt to new tasks while retaining previously learned knowledge by freezing most parameters.

#### Based on their operations, PEFT algorithms can be categorized into:
1. **Additive finetuning** - the parameters of the pre-trained model are supplemented with new ones, and training takes place on them, while the original data is frozen. Prefix finetuning is based on this approach (check the picture).
    
    - Adapter-based Fine-tuning: This approach involves insertion of small adapter layers within Transformer blocks. Since during fine-tuning, only a minimal number of trainable parameters that are strategically positioned within the model architecture are updated, that results in reduction of storage, memory and compute requirements.
    - Soft Prompt-based Fine-tuning: Approach to refine model and improve its performance via fine-tuning. As part of this approach, adjustable vectors known as soft prompts are appended to the start of input sequence.

Additive PEFT:
<img src="https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab13/AdditivePEFT.png" alt="Additive finetuning" width="300"/>

2. **Selective PEFT** - Instead of adding more parameters as done in additive PEFT, this approach finetunes a subset of the existing parameters to enhance model performance over downstream tasks.
    - Structural masking: Structured mask organize parameter masking in regular patterns, unlike unstructured ones that apply it randomly, thus can enhances computational and hardware efficiency during training.
    - Unstructural masking


#### Some PEFT techniques:

- LoRA - Low-Rank Adaptation
- Prefix tuning - uses additive method. We add a sequence of training vectors (continuous task-specific vectors), called a prefix, to each transformer block and train only it, without touching the rest of the data. Prefix parameters are inserted in all of the model layers, whereas prompt tuning only adds the prompt parameters to the model input embeddings. The creators of this method, based on the results of the experiment based on GPT-2, concluded that, by training only 0.1% of parameters, Prefix tuning shows performance comparable to additional training, in which all parameters of the model are adjusted, and is superior to it when fine tuning with small volume of data.

<img src="https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab13/PrefixTuning.png" alt="Prefix tuning" width="800"/>

- Prompt tuning - additive method, soft-prompt based, simplified version of Prefix tuning. Prompt tokens have their own parameters that are updated independently. This means you can keep the pretrained model’s parameters frozen, and only update the gradients of the prompt token embeddings.

<img src="https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab13/SoftPrompt.png" alt="Prompt finetuning" width="800"/>

- Adapter - additive method, similar to the Prefix tuning (adapters are added, not prefixes). Adds extra trainable parameters after the attention and fully-connected layers of a frozen pretrained model to reduce memory-usage and speed up training.
Structure: Within the adapter, the original d-dimensional features are first projected into the smaller dimension m, then nonlinearity is applied, and then projected back into the d-dimensional dimension. There is also a skip connection here.

<img src="https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab13/Adapters.png" alt="Adapter Architecture" width="800"/>

<img src="https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab13/AdapterInTransformer.png" alt="Adapter" width="800"/>

How is this parameter efficient? For example, assume the first fully connected layer projects a 1024-dimensional input down to 24 dimensions, and the second fully connected layer projects it back into 1024 dimensions. This means we introduced 1,024 x 24 + 24 x 1,024 = 49,152 weight parameters. In contrast, a single fully connected layer that reprojects a 1024-dimensional input into a 1,024-dimensional space would have 1,024 x 1024 = 1,048,576 parameters.

Follow [this link](https://magazine.sebastianraschka.com/p/finetuning-llms-with-adapters#:~:text=The%20idea%20of%20parameter%2Defficient,the%20pretrained%20LLM%20remain%20frozen.) for a better understanding and code examples.


- P-tuning - adds a trainable embedding tensor that can be optimized to find better prompts, and it uses a prompt encoder (a bidirectional long-short term memory network or LSTM) to optimize the prompt parameters.
- IA3 - rescales inner activations with learned vectors. These learned vectors are injected in the attention and feedforward modules in a typical transformer-based architecture.

Some useful links: [Adapters](https://huggingface.co/docs/peft/en/conceptual_guides/adapter), [Soft prompts](https://huggingface.co/docs/peft/en/conceptual_guides/prompting), [IA3](https://huggingface.co/docs/peft/en/conceptual_guides/ia3)

More theory about Efficient Tuning is [here](https://vinija.ai/nlp/parameter-efficient-fine-tuning/)

## LoRA

[Paper](https://arxiv.org/pdf/2106.09685.pdf)

Low-Rank Adaptation (LoRA) is a reparametrization method that aims to reduce the number of trainable parameters with low-rank representations. The weight matrix is broken down into low-rank matrices that are trained and updated. All the pretrained model parameters remain frozen. After training, the low-rank matrices are added back to the original weights. This makes it more efficient to store and train a LoRA model because there are significantly fewer parameters.

In other words: it simplifies the fine-tuning of large models by **decomposing** complex, high-dimensional weight matrices into lower-dimensional forms. This technique, akin to methods like PCA and SVD, allows for the retention of critical information while significantly reducing the size and complexity of the weights, thus enhancing fine-tuning efficiency on resource-constrained settings.

Benefits: This approach offers considerable time and memory efficiency, as a large portion of the model’s parameters are kept frozen, reducing both training time and GPU memory requirements. It also avoids additional inference latency and facilitates easy task-switching during deployment, requiring changes only in a small subset of weights.

<img src="https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab13/FineTuning.png" alt="LoRA" width="507"/>


<img src="https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab13/LoRAandFinetuning.png" alt="LoRA" width="600"/>


### LoRA configuration

By using LoRA, you are unfreezing the attention `Weight_delta` matrix and only updating `W_a` and `W_b`.

<img src="https://files.training.databricks.com/images/llm/lora.png" width=500>

You can treat `r` (rank) as a hyperparameter. LoRA can perform well with very small ranks based on [Hu et a 2021's paper](https://arxiv.org/abs/2106.09685). GPT-3's validation accuracies across tasks with ranks from 1 to 64 are quite similar.

From [PyTorch Lightning's documentation](https://lightning.ai/pages/community/article/lora-llm/):

> A smaller `r` leads to a simpler low-rank matrix, which results in fewer parameters to learn during adaptation. This can lead to faster training and potentially reduced computational requirements. However, with a smaller `r`, the capacity of the low-rank matrix to capture task-specific information decreases. This may result in lower adaptation quality, and the model might not perform as well on the new task compared to a higher `r`.

Other arguments:

- `lora_dropout`:
  - Dropout is a regularization method that reduces overfitting by randomly and temporarily removing nodes during training.
  - It works like this: <br>
    - Apply to most type of layers (e.g. fully connected, convolutional, recurrent) and larger networks
    - Temporarily and randomly remove nodes and their connections during each training cycle
- `target_modules`:
  - Specifies the module names to apply to
  - This is dependent on how the foundation model names its attention weight matrices.
  - Typically, this can be:
    - `query`, `q`, `q_proj`
    - `key`, `k`, `k_proj`
    - `value`, `v` , `v_proj`
    - `query_key_value`


### QLoRA

[Paper](https://arxiv.org/pdf/2305.14314.pdf), [QLoRA repository](https://github.com/artidoro/qlora), [bitsandbytes](https://github.com/TimDettmers/bitsandbytes)

QLoRA (Quantized Low-Rank Adaptation) extends LoRA to enhance efficiency by quantizing weight values of the original network, from high-resolution data types, such as Float32, to lower-resolution data types like int4. This leads to reduced memory demands and faster calculations.

### Key optimizations of QLoRA

1. **4-bit NF4 Quantization**. 4-bit NormalFloat4 is an optimized data type that can be used to store weights, which brings down the memory footprint considerably.
2. **Normalization & Quantization**. As part of normalization and quantization steps, the weights are adjusted to a zero mean, and a constant unit variance. A 4-bit data type can only store 16 numbers. As part of normalization the weights are mapped to these 16 numbers, zero-centered distributed, and instead of storing the weights, the nearest position is stored.
3. **Double quantization**. Double quantization is the process of quantizing the quantization constant to reduce the memory down further to save these constant. To perform dequantization technique we need to store the quantization constants. If we employed blockwise quantization, then we will have n quantization constants in their original datatype. In the case of expansive LLM’s which have substantial number of quantization constants that must be stored, leading to increased memory overhead.

<img src="https://raw.githubusercontent.com/Dnau15/LabImages/main/images/lab13/LoraQLora.png" alt="LoRA and QLoRA" width="800"/>

In the QLoRA approach, it is the original model’s weights that are quantized to 4-bit precision. The newly added Low-rank Adapter (LoRA) weights are not quantized; they remain at a higher precision and are fine-tuned during the training process. This strategy allows for efficient memory use while maintaining the performance of large language models during finetuning.


## Code

Let's fine-tune LLaMa2 for sentiment analysis on Financial News


In [1]:
!pip install -q peft transformers datasets evaluate seqeval accelerate bitsandbytes trl

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
kfp 2.5.0 requires google-cloud-storage<3,>=2.2.1, but you have google-cloud-storage 1.44.0 which is incompatible.[0m[31m
[0m

In [2]:
import warnings

warnings.filterwarnings("ignore")

In [3]:
import numpy as np
import pandas as pd
import os
from tqdm import tqdm
import bitsandbytes as bnb
import torch
import torch.nn as nn
import transformers
from datasets import Dataset
from peft import LoraConfig, PeftConfig
from trl import SFTTrainer
from trl import setup_chat_format
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
)
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split

2024-08-04 20:29:17.271848: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-04 20:29:17.271957: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-04 20:29:17.387447: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [4]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

### Data preparation

Read initial data and perform preprocessing: split into train and validation splits, insert prompts to text of news.


In [5]:
!wget https://raw.githubusercontent.com/Dnau15/LabImages/main/data/txt_data/financial_news.csv

--2024-08-04 20:29:28--  https://raw.githubusercontent.com/Dnau15/LabImages/main/data/txt_data/financial_news.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 672006 (656K) [application/octet-stream]
Saving to: 'financial_news.csv'


2024-08-04 20:29:28 (12.1 MB/s) - 'financial_news.csv' saved [672006/672006]



In [6]:
df = pd.read_csv(
    "financial_news.csv",
    names=["sentiment", "text"],
    encoding="utf-8",
    encoding_errors="replace",
)
df.head()

Unnamed: 0,sentiment,text
0,neutral,"According to Gran , the company has no plans t..."
1,neutral,Technopolis plans to develop in stages an area...
2,negative,The international electronic industry company ...
3,positive,With the new production plant the company woul...
4,positive,According to the company 's updated strategy f...


In [7]:
X_train = list()
X_test = list()
for sentiment in ["positive", "neutral", "negative"]:
    train, test = train_test_split(
        df[df.sentiment == sentiment], train_size=300, test_size=300, random_state=42
    )
    X_train.append(train)
    X_test.append(test)

X_train = pd.concat(X_train).sample(frac=1, random_state=10)
X_test = pd.concat(X_test)

eval_idx = [
    idx for idx in df.index if idx not in list(X_train.index) + list(X_test.index)
]
X_eval = df[df.index.isin(eval_idx)]
X_eval = X_eval.groupby("sentiment", group_keys=False).apply(
    lambda x: x.sample(n=50, random_state=10, replace=True)
)
X_train = X_train.reset_index(drop=True)


def generate_prompt(data_point):
    return f"""
            Analyze the sentiment of the news headline enclosed in square brackets,
            determine if it is positive, neutral, or negative, and return the answer as
            the corresponding sentiment label "positive" or "neutral" or "negative".

            [{data_point["text"]}] = {data_point["sentiment"]}
            """.strip()


def generate_test_prompt(data_point):
    return f"""
            Analyze the sentiment of the news headline enclosed in square brackets,
            determine if it is positive, neutral, or negative, and return the answer as
            the corresponding sentiment label "positive" or "neutral" or "negative".

            [{data_point["text"]}] = """.strip()


X_train = pd.DataFrame(X_train.apply(generate_prompt, axis=1), columns=["text"])
X_eval = pd.DataFrame(X_eval.apply(generate_prompt, axis=1), columns=["text"])

y_true = X_test.sentiment
X_test = pd.DataFrame(X_test.apply(generate_test_prompt, axis=1), columns=["text"])

train_data = Dataset.from_pandas(X_train)
eval_data = Dataset.from_pandas(X_eval)

### Evaluation

Evaluation function for model


In [8]:
def evaluate(y_true, y_pred):
    labels = ["positive", "neutral", "negative"]
    mapping = {"positive": 2, "neutral": 1, "none": 1, "negative": 0}

    def map_func(x):
        return mapping.get(x, 1)

    y_true = np.vectorize(map_func)(y_true)
    y_pred = np.vectorize(map_func)(y_pred)

    # Calculate accuracy
    accuracy = accuracy_score(y_true=y_true, y_pred=y_pred)
    print(f"Accuracy: {accuracy:.3f}")

    # Generate accuracy report
    unique_labels = set(y_true)  # Get unique labels

    for label in unique_labels:
        label_indices = [i for i in range(len(y_true)) if y_true[i] == label]
        label_y_true = [y_true[i] for i in label_indices]
        label_y_pred = [y_pred[i] for i in label_indices]
        accuracy = accuracy_score(label_y_true, label_y_pred)
        print(f"Accuracy for label {label}: {accuracy:.3f}")

    # Generate classification report
    class_report = classification_report(y_true=y_true, y_pred=y_pred)
    print("\nClassification Report:")
    print(class_report)

    # Generate confusion matrix
    conf_matrix = confusion_matrix(y_true=y_true, y_pred=y_pred, labels=[0, 1, 2])
    print("\nConfusion Matrix:")
    print(conf_matrix)

### Load model

Let's create a BitsAndBytesConfig object with the following settings and load LLM with quantization config:

- `load_in_4bit`: Load the model weights in 4-bit format.
- `bnb_4bit_quant_type`: Use the "nf4" quantization type. 4-bit NormalFloat (NF4), is a new data type that is information theoretically optimal for normally distributed weights.
- `bnb_4bit_compute_dtype`: Use the float16 data type for computations.
- `bnb_4bit_use_double_quant`: Do not use double quantization (reduces the average memory footprint by quantizing also the quantization constants and saves an additional 0.4 bits per parameter.).


In [9]:
model_name = "/kaggle/input/llama2-7b-hf/Llama2-7b-hf"
compute_dtype = getattr(torch, "float16")

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map=device,
    torch_dtype=compute_dtype,
    quantization_config=bnb_config,
)

model.config.use_cache = False
model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True,
)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

model, tokenizer = setup_chat_format(model, tokenizer)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

### Predict

Perfrom inference of model


In [10]:
def predict(test, model, tokenizer):
    y_pred = []
    for i in tqdm(range(len(X_test))):
        prompt = X_test.iloc[i]["text"]
        pipe = pipeline(
            task="text-generation",
            model=model,
            tokenizer=tokenizer,
            max_new_tokens=1,
            temperature=0.0,
        )
        result = pipe(prompt)
        answer = result[0]["generated_text"].split("=")[-1]
        if "positive" in answer:
            y_pred.append("positive")
        elif "negative" in answer:
            y_pred.append("negative")
        elif "neutral" in answer:
            y_pred.append("neutral")
        else:
            y_pred.append("none")
    return y_pred

### Model predictions without fine-tuning


In [None]:
y_pred = predict(test, model, tokenizer)

 45%|████▌     | 407/900 [01:35<01:55,  4.26it/s]

In [None]:
evaluate(y_true, y_pred)

### Fine-tuning


In [None]:
output_dir = "trained_weigths"

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    target_modules="all-linear",
    task_type="CAUSAL_LM",
)

training_arguments = TrainingArguments(
    output_dir=output_dir,  # directory to save and repository id
    num_train_epochs=3,  # number of training epochs
    per_device_train_batch_size=1,  # batch size per device during training
    gradient_accumulation_steps=8,  # number of steps before performing a backward/update pass
    gradient_checkpointing=True,  # use gradient checkpointing to save memory
    optim="paged_adamw_32bit",
    save_steps=0,
    logging_steps=25,  # log every 10 steps
    learning_rate=2e-4,  # learning rate, based on QLoRA paper
    weight_decay=0.001,
    fp16=True,
    bf16=False,
    max_grad_norm=0.3,  # max gradient norm based on QLoRA paper
    max_steps=-1,
    warmup_ratio=0.03,  # warmup ratio based on QLoRA paper
    group_by_length=True,
    lr_scheduler_type="cosine",  # use cosine learning rate scheduler
    report_to="tensorboard",  # report metrics to tensorboard
    evaluation_strategy="epoch",  # save checkpoint every epoch
)

trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=train_data,
    eval_dataset=eval_data,
    peft_config=peft_config,
    dataset_text_field="text",
    tokenizer=tokenizer,
    max_seq_length=1024,
    packing=False,
    dataset_kwargs={
        "add_special_tokens": False,
        "append_concat_token": False,
    },
)

In [None]:
trainer.train()

In [None]:
trainer.save_model()
tokenizer.save_pretrained(output_dir)

### Prediction of fine-tuned model


In [None]:
# delete and call garbage collector to free memory
import gc

del [
    model,
    tokenizer,
    peft_config,
    trainer,
    train_data,
    eval_data,
    bnb_config,
    training_arguments,
]
del [df, X_train, X_eval]
del [TrainingArguments, SFTTrainer, LoraConfig, BitsAndBytesConfig]

In [None]:
# empty cuda cache several times
for _ in range(100):
    torch.cuda.empty_cache()
    gc.collect()

### Load trained model and merge with pretrained

`merge_and_unload` - merges LoRA adapters into base model


In [None]:
from peft import AutoPeftModelForCausalLM

finetuned_model = "./trained_weigths/"
compute_dtype = getattr(torch, "float16")
tokenizer = AutoTokenizer.from_pretrained("/kaggle/input/llama2-7b-hf/Llama2-7b-hf")

model = AutoPeftModelForCausalLM.from_pretrained(
    finetuned_model,
    torch_dtype=compute_dtype,
    return_dict=False,
    low_cpu_mem_usage=True,
    device_map=device,
)

merged_model = model.merge_and_unload()
merged_model.save_pretrained(
    "./merged_model", safe_serialization=True, max_shard_size="2GB"
)
tokenizer.save_pretrained("./merged_model")

In [None]:
y_pred = predict(test, merged_model, tokenizer)
evaluate(y_true, y_pred)

In [None]:
del [merged_model, tokenizer, y_pred, model, y_true, test]

for _ in range(100):
    torch.cuda.empty_cache()
    gc.collect()

In [None]:
!nvidia-smi

In [48]:
!wget https://raw.githubusercontent.com/Dnau15/LabImages/main/data/txt_data/train.csv
!wget https://raw.githubusercontent.com/Dnau15/LabImages/main/data/txt_data/test.csv

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


--2024-08-04 21:26:28--  https://raw.githubusercontent.com/Dnau15/LabImages/main/data/txt_data/train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3733324 (3.6M) [text/plain]
Saving to: 'train.csv'


2024-08-04 21:26:28 (47.3 MB/s) - 'train.csv' saved [3733324/3733324]



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


--2024-08-04 21:26:29--  https://raw.githubusercontent.com/Dnau15/LabImages/main/data/txt_data/test.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 651623 (636K) [text/plain]
Saving to: 'test.csv'


2024-08-04 21:26:29 (12.4 MB/s) - 'test.csv' saved [651623/651623]



## Task 1
Read csv files using pandas, divide sentences into tokens and convert tags into int.

In [49]:
from datasets import Dataset
import pandas as pd

df = pd.read_csv("/kaggle/input/nlp-week-13-fine-tuning-with-lora/train.csv")
df["tokens"] = # YOUR CODE
df["tags"] = # YOUR CODE
dataset = Dataset.from_pandas(df)
splitted_dataset = dataset.train_test_split(test_size=0.2)

assert all(isinstance(tokens, list) for tokens in df["tokens"]), "Not all entries in 'tokens' are lists."
assert all(isinstance(tag_list, list) and all(isinstance(tag, int) for tag in tag_list) for tag_list in df["tags"]), "Not all entries in 'tags' are lists of integers."

In [50]:
label2id = {
    "O": 0,
    "B-DNA": 1,
    "I-DNA": 2,
    "B-protein": 3,
    "I-protein": 4,
    "B-cell_type": 5,
    "I-cell_type": 6,
    "B-cell_line": 7,
    "I-cell_line": 8,
    "B-RNA": 9,
    "I-RNA": 10,
}

In [51]:
from datasets import load_dataset
from transformers import (
    AutoModelForTokenClassification,
    AutoTokenizer,
    DataCollatorForTokenClassification,
    TrainingArguments,
    Trainer,
)
from peft import (
    get_peft_config,
    PeftModel,
    PeftConfig,
    get_peft_model,
    LoraConfig,
    TaskType,
)
import evaluate
import torch
import numpy as np

model_checkpoint = "roberta-base"
lr = 1e-3
batch_size = 32
num_epochs = 10

In [52]:
from tokenizers.pre_tokenizers import WhitespaceSplit

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, add_prefix_space=True)
tokenizer.pre_tokenizer = WhitespaceSplit()
tokenizer

RobertaTokenizerFast(name_or_path='roberta-base', vocab_size=50265, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	1: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	50264: AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False, special=True),
}

## Task 2
Check work of the tokenizer
Don't forget about padding

In [53]:
texts = ["Hi guys!", "This is a test one sentence."]

# Tokenize the texts
tokenized_outputs = # YOUR CODE
assert (tokenized_outputs['input_ids'] == torch.Tensor([[0,12289,1669,328,2,1,1,1,1],[0,152,16,10,1296,65,3645,4,2]])).all()


## Task 3
Fill gaps and check code \
Hint: If word_idx is none append(-100) 

In [54]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )

    labels = []
    for i, label in enumerate(examples[f"tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                # YOUR CODE
            elif word_idx != previous_word_idx:
                # YOUR CODE
            else:
                # YOUR CODE
            # YOUR CODE
        # YOUR CODE

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [56]:
examples = {
    "tokens": [["Hello", "world"], ["This", "is", "a", "test"]],
    "tags": [[1, 0], [1, 0, 0, 0]]
}

# Applying the function
tokenized_data = # YOUR CODE

assert len(tokenized_data["input_ids"]) == len(examples["tokens"]), "The number of tokenized inputs should match the number of examples."

assert tokenized_data['input_ids'] == [[0, 20920, 232, 2], [0, 152, 16, 10, 1296, 2]]
assert tokenized_data['attention_mask'] == [[1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]

In [57]:
tokenized_splitted_dataset = splitted_dataset.map(
    tokenize_and_align_labels, batched=True
)

Map:   0%|          | 0/13295 [00:00<?, ? examples/s]

Map:   0%|          | 0/3324 [00:00<?, ? examples/s]

In [58]:
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

## Task 4
Complete function to compute metrics

In [59]:
seqeval = evaluate.load("seqeval")
label_list = list(label2id.keys())


def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        # YOUR CODE
        for prediction, label in zip(predictions, labels)
    ]

    results = # YOUR CODE
    return {
        "precision": # YOUR CODE
        "recall": # YOUR CODE
        "f1": # YOUR CODE
        "accuracy": # YOUR CODE
    }

## Task 5
Complete function to convert id to label

In [60]:
id2label = # YOUR CODE
model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint, num_labels=11, id2label=id2label, label2id=label2id
)

id_sequence = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

expected_labels = [
    "O",        # ID 0
    "B-DNA",    # ID 1
    "I-DNA",    # ID 2
    "B-protein", # ID 3
    "I-protein", # ID 4
    "B-cell_type", # ID 5
    "I-cell_type", # ID 6
    "B-cell_line", # ID 7
    "I-cell_line", # ID 8
    "B-RNA",    # ID 9
    "I-RNA"     # ID 10
]

# Convert IDs to labels using id2label mapping
converted_labels = [id2label[id_] for id_ in id_sequence]

# Assertions to verify the correctness of the conversion
assert converted_labels == expected_labels, "The converted labels do not match the expected labels."

# Additional assertions to verify individual conversions
for id_, expected_label in zip(id_sequence, expected_labels):
    assert id2label[id_] == expected_label, f"ID {id_} should map to label '{expected_label}' but maps to '{id2label[id_]}'."

Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Task 6
Fill in the gaps and train the model. Try to get the best scores.

Baseline 
- F1 score: 0.77
- Accuracy: 0.94

Hint: you can change rank, alpha, dropout, etc.

In [61]:
peft_config = LoraConfig(
    task_type=# YOUR CODE
    # YOUR CODE ...
)

In [62]:
model = get_peft_model(model, peft_config)

In [63]:
training_args = TrainingArguments(
    output_dir="roberta-base-lora-ner",
    learning_rate=# YOUR CODE,
    per_device_train_batch_size=# YOUR CODE,
    per_device_eval_batch_size=# YOUR CODE,
    num_train_epochs=# YOUR CODE,
    weight_decay=# YOUR CODE
    evaluation_strategy=# YOUR CODE
    save_strategy="epoch",
    load_best_model_at_end=True,
    report_to=# YOUR CODE
)

In [None]:
trainer = Trainer(
    model=# YOUR CODE,
    args=# YOUR CODE,
    train_dataset=# YOUR CODE
    eval_dataset=# YOUR CODE
    tokenizer=# YOUR CODE,
    data_collator=# YOUR CODE,
    compute_metrics=# YOUR CODE,
)

trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.22763,0.603687,0.753119,0.670174,0.920834
2,No log,0.184079,0.697482,0.778284,0.735671,0.935957
3,0.265900,0.177199,0.726578,0.773945,0.749514,0.940291
4,0.265900,0.170859,0.741316,0.78707,0.763508,0.940575
5,0.165100,0.168991,0.75328,0.797158,0.774598,0.943184
6,0.165100,0.166087,0.741031,0.811042,0.774458,0.94442
7,0.165100,0.170062,0.75339,0.807571,0.77954,0.944466


In [None]:
test_df = pd.read_csv("/kaggle/input/nlp-week-13-fine-tuning-with-lora/test.csv")
test_df.head()

## Task 7
Complete code for model inference

In [None]:
from tqdm import tqdm
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
predictions = []
for tokens in tqdm(test_df["tokens"].values):
    inputs = # YOUR CODE
    with torch.no_grad():
        logits = # YOUR CODE
    preds = # YOUR CODE
    prediction = [
        pred
        for token, pred in zip(inputs.tokens(), preds[0].cpu().numpy())
        if "Ġ" in token
    ]
    # YOUR CODE

# Conclusion

In this lesson, we've delved deeply into the world of LLMs. Let's summarize what we have learnt:
- LLM and Autoregressive Generation: Large Language Models (LLMs) leverage autoregressive generation to produce coherent text by predicting the next word based on previous context. The Key for tasks requiring context-aware text generation and understanding.
- PEFT and Its Algorithms: Additive and Selective methods enhance model performance by efficiently adjusting pre-trained models. Prefix Tuning, Prompt Tuning, Adapter, P-Tuning, and IA3 are techniques under PEFT that enable effective fine-tuning with minimal data and computational resources.
- LoRA (Low-Rank Adaptation): Architecture: Utilizes low-rank matrices to adapt models efficiently. Configuration: Adjusts the rank and adaptation parameters to balance performance and resource usage.
- QLoRA (Quantized Low-Rank Adaptation): Differences from LoRA: Incorporates quantization to reduce memory and computational needs. Architecture: Builds upon LoRA with added quantization layers to optimize large-scale models.
- Code Examples: Llama: Demonstrated end-to-end process including data preparation, model loading, fine-tuning, and prediction. Trained Model Management: Techniques for loading and merging trained models with pre-trained counterparts to leverage existing knowledge effectively.
- Practical Task: Applied concepts in real-world scenarios to solidify understanding and implementation skills in NLP tasks.

This summary encapsulates the key points of the lesson, reinforcing the concepts and techniques covered to facilitate a deeper understanding of modern NLP methodologies.