# LMSYS Chatbot Arena - Training Notebook
 
| **Competition** | [LMSYS Chatbot Arena on Kaggle](https://www.kaggle.com/competitions/lmsys-chatbot-arena) |
|-------|-------|
| **Author** | Guillaume Raille ([grll](https://github.com/grll)) |
| **Date** | 2024-09-17 |

This notebook demonstrates how to efficiently fine-tune the Gemma-2-9b model for sequence classification using QLoRA with the Unsloth library in the context of the LMSYS Chatbot Arena competition. QLoRA and Unsloth enable to fine-tune LLM with much lower GPU compute and memory requirements for minimal performance impact.

The objective is to classify sequences composed of: a prompt, a response from model A, a response from model B into: model A wins, model B wins or it's a tie based on human preference. 

We will cover data preparation, model configuration and training.

This notebook was executed during the competition on a machine with a single RTX4090 GPU from [vast.ai](https://vast.ai/). It took ~5 hours to run at a cost of ~0.3 USD/hour so a total of ~1.5 USD were spent on this experiment.

The model obtained in this notebook with no additional data or tricks except quantization with autoAWQ for efficient inference achieved a ranking of 115th out of 1849 in the competition.

The training dataset to reproduce the experiment can be found on [kaggle](https://www.kaggle.com/competitions/lmsys-chatbot-arena).

## Table of Contents

1. [1. Setup](#1-setup)
2. [2. Data Transformations](#2-data-transformations)
    1. [2.1 RawParser](#21-rawparser)
    2. [2.2 ParsedTokenizer](#22-parsedtokenizer)
    3. [2.3 SampleCreator](#23-samplecreator)
    4. [2.4 LabelCreator](#24-labelcreator)
3. [3. Configuration](#3-configuration)
4. [4. Data Preparation](#4-data-preparation)
    1. [4.0 Load the model and tokenizer](#40-load-the-model-and-tokenizer)
    2. [4.1 Load Raw Data](#41-load-raw-data)
    3. [4.2 Instantiate data transformations](#42-instantiate-data-transformations)
    4. [4.3 Create a Hugging Face Dataset](#43-create-a-hugging-face-dataset)
    5. [4.4 Implement Data Collator](#44-implement-data-collator)
5. [5. Model Setup](#5-model-setup)
    1. [5.1 Apply LoRA](#51-apply-lora)
    2. [5.2 Sequence-to-Sequence to Sequence-to-Score fine-tuning](#52-sequence-to-sequence-to-sequence-to-score-fine-tuning)
        1. [5.2.1 Create a custom score head](#521-create-a-custom-score-head)
        2. [5.2.2 Create a callback to automatically save the score head during training](#522-create-a-callback-to-automatically-save-the-score-head-during-training)
        3. [5.2.3 Mock the forward method to replace the last layer with our score head](#523-mock-the-forward-method-to-replace-the-last-layer-with-our-score-head)
6. [6. Training](#6-training)
    1. [6.1 Define the training arguments and instantiate the trainer](#61-define-the-training-arguments-and-instantiate-the-trainer)
    2. [6.2 Run the training loop](#62-run-the-training-loop)
7. [7. Model Saving](#7-model-saving)
8. [8. Inference](#8-inference)
    1. [8.1 Load Model and Custom Score Module](#81-load-model-and-custom-score-module)
    2. [8.2 Prepare Test Dataset](#82-prepare-test-dataset)
    3. [8.3 Define the training arguments and instantiate the inference trainer](#83-define-the-training-arguments-and-instantiate-the-inference-trainer)
    4. [8.4 Perform Inference](#84-perform-inference)
9. [9. Conclusion](#9-conclusion)

## 1. Setup

Install the required packages and set up the environment.

In [None]:
# Install Unsloth library (with CUDA 12.1 and Torch 2.3.0 support)
# in vast.ai make sure that your template comes with torch 2.3.0 devel and CUDA 12.1
# in vast.ai make sure you have enough disk storage (50gb+ for package installation + model download...)
!pip install "unsloth[cu121-ampere-torch230] @ git+https://github.com/unslothai/unsloth.git"

# Install additional dependencies
!pip install scikit-learn
!pip install wandb # optional for logging to wandb

Optional: set the WANDB_PROJECT environment variable for Weights & Biases tracking.

In [None]:
%env WANDB_PROJECT=unsloth_lmsys

# 2. Data Transformations

Define utility classes for processing and tokenizing data.

### 2.1 RawParser

Transforms raw data from the competition datasets into "samples" (better suited for further processing and training).

```mermaid
graph LR
    Input1[prompt: str] --> RawParser
    Input2[response_a: str] --> RawParser
    Input3[response_b: str] --> RawParser
    RawParser --> Output1[prompt_ls: list[str]]
    RawParser --> Output2[response_a_ls: list[str]]
    RawParser --> Output3[response_b_ls: list[str]]
```

In [4]:
import ast, re, dataclasses as dc

class RawParser:
    """transform a raw data point into a dataset sample"""
    Out = dc.make_dataclass("Out", [("prompt_ls", list[str]), ("response_a_ls", list[str]), ("response_b_ls", list[str])])

    @staticmethod
    def recursive_match(s: str, pattern: str, matches: list["re.Match"] = (), min_length: int = 1, offset: int = 0) -> list["re.Match"]:
        if len(s) - offset < min_length:
            return matches

        match = re.compile(pattern).search(s, offset)
        if not match:
            return matches
    
        offset = match.start() + 1
        return RawParser.recursive_match(s, pattern, (*matches, match), min_length, offset)

    @staticmethod
    def null_to_none(s: str) -> str:
        pattern = r"((?:\[|,)\s*)(null)(\s*(?:,|\]))"
        for match in RawParser.recursive_match(s, pattern):
            s = (
                s[: match.start()]
                + match.group().replace("null", "None")
                + s[match.end() :]
            )
        return s

    def __call__(self, prompt: str, response_a: str, response_b: str) -> Out:
        prompt_ls: list[str | None] = ast.literal_eval(self.null_to_none(prompt))
        response_a_ls: list[str | None] = ast.literal_eval(self.null_to_none(response_a))
        response_b_ls: list[str | None] = ast.literal_eval(self.null_to_none(response_b))
         
        def _to_str(ls: list[str | None]) -> list[str]:
            t = []
            for s in ls:
                if not isinstance(s, str):
                    t.append(str(s).replace("None", ""))
                else:
                    # fix an issue with pyarrow not handling utf-16 surrogate pairs
                    if bool(re.search(r"[\ud800-\udfff]", s)):
                        s = s.encode("utf-16", "surrogatepass").decode("utf-16", "surrogatepass")
                    t.append(s.strip())
            return t

        return self.Out(_to_str(prompt_ls), _to_str(response_a_ls), _to_str(response_b_ls))

### 2.2 ParsedTokenizer

Tokenizes each part of a sample into a list of input_ids.

```mermaid
graph LR
    Input1[prompt_ls: list[str]] --> ParsedTokenizer
    Input2[response_a_ls: list[str]] --> ParsedTokenizer
    Input3[response_b_ls: list[str]] --> ParsedTokenizer
    ParsedTokenizer --> Output1[prompt_iid: list[list[int]]]
    ParsedTokenizer --> Output2[response_a_iid: list[list[int]]]
    ParsedTokenizer --> Output3[response_b_iid: list[list[int]]]
```

In [5]:
import dataclasses as dc

class ParsedTokenizer:
    """tokenize a sample into a list of input_ids"""
    Out = dc.make_dataclass("Out", [("prompt_iid", list[list[int]]), ("response_a_iid", list[list[int]]), ("response_b_iid", list[list[int]])])

    def __init__(self, tokenizer: "transformers.FastTokenizer"):
        self.tokenizer = tokenizer
        self.tokenizer.padding_side = "right"

    def __call__(self, prompt_ls: list[str], response_a_ls: list[str], response_b_ls: list[str]) -> Out:
        prompt_encodings = self.tokenizer(prompt_ls, add_special_tokens=False)
        response_a_encodings = self.tokenizer(response_a_ls, add_special_tokens=False)
        response_b_encodings = self.tokenizer(response_b_ls, add_special_tokens=False)

        return self.Out(prompt_encodings.input_ids, response_a_encodings.input_ids, response_b_encodings.input_ids)

### 2.3 SampleCreator

Combines input_ids from the prompt(s), response_a(s) and response_b(s) into a single sample input_ids of **precisely** the desired size.

```mermaid
graph LR
    Input1[prompt_iid: list[list[int]]] --> SampleCreator
    Input2[response_a_iid: list[list[int]]] --> SampleCreator
    Input3[response_b_iid: list[list[int]]] --> SampleCreator
    SampleCreator --> Output1[input_ids: list[int]]
```

In [6]:
import dataclasses as dc
import textwrap

class SampleCreator:
    """combines input_ids from the prompt(s), response_a(s) and response_b(s) into input_ids of the desired size"""
    Out = dc.make_dataclass("Out", [("input_ids", list[int])])

     # IMPORTANT: adapt special tokens and prompt format for each model's tokenizer.
    SAMPLE_BEFORE_TURNS: str = textwrap.dedent("""
    <bos><start_of_turn>user
    Answer "tie" if both model's answers are similar. Answer "A" if model A's answers are better, or "B" if model B's answers are better.<end_of_turn>
    """).strip("\n")
    TURN_BEFORE_PROMPT: str =  "<start_of_turn>user\n"
    TURN_AFTER_PROMPT: str = "<end_of_turn><start_of_turn>model A\n"
    TURN_AFTER_RESPONSE_A: str = "<end_of_turn><start_of_turn>model B\n"
    TURN_AFTER_RESPONSE_B: str = "<end_of_turn>"
    SAMPLE_AFTER_TURNS: str = "<start_of_turn>model\n"
    

    def __init__(self, tokenizer, max_len: int = 1024, min_turn_content_len: int = 128):
        self.tokenizer = tokenizer
        self.max_len = max_len
        self.min_turn_content_len = min_turn_content_len

        self.sample_before_turns_ids = self.tokenizer(self.SAMPLE_BEFORE_TURNS, add_special_tokens=False).input_ids
        self.turn_before_prompt_ids = self.tokenizer(self.TURN_BEFORE_PROMPT, add_special_tokens=False).input_ids
        self.turn_after_prompt_ids = self.tokenizer(self.TURN_AFTER_PROMPT, add_special_tokens=False).input_ids
        self.turn_after_response_a_ids = self.tokenizer(self.TURN_AFTER_RESPONSE_A, add_special_tokens=False).input_ids
        self.turn_after_response_b_ids = self.tokenizer(self.TURN_AFTER_RESPONSE_B, add_special_tokens=False).input_ids
        self.sample_after_turns_ids = self.tokenizer(self.SAMPLE_AFTER_TURNS, add_special_tokens=False).input_ids

    @staticmethod
    def _greedy_reduce_to_threshold(values: list[int], threshold: int) -> list[int]:
        while sum(values) > threshold:
            max_index = values.index(max(values))
            values[max_index] -= 1
        return values
    
    @staticmethod
    def _truncate_sequence(seq: list[int], max_len: int, ellipsis_id: int) -> list[int]:
        if len(seq) <= max_len:
            return seq
        return seq[:max_len-1] + [ellipsis_id]
    
    def __call__(self, prompt_iid: list[list[int]], response_a_iid: list[list[int]], response_b_iid: list[list[int]]) -> Out:
        special_tokens_per_conversation = len(self.sample_before_turns_ids) + len(self.sample_after_turns_ids)
        special_tokens_per_turn = len(self.turn_before_prompt_ids) + len(self.turn_after_prompt_ids) + len(self.turn_after_response_a_ids) + len(self.turn_after_response_b_ids)
        ellipsis_id = self.tokenizer("...", add_special_tokens=False).input_ids[0]
        content_max_len_left = self.max_len - special_tokens_per_conversation

        turns = []
        for turn in range(len(prompt_iid)):
            content_max_len_left -= special_tokens_per_turn

            if content_max_len_left < self.min_turn_content_len:
                break
    
            prompt_len, response_a_len, response_b_len = map(len, [
                prompt_iid[turn],
                response_a_iid[turn],
                response_b_iid[turn]
            ])
            
            sizes = self._greedy_reduce_to_threshold([prompt_len, response_a_len, response_b_len], content_max_len_left)
            
            turn_data = {
                "prompt": self._truncate_sequence(prompt_iid[turn], sizes[0], ellipsis_id),
                "response_a": self._truncate_sequence(response_a_iid[turn], sizes[1], ellipsis_id),
                "response_b": self._truncate_sequence(response_b_iid[turn], sizes[2], ellipsis_id)
            }

            
            content_max_len_left -= sum(sizes)
            turns.append(turn_data)

        input_ids = []
        input_ids.extend(self.sample_before_turns_ids)
        for turn in turns:
            input_ids.extend(self.turn_before_prompt_ids)
            input_ids.extend(turn["prompt"])
            input_ids.extend(self.turn_after_prompt_ids)
            input_ids.extend(turn["response_a"])
            input_ids.extend(self.turn_after_response_a_ids)
            input_ids.extend(turn["response_b"])
            input_ids.extend(self.turn_after_response_b_ids)
        input_ids.extend(self.sample_after_turns_ids)

        return self.Out(input_ids)

### 2.4 LabelCreator

Creates labels from raw labels (label input_ids).

```mermaid
graph LR
    Input1[winner_model_a: int] --> LabelCreator
    Input2[winner_model_b: int] --> LabelCreator
    Input3[winner_tie: int] --> LabelCreator
    LabelCreator --> Output1[label_ids: int]
```

In [7]:
import dataclasses as dc

class LabelCreator:
    """create LLM labels from raw labels (label input_ids)"""
    def __init__(self, tokenizer, win_a_word: str = "A", win_b_word: str = "B", win_tie_word: str = "tie"):
        self.tokenizer = tokenizer

        self.win_a_word = "A"
        self.win_b_word = "B"
        self.win_tie_word = "tie"

        self.label2id = {k: self.tokenizer(k, add_special_tokens=False).input_ids[0] for k in [win_a_word, win_b_word, win_tie_word]}
        self.id2label = {v: k for k, v in self.label2id.items()}

    def __call__(self, winner_model_a: int, winner_model_b: int, winner_tie: int) -> int:
        assert sum([winner_model_a, winner_model_b, winner_tie]) == 1
        if winner_model_a == 1:
            return self.label2id[self.win_a_word]

        if winner_model_b == 1:
            return self.label2id[self.win_b_word]

        if winner_tie == 1:
            return self.label2id[self.win_tie_word]

        raise AttributeError("label should be winner a, winner b or tie but none were set to 1...")

## 3. Configuration

Set the configuration parameters for the model, training, and data processing.

In [22]:
# CONFIG
MAX_SEQ_LEN = 2048 # sample maximum seq_len to use for training
NUM_PROC = 12 # number of processes to use for data processing
SEED = 42 # seed for reproducibility

# MODEL
MODEL_PATH = "unsloth/gemma-2-9b-it-bnb-4bit" # hugging face model id to use for training
LORA_R = 16 # LoRA rank
LORA_ALPHA = 2 * LORA_R # LoRA alpha
LORA_DROPOUT = 0 # LoRA dropout
LORA_BIAS = "none" # LoRA bias (none, all, or per_module)
LORA_TGT_MOD = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"] # LoRA target modules
LORA_SEED = SEED # LoRA seed 
LORA_GRAD_CHKPOINT = "unsloth" # LoRA gradient checkpointing    
LORA_RSLORA = False # LoRA rank stabilized
LORA_LOFTQ_CFG = None # LoRA LoftQ config

# TRAINING
TR_OUTPUT_DIR = "36_unsloth_gemma2_9b_2048_1epochs_1e-4" # Training output directory
TR_SAVE_STEPS = 200 # Save steps
TR_WARMUP_STEPS = 0 # Warmup steps
TR_OPTIM = "adamw_8bit" # Optimizer
TR_LR = 1e-4 # Learning rate
TR_BSZ = 16 # Training batch size
TR_EVAL_BSZ = 6 # Evaluation batch size
TR_GRAD_ACC = 1 # Gradient accumulation steps
TR_EPOCHS = 1 # Number of epochs

# DATA
TRAIN_CSV_PATH = "data/train.csv" # Training data path (this is the csv from kaggle)
TEST_SIZE=0.2 # Test size
DS_SEED=SEED # Dataset seed

## 4. Data Preparation

Load the competition dataset, use the data transformations defined above to prepare it for training.

### 4.0 Load the model and tokenizer

The tokenizer is necessary to instantiate our data transformations.

In [10]:
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = MODEL_PATH,
    max_seq_length = MAX_SEQ_LEN,
    load_in_4bit = True,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.8: Fast Gemma2 patching. Transformers = 4.43.3.
   \\   /|    GPU: NVIDIA GeForce RTX 4090. Max memory: 23.643 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0. CUDA = 8.9. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.26.post1. FA2 = True]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


model.safetensors:   0%|          | 0.00/6.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/40.6k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

Unsloth 2024.8 patched 42 layers with 42 QKV layers, 42 O layers and 42 MLP layers.


### 4.1 Load Raw Data

Load the raw competition training dataset as a pandas dataframe.

In [11]:
import pandas as pd

df = pd.read_csv(TRAIN_CSV_PATH)
print(df.shape)
df.head()

(57477, 9)


Unnamed: 0,id,model_a,model_b,prompt,response_a,response_b,winner_model_a,winner_model_b,winner_tie
0,30192,gpt-4-1106-preview,gpt-4-0613,"[""Is it morally right to try to have a certain...","[""The question of whether it is morally right ...","[""As an AI, I don't have personal beliefs or o...",1,0,0
1,53567,koala-13b,gpt-4-0613,"[""What is the difference between marriage lice...","[""A marriage license is a legal document that ...","[""A marriage license and a marriage certificat...",0,1,0
2,65089,gpt-3.5-turbo-0613,mistral-medium,"[""explain function calling. how would you call...","[""Function calling is the process of invoking ...","[""Function calling is the process of invoking ...",0,0,1
3,96401,llama-2-13b-chat,mistral-7b-instruct,"[""How can I create a test set for a very rare ...","[""Creating a test set for a very rare category...","[""When building a classifier for a very rare c...",1,0,0
4,198779,koala-13b,gpt-3.5-turbo-0314,"[""What is the best way to travel from Tel-Aviv...","[""The best way to travel from Tel Aviv to Jeru...","[""The best way to travel from Tel-Aviv to Jeru...",0,1,0


### 4.2 Instantiate data transformations

Create instances of the data processing classes defined earlier with the tokenizer.

In [12]:
# Instanciate data processing classes
raw_parser = RawParser()
parsed_tokenizer = ParsedTokenizer(tokenizer)
sample_creator = SampleCreator(tokenizer, MAX_SEQ_LEN)
label_creator = LabelCreator(tokenizer)

## 4.3 Create a Hugging Face Dataset

We use a hugging face dataset for processing the dataset as it allows to quickly parallelize the job on all CPU threads available while maintaining low storage requirements with pyarrow. The added benefit is that we can directly use the dataset in the hugging face trainer later on.

In [13]:
from datasets import Dataset

def map_fn(samples: dict[str, list[any]]) -> dict[str, list[any]]:
    return_dict = {"input_ids": [], "label_ids": [], "seq_len": []}

    for prompt, response_a, response_b, winner_model_a, winner_model_b, winner_tie in zip(
            samples["prompt"], samples["response_a"], samples["response_b"], samples["winner_model_a"], samples["winner_model_b"], samples["winner_tie"]):
    
            out = raw_parser(prompt, response_a, response_b)
            out = parsed_tokenizer(out.prompt_ls, out.response_a_ls, out.response_b_ls)
            out = sample_creator(out.prompt_iid, out.response_a_iid, out.response_b_iid)
    
            label_ids = label_creator(winner_model_a, winner_model_b, winner_tie)
            
            return_dict["input_ids"].append(out.input_ids)
            return_dict["label_ids"].append(label_ids)
            return_dict["seq_len"].append(len(out.input_ids)) # enable groupbylength sampling

    return return_dict

dataset = (
    Dataset.from_pandas(df)
    .select_columns(["id", "prompt", "response_a", "response_b", "winner_model_a", "winner_model_b", "winner_tie"])
    .map(map_fn, batched=True, num_proc=NUM_PROC)
    # .train_test_split(test_size=TEST_SIZE, seed=DS_SEED)
    .select_columns(["id", "input_ids", "label_ids", "seq_len"])
)

Map (num_proc=12):   0%|          | 0/57477 [00:00<?, ? examples/s]

### 4.4 Implement Data Collator

Define a data collator for batching samples during training. The main reason to use a custom collator here is that we delay padding to the very end when samples are already batched to optimize batch sequence length.

In [14]:
import torch

class Collator:
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer
        self.tokenizer.padding_side = "right"

    def __call__(self, samples: list[dict[str, any]]) -> dict[str, any]:
        # flatten into a single dict as expected by tokenizer.pad
        d = {key: [] for key in samples[0].keys()}
        for sample in samples:
            for k, v in sample.items():
                d[k].append(v)

        encodings = tokenizer.pad({"input_ids": d["input_ids"]}, return_tensors="pt")

        return {
            "id": d["id"],
            "input_ids": encodings.input_ids,
            "attention_mask": encodings.attention_mask,
            "labels": torch.tensor(d["label_ids"])
        }

collator_fn = Collator(tokenizer)

## 5. Model Setup

Load the pre-trained model and prepare it for training with QLoRA.

### 5.1 Apply LoRA

Unsloth patch the "peft" model to make it run faster during fine-tuning. Optimized kernels for matrix multiplication, reorder operations, ...

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = LORA_R,
    target_modules = LORA_TGT_MOD,
    lora_alpha = LORA_ALPHA,
    lora_dropout = LORA_DROPOUT, # Supports any, but = 0 is optimized
    bias = LORA_BIAS,    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = LORA_GRAD_CHKPOINT, # True or "unsloth" for very long context
    random_state = LORA_SEED,
    use_rslora = LORA_RSLORA,  # We support rank stabilized LoRA
    loftq_config = LORA_LOFTQ_CFG, # And LoftQ
)

### 5.2 Sequence-to-Sequence to Sequence-to-Score fine-tuning

By default Unsloth only supports sequence to sequence model fine tuning. In this problem it is more efficient to replace the last layer of the seq-to-seq model (a classifier over the full vocabulary) with a simple classifier over the 3 possible labels (A, B or tie).

To that effect we need to perform 3 steps: 
1. create a custom score head in full precision (for better performance it won't be trained with QLoRA)
2. save it to the model folder so it can be loaded during inference using a callback
3. mock the forward method to replace the last layer with our score head

#### 5.2.1 Create a custom score head

We make sure to create the score head in full precision to avoid any accuracy loss during training. Also this layer will be trained normally (without QLoRA).

Note that it might be optimal to use 2 optimizers in this case to have more control over the learning rate of the score head.

In [15]:
# create our own head in full precision
import torch

score = torch.nn.Linear(model.lm_head.in_features, 3, bias=False, dtype=torch.float32, device="cuda")

#### 5.2.2 Create a callback to automatically save the score head during training

In [18]:
from transformers import TrainerCallback
import torch
import os

class ScoreSaverCallback(TrainerCallback):
    def __init__(self, score_module):
        self.score_module = score_module

    def on_save(self, args, state, control, **kwargs):
        checkpoint_folder = f"checkpoint-{state.global_step}"
        output_dir = os.path.join(args.output_dir, checkpoint_folder)
        score_path = os.path.join(output_dir, "score.pth")
        
        torch.save(self.score_module.state_dict(), score_path)
        print(f"Saved score module to {score_path}")

#### 5.2.3 Mock the forward method to replace the last layer with our score head

In [None]:
def mock_forward(
    self,
    input_ids = None,
    causal_mask= None,
    attention_mask = None,
    position_ids = None,
    past_key_values= None,
    inputs_embeds = None,
    labels = None,
    use_cache = None,
    output_attentions = None,
    output_hidden_states = None,
    return_dict = None,
    *args, **kwargs,
):
    causal_mask = xformers.attn_bias.LowerTriangularMask()

    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    self.model._has_no_labels = labels is None

    outputs = self.model(
        input_ids=input_ids,
        causal_mask=causal_mask,
        attention_mask=attention_mask,
        position_ids=position_ids,
        past_key_values=past_key_values,
        inputs_embeds=inputs_embeds,
        use_cache=use_cache,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    hidden_states = outputs[0]
    bsz, q_len, hd = hidden_states.shape

    # pool the hidden_states by getting only last for each seq
    seq_idx = attention_mask.sum(-1) - 1 # (bsz,)
    hidden_states = hidden_states[torch.arange(bsz), seq_idx, :].unsqueeze(1) # (bsz, 1, hd)
    
    # lm_head = self.lm_head.weight
    # logits = self.lm_head(hidden_states.to(lm_head.dtype)) # (bsz, 1, vocab)
    # logits = logits.to(self.config.torch_dtype) 

    loss = None
    logit_softcapping = getattr(self.config, "final_logit_softcapping", 0)

    # shift_logits = logits[..., list(label_creator.id2label.keys())] # (bsz, 1, 3)
    shift_logits = score(hidden_states.to(score.weight.dtype)) # (bsz, 1, 3)
    shift_labels = labels.unsqueeze(-1) # (bsz, 1)
    for i, label_id in enumerate(label_creator.id2label.keys()):
        shift_labels[shift_labels == label_id] = i

    loss = fast_cross_entropy_loss(
        logits = shift_logits,
        labels = shift_labels,
        logit_softcapping = logit_softcapping,
    )

    if not return_dict:
        output = (shift_logits,) + outputs[1:]
        return (loss,) + output if loss is not None else output

    return CausalLMOutputWithPast(
        loss=loss,
        logits=shift_logits,
        past_key_values=outputs.past_key_values,
        hidden_states=outputs.hidden_states,
        attentions=outputs.attentions,
    )

## 6. Training

Set up the training arguments, instantiate the trainer and train the model.

### 6.1 Define the training arguments and instantiate the trainer

**Important Notes**
- group by length combined with our custom collator allows for much faster training as batches can be grouped by length and padded to the maximum sequence length of the batch.

In [23]:
# DEFINE TRAINER
from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    output_dir=TR_OUTPUT_DIR,
    overwrite_output_dir = True,
    save_strategy = "steps",
    save_steps=TR_SAVE_STEPS,
    logging_strategy="steps",
    logging_steps=10,
    warmup_steps=TR_WARMUP_STEPS,
    optim=TR_OPTIM,
    learning_rate=TR_LR,
    per_device_train_batch_size=TR_BSZ,
    per_device_eval_batch_size=TR_EVAL_BSZ,
    gradient_accumulation_steps=TR_GRAD_ACC,
    num_train_epochs=TR_EPOCHS,
    bf16=True,
    report_to="wandb",
    run_name=TR_OUTPUT_DIR,
    remove_unused_columns=False, # don't remove id column...
    group_by_length=True, # group by length for faster training
    length_column_name="seq_len",
)

trainer = Trainer(
    args=args,
    model=model,
    train_dataset=dataset,
    data_collator=collator_fn,
    callbacks=[ScoreSaverCallback(score_module=score)]
)

### 6.2 Run the training loop

In [24]:
from unittest import mock

with mock.patch.object(model.base_model.model,'forward', new=mock_forward.__get__(model.base_model.model, type(model.base_model.model))):
    trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 57,477 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 16 | Gradient Accumulation steps = 1
\        /    Total batch size = 16 | Total steps = 3,593
 "-____-"     Number of trainable parameters = 54,018,048
[34m[1mwandb[0m: Currently logged in as: [33mguillaume-raille[0m. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
10,1.0751
20,1.2367
30,1.1569
40,1.0769
50,1.16
60,1.1081
70,1.0953
80,1.1331
90,1.1876
100,1.1491


Saved score module to 36_unsloth_gemma2_9b_2048_1epochs_1e-4/checkpoint-200/score.pth
Saved score module to 36_unsloth_gemma2_9b_2048_1epochs_1e-4/checkpoint-400/score.pth
Saved score module to 36_unsloth_gemma2_9b_2048_1epochs_1e-4/checkpoint-600/score.pth
Saved score module to 36_unsloth_gemma2_9b_2048_1epochs_1e-4/checkpoint-800/score.pth
Saved score module to 36_unsloth_gemma2_9b_2048_1epochs_1e-4/checkpoint-1000/score.pth
Saved score module to 36_unsloth_gemma2_9b_2048_1epochs_1e-4/checkpoint-1200/score.pth
Saved score module to 36_unsloth_gemma2_9b_2048_1epochs_1e-4/checkpoint-1400/score.pth
Saved score module to 36_unsloth_gemma2_9b_2048_1epochs_1e-4/checkpoint-1600/score.pth
Saved score module to 36_unsloth_gemma2_9b_2048_1epochs_1e-4/checkpoint-1800/score.pth
Saved score module to 36_unsloth_gemma2_9b_2048_1epochs_1e-4/checkpoint-2000/score.pth
Saved score module to 36_unsloth_gemma2_9b_2048_1epochs_1e-4/checkpoint-2200/score.pth
Saved score module to 36_unsloth_gemma2_9b_2048

## 7. Model Saving

Save the trained model, tokenizer and custom score module.

In [25]:
# SAVE MODEL
model.save_pretrained(f"{TR_OUTPUT_DIR}/final") # Local saving
tokenizer.save_pretrained(f"{TR_OUTPUT_DIR}/final")
torch.save(score, f"{TR_OUTPUT_DIR}/final/score.pth")

Finish the Weights & Biases run.

In [26]:
import wandb
wandb.finish()

VBox(children=(Label(value='0.007 MB of 0.007 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
train/epoch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
train/global_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
train/grad_norm,█▅▃▃▃▁▂▃▂▂▂▃▂▃▂▁▃▂▃▂▃▃▃▄▄▄▃▄▁▄▃▅▃▂▂▃▃▃▄▂
train/learning_rate,████▇▇▇▇▇▆▆▆▆▆▆▅▅▅▅▅▄▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▁▁▁
train/loss,▇██▅▄▆█▃▇▄▅▃▅▂▆▇▂▄▃▆▄▅▄▂▁▄▅▂▆▃▅▁▅▁▂▄▁▆▄▅

0,1
total_flos,2.29905159755469e+18
train/epoch,1.0
train/global_step,3593.0
train/grad_norm,2.81213
train/learning_rate,0.0
train/loss,0.7639
train_loss,0.94177
train_runtime,18974.4169
train_samples_per_second,3.029
train_steps_per_second,0.189


## 8. Inference

Note that this is not the inference script used in the competition but still useful to quickly test if the fine tuned model is working as expected.

### 8.1 Load Model and Custom Score Module

In [6]:
# LOAD MODEL & TOKENIZER
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained("32_unsloth_gemma2_9b_2048")
FastLanguageModel.for_inference(model)
tokenizer.padding_side = "right"

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.8: Fast Gemma2 patching. Transformers = 4.43.3.
   \\   /|    GPU: NVIDIA GeForce RTX 4090. Max memory: 23.643 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0. CUDA = 8.9. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.26.post1. FA2 = True]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Unsloth 2024.8 patched 42 layers with 42 QKV layers, 42 O layers and 42 MLP layers.


### 8.2 Prepare Test Dataset

This time we split into training and testing set while before we took the whole dataset as training set.

In [9]:
# CREATE HF DATASET for TRAINING / TESTING
from datasets import Dataset

dataset = (
    Dataset.from_pandas(df)
    .select_columns(["id", "prompt", "response_a", "response_b", "winner_model_a", "winner_model_b", "winner_tie"])
    .map(map_fn, batched=True, num_proc=NUM_PROC)
    .train_test_split(test_size=TEST_SIZE, seed=DS_SEED)
    .select_columns(["id", "input_ids", "label_ids", "seq_len"])
)

Map (num_proc=12):   0%|          | 0/57477 [00:00<?, ? examples/s]

### 8.3 Define the training arguments and instantiate the inference trainer

Our inference trainer is essentially the same as our training trainer, instead of using the train method we will use the predict method. Note that because our test set is in our training set we expect here quite high accuracy (much higher than on the competition private dataset).

In [15]:
# DEFINE TRAINER
from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    output_dir=TR_OUTPUT_DIR,
    overwrite_output_dir = True,
    save_strategy = "steps",
    save_steps=TR_SAVE_STEPS,
    logging_strategy="steps",
    logging_steps=10,
    warmup_steps=TR_WARMUP_STEPS,
    optim=TR_OPTIM,
    learning_rate=TR_LR,
    per_device_train_batch_size=TR_BSZ,
    per_device_eval_batch_size=TR_EVAL_BSZ,
    gradient_accumulation_steps=TR_GRAD_ACC,
    num_train_epochs=TR_EPOCHS,
    bf16=True,
    report_to="none",
    remove_unused_columns=False, # don't remove id column...
    group_by_length=True, # group by length for faster training
    length_column_name="seq_len",
)

trainer = Trainer(
    args=args,
    model=model,
    train_dataset=dataset["train"],
    data_collator=collator_fn,
)

### 8.4 Perform Inference

We load the score module and mock the forward method to replace the last layer with our score head.

In [18]:
# ACTUAL INFERENCE LOOP
from unittest import mock

from unsloth.models.llama import xformers, CausalLMOutputWithPast, fast_cross_entropy_loss
import torch

# load my own head in full precision
score = torch.load("32_unsloth_gemma2_9b_2048/score.pth")

# unfortunately we have to redefine the mock_forward because we pass load and pass the score module
# we could refactor the code to avoid this.
def mock_forward(
    self,
    input_ids = None,
    causal_mask= None,
    attention_mask = None,
    position_ids = None,
    past_key_values= None,
    inputs_embeds = None,
    labels = None,
    use_cache = None,
    output_attentions = None,
    output_hidden_states = None,
    return_dict = None,
    *args, **kwargs,
):
    causal_mask = xformers.attn_bias.LowerTriangularMask()

    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    self.model._has_no_labels = labels is None

    outputs = self.model(
        input_ids=input_ids,
        causal_mask=causal_mask,
        attention_mask=attention_mask,
        position_ids=position_ids,
        past_key_values=past_key_values,
        inputs_embeds=inputs_embeds,
        use_cache=use_cache,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    hidden_states = outputs[0]
    bsz, q_len, hd = hidden_states.shape

    # pool the hidden_states by getting only last for each seq
    seq_idx = attention_mask.sum(-1) - 1 # (bsz,)
    hidden_states = hidden_states[torch.arange(bsz), seq_idx, :].unsqueeze(1) # (bsz, 1, hd)

    loss = None
    logit_softcapping = getattr(self.config, "final_logit_softcapping", 0)

    shift_logits = score(hidden_states.to(score.weight.dtype)) # (bsz, 1, 3)
    shift_labels = labels.unsqueeze(-1) # (bsz, 1)
    for i, label_id in enumerate(label_creator.id2label.keys()):
        shift_labels[shift_labels == label_id] = i

    loss = fast_cross_entropy_loss(
        logits = shift_logits,
        labels = shift_labels,
        logit_softcapping = logit_softcapping,
    )

    if not return_dict:
        output = (shift_logits,) + outputs[1:]
        return (loss,) + output if loss is not None else output

    return CausalLMOutputWithPast(
        loss=loss,
        logits=shift_logits,
    )

with mock.patch.object(model.base_model.model,'forward', new=mock_forward.__get__(model.base_model.model, type(model.base_model.model))):
    prediction_outputs = trainer.predict(dataset["test"].sort("seq_len", reverse=True))

In [17]:
prediction_outputs

{'eval_loss': 0.8944498896598816,
 'eval_model_preparation_time': 0.0143,
 'eval_runtime': 1205.585,
 'eval_samples_per_second': 9.536,
 'eval_steps_per_second': 1.589}

Note that in this case we obtained a negative log-likelihood of 0.894 which is extremly good on this problem. As we have 3 classes random would be $-\ln(\frac{1}{3}) = 1.0986$. Again this is because the test set is in the training dataset but at least we have a sense that the fine-tuning somehow worked.

## 9. Conclusion

This concludes our notebook on finetuning gemma 2 9B for sequence classification to predict human preference. We have seen how to use Unsloth to quickly and efficiently fine-tune a model with QLoRA and how to implement a custom score head for converting the seq-to-seq into seq-to-cls model. We have also seen how to use the fast tokenizer and data processing pipeline to quickly create a custom dataset and dataloader.

In the next notebook we will see how to quantize the final model while retaining good accuracy with autoAWQ. We will then use the quantized model in the context of the competition for inference by customizing VLLM for sequence classification.