This Notebook is adapted from one of the best notebooks of the previous competition by [@emiz6413](https://www.kaggle.com/emiz6413)

Original Notebook can be found [here](https://www.kaggle.com/emiz6413)

Major changes : 
* Previous competition used prompts in **English** only. This competition has **multilingual prompts**. I didn't add any translation mechanism here, I think it can be added for better performance.
* For the previous competition, we had to submit probabilities for each model,
 > winner_model_[a/b/tie]

In this edition, we just have to submit the winner model ( **model_a or model_b**). I just added a simple function to chose the best model based on the probabilities.

## What this notebook is

This is a inference notebook using 4-bit quantized [Gemma-2 9b Instruct](https://blog.google/technology/developers/google-gemma-2/) and a LoRA adapter trained using the script uploaded [here](https://www.kaggle.com/code/emiz6413/gemma-2-9b-4-bit-qlora-finetune).
Although we can choose to merge the LoRA adapter to the base model for faster inference, naively doing so could introduce non-negligible quantization error. Therefore, LoRA adapter was kept unmerged


In [1]:
!pip install /kaggle/input/lmsys-wheel-files/*.whl

Looking in links: /kaggle/input/pm-73558185-at-01-08-2025-09-22-49/
Processing /kaggle/input/lmsys-wheel-files/MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Processing /kaggle/input/lmsys-wheel-files/PyYAML-6.0.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Processing /kaggle/input/lmsys-wheel-files/accelerate-0.32.1-py3-none-any.whl
Processing /kaggle/input/lmsys-wheel-files/bitsandbytes-0.43.1-py3-none-manylinux_2_24_x86_64.whl
Processing /kaggle/input/lmsys-wheel-files/certifi-2024.7.4-py3-none-any.whl
Processing /kaggle/input/lmsys-wheel-files/charset_normalizer-3.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Processing /kaggle/input/lmsys-wheel-files/filelock-3.15.4-py3-none-any.whl
Processing /kaggle/input/lmsys-wheel-files/fsspec-2024.6.1-py3-none-any.whl
Processing /kaggle/input/lmsys-wheel-files/huggingface_hub-0.23.4-py3-none-any.whl
Processing /kaggle/input/lmsys-wheel-files/idna-3.7-py3-none-any.whl
Pro

In [2]:
import os
import copy
from dataclasses import dataclass

import numpy as np
import pandas as pd
import torch
from datasets import Dataset
from transformers import (
    BitsAndBytesConfig,
    Gemma2ForSequenceClassification,
    GemmaTokenizerFast,
    Gemma2Config,
    PreTrainedTokenizerBase, 
    EvalPrediction,
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType
from sklearn.metrics import log_loss, accuracy_score

2025-01-08 09:27:33.067017: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-01-08 09:27:33.067139: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-01-08 09:27:33.259844: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## Configurations

In [3]:
@dataclass
class Config:
    output_dir: str = "output"
    # checkpoint: str = "unsloth/gemma-2-9b-it-bnb-4bit"  # 4-bit quantized gemma-2-9b-instruct
    checkpoint: str = '/kaggle/input/gemma-2/transformers/gemma-2-9b-it-4bit/1/gemma-2-9b-it-4bit'
    lora_dir = '/kaggle/working/wsdm2025-multilingual-chatbot-arena/lora_param'
    max_length: int = 1024
    n_splits: int = 5
    fold_idx: int = 0
    optim_type: str = "adamw_8bit"
    per_device_train_batch_size: int = 2
    gradient_accumulation_steps: int = 2  # global batch size is 8 
    per_device_eval_batch_size: int = 8
    n_epochs: int = 1
    freeze_layers: int = 16  # there're 42 layers in total, we don't add adapters to the first 16 layers
    lr: float = 2e-4
    warmup_steps: int = 20
    lora_r: int = 8
    lora_alpha: float = lora_r * 2
    lora_dropout: float = 0.05
    lora_bias: str = "none"
    
config = Config()

In [4]:
training_args = TrainingArguments(
    output_dir="output",
    overwrite_output_dir=True,
    report_to="none",
    num_train_epochs=config.n_epochs,
    per_device_train_batch_size=config.per_device_train_batch_size,
    gradient_accumulation_steps=config.gradient_accumulation_steps,
    per_device_eval_batch_size=config.per_device_eval_batch_size,
    logging_steps=10,
    eval_strategy="epoch",
    save_strategy="steps",
    save_steps=200,
    optim=config.optim_type,
    fp16=True,
    learning_rate=config.lr,
    warmup_steps=config.warmup_steps,
    ddp_find_unused_parameters=False,
    gradient_checkpointing=True,
    logging_dir="./logs",
)

In [5]:
lora_config = LoraConfig(
    r=config.lora_r,
    lora_alpha=config.lora_alpha,
    # only target self-attention
    target_modules=["q_proj", "k_proj", "v_proj"],
    layers_to_transform=[i for i in range(42) if i >= config.freeze_layers],
    lora_dropout=config.lora_dropout,
    bias=config.lora_bias,
    task_type=TaskType.SEQ_CLS,
)

In [6]:
tokenizer = GemmaTokenizerFast.from_pretrained(config.checkpoint)
tokenizer.add_eos_token = True  # We'll add <eos> at the end
tokenizer.padding_side = "right"

In [7]:
model = Gemma2ForSequenceClassification.from_pretrained(
    config.checkpoint,
    # num_labels=2,
    torch_dtype=torch.float16,
    device_map="auto",
)
model.classifier = torch.nn.Linear(in_features=3584, out_features=2)  # 2クラス分類に変更
model.config.use_cache = False
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
model

Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): Gemma2ForSequenceClassification(
      (model): Gemma2Model(
        (embed_tokens): Embedding(256000, 3584, padding_idx=0)
        (layers): ModuleList(
          (0-15): 16 x Gemma2DecoderLayer(
            (self_attn): Gemma2SdpaAttention(
              (q_proj): Linear4bit(in_features=3584, out_features=4096, bias=False)
              (k_proj): Linear4bit(in_features=3584, out_features=2048, bias=False)
              (v_proj): Linear4bit(in_features=3584, out_features=2048, bias=False)
              (o_proj): Linear4bit(in_features=4096, out_features=3584, bias=False)
              (rotary_emb): Gemma2RotaryEmbedding()
            )
            (mlp): Gemma2MLP(
              (gate_proj): Linear4bit(in_features=3584, out_features=14336, bias=False)
              (up_proj): Linear4bit(in_features=3584, out_features=14336, bias=False)
              (down_proj): Linear4bit(in_features=14336, out_features=3584,

# Load & pre-process Data 

In [8]:
INPUT_DIR = "/kaggle/input/wsdm-cup-multilingual-chatbot-arena"

train = pd.read_parquet(f"{INPUT_DIR}/train.parquet")
# test = pd.read_parquet(f"{INPUT_DIR}/test.parquet")

In [9]:
ds = Dataset.from_pandas(train)
ds = ds.select(torch.arange(3000))

In [10]:
class CustomTokenizer:
    def __init__(self, tokenizer: PreTrainedTokenizerBase, max_length: int) -> None:
        self.tokenizer = tokenizer
        self.max_length = max_length
        
    def __call__(self, batch: dict) -> dict:
        # Ensure that the keys exist in the batch before processing
        prompt = ["<prompt>: " + self.process_text(t) for t in batch.get("prompt", [])]
        response_a = ["\n\n<response_a>: " + self.process_text(t) for t in batch.get("response_a", [])]
        response_b = ["\n\n<response_b>: " + self.process_text(t) for t in batch.get("response_b", [])]
        
        # Concatenate all parts into one text field for tokenization
        texts = [p + r_a + r_b for p, r_a, r_b in zip(prompt, response_a, response_b)]
        
        # Tokenize the texts
        tokenized = self.tokenizer(texts, max_length=self.max_length, truncation=True, padding=True)
        
        # Handle the winner labels (mapping winner from 'model_a' to 0, 'model_b' to 1)
        labels = []
        winners = batch.get("winner", [])
        
        for winner in winners:
            if winner == 'model_a':
                label = 0
            elif winner == 'model_b':
                label = 1
            # If the winner is neither 'model_a' nor 'model_b', you could choose to skip or handle the error here
            else:
                continue  # Or use `label = None` if you want to handle such cases separately
                
            labels.append(label)
        
        # Return tokenized output with labels
        return {**tokenized, "labels": labels}

    @staticmethod
    def process_text(text: str) -> str:
        return text.replace("null", "").strip()

In [11]:
encode = CustomTokenizer(tokenizer, max_length=config.max_length)
ds = ds.map(encode, batched=True)

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

In [12]:
def compute_metrics(eval_preds: EvalPrediction) -> dict:
    preds = eval_preds.predictions
    labels = eval_preds.label_ids
    probs = torch.from_numpy(preds).float().softmax(-1).numpy()
    loss = log_loss(y_true=labels, y_pred=probs)
    acc = accuracy_score(y_true=labels, y_pred=preds.argmax(-1))
    return {"acc": acc, "log_loss": loss}

In [13]:
def compute_metrics(eval_pred: EvalPrediction) -> dict:
    # Extract predictions and labels from the EvalPrediction object
    logits, labels = eval_pred.predictions, eval_pred.label_ids

    # Convert logits to predicted labels (assuming binary classification with logits)
    pred_labels = logits.argmax(axis=-1)  # For multi-class, use argmax along the correct axis

    # Calculate accuracy and other metrics
    accuracy = accuracy_score(labels, pred_labels)
    # precision = precision_score(labels, pred_labels, average='binary')
    # recall = recall_score(labels, pred_labels, average='binary')
    # f1 = f1_score(labels, pred_labels, average='binary')

    # Return the metrics as a dictionary
    return {
        "accuracy": accuracy,
        # "precision": precision,
        # "recall": recall,
        # "f1": f1,
    }

In [14]:
folds = [
    (
        [i for i in range(len(ds)) if i % config.n_splits != fold_idx],
        [i for i in range(len(ds)) if i % config.n_splits == fold_idx]
    ) 
    for fold_idx in range(config.n_splits)
]

In [15]:
train_idx, eval_idx = folds[config.fold_idx]

trainer = Trainer(
    args=training_args, 
    model=model,
    tokenizer=tokenizer,
    train_dataset=ds.select(train_idx),
    eval_dataset=ds.select(eval_idx),
    compute_metrics=compute_metrics,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
)
trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy
1,0.6842,0.702079,0.546667




TrainOutput(global_step=600, training_loss=0.9227921708424887, metrics={'train_runtime': 22297.4885, 'train_samples_per_second': 0.108, 'train_steps_per_second': 0.027, 'total_flos': 1.22803984171008e+17, 'train_loss': 0.9227921708424887, 'epoch': 1.0})

In [16]:
model.save_pretrained(config.lora_dir)



# Inference

In [17]:
# def process_text(text):
#     return text.replace("null", "").strip()

In [18]:
# def tokenize_test(row,tokenizer):
#     prompt = ["<prompt>: " + process_text(t) for t in row["prompt"]]
#     response_a = ["\n\n<response_a>: " + process_text(t) for t in row["response_a"]]
#     response_b = ["\n\n<response_b>: " + process_text(t) for t in row["response_b"]]

#     # Concatenate all parts into one text field for tokenization
#     texts = [p + r_a + r_b for p, r_a, r_b in zip(prompt, response_a, response_b)]
        
#     # Tokenize the texts
#     tokenized = tokenizer(texts, max_length=1024, truncation=True, padding=True)

#     return tokenized

In [19]:
# preds_valid = trainer.predict(ds.select(eval_idx)).predictions

# preds_oof[idx_valid] = torch.from_numpy(preds).float().softmax(dim=-1).numpy()[:, -1]

# ds_test = Dataset.from_pandas(test)
# ds_test = ds_test.map(
#     tokenize_test,
#     fn_kwargs={'tokenizer': tokenizer},
#     batched=True,
# )

# # メモリ節約のために元のテキスト列を削除（オプション）
# ds_test = ds_test.remove_columns(['prompt', 'response_a', 'response_b'])

# preds_test = trainer.predict(ds_test).predictions

In [20]:
# prob = torch.from_numpy(preds_test).float().softmax(dim=-1).numpy()
# prob

In [21]:
# class_mapping = {0: 'model_a', 1: 'model_b'} 
# test["winner"] = np.argmax(prob, axis=1)

In [22]:
# test['winner'] = test['winner'].map(class_mapping)

In [23]:
# sub=test[["id","winner"]]
# sub.head()

In [24]:
# sub.to_csv("submission.csv",index=False)