This notebook is the trainning of my second model, focusing on using a finetuned mdeberta (for multiligual purpose) with some extra feature engineering (made in the EDA_FE notebook).

Here we are tokenizing promt and unique response (a/b) together for each row. The resulting embbeding is then coupled with some feature engineering and fed to a classification fc layer.

The modelisation roughly looks like that :
> tokenise and embbed [CLS] prompt [SEP] response_a [SEP]  
> tokenise and embbed [CLS] prompt [SEP] response_b [SEP]  
> Cat [Embbed A] + [Embbed B]  
> [Transformer Output Embedding] + [Feature Vector]  

Feature engineering are pretty simple and include:
- Prompt-Response Similarity
- Response Length
- N-grams/Keywords
- lexical diversity

Each time creating a single float by substracting the result of a and b.


Reguarding the learning rate, I manually tried different setup, the best i got so far was to lower the learning rate for the finetuning part, giving more learning impact for the FC layer, coupled with a linear warm-up/decay scheduler (5%). I should have done some kind of gridsearch for better hyperparameters (will do for future competition). Also i found some ppl using Gemma couple with LoRA having pretty good results. I should take a look on this for the future.

This solution will be the building block for the next competition : https://www.kaggle.com/competitions/wsdm-cup-multilingual-chatbot-arena/overview


-------------------------------------------------------------

In [1]:
# For Kaggle
# install in dependencies :
#!pip install -U KeyBERT
#import sys 
#sys.path.append("/kaggle/input/sentence-transformers-2-4-0/sentence_transformers-2.4.0-py3-none-any.whl") 
#import sentence_transformers

In [2]:
import sklearn

import numpy as np 
import pandas as pd
from tqdm import tqdm
import json

import matplotlib.pyplot as plt
import matplotlib as mpl
import plotly.express as px

from sklearn.model_selection import train_test_split

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

import ModelsUtils as Utils

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
import torch

from torch.utils.data import Dataset, DataLoader

import torch.optim as optim
from torch.nn.functional import cross_entropy

from transformers import AutoModel, AutoTokenizer

print('Torch version:', torch.__version__)
print('Torch is build with CUDA:', torch.cuda.is_available())
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f'Torch device : {device}')
print('------------------------------')

Torch version: 2.5.1+cu118
Torch is build with CUDA: True
Torch device : cuda
------------------------------


In [4]:
#sequence_length = 64
sequence_length = 256
#sequence_length = 512
BATCH_SIZE = 1
sample_size = 0.01      # Will only be taken if [MINI_RUN & BUILD_DATASET] are True
EPOCHS = 1
BUILD_DATASET = False#True # Will load from file pre-preprocessed data if False
MINI_RUN = True         # Test run with very little data

model_name = "microsoft/mdeberta-v3-base" # For multilingual purpose


In [5]:
BASE_PATH = './kaggle/input/llm-classification-finetuning'
CUSTOM_BASE_PATH = '../Data'

## Files

In [6]:
# Load Train Data
df = pd.DataFrame()

if BUILD_DATASET:
    df = pd.read_csv(f'{BASE_PATH}/train.csv')
else:
    if MINI_RUN:
        df = pd.read_csv(f'{CUSTOM_BASE_PATH}/train_preprocessed_mini.csv')
    else:
        df = pd.read_csv(f'{CUSTOM_BASE_PATH}/train_preprocessed_full.csv')

# For kaggle
#/kaggle/input/preprocessed-dataset-mini/train_preprocessed_mini.csv
#/kaggle/input/train-preprocessed-full/train_preprocessed_full.csv

In [7]:
df['prompt'] = df['prompt'].astype(str)
df['response_a'] = df['response_a'].astype(str)
df['response_b'] = df['response_b'].astype(str)

## Tokenizer

In [8]:
# Only Local 
# temp code to upload model on Kaggle (because not on Kaggle's pretrainned offline model list)
if False:
    model = AutoModel.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    save_path = 'deberta_v3_small_pretrained_model_pytorch_CPU'
    model.save_pretrained(save_path)
    tokenizer.save_pretrained(save_path)

In [9]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tokenization function
def tokenize_data(prompt, response1, response2, max_length=sequence_length):
    tokens_resp1 = tokenizer(
        prompt,
        response1,  # Pair of responses
        #[response1, response2],  # Pair of responses
        padding="max_length",
        truncation=True,
        max_length=max_length,
        return_tensors="pt"
    )
    
    tokens_resp2 = tokenizer(
        prompt,
        response2,  # Pair of responses
        #[response1, response2],  # Pair of responses
        padding="max_length",
        truncation=True,
        max_length=max_length,
        return_tensors="pt"
    )
    
    return {
        'input_ids_resp1': tokens_resp1['input_ids'],
        'attention_mask_resp1': tokens_resp1['attention_mask'],
        'input_ids_resp2': tokens_resp2['input_ids'],
        'attention_mask_resp2': tokens_resp2['attention_mask']
    }



## Call

### Split

In [10]:
df_train, df_valid = train_test_split(df, test_size=0.1, random_state=42)

In [11]:
# Prepare dataset and dataloader
dataset_train = Utils.ChatbotArenaDataset(df_train, tokenizer)
dataloader_train = Utils.DataLoader(dataset_train, batch_size=BATCH_SIZE, shuffle=True)

dataset_valid = Utils.ChatbotArenaDataset(df_valid, tokenizer)
dataloader_valid = Utils.DataLoader(dataset_valid, batch_size=BATCH_SIZE, shuffle=True)

In [12]:
# Initialize model, optimizer
model = Utils.PreferencePredictionModel(transformer_name=model_name, feature_dim=4, num_classes=3)

optimizer = optim.AdamW([
    {'params': model.transformer.parameters(), 'lr': 2e-6},     # Lower learning rate for transformer layers
    {'params': model.feature_fc.parameters(), 'lr': 1e-3},      # Higher learning rate for custom layers
], weight_decay=0.01)


In [13]:
from transformers import get_scheduler

num_training_steps = len(dataloader_train) * EPOCHS
num_warmup_steps = int(0.05 * num_training_steps)  # Warm up for 5% of total steps

lr_scheduler = get_scheduler(
    name="linear",  # Linear warm-up and decay
    optimizer=optimizer,
    num_warmup_steps=num_warmup_steps,
    num_training_steps=num_training_steps
)

In [14]:
# Train and evaluate
Utils.train_model(model, dataloader_train, dataloader_valid, optimizer, lr_scheduler, device=device, num_epochs=EPOCHS)

100%|██████████| 517/517 [01:44<00:00,  4.95row/s]


9.914715599682628 val loss is better than previous 9.914715599682628, saving checkpoint epoch:  0
Trainning Epoch 1, Accumulated Train Loss: 10.380057711937878
Eval : Valid Loss: 9.914715599682628, Valid Accuracy : 0.41379310344827586
Current learning rate: 0.0
Current learning rate: 0.0


## Future Directions

In this notebook, we've achieved a good score with a small model and modest token length. Because of the complexity of the task and data, it hard to rapidly iterate and test different stuff. Also 30h free GPU from kaggle is very nice, but other ressources like Collab (expensive) might be a solution for faster iteration.

There's plenty of room to improve. Here's how:

- Higher token length (1024 ?)
- Try bigger models like Gemma. I see a lot of good public score made with this model -> let's experiment
- Better data handling, maybe filter some data, augment from other similar competition ?
- some kind of grid search to find better parameters ?

I stopped trying to improve this result as soon as i found out there was a timed competition of the same type with almost the same parameters. I simply continued to make this code evolve for another competition:

https://www.kaggle.com/competitions/wsdm-cup-multilingual-chatbot-arena/overview