This notebook is the trainning of my second model, focusing on using a finetuned mdeberta (for multiligual purpose) with some extra feature engineering (made in the EDA_FE notebook).

Here we are tokenizing promt and unique response (a/b) together for each row. The resulting embbeding is then coupled with some feature engineering and fed to a classification fc layer.

The modelisation roughly looks like that :
> tokenise and embbed [CLS] prompt [SEP] response_a [SEP]  
> tokenise and embbed [CLS] prompt [SEP] response_b [SEP]  
> Cat [Embbed A] + [Embbed B]  
> [Transformer Output Embedding] + [Feature Vector]  

Feature engineering are pretty simple and include:
- Prompt-Response Similarity
- Response Length
- N-grams/Keywords
- lexical diversity

Each time creating a single float by substracting the result of a and b.


Reguarding the learning rate, I manually tried different setup, the best i got so far was to lower the learning rate for the finetuning part, giving more learning impact for the FC layer, coupled with a warm-up/decay scheduler. I should have done some kind of gridsearch for better hyperparameters (will do for future competition). Also i found some ppl using Gemma couple with LoRA having pretty good results. I should take a look on this for the future.

This solution will be the building block for the next competition : https://www.kaggle.com/competitions/wsdm-cup-multilingual-chatbot-arena/overview


-------------------------------------------------------------

In [None]:
# For Kaggle
# install in dependencies :
#!pip install -U KeyBERT
#import sys 
#sys.path.append("/kaggle/input/sentence-transformers-2-4-0/sentence_transformers-2.4.0-py3-none-any.whl") 
#import sentence_transformers

In [1]:
import sklearn

import numpy as np 
import pandas as pd
from tqdm import tqdm
import json

import matplotlib.pyplot as plt
import matplotlib as mpl
import plotly.express as px

from sklearn.model_selection import train_test_split

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

import ModelsUtils as Utils

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
import torch

from torch.utils.data import Dataset, DataLoader

import torch.optim as optim
from torch.nn.functional import cross_entropy

from transformers import AutoModel, AutoTokenizer

print('Torch version:', torch.__version__)
print('Torch is build with CUDA:', torch.cuda.is_available())
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f'Torch device : {device}')
print('------------------------------')


device = "cpu"

Torch version: 2.5.1+cu118
Torch is build with CUDA: True
Torch device : cuda
------------------------------


In [3]:
#sequence_length = 64
sequence_length = 256
#sequence_length = 512
BATCH_SIZE = 1
sample_size = 0.01      # Will only be taken if [MINI_RUN & BUILD_DATASET] are True
EPOCHS = 1
BUILD_DATASET = False#True # Will load from file pre-preprocessed data if False
MINI_RUN = True         # Test run with very little data

model_name = "microsoft/mdeberta-v3-base" # For multilingual purpose


In [4]:
BASE_PATH = './kaggle/input/llm-classification-finetuning'
CUSTOM_BASE_PATH = '../Data'

## Files

`train.csv`
- `id`
- `model_[a/b]`: Model identity, present in train.csv but not in test.csv.
- `prompt`: Input prompt given to both models.
- `response_[a/b]`: Model_[a/b]'s response to the prompt.
- `winner_model_[a/b/tie]`: Binary columns indicating the judge's selection (ground truth target).

`test.csv`
- `id`: Unique identifier for each row.
- `prompt`: Input prompt given to both models.
- `response_[a/b]`: Model_[a/b]'s response to the prompt.

> !!!! Note that each interaction may have multiple prompts and responses, but this notebook will use only one prompt per interaction. You can choose to use all prompts and responses. Additionally, prompts and responses in the dataframe are provided as string-formatted lists, so they need to be converted to literal lists using eval().

> !!! TODO : use all prompts

In [8]:
# Load Train Data
df = pd.DataFrame()

if BUILD_DATASET:
    df = pd.read_csv(f'{BASE_PATH}/train.csv')
else:
    if MINI_RUN:
        df = pd.read_csv(f'{CUSTOM_BASE_PATH}/train_preprocessed_mini.csv')
    else:
        df = pd.read_csv(f'{CUSTOM_BASE_PATH}/train_preprocessed_full.csv')

# For kaggle
#/kaggle/input/preprocessed-dataset-mini/train_preprocessed_mini.csv
#/kaggle/input/train-preprocessed-full/train_preprocessed_full.csv

# Load Test Data
test_df = pd.read_csv(f'{BASE_PATH}/test.csv')

## Train Data

In [10]:
# Sample data
if BUILD_DATASET & MINI_RUN:
    df = df.sample(frac=sample_size)

if BUILD_DATASET :
    # Take the first prompt and its associated response
    #df["prompt"] = df.prompt.map(lambda x: eval(x)[0])
    df["prompt"] = df.prompt.map(lambda x: ' '.join(eval(x.replace("null","''"))))
    df["response_a"] = df.response_a.map(lambda x: ' '.join(eval(x.replace("null","''"))))
    df["response_b"] = df.response_b.map(lambda x: ' '.join(eval(x.replace("null", "''"))))

    # Label conversion
    df["class_name"] = df[["winner_model_a", "winner_model_b", "winner_tie"]].idxmax(axis=1)
    df["class_label"] = df.class_name.map(CFG.name2label)

# Show Sample
df.head()

Unnamed: 0,id,model_a,model_b,prompt,response_a,response_b,winner_model_a,winner_model_b,winner_tie,class_name,...,length_diff,resp1_lexical_div,resp2_lexical_div,lexical_div_diff,resp1_similarity,resp2_similarity,similarity_diff,resp1_keyword_overlap,resp2_keyword_overlap,keyword_overlap_diff
0,1082567922,gpt-4-0314,koala-13b,What is a herring gull,A herring gull (Larus argentatus) is a large s...,A herring gull is a large bird in the gull gen...,0,1,0,winner_model_b,...,-218,0.715596,0.684932,0.030665,0.910796,0.831494,0.079302,1.0,1.0,0.0
1,2045704262,codellama-34b-instruct,gpt-4-0613,Write a program to fry device cpu,"I cannot fulfill your request. I'm just an AI,...","Sorry, but I can't assist with that.",0,0,1,winner_tie,...,512,0.766667,1.0,-0.233333,0.445107,0.104861,0.340246,0.4,0.0,0.4
2,1510303905,gemini-pro-dev-api,tulu-2-dpo-70b,what does ((this)) indicate,`((this))` is a special JavaScript keyword tha...,"In JavaScript, `this` is a keyword used to ref...",0,0,1,winner_tie,...,590,0.454545,0.582192,-0.127646,0.585441,0.624612,-0.039172,0.0,0.0,0.0
3,1892761574,gpt-3.5-turbo-1106,mistral-medium,Please write a prompt in order to build an ont...,Prompt: Building an Ontology for the Semantic ...,Building an Ontology for a Semantic Entity: A ...,0,0,1,winner_tie,...,-1849,0.64257,0.431985,0.210585,0.580195,0.900922,-0.320727,0.4,0.6,-0.2
4,3254661415,palm-2,vicuna-13b,"using pixar openusd python api, how do i creat...",To create an asset resolver system that tracks...,To create an asset resolver system that tracks...,0,0,1,winner_tie,...,-29,0.434426,0.446352,-0.011926,0.690365,0.639343,0.051022,0.2,0.4,-0.2


In [11]:
df['prompt'] = df['prompt'].astype(str)
df['response_a'] = df['response_a'].astype(str)
df['response_b'] = df['response_b'].astype(str)

## Test Data

In [12]:
# Take the first prompt and response
test_df["prompt"] = test_df.prompt.map(lambda x: ' '.join(eval(x.replace("null","''"))))
test_df["response_a"] = test_df.response_a.map(lambda x: ' '.join(eval(x.replace("null","''"))))
test_df["response_b"] = test_df.response_b.map(lambda x: ' '.join(eval(x.replace("null", "''"))))

# Show Sample
test_df.head()

Unnamed: 0,id,prompt,response_a,response_b
0,136060,"I have three oranges today, I ate an orange ye...",You have two oranges today.,You still have three oranges. Eating an orange...
1,211333,You are a mediator in a heated political debat...,Thank you for sharing the details of the situa...,Mr Reddy and Ms Blue both have valid points in...
2,1233961,How to initialize the classification head when...,When you want to initialize the classification...,To initialize the classification head when per...


## Encoding

In [13]:
# Codec stuff

# Define a function to create options based on the prompt and choices
def reencode(row):
    row["encode_fail"] = False
    try:
        row["prompt"] = row.prompt.encode("utf-8").decode("utf-8")
    except:
        row["prompt"] = ""
        row["encode_fail"] = True

    try:
        row["response_a"] = row.response_a.encode("utf-8").decode("utf-8")
    except:
        row["response_a"] = ""
        row["encode_fail"] = True

    try:
        row["response_b"] = row.response_b.encode("utf-8").decode("utf-8")
    except:
        row["response_b"] = ""
        row["encode_fail"] = True
        
    return row

if BUILD_DATASET :
    df = df.apply(reencode, axis=1)  # Apply the make_pairs function to each row in df
    display(df.head(2))  # Display the first 2 rows of df

test_df = test_df.apply(reencode, axis=1)  # Apply the make_pairs function to each row in df
display(test_df.head(2))  # Display the first 2 rows of df

Unnamed: 0,id,prompt,response_a,response_b,encode_fail
0,136060,"I have three oranges today, I ate an orange ye...",You have two oranges today.,You still have three oranges. Eating an orange...,False
1,211333,You are a mediator in a heated political debat...,Thank you for sharing the details of the situa...,Mr Reddy and Ms Blue both have valid points in...,False


In [14]:
df.encode_fail.value_counts(normalize=False)

encode_fail
False    569
True       6
Name: count, dtype: int64

## EDA

In [15]:
model_df = pd.concat([df.model_a, df.model_b])
counts = model_df.value_counts().reset_index()
counts.columns = ['LLM', 'Count']

# Create a bar plot with custom styling using Plotly
fig = px.bar(counts, x='LLM', y='Count',
                title='Distribution of LLMs',
                color='Count', color_continuous_scale='viridis', width=1000)

fig.update_layout(xaxis_tickangle=-45)  # Rotate x-axis labels for better readability

fig.show()

### Winning distribution

In [16]:
counts = df['class_name'].value_counts().reset_index()
counts.columns = ['Winner', 'Win Count']

fig = px.bar(counts, x='Winner', y='Win Count',
                title='Winner distribution for Train Data',
                labels={'Winner': 'Winner', 'Win Count': 'Win Count'},
                color='Winner', color_continuous_scale='viridis', width=1000)

fig.update_layout(xaxis_title="Winner", yaxis_title="Win Count")

fig.show()

### Winning distribution ratio per model

In [17]:
models_a = df.query('winner_model_a == 1').groupby(['model_a'])['winner_model_a'].count().reset_index() 
models_a.columns = ['model', 'wins']
models_a['losses'] = df.query('winner_model_a == 0').groupby(['model_a'])['winner_model_a'].count().reset_index()['winner_model_a']

models_b = df.query('winner_model_b == 1').groupby(['model_b'])['winner_model_b'].count().reset_index() 
models_b.columns = ['model', 'wins']
models_b['losses'] = df.query('winner_model_b == 0').groupby(['model_b'])['winner_model_b'].count().reset_index()['winner_model_b']

models = models_a
models[['wins', 'losses']] = models_a[['wins', 'losses']] + models_b[['wins', 'losses']]

In [18]:
models['winsRatio'] = (models['wins'] / (models['wins'] + models['losses']))
models.sort_values(by='winsRatio', ascending=False, inplace=True)

In [19]:
#models.sort_values(by='wins', ascending=False, inplace=True)

fig = px.bar(
    data_frame = models,
    x = "model",
    y = ["winsRatio"],
    opacity = 0.9,
    #orientation = "v",
    #barmode = 'stack',
    title='Wins ratio per model',
)

fig.update_layout(xaxis_tickangle=-45)  # Rotate x-axis labels for better readability

fig.show()

> Data we can create from this dataset:
> - ratio length response/prompt
> - embeddings cosine similarity
> - check embbeding vector difference between prompt/response (create a 'mean' difference vector from all best response) and check cosine similarity distribution.

In [20]:
df.head()

Unnamed: 0,id,model_a,model_b,prompt,response_a,response_b,winner_model_a,winner_model_b,winner_tie,class_name,...,length_diff,resp1_lexical_div,resp2_lexical_div,lexical_div_diff,resp1_similarity,resp2_similarity,similarity_diff,resp1_keyword_overlap,resp2_keyword_overlap,keyword_overlap_diff
0,1082567922,gpt-4-0314,koala-13b,What is a herring gull,A herring gull (Larus argentatus) is a large s...,A herring gull is a large bird in the gull gen...,0,1,0,winner_model_b,...,-218,0.715596,0.684932,0.030665,0.910796,0.831494,0.079302,1.0,1.0,0.0
1,2045704262,codellama-34b-instruct,gpt-4-0613,Write a program to fry device cpu,"I cannot fulfill your request. I'm just an AI,...","Sorry, but I can't assist with that.",0,0,1,winner_tie,...,512,0.766667,1.0,-0.233333,0.445107,0.104861,0.340246,0.4,0.0,0.4
2,1510303905,gemini-pro-dev-api,tulu-2-dpo-70b,what does ((this)) indicate,`((this))` is a special JavaScript keyword tha...,"In JavaScript, `this` is a keyword used to ref...",0,0,1,winner_tie,...,590,0.454545,0.582192,-0.127646,0.585441,0.624612,-0.039172,0.0,0.0,0.0
3,1892761574,gpt-3.5-turbo-1106,mistral-medium,Please write a prompt in order to build an ont...,Prompt: Building an Ontology for the Semantic ...,Building an Ontology for a Semantic Entity: A ...,0,0,1,winner_tie,...,-1849,0.64257,0.431985,0.210585,0.580195,0.900922,-0.320727,0.4,0.6,-0.2
4,3254661415,palm-2,vicuna-13b,"using pixar openusd python api, how do i creat...",To create an asset resolver system that tracks...,To create an asset resolver system that tracks...,0,0,1,winner_tie,...,-29,0.434426,0.446352,-0.011926,0.690365,0.639343,0.051022,0.2,0.4,-0.2


## Feature engineering

#### 1. Response Length

In [21]:
def add_length_features(df):
    df['resp1_length'] = df['response_a'].apply(len)
    df['resp2_length'] = df['response_b'].apply(len)
    df['length_diff'] = df['resp1_length'] - df['resp2_length']  # Difference in lengths
    return df

#### 2. Lexical Diversity

In [22]:
def lexical_diversity(text):
    tokens = text.split()  # Tokenize by whitespace
    return len(set(tokens)) / len(tokens) if len(tokens) > 0 else 0

def add_lexical_features(df):
    df['resp1_lexical_div'] = df['response_a'].apply(lexical_diversity)
    df['resp2_lexical_div'] = df['response_b'].apply(lexical_diversity)
    df['lexical_div_diff'] = df['resp1_lexical_div'] - df['resp2_lexical_div']
    return df

#### 3. Sentiment analysis

In [23]:
from transformers import pipeline

# Load sentiment analysis pipeline (ensure it's multilingual)
sentiment_analyzer = pipeline("sentiment-analysis", model="nlptown/bert-base-multilingual-uncased-sentiment", device=0)

def get_sentiment(text):
    result = sentiment_analyzer(text[:512])  # Truncate to 512 tokens for BERT-based models
    return result[0]['label']

def add_sentiment_features(df):
    df['resp1_sentiment'] = df['response_a'].apply(get_sentiment)
    df['resp2_sentiment'] = df['response_b'].apply(get_sentiment)
    # Convert sentiments to numeric scale (e.g., positive=1, neutral=0, negative=-1)
    sentiment_map = {'positive': 1, 'neutral': 0, 'negative': -1}
    df['resp1_sentiment_num'] = df['resp1_sentiment'].map(sentiment_map)
    df['resp2_sentiment_num'] = df['resp2_sentiment'].map(sentiment_map)
    df['sentiment_diff'] = df['resp1_sentiment_num'] - df['resp2_sentiment_num']
    return df


#### 4. Semantic Similarity

In [51]:
# Load a multilingual sentence transformer model
embedder = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2__')

#embedder.save('paraphrase-multilingual-MiniLM-L12-v2__')

def calculate_similarity(prompt, response):
    embeddings = embedder.encode([prompt, response])
    return cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]

def add_similarity_features(df):
    df['resp1_similarity'] = df.apply(lambda x: calculate_similarity(x['prompt'], x['response_a']), axis=1)
    df['resp2_similarity'] = df.apply(lambda x: calculate_similarity(x['prompt'], x['response_b']), axis=1)
    df['similarity_diff'] = df['resp1_similarity'] - df['resp2_similarity']
    return df

#### 5. Question-Answer Alignment

In [57]:
from keybert import KeyBERT

# Use KeyBERT for keyword extraction
#kw_model = KeyBERT()

kw_model = KeyBERT(model='paraphrase-multilingual-MiniLM-L12-v2__')

def get_keyword_overlap(prompt, response):
    prompt_keywords = set([kw[0] for kw in kw_model.extract_keywords(prompt)])
    response_keywords = set([kw[0] for kw in kw_model.extract_keywords(response)])
    overlap = len(prompt_keywords & response_keywords)
    return overlap / len(prompt_keywords) if len(prompt_keywords) > 0 else 0

def add_keyword_overlap_features(df):
    df['resp1_keyword_overlap'] = df.apply(lambda x: get_keyword_overlap(x['prompt'], x['response_a']), axis=1)
    df['resp2_keyword_overlap'] = df.apply(lambda x: get_keyword_overlap(x['prompt'], x['response_b']), axis=1)
    df['keyword_overlap_diff'] = df['resp1_keyword_overlap'] - df['resp2_keyword_overlap']
    return df

#### 6. Language-Specific Formality or Tone (TODO for each language)

In [26]:
# # Example with spaCy and third-party plugins
# import spacy

# # Load spaCy models for specific languages
# nlp_en = spacy.load("en_core_web_sm")  # English example

# def detect_formality(text):
#     doc = nlp_en(text)
#     formality_score = sum(1 for token in doc if token.pos_ in ["VERB", "ADV"]) / len(doc)
#     return formality_score if len(doc) > 0 else 0

# def add_formality_features(df):
#     df['resp1_formality'] = df['response_a'].apply(detect_formality)
#     df['resp2_formality'] = df['response_b'].apply(detect_formality)
#     df['formality_diff'] = df['resp1_formality'] - df['resp2_formality']
#     return df

# # wont add until all languages supported

#### 7. Named Entity Recognition (NER)

In [53]:
def count_entities(text, nlp_model):
    doc = nlp_model(text)
    return len(doc.ents)

def add_ner_features(df, nlp_model):
    df['resp1_entities'] = df['response_a'].apply(lambda x: count_entities(x, nlp_model))
    df['resp2_entities'] = df['response_b'].apply(lambda x: count_entities(x, nlp_model))
    df['entity_diff'] = df['resp1_entities'] - df['resp2_entities']
    return df

### Extract all features

In [58]:
def extract_all_features(df):
    df = add_length_features(df)
    df = add_lexical_features(df)
    #df = add_sentiment_features(df)
    df = add_similarity_features(df)
    df = add_keyword_overlap_features(df)
    #df = add_formality_features(df)
    #df = add_ner_features(df)
    return df

# test will need preprocess no matter what, can't pre load them from kaggle
test_df = extract_all_features(test_df)

if BUILD_DATASET:
    df = extract_all_features(df)
    if MINI_RUN :
        df.to_csv(f'{CUSTOM_BASE_PATH}/train_preprocessed_mini.csv', index = False)
    else:
        df.to_csv(f'{CUSTOM_BASE_PATH}/train_preprocessed_full.csv', index = False)
    

In [29]:
if BUILD_DATASET:
    raise SystemExit("Stop")

## Tokenizer

In [48]:
# temp code to upload model on Kaggle (not on the Kaggle's pretrainned offline model list)

model_name = "microsoft/mdeberta-v3-base"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

save_path = 'deberta_v3_small_pretrained_model_pytorch_CPU'

model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)


The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.



('deberta_v3_small_pretrained_model_pytorch_CPU\\tokenizer_config.json',
 'deberta_v3_small_pretrained_model_pytorch_CPU\\special_tokens_map.json',
 'deberta_v3_small_pretrained_model_pytorch_CPU\\spm.model',
 'deberta_v3_small_pretrained_model_pytorch_CPU\\added_tokens.json',
 'deberta_v3_small_pretrained_model_pytorch_CPU\\tokenizer.json')

In [30]:
# Load a multilingual tokenizer
#model_name = "xlm-roberta-base"
model_name = "microsoft/mdeberta-v3-base"

tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tokenization function
def tokenize_data(prompt, response1, response2, max_length=sequence_length):
    tokens_resp1 = tokenizer(
        prompt,
        response1,  # Pair of responses
        #[response1, response2],  # Pair of responses
        padding="max_length",
        truncation=True,
        max_length=max_length,
        return_tensors="pt"
    )
    
    tokens_resp2 = tokenizer(
        prompt,
        response2,  # Pair of responses
        #[response1, response2],  # Pair of responses
        padding="max_length",
        truncation=True,
        max_length=max_length,
        return_tensors="pt"
    )
    
    return {
        'input_ids_resp1': tokens_resp1['input_ids'],
        'attention_mask_resp1': tokens_resp1['attention_mask'],
        'input_ids_resp2': tokens_resp2['input_ids'],
        'attention_mask_resp2': tokens_resp2['attention_mask']
    }


The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.



## Data Loader

In [31]:
class ChatbotArenaDataset(Dataset):
    def __init__(self, dataframe, tokenizer, test=False, max_length=sequence_length):
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.num_classes = 3
        self.test = test

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        row = self.data.iloc[idx]

        # Tokenize the text
        tokens = tokenize_data(row['prompt'], row['response_a'], row['response_b'], self.max_length)
        
        # Extract engineered features
        features = torch.tensor([
            #row['resp1_length'],
            #row['resp2_length'],
            row['length_diff'],
            #row['resp1_lexical_div'],
            #row['resp2_lexical_div'],
            row['lexical_div_diff'],
            #row['resp1_similarity'],
            #row['resp2_similarity'],
            row['similarity_diff'],
            #row['resp1_keyword_overlap'],
            #row['resp2_keyword_overlap'],
            row['keyword_overlap_diff'],
        ], dtype=torch.float)

        if not self.test:
            # Label
            label = torch.nn.functional.one_hot(torch.tensor(row['class_label']), num_classes=self.num_classes).float()

            return {
                'input_ids_resp1': tokens['input_ids_resp1'].squeeze(0),
                'attention_mask_resp1': tokens['attention_mask_resp1'].squeeze(0),
                'input_ids_resp2': tokens['input_ids_resp2'].squeeze(0),
                'attention_mask_resp2': tokens['attention_mask_resp2'].squeeze(0),
                'features': features,
                'label': label
            }
        else:
            return {
                'input_ids_resp1': tokens['input_ids_resp1'].squeeze(0),
                'attention_mask_resp1': tokens['attention_mask_resp1'].squeeze(0),
                'input_ids_resp2': tokens['input_ids_resp2'].squeeze(0),
                'attention_mask_resp2': tokens['attention_mask_resp2'].squeeze(0),
                'features': features
            }

## Modelisation

In [32]:
import torch.nn as nn

class PreferencePredictionModel(nn.Module):
    def __init__(self, transformer_name, feature_dim, num_classes=3):
        super(PreferencePredictionModel, self).__init__()
        
        # Load transformer model
        self.transformer = AutoModel.from_pretrained(transformer_name)
        transformer_hidden_size = self.transformer.config.hidden_size  # e.g., 768 for XLM-RoBERTa
        
        # Fully connected layers for features
        self.feature_fc = nn.Linear(feature_dim, 64)
        
        # Final classification layer
        self.classifier = nn.Sequential(
            nn.Linear(2 * transformer_hidden_size + 64, 128),  # Combine response1, response2, and features
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(128, num_classes)
        )
    
    def forward(self, input_ids_resp1, attention_mask_resp1, input_ids_resp2, attention_mask_resp2, features):
        # Process response1
        output_resp1 = self.transformer(input_ids=input_ids_resp1, attention_mask=attention_mask_resp1)
        cls_embedding_resp1 = output_resp1.last_hidden_state[:, 0, :]  # CLS token
        
        # Process response2
        output_resp2 = self.transformer(input_ids=input_ids_resp2, attention_mask=attention_mask_resp2)
        cls_embedding_resp2 = output_resp2.last_hidden_state[:, 0, :]  # CLS token
        
        # Feature processing
        feature_output = self.feature_fc(features)
        
        # Concatenate and classify
        combined = torch.cat((cls_embedding_resp1, cls_embedding_resp2, feature_output), dim=1)
        logits = self.classifier(combined)
        
        return logits

## Evaluation function

In [33]:
# Evaluation (use for trainning)
def evaluate_model(model, dataloader, device="cuda"):
    model = model.to(device)
    model.eval()
    total_loss = 0.0
    correct = 0
    total_samples = 0

    # Use BCEWithLogitsLoss for one-hot encoded labels
    loss_fn = nn.CrossEntropyLoss()
    #loss_fn = nn.BCEWithLogitsLoss()
    #loss_fn = nn.BCELoss()

    with torch.no_grad():
        for batch in dataloader:
            # Move data to device
            input_ids_resp1 = batch['input_ids_resp1'].to(device)
            attention_mask_resp1 = batch['attention_mask_resp1'].to(device)
            input_ids_resp2 = batch['input_ids_resp2'].to(device)
            attention_mask_resp2 = batch['attention_mask_resp2'].to(device)
            features = batch['features'].to(device)
            labels = batch['label'].to(device)  # One-hot encoded labels

            # Forward pass
            logits = model(
                input_ids_resp1=input_ids_resp1,
                attention_mask_resp1=attention_mask_resp1,
                input_ids_resp2=input_ids_resp2,
                attention_mask_resp2=attention_mask_resp2,
                features=features
            )

            # Compute loss
            loss = loss_fn(logits, labels)
            total_loss += loss.item()

            # Compute predictions and accuracy
            predictions = torch.argmax(logits, dim=1)  # Class with highest score
            true_labels = torch.argmax(labels, dim=1)  # Convert one-hot to class indices
            
            correct += (predictions == true_labels).sum().item()
            total_samples += labels.size(0)

    # Calculate average loss and accuracy
    avg_loss = total_loss / len(dataloader)
    accuracy = correct / total_samples

    return {
        'loss': avg_loss,
        'accuracy': accuracy
    }

## Trainning function

In [34]:
# Training loop
def train_model(model, dataloader, valid_dataloader, optimizer, scheduler = None, num_epochs=5, device="cuda"):
    model = model.to(device)
    model.train()
    min_val_loss = float('inf') #checkpoint

    for epoch in range(num_epochs):
        total_loss = 0
        model.train()
        
        for batch in tqdm(dataloader, total=len(dataloader), unit='row'):
            optimizer.zero_grad()
            
            logits = model(
                input_ids_resp1=batch['input_ids_resp1'].to(device),
                attention_mask_resp1=batch['attention_mask_resp1'].to(device),
                input_ids_resp2=batch['input_ids_resp2'].to(device),
                attention_mask_resp2=batch['attention_mask_resp2'].to(device),
                features=batch['features'].to(device)
            )
            
            # One-hot labels
            labels = batch['label'].to(device)
            
        
            #loss = nn.BCEWithLogitsLoss()(logits, labels)
            loss = nn.CrossEntropyLoss()(logits, labels)
        
            # Use BCELoss for one-hot encoded labels
            #loss = nn.BCELoss()(logits, labels) #more stable, It combines a sigmoid activation and binary cross-entropy loss.
            loss.backward()
            optimizer.step()
            if scheduler is not None:
                scheduler.step()
            
            total_loss += loss.item()
            
        
        metrics = evaluate_model(model, valid_dataloader, device=device)
        
        if min_val_loss > metrics['loss']:
            min_val_loss = metrics['loss']
            torch.save({
                        'epoch': epoch,
                        'model_state_dict': model.state_dict(),
                        'optimizer_state_dict': optimizer.state_dict(),
                        }, f'PreferencePredictionModel.pt')
            print(f"{metrics['loss']} val loss is better than previous {min_val_loss}, saving checkpoint epoch: ", epoch)
            

        print(f"Trainning Epoch {epoch + 1}, Accumulated Train Loss: {total_loss / len(dataloader)}")
        print(f"Eval : Valid Loss: {metrics['loss']}, Valid Accuracy : {metrics['accuracy']}")
        for param_group in optimizer.param_groups:
            print(f"Current learning rate: {param_group['lr']}")


## Call

### Split

In [35]:
df_train, df_valid = train_test_split(df, test_size=0.1, random_state=42)

In [36]:
# Prepare dataset and dataloader
dataset_train = ChatbotArenaDataset(df_train, tokenizer)
dataloader_train = DataLoader(dataset_train, batch_size=BATCH_SIZE, shuffle=True)

dataset_valid = ChatbotArenaDataset(df_valid, tokenizer)
dataloader_valid = DataLoader(dataset_valid, batch_size=BATCH_SIZE, shuffle=True)

dataset_test = ChatbotArenaDataset(test_df, tokenizer, test=True)
dataloader_test = DataLoader(dataset_test, batch_size=BATCH_SIZE, shuffle=True)

In [37]:
# Initialize model, optimizer
model = PreferencePredictionModel(transformer_name=model_name, feature_dim=4, num_classes=3)

optimizer = optim.AdamW([
    {'params': model.transformer.parameters(), 'lr': 2e-6},     # Lower learning rate for transformer layers
    {'params': model.feature_fc.parameters(), 'lr': 1e-3},      # Higher learning rate for custom layers
], weight_decay=0.01)


In [38]:
from transformers import get_scheduler

num_training_steps = len(dataloader_train) * EPOCHS
num_warmup_steps = int(0.05 * num_training_steps)  # Warm up for 5% of total steps

lr_scheduler = get_scheduler(
    name="linear",  # Linear warm-up and decay
    optimizer=optimizer,
    num_warmup_steps=num_warmup_steps,
    num_training_steps=num_training_steps
)

In [39]:
# Train and evaluate
train_model(model, dataloader_train, dataloader_valid, optimizer, lr_scheduler, device=device, num_epochs=EPOCHS)

100%|██████████| 517/517 [27:36<00:00,  3.20s/row]


5.098310983900366 val loss is better than previous 5.098310983900366, saving checkpoint epoch:  0
Trainning Epoch 1, Accumulated Train Loss: 7.886439109258734
Eval : Valid Loss: 5.098310983900366, Valid Accuracy : 0.27586206896551724
Current learning rate: 0.0
Current learning rate: 0.0


In [40]:
# load best epoch
checkpoint = torch.load(f'PreferencePredictionModel.pt')
model.load_state_dict(checkpoint['model_state_dict'])

<All keys matched successfully>

In [41]:
def predict(model, dataloader, device="cuda"):
    """
    Predict outcomes using a DataLoader for the test dataset.

    Args:
        model: Trained PyTorch model.
        dataloader: DataLoader for the test dataset.
        device: Device to perform inference ('cpu' or 'cuda').

    Returns:
        A list of predicted class labels for the entire test dataset.
    """
    model = model.to(device)
    model.eval()  # Set model to evaluation mode
    predictions = []

    with torch.no_grad():
        for batch in dataloader:
            # Move data to device
            input_ids_resp1 = batch['input_ids_resp1'].to(device)
            attention_mask_resp1 = batch['attention_mask_resp1'].to(device)
            input_ids_resp2 = batch['input_ids_resp2'].to(device)
            attention_mask_resp2 = batch['attention_mask_resp2'].to(device)
            features = batch['features'].to(device)

            # Forward pass through the model
            logits = model(
                input_ids_resp1=input_ids_resp1,
                attention_mask_resp1=attention_mask_resp1,
                input_ids_resp2=input_ids_resp2,
                attention_mask_resp2=attention_mask_resp2,
                features=features
            )

            # Convert logits to predicted class
            #batch_predictions = torch.argmax(logits, dim=1).cpu().tolist()
            #batch_predictions = logits.cpu().tolist()
            batch_probs = torch.softmax(logits, dim=1).cpu().tolist()
            predictions.extend(batch_probs)

    return predictions



In [42]:
prediction = predict(model, dataloader_test)

In [43]:
prediction

[[0.0004107772547286004, 0.9995891451835632, 5.099884958781331e-08],
 [0.22635045647621155, 0.42408058047294617, 0.3495689630508423],
 [0.272579163312912, 0.7171930074691772, 0.01022784411907196]]

In [44]:
sub_df = test_df[["id"]].copy()
sub_df[CFG.class_names] = prediction
sub_df.head()

Unnamed: 0,id,winner_model_a,winner_model_b,winner_tie
0,136060,0.000411,0.999589,5.099885e-08
1,211333,0.22635,0.424081,0.349569
2,1233961,0.272579,0.717193,0.01022784


In [45]:
sub_df.to_csv("submission.csv", index=False)

todo:
- check with previous pytorch project (monitoring)
- check diff between BCEWithLogitsLoss and BCELoss

## Future Directions

In this notebook, we've achieved a good score with a small model and modest token length. Because of the complexity of the task and data, it hard to rapidly iterate and test different stuff. Also 30h free GPU from kaggle is very nice, but other ressources like Collab (expensive) might be a solution for faster iteration.

There's plenty of room to improve. Here's how:

- Higher token length (1024 ?)
- Try bigger models like Gemma. I see a lot of good public score made with this model -> let's experiment
- Better data handling, maybe filter some data, augment from other similar competition ?
- some kind of grid search to find better parameters ?

I stopped trying to improve this result as soon as i found out there was a timed competition of the same type with almost the same parameters. I simply continued to make this code evolve for another competition:

https://www.kaggle.com/competitions/wsdm-cup-multilingual-chatbot-arena/overview