This notebook is created to test how current open source state-of-the-art LLMs perform for the multilingual-chatbot-arena dataset.

The experiment's constraints are as follows:

1. Model's inference will be held using an NVIDIA GeForce RTX 4060. Therefore the GPU computing is limited. Recall that is most likely that the current pretrained models with not perform very well on this challenge's data. The benchmarked LLMs must hold around 7-9B parameters in order for the hardware to handle inferencing/fine tuning.
2. For the fine tuning of LLMs. The best course of action will be to use QLORA. Due to hardware constraints.
3. There are an abundant number of capable open-source LLMs. In this Demo we will be benchmarking 3 model's families: Qwen2.5, Llama 3.X and gemini.
4. Dataset for the experiment: training set.
5. Performance metric: Accuracy (For the whole dataset, how many prompts what's the proportion of accurately predicted answers).

# Batch processing workloads

## Loading challenge's training data from Comet ML

In [1]:
import sys
import pathlib
root_repo_directory = pathlib.Path().resolve().parent.__str__()
sys.path.append(root_repo_directory)
from multilingual_chatbot_arena import initialize
import datasets_creator.src.constants as c
import datasets_creator.src.utils as utils
import pandas as pd
from fire import Fire
from pydantic import BaseModel
from typing import List,Optional,Dict,Union
import pathlib
import numpy as np
import pickle
from dataclasses import dataclass
import re

import os
import opik
from loguru import logger
initialize()

import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from transformers.tokenization_utils import PreTrainedTokenizer
from transformers.tokenization_utils_fast import PreTrainedTokenizerFast

from sklearn.metrics import accuracy_score
from tqdm import tqdm

* 'fields' has been removed
[32m2025-01-14 10:39:00.703[0m | [1mINFO    [0m | [36mmultilingual_chatbot_arena[0m:[36minitialize[0m:[36m13[0m - [1mInitializing env vars...[0m
[32m2025-01-14 10:39:00.704[0m | [1mINFO    [0m | [36mmultilingual_chatbot_arena[0m:[36minitialize[0m:[36m18[0m - [1mLoading environment variables from: /home/kevinmg96/Kaggle competitions/WSDM Cup/multilingual-chatbot-arena/.env[0m
  from .autonotebook import tqdm as notebook_tqdm


In [2]:
#getting challenge's train dataset
client = opik.Opik(workspace=os.environ['COMET_WORKSPACE'],api_key=os.environ['COMET_API_KEY'])
dataset_comet = client.get_or_create_dataset("multilingual-chatbot-arena-train-1")
data = dataset_comet.to_pandas()

In [3]:
data.head()

Unnamed: 0,answer,language,prompt,id
0,Best model is model_b based on its human prefe...,German,\nYou are an expert in assesing LLM's model re...,01945b29-7e5e-7b26-a627-7f8d4db7f396
1,Best model is model_b based on its human prefe...,Russian,\nYou are an expert in assesing LLM's model re...,01945b29-7e5d-772c-998b-5f8a8413153c
2,Best model is model_b based on its human prefe...,Czech,\nYou are an expert in assesing LLM's model re...,01945b29-7e5c-71e1-9d4e-a9a503529dcd
3,Best model is model_b based on its human prefe...,English,\nYou are an expert in assesing LLM's model re...,01945b29-7e5b-7b26-9eee-98c6dfc31878
4,Best model is model_a based on its human prefe...,English,\nYou are an expert in assesing LLM's model re...,01945b29-7e5a-75b4-8ec8-7c026e3e0afe


Remove from prompt column, system's message declaration

In [3]:
system_substring = """\nYou are an expert in assesing LLM's model response based on a prompt. I will give you an input prompt (**prompt**) with two different responses coming from fellow LLM models; the first model's response is called **response_a** and second model's response is **response_b**. You can find the previous information after the double slashes (//), respecting the correct title based on the proper input.Your task is to assess the content of each response based on its quality and human's language similarity, then choose the model's response which adheres best to the given guidelines.\nYour response must obey the following format: "Best model is model_[] based on its human preferability response for the input prompt.". You will substitute "[]" with either "a" if you think **response_a** is better than **response_b**, or "b" otherwise."""

In [4]:
def del_system_sentence(x):
    prompt = x.prompt

    return prompt.split(system_substring)[-1]

data["prompt"] = data.apply(del_system_sentence,axis=1)

In [6]:
data.head()

Unnamed: 0,answer,language,prompt,id
0,Best model is model_b based on its human prefe...,German,\n\n//\n**prompt**:\nSchreibe bitte ein Anschr...,01945b29-7e5e-7b26-a627-7f8d4db7f396
1,Best model is model_b based on its human prefe...,Russian,\n\n//\n**prompt**:\nСделай рерайт предложения...,01945b29-7e5d-772c-998b-5f8a8413153c
2,Best model is model_b based on its human prefe...,Czech,\n\n//\n**prompt**:\nVytvoř 4x2 tabulku. V prv...,01945b29-7e5c-71e1-9d4e-a9a503529dcd
3,Best model is model_b based on its human prefe...,English,\n\n//\n**prompt**:\nПереведи текст на русския...,01945b29-7e5b-7b26-9eee-98c6dfc31878
4,Best model is model_a based on its human prefe...,English,\n\n//\n**prompt**:\n# Setting\n\nAll eyes are...,01945b29-7e5a-75b4-8ec8-7c026e3e0afe


## Creating custom dataset for the imported data.

In [5]:
class ChatbotDataset(Dataset):
    def __init__(self,data : pd.DataFrame):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        #get either a single data point or a pandas Dataframe window of data points
        data_window = self.data.iloc[idx]    

        return data_window.to_dict()
    
train_dataset = ChatbotDataset(data)

        

Testing **ChatbotDataset**

In [None]:
train_dataset[1]

Find dataset's max token length (Deprecated)

In [54]:
data_point_max_len = 0
for dic in train_dataset:
    if dic["inputs"]["input_ids"].squeeze().shape[0] > data_point_max_len:
        data_point_max_len =  dic["inputs"]["input_ids"].squeeze().shape[0]

print(f"Data point's maximum length after tokenization : {data_point_max_len}")

## Setting Custom Dataloader

In [6]:


class ChatbotDataloader(DataLoader):
    def __init__(self, tokenizer :  PreTrainedTokenizer | PreTrainedTokenizerFast, **kwargs):
        self.tokenizer = tokenizer
        
        kwargs["collate_fn"] = self.chatbot_collate
        super().__init__(**kwargs)

    
    def chatbot_collate(self,batch):
        """Custom collate function to teach the Dataloader class how to parse the batches into an llm friendly format
        Args:
            original_batch : List of batch elements with len -> batch_size. Each list's element strictly follows 
            the format inside __getitem__ from Dataset class. 
        
        """
        prompts = []
        answers = []
        languages = []
        for dic in batch:
            prompt_messages = [
                {"role": "system", "content": system_substring},
                {"role" : "user", "content" : dic["prompt"]}
            ]

            prompt_text  = self.tokenizer.apply_chat_template(
                prompt_messages,
                tokenize=False,
                add_generation_prompt=True,
            )

            """ answer_messages = [
                {"role" : "user", "content" : dic["answer"]}
            ]

            answer_text  = self.tokenizer.apply_chat_template(
                answer_messages,
                tokenize=False,
                add_generation_prompt=True,
            ) """


            
            prompts.append(prompt_text)
            #answers.append(answer_text)
            answers.append(dic["answer"])
            languages.append(dic["language"])

        #tokenize batch of prompts and answers
        prompt_tokenize = self.tokenizer(prompts,
                padding='longest',truncation=True,return_tensors="pt")

        """ answer_tokenize = self.tokenizer(answers,
                padding='longest',truncation=True,return_tensors="pt") """

        return {
            "inputs" : prompt_tokenize, #Dict[str,torch.Tensor]
            "labels" : answers, #List[str]  ##answer_tokenize, #Dict[str,torch.Tensor],
            "languages" : languages, #List[str]
            "longest_seq" : prompt_tokenize["input_ids"].shape[1] #int
        }

# Model Inference Setup

This section model inferencing pipelines for each of the benchmark models 

In [7]:
def custom_accurracy_metric(predictions,labels):
    """
    """
    unmatched_idxs = [] #incorrectly predicted records idxs
    accurracy = 0
    for i,(pred,lab) in enumerate(zip(predictions,labels)):
        if pred == lab:
            accurracy += 1
        else:
            unmatched_idxs.append(i)

    return accurracy / len(predictions), unmatched_idxs


def get_model_winner(matches) -> str:
    """
    Extract which model's reponse is better from input_response
    """
    for match in matches:
        if 'a' in match:
            return 'a'
        elif 'b' in match:
            return 'b'
    return 'c'

def parse_output_llm(responses) -> List[str]:
    """
    Retrieves a list specifying which of the two paired models in each training record adheres the best
    to human responses.

    Args:
        responses List[str]: Batch of LLM's responses.
    """
    pattern = r'Best model(.*?)based on its human preferability response'

    model_winner_in_responses = []
    for response in responses:
        #Extract pattern from response
        matches = re.findall(pattern, response,re.DOTALL)        
        model_winner_in_responses.append(get_model_winner(matches))
    
    return model_winner_in_responses




In [8]:
@torch.no_grad()
def model_inference(model,train_dataloader) -> tuple[List[str], List[str]]:
    """
    Retrieves two lists, the first list specifies the LLM's decisions per record, on which response was more humanly
    seen. The other specifies the challenge's ground truth.

    Args:
        model : HuggingFace Pretrained LLM.
    """
    global_output_winners = []
    global_label_winners = []
    for batch in tqdm(train_dataloader,desc="Training set - Model Inference"):
        # Let's send current batch into device: 'auto'
        inputs,labels = batch["inputs"].to(model.device),batch["labels"]
        
        #forward batch of input tokens into the model, get output token ids
        output_token_ids  = model.generate(
            **inputs,
            max_new_tokens=512,
        )

        output_token_ids = output_token_ids.detach().cpu()

        #Remove prompt from generated response
        output_token_ids = [output_token_ids[i,batch["longest_seq"]:]  for i in range(
            output_token_ids.shape[0])]

        #Decode batch's output
        batch_output_decoded = train_dataloader.tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)

        #Parse batch's decoded responses to extract, for each model-pair in input, which one is predicted to 
        #have better response
        batch_output_winners = parse_output_llm(batch_output_decoded)

        #Decode batch's labels
        #batch_label_decoded = train_dataloader.tokenizer.batch_decode(labels["input_ids"], skip_special_tokens=True)
        batch_label_winners = parse_output_llm(labels)#parse_output_llm(batch_label_decoded)

        #store batches into dataset
        global_output_winners.extend(batch_output_winners)
        global_label_winners.extend(batch_label_winners)

        #clear GPU cache
        torch.cuda.empty_cache()

    return global_output_winners,global_label_winners


## Config Inference Arguments

In [9]:
@dataclass
class InferenceArgs:
    model_name : str
    batch_size : int

## Qwen/Qwen2.5-3B-Instruct-GPTQ-Int4

In [10]:
config = InferenceArgs(
    model_name="Qwen/Qwen2.5-1.5B-Instruct-GPTQ-Int4",
    batch_size=8
)

In [11]:
tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast = AutoTokenizer.from_pretrained(config.model_name,
                                                                                         padding_side="left")

In [12]:
print(f"Model : {config.model_name} max context length : {tokenizer.model_max_length}")

Model : Qwen/Qwen2.5-1.5B-Instruct-GPTQ-Int4 max context length : 131072


### Dataloader

In [12]:
arguments = {
    "dataset" :train_dataset,
    "batch_size" : config.batch_size
}

qwen_2_5_3b_instruct_4int_dataloader = ChatbotDataloader(
    tokenizer,**arguments
)

Testing dataloader

In [63]:
iterator = iter(qwen_2_5_3b_instruct_4int_dataloader)
dic = next(iterator)
print(dic["inputs"]["input_ids"])
print(dic["inputs"]["input_ids"].shape)
print(dic["labels"])
print(dic["languages"])

tensor([[151643, 151643, 151643,  ..., 151644,  77091,    198],
        [151643, 151643, 151643,  ..., 151644,  77091,    198],
        [151643, 151643, 151643,  ..., 151644,  77091,    198],
        [151644,   8948,    271,  ..., 151644,  77091,    198]])
torch.Size([4, 3746])
['Best model is model_b based on its human preferability response for the input prompt.', 'Best model is model_b based on its human preferability response for the input prompt.', 'Best model is model_b based on its human preferability response for the input prompt.', 'Best model is model_b based on its human preferability response for the input prompt.']
['German', 'Russian', 'Czech', 'English']


### Loading model into VRAM

In [13]:
model = AutoModelForCausalLM.from_pretrained(
    config.model_name, 
    device_map="auto"
)
model.eval()

  @custom_fwd
  @custom_bwd
  @custom_fwd(cast_inputs=torch.float16)
CUDA extension not installed.
CUDA extension not installed.
`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 1536)
    (layers): ModuleList(
      (0-27): 28 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (k_proj): QuantLinear()
          (o_proj): QuantLinear()
          (q_proj): QuantLinear()
          (v_proj): QuantLinear()
        )
        (mlp): Qwen2MLP(
          (act_fn): SiLU()
          (down_proj): QuantLinear()
          (gate_proj): QuantLinear()
          (up_proj): QuantLinear()
        )
        (input_layernorm): Qwen2RMSNorm((1536,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((1536,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm((1536,), eps=1e-06)
    (rotary_emb): Qwen2RotaryEmbedding()
  )
  (lm_head): Linear(in_features=1536, out_features=151936, bias=False)
)

Test model's text generation forward pass

In [13]:
with torch.no_grad():
    my_batch = next(iter(qwen_2_5_3b_instruct_4int_dataloader))
    my_batch_input = my_batch["inputs"].to("cuda")

    #logits = model(**my_batch_input).logits

    output_ids  = model.generate(
        **my_batch_input,
        max_new_tokens=512,
    )

    output_ids = output_ids.detach().cpu()

    response = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
    print(parse_output_llm(response))

    ground_truth_batch = tokenizer.batch_decode(my_batch["labels"]["input_ids"], skip_special_tokens=True)
    print(parse_output_llm(ground_truth_batch))



['a', 'a', 'a', 'a']
['b', 'b', 'b', 'b']


### Parsing dataset to Model: Qwen2.5-3B-Int4

In [14]:
output_winners, label_winners = model_inference(model,qwen_2_5_3b_instruct_4int_dataloader)

Training set - Model Inference:   0%|          | 15/5147 [02:24<9:19:36,  6.54s/it] 

In [None]:
custom_accurracy_metric(output_winners,label_winners)

## meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8

In [None]:
config = InferenceArgs(
    model_name="meta-llama/Llama-3.2-1B-Instruct-SpinQuant_INT4_EO8",
    batch_size=8
)

tokenizer: PreTrainedTokenizer | PreTrainedTokenizerFast = AutoTokenizer.from_pretrained(config.model_name,
                                                                                         padding_side="left")

In [None]:
print(f"Model : {config.model_name} max context length : {tokenizer.model_max_length}")

### Dataloader

In [None]:
arguments = {
    "dataset" :train_dataset,
    "batch_size" : config.batch_size
}

llama_3_2_1b_spinquant_4int_dataloader = ChatbotDataloader(
    tokenizer,**arguments
)

### Loading model into VRAM

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    config.model_name, 
    device_map="auto"
)
model.eval()

In [None]:
with torch.no_grad():
    my_batch = next(iter(llama_3_2_1b_spinquant_4int_dataloader))
    my_batch_input = my_batch["inputs"].to("cuda")

    #logits = model(**my_batch_input).logits

    output_ids  = model.generate(
        **my_batch_input,
        max_new_tokens=512,
    )

    output_ids = output_ids.detach().cpu()

    response = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
    print(parse_output_llm(response))

    ground_truth_batch = tokenizer.batch_decode(my_batch["labels"]["input_ids"], skip_special_tokens=True)
    print(parse_output_llm(ground_truth_batch))

