<a href="https://colab.research.google.com/github/kanyijohn/Airbnb_design/blob/main/LLM_Classification_Finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install kaggle



In [2]:
from google.colab import files
import json

# Upload the file again
uploaded = files.upload()

# Verify the contents (make sure username & key are correct)
for filename in uploaded.keys():
    print(f"Uploaded: {filename}")
    print(uploaded[filename].decode('utf-8'))  # Check the key is correct

Saving kaggle.json to kaggle.json
Uploaded: kaggle.json
{"username":"johnsonkanyi","key":"f8e3cd38e45c532ab64524da20ece09e"}


In [3]:
!mkdir -p ~/.kaggle  # -p prevents error if dir exists
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json  # Restrict permissions

In [4]:
!kaggle competitions list  # Should list competitions (no 401 error)

ref                                                                              deadline             category                reward  teamCount  userHasEntered  
-------------------------------------------------------------------------------  -------------------  ---------------  -------------  ---------  --------------  
https://www.kaggle.com/competitions/arc-prize-2025                               2025-11-03 23:59:00  Featured         1,000,000 Usd        605           False  
https://www.kaggle.com/competitions/google-gemma-3n-hackathon                    2025-08-06 23:59:00  Featured           150,000 Usd          0           False  
https://www.kaggle.com/competitions/make-data-count-finding-data-references      2025-09-09 23:59:00  Research           100,000 Usd        436           False  
https://www.kaggle.com/competitions/cmi-detect-behavior-with-sensor-data         2025-09-02 23:59:00  Featured            50,000 Usd       1553           False  
https://www.kaggle.com/compe

In [5]:
!kaggle competitions download -c llm-classification-finetuning

Downloading llm-classification-finetuning.zip to /content
  0% 0.00/57.0M [00:00<?, ?B/s]
100% 57.0M/57.0M [00:00<00:00, 942MB/s]


In [6]:
!unzip llm-classification-finetuning.zip

Archive:  llm-classification-finetuning.zip
  inflating: sample_submission.csv   
  inflating: test.csv                
  inflating: train.csv               


In [7]:
# @title Step 1: Setup and Data Loading

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, Embedding, GlobalAveragePooling1D, concatenate
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import re # For text cleaning
from nltk.corpus import stopwords
import nltk
import string

# Download NLTK stopwords (if not already downloaded)
try:
    stopwords.words('english')
except LookupError:
    nltk.download('stopwords')

# Load the datasets
try:
    train_df = pd.read_csv('train.csv')
    test_df = pd.read_csv('test.csv')
    sample_submission_df = pd.read_csv('sample_submission.csv')
    print("Data loaded successfully!")
except FileNotFoundError:
    print("Ensure train.csv, test.csv, and sample_submission.csv are in your Colab environment.")
    # You might need to upload the files or connect to Google Drive
    # from google.colab import files
    # uploaded = files.upload()
    # For Kaggle competitions, the data is usually available in the input directory
    # train_df = pd.read_csv('/kaggle/input/llm-classification-finetuning/train.csv')
    # test_df = pd.read_csv('/kaggle/input/llm-classification-finetuning/test.csv')
    # sample_submission_df = pd.read_csv('/kaggle/input/llm-classification-finetuning/sample_submission.csv')

print("\nTrain data shape:", train_df.shape)
print("Test data shape:", test_df.shape)
print("\nTrain data head:")
print(train_df.head())
print("\nTest data head:")
print(test_df.head())
print("\nSample submission head:")
print(sample_submission_df.head())

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Data loaded successfully!

Train data shape: (57477, 9)
Test data shape: (3, 4)

Train data head:
       id             model_a              model_b  \
0   30192  gpt-4-1106-preview           gpt-4-0613   
1   53567           koala-13b           gpt-4-0613   
2   65089  gpt-3.5-turbo-0613       mistral-medium   
3   96401    llama-2-13b-chat  mistral-7b-instruct   
4  198779           koala-13b   gpt-3.5-turbo-0314   

                                              prompt  \
0  ["Is it morally right to try to have a certain...   
1  ["What is the difference between marriage lice...   
2  ["explain function calling. how would you call...   
3  ["How can I create a test set for a very rare ...   
4  ["What is the best way to travel from Tel-Aviv...   

                                          response_a  \
0  ["The question of whether it is morally right ...   
1  ["A marriage license is a legal document that ...   
2  ["Function calling is the process of invoking ...   
3  ["Creating a 

In [8]:
# @title Step 2: Data Preprocessing and Feature Engineering

# Text Cleaning Function
def clean_text(text):
    text = text.lower() # Convert to lowercase
    text = re.sub(f'[{re.escape(string.punctuation)}]', '', text) # Remove punctuation
    # Optional: Remove stopwords
    # stop_words = set(stopwords.words('english'))
    # text = ' '.join([word for word in text.split() if word not in stop_words])
    text = re.sub(r'\s+', ' ', text).strip() # Remove extra whitespace
    return text

# Apply cleaning to prompt and responses
train_df['prompt_cleaned'] = train_df['prompt'].apply(clean_text)
train_df['response_a_cleaned'] = train_df['response_a'].apply(clean_text)
train_df['response_b_cleaned'] = train_df['response_b'].apply(clean_text)

test_df['prompt_cleaned'] = test_df['prompt'].apply(clean_text)
test_df['response_a_cleaned'] = test_df['response_a'].apply(clean_text)
test_df['response_b_cleaned'] = test_df['response_b'].apply(clean_text)

print("\nText cleaning applied.")
print("\nTrain data with cleaned text:")
print(train_df[['prompt_cleaned', 'response_a_cleaned', 'response_b_cleaned']].head())

# Tokenization and Padding
max_words = 20000 # Maximum number of words to keep based on word frequency
maxlen = 256 # Maximum length of sequences

tokenizer = Tokenizer(num_words=max_words, oov_token="<OOV>")

# Fit tokenizer on combined text from train and test for better vocabulary coverage
all_text = pd.concat([
    train_df['prompt_cleaned'], train_df['response_a_cleaned'], train_df['response_b_cleaned'],
    test_df['prompt_cleaned'], test_df['response_a_cleaned'], test_df['response_b_cleaned']
])
tokenizer.fit_on_texts(all_text)

# Convert text to sequences
train_sequences_prompt = tokenizer.texts_to_sequences(train_df['prompt_cleaned'])
train_sequences_response_a = tokenizer.texts_to_sequences(train_df['response_a_cleaned'])
train_sequences_response_b = tokenizer.texts_to_sequences(train_df['response_b_cleaned'])

test_sequences_prompt = tokenizer.texts_to_sequences(test_df['prompt_cleaned'])
test_sequences_response_a = tokenizer.texts_to_sequences(test_df['response_a_cleaned'])
test_sequences_response_b = tokenizer.texts_to_sequences(test_df['response_b_cleaned'])

# Pad sequences
train_padded_prompt = pad_sequences(train_sequences_prompt, maxlen=maxlen, padding='post', truncating='post')
train_padded_response_a = pad_sequences(train_sequences_response_a, maxlen=maxlen, padding='post', truncating='post')
train_padded_response_b = pad_sequences(train_sequences_response_b, maxlen=maxlen, padding='post', truncating='post')

test_padded_prompt = pad_sequences(test_sequences_prompt, maxlen=maxlen, padding='post', truncating='post')
test_padded_response_a = pad_sequences(test_sequences_response_a, maxlen=maxlen, padding='post', truncating='post')
test_padded_response_b = pad_sequences(test_sequences_response_b, maxlen=maxlen, padding='post', truncating='post')

print("\nText tokenization and padding applied.")
print("\nExample padded sequence (prompt):")
print(train_padded_prompt[0])

# Prepare Target Variable
# We need to represent the winner as a one-hot encoded vector
# [1, 0, 0] for winner_model_a, [0, 1, 0] for winner_model_b, [0, 0, 1] for winner_tie
y_train = train_df[['winner_model_a', 'winner_model_b', 'winner_tie']].values

print("\nTarget variable prepared:")
print(y_train[:5])


Text cleaning applied.

Train data with cleaned text:
                                      prompt_cleaned  \
0  is it morally right to try to have a certain p...   
1  what is the difference between marriage licens...   
2  explain function calling how would you call a ...   
3  how can i create a test set for a very rare ca...   
4  what is the best way to travel from telaviv to...   

                                  response_a_cleaned  \
0  the question of whether it is morally right to...   
1  a marriage license is a legal document that al...   
2  function calling is the process of invoking or...   
3  creating a test set for a very rare category c...   
4  the best way to travel from tel aviv to jerusa...   

                                  response_b_cleaned  
0  as an ai i dont have personal beliefs or opini...  
1  a marriage license and a marriage certificate ...  
2  function calling is the process of invoking a ...  
3  when building a classifier for a very rare cat..

In [9]:
# @title Step 3: Model Definition

# Define the model architecture
embedding_dim = 64 # Dimension of the word embeddings

# Input layers
prompt_input = Input(shape=(maxlen,), name='prompt_input')
response_a_input = Input(shape=(maxlen,), name='response_a_input')
response_b_input = Input(shape=(maxlen,), name='response_b_input')

# Embedding layer (shared for all inputs)
embedding_layer = Embedding(input_dim=max_words, output_dim=embedding_dim)

# Embed the inputs
prompt_embedding = embedding_layer(prompt_input)
response_a_embedding = embedding_layer(response_a_input)
response_b_embedding = embedding_layer(response_b_input)

# Global Average Pooling to reduce dimensionality
prompt_pooled = GlobalAveragePooling1D()(prompt_embedding)
response_a_pooled = GlobalAveragePooling1D()(response_a_embedding)
response_b_pooled = GlobalAveragePooling1D()(response_b_embedding)

# Concatenate the pooled embeddings and features comparing responses
# We can add simple features here, like length difference, before concatenating
# For simplicity, we'll just concatenate the pooled embeddings for now.
# To add more features, calculate them here and concatenate with the pooled embeddings.
concatenated = concatenate([prompt_pooled, response_a_pooled, response_b_pooled])

# Dense layers for classification
dense_1 = Dense(128, activation='relu')(concatenated)
dense_2 = Dense(64, activation='relu')(dense_1)

# Output layer: 3 units for the three outcomes (A wins, B wins, Tie)
# Softmax activation to get probabilities that sum to 1
output_layer = Dense(3, activation='softmax', name='output_layer')(dense_2)

# Define the model
model = Model(inputs=[prompt_input, response_a_input, response_b_input], outputs=output_layer)

# Compile the model
# Use categorical_crossentropy as the loss function for multi-class classification
# Use Adam optimizer
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

model.summary()

In [10]:
# @title Step 4: Model Training

# Prepare data for model input
X_train = {
    'prompt_input': train_padded_prompt,
    'response_a_input': train_padded_response_a,
    'response_b_input': train_padded_response_b
}

# Split training data for validation using indices
from sklearn.model_selection import train_test_split

train_indices, val_indices = train_test_split(
    np.arange(train_df.shape[0]),
    test_size=0.2,
    random_state=42,
    stratify=y_train
)

X_train_split = {
    'prompt_input': train_padded_prompt[train_indices],
    'response_a_input': train_padded_response_a[train_indices],
    'response_b_input': train_padded_response_b[train_indices]
}

X_val_split = {
    'prompt_input': train_padded_prompt[val_indices],
    'response_a_input': train_padded_response_a[val_indices],
    'response_b_input': train_padded_response_b[val_indices]
}

y_train_split = y_train[train_indices]
y_val_split = y_train[val_indices]


print("\nTraining and validation data split.")

# Train the model
epochs = 5 # Number of training epochs
batch_size = 32 # Size of the mini-batches

history = model.fit(
    X_train_split,
    y_train_split,
    epochs=epochs,
    batch_size=batch_size,
    validation_data=(X_val_split, y_val_split)
)

print("\nModel training finished.")


Training and validation data split.
Epoch 1/5
[1m1437/1437[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 5ms/step - accuracy: 0.4246 - loss: 1.0686 - val_accuracy: 0.4559 - val_loss: 1.0501
Epoch 2/5
[1m1437/1437[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 4ms/step - accuracy: 0.4864 - loss: 1.0194 - val_accuracy: 0.4498 - val_loss: 1.0556
Epoch 3/5
[1m1437/1437[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 4ms/step - accuracy: 0.5550 - loss: 0.9295 - val_accuracy: 0.4415 - val_loss: 1.1118
Epoch 4/5
[1m1437/1437[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - accuracy: 0.6468 - loss: 0.7857 - val_accuracy: 0.4313 - val_loss: 1.2409
Epoch 5/5
[1m1437/1437[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 4ms/step - accuracy: 0.7100 - loss: 0.6557 - val_accuracy: 0.4292 - val_loss: 1.3857

Model training finished.


In [11]:
# @title Step 5: Model Evaluation

# Evaluate the model on the validation set
print("\nEvaluating model on validation data...")
loss, accuracy = model.evaluate(X_val_split, y_val_split, verbose=0)

print(f"Validation Loss: {loss:.4f}")
print(f"Validation Accuracy: {accuracy:.4f}")

# Note: Log Loss is the primary competition metric.
# The reported validation loss from model.evaluate is the categorical_crossentropy, which is equivalent to Log Loss for this task.


Evaluating model on validation data...
Validation Loss: 1.3857
Validation Accuracy: 0.4292


In [12]:
# @title Step 6: Make Predictions on Test Data

# Prepare test data for model input
X_test = {
    'prompt_input': test_padded_prompt,
    'response_a_input': test_padded_response_a,
    'response_b_input': test_padded_response_b
}

# Make predictions on the test data
print("\nMaking predictions on test data...")
test_predictions = model.predict(X_test)

print("\nTest predictions shape:", test_predictions.shape)
print("\nExample test predictions (probabilities for [A wins, B wins, Tie]):")
print(test_predictions[:5])


Making predictions on test data...
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 500ms/step

Test predictions shape: (3, 3)

Example test predictions (probabilities for [A wins, B wins, Tie]):
[[0.27195528 0.30061084 0.42743385]
 [0.22379935 0.4377555  0.33844513]
 [0.0421933  0.9260752  0.03173143]]


# Task
Explain the `KeyError: "['winner_model_tie'] not in index"` in the provided Python code, fix the error, and incorporate the fix into the existing code. Then, based on the provided plan, fine-tune the model to achieve a validation log loss of less than 1 and improve validation accuracy. The plan involves implementing advanced text representation using pre-trained language models, adopting a more sophisticated model architecture, incorporating feature engineering, refining model training and evaluation with stratified K-fold cross-validation and log loss, performing targeted hyperparameter tuning, exploring handling ties, considering ensemble methods, and iteratively addressing bias.

## Implement advanced text representation

### Subtask:
Replace the simple Embedding and GlobalAveragePooling layers with embeddings from a pre-trained language model to capture richer semantic information.


**Reasoning**:
Install the `transformers` library to use pre-trained language models.



In [13]:
!pip install transformers



**Reasoning**:
Load a pre-trained BERT model and its tokenizer, then tokenize and obtain embeddings for the cleaned text data.



In [None]:
from transformers import BertModel, BertTokenizer
import torch

# Choose a pre-trained model (e.g., 'bert-base-uncased')
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
bert_model = BertModel.from_pretrained(model_name)

# Define a function to get BERT embeddings
def get_bert_embeddings(texts, tokenizer, model, max_len=256):
    model.eval() # Set model to evaluation mode
    embeddings = []
    with torch.no_grad():
        for text in texts:
            encoded_input = tokenizer(text, return_tensors='pt', max_length=max_len, padding='max_length', truncation=True)
            output = model(**encoded_input)
            # Use the pooled output (CLS token representation)
            pooled_output = output.pooler_output
            embeddings.append(pooled_output.squeeze().numpy()) # Remove batch dimension and convert to numpy
    return np.array(embeddings)

# Get embeddings for train and test data
print("Getting BERT embeddings for training data...")
train_prompt_embeddings = get_bert_embeddings(train_df['prompt_cleaned'].tolist(), tokenizer, bert_model, maxlen)
train_response_a_embeddings = get_bert_embeddings(train_df['response_a_cleaned'].tolist(), tokenizer, bert_model, maxlen)
train_response_b_embeddings = get_bert_embeddings(train_df['response_b_cleaned'].tolist(), tokenizer, bert_model, maxlen)

print("Getting BERT embeddings for test data...")
test_prompt_embeddings = get_bert_embeddings(test_df['prompt_cleaned'].tolist(), tokenizer, bert_model, maxlen)
test_response_a_embeddings = get_bert_embeddings(test_df['response_a_cleaned'].tolist(), tokenizer, bert_model, maxlen)
test_response_b_embeddings = get_bert_embeddings(test_df['response_b_cleaned'].tolist(), tokenizer, bert_model, maxlen)

print("\nBERT embeddings obtained.")
print("Train prompt embeddings shape:", train_prompt_embeddings.shape)
print("Test prompt embeddings shape:", test_prompt_embeddings.shape)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Getting BERT embeddings for training data...


**Reasoning**:
Update the model definition to use the pre-trained BERT embeddings instead of the simple Embedding and GlobalAveragePooling layers.



In [None]:
# @title Step 3: Model Definition (Updated with BERT Embeddings)

# Define the model architecture using pre-computed BERT embeddings
# The input to the model will now be the BERT embeddings themselves.
bert_embedding_dim = train_prompt_embeddings.shape[1] # Get the embedding dimension from the computed embeddings

# Input layers for BERT embeddings
prompt_input = Input(shape=(bert_embedding_dim,), name='prompt_input_bert')
response_a_input = Input(shape=(bert_embedding_dim,), name='response_a_input_bert')
response_b_input = Input(shape=(bert_embedding_dim,), name='response_b_input_bert')

# Concatenate the BERT embeddings
concatenated = concatenate([prompt_input, response_a_input, response_b_input])

# Dense layers for classification (can be the same as before or adjusted)
dense_1 = Dense(128, activation='relu')(concatenated)
dense_2 = Dense(64, activation='relu')(dense_1)

# Output layer: 3 units for the three outcomes (A wins, B wins, Tie)
# Softmax activation to get probabilities that sum to 1
output_layer = Dense(3, activation='softmax', name='output_layer')(dense_2)

# Define the model
model = Model(inputs=[prompt_input, response_a_input, response_b_input], outputs=output_layer)

# Compile the model
# Use categorical_crossentropy as the loss function for multi-class classification
# Use Adam optimizer
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

model.summary()

**Reasoning**:
Prepare the training and validation data using the pre-computed BERT embeddings and train the updated model.



In [None]:
# @title Step 4: Model Training (with BERT Embeddings)

# Prepare data for model input using BERT embeddings
X_train_bert = {
    'prompt_input_bert': train_prompt_embeddings,
    'response_a_input_bert': train_response_a_embeddings,
    'response_b_input_bert': train_response_b_embeddings
}

# Split training data for validation using indices (same split as before)
# train_indices and val_indices are already defined from the previous split

X_train_split_bert = {
    'prompt_input_bert': train_prompt_embeddings[train_indices],
    'response_a_input_bert': train_response_a_embeddings[train_indices],
    'response_b_input_bert': train_response_b_embeddings[train_indices]
}

X_val_split_bert = {
    'prompt_input_bert': train_prompt_embeddings[val_indices],
    'response_a_input_bert': train_response_a_embeddings[val_indices],
    'response_b_input_bert': train_response_b_embeddings[val_indices]
}

# y_train_split and y_val_split are already defined

print("\nTraining and validation data prepared with BERT embeddings.")

# Train the model
epochs = 5 # Number of training epochs
batch_size = 32 # Size of the mini-batches

history = model.fit(
    X_train_split_bert,
    y_train_split,
    epochs=epochs,
    batch_size=batch_size,
    validation_data=(X_val_split_bert, y_val_split)
)

print("\nModel training finished using BERT embeddings.")

**Reasoning**:
The error `NameError: name 'train_prompt_embeddings' is not defined` indicates that the variables holding the BERT embeddings were not successfully created in the previous step. This could be due to an execution error in the cell that generates the embeddings. I need to re-execute the cell that obtains the BERT embeddings before preparing and training the model.

