# LLM Classification Finetuning
By Josh Houlding

<b>Competition Page:</b> [https://www.kaggle.com/competitions/llm-classification-finetuning/overview](https://www.kaggle.com/competitions/llm-classification-finetuning/overview)

# Loading the data

In [90]:
import pandas as pd

# Load data
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
sample_submission = pd.read_csv("sample_submission.csv")

In [91]:
# Show sample entries
print(f"Training set shape: {train.shape}")
train.head()

Training set shape: (57477, 9)


Unnamed: 0,id,model_a,model_b,prompt,response_a,response_b,winner_model_a,winner_model_b,winner_tie
0,30192,gpt-4-1106-preview,gpt-4-0613,"[""Is it morally right to try to have a certain...","[""The question of whether it is morally right ...","[""As an AI, I don't have personal beliefs or o...",1,0,0
1,53567,koala-13b,gpt-4-0613,"[""What is the difference between marriage lice...","[""A marriage license is a legal document that ...","[""A marriage license and a marriage certificat...",0,1,0
2,65089,gpt-3.5-turbo-0613,mistral-medium,"[""explain function calling. how would you call...","[""Function calling is the process of invoking ...","[""Function calling is the process of invoking ...",0,0,1
3,96401,llama-2-13b-chat,mistral-7b-instruct,"[""How can I create a test set for a very rare ...","[""Creating a test set for a very rare category...","[""When building a classifier for a very rare c...",1,0,0
4,198779,koala-13b,gpt-3.5-turbo-0314,"[""What is the best way to travel from Tel-Aviv...","[""The best way to travel from Tel Aviv to Jeru...","[""The best way to travel from Tel-Aviv to Jeru...",0,1,0


# Cleaning the data

### Viewing a sample

In [92]:
# Function to show the prompt and responses for a given instance
def show_prompt_and_responses(index):
    prompt = train.iloc[index]["prompt"]
    response_a = train.iloc[index]["response_a"]
    response_b = train.iloc[index]["response_b"]
    return "Prompt: " + prompt + "\n\n Model A response: " + response_a + "\n\n Model B response: " + response_b

# Show example
print(show_prompt_and_responses(2))

Prompt: ["explain function calling. how would you call a function?"]

 Model A response: ["Function calling is the process of invoking or executing a function in a programming language. To call a function, you need to use the function's name followed by parentheses \"()\". Within the parentheses, you can pass arguments if the function requires any.\n\nHere is an example of how to call a function in Python:\n\n```python\n# Define a function\ndef greet(name):\n    print(\"Hello, \" + name + \"!\")\n\n# Call the function\ngreet(\"John\")\n```\n\nIn this example, the function \"greet\" is defined with a single parameter \"name\" which is used to print a greeting message. To call the function and pass an argument, we write \"greet(\"John\")\". This will execute the function and print \"Hello, John!\".\n\nIt is important to note that the function should be defined before calling it in most programming languages, as the interpreter needs to know about the function's existence before it can be

### Removing brackets

In [93]:
import re

# Function to remove square brackets and parentheses
def remove_brackets(text):
    text = re.sub(r'[\[\]\(\)]', '', str(text))
    return text

# Function to apply another function to the train and test sets
def apply_function(function_name):
    train['prompt'] = train['prompt'].apply(function_name)
    train['response_a'] = train['response_a'].apply(function_name)
    train['response_b'] = train['response_b'].apply(function_name)
    test['prompt'] = test['prompt'].apply(function_name)
    test['response_a'] = test['response_a'].apply(function_name)
    test['response_b'] = test['response_b'].apply(function_name)

# Apply the function
apply_function(remove_brackets)

### Removing newline characters (`\n`)

In [94]:
# Function to remove newline characters
def remove_newlines(text):
  return text.replace("\\n", " ")

# Apply the function
apply_function(remove_newlines)

### Converting all text to lowercase

In [95]:
# Function to lowercase text
def lowercase_text(text):
    return text.lower()

# Apply the function
apply_function(lowercase_text)

### Removing punctuation

In [96]:
import string

# Function to remove punctuation
def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))

# Apply the function
apply_function(remove_punctuation)

### Removing stopwords

In [97]:
"""
from nltk.corpus import stopwords

# Function to remove stopwords
def remove_stopwords(text):
    words = text.split()
    filtered_words = [word for word in words if word not in set(stopwords.words('english'))]
    return ' '.join(filtered_words)

# Apply the function
apply_function(remove_stopwords)
""";

### Lemmatizing text

In [98]:
"""
from nltk.stem import WordNetLemmatizer

# Function to lemmatize text
def lemmatize_text(text):
    words = text.split()
    lemmatized_words = [WordNetLemmatizer().lemmatize(word) for word in words]
    return ' '.join(lemmatized_words)

# Apply the function
apply_function(lemmatize_text)
""";

### Handling whitespaces

In [99]:
# Function to remove unnecessary whitespaces
def handle_whitespace(text):
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Apply the function
apply_function(handle_whitespace)

### Example entry after cleaning

In [100]:
# Show the same entry after data cleaning
print(show_prompt_and_responses(2))

Prompt: explain function calling how would you call a function

 Model A response: function calling is the process of invoking or executing a function in a programming language to call a function you need to use the functions name followed by parentheses within the parentheses you can pass arguments if the function requires any here is an example of how to call a function in python python define a function def greetname printhello name call the function greetjohn in this example the function greet is defined with a single parameter name which is used to print a greeting message to call the function and pass an argument we write greetjohn this will execute the function and print hello john it is important to note that the function should be defined before calling it in most programming languages as the interpreter needs to know about the functions existence before it can be called

 Model B response: function calling is the process of invoking a function in a program a function is a blo

# Modeling

### Vectorizing text with TF-IDF

In [101]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import joblib

# Function to vectorize text using TF-IDF
def vectorize_text_simple(texts, max_features=5000, fitted_vectorizer=None):
    if fitted_vectorizer is None:
        vectorizer = TfidfVectorizer(max_features=max_features)
        vectors = vectorizer.fit_transform(texts)
        return vectors, vectorizer  # Return both vectors and vectorizer
    else:
        vectors = fitted_vectorizer.transform(texts)
        return vectors, fitted_vectorizer #return vectors, and the passed vectorizer

# Combine the prompt and responses into single text columns for vectorization
train['combined_text'] = train['prompt'] + " " + train['response_a'] + " " + train['response_b']
test['combined_text'] = test['prompt'] + " " + test['response_a'] + " " + test['response_b']

# Vectorize the training data
train_vectors, fitted_vectorizer = vectorize_text_simple(train['combined_text'])

# Save the fitted vectorizer
joblib.dump(fitted_vectorizer, 'tfidf_vectorizer.joblib')

# Vectorize the testing data using the fitted vectorizer
test_vectors, _ = vectorize_text_simple(test['combined_text'], fitted_vectorizer=fitted_vectorizer)

### H2O AutoML

In [102]:
"""
# Convert sparse matrix to pandas dataframe.
train_vectors_df = pd.DataFrame.sparse.from_spmatrix(train_vectors)
test_vectors_df = pd.DataFrame.sparse.from_spmatrix(test_vectors)

# Add other columns
train_h2o = h2o.H2OFrame(pd.concat([train_vectors_df, train[['winner', 'model_a', 'model_b']]], axis=1))
test_h2o = h2o.H2OFrame(pd.concat([test_vectors_df, test[['model_a', 'model_b']]], axis=1))

# Identify target and predictors
y = "winner"
x = train_h2o.columns
x.remove(y)

# Run AutoML
aml = H2OAutoML(max_runtime_secs=10, seed=42) #Adjust max models as needed.
aml.train(x=x, y=y, training_frame=train_h2o)

# Get the best model
best_model = aml.leader

# Make predictions
predictions = best_model.predict(test_h2o)

# Convert predictions to a Pandas DataFrame
predictions_df = predictions.as_data_frame()

# Prepare submission (if applicable)
# ... (your submission preparation code) ...

# Shutdown H2O
h2o.shutdown()
""";