<a href="https://colab.research.google.com/github/kanyijohn/Deep-Learning-ML-models/blob/main/LLM_Classification_Finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Explore using pre-trained language models (like BERT, RoBERTa, or others) to generate embeddings for the prompt and responses with More Complex Model Architecture to Consider more sophisticated neural network architectures designed for sequence comparison or ranking. You could fine-tune these models on the preference prediction task. Systematically tune hyperparameters like embedding dimension, maxlen, number of layers, units in dense layers, learning rate, batch size, and number of epochs.

## Install necessary libraries

### Subtask:
Install the `transformers` library to use pre-trained language models.


**Reasoning**:
Install the transformers library using pip.



In [1]:
!pip install transformers



## Data loading

### Subtask:
Load pre-trained model and tokenizer


**Reasoning**:
Import necessary classes and load the pre-trained tokenizer and model.



In [2]:
!pip install kaggle



In [3]:
from google.colab import files
import json

# Upload the file again
uploaded = files.upload()

# Verify the contents (make sure username & key are correct)
for filename in uploaded.keys():
    print(f"Uploaded: {filename}")
    print(uploaded[filename].decode('utf-8'))  # Check the key is correct

Saving kaggle.json to kaggle.json
Uploaded: kaggle.json
{"username":"johnsonkanyi","key":"f8e3cd38e45c532ab64524da20ece09e"}


In [4]:
!mkdir -p ~/.kaggle  # -p prevents error if dir exists
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json  # Restrict permissions

In [5]:
!kaggle competitions list  # Should list competitions (no 401 error)

ref                                                                              deadline             category                reward  teamCount  userHasEntered  
-------------------------------------------------------------------------------  -------------------  ---------------  -------------  ---------  --------------  
https://www.kaggle.com/competitions/arc-prize-2025                               2025-11-03 23:59:00  Featured         1,000,000 Usd        657           False  
https://www.kaggle.com/competitions/google-gemma-3n-hackathon                    2025-08-06 23:59:00  Featured           150,000 Usd          0           False  
https://www.kaggle.com/competitions/make-data-count-finding-data-references      2025-09-09 23:59:00  Research           100,000 Usd        608           False  
https://www.kaggle.com/competitions/map-charting-student-math-misunderstandings  2025-10-15 23:59:00  Featured            55,000 Usd        243           False  
https://www.kaggle.com/compe

In [6]:
!kaggle competitions download -c llm-classification-finetuning

llm-classification-finetuning.zip: Skipping, found more recently modified local copy (use --force to force download)


In [7]:
!unzip llm-classification-finetuning.zip

Archive:  llm-classification-finetuning.zip
replace sample_submission.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: no
replace test.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: no
replace train.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: no


In [8]:
# @title Step 1: Setup and Data Loading

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, Embedding, GlobalAveragePooling1D, concatenate
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import re # For text cleaning
from nltk.corpus import stopwords
import nltk
import string

# Download NLTK stopwords (if not already downloaded)
try:
    stopwords.words('english')
except LookupError:
    nltk.download('stopwords')

# Load the datasets
try:
    train_df = pd.read_csv('train.csv')
    test_df = pd.read_csv('test.csv')
    sample_submission_df = pd.read_csv('sample_submission.csv')
    print("Data loaded successfully!")
except FileNotFoundError:
    print("Ensure train.csv, test.csv, and sample_submission.csv are in your Colab environment.")
    # You might need to upload the files or connect to Google Drive
    # from google.colab import files
    # uploaded = files.upload()
    # For Kaggle competitions, the data is usually available in the input directory
    # train_df = pd.read_csv('/kaggle/input/llm-classification-finetuning/train.csv')
    # test_df = pd.read_csv('/kaggle/input/llm-classification-finetuning/test.csv')
    # sample_submission_df = pd.read_csv('/kaggle/input/llm-classification-finetuning/sample_submission.csv')

print("\nTrain data shape:", train_df.shape)
print("Test data shape:", test_df.shape)
print("\nTrain data head:")
print(train_df.head())
print("\nTest data head:")
print(test_df.head())
print("\nSample submission head:")
print(sample_submission_df.head())

Data loaded successfully!

Train data shape: (57477, 9)
Test data shape: (3, 4)

Train data head:
       id             model_a              model_b  \
0   30192  gpt-4-1106-preview           gpt-4-0613   
1   53567           koala-13b           gpt-4-0613   
2   65089  gpt-3.5-turbo-0613       mistral-medium   
3   96401    llama-2-13b-chat  mistral-7b-instruct   
4  198779           koala-13b   gpt-3.5-turbo-0314   

                                              prompt  \
0  ["Is it morally right to try to have a certain...   
1  ["What is the difference between marriage lice...   
2  ["explain function calling. how would you call...   
3  ["How can I create a test set for a very rare ...   
4  ["What is the best way to travel from Tel-Aviv...   

                                          response_a  \
0  ["The question of whether it is morally right ...   
1  ["A marriage license is a legal document that ...   
2  ["Function calling is the process of invoking ...   
3  ["Creating a 

In [9]:
# @title Step 2: Data Preprocessing and Feature Engineering

# Text Cleaning Function
def clean_text(text):
    text = text.lower() # Convert to lowercase
    text = re.sub(f'[{re.escape(string.punctuation)}]', '', text) # Remove punctuation
    # Optional: Remove stopwords
    # stop_words = set(stopwords.words('english'))
    # text = ' '.join([word for word in text.split() if word not in stop_words])
    text = re.sub(r'\s+', ' ', text).strip() # Remove extra whitespace
    return text

# Apply cleaning to prompt and responses
train_df['prompt_cleaned'] = train_df['prompt'].apply(clean_text)
train_df['response_a_cleaned'] = train_df['response_a'].apply(clean_text)
train_df['response_b_cleaned'] = train_df['response_b'].apply(clean_text)

test_df['prompt_cleaned'] = test_df['prompt'].apply(clean_text)
test_df['response_a_cleaned'] = test_df['response_a'].apply(clean_text)
test_df['response_b_cleaned'] = test_df['response_b'].apply(clean_text)

print("\nText cleaning applied.")
print("\nTrain data with cleaned text:")
print(train_df[['prompt_cleaned', 'response_a_cleaned', 'response_b_cleaned']].head())

# Tokenization and Padding
max_words = 20000 # Maximum number of words to keep based on word frequency
maxlen = 256 # Maximum length of sequences

tokenizer = Tokenizer(num_words=max_words, oov_token="<OOV>")

# Fit tokenizer on combined text from train and test for better vocabulary coverage
all_text = pd.concat([
    train_df['prompt_cleaned'], train_df['response_a_cleaned'], train_df['response_b_cleaned'],
    test_df['prompt_cleaned'], test_df['response_a_cleaned'], test_df['response_b_cleaned']
])
tokenizer.fit_on_texts(all_text)

# Convert text to sequences
train_sequences_prompt = tokenizer.texts_to_sequences(train_df['prompt_cleaned'])
train_sequences_response_a = tokenizer.texts_to_sequences(train_df['response_a_cleaned'])
train_sequences_response_b = tokenizer.texts_to_sequences(train_df['response_b_cleaned'])

test_sequences_prompt = tokenizer.texts_to_sequences(test_df['prompt_cleaned'])
test_sequences_response_a = tokenizer.texts_to_sequences(test_df['response_a_cleaned'])
test_sequences_response_b = tokenizer.texts_to_sequences(test_df['response_b_cleaned'])

# Pad sequences
train_padded_prompt = pad_sequences(train_sequences_prompt, maxlen=maxlen, padding='post', truncating='post')
train_padded_response_a = pad_sequences(train_sequences_response_a, maxlen=maxlen, padding='post', truncating='post')
train_padded_response_b = pad_sequences(train_sequences_response_b, maxlen=maxlen, padding='post', truncating='post')

test_padded_prompt = pad_sequences(test_sequences_prompt, maxlen=maxlen, padding='post', truncating='post')
test_padded_response_a = pad_sequences(test_sequences_response_a, maxlen=maxlen, padding='post', truncating='post')
test_padded_response_b = pad_sequences(test_sequences_response_b, maxlen=maxlen, padding='post', truncating='post')

print("\nText tokenization and padding applied.")
print("\nExample padded sequence (prompt):")
print(train_padded_prompt[0])

# Prepare Target Variable
# We need to represent the winner as a one-hot encoded vector
# [1, 0, 0] for winner_model_a, [0, 1, 0] for winner_model_b, [0, 0, 1] for winner_tie
y_train = train_df[['winner_model_a', 'winner_model_b', 'winner_tie']].values

print("\nTarget variable prepared:")
print(y_train[:5])


Text cleaning applied.

Train data with cleaned text:
                                      prompt_cleaned  \
0  is it morally right to try to have a certain p...   
1  what is the difference between marriage licens...   
2  explain function calling how would you call a ...   
3  how can i create a test set for a very rare ca...   
4  what is the best way to travel from telaviv to...   

                                  response_a_cleaned  \
0  the question of whether it is morally right to...   
1  a marriage license is a legal document that al...   
2  function calling is the process of invoking or...   
3  creating a test set for a very rare category c...   
4  the best way to travel from tel aviv to jerusa...   

                                  response_b_cleaned  
0  as an ai i dont have personal beliefs or opini...  
1  a marriage license and a marriage certificate ...  
2  function calling is the process of invoking a ...  
3  when building a classifier for a very rare cat..

In [10]:
from transformers import BertModel, BertTokenizer

# Load a pre-trained tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Load the pre-trained model
model = BertModel.from_pretrained('bert-base-uncased')

print("Pre-trained BERT tokenizer and model loaded successfully.")

Pre-trained BERT tokenizer and model loaded successfully.


## Generate embeddings

### Subtask:
Use the pre-trained model to generate embeddings for the prompt and responses.


**Reasoning**:
Define a function to generate BERT embeddings for text inputs using the loaded tokenizer and model, then apply it to the relevant columns in the training and test dataframes.



In [None]:
import torch

def generate_bert_embeddings(texts, tokenizer, model, max_len=512, batch_size=32):
    """Generates BERT embeddings for a list of text inputs in batches."""
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)
    embeddings = []

    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i + batch_size]
        encoded_inputs = tokenizer(
            batch_texts,
            padding=True,
            truncation=True,
            max_length=max_len,
            return_tensors='pt'  # Return PyTorch tensors
        )

        encoded_inputs = {key: val.to(device) for key, val in encoded_inputs.items()}

        with torch.no_grad():
            outputs = model(**encoded_inputs)

        cls_embeddings = outputs.last_hidden_state[:, 0, :]
        embeddings.append(cls_embeddings.cpu().numpy())

    return np.concatenate(embeddings)

# Generate embeddings for training data
print("Generating embeddings for training data...")
train_prompt_embeddings = generate_bert_embeddings(train_df['prompt_cleaned'].tolist(), tokenizer, model)
train_response_a_embeddings = generate_bert_embeddings(train_df['response_a_cleaned'].tolist(), tokenizer, model)
train_response_b_embeddings = generate_bert_embeddings(train_df['response_b_cleaned'].tolist(), tokenizer, model)
print("Training embeddings generated.")

# Generate embeddings for test data
print("Generating embeddings for test data...")
test_prompt_embeddings = generate_bert_embeddings(test_df['prompt_cleaned'].tolist(), tokenizer, model)
test_response_a_embeddings = generate_bert_embeddings(test_df['response_a_cleaned'].tolist(), tokenizer, model)
test_response_b_embeddings = generate_bert_embeddings(test_df['response_b_cleaned'].tolist(), tokenizer, model)
print("Test embeddings generated.")

print("\nShape of training prompt embeddings:", train_prompt_embeddings.shape)
print("Shape of training response_a embeddings:", train_response_a_embeddings.shape)
print("Shape of training response_b embeddings:", train_response_b_embeddings.shape)
print("\nShape of test prompt embeddings:", test_prompt_embeddings.shape)
print("Shape of test response_a embeddings:", test_response_a_embeddings.shape)
print("Shape of test response_b embeddings:", test_response_b_embeddings.shape)

Generating embeddings for training data...


**Reasoning**:
The previous command failed because the dataframes `train_df` and `test_df` were not defined in the current session. Need to reload the dataframes and then regenerate the BERT embeddings.



In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import torch
from transformers import BertModel, BertTokenizer
import re
import string
import nltk

# Download NLTK stopwords (if not already downloaded)
try:
    stopwords.words('english')
except LookupError:
    nltk.download('stopwords')

# Load the datasets
try:
    train_df = pd.read_csv('train.csv')
    test_df = pd.read_csv('test.csv')
    sample_submission_df = pd.read_csv('sample_submission.csv')
    print("Data loaded successfully!")
except FileNotFoundError:
    print("Ensure train.csv, test.csv, and sample_submission.csv are in your Colab environment.")


# Text Cleaning Function
def clean_text(text):
    text = text.lower() # Convert to lowercase
    text = re.sub(f'[{re.escape(string.punctuation)}]', '', text) # Remove punctuation
    text = re.sub(r'\s+', ' ', text).strip() # Remove extra whitespace
    return text

# Apply cleaning to prompt and responses
train_df['prompt_cleaned'] = train_df['prompt'].apply(clean_text)
train_df['response_a_cleaned'] = train_df['response_a'].apply(clean_text)
train_df['response_b_cleaned'] = train_df['response_b'].apply(clean_text)

test_df['prompt_cleaned'] = test_df['prompt'].apply(clean_text)
test_df['response_a_cleaned'] = test_df['response_a'].apply(clean_text)
test_df['response_b_cleaned'] = test_df['response_b'].apply(clean_text)

print("\nText cleaning applied.")

# Load a pre-trained tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
print("\nPre-trained BERT tokenizer and model loaded successfully.")

def generate_bert_embeddings(texts, tokenizer, model, max_len=512):
    """Generates BERT embeddings for a list of text inputs."""
    # Tokenize texts
    encoded_inputs = tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=max_len,
        return_tensors='pt'  # Return PyTorch tensors
    )

    # Move inputs to the same device as the model (if using GPU)
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    encoded_inputs = {key: val.to(device) for key, val in encoded_inputs.items()}
    model.to(device)

    # Get hidden states from BERT
    with torch.no_grad(): # Disable gradient calculation for inference
        outputs = model(**encoded_inputs)

    # Aggregate hidden states (using the embedding of the [CLS] token)
    # The output 'last_hidden_state' has shape (batch_size, sequence_length, hidden_size)
    # The [CLS] token is at the beginning (index 0) of the sequence
    cls_embeddings = outputs.last_hidden_state[:, 0, :]

    return cls_embeddings.cpu().numpy() # Move back to CPU and convert to numpy

# Generate embeddings for training data
print("Generating embeddings for training data...")
train_prompt_embeddings = generate_bert_embeddings(train_df['prompt_cleaned'].tolist(), tokenizer, model)
train_response_a_embeddings = generate_bert_embeddings(train_df['response_a_cleaned'].tolist(), tokenizer, model)
train_response_b_embeddings = generate_bert_embeddings(train_df['response_b_cleaned'].tolist(), tokenizer, model)
print("Training embeddings generated.")

# Generate embeddings for test data
print("Generating embeddings for test data...")
test_prompt_embeddings = generate_bert_embeddings(test_df['prompt_cleaned'].tolist(), tokenizer, model)
test_response_a_embeddings = generate_bert_embeddings(test_df['response_a_cleaned'].tolist(), tokenizer, model)
test_response_b_embeddings = generate_bert_embeddings(test_df['response_b_cleaned'].tolist(), tokenizer, model)
print("Test embeddings generated.")

print("\nShape of training prompt embeddings:", train_prompt_embeddings.shape)
print("Shape of training response_a embeddings:", train_response_a_embeddings.shape)
print("Shape of training response_b embeddings:", train_response_b_embeddings.shape)
print("\nShape of test prompt embeddings:", test_prompt_embeddings.shape)
print("Shape of test response_a embeddings:", test_response_a_embeddings.shape)
print("Shape of test response_b embeddings:", test_response_b_embeddings.shape)

**Reasoning**:
The previous command failed because `stopwords` was not imported from `nltk`. Need to import `stopwords` and then retry the data loading, cleaning, and embedding generation process.



In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import torch
from transformers import BertModel, BertTokenizer
import re
import string
import nltk
from nltk.corpus import stopwords

# Download NLTK stopwords (if not already downloaded)
try:
    stopwords.words('english')
except LookupError:
    nltk.download('stopwords')

# Load the datasets
try:
    train_df = pd.read_csv('train.csv')
    test_df = pd.read_csv('test.csv')
    sample_submission_df = pd.read_csv('sample_submission.csv')
    print("Data loaded successfully!")
except FileNotFoundError:
    print("Ensure train.csv, test.csv, and sample_submission.csv are in your Colab environment.")


# Text Cleaning Function
def clean_text(text):
    text = text.lower() # Convert to lowercase
    text = re.sub(f'[{re.escape(string.punctuation)}]', '', text) # Remove punctuation
    text = re.sub(r'\s+', ' ', text).strip() # Remove extra whitespace
    return text

# Apply cleaning to prompt and responses
train_df['prompt_cleaned'] = train_df['prompt'].apply(clean_text)
train_df['response_a_cleaned'] = train_df['response_a'].apply(clean_text)
train_df['response_b_cleaned'] = train_df['response_b'].apply(clean_text)

test_df['prompt_cleaned'] = test_df['prompt'].apply(clean_text)
test_df['response_a_cleaned'] = test_df['response_a'].apply(clean_text)
test_df['response_b_cleaned'] = test_df['response_b'].apply(clean_text)

print("\nText cleaning applied.")

# Load a pre-trained tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
print("\nPre-trained BERT tokenizer and model loaded successfully.")

def generate_bert_embeddings(texts, tokenizer, model, max_len=512):
    """Generates BERT embeddings for a list of text inputs."""
    # Tokenize texts
    encoded_inputs = tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=max_len,
        return_tensors='pt'  # Return PyTorch tensors
    )

    # Move inputs to the same device as the model (if using GPU)
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    encoded_inputs = {key: val.to(device) for key, val in encoded_inputs.items()}
    model.to(device)

    # Get hidden states from BERT
    with torch.no_grad(): # Disable gradient calculation for inference
        outputs = model(**encoded_inputs)

    # Aggregate hidden states (using the embedding of the [CLS] token)
    # The output 'last_hidden_state' has shape (batch_size, sequence_length, hidden_size)
    # The [CLS] token is at the beginning (index 0) of the sequence
    cls_embeddings = outputs.last_hidden_state[:, 0, :]

    return cls_embeddings.cpu().numpy() # Move back to CPU and convert to numpy

# Generate embeddings for training data
print("Generating embeddings for training data...")
train_prompt_embeddings = generate_bert_embeddings(train_df['prompt_cleaned'].tolist(), tokenizer, model)
train_response_a_embeddings = generate_bert_embeddings(train_df['response_a_cleaned'].tolist(), tokenizer, model)
train_response_b_embeddings = generate_bert_embeddings(train_df['response_b_cleaned'].tolist(), tokenizer, model)
print("Training embeddings generated.")

# Generate embeddings for test data
print("Generating embeddings for test data...")
test_prompt_embeddings = generate_bert_embeddings(test_df['prompt_cleaned'].tolist(), tokenizer, model)
test_response_a_embeddings = generate_bert_embeddings(test_df['response_a_cleaned'].tolist(), tokenizer, model)
test_response_b_embeddings = generate_bert_embeddings(test_df['response_b_cleaned'].tolist(), tokenizer, model)
print("Test embeddings generated.")

print("\nShape of training prompt embeddings:", train_prompt_embeddings.shape)
print("Shape of training response_a embeddings:", train_response_a_embeddings.shape)
print("Shape of training response_b embeddings:", train_response_b_embeddings.shape)
print("\nShape of test prompt embeddings:", test_prompt_embeddings.shape)
print("Shape of test response_a embeddings:", test_response_a_embeddings.shape)
print("Shape of test response_b embeddings:", test_response_b_embeddings.shape)

**Reasoning**:
The previous command failed because the dataframes were not loaded successfully within the `try...except` block due to a `FileNotFoundError` message being printed, but the program continued trying to access the dataframes. The code needs to ensure the dataframes are loaded before attempting to clean and process them. The previous data loading cell succeeded, indicating the files are present. I will remove the try-except block and directly load the files.



In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import torch
from transformers import BertModel, BertTokenizer
import re
import string
import nltk
from nltk.corpus import stopwords

# Download NLTK stopwords (if not already downloaded)
try:
    stopwords.words('english')
except LookupError:
    nltk.download('stopwords')

# Load the datasets - Direct load as previous successful execution confirms files are present
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
sample_submission_df = pd.read_csv('sample_submission.csv')
print("Data loaded successfully!")


# Text Cleaning Function
def clean_text(text):
    text = text.lower() # Convert to lowercase
    text = re.sub(f'[{re.escape(string.punctuation)}]', '', text) # Remove punctuation
    text = re.sub(r'\s+', ' ', text).strip() # Remove extra whitespace
    return text

# Apply cleaning to prompt and responses
train_df['prompt_cleaned'] = train_df['prompt'].apply(clean_text)
train_df['response_a_cleaned'] = train_df['response_a'].apply(clean_text)
train_df['response_b_cleaned'] = train_df['response_b'].apply(clean_text)

test_df['prompt_cleaned'] = test_df['prompt'].apply(clean_text)
test_df['response_a_cleaned'] = test_df['response_a'].apply(clean_text)
test_df['response_b_cleaned'] = test_df['response_b'].apply(clean_text)

print("\nText cleaning applied.")

# Load a pre-trained tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
print("\nPre-trained BERT tokenizer and model loaded successfully.")

def generate_bert_embeddings(texts, tokenizer, model, max_len=512):
    """Generates BERT embeddings for a list of text inputs."""
    # Tokenize texts
    encoded_inputs = tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=max_len,
        return_tensors='pt'  # Return PyTorch tensors
    )

    # Move inputs to the same device as the model (if using GPU)
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    encoded_inputs = {key: val.to(device) for key, val in encoded_inputs.items()}
    model.to(device)

    # Get hidden states from BERT
    with torch.no_grad(): # Disable gradient calculation for inference
        outputs = model(**encoded_inputs)

    # Aggregate hidden states (using the embedding of the [CLS] token)
    # The output 'last_hidden_state' has shape (batch_size, sequence_length, hidden_size)
    # The [CLS] token is at the beginning (index 0) of the sequence
    cls_embeddings = outputs.last_hidden_state[:, 0, :]

    return cls_embeddings.cpu().numpy() # Move back to CPU and convert to numpy

# Generate embeddings for training data
print("Generating embeddings for training data...")
train_prompt_embeddings = generate_bert_embeddings(train_df['prompt_cleaned'].tolist(), tokenizer, model)
train_response_a_embeddings = generate_bert_embeddings(train_df['response_a_cleaned'].tolist(), tokenizer, model)
train_response_b_embeddings = generate_bert_embeddings(train_df['response_b_cleaned'].tolist(), tokenizer, model)
print("Training embeddings generated.")

# Generate embeddings for test data
print("Generating embeddings for test data...")
test_prompt_embeddings = generate_bert_embeddings(test_df['prompt_cleaned'].tolist(), tokenizer, model)
test_response_a_embeddings = generate_bert_embeddings(test_df['response_a_cleaned'].tolist(), tokenizer, model)
test_response_b_embeddings = generate_bert_embeddings(test_df['response_b_cleaned'].tolist(), tokenizer, model)
print("Test embeddings generated.")

print("\nShape of training prompt embeddings:", train_prompt_embeddings.shape)
print("Shape of training response_a embeddings:", train_response_a_embeddings.shape)
print("Shape of training response_b embeddings:", train_response_b_embeddings.shape)
print("\nShape of test prompt embeddings:", test_prompt_embeddings.shape)
print("Shape of test response_a embeddings:", test_response_a_embeddings.shape)
print("Shape of test response_b embeddings:", test_response_b_embeddings.shape)

**Reasoning**:
The previous command failed because the files were not found. Although a previous cell indicated successful unzipping, the current environment seems unable to locate the files directly. I will try loading the files from the '/content/' directory which is the default location after unzipping in Colab.



**Reasoning**:
The previous command failed because the dataframes `train_df` and `test_df` were not defined in the current session. Need to reload the dataframes and then regenerate the BERT embeddings.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import torch
from transformers import BertModel, BertTokenizer
import re
import string
import nltk

# Download NLTK stopwords (if not already downloaded)
try:
    stopwords.words('english')
except LookupError:
    nltk.download('stopwords')

# Load the datasets
try:
    train_df = pd.read_csv('train.csv')
    test_df = pd.read_csv('test.csv')
    sample_submission_df = pd.read_csv('sample_submission.csv')
    print("Data loaded successfully!")
except FileNotFoundError:
    print("Ensure train.csv, test.csv, and sample_submission.csv are in your Colab environment.")


# Text Cleaning Function
def clean_text(text):
    text = text.lower() # Convert to lowercase
    text = re.sub(f'[{re.escape(string.punctuation)}]', '', text) # Remove punctuation
    text = re.sub(r'\s+', ' ', text).strip() # Remove extra whitespace
    return text

# Apply cleaning to prompt and responses
train_df['prompt_cleaned'] = train_df['prompt'].apply(clean_text)
train_df['response_a_cleaned'] = train_df['response_a'].apply(clean_text)
train_df['response_b_cleaned'] = train_df['response_b'].apply(clean_text)

test_df['prompt_cleaned'] = test_df['prompt'].apply(clean_text)
test_df['response_a_cleaned'] = test_df['response_a'].apply(clean_text)
test_df['response_b_cleaned'] = test_df['response_b'].apply(clean_text)

print("\nText cleaning applied.")

# Load a pre-trained tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
print("\nPre-trained BERT tokenizer and model loaded successfully.")

def generate_bert_embeddings(texts, tokenizer, model, max_len=512):
    """Generates BERT embeddings for a list of text inputs."""
    # Tokenize texts
    encoded_inputs = tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=max_len,
        return_tensors='pt'  # Return PyTorch tensors
    )

    # Move inputs to the same device as the model (if using GPU)
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    encoded_inputs = {key: val.to(device) for key, val in encoded_inputs.items()}
    model.to(device)

    # Get hidden states from BERT
    with torch.no_grad(): # Disable gradient calculation for inference
        outputs = model(**encoded_inputs)

    # Aggregate hidden states (using the embedding of the [CLS] token)
    # The output 'last_hidden_state' has shape (batch_size, sequence_length, hidden_size)
    # The [CLS] token is at the beginning (index 0) of the sequence
    cls_embeddings = outputs.last_hidden_state[:, 0, :]

    return cls_embeddings.cpu().numpy() # Move back to CPU and convert to numpy

# Generate embeddings for training data
print("Generating embeddings for training data...")
train_prompt_embeddings = generate_bert_embeddings(train_df['prompt_cleaned'].tolist(), tokenizer, model)
train_response_a_embeddings = generate_bert_embeddings(train_df['response_a_cleaned'].tolist(), tokenizer, model)
train_response_b_embeddings = generate_bert_embeddings(train_df['response_b_cleaned'].tolist(), tokenizer, model)
print("Training embeddings generated.")

# Generate embeddings for test data
print("Generating embeddings for test data...")
test_prompt_embeddings = generate_bert_embeddings(test_df['prompt_cleaned'].tolist(), tokenizer, model)
test_response_a_embeddings = generate_bert_embeddings(test_df['response_a_cleaned'].tolist(), tokenizer, model)
test_response_b_embeddings = generate_bert_embeddings(test_df['response_b_cleaned'].tolist(), tokenizer, model)
print("Test embeddings generated.")

print("\nShape of training prompt embeddings:", train_prompt_embeddings.shape)
print("Shape of training response_a embeddings:", train_response_a_embeddings.shape)
print("Shape of training response_b embeddings:", train_response_b_embeddings.shape)
print("\nShape of test prompt embeddings:", test_prompt_embeddings.shape)
print("Shape of test response_a embeddings:", test_response_a_embeddings.shape)
print("Shape of test response_b embeddings:", test_response_b_embeddings.shape)