# Kaggle - LLM Science Exam
##### Use LLMs to answer difficult science questions

In [1]:
!pip install transformers



In [2]:
!pip install jupyter --upgrade

Collecting jupyter
  Downloading jupyter-1.0.0-py2.py3-none-any.whl (2.7 kB)
Installing collected packages: jupyter
Successfully installed jupyter-1.0.0


In [3]:
!pip install ipywidgets



In [4]:
!pip install --upgrade tqdm



In [5]:
!jupyter nbextension enable --py widgetsnbextension

Config option `kernel_spec_manager_class` not recognized by `EnableNBExtensionApp`.
Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: [32mOK[0m


In [6]:
import warnings
warnings.filterwarnings("ignore")

In [7]:
# Library for data manipulation and analysis
import pandas as pd

# Library for numerical computing
import numpy as np

In [8]:
# Library for deep learning and tensor computation
import torch

# Library for progress bars
from tqdm.notebook import tqdm

In [9]:
# Library for NLP models and tokenization
from transformers import AutoTokenizer, T5ForConditionalGeneration

# Library for data normalization
from sklearn.preprocessing import normalize

In [10]:
# Library for handling datasets
from datasets import Dataset

# Library for defining data classes
from dataclasses import dataclass

In [11]:
# Base tokenizer class
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy

# Library for multiple choice models
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer

In [12]:
from typing import Optional, Union

In [13]:
# Step 1: Overview of the Dataset
train_data = pd.read_csv('/kaggle/input/kaggle-llm-science-exam/train.csv')
train_data.head(5)

Unnamed: 0,id,prompt,A,B,C,D,E,answer
0,0,Which of the following statements accurately d...,MOND is a theory that reduces the observed mis...,MOND is a theory that increases the discrepanc...,MOND is a theory that explains the missing bar...,MOND is a theory that reduces the discrepancy ...,MOND is a theory that eliminates the observed ...,D
1,1,Which of the following is an accurate definiti...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,A
2,2,Which of the following statements accurately d...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol is a representation of th...,The triskeles symbol is a representation of a ...,The triskeles symbol represents three interloc...,The triskeles symbol is a representation of th...,A
3,3,What is the significance of regularization in ...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,C
4,4,Which of the following statements accurately d...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,D


In [14]:
# Step 1: Overview of the Dataset
test_data = pd.read_csv('/kaggle/input/kaggle-llm-science-exam/test.csv')
test_data.head()

Unnamed: 0,id,prompt,A,B,C,D,E
0,0,Which of the following statements accurately d...,MOND is a theory that reduces the observed mis...,MOND is a theory that increases the discrepanc...,MOND is a theory that explains the missing bar...,MOND is a theory that reduces the discrepancy ...,MOND is a theory that eliminates the observed ...
1,1,Which of the following is an accurate definiti...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...
2,2,Which of the following statements accurately d...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol is a representation of th...,The triskeles symbol is a representation of a ...,The triskeles symbol represents three interloc...,The triskeles symbol is a representation of th...
3,3,What is the significance of regularization in ...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...
4,4,Which of the following statements accurately d...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...


In [15]:
# Step 2: Data Summary
print("Summary statistics:")
print(train_data.describe())

Summary statistics:
               id
count  200.000000
mean    99.500000
std     57.879185
min      0.000000
25%     49.750000
50%     99.500000
75%    149.250000
max    199.000000


In [16]:
# Step 3: Missing Values
print("Number of missing values in each column:")
print(train_data.isnull().sum())

Number of missing values in each column:
id        0
prompt    0
A         0
B         0
C         0
D         0
E         0
answer    0
dtype: int64


### 2. Preparing the T5 Model <a id="cell2"></a>

In [41]:
model_path = '/kaggle/input/transformers/t5-large'

In [18]:
model = T5ForConditionalGeneration.from_pretrained(model_path)

In [19]:
tokenizer = AutoTokenizer.from_pretrained(model_path)

### 3. Training and Validation <a id="cell3"></a>

In [20]:
valid_score = 0

In [21]:
model.eval()

T5ForConditionalGeneration(
  (shared): Embedding(32128, 1024)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 1024)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=1024, out_features=1024, bias=False)
              (k): Linear(in_features=1024, out_features=1024, bias=False)
              (v): Linear(in_features=1024, out_features=1024, bias=False)
              (o): Linear(in_features=1024, out_features=1024, bias=False)
              (relative_attention_bias): Embedding(32, 16)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=1024, out_features=4096, bias=False)
              (wo): Linear(in_features=4096, out_features=1024, bias=False)
              (d

In [22]:
# Evaluation function to calculate reciprocal rank
def reciprocal_rank(predicted_ranks):
    for rank, p in enumerate(predicted_ranks):
        if p == 1:
            return 1.0 / (rank + 1)
    return 0.0

In [23]:
with torch.no_grad():
    for index in tqdm(range(train_data.shape[0])):
        columns = train_data.iloc[index].values
        input_ids = tokenizer(columns[1] + " <extra_id_0>", return_tensors="pt").input_ids
        labels = tokenizer(["<extra_id_0> " + columns[2+p] for p in range(5)], return_tensors="pt", padding=True).input_ids
        minlen = np.min([len(l) for l in labels])
        scores = []
        
        # Generate text for each option and compute the scores
        for p in range(5):
            with torch.no_grad():
                loss = model(input_ids=input_ids, labels=labels[p][:minlen].unsqueeze(0)).loss.detach().cpu().numpy()
            scores.append(float(loss))
        
        # Calculate the ranks of the options based on scores
        predicted_ranks = np.array(scores).argsort().argsort()[::-1]
        
        # Calculate the reciprocal rank score and add it to the total
        valid_score += reciprocal_rank(predicted_ranks)

  0%|          | 0/200 [00:00<?, ?it/s]

In [24]:
valid_score /= train_data.shape[0]

In [25]:
print(f'score = {valid_score}')

score = 0.445


### 4. Generating Predictions <a id="cell4"></a>

In [26]:
#The example is expected to be a dictionary with keys 'prompt', 
#'A', 'B', 'C', 'D', 'E', and 'answer'.
options = 'ABCDE'
indices = list(range(5))

In [27]:
option_to_index = {option: index for option, index in zip(options, indices)}
index_to_option = {index: option for option, index in zip(options, indices)}

In [28]:

def preprocess(example):
    first_sentence = [example['prompt']] * 5
    second_sentence = [example[option] for option in options]
    tokenized_example = tokenizer(first_sentence, second_sentence, truncation=True)
    tokenized_example['label'] = option_to_index[example['answer']]
    return tokenized_example

In [29]:
@dataclass
class DataCollatorForMultipleChoice:
    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    
    def __call__(self, features):
        label_name = 'label' if 'label' in features[0].keys() else 'labels'
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]['input_ids'])
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
        ]
        flattened_features = sum(flattened_features, [])
        
        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors='pt',
        )
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        batch['labels'] = torch.tensor(labels, dtype=torch.int64)
        return batch   

In [30]:
training_args = TrainingArguments(
    output_dir='./',
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    learning_rate=3e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=100,
    weight_decay=0.01,
    report_to='none'
)

In [31]:
model_dir = '/kaggle/input/llm-sci-exam-deberta-large-run01'
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForMultipleChoice.from_pretrained(model_dir)

In [32]:
trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
)

In [33]:

def predictions_to_map_output(predictions):
    sorted_answer_indices = np.argsort(-predictions)
    top_answer_indices = sorted_answer_indices[:,:3] # Get the first three answers in each row
    top_answers = np.vectorize(index_to_option.get)(top_answer_indices)
    return np.apply_along_axis(lambda row: ' '.join(row), 1, top_answers)

In [34]:

test_data['answer'] = 'B'

In [35]:
#preprocess 
test_ds = Dataset.from_pandas(test_data)
tokenized_test_ds = test_ds.map(preprocess, 
                                batched=False, 
                                remove_columns=['prompt', 'A', 'B', 'C', 'D', 'E', 'answer'])

  0%|          | 0/200 [00:00<?, ?ex/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


### 5. Creating Submission File <a id="cell5"></a>

In [36]:
# "real" predictions 
test_predictions = trainer.predict(tokenized_test_ds)

You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [37]:

submission_df = test_data[['id']]

In [38]:

submission_df['prediction'] = predictions_to_map_output(test_predictions.predictions)

In [39]:

submission_df.to_csv('submission.csv', index=False)

In [40]:
print(submission_df)

      id prediction
0      0      D B E
1      1      A B E
2      2      A C E
3      3      C E A
4      4      D A C
..   ...        ...
195  195      C A E
196  196      B C A
197  197      B A E
198  198      D C A
199  199      C D A

[200 rows x 2 columns]
