# How To Train Model for Open Book Q&A Technique
In this notebook we demonstrate how to train a model to be used with top scoring Open Book Q&A method. The Open Book method was first presented by JJ (@jjinho) [here][1], then Quangteo (@quangbk) improved RAM usage [here][2], and Anil (@nlztrk) combined with Q&A [here][3]. Radek (@radek1) demonstrated the strength of Q&A [here][5]. Next Mgoksu (@mgoksu) demonstrated how to achieve top public LB=0.807 using this method [here][4] by finetuning DeBerta large on this method.

In order to train a model for use with Open Book Q&A, we need a CSV that contains; `prompt` (i.e. question), `A, B, C, D, E` (i.e. answer choices), and we need a column of `context` extracted from wikipedia pages for each question. To generate the `context` column, we run Mgoksu's notebook [here][4]. In code cell #5, we load our CSV without `context` column with code `trn = pd.read_csv(OUR_DATASET.CSV)`. Then in code cell #21 our dataset is saved to disk as `test_context.csv` with the column `context` added.

I have searched and concatenated all publicly shared datasets into one 60k CSV and then ran Mgoksu's notebook with `NUM_TITLES_INCLUDE = 5` and `NUM_SENTENCES_INCLUDE = 20`. This added an additional `context` column. I uploaded the resultant CSV file to a Kaggle dataset [here][6]. If you enjoy the notebook you are reading, please upvote the dataset too. Thanks! 

![](https://miro.medium.com/v2/resize:fit:800/format:webp/1*bTGY3fKIgNefQxNsOYpnBw.png)
 
(image source [here][7])

[1]: https://www.kaggle.com/code/jjinho/open-book-llm-science-exam
[2]: https://www.kaggle.com/code/quangbk/open-book-llm-science-exam-reduced-ram-usage
[3]: https://www.kaggle.com/code/nlztrk/openbook-debertav3-large-baseline-single-model
[4]: https://www.kaggle.com/code/mgoksu/0-807-sharing-my-trained-with-context-model
[5]: https://www.kaggle.com/code/radek1/new-dataset-deberta-v3-large-training
[6]: https://www.kaggle.com/datasets/cdeotte/60k-data-with-context-v2
[7]: https://blog.gopenai.com/enrich-llms-with-retrieval-augmented-generation-rag-17b82a96b6f0

# Load CSV
We will load 60k CSV of `prompts`, `A,B,C,D,E`, and `context` from my Kaggle dataset [here][1]. This dataset is all publicly shared datasets concatenated then processed with Mgoksu's notebook [here][2] to create a `context` column. (To learn more about the datasets within read my discussion post). This Kaggle dataset also contains competition `train.csv` with added `context` column (to be used as a validation dataset).

In this train notebook, we have internet turned on and can choose whatever model we wish to download and train. After we finetune this model, we will create a second notebook with the Open Book Q&A technique and load the finetuned model from the output of this notebook. The second notebook will have internet turned off so that it can be submitted to Kaggle's competition.

[1]: https://www.kaggle.com/datasets/cdeotte/60k-data-with-context-v2
[2]: https://www.kaggle.com/code/mgoksu/0-807-sharing-my-trained-with-context-model

In [1]:
!nvidia-smi

Failed to initialize NVML: Unknown Error


In [2]:
###---- Environment config ----###

# MACHINE = "COLAB"
# device = "TPU"

# MACHINE = "KAGGLE"
# device = "TPU-VM"
# device = "GPU"

MACHINE = "JAYOO_PC"
device = "GPU"


# DEBUG = True
DEBUG = False
if DEBUG == True:
    print("IN DEBUG MODE")
    # device = "CPU"
    
# Set root directory
if MACHINE == "JAYOO_PC":
    ROOT = '/jayoo'  # local
elif MACHINE == "COLAB":
    ROOT = './drive/MyDrive/colab_env'
    from google.colab import drive
    drive.mount('/content/drive')
    !pwd
else:
    ROOT = ''  # Kaggle

print(f"Machine: {MACHINE}, device: {device}, root: {ROOT}")

import multiprocessing
print(multiprocessing.cpu_count())

Machine: JAYOO_PC, device: GPU, root: /jayoo
12


In [3]:
import os
if MACHINE == "KAGGLE":
    os.environ["CUDA_VISIBLE_DEVICES"]="0,1"
import random
from typing import Optional, Union
import pandas as pd, numpy as np, torch
from datasets import Dataset
from dataclasses import dataclass
from transformers import AutoTokenizer
from transformers import EarlyStoppingCallback
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer
import gc
import ctypes



In [4]:
# randomly shuffle order of answers
def shuffle_answers(row):
#     correct = row['answer']
    new_row = row.copy()
    answers = ['A', 'B', 'C', 'D', 'E']
    #shuffle answers
    shuffled_ans = answers.copy()
    random.shuffle(shuffled_ans)
    
    for i in range(len(answers)):
        target = shuffled_ans[i]
        new_row[answers[i]] = row[target]
        if target == row['answer']:
            new_row['answer'] = answers[i]
        
    return new_row


# shuffle all rows in df
def shuffle_df(df):
    for i in range(len(df)):
        df.loc[i] = shuffle_answers(df.loc[i])
    
    return df


# replace nan answer with a random incorrect answer
def fix_nan(row):
    nan_option = None
    options = []
    answers = ['A', 'B', 'C', 'D', 'E']
    for char in answers:
        if (len(row[char]) > 0):
            if (char != row['answer']):
                options.append(char)
        else:
            nan_option = char
    
    if (nan_option != None):
        copy = random.choice(options)
        copy_text = row[copy]
        row[nan_option] = copy_text
    
    return row


# replace all nan answers
def replace_nans(df):
    for i in range(len(df)):
        df.loc[i] = fix_nan(df.loc[i])
    
    return df


def clean_memory():
    gc.collect()
    ctypes.CDLL("libc.so.6").malloc_trim(0)
    torch.cuda.empty_cache()


In [5]:
VER=2
# TRAIN WITH SUBSET OF 60K
# NUM_TRAIN_SAMPLES = 52984 #1024
if DEBUG is True:
    NUM_TRAIN_SAMPLES = 1024

# PARAMETER EFFICIENT FINE TUNING
# PEFT REQUIRES 1XP100 GPU NOT 2XT4
USE_PEFT = False
# NUMBER OF LAYERS TO FREEZE 
# DEBERTA LARGE HAS TOTAL OF 24 LAYERS
FREEZE_LAYERS = 0  #18
# BOOLEAN TO FREEZE EMBEDDINGS
FREEZE_EMBEDDINGS = False  #True
# LENGTH OF CONTEXT PLUS QUESTION ANSWER
MAX_INPUT = 768  # 256

# HUGGING FACE MODEL
MODEL = 'microsoft/deberta-v3-large'
# MODEL = ROOT + '/deberta-v3_model'
# TOK_DIR = ROOT + '/deberta-v3_tokenizer'

In [6]:
df_valid = pd.read_csv(ROOT+'/kaggle/input/bge/science_only/500val_bgeSci_5context.csv')

# bge_1 = pd.read_csv(ROOT+'/kaggle/input/bge/science_only/500val_bgeSci_5context.csv')
# bge_2 = pd.read_csv(ROOT+'/kaggle/input/bge/science_only/bgeSci_data2/500val_bgeSci_data2.csv')
# df_valid = pd.concat([bge_1, bge_2])

print('Validation data size:', df_valid.shape )
df_valid.head()

Validation data size: (500, 8)


Unnamed: 0,prompt,context,A,B,C,D,E,answer
0,What is the method of transcription in the lif...,-There are three different replication systems...,RNA-templated transcription is the method of t...,Transcription occurs through a unique mechanis...,Reverse transcription is the method of transcr...,DNA-templated transcription is the method of t...,Transcription does not occur in the life cycle...,D
1,What is the role of the viral fiber glycoprote...,"-ASFV is a large (175–215 nm), icosahedral, do...",The viral fiber glycoproteins are involved in ...,The viral fiber glycoproteins code for 40 prot...,The viral fiber glycoproteins are responsible ...,The viral fiber glycoproteins mediate endocyto...,The viral fiber glycoproteins are responsible ...,D
2,What is the significance of the faint Hα emiss...,-Single antenna detections Radio observations ...,The emission lines indicate that 3 Geminorum i...,The emission lines indicate that 3 Geminorum i...,The emission lines indicate that 3 Geminorum i...,The emission lines indicate that 3 Geminorum i...,The emission lines indicate that 3 Geminorum i...,A
3,What is the significance of the pedicellariae ...,-Structure The three basic segments of the typ...,They are used for climbing on corals.,They resemble the traps of the Venus fly trap ...,They are covered by short and stout spines.,They are found on the central disc of the sea ...,They are a characteristic feature of the Gonia...,B
4,What is the role of the microprocessor complex...,-The microprocessor complex is a protein compl...,The microprocessor complex is responsible for ...,The microprocessor complex is responsible for ...,The microprocessor complex is involved in the ...,The microprocessor complex is involved in the ...,The microprocessor complex is responsible for ...,A


In [7]:
# tf_c1_v1 = pd.read_csv(ROOT+'/kaggle/input/tf-idf_context/v1/500val_tfidf_context1_v1.csv')
# tf_c2_v1 = pd.read_csv(ROOT+'/kaggle/input/tf-idf_context/v1/500val_tfidf_context2_v1.csv')

# tf_c1_v3 = pd.read_csv(ROOT+'/kaggle/input/tf-idf_context/v3/500val_tfidf_context1_v3.csv')
# tf_c2_v3 = pd.read_csv(ROOT+'/kaggle/input/tf-idf_context/v3/500val_tfidf_context2_v3.csv')

# sci_val = pd.read_csv(ROOT+'/kaggle/input/bge/science_only/500val_bgeSci_5context.csv')
# wiki_val = pd.read_csv(ROOT+'/kaggle/input/bge/prefix/500val_prefix_bge_wikiAbstract.csv')

# all_val = pd.concat([tf_c1_v1, tf_c2_v1, tf_c1_v3, tf_c2_v3, sci_val, wiki_val])
# all_val.to_csv('3k_val.csv', index=False)

In [8]:
# 60k dataset
# # df_train = pd.read_csv(ROOT+'/kaggle/input/60k-data-with-context-v2/all_12_with_context2.csv')

# # bge no prefix
# # df_53k = pd.read_csv(ROOT+'/kaggle/input/bge/no_prefix/53k_bge_wikiAbstract.csv')
# # df_53k = df_53k.fillna('')
# # df_53k = replace_nans(df_53k)
# # df_53k = shuffle_df(df_53k)

# # # bge prefix
df_54k = pd.read_csv(ROOT+'/kaggle/input/bge/prefix/54k_prefix_bge_wikiAbstract.csv')
df_54k = df_54k.fillna('')
df_54k = replace_nans(df_54k)
df_54k = shuffle_df(df_54k)
non_sci = df_54k[(df_54k['source'] != 6) & (df_54k['source'] != 10) & (df_54k['source'] != 11) & (df_54k['source'] != 12)]

# # Combined data
combined = pd.read_csv(ROOT+'/kaggle/input/bge/science_only/177k_sci_context.csv')
# combined = combined.iloc[:118123]

# # 3k val
# all_val = pd.read_csv('3k_val.csv')
# all_val = shuffle_df(all_val)

# # # # 15k gpt original context
# # # # df_train = pd.read_csv(ROOT+'/kaggle/input/datasets/15kgpt_cleaned.csv')
# # # # 15k gpt bge wiki context
# # # gpt_wiki = pd.read_csv(ROOT+'/kaggle/input/bge/prefix/GPT_prefix_bge_wikiAbstract.csv')
# # # gpt_wiki = shuffle_df(gpt_wiki)
# # # # 15k gpt bge sci context
# # gpt_sci = pd.read_csv(ROOT+'/kaggle/input/bge/science_only/GPT_bge_science_v1.csv')

# gpt_tfidf = pd.read_csv(ROOT+'/kaggle/input/tf-idf_context/v3/15k_gpt/15kgpt_tfidf_context2_v3.csv')
# gpt_tfidf = shuffle_df(gpt_tfidf)
# # # gpt_long_10k = pd.read_csv(ROOT+'/kaggle/input/tf-idf_context/v3/15k_gpt/10k_long_tfidf.csv')
# # # gpt_long_10k = shuffle_df(gpt_long_10k)

# tfidf_20k = pd.read_csv(ROOT+'/kaggle/input/tf-idf_context/v3/20k_sci_tfidf/20k_tfidf_v3_fixed.csv')
# tfidf_20k = shuffle_df(tfidf_20k)
# # # sci_20k = pd.read_csv(ROOT+'/kaggle/input/bge/science_only/20k_bgeSci_noNaN.csv')
# # # sci_20k = shuffle_df(sci_20k)


# # from new 70k gpt3
# long_14k_2 = pd.read_csv(ROOT+'/kaggle/input/tf-idf_context/v3/70k_gpt3/14klong_tfidf_context2_v3.csv')
# long_14k_1 = pd.read_csv(ROOT+'/kaggle/input/tf-idf_context/v3/70k_gpt3/14klong_tfidf_context1_v3.csv')
# long_14k_1 = shuffle_df(long_14k_1)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  row[nan_option] = copy_text


In [9]:
# Create three datasets with shuffled answers
# df1 = pd.read_csv(ROOT+'/kaggle/input/bge/no_prefix/53k_bge_wikiAbstract.csv')
# df2 = pd.read_csv(ROOT+'/kaggle/input/bge/prefix/54k_prefix_bge_wikiAbstract.csv')
# df3 = pd.read_csv(ROOT+'/kaggle/input/chris_data/54k_nota.csv')

# df1 = shuffle_df(df1)
# df2 = shuffle_df(df2)
# df3 = shuffle_df(df3)

# df_train = pd.concat([gpt_sci, df_54k, gpt_tfidf, tfidf_20k, long_14k_2, long_14k_1])

df_train = pd.concat([df_54k, combined])

# df_train = pd.read_csv(ROOT+'/kaggle/input/bge/science_only/bgeSci_data2/sci_mix_128k.csv')

In [10]:
# preprocess data
NUM_TRAIN_SAMPLES = len(df_train)
# df_train = df_train.drop(columns="source")
df_train = df_train.fillna('').sample(NUM_TRAIN_SAMPLES)
print('Train data size:', df_train.shape)
df_train.head()

Train data size: (231602, 9)


Unnamed: 0,prompt,A,B,C,D,E,answer,context,source
48973,What are biofilms resistant to many common for...,termination,sterilization,termination,assimilation,vaccination,B,"Moreover, from an evolutionary point of view, ...",10.0
11578,What is the notable characteristic of Bristol ...,"In 2010, the Liberal Democrat candidate achiev...",The winning candidate in every election from 1...,Party positions in the constituency remained u...,Bristol North West is known for its high voter...,The constituency has consistently leaned towar...,B,Bristol North West is a constituency represent...,3.0
58940,What is quantum engineering?,The development of technology that capitalizes...,The development of technology that capitalizes...,The development of technology that capitalizes...,The development of technology that capitalizes...,The development of technology that capitalizes...,B,Quantum engineering is the development of tech...,
149643,What are some major components of modern theor...,All of the above,Reaction networks,Theories of electrolyte solutions,Statistical thermodynamics,Molecular dynamics,A,"In recent years, it has consisted primarily of...",
123740,What is the main reason why shales emit more g...,Presence of clay,Presence of uranium and thorium,Presence of dolomite and limestone,Presence of gypsum and coal,Presence of radioactive potassium,E,How are gamma rays and neutrons produced by co...,


In [11]:
df_train = df_train.astype(str)

In [12]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 231602 entries, 48973 to 23609
Data columns (total 9 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   prompt   231602 non-null  object
 1   A        231602 non-null  object
 2   B        231602 non-null  object
 3   C        231602 non-null  object
 4   D        231602 non-null  object
 5   E        231602 non-null  object
 6   answer   231602 non-null  object
 7   context  231602 non-null  object
 8   source   231602 non-null  object
dtypes: object(9)
memory usage: 17.7+ MB


# Data Loader
Code is from Radek's notebook [here][1] with modifications to the tokenization process.

[1]: https://www.kaggle.com/code/radek1/new-dataset-deberta-v3-large-training

In [13]:
option_to_index = {option: idx for idx, option in enumerate('ABCDE')}
index_to_option = {v: k for k,v in option_to_index.items()}

def preprocess(example):
    first_sentence = [ "[CLS] " + example['context'] ] * 5
    second_sentences = [" #### " + example['prompt'] + " [SEP] " + example[option] + " [SEP]" for option in 'ABCDE']
    tokenized_example = tokenizer(first_sentence, second_sentences, truncation='only_first', 
                                  max_length=MAX_INPUT, add_special_tokens=False)
    tokenized_example['label'] = option_to_index[example['answer']]
    
    return tokenized_example

@dataclass
class DataCollatorForMultipleChoice:
    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    
    def __call__(self, features):
        label_name = 'label' if 'label' in features[0].keys() else 'labels'
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]['input_ids'])
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
        ]
        flattened_features = sum(flattened_features, [])
        
        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors='pt',
        )
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        batch['labels'] = torch.tensor(labels, dtype=torch.int64)
        return batch

In [14]:
tokenizer = AutoTokenizer.from_pretrained(MODEL)
dataset_valid = Dataset.from_pandas(df_valid)
dataset = Dataset.from_pandas(df_train)
dataset = dataset.remove_columns(["__index_level_0__"])
dataset

Downloading (…)okenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/580 [00:00<?, ?B/s]

Downloading spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Dataset({
    features: ['prompt', 'A', 'B', 'C', 'D', 'E', 'answer', 'context', 'source'],
    num_rows: 231602
})

In [15]:
tokenized_dataset_valid = dataset_valid.map(preprocess, remove_columns=['prompt', 'context', 'A', 'B', 'C', 'D', 'E', 'answer'])
tokenized_dataset = dataset.map(preprocess, remove_columns=['prompt', 'context', 'A', 'B', 'C', 'D', 'E', 'answer'])
tokenized_dataset

  0%|          | 0/500 [00:00<?, ?ex/s]

  0%|          | 0/231602 [00:00<?, ?ex/s]

Dataset({
    features: ['source', 'input_ids', 'token_type_ids', 'attention_mask', 'label'],
    num_rows: 231602
})

# Build Model
We will use a Hugging Face AutoModelForMultipleChoice. For the list of possible models, see Hugging Face's repository [here][1]. We can optionally use PEFT to accelerate training and use less memory. However i have noticed that validation accuracy is less. (Note that PEFT requires us to use 1xP100 not 2xT4 GPU. I'm not sure why). We can also optionally freeze layers. This also accelerates training and uses less memory. However validation accuracy may become less.

[1]: https://huggingface.co/models

In [16]:
model = AutoModelForMultipleChoice.from_pretrained(MODEL)

Downloading pytorch_model.bin:   0%|          | 0.00/874M [00:00<?, ?B/s]

Some weights of DebertaV2ForMultipleChoice were not initialized from the model checkpoint at microsoft/deberta-v3-large and are newly initialized: ['pooler.dense.bias', 'classifier.bias', 'pooler.dense.weight', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [17]:
# NOTE PEFT REQUIRES US TO USE 1XP100 NOT 2XT4. I'M NOT SURE WHY.
if USE_PEFT:
    !pip install --no-index --no-deps /kaggle/input/llm-whls/peft-0.4.0-py3-none-any.whl

In [18]:
if USE_PEFT:
    print('We are using PEFT.')
    from peft import LoraConfig, get_peft_model, TaskType
    peft_config = LoraConfig(
        r=8, lora_alpha=4, task_type=TaskType.SEQ_CLS, lora_dropout=0.1, 
        bias="none", inference_mode=False, 
        target_modules=["query_proj", "value_proj"],
        modules_to_save=['classifier','pooler'],
    )
    model = get_peft_model(model, peft_config)
    model.print_trainable_parameters()

In [19]:
if FREEZE_EMBEDDINGS:
    print('Freezing embeddings.')
    for param in model.deberta.embeddings.parameters():
        param.requires_grad = False
if FREEZE_LAYERS>0:
    print(f'Freezing {FREEZE_LAYERS} layers.')
    for layer in model.deberta.encoder.layer[:FREEZE_LAYERS]:
        for param in layer.parameters():
            param.requires_grad = False

# MAP@3 Metric
The competition metric is MAP@3 therefore we will make a custom code to add to Hugging Face's trainer. Discussion [here][1]

[1]: https://www.kaggle.com/competitions/kaggle-llm-science-exam/discussion/435602

In [20]:
def map_at_3(predictions, labels):
    map_sum = 0
    pred = np.argsort(-1*np.array(predictions),axis=1)[:,:3]
    for x,y in zip(pred,labels):
        z = [1/i if y==j else 0 for i,j in zip([1,2,3],x)]
        map_sum += np.sum(z)
    return map_sum / len(predictions)

def compute_metrics(p):
    predictions = p.predictions.tolist()
    labels = p.label_ids.tolist()
    return {"map@3": map_at_3(predictions, labels)}

# Train and Save 
We will now train and save our model using Hugging Face's easy to use trainer. By adjusting the parameters in this notebook, we can achieve `CV MAP@3 = 0.915+` and corresponding single model `LB MAP@3 = 0.830+` wow!

In we run this notebook outside of Kaggle then we can train longer and with more RAM. If we run this notebook on Kaggle, then we need to use tricks to train models efficiently. Here are some ideas:
* use fp16 (this speeds up T4 not P100)
* use gradient_accumlation_steps (this simulates larger batch sizes)
* use gradient_checkpointing (this uses disk to save RAM)
* use 2xT4 instead of 1xP100 (this doubles GPUs)
* freeze model embeddings (this reduces weights to train)
* freeze some model layers (this reduces weights to train)
* use PEFT (this reduces weights to train)
* increase LR and decrease epochs (this reduces work)
* use smaller models (this reduces weights to train)

In [21]:
# del trainer
# del model
# clean_memory()

In [22]:
SAVE_STEPS = 200
EVAL_STEPS = 200
if DEBUG is True:  # don't save
    SAVE_STEPS = 1000000
    EVAL_STEPS = 10
    
training_args = TrainingArguments(
    warmup_ratio=0.1,
    learning_rate=2e-6, #2e-5
    per_device_train_batch_size=8, #1
    per_device_eval_batch_size=8,  #2
    num_train_epochs=2,  #2
    report_to='none',
    output_dir = f'./checkpoints_{VER}',
    overwrite_output_dir=True,
    fp16=True,
    gradient_accumulation_steps=4, # 8
    logging_steps=EVAL_STEPS,
    evaluation_strategy='steps',
    eval_steps=EVAL_STEPS,
    save_strategy="steps",
    save_steps=SAVE_STEPS,
    load_best_model_at_end=False,
    metric_for_best_model='map@3',
    lr_scheduler_type='cosine', #'cosine'
    weight_decay=0.01,
    save_total_limit=2,
    gradient_checkpointing=True,
)


# checkpoint = None  # train from scratch
checkpoint = ROOT + "/checkpoints_2/checkpoint-5200"  # resume checkpoint

ValueError: FP16 Mixed precision training with AMP or APEX (`--fp16`) and FP16 half precision evaluation (`--fp16_full_eval`) can only be used on CUDA or NPU devices.

In [None]:
torch.version.cuda

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
    train_dataset=tokenized_dataset,
    eval_dataset=tokenized_dataset_valid,
    compute_metrics = compute_metrics,
    #callbacks=[EarlyStoppingCallback(early_stopping_patience=5)],
)

trainer.train(resume_from_checkpoint=checkpoint)
trainer.save_model(f'model_v{VER}')

# Verify Saved Model
During training, we see the MAP@3 validation score above. Let's load the saved model and compute it again here to verify that our model is saved correctly.

In [None]:
# del model, trainer
# if USE_PEFT:
#     model = AutoModelForMultipleChoice.from_pretrained(MODEL)
#     model = get_peft_model(model, peft_config)
#     checkpoint = torch.load(f'model_v{VER}/pytorch_model.bin')
#     model.load_state_dict(checkpoint)
# else:
#     model = AutoModelForMultipleChoice.from_pretrained(f'model_v{VER}')
# trainer = Trainer(model=model)

# test_df = pd.read_csv(ROOT+'/kaggle/input/60k-data-with-context-v2/train_with_context2.csv')
# tokenized_test_dataset = Dataset.from_pandas(test_df).map(
#         preprocess, remove_columns=['prompt', 'context', 'A', 'B', 'C', 'D', 'E'])

test_predictions = trainer.predict(tokenized_dataset_valid).predictions
test_df = df_valid.copy()
predictions_as_ids = np.argsort(-test_predictions, 1)
predictions_as_answer_letters = np.array(list('ABCDE'))[predictions_as_ids]
predictions_as_string = test_df['prediction'] = [
    ' '.join(row) for row in predictions_as_answer_letters[:, :3]
]

# Compute Validation Score

# https://www.kaggle.com/code/philippsinger/h2ogpt-perplexity-ranking
import numpy as np
def precision_at_k(r, k):
    """Precision at k"""
    assert k <= len(r)
    assert k != 0
    return sum(int(x) for x in r[:k]) / k

def MAP_at_3(predictions, true_items):
    """Score is mean average precision at 3"""
    U = len(predictions)
    map_at_3 = 0.0
    for u in range(U):
        user_preds = predictions[u].split()
        user_true = true_items[u]
        user_results = [1 if item == user_true else 0 for item in user_preds]
        for k in range(min(len(user_preds), 3)):
            map_at_3 += precision_at_k(user_results, k+1) * user_results[k]
    return map_at_3 / U

m = MAP_at_3(test_df.prediction.values, test_df.answer.values)
print( 'CV MAP@3 =',m )



# Inference

In [None]:
# del tokenized_test_dataset
# # del trainer
# # del dataset
# clean_memory()

In [None]:
# from scipy.special import softmax
# from torch.utils.data import DataLoader

# def preprocess(example):
#     first_sentence = [example['prompt']] * 5
#     second_sentence = []
#     for option in options:
#         second_sentence.append(example[option])
    
#     tokenized_example = tokenizer(first_sentence, second_sentence, truncation='only_first')
#     tokenized_example['label'] = option_to_index[example['answer']]
#     return tokenized_example

# # Compute Validation Score
# # https://www.kaggle.com/code/philippsinger/h2ogpt-perplexity-ranking
# def precision_at_k(r, k):
#     """Precision at k"""
#     assert k <= len(r)
#     assert k != 0
#     return sum(int(x) for x in r[:k]) / k

# def MAP_at_3(predictions, true_items):
#     """Score is mean average precision at 3"""
#     U = len(predictions)
#     map_at_3 = 0.0
#     for u in range(U):
#         user_preds = predictions[u].split()
#         user_true = true_items[u]
#         user_results = [1 if item == user_true else 0 for item in user_preds]
#         for k in range(min(len(user_preds), 3)):
#             map_at_3 += precision_at_k(user_results, k+1) * user_results[k]
#     return map_at_3 / U

# # for formatting predictions as strings
# options = 'ABCDE'
# indices = list(range(5))
# option_to_index = {option: index for option, index in zip(options, indices)}
# index_to_option = {index: option for option, index in zip(options, indices)}

In [None]:
# test_df = pd.read_csv(ROOT+'/kaggle/input/bge/science_only/500val_bgeSci_5context.csv')
# print('Validation data size:', test_df.shape )
# test_df.head()

In [None]:
# test_df["prompt"] = test_df["context"].apply(lambda x: x[:2300]) + " #### " +  test_df["prompt"]  
# data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
# tokenized_test_dataset = Dataset.from_pandas(test_df[['prompt', 'A', 'B', 'C', 'D', 'E', 'answer']]).map(preprocess, remove_columns=['prompt', 'A', 'B', 'C', 'D', 'E', 'answer'])
# # tokenized_test_dataset = tokenized_test_dataset.remove_columns(["__index_level_0__"])
# test_dataloader = DataLoader(tokenized_test_dataset, batch_size=1, shuffle=False, collate_fn=data_collator)

In [None]:
# # predict
# test_predictions = []
# for batch in test_dataloader:
#     for k in batch.keys():
#         batch[k] = batch[k].to(torch.device('cuda:0'))
#     with torch.no_grad():
#         outputs = model(**batch)
#     test_predictions.append(outputs.logits.cpu().detach())

# test_predictions = torch.cat(test_predictions)
# test_predictions = softmax(test_predictions, axis=1).numpy()

# bge_preds = test_predictions

In [None]:
# combined_predictions = test_predictions
# predictions_as_ids = np.argsort(-combined_predictions, 1)
# predictions_as_answer_letters = np.array(list('ABCDE'))[predictions_as_ids]
# predictions_as_string = test_df['prediction'] = [
#     ' '.join(row) for row in predictions_as_answer_letters[:, :3]
# ]
# # print MAP score
# m = MAP_at_3(test_df.prediction.values, test_df.answer.values)
# print( 'CV MAP@3 =',m )

In [None]:
# combined_predictions = (tfidf_preds_1 + tfidf_preds_2 + bge_preds) / 3
# predictions_as_ids = np.argsort(-combined_predictions, 1)
# predictions_as_answer_letters = np.array(list('ABCDE'))[predictions_as_ids]
# predictions_as_string = test_df['prediction'] = [
#     ' '.join(row) for row in predictions_as_answer_letters[:, :3]
# ]

# # print MAP score
# m = MAP_at_3(test_df.prediction.values, test_df.answer.values)
# print( 'CV MAP@3 =',m )