# LLM Science Exam Optimise Ensemble Weights 

In this competition, when looking for the high-scoring notebooks, those that are ensembles with multiple models stand out. In fact, it is known empirically that ensembles are very powerful in NLP competition.

[The voting ensemble was introduced](https://www.kaggle.com/code/radek1/an-introduction-to-voting-ensemble) by [radek1](https://www.kaggle.com/radek1) and many notes have been published on this basis.

On the other hand, ensembles with predicted probabilities appear to be less used.

This notebook introduces ensembles using probabilities and shows how to optimise model weights with **scipy.optimize**.

Normally, OOF(out of fold) predictions are used to optimise model weights, But The training data used looks mixed and most of the weight is for single models. Therefore, I'll use an evaluation dataset that appears not to have been used for training. the dataset named [MMLU-Dataset](https://www.kaggle.com/datasets/peiyuanliu2001/mmlu-dataset) shared by [Peiyuan Liu](https://www.kaggle.com/peiyuanliu2001). [See his discussion for details.](https://www.kaggle.com/competitions/kaggle-llm-science-exam/discussion/433168) Please note that this dataset contains more than just STEM questions, so it may not be suitable as an evaluation dataset.

edit: Somehow unable to submitted due to the MMLU dataset, so I've created a separate dataset.

edit: [Chris Deotte](https://www.kaggle.com/cdeotte) once again published an [amazing dataset](https://www.kaggle.com/datasets/cdeotte/60k-data-with-context-v2) and notebooks. his [training code is here](https://www.kaggle.com/code/cdeotte/how-to-train-open-book-model-part-1) and [inference code is here](https://www.kaggle.com/code/cdeotte/how-to-train-open-book-model-part-2). This version also uses his trained weights.

### References, see also them

Weight optimization related 

* [Optimise Blending Weights with Bonus :0](https://www.kaggle.com/code/gogo827jz/optimise-blending-weights-with-bonus-0/notebook) by [Yirun Zhang](https://www.kaggle.com/gogo827jz)

OpenBook and its tuning related(Too many, so just partial only)

* [OpenBook DeBERTaV3-Large Baseline (Single Model)](https://www.kaggle.com/code/nlztrk/openbook-debertav3-large-baseline-single-model) by [Anil Ozturk](https://www.kaggle.com/nlztrk)

* [[0.807] Sharing my trained-with-context model](https://www.kaggle.com/code/mgoksu/0-807-sharing-my-trained-with-context-model/notebook) by [MGöksu](https://www.kaggle.com/mgoksu)

Trainning and inferring OpenBook Dataset with context

* [How To Train Open Book Model - Part 1](https://www.kaggle.com/code/cdeotte/how-to-train-open-book-model-part-1) by [Chris Deotte](https://www.kaggle.com/cdeotte)

* [How To Train Open Book Model - Part 2](https://www.kaggle.com/code/cdeotte/how-to-train-open-book-model-part-2) by [Chris Deotte](https://www.kaggle.com/cdeotte)

Voting ensemble (Too many, so just the original)

* [The voting ensemble was introduced](https://www.kaggle.com/code/radek1/an-introduction-to-voting-ensemble) by [radek1](https://www.kaggle.com/radek1)

### My other Notebooks

In this competition

* [Incorporate MAP@k metrics into HF Trainer](https://www.kaggle.com/code/itsuki9180/incorporate-map-k-metrics-into-hf-trainer)

* [Introducing Adversarial Weight Perturbation (AWP)](https://www.kaggle.com/code/itsuki9180/introducing-adversarial-weight-perturbation-awp)

* [Adversarial Weight Perturbation (AWP) Inference](https://www.kaggle.com/code/itsuki9180/adversarial-weight-perturbation-awp-inference)

* [Using DeepSpeed with HF🤗 Trainer](https://www.kaggle.com/code/itsuki9180/using-deepspeed-with-hf-trainer)

Weight optimization related (almost same as Yirun Zhangs')

* [G2Net_oof_weight_optimizer](https://www.kaggle.com/code/itsuki9180/g2net-oof-weight-optimizer)

# How To Train Model for Open Book Q&A Technique - Part 2
The notebook you are reading is a fork of Mgoksu's great notebook [here][1]. Mgoksu (@mgoksu) demonstrated how to achieve top public LB=0.807 using Open Book technique. The Open Book method was first presented by JJ (@jjinho) [here][2], then Quangteo (@quangbk) improved RAM usage [here][3], and Anil (@nlztrk) combined with Q&A [here][4]. Radek (@radek1) demonstrated the strength of Q&A [here][5].

In my previous notebook [here][6] (i.e. Part 1), we demonstrated how to train a model for Open Book. The model was trained using my 60k Kaggle dataset [here][7]. If you enjoy the notebook you are reading, please upvote the dataset too. Thanks!

In this notebook, we will load the trained model output from my previous notebook. We will infer this model after running the code from Mgoksu's public notebook to use Open Book to seach Wikipedia for context. For each test sample in the hidden dataset, we will append Wikipedia context. Then our trained model will infer the multiple choice answer (using both question and appended Wikipedia context). When predicting the answer, this notebook uses a 50% 50% ensemble of the new Q&A model we trained ensembled with Mgoksu's original model. Here is a diagram showing the Open Book method:

![](https://miro.medium.com/v2/resize:fit:800/format:webp/1*bTGY3fKIgNefQxNsOYpnBw.png)

(image source [here][8])

[1]: https://www.kaggle.com/code/mgoksu/0-807-sharing-my-trained-with-context-model
[2]: https://www.kaggle.com/code/jjinho/open-book-llm-science-exam
[3]: https://www.kaggle.com/code/quangbk/open-book-llm-science-exam-reduced-ram-usage
[4]: https://www.kaggle.com/code/nlztrk/openbook-debertav3-large-baseline-single-model
[5]: https://www.kaggle.com/code/radek1/new-dataset-deberta-v3-large-training
[6]: https://www.kaggle.com/code/cdeotte/how-to-train-open-book-model
[7]: https://www.kaggle.com/datasets/cdeotte/60k-data-with-context-v2
[8]: https://blog.gopenai.com/enrich-llms-with-retrieval-augmented-generation-rag-17b82a96b6f0

In [None]:
# installing offline dependencies
!pip install -U /kaggle/input/faiss-gpu-173-python310/faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
!cp -rf /kaggle/input/sentence-transformers-222/sentence-transformers /kaggle/working/sentence-transformers
!pip install -U /kaggle/working/sentence-transformers
!pip install -U /kaggle/input/blingfire-018/blingfire-0.1.8-py3-none-any.whl

!pip install --no-index --no-deps /kaggle/input/llm-whls/transformers-4.31.0-py3-none-any.whl
!pip install --no-index --no-deps /kaggle/input/llm-whls/peft-0.4.0-py3-none-any.whl
!pip install --no-index --no-deps /kaggle/input/llm-whls/datasets-2.14.3-py3-none-any.whl
!pip install --no-index --no-deps /kaggle/input/llm-whls/trl-0.5.0-py3-none-any.whl

In [None]:
import os, time
import gc
import pandas as pd
import numpy as np
import re
from tqdm.auto import tqdm
import blingfire as bf
from __future__ import annotations

from collections.abc import Iterable

import faiss
from faiss import write_index, read_index

from sentence_transformers import SentenceTransformer

import torch
import ctypes
libc = ctypes.CDLL("libc.so.6")

from dataclasses import dataclass
from typing import Optional, Union

import torch
import numpy as np
import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from torch.utils.data import DataLoader

from scipy.special import softmax

In [None]:
SIM_MODEL = '/kaggle/input/sentencetransformers-allminilml6v2/sentence-transformers_all-MiniLM-L6-v2'
DEVICE = 0
MAX_LENGTH = 384
BATCH_SIZE = 32

trn = pd.read_csv("/kaggle/input/kaggle-llm-science-exam/test.csv").drop("id", axis=1)

DEBUG = False
# DEBUG = False if len(trn)!=200 else True # If you want to save GPU Quota, check off this comment-out. But cannot get accurate weight on saving notebook
FILTER_LEN = 1 if DEBUG else 10
IND_SEARCH = 1 if DEBUG else 7
NUM_SENTENCES_INCLUDE = 1 if DEBUG else 22
CONTEXT_LEN = 1000 if DEBUG else 2300
VAL_SIZE = 200 if DEBUG else 1500

WIKI_PATH = "/kaggle/input/wikipedia-20230701"
wiki_files = os.listdir(WIKI_PATH)

In [None]:
def process_documents(documents: Iterable[str],
                      document_ids: Iterable,
                      split_sentences: bool = True,
                      filter_len: int = FILTER_LEN,
                      disable_progress_bar: bool = False) -> pd.DataFrame:
    
    df = sectionize_documents(documents, document_ids, disable_progress_bar)

    if split_sentences:
        df = sentencize(df.text.values, 
                        df.document_id.values,
                        df.offset.values, 
                        filter_len, 
                        disable_progress_bar)
    return df


def sectionize_documents(documents: Iterable[str],
                         document_ids: Iterable,
                         disable_progress_bar: bool = False) -> pd.DataFrame:

    processed_documents = []
    for document_id, document in tqdm(zip(document_ids, documents), total=len(documents), disable=disable_progress_bar):
        row = {}
        text, start, end = (document, 0, len(document))
        row['document_id'] = document_id
        row['text'] = text
        row['offset'] = (start, end)

        processed_documents.append(row)

    _df = pd.DataFrame(processed_documents)
    if _df.shape[0] > 0:
        return _df.sort_values(['document_id', 'offset']).reset_index(drop=True)
    else:
        return _df


def sentencize(documents: Iterable[str],
               document_ids: Iterable,
               offsets: Iterable[tuple[int, int]],
               filter_len: int = FILTER_LEN,
               disable_progress_bar: bool = False) -> pd.DataFrame:

    document_sentences = []
    for document, document_id, offset in tqdm(zip(documents, document_ids, offsets), total=len(documents), disable=disable_progress_bar):
        try:
            _, sentence_offsets = bf.text_to_sentences_and_offsets(document)
            for o in sentence_offsets:
                if o[1]-o[0] > filter_len:
                    sentence = document[o[0]:o[1]]
                    abs_offsets = (o[0]+offset[0], o[1]+offset[0])
                    row = {}
                    row['document_id'] = document_id
                    row['text'] = sentence
                    row['offset'] = abs_offsets
                    document_sentences.append(row)
        except:
            continue
    return pd.DataFrame(document_sentences)

In [None]:
trn = pd.read_csv("/kaggle/input/kaggle-llm-science-exam/test.csv").drop("id", axis=1)
trn.head()

In [None]:
val = pd.read_csv('/kaggle/input/mmlu-dataset-valid-only/valid_mmlu_1526_ind0.csv',index_col=0)[:VAL_SIZE]

val['E'] = '' # dummy answer that allows us to preprocess the test datataset using functionality that works for the train set
val = val.replace(np.NaN, '')

val['A'] = val['A'].map(str)
val['B'] = val['B'].map(str)
val['C'] = val['C'].map(str)
val['D'] = val['D'].map(str)
val['E'] = val['E'].map(str)

val.head()

In [None]:
model = SentenceTransformer(SIM_MODEL, device='cuda')
model.max_seq_length = MAX_LENGTH
model = model.half()

In [None]:
sentence_index = read_index("/kaggle/input/wikipedia-2023-07-faiss-index/wikipedia_202307.index")

In [None]:
prompt_embeddings = model.encode(trn.prompt.values, batch_size=BATCH_SIZE, device=DEVICE, show_progress_bar=True, convert_to_tensor=True, normalize_embeddings=True)
prompt_embeddings = prompt_embeddings.detach().cpu().numpy()

prompt_embeddings_v = model.encode(val.prompt.values, batch_size=BATCH_SIZE, device=DEVICE, show_progress_bar=True, convert_to_tensor=True, normalize_embeddings=True)
prompt_embeddings_v = prompt_embeddings_v.detach().cpu().numpy()

_ = gc.collect()

In [None]:
## Get the top IND_SEARCH pages that are likely to contain the topic of interest
search_score, search_index = sentence_index.search(prompt_embeddings, IND_SEARCH)

search_score_v, search_index_v = sentence_index.search(prompt_embeddings_v, IND_SEARCH)

In [None]:
## Save memory - delete sentence_index since it is no longer necessary
del sentence_index
del prompt_embeddings,  prompt_embeddings_v
_ = gc.collect()
libc.malloc_trim(0)

# Getting Sentences from the Relevant Titles

In [None]:
df = pd.read_parquet("/kaggle/input/wikipedia-20230701/wiki_2023_index.parquet", columns=['id', 'file'])

In [None]:
## Get the article and associated file location using the index
wikipedia_file_data = []
wikipedia_file_data_v = []

for i, (scr, idx) in tqdm(enumerate(zip(search_score, search_index)), total=len(search_score)):
    scr_idx = idx
    _df = df.loc[scr_idx].copy()
    _df['prompt_id'] = i
    wikipedia_file_data.append(_df)
wikipedia_file_data = pd.concat(wikipedia_file_data).reset_index(drop=True)
wikipedia_file_data = wikipedia_file_data[['id', 'prompt_id', 'file']].drop_duplicates().sort_values(['file', 'id']).reset_index(drop=True)

for i, (scr, idx) in tqdm(enumerate(zip(search_score_v, search_index_v)), total=len(search_score_v)):
    scr_idx = idx
    _df = df.loc[scr_idx].copy()
    _df['prompt_id'] = i
    wikipedia_file_data_v.append(_df)
wikipedia_file_data_v = pd.concat(wikipedia_file_data_v).reset_index(drop=True)
wikipedia_file_data_v = wikipedia_file_data_v[['id', 'prompt_id', 'file']].drop_duplicates().sort_values(['file', 'id']).reset_index(drop=True)


del df
_ = gc.collect()
libc.malloc_trim(0)

In [None]:
## Get the full text data
wiki_text_data = []
wiki_text_data_v = []

for file in tqdm(wikipedia_file_data.file.unique(), total=len(wikipedia_file_data.file.unique())):
    _id = [str(i) for i in wikipedia_file_data[wikipedia_file_data['file']==file]['id'].tolist()]
    _df = pd.read_parquet(f"{WIKI_PATH}/{file}", columns=['id', 'text'])

    _df_temp = _df[_df['id'].isin(_id)].copy()
    del _df
    _ = gc.collect()
    libc.malloc_trim(0)
    wiki_text_data.append(_df_temp)
wiki_text_data = pd.concat(wiki_text_data).drop_duplicates().reset_index(drop=True)

for file in tqdm(wikipedia_file_data_v.file.unique(), total=len(wikipedia_file_data_v.file.unique())):
    _id = [str(i) for i in wikipedia_file_data_v[wikipedia_file_data_v['file']==file]['id'].tolist()]
    _df = pd.read_parquet(f"{WIKI_PATH}/{file}", columns=['id', 'text'])

    _df_temp = _df[_df['id'].isin(_id)].copy()
    del _df
    _ = gc.collect()
    libc.malloc_trim(0)
    wiki_text_data_v.append(_df_temp)
wiki_text_data_v = pd.concat(wiki_text_data_v).drop_duplicates().reset_index(drop=True)


_ = gc.collect()

In [None]:
processed_wiki_text_data = process_documents(wiki_text_data.text.values, wiki_text_data.id.values)

processed_wiki_text_data_v = process_documents(wiki_text_data_v.text.values, wiki_text_data_v.id.values)

In [None]:

wiki_data_embeddings = model.encode(processed_wiki_text_data.text,
                                    batch_size=BATCH_SIZE,
                                    device=DEVICE,
                                    show_progress_bar=True,
                                    convert_to_tensor=True,
                                    normalize_embeddings=True)#.half()
wiki_data_embeddings = wiki_data_embeddings.detach().cpu().numpy()

wiki_data_embeddings_v = model.encode(processed_wiki_text_data_v.text,
                                    batch_size=BATCH_SIZE,
                                    device=DEVICE,
                                    show_progress_bar=True,
                                    convert_to_tensor=True,
                                    normalize_embeddings=True)#.half()
wiki_data_embeddings_v = wiki_data_embeddings_v.detach().cpu().numpy()

In [None]:
_ = gc.collect()

In [None]:
trn['answer_all'] = trn.apply(lambda x: " ".join([x['A'], x['B'], x['C'], x['D'], x['E']]), axis=1)
trn['prompt_answer_stem'] = trn['prompt'] + " " + trn['answer_all']

In [None]:
val['A'] = val['A'].map(str)
val['B'] = val['B'].map(str)
val['C'] = val['C'].map(str)
val['D'] = val['D'].map(str)
val['E'] = val['E'].map(str)

val['answer_all'] = val.apply(lambda x: " ".join([x['A'], x['B'], x['C'], x['D'], x['E']]), axis=1)
val['prompt_answer_stem'] = val['prompt'] + " " + val['answer_all']

In [None]:
question_embeddings = model.encode(trn.prompt_answer_stem.values, batch_size=BATCH_SIZE, device=DEVICE, show_progress_bar=True, convert_to_tensor=True, normalize_embeddings=True)
question_embeddings = question_embeddings.detach().cpu().numpy()

question_embeddings_v = model.encode(val.prompt_answer_stem.values, batch_size=BATCH_SIZE, device=DEVICE, show_progress_bar=True, convert_to_tensor=True, normalize_embeddings=True)
question_embeddings_v = question_embeddings_v.detach().cpu().numpy()

# Extracting Matching Prompt-Sentence Pairs

In [None]:
contexts = []
contexts_v = []

for r in tqdm(trn.itertuples(), total=len(trn)):

    prompt_id = r.Index

    prompt_indices = processed_wiki_text_data[processed_wiki_text_data['document_id'].isin(wikipedia_file_data[wikipedia_file_data['prompt_id']==prompt_id]['id'].values)].index.values

    if prompt_indices.shape[0] > 0:
        prompt_index = faiss.index_factory(wiki_data_embeddings.shape[1], "Flat")
        prompt_index.add(wiki_data_embeddings[prompt_indices])

        context = ""
        
        ## Get the top matches
        ss, ii = prompt_index.search(question_embeddings, NUM_SENTENCES_INCLUDE)
        for _s, _i in zip(ss[prompt_id], ii[prompt_id]):
            context += processed_wiki_text_data.loc[prompt_indices]['text'].iloc[_i] + " "
        
    contexts.append(context)
    
    
for r in tqdm(val.itertuples(), total=len(val)):

    prompt_id = r.Index

    prompt_indices = processed_wiki_text_data_v[processed_wiki_text_data_v['document_id'].isin(wikipedia_file_data_v[wikipedia_file_data_v['prompt_id']==prompt_id]['id'].values)].index.values

    if prompt_indices.shape[0] > 0:
        prompt_index = faiss.index_factory(wiki_data_embeddings_v.shape[1], "Flat")
        prompt_index.add(wiki_data_embeddings_v[prompt_indices])

        context = ""
        
        ## Get the top matches
        ss, ii = prompt_index.search(question_embeddings_v, NUM_SENTENCES_INCLUDE)
        for _s, _i in zip(ss[prompt_id], ii[prompt_id]):
            context += processed_wiki_text_data_v.loc[prompt_indices]['text'].iloc[_i] + " "
        
    contexts_v.append(context)

In [None]:
trn['context'] = contexts
val['context'] = contexts_v

In [None]:
trn[["prompt", "context", "A", "B", "C", "D", "E"]].to_csv("./test_context.csv", index=False)
val[["prompt", "context", "A", "B", "C", "D", "E", "answer"]].to_csv("./val_context.csv", index=False)

# Inference

In [None]:
test_df = pd.read_csv("test_context.csv")
test_df.index = list(range(len(test_df)))
test_df['id'] = list(range(len(test_df)))
test_df["prompt"] = test_df["context"].apply(lambda x: x[:CONTEXT_LEN]) + " #### " +  test_df["prompt"]
test_df['answer'] = 'A'

val_df = pd.read_csv("val_context.csv")
val_df.index = list(range(len(val_df)))
val_df['id'] = list(range(len(val_df)))
val_df["prompt"] = val_df["context"].apply(lambda x: x[:CONTEXT_LEN]) + " #### " +  val_df["prompt"]
val_df = val_df.replace(np.NaN, '')

In [None]:
model_dir = "/kaggle/input/llm-science-run-context-2"
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForMultipleChoice.from_pretrained(model_dir).cuda()
model.eval()

In [None]:

options = 'ABCDE'
indices = list(range(5))

option_to_index = {option: index for option, index in zip(options, indices)}
index_to_option = {index: option for option, index in zip(options, indices)}

def preprocess(example):
  
    first_sentence = [example['prompt']] * 5
    second_sentence = []
    for option in options:
        second_sentence.append(example[option])
    
    tokenized_example = tokenizer(first_sentence, second_sentence, truncation='only_first')
    tokenized_example['label'] = option_to_index[example['answer']]
    return tokenized_example

In [None]:
@dataclass
class DataCollatorForMultipleChoice:
    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    
    def __call__(self, features):
        label_name = "label" if 'label' in features[0].keys() else 'labels'
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]['input_ids'])
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
        ]
        flattened_features = sum(flattened_features, [])
        
        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors='pt',
        )
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        batch['labels'] = torch.tensor(labels, dtype=torch.int64)
        return batch

In [None]:
tokenized_test_dataset = Dataset.from_pandas(test_df[['id', 'prompt', 'A', 'B', 'C', 'D', 'E', 'answer']].drop(columns=['id'])).map(preprocess, remove_columns=['prompt', 'A', 'B', 'C', 'D', 'E', 'answer'])
tokenized_test_dataset = tokenized_test_dataset.remove_columns(["__index_level_0__"])
data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
test_dataloader = DataLoader(tokenized_test_dataset, batch_size=1, shuffle=False, collate_fn=data_collator)

tokenized_val_dataset = Dataset.from_pandas(val_df[['id', 'prompt', 'A', 'B', 'C', 'D', 'E', 'answer']].drop(columns=['id'])).map(preprocess, remove_columns=['prompt', 'A', 'B', 'C', 'D', 'E', 'answer'])
tokenized_val_dataset = tokenized_val_dataset.remove_columns(["__index_level_0__"])

val_dataloader = DataLoader(tokenized_val_dataset, batch_size=1, shuffle=False, collate_fn=data_collator)

In [None]:
test_predictions = []
val_predictions = []

for batch in tqdm(test_dataloader):
    for k in batch.keys():
        batch[k] = batch[k].cuda()
    with torch.no_grad():
        outputs = model(**batch)
    test_predictions.append(outputs.logits.cpu().detach())
    
for batch in tqdm(val_dataloader):
    for k in batch.keys():
        batch[k] = batch[k].cuda()
    with torch.no_grad():
        outputs = model(**batch)
    val_predictions.append(outputs.logits.cpu().detach())

test_predictions = torch.cat(test_predictions)
val_predictions = torch.cat(val_predictions)

In [None]:
test_predictions = softmax(test_predictions, axis=1).numpy()
val_predictions = softmax(val_predictions, axis=1).numpy().astype(np.float16)


prob_lables = ['A_prob', 'B_prob', 'C_prob', 'D_prob', 'E_prob']
df_prob = pd.DataFrame(zip(*val_predictions.T), index=val_df.index, columns=prob_lables)
df_prob.to_csv('openbook_val.csv')
df_prob

In [None]:
ob_preds = test_predictions
ob_preds_v = val_predictions
del test_predictions, val_predictions

In [None]:
model_dir = "/kaggle/input/how-to-train-open-book-model-part-1/model_v2"
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForMultipleChoice.from_pretrained(model_dir).cuda()
model.eval()

In [None]:
test_predictionsc = []
val_predictionsc = []

for batch in tqdm(test_dataloader):
    for k in batch.keys():
        batch[k] = batch[k].cuda()
    with torch.no_grad():
        outputs = model(**batch)
    test_predictionsc.append(outputs.logits.cpu().detach())
    
for batch in tqdm(val_dataloader):
    for k in batch.keys():
        batch[k] = batch[k].cuda()
    with torch.no_grad():
        outputs = model(**batch)
    val_predictionsc.append(outputs.logits.cpu().detach())

test_predictionsc = torch.cat(test_predictionsc)
val_predictionsc = torch.cat(val_predictionsc)

In [None]:
test_predictionsc = softmax(test_predictionsc, axis=1).numpy()
val_predictionsc = softmax(val_predictionsc, axis=1).numpy().astype(np.float16)

prob_lables = ['A_prob', 'B_prob', 'C_prob', 'D_prob', 'E_prob']
df_prob = pd.DataFrame(zip(*val_predictionsc.T), index=val_df.index, columns=prob_lables)
df_prob.to_csv('chris_val.csv')
df_prob

In [None]:
gc.collect()

In [None]:
model_dir = "/kaggle/input/using-deepspeed-with-hf-trainer/checkpoints_1"
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForMultipleChoice.from_pretrained(model_dir).cuda()
model.eval()

In [None]:
test_predictionsi = []
val_predictionsi = []

for batch in tqdm(test_dataloader):
    for k in batch.keys():
        batch[k] = batch[k].cuda()
    with torch.no_grad():
        outputs = model(**batch)
    test_predictionsi.append(outputs.logits.cpu().detach())
    
for batch in tqdm(val_dataloader):
    for k in batch.keys():
        batch[k] = batch[k].cuda()
    with torch.no_grad():
        outputs = model(**batch)
    val_predictionsi.append(outputs.logits.cpu().detach())

test_predictionsi = torch.cat(test_predictionsi)
val_predictionsi = torch.cat(val_predictionsi)

In [None]:
test_predictionsi = softmax(test_predictionsi, axis=1).numpy()
val_predictionsi = softmax(val_predictionsi, axis=1).numpy().astype(np.float16)

prob_lables = ['A_prob', 'B_prob', 'C_prob', 'D_prob', 'E_prob']
df_prob = pd.DataFrame(zip(*val_predictionsi.T), index=val_df.index, columns=prob_lables)
df_prob.to_csv('itk_ob_val.csv')
df_prob

#### In order to increase diversity, we also use some weights that do not use openbook.

In [None]:
from typing import Optional, Union
import pandas as pd
import numpy as np
import torch
from datasets import Dataset
from dataclasses import dataclass
from transformers import AutoTokenizer
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer, AutoModel
from torch.utils.data import DataLoader
deberta_v3_large = '/kaggle/input/deberta-v3-large-hf-weights'
import os
os.environ['TRANSFORMERS_NO_ADVISORY_WARNINGS'] = 'true'
os.environ['TOKENIZERS_PARALLELISM'] = 'false'

In [None]:
option_to_index = {option: idx for idx, option in enumerate('ABCDE')}
index_to_option = {v: k for k,v in option_to_index.items()}

def preprocess(example):
    first_sentence = [example['prompt']] * 5
    second_sentences = [example[option] for option in 'ABCDE']
    tokenized_example = tokenizer(first_sentence, second_sentences, truncation=False)
    tokenized_example['label'] = option_to_index[example['answer']]
    
    return tokenized_example

@dataclass
class DataCollatorForMultipleChoice:
    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    
    def __call__(self, features):
        label_name = 'label' if 'label' in features[0].keys() else 'labels'
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]['input_ids'])
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
        ]
        flattened_features = sum(flattened_features, [])
        
        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors='pt',
        )
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        batch['labels'] = torch.tensor(labels, dtype=torch.int64)
        return batch 

In [None]:
tokenizer = AutoTokenizer.from_pretrained(deberta_v3_large)

test_df = pd.read_csv('/kaggle/input/kaggle-llm-science-exam/test.csv')
test_df['answer'] = 'A' # dummy answer that allows us to preprocess the test datataset using functionality that works for the train set

val_df = pd.read_csv('/kaggle/input/mmlu-dataset-valid-only/valid_mmlu_1526_ind0.csv',index_col=0)[:VAL_SIZE]

val_df['E'] = '' # dummy answer that allows us to preprocess the test datataset using functionality that works for the train set
val_df = val_df.replace(np.NaN, '')

val_df['A'] = val_df['A'].map(str)
val_df['B'] = val_df['B'].map(str)
val_df['C'] = val_df['C'].map(str)
val_df['D'] = val_df['D'].map(str)
val_df['E'] = val_df['E'].map(str)

val_df.reset_index(inplace=True, drop=True)

tokenized_test_dataset = Dataset.from_pandas(test_df.drop(columns=['id'])).map(preprocess, remove_columns=['prompt', 'A', 'B', 'C', 'D', 'E', 'answer'])
data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
test_dataloader = DataLoader(tokenized_test_dataset, 1, shuffle=False, collate_fn=data_collator, num_workers=0, pin_memory=True,)


tokenized_val_dataset = Dataset.from_pandas(val_df).map(preprocess, remove_columns=['prompt', 'A', 'B', 'C', 'D', 'E', 'answer'])
val_dataloader = DataLoader(tokenized_val_dataset, 1, shuffle=False, collate_fn=data_collator, num_workers=0, pin_memory=True,)

In [None]:
model = AutoModelForMultipleChoice.from_pretrained(f'/kaggle/input/2023kagglellm-deberta-v3-large-model1').cuda()
model.eval()

preds = []
preds_v = []

for batch in tqdm(test_dataloader, total=len(test_dataloader)):
    for k in batch.keys():
        batch[k] = batch[k].cuda()
    with torch.no_grad():
        outputs = model(**batch)
    preds.append(outputs.logits.cpu().detach())
    
for batch in tqdm(val_dataloader, total=len(val_dataloader)):
    for k in batch.keys():
        batch[k] = batch[k].cuda()
    with torch.no_grad():
        outputs = model(**batch)
    preds_v.append(outputs.logits.cpu().detach())

hyc_preds = torch.cat(preds)
hyc_preds_v = torch.cat(preds_v)

del model
torch.cuda.empty_cache()

In [None]:
hyc_preds = softmax(hyc_preds, axis=1).numpy()
hyc_preds_v = softmax(hyc_preds_v, axis=1).numpy().astype(np.float16)
prob_lables = ['A_prob', 'B_prob', 'C_prob', 'D_prob', 'E_prob']
df_prob = pd.DataFrame(zip(*hyc_preds_v.T), index=val_df.index, columns=prob_lables)
df_prob.to_csv('hyc_val.csv')
df_prob

In [None]:
gc.collect()

In [None]:
import os, glob
from typing import Optional, Union
import pandas as pd
import numpy as np
from tqdm import tqdm

import torch
import torch.nn as nn
from torch.utils.data import DataLoader

from datasets import Dataset
from dataclasses import dataclass
from transformers import AutoTokenizer, AutoConfig
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer, AutoModel

In [None]:
MODEL_DIR = '/kaggle/input/llm-kaggle-awp'
CONF_PATH = MODEL_DIR + '/deberta-v3-large_config.pth'
MODEL_PATH = MODEL_DIR + '/best_model_public.pt'

In [None]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
device

In [None]:
test_df = pd.read_csv('/kaggle/input/kaggle-llm-science-exam/test.csv')
test_df['answer'] = 'A' # dummy answer that allows us to preprocess the test datataset using functionality that works for the train set
test_df = test_df.replace(np.NaN, '')

val_df = pd.read_csv('/kaggle/input/mmlu-dataset-valid-only/valid_mmlu_1526_ind0.csv',index_col=0)[:VAL_SIZE]

val_df['E'] = '' # dummy answer that allows us to preprocess the test datataset using functionality that works for the train set
val_df = val_df.replace(np.NaN, '')

val_df['A'] = val_df['A'].map(str)
val_df['B'] = val_df['B'].map(str)
val_df['C'] = val_df['C'].map(str)
val_df['D'] = val_df['D'].map(str)
val_df['E'] = val_df['E'].map(str)

val_df.reset_index(inplace=True, drop=True)

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR+'/tokenizer')
tokenizer

In [None]:
class LlmseDataset(torch.utils.data.Dataset):
    def __init__(self, df):
        self.df = df
        self.a2i = {alp: idx for idx, alp in enumerate('ABCDE')}
        self.i2a = {v: k for k,v in self.a2i.items()}
        self.perm_dict = {0: [1,2,3,4],
                     1: [2,3,4,0], 
                     2: [3,4,0,1],
                     3: [4,0,1,2],
                     4: [0,1,2,3]}
  
    def __len__(self):
        return len(self.df)
        
    def __getitem__(self, idx):
        example = self.df.iloc[idx]
        tokenized_example = dict()              

        first_sentence = [example['prompt']] * 5
        second_sentences = [example[option] for option in 'ABCDE']
        other_sentences = [[] for i in range(5)]

        for i, p in enumerate(range(5)):
            value = self.perm_dict[p] 
            for v in value:
                al = self.i2a[v] 
                second_sentences[i]+= ' ' + example[al]

        tokenized_example = tokenizer(first_sentence, 
                                      second_sentences,
                                      truncation='only_first')
        tokenized_example['label'] = option_to_index[example['answer']]
        return tokenized_example
            
val_ds = LlmseDataset(val_df)    
test_ds = LlmseDataset(test_df)

In [None]:
data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)

val_dl = DataLoader(
    val_ds, 
    batch_size=1, 
    shuffle=False, 
    collate_fn=data_collator,
    num_workers=0,
    pin_memory=True,
    drop_last=False
)

test_dl = DataLoader(
    test_ds, 
    batch_size=1, 
    shuffle=False, 
    collate_fn=data_collator,
    num_workers=0,
    pin_memory=True,
    drop_last=False
)

In [None]:
class CustomModel(nn.Module):
    def __init__(self, model_conf, *, dropout=0.2, pretrained=True):
        super().__init__()

        # Transformer
        #self.config = AutoConfig.from_pretrained(model_conf)

        self.transformer = AutoModelForMultipleChoice.from_config(model_conf)

        #self._init_weights(self.fc, self.config)

    def _init_weights(self, module, config):
        module.weight.data.normal_(mean=0.0, std=config.initializer_range)
        if module.bias is not None:
            module.bias.data.zero_()

    def forward(self, input_ids, attention_mask, token_type_ids=None):
        out = self.transformer(input_ids, attention_mask, token_type_ids=token_type_ids)
        x = out['logits'] 

        return x

In [None]:
config = torch.load(CONF_PATH)
model = CustomModel(model_conf=config)
model.load_state_dict(torch.load(MODEL_PATH))
model.to(device)
model.eval()

In [None]:
y_preds = []
y_preds_v = []

with tqdm(test_dl, leave=True) as pbar:
    with torch.no_grad():
        for idx, batch in enumerate(pbar):
            inp_ids = batch['input_ids'].to(device)
            att_mask = batch['attention_mask'].to(device)
            token_type_ids = batch['token_type_ids'].to(device)

            y_pred = model(input_ids=inp_ids, 
                           attention_mask=att_mask, 
                           token_type_ids=token_type_ids)

            y_pred = y_pred.to(torch.float)

            y_preds.append(y_pred.cpu())
            
with tqdm(val_dl, leave=True) as pbar:
    with torch.no_grad():
        for idx, batch in enumerate(pbar):
            inp_ids = batch['input_ids'].to(device)
            att_mask = batch['attention_mask'].to(device)
            token_type_ids = batch['token_type_ids'].to(device)

            y_pred = model(input_ids=inp_ids, 
                           attention_mask=att_mask, 
                           token_type_ids=token_type_ids)

            y_pred = y_pred.to(torch.float)

            y_preds_v.append(y_pred.cpu())
            
        
itk_preds = torch.cat(y_preds)
itk_preds_v = torch.cat(y_preds_v)
del model, y_preds, y_preds_v
torch.cuda.empty_cache()

In [None]:
itk_preds = softmax(itk_preds, axis=1).numpy()
itk_preds_v = softmax(itk_preds_v, axis=1).numpy().astype(np.float16)
prob_lables = ['A_prob', 'B_prob', 'C_prob', 'D_prob', 'E_prob']
df_prob = pd.DataFrame(zip(*itk_preds_v.T), index=val_df.index, columns=prob_lables)
df_prob.to_csv('itk_awp_val.csv')
df_prob

In [None]:
del df_prob
gc.collect()

# Optimise model weights

In [None]:
from scipy.optimize import minimize, fsolve
import datetime
import torch.nn.functional as F
from numba import njit

In [None]:
def apk(actual, predicted, k=5):
    """
    Computes the average precision at k.
    This function computes the average prescision at k between two lists of
    items.
    Parameters
    ----------
    actual : list
             A list of elements that are to be predicted (order doesn't matter)
    predicted : list
                A list of predicted elements (order does matter)
    k : int, optional
        The maximum number of predicted elements
    Returns
    -------
    score : double
            The average precision at k over the input lists
    """
    
    # requires all elements are unique
    assert (len(np.unique(predicted)) == len(predicted))

    if len(predicted)>k:
        predicted = predicted[:k]

    score = 0.0
    num_hits = 0.0

    for i,p in enumerate(predicted):
        # first condition checks whether it is valid prediction
        # second condition checks if prediction is not repeated
        if p in actual and p not in predicted[:i]:
            num_hits += 1.0
            score += num_hits / (i + 1.0)

    return score / min(len(actual), k)

def mapk(actual, predicted, k=5):
    
    """
    Computes the mean average precision at k.
    This function computes the mean average prescision at k between two lists
    of lists of items.
    Parameters
    ----------
    actual : list
             A list of lists of elements that are to be predicted 
             (order doesn't matter in the lists)
    predicted : list
                A list of lists of predicted elements
                (order matters in the lists)
    k : int, optional
        The maximum number of predicted elements
    Returns
    -------
    score : double
            The mean average precision at k over the input lists
    """
    return np.mean([apk(a,p,k) for a,p in zip(actual, predicted)])

In [None]:
@njit
def grad_func_jit(weights):
    preds_clip = np.minimum(1 - 1e-15, np.maximum(preds, 1e-15))
    gradients = np.zeros(preds.shape[0])
    for i in range(preds.shape[0]):
        a, b, c = target_values, preds_clip[i], np.zeros((preds.shape[1], preds.shape[2]))
        a = np.eye(5)[a]
        for j in range(preds.shape[0]):
            if j != i:
                c += weights[j] * preds_clip[j]
        gradients[i] = -np.mean((-a*b+(b**2)*weights[i]+b*c)/((b**2)*(weights[i]**2)+2*b*c*weights[i]-b*weights[i]+(c**2)-c))
    return gradients

In [None]:
def calc_mtr(predicted, k=3):
    y_preds = np.argsort(-predicted, 1)
    map3 = mapk(target_values.reshape(-1, 1), y_preds.reshape(-1, 5), k=k)
    return map3

# def calc_loss(predicted):
#     score = F.cross_entropy(torch.tensor(predicted), torch.tensor(target_values)).numpy()
#     return score


def log_loss_numpy(y_pred):
    y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)
    loss = - target_values_one_hot * np.log(y_pred)
    loss = np.sum(loss, axis=-1)
    return loss.mean()

def func_to_optimise(weights):
    pred_blend = np.tensordot(weights, preds, axes = ((0), (0)))
    score = log_loss_numpy(pred_blend)
    return score

def func_to_map3(weights):
    pred_blend = np.tensordot(weights, preds, axes = ((0), (0)))
    score = calc_mtr(pred_blend)
    return score

In [None]:
val_df = pd.read_csv('/kaggle/input/mmlu-dataset-valid-only/valid_mmlu_1526_ind0.csv',index_col=0)[:VAL_SIZE]

val_df['E'] = '' # dummy answer that allows us to preprocess the test datataset using functionality that works for the train set
val_df = val_df.replace(np.NaN, '')

val_df['A'] = val_df['A'].map(str)
val_df['B'] = val_df['B'].map(str)
val_df['C'] = val_df['C'].map(str)
val_df['D'] = val_df['D'].map(str)
val_df['E'] = val_df['E'].map(str)

val_df.reset_index(inplace=True, drop=True)

In [None]:
options = 'ABCDE'
indices = list(range(5))

option_to_index = {option: index for option, index in zip(options, indices)}
index_to_option = {index: option for option, index in zip(options, indices)}
target_values = val_df['answer'].map(option_to_index).values
target_values_one_hot = np.eye(5)[target_values]

In [None]:
preds_dict = {
    'chris': val_predictionsc,
    'openbook': ob_preds_v,
    'itk_ob': val_predictionsi,   
    'hyc': hyc_preds_v,
    'itk_awp': itk_preds_v
}

In [None]:
preds = np.zeros((len(preds_dict), len(val_df), 5))
for i in range(preds.shape[0]):
    preds[i] = list(preds_dict.values())[i]

In [None]:
%%time

map3_scores = {}
for n, key in enumerate(preds_dict.keys()):
    score_val = calc_mtr(preds[n])
    map3_scores[key] = score_val
    print(f'{key:40s} CV_map@3:', score_val)
    
print('-' * 60)

loss_scores = {}
for n, key in enumerate(preds_dict.keys()):
    score_val = log_loss_numpy(preds[n])
    loss_scores[key] = score_val
    print(f'{key:40s} CV_CELoss:', score_val)
    
print('-' * 60)

ln(5) = 1.60943791243 and losses are about 1.5, So the models seem to be making near-random predictions. As I said at the beginning, the validation dataset may not be suitable, but I will continue.

### Observe correlation
It is not the good way to increase the weight of model just because the CV is high. Correlation between models is also an important index in determining weights. In general, we can aim for a high score improvement by ensemble models with good CV and low correlation.

In [None]:
import matplotlib.pyplot as plt
import matplotlib.style as style
import seaborn as sns
from matplotlib import pyplot
from matplotlib.ticker import ScalarFormatter
sns.set_context("talk")
style.use('fivethirtyeight')

subs = np.zeros((len(preds_dict), len(val_df), 5))

for i, p in enumerate(preds_dict.keys()):
    print(i,p)
    subs[i,:,:] = list(preds_dict.values())[i]
    
corr = np.corrcoef(subs.reshape(len(preds_dict), -1))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(15, 12))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap=cmap, annot=True, fmt="g",
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, ax=ax)
ax.set_ylim(corr.shape[0], 0)
plt.yticks(rotation=0)

## Blending Weights Optimize

Maximising MAP@3 is very difficult(Is it even possible?). so Minimising CE loss here.

In [None]:
tol = 1e-10
init_guess = [1 / preds.shape[0]] * preds.shape[0]
bnds = [(0, 1) for _ in range(preds.shape[0])]
cons = {'type': 'eq', 
        'fun': lambda x: np.sum(x) - 1, 
        'jac': lambda x: [1] * len(x)}

print('Inital Blend Loss:', func_to_optimise(init_guess))
print('Inital Blend MAP@3:', func_to_map3(init_guess))
start_time = time.time()

res_scipy = minimize(fun = func_to_optimise, 
                     x0 = init_guess, 
                     method = 'SLSQP', 
                     tol = tol,
                     bounds = bnds,
                     jac = grad_func_jit, 
                     constraints = cons,
                     options={"disp":True,"maxiter":1000})

print(f'[{str(datetime.timedelta(seconds = time.time() - start_time))[2:7]}] Optimised Blend Loss:', res_scipy.fun, ', Optimised Blend MAP@3:', func_to_map3(res_scipy.x))
print('Optimised Weights:', res_scipy.x)
print('-' * 70)

for n, key in enumerate(preds_dict.keys()):
    print(f'{key:40s} Optimised Weights:', res_scipy.x[n])

# Apply weights and make submission

In [None]:
ws = [res_scipy.x[i] for i in range(len(preds_dict.keys()))]
ws = ws / np.sum(ws)
ws

In [None]:
predictions_overall = test_predictionsc * ws[0] + ob_preds * ws[1] + test_predictionsi * ws[2] + hyc_preds * ws[3] + itk_preds * ws[4]
predictions_overall.shape

In [None]:
predictions_overall = predictions_overall
predictions_overall = np.argsort(-predictions_overall)[:,:3]
predictions_overall[:5]

In [None]:
predictions_as_answer_letters = np.array(list('ABCDE'))[predictions_overall]
predictions_as_answer_letters[:3]

In [None]:
predictions_as_string = test_df['prediction'] = [
    ' '.join(row) for row in predictions_as_answer_letters[:, :3]
]
predictions_as_string[:3]

In [None]:
submission = test_df[['id', 'prediction']]
submission.to_csv('submission.csv', index=False)

pd.read_csv('submission.csv').head(10)

In conclusion, at least we were able to confirm that the openbook model (based on Ozturk's and Chris'), which differs in method from other models and has a high score, has the higher weight.

Now it's your turn to blend. Let's add weights for your model. 

Also, running notebooks, especially inference for openbook model, takes a long time, so it's a good idea to separate notebooks for calculating weights and for submitting them like Yirun Zhangs' base notebook.

It would also be important to change the evaluation dataset to something relevant to STEM. If the model weights are unnaturally high, suspect a leak. And make sure the evaluation dataset is not used for training.

### Wishing you happy kaggling!