This was the actual notebook to make the kaggle competition submisstion.
Uses pre-built FAISS index for wikipedia, along with the pre-chunked documents in sqlite database.

Find documents matching question + possible answers pairs from Wikipedia based on FAISS index.
Find document chunks most similar to the query.
Select answer using FlanT5 and in-context, RAG-based, Question Answering.

In [2]:
import os
from tqdm.notebook import tqdm

on_kaggle = False
if os.path.exists("/kaggle/input"):
    on_kaggle = True


While some of the HuggingFace T5 configuration files have maximum token count of 512, it appears the model has been trained on 2048 tokens and actually has some kind of relative positional embeddings for theoretically unlimited input size:

- https://github.com/huggingface/transformers/issues/5204
- https://github.com/google-research/text-to-text-transfer-transformer/issues/273
- https://huggingface.co/google/flan-t5-xxl/discussions/41

Therefor, I tried with context length of up to 3000 tokens. Seemed to give slight increase, up to that context size, in score.

In [3]:
MAX_CONTEXT = 3000
MAX_CONTEXT_RAG = MAX_CONTEXT - 200

CHUNK_SIZE = 256
#better to load a little more chunks than that which fits into the max context. allows for selection by leaving some out.
CHUNKS_TO_LOAD = int(MAX_CONTEXT / CHUNK_SIZE * 1.2) + 1
print(CHUNKS_TO_LOAD)
if CHUNK_SIZE == 512:
    CHUNKS_TO_LOAD = 10
if CHUNK_SIZE == 384:
    CHUNKS_TO_LOAD = 13
if CHUNK_SIZE == 256:
    CHUNKS_TO_LOAD = 15
CHUNKS_TO_LOAD

15


15

In [4]:
import numpy as np 
import pandas as pd 

import os

#paths on kaggle are different, as well as some pip installs
if on_kaggle:
    for dirname, _, filenames in os.walk('/kaggle/input'):
        for filename in filenames:
            pass
            #print(os.path.join(dirname, filename))



In [5]:
if on_kaggle:
    !pip install -U /kaggle/input/faiss-gpu-173-python310/faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
else:
    !pip install faiss-cpu



[0m

In [6]:
if on_kaggle:
    #sentence-transformers needs to build the wheel and write access to filesystem
    !cp -rf /kaggle/input/sentence-transformers-222/sentence-transformers /kaggle/working/sentence-transformers
    !pip install -U /kaggle/working/sentence-transformers

In [7]:
if on_kaggle:
    #https://www.kaggle.com/code/chesterx/llm-science-cut-fragment-length/notebook
    #transformers is already at newer version
    #!pip install --no-index --no-deps /kaggle/input/llm-whls/transformers-4.31.0-py3-none-any.whl
    !pip install --no-index --no-deps /kaggle/input/llm-whls/peft-0.4.0-py3-none-any.whl
    !pip install --no-index --no-deps /kaggle/input/llm-whls/datasets-2.14.3-py3-none-any.whl
    !pip install --no-index --no-deps /kaggle/input/llm-whls/trl-0.5.0-py3-none-any.whl

In [8]:
import psutil

def print_stats():
    # Getting % usage of virtual_memory ( 3rd field)
    print('RAM memory % used:', psutil.virtual_memory()[2])
    # Getting usage of virtual_memory in GB ( 4th field)
    print('RAM Used (GB):', psutil.virtual_memory()[3]/1000000000)

In [9]:
print_stats()

RAM memory % used: 9.8
RAM Used (GB): 11.994046464


In [10]:
if on_kaggle:
    data_dir = "/kaggle/input/kaggle-llm-science-exam"
    index_dir = "/kaggle/input/faiss-512-wikipedia-202308/"
    embedding_dir = "/kaggle/input/bge-small-en/bge-small-en"
    llm_dir = "/kaggle/input/flan-t5/pytorch/xl/3"
    if CHUNK_SIZE == 256:
        chunk_database_path = "/kaggle/input/wikipedia-202308-chunks-256tk-sqlite/wikipedia_chunks_256.db"
    elif CHUNK_SIZE == 64:
        chunk_database_path = "/kaggle/input/wikipedia-202308-64tk/wikipedia_chunks_64.db"
else:
    data_dir = "/mystuff/science-exam"
    index_dir = ".."
    embedding_dir = "/mystuff/llm/bge-small-en"
    llm_dir = "/mystuff/llm/flan-t5-xl"
#    llm_dir = "/mystuff/llm/flan-ul2"
    chunk_database_path = f"../wikipedia_chunks_{CHUNK_SIZE}.db"


In [11]:
df_train = pd.read_csv(f'{data_dir}/train.csv')
df_test  = pd.read_csv(f'{data_dir}/test.csv')
df_samp = pd.read_csv(f'{data_dir}/sample_submission.csv')


In [12]:
print_stats()

RAM memory % used: 9.8
RAM Used (GB): 11.997659136


In [13]:
# for testing the processing limits on a test set of size 4k

#df_test_orig = df_test.copy()
#df_test = pd.concat([df_test, df_test], ignore_index=True) #400
#df_test = pd.concat([df_test, df_test], ignore_index=True) #800
#df_test = pd.concat([df_test, df_test], ignore_index=True) #1600
#df_test = pd.concat([df_test, df_test], ignore_index=True) #3200
#df_test = pd.concat([df_test, df_test_orig], ignore_index=True) #3400
#df_test = pd.concat([df_test, df_test_orig], ignore_index=True) #3600
#df_test = pd.concat([df_test, df_test_orig], ignore_index=True) #3800
#df_test = pd.concat([df_test, df_test_orig], ignore_index=True) #4000



In [14]:
DEVICE = "cuda"


In [15]:
def collect_answer_options(df, idx, options = ['A','B','C','D','E']):
    prompt = df.loc[idx, 'prompt']
    answers = df.loc[idx,options].tolist()
    correct_answer = None
    if "answer" in df.columns:
        correct_answer = df.loc[idx, 'answer']
    
    return prompt, answers, correct_answer

This loads the FAISS database into memory in a separate process, queries all prompt/answer pairs for nearest document ids.
Those doc ids can be later used to load the actual docs and their chunks.

In [16]:
from multiprocessing import Process, Queue
import faiss
import numpy as np
from faiss import write_index, read_index
from sentence_transformers import SentenceTransformer, util
import time
import traceback

def load_and_search_faiss(queue, df):
    try:
        start = time.time()

        print("loading index")
        sentence_index = read_index(f"{index_dir}/faiss_index_512_flat_small.index", faiss.IO_FLAG_MMAP)
        print("index loaded")

        print("loading embedding model")
        embedding_model_path = embedding_dir
        embedding_model = SentenceTransformer(embedding_model_path, device=DEVICE)
        print("model loaded")

        # collect full text for all prompt + answers in the df
        full_texts = []
        for idx in tqdm(range(df.shape[0])):
            prompt, answers, correct_answer = collect_answer_options(df, idx)
            full_text = prompt + "\n".join(answers)
            full_texts.append(full_text)

        print("encoding")
        q_embeddings = embedding_model.encode(full_texts)
        print("finished encoding")

        k = 6
        print("searching index")
        D, I = sentence_index.search(q_embeddings, k)
        print("search done")
        faiss_scores = D
        faiss_doc_ids = I

        # Put the results in the queue to send back to main process
        print("putting results in queue")
        queue.put((faiss_scores, faiss_doc_ids))
        end = time.time()
        diff = end - start
        print(f"returning: {diff}")
    except Exception as e:
        print(f"An error occurred in the subprocess: {e}")
        traceback.print_exc()
        queue.put(None)




In [17]:
df_train.head()

Unnamed: 0,id,prompt,A,B,C,D,E,answer
0,0,Which of the following statements accurately d...,MOND is a theory that reduces the observed mis...,MOND is a theory that increases the discrepanc...,MOND is a theory that explains the missing bar...,MOND is a theory that reduces the discrepancy ...,MOND is a theory that eliminates the observed ...,D
1,1,Which of the following is an accurate definiti...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,A
2,2,Which of the following statements accurately d...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol is a representation of th...,The triskeles symbol is a representation of a ...,The triskeles symbol represents three interloc...,The triskeles symbol is a representation of th...,A
3,3,What is the significance of regularization in ...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,C
4,4,Which of the following statements accurately d...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,D


In [18]:
df_test.head()

Unnamed: 0,id,prompt,A,B,C,D,E
0,0,Which of the following statements accurately d...,MOND is a theory that reduces the observed mis...,MOND is a theory that increases the discrepanc...,MOND is a theory that explains the missing bar...,MOND is a theory that reduces the discrepancy ...,MOND is a theory that eliminates the observed ...
1,1,Which of the following is an accurate definiti...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...
2,2,Which of the following statements accurately d...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol is a representation of th...,The triskeles symbol is a representation of a ...,The triskeles symbol represents three interloc...,The triskeles symbol is a representation of th...
3,3,What is the significance of regularization in ...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...
4,4,Which of the following statements accurately d...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...


In [19]:
# Concatenate Train + Test DataFrames for more efficient processing
df_full = pd.concat([df_train, df_test]).reset_index(drop=True)


As noted above, FAISS search is done in a separate process. This allows freeing memory as otherwise FAISS mem use seems very hard to control and free the memory after.

In [20]:
q = Queue()

p = Process(target=load_and_search_faiss, args=(q,df_full))
p.start()
print("getting value")
#the get should block 
faiss_scores_full, faiss_doc_ids_full = q.get()
print("Best match document indices:", len(faiss_doc_ids_full))
#but better to sleep and join anyway
time.sleep(5)
print("join started")
p.join()
print("join done")


loading index
getting value
index loaded
loading embedding model


Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`,  it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again. You will see the new `added_tokens_decoder` attribute that will store the relevant information.


model loaded


  0%|          | 0/400 [00:00<?, ?it/s]

encoding
finished encoding
searching index
search done
putting results in queue
returning: 8.586722373962402
Best match document indices: 400
join started
join done


In [21]:
# Split the result list back to train and test since both were processed together
split_index = len(df_train)
faiss_scores_train = faiss_scores_full[:split_index]
faiss_scores_test = faiss_scores_full[split_index:]
faiss_doc_ids_train = faiss_doc_ids_full[:split_index]
faiss_doc_ids_test = faiss_doc_ids_full[split_index:]


In [22]:
print_stats()

RAM memory % used: 10.0
RAM Used (GB): 12.198162432


In [23]:
DEVICE

'cuda'

Load the embeddings model to create embeddings for new document chunks as needed:

In [24]:
from sentence_transformers import SentenceTransformer, util

embedding_model_path = embedding_dir

embedding_model = SentenceTransformer(embedding_model_path, device=DEVICE)

Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`,  it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again. You will see the new `added_tokens_decoder` attribute that will store the relevant information.


In [25]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, BitsAndBytesConfig, AutoModelForSeq2SeqLM
from transformers.generation import GenerationConfig
from tqdm.notebook import tqdm

In [26]:
llm = llm_dir

In [27]:
tokenizer = AutoTokenizer.from_pretrained(llm, local_files_only=True)
#some of the talk on T5 model max length give these types instructions to try. I don't think it has a difference but why not:
tokenizer.model_max_length=2048

In [28]:
llm

'/mystuff/llm/flan-t5-xl'

In [29]:
model = AutoModelForSeq2SeqLM.from_pretrained(llm, device_map=DEVICE, local_files_only=True)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

MAP calculation. Taken from some random Kaggle notebook, I seem to have forgot which one. Sorry about that:

In [30]:
# MAP3 

def calculate_MAP3(predictions, labels):
    U = len(predictions)  # Number of questions in the test set
    MAP3 = 0.0  # Mean Average Precision @ 3

    for i in range(U):
        n = len(predictions[i])  # Number of predictions per question
        relevant_labels = set(labels[i])  # Correct labels for the current question
        
        precision_sum = 0.0
        precision_at_k = 0.0
        relevant_count = 0

        for k in range(n):
            if predictions[i][k] in relevant_labels:
                relevant_count += 1
                precision_at_k = relevant_count / (k + 1)
                precision_sum += precision_at_k
                relevant_labels.remove(predictions[i][k])

            if relevant_count >= 3:
                break

        average_precision = precision_sum / min(len(labels[i]), 3)
        MAP3 += average_precision
    
    MAP3 /= U
    print("MAP@3 score:", MAP3)



Load document with the given ID from the SQLite database.
Includes all the document chunks in the database:

In [31]:
import sqlite3

def fetch_document_and_chunks_by_id(database_path, document_id):
    # Establish the database connection
    connection = sqlite3.connect(database_path)
    cursor = connection.cursor()

    # Prepare and execute the SQL query
    query = """
    SELECT 
        d.document_id, d.document_title,
        s.section_id, s.section_title,
        tc.chunk_id, tc.content
    FROM documents d
    LEFT JOIN sections s ON s.document_id = d.document_id
    LEFT JOIN text_chunks tc ON tc.document_id = d.document_id AND tc.section_id = s.section_id
    WHERE d.document_id = ?
    ORDER BY d.document_id, s.section_id, tc.chunk_id;
    """
    cursor.execute(query, (document_id,))

    # Initialize an empty dictionary to hold the document and its chunks
    # TODO: throw some error if this is not right
    document = {
        'id': document_id,
        'sections': {}
    }

    # Fetch and process the result
    for row in cursor:
        #print(row)
        _, document_title, section_id, section_title, chunk_id, content = row

        # Set the document title
        document['title'] = document_title

        # Add section if not already present
        if section_id and section_id not in document['sections']:
            document['sections'][section_id] = {
                'title': section_title,
                'chunks': {}
            }

        # Add chunk
        if chunk_id:
            document['sections'][section_id]['chunks'][chunk_id] = content

    # Close the database connection
    connection.close()

    return document


Load a Wikipedia page with given ID, print it if asked:

In [32]:
def load_and_print_doc_by_id(document_id_to_query, print_all=False):
#    document_id_to_query = 1880580 # Replace with the document ID you're interested in
    document = fetch_document_and_chunks_by_id(chunk_database_path, document_id_to_query)
    #print(document)
    
    # Print the document and its chunks
    all_chunks = []
    #print(f"Document ID: {document['id']}, Title: {document['title']}")
    for sec_id, sec_data in document['sections'].items():
        #print(f"  Section ID: {sec_id}, Title: {sec_data['title']}")
        for chunk_id, content in sec_data['chunks'].items():
            if print_all:
                print(f"    Chunk ID: {chunk_id}, Content: {content[:50]}...")  # Printing first 50 characters of each chunk
            all_chunks.append(content)
    return document, all_chunks


Load given set of documents (Wikipedia pages) and return chunks for all of them as separate lists and as a single big list:

In [33]:
def get_doc_chunks(doc_ids, print_all=True):
    docs = []
    doc_chunks = []
    doc_chunks_flat = []
    for doc_id in doc_ids:
        doc, chunks = load_and_print_doc_by_id(int(doc_id+1), print_all)
        docs.append(doc)
        doc_chunks.append(chunks)
        doc_chunks_flat.extend(chunks)
    return doc_chunks, doc_chunks_flat

find_top_n_rag finds indices of top N largest values in the given np_array, and sorts them from largest to smallest.
That is, in case of this notebook it is used to find the indices of documents with highest similarity to given prompt + answer options.

If the code seems strange, I recommend copy-pasting it to ChatGPT with prompt "what does this code do:" or something similar :)

In [34]:
def find_top_n_rag(np_array, n):
    ind = np.argpartition(np_array, -n)[-n:]
    top_n = np_array[ind]
    sorted_ind = ind[np.argsort(np_array[ind])]
    return sorted_ind[::-1]

In [35]:
import numpy as np

def find_bottom_n_rag(np_array, n):
    ind = np.argpartition(np_array, n)[:n]  # Find indices of smallest n elements
    sorted_ind = ind[np.argsort(np_array[ind])]  # Sort these indices
    return sorted_ind  # Return sorted indices


In [36]:
q_full_promps = []
# iterate all of df_train
for idx in tqdm(range(df_train.shape[0])):
    prompt, answers, correct_answer = collect_answer_options(df_train, idx)
    full_text = prompt + "\n".join(answers)
    q_full_promps.append(full_text)

print("encoding")
train_q_embeddings = embedding_model.encode(q_full_promps)


  0%|          | 0/200 [00:00<?, ?it/s]

encoding


In [37]:
q_full_promps = []
# iterate all of df_test
for idx in tqdm(range(df_test.shape[0])):
    prompt, answers, correct_answer = collect_answer_options(df_test, idx)
    full_text = prompt + "\n".join(answers)
    q_full_promps.append(full_text)

test_q_embeddings = embedding_model.encode(q_full_promps)

  0%|          | 0/200 [00:00<?, ?it/s]

In [38]:

# df = the dataframe to build contexts for, df_train or df_test in practice
# faiss_doc_ids = top documents by embeddings similarity for each query prompt + answer pairs
# for example, df_train[0] has a prompt and 5 answer options. faiss_doc_ids[0] has the set of most similar docs to these df_train[0] texts
# q_embeddings the query embeddings for the dataframe rows, to allow for effective chunk selection from the doc chunks
def build_contexts(df, faiss_doc_ids, q_embeddings):
    contexts = []
    doc_chunks_all = []

    print("loading chunks")
    for idx in tqdm(range(df.shape[0])):
        doc_ids = faiss_doc_ids[idx]
        doc_chunks, doc_chunks_flat = get_doc_chunks(doc_ids, False)
        # doc_chunks_all should then contain list of sublists where each doc has its chunks in a separate list
        doc_chunks_all.append(doc_chunks_flat)
        
    print("flattening chunks")
    # flattening all chunks into a single list allows batch processing them more efficiently
    flat_chunks_all = [item for sublist in doc_chunks_all for item in sublist]
    print("embedding chunks")
    # due to flattened list, we can not encode all chunks at once in batches by encoding model
    chunk_embeddings = embedding_model.encode(flat_chunks_all)
    
    embeddings_list_of_lists = []
    start_idx = 0
    print("splitting embeddings back")
    for sublist in doc_chunks_all:
        # splitting embeddings from model back into sublists per doc/chunks
        end_idx = start_idx + len(sublist)
        embeddings_list_of_lists.append(chunk_embeddings[start_idx:end_idx])
        start_idx = end_idx

    print("finding top n for all prompts, building contexts")
    sim_scores_all = []
    # for each prompt / answer pairs, find best matching doc chunks to use as RAG QA context
    for idx in tqdm(range(df.shape[0])):
        # q_embeddings[idx] is the embeddings for prompt + answers together as a question
        # embeddings_list_of_lists is the embeddings for the chunks for closest docs
        sim_scores = util.cos_sim(q_embeddings[idx], embeddings_list_of_lists[idx])
        sim_scores_all.append(sim_scores)
        search_n = CHUNKS_TO_LOAD
        if len(sim_scores[0]) < search_n:
            #some documents may not have enough chunks, so have to cap it if that is the case
            search_n = len(sim_scores[0])
        #find the ones that have highest similarity scores by embedding
        top_n = find_top_n_rag(np.array(sim_scores)[0], search_n)

        total_tokens = 0
        total_tokens_prev = 0
        context = ""
        count = 0
        #top_n should now be indices into the chunk list
        for n in top_n:
            chunk = doc_chunks_all[idx][n]
            token_ids = tokenizer(chunk)["input_ids"]
            token_count = len(token_ids)
            total_tokens_prev = total_tokens
            total_tokens += token_count
            if total_tokens > MAX_CONTEXT_RAG:
                break
            count += 1
            context += "\n"+chunk
        # print(f"{count}: {total_tokens_prev}")
        
        contexts.append(context)
    return contexts, doc_chunks_all, embeddings_list_of_lists, sim_scores_all

In [39]:
contexts_train, train_chunks_1, train_emb_lists, train_sim_scores_1 = build_contexts(df_train, faiss_doc_ids_train, train_q_embeddings)

loading chunks


  0%|          | 0/200 [00:00<?, ?it/s]

flattening chunks
embedding chunks
splitting embeddings back
finding top n for all prompts, building contexts


  0%|          | 0/200 [00:00<?, ?it/s]

In [40]:
contexts_test, test_chunks_1, test_emb_lists, test_sim_scores_1 = build_contexts(df_test, faiss_doc_ids_test, test_q_embeddings)

loading chunks


  0%|          | 0/200 [00:00<?, ?it/s]

flattening chunks
embedding chunks
splitting embeddings back
finding top n for all prompts, building contexts


  0%|          | 0/200 [00:00<?, ?it/s]

In [41]:
#this creates the actual LLM input, also known as prompt or prompt with context
def format_input_rag(df, idx, faiss_doc_ids, contexts, options = ['A','B','C','D','E']):
    preamble = "Answer the following multiple-choice question about wikipedia content with the correct answer. "\
               "Use the provided context to assist in the answer if useful. "\
               "Answer only with the letter of the choice."
    postamble = ""

    prompt = df.loc[idx, 'prompt']
    context = contexts[idx]
        
    answers = df.loc[idx,options].tolist()
    correct_answer = None
    if "answer" in df.columns:
        correct_answer = df.loc[idx, 'answer']

    options_text = ""
    for option, answer in zip(options, answers):
        options_text += (f"{option}) {answer}\n")
    
    input_text = f"{preamble}\n\nQuestion:\n{prompt}\n\nContext:\n{context}\nOptions:\n{options_text}\n{postamble}"
    return input_text, correct_answer

In [42]:
#this loops the prediction by asking the model to choose one of the options, removes the selected, and repeats 3 times to get top 3
#if no selection, the choice is skipped (kaggle competition allowed max 3 choices per question but not required)
def predict_choices_rag(tokenizer, model, df, faiss_doc_ids, contexts):
    result = []
    for i in tqdm(range(df.shape[0])):
        j = 1
        ans = []
        options = ['A', 'B', 'C', 'D', 'E']
        while j<=3:
            input_text, correct_answer = format_input_rag(df,i,faiss_doc_ids, contexts, options)
            model_inputs = tokenizer(input_text, return_tensors='pt').to(DEVICE)
            greedy_output = model.generate(**model_inputs, max_new_tokens=40)
            response = tokenizer.decode(greedy_output[0], skip_special_tokens=True)

            # sometimes the model does not answer with a letter..
            # print(f"{j}:{response}")
            if len(response) == 0:
                print("empty response, skipping")
                j+=1
                continue
            opt = response[0] if response[0] in 'ABCDE' else 'C' # Choose C if cannot infer the answer;)
            ans.append(opt)
            try:
                options.remove(opt)
            except:
                options = ['A', 'B', 'C', 'D', 'E']
            j+=1
        result.append(ans)
    return result

In [43]:
labels = [[x] for x in df_train['answer']]
predictions_train = predict_choices_rag(tokenizer, model, df_train, faiss_doc_ids_train, contexts_train)

  0%|          | 0/200 [00:00<?, ?it/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (3091 > 2048). Running this sequence through the model will result in indexing errors


In [44]:
calculate_MAP3(predictions_train, df_train["answer"])

MAP@3 score: 0.8725


In [45]:
predictions_test = predict_choices_rag(tokenizer, model, df_test, faiss_doc_ids_test, contexts_test)



  0%|          | 0/200 [00:00<?, ?it/s]

In [46]:
df_submission = df_samp
df_submission

Unnamed: 0,id,prediction
0,0,A B C
1,1,A B C
2,2,A B C
3,3,A B C
4,4,A B C
...,...,...
195,195,A B C
196,196,A B C
197,197,A B C
198,198,A B C


In [47]:
str_submissions = [" ".join(x) for x in predictions_test]
str_submissions[:5]

['D E A', 'A B C', 'B A D', 'A C D', 'D B A']

In [48]:
if df_train.shape[0] == df_test.shape[0]:
    calculate_MAP3(predictions_test, df_train["answer"])

MAP@3 score: 0.8725


In [49]:
#these are some of the scores per combinations of different parameters.
#mainly chunk size and max tokens to put in context, and the resulting MAP3 score
# 512/3000: 0.84
# 512/2000: 0.864
# 512/1000: 0.836
# 512/500:  0.743
# 384/3000: 0.858
# 384/2000: 0.848
# 384/1000: 0.86
# 384/500:  0.789
# 256/3000: 0.8725, 0.8675
# 256/2000: 0.852
# 256/1000: 0.8425
# 256/500:  0.8575
# 192/3000: 0.868
# 192/2000: 0.856
# 192/1000: 0.8575
# 192/500:  0.8233
# 128/3000: 0.8525
# 128/2000: 0.8466
# 128/1000: 0.0.865
# 128/500:  0.8425
#  64/3000: 0.850
#  64/2000: 0.863
#  64/1000: 0.8625
#  64/500:  0.8275



In [50]:

df_submission["prediction"] = str_submissions


In [51]:
df_submission.to_csv("submission.csv", index=False)

In [52]:
!head submission.csv

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


id,prediction
0,D E A
1,A B C
2,B A D
3,A C D
4,D B A
5,C B E
6,A D B
7,D B E
8,C B C
