# Kaggle Large Language Model Science Exam

In this competition we are challenged to answer difficult science-based questions written by a Large Language Model. We are also told that 

> The dataset for this challenge was generated by giving gpt3.5 snippets of text on a range of scientific topics pulled from wikipedia, and asking it to write a multiple choice question (with a known answer), then filtering out easy questions.

An idea is to make this challenge a little easier by converting it to an ***open book science exam*** using semantic search and Wikipedia.

## Overview

1. We obtain the plain text version of the latest dump from Wikipedia (https://www.kaggle.com/datasets/jjinho/wikipedia-20230701)
1. We will then convert the prompts into embeddings using sentence transformers (specifically using the `all-MiniLM-L6-v2` model)
1. We will also create embeddings of all the Wikipedia articles, and to help us, use the first sentence from each article to provide more context (again using `all-MiniLM-L6-v2`)
1. We will then use `faiss` to perform similarity search to find the top-k articles that are most likely to have the information needed
1. We will then get the full text of those articles and split them into sentences using the fast `blingfire` package
1. Again, we will obtain embeddings of these sentences as well as embeddings of the prompt + answer choices and perform similarity search to get the top-k matching sentences for each question
1. We can then combine the questions, answer choices, and context to either perform straight up question answering, or feed into a LLM

## TODO

* ~~Enable off-line use~~
* Improve memory efficiency
* Make faster
* Use context information to train a model or run inference using LLM (like https://www.kaggle.com/code/philippsinger/h2ogpt-perplexity-ranking/notebook)

# Get Necessary Packages

In [None]:
# faiss
!pip install -U /kaggle/input/faiss-cpu-173/faiss_cpu-1.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

In [None]:
## Needed otherwise encounter read-only error
!cp -rf /kaggle/input/sentence-transformers-222/sentence-transformers /kaggle/working/sentence-transformers

In [None]:
# sentence transformer
!pip install -U /kaggle/working/sentence-transformers

In [None]:
# blingfire
!pip install -U /kaggle/input/blingfire-018/blingfire-0.1.8-py3-none-any.whl

## Imports

In [None]:
import os
import gc
import pandas as pd
import numpy as np
import re
from tqdm.auto import tqdm
import blingfire as bf

from collections.abc import Iterable

import faiss
from faiss import write_index, read_index

from sentence_transformers import SentenceTransformer

## Code to Sentencize Text

In [None]:
def process_documents(documents: Iterable[str],
                      document_ids: Iterable,
                      split_sentences: bool = True,
                      filter_len: int = 3,
                      disable_progress_bar: bool = False) -> pd.DataFrame:
    """
    Main helper function to process documents from the EMR.

    :param documents: Iterable containing documents which are strings
    :param document_ids: Iterable containing document unique identifiers
    :param document_type: String denoting the document type to be processed
    :param document_sections: List of sections for a given document type to process
    :param split_sentences: Flag to determine whether to further split sections into sentences
    :param filter_len: Minimum character length of a sentence (otherwise filter out)
    :param disable_progress_bar: Flag to disable tqdm progress bar
    :return: Pandas DataFrame containing the columns `document_id`, `text`, `section`, `offset`
    """
    
    df = sectionize_documents(documents, document_ids, disable_progress_bar)

    if split_sentences:
        df = sentencize(df.text.values, 
                        df.document_id.values,
                        df.offset.values, 
                        filter_len, 
                        disable_progress_bar)
    return df


def sectionize_documents(documents: Iterable[str],
                         document_ids: Iterable,
                         disable_progress_bar: bool = False) -> pd.DataFrame:
    """
    Obtains the sections of the imaging reports and returns only the 
    selected sections (defaults to FINDINGS, IMPRESSION, and ADDENDUM).

    :param documents: Iterable containing documents which are strings
    :param document_ids: Iterable containing document unique identifiers
    :param disable_progress_bar: Flag to disable tqdm progress bar
    :return: Pandas DataFrame containing the columns `document_id`, `text`, `offset`
    """
    processed_documents = []
    for document_id, document in tqdm(zip(document_ids, documents), total=len(documents), disable=disable_progress_bar):
        row = {}
        text, start, end = (document, 0, len(document))
        row['document_id'] = document_id
        row['text'] = text
        row['offset'] = (start, end)

        processed_documents.append(row)

    _df = pd.DataFrame(processed_documents)
    if _df.shape[0] > 0:
        return _df.sort_values(['document_id', 'offset']).reset_index(drop=True)
    else:
        return _df


def sentencize(documents: Iterable[str],
               document_ids: Iterable,
               offsets: Iterable[tuple[int, int]],
               filter_len: int = 3,
               disable_progress_bar: bool = False) -> pd.DataFrame:
    """
    Split a document into sentences. Can be used with `sectionize_documents`
    to further split documents into more manageable pieces. Takes in offsets
    to ensure that after splitting, the sentences can be matched to the
    location in the original documents.

    :param documents: Iterable containing documents which are strings
    :param document_ids: Iterable containing document unique identifiers
    :param offsets: Iterable tuple of the start and end indices
    :param filter_len: Minimum character length of a sentence (otherwise filter out)
    :return: Pandas DataFrame containing the columns `document_id`, `text`, `section`, `offset`
    """

    document_sentences = []
    for document, document_id, offset in tqdm(zip(documents, document_ids, offsets), total=len(documents), disable=disable_progress_bar):
        try:
            _, sentence_offsets = bf.text_to_sentences_and_offsets(document)
            for o in sentence_offsets:
                if o[1]-o[0] > filter_len:
                    sentence = document[o[0]:o[1]]
                    abs_offsets = (o[0]+offset[0], o[1]+offset[0])
                    row = {}
                    row['document_id'] = document_id
                    row['text'] = sentence
                    row['offset'] = abs_offsets
                    document_sentences.append(row)
        except:
            continue
    return pd.DataFrame(document_sentences)

## Configurations

In [None]:
MODEL = '/kaggle/input/sentencetransformers-allminilml6v2/sentence-transformers_all-MiniLM-L6-v2'
DEVICE = 0
MAX_LENGTH = 384
BATCH_SIZE = 16

## Load Data

In [None]:
WIKI_PATH = "/kaggle/input/wikipedia-20230701"
wiki_files = os.listdir(WIKI_PATH)

In [None]:
trn = pd.read_csv("/kaggle/input/kaggle-llm-science-exam/train.csv")

In [None]:
model = SentenceTransformer(MODEL, device='cuda')
model.max_seq_length = MAX_LENGTH
model = model.half()

#### Using precomputed index of the Wikipedia 2023-07 dump

Dataset can be found: https://www.kaggle.com/datasets/jjinho/wikipedia-2023-07-faiss-index

In [None]:
sentence_index = read_index("/kaggle/input/wikipedia-2023-07-faiss-index/wikipedia_202307.index")

## Encode the prompts

We observe that the prompts contain the subject matter almost always at the end. Here we just encode the entire prompt using `sentence_transformers` so that we can use semantic search to find appropriate articles that may have information relating to the questions.

In [None]:
prompt_embeddings = model.encode(trn.prompt.values, batch_size=BATCH_SIZE, device=DEVICE, show_progress_bar=True, convert_to_tensor=True, normalize_embeddings=True).half()
prompt_embeddings = prompt_embeddings.detach().cpu().numpy()

In [None]:
_ = gc.collect()

In [None]:
prompt_embeddings.shape

In [None]:
## Get the top 3 pages that are likely to contain the topic of interest
search_score, search_index = sentence_index.search(prompt_embeddings, 3)

In [None]:
## Save memory - delete sentence_index since it is no longer necessary
del sentence_index
del prompt_embeddings
_ = gc.collect()

#### Load the Wikipedia Index File

In [None]:
df = pd.read_parquet("/kaggle/input/wikipedia-20230701/wiki_2023_index.parquet", columns=['id', 'file'])

In [None]:
## Get the article and associated file location using the index
wikipedia_file_data = []

for i, (scr, idx) in tqdm(enumerate(zip(search_score, search_index)), total=len(search_score)):
    
    ## Get indices by score threshold
    #scr_idx = idx[np.where(scr <= 0.85)]
    scr_idx = idx
    _df = df.loc[scr_idx].copy()
    _df['prompt_id'] = i
    wikipedia_file_data.append(_df)
wikipedia_file_data = pd.concat(wikipedia_file_data).reset_index(drop=True)
wikipedia_file_data = wikipedia_file_data[['id', 'prompt_id', 'file']].drop_duplicates().sort_values(['file', 'id']).reset_index(drop=True)

## Save memory - delete df since it is no longer necessary
del df
_ = gc.collect()

In [None]:
## Get the full text data
wiki_text_data = []

for file in tqdm(wikipedia_file_data.file.unique(), total=len(wikipedia_file_data.file.unique())):
    _id = [str(i) for i in wikipedia_file_data[wikipedia_file_data['file']==file]['id'].tolist()]
    _df = pd.read_parquet(f"{WIKI_PATH}/{file}", columns=['id', 'text'])

    _df = _df[_df['id'].isin(_id)]
    wiki_text_data.append(_df)
    _ = gc.collect()
wiki_text_data = pd.concat(wiki_text_data).drop_duplicates().reset_index(drop=True)
_ = gc.collect()

## Split full-text Wikipedia Documents into Sentences

We split the Wikipedia documents into sentences because we can observe that in many cases it seems that GPT3.5 directly took the answers from the text. We therefore want to retrieve the most similar sentences to provide context.

In [None]:
## Parse documents into sentences
processed_wiki_text_data = process_documents(wiki_text_data.text.values, wiki_text_data.id.values)

In [None]:
## Get embeddings of the wiki text data
wiki_data_embeddings = model.encode(processed_wiki_text_data.text, batch_size=BATCH_SIZE, device=DEVICE, show_progress_bar=True, convert_to_tensor=True, normalize_embeddings=True).half()
wiki_data_embeddings = wiki_data_embeddings.detach().cpu().numpy()

In [None]:
_ = gc.collect()

Found that encoding the prompt and the answers gave the best quality results for retrieval.

In [None]:
## Combine all answers
trn['answer_all'] = trn.apply(lambda x: " ".join([x['A'], x['B'], x['C'], x['D'], x['E']]), axis=1)

## Search using the prompt and answers to guide the search
trn['prompt_answer_stem'] = trn['prompt'] + " " + trn['answer_all']

In [None]:
question_embeddings = model.encode(trn.prompt_answer_stem.values, batch_size=BATCH_SIZE, device=DEVICE, show_progress_bar=True, convert_to_tensor=True, normalize_embeddings=True).half()
question_embeddings = question_embeddings.detach().cpu().numpy()

In [None]:
## Parameter to determine how many relevant sentences to include
NUM_SENTENCES_INCLUDE = 3

## List containing Question, Choices, Context
prompt_contexts = []

## List containing just Context
contexts = []

for r in trn.itertuples():
    prompt_context = ""

    prompt_id = r.id

    prompt_indices = processed_wiki_text_data[processed_wiki_text_data['document_id'].isin(wikipedia_file_data[wikipedia_file_data['prompt_id']==prompt_id]['id'].values)].index.values
    prompt_context += "Question: " + trn.prompt.iloc[prompt_id] + "\n"

    prompt_context += "Choices:\n"
    prompt_context += "(A) " + trn.A.iloc[prompt_id] + "\n"
    prompt_context += "(B) " + trn.B.iloc[prompt_id] + "\n"
    prompt_context += "(C) " + trn.C.iloc[prompt_id] + "\n"
    prompt_context += "(D) " + trn.D.iloc[prompt_id] + "\n"
    prompt_context += "(E) " + trn.E.iloc[prompt_id] + "\n"

    if prompt_indices.shape[0] > 0:
        prompt_context += "Context:\n"
        ## Per Prompt Index
        prompt_index = faiss.index_factory(wiki_data_embeddings.shape[1], "Flat")
        prompt_index.add(wiki_data_embeddings[prompt_indices])

        context = ""
        
        ## Get the top matches
        ss, ii = prompt_index.search(question_embeddings, NUM_SENTENCES_INCLUDE)
        for _s, _i in zip(ss[prompt_id], ii[prompt_id]):
            ## Threshold on the score
            if _s < 2:
                context += processed_wiki_text_data.loc[prompt_indices]['text'].iloc[_i] + "\n"
        prompt_context += context
        
    contexts.append(context)
    prompt_contexts.append(prompt_context)

In [None]:
trn['context'] = contexts

In [None]:
trn.to_csv("./train_context.csv", index=False)

## Open Book Test Taking!

Below we can see some results which provides not only the question and choices, but also context from Wikipedia which may provide crucial hints or even the answers themselves!

In [None]:
for i, p in enumerate(prompt_contexts[:10]):
    print(f"Question {i}")
    print(p)
    print()