# Document Retrieval

This notebook shows the results of document retrieval from different methods (***TF-IDF***, ***BM25*** and ***LLM embedding***) on different pre-processed corpus, i.e ***Lower-Case*** (LC), ***Stopwords Removing*** (SR) and ***Lemmatization*** (L). 

As mentioned before, the entire corpus has been pre-processed 3 main approach. We created three cleaned dataset, split by languages, using different combination of our chosen approaches. All our datasets can be found on the `clean_data/` folder.

1) `lower_case/`: regroup all the cleaned data that have been only pre-processed to split the corpus by languages and lower-case all the text.
2) `lower_case_stop_words/`: regroup all the cleaned data that have been pre-processed to split the corpus by languages, lower-case and remove the stopwords of all text.
3) `lower_case_stop_words_lemmatization/`: regroup all the cleaned data that have been pre-processed to split the corpus by languages, lower-case, remove the stopwords and lemmatization of all text.

***Author:*** Paulo Ribeiro

## Import 

In [None]:
import pandas as pd
import warnings
from data_helpers import QueryClean
from models.BM25s.bm25s import BM25sRetriever

warnings.filterwarnings('ignore')

%load_ext autoreload
%autoreload 2

## Queries

We load the queries and perform the same pre-processing steps as the corpus.

In [None]:
# Choose the variables to execute the right pre-process and model to use
data_type = 'test'
processing_wanted = 'lc_sw_l'
k1 = 1.1
submission_path = ""  # Please refer to where the output.csv needs to go in Kaggle

In [None]:
# Load the queries
query = QueryClean(
    queries_path=f'data/{data_type}.csv',
    processing_wanted=processing_wanted,
    show_progress=False
)

# Initiate the list to stack all the matches per language in one .csv file
match_per_lang = []

In [None]:
# Perform the pre-processing step chosen
langs = query.pre_process()

## BM25s

Then, we use the improved version of BM25 to compare the performance in our document retrieval task.

In [None]:
# Initiate all the BM25s models for each language present in the queries
bm25s_retrievers = {
    lang: BM25sRetriever(queries_df=query.data_clean[lang],
                         model_path=f'models/BM25s/bm25s_matrix/{processing_wanted}/k1_{k1}/bm25s_{lang}.pkl',
                         top_k=10)
    for lang in langs
}

In [None]:
# Compute the matching between query and document for each language separately
for lang in langs:
    bm25s = bm25s_retrievers[lang]
    bm25s.match()
    match_per_lang.append(bm25s.matches)

In [None]:
# Stack all the pd.Series to create a unified pd.Series with all the matches
matches = pd.concat(match_per_lang, ignore_index=True)

# Write on disk a .csv file with the matches
matches.to_csv(f'{submission_path}/output.csv',
               index=True,
               index_label='id')