# Document Retrieval

This notebook shows the results of document retrieval from different methods (***TF-IDF***, ***BM25*** and ***LLM embedding***) on different pre-processed corpus, i.e ***Lower-Case*** (LC), ***Stopwords Removing*** (SR) and ***Lemmatization*** (L). 

As mentioned before, the entire corpus has been pre-processed 3 main approach. We created three cleaned dataset, split by languages, using different combination of our chosen approaches. All our datasets can be found on the `clean_data/` folder.

1) `lower_case/`: regroup all the cleaned data that have been only pre-processed to split the corpus by languages and lower-case all the text.
2) `lower_case_stop_words/`: regroup all the cleaned data that have been pre-processed to split the corpus by languages, lower-case and remove the stopwords of all text.
3) `lower_case_stop_words_lemmatization/`: regroup all the cleaned data that have been pre-processed to split the corpus by languages, lower-case, remove the stopwords and lemmatization of all text.

***Author:*** Paulo Ribeiro

## Import 

In [None]:
import numpy as np
import pandas as pd

from data_helpers import QueryClean
from TFIDF.tf_idf import TFIDFRetriever
from BM25.bm25 import BM25sRetriever
from LLM.llm_embedding import LLMRetriever
import warnings

warnings.filterwarnings('ignore')

%load_ext autoreload
%autoreload 2

## Queries

We load the queries and perform the same pre-processing steps as the corpus.

In [None]:
# Choose the pre-process steps to perform.
# Please keep the order in that way, while you are allow to delete some pre-process steps:
# ['lower_case', 'stop_words', 'lemmatization']
process_steps = ['lower_case', 'stop_words', 'lemmatization']

# Load the queries
query = QueryClean(
    queries_path='data/test.csv',
    process_steps=process_steps,
    show_progress=False
)

# Variable needed to choose the right model from the pre-process steps chosen before
text_pre_processing_desc = '_'.join(process_steps)

In [None]:
# Perform the pre-processing step chosen
langs = query.pre_process()

## TF-IDF

Let's start by initiate all the TF-IDF retriever for each queries languages.

In [None]:
tf_idf_retrievers = {
    lang: TFIDFRetriever(queries_df=query.data_clean[lang],
                         tf_idf_data_path=f'TFIDF/tf_idf_matrix/{text_pre_processing_desc}/tf_idf_{lang}.pkl',
                         lang=f'{lang}',
                         top_k=10)
    for lang in langs
}

 Then we can perform the matching process to be able to see the performance of our most basic document retrieval method, using the TF-IDF.

In [None]:
# TODO: Create script to handle the three methods pipeline to match all multilingual queries with their docids.
results = []
for lang in langs:
    tf_idf = tf_idf_retrievers[lang]
    tf_idf.vectorize_query()
    tf_idf.match()
    results.append(tf_idf.matches)

# Stack the results and reset the index
stacked_series = pd.concat(results, ignore_index=True)

In [None]:
stacked_series.to_csv('tf_idf_output.csv', index=True, index_label='id')

## BM25

Then, we use the improved version of TF-IDF to compare the performance in our document retrieval task.

In [None]:
# TODO: Create script to handle the three methods pipeline to match all multilingual queries with their docids.
bm25_retrievers = {
    lang: BM25sRetriever(queries_df=query.data_clean[lang],
                         model_path=f'BM25/bm25_matrix/{text_pre_processing_desc}/bm25s_{lang}.pkl',
                         top_k=10)
    for lang in langs
}

In [None]:
results = []
for lang in langs:
    bm25 = bm25_retrievers[lang]
    bm25.match()
    results.append(bm25.matches)

In [None]:
# Stack the results and reset the index
stacked_series = pd.concat(results, ignore_index=True)

In [None]:
stacked_series.to_csv('bm25_output.csv', index=True, index_label='id')

## LLM Embedding

Finally, we use the embeddings of our corpus created by a Large Language Model (LLM) to perform our document retrieval task.

In [None]:
pass

## Performance Comparaison

From our three previous methods, we compute their performances using the Recall@10 metric and display the result on a bar chart.

In [None]:
pass