# Document Retrieval

This notebook shows the results of document retrieval from different methods (***TF-IDF***, ***BM25*** and ***LLM embedding***) on different pre-processed corpus, i.e ***Lower-Case*** (LC), ***Stopwords Removing*** (SR) and ***Lemmatization*** (L). 

As mentioned before, the entire corpus has been pre-processed 3 main approach. We created three cleaned dataset, split by languages, using different combination of our chosen approaches. All our datasets can be found on the `clean_data/` folder.

1) `lower_case/`: regroup all the cleaned data that have been only pre-processed to split the corpus by languages and lower-case all the text.
2) `lower_case_stop_words/`: regroup all the cleaned data that have been pre-processed to split the corpus by languages, lower-case and remove the stopwords of all text.
3) `lower_case_stop_words_lemmatization/`: regroup all the cleaned data that have been pre-processed to split the corpus by languages, lower-case, remove the stopwords and lemmatization of all text.

***Author:*** Paulo Ribeiro

## Import 

In [1]:
import pandas as pd
from data_helpers import QueryClean
from TFIDF.tf_idf import TFIDFRetriever
from BM25.bm25 import BM25Retriever
from LLM.llm_embedding import LLMRetriever
import warnings

warnings.filterwarnings('ignore')

%load_ext autoreload
%autoreload 2

## Queries

We load the queries and perform the same pre-processing steps as the corpus.

In [2]:
# Load the queries
query = QueryClean(queries_path='data/dev.csv')
query.split_per_lang()
langs = list(query.data_clean.keys())

# Perform the pre-processing step wanted by uncommenting the following lines
text_pre_processing_desc = 'lower_case_stop_words'
query.split_per_lang()
query.lower_case()
query.stop_words()
#query.lemmatization()

Loading queries...
Queries Loaded ! 



Lower casing 'en' texts:   0%|          | 0/200 [00:00<?, ?it/s]

Lower casing 'fr' texts:   0%|          | 0/200 [00:00<?, ?it/s]

Lower casing 'de' texts:   0%|          | 0/200 [00:00<?, ?it/s]

Lower casing 'es' texts:   0%|          | 0/200 [00:00<?, ?it/s]

Lower casing 'it' texts:   0%|          | 0/200 [00:00<?, ?it/s]

Lower casing 'ko' texts:   0%|          | 0/200 [00:00<?, ?it/s]

Lower casing 'ar' texts:   0%|          | 0/200 [00:00<?, ?it/s]

Removing stop words for 'en' texts:   0%|          | 0/200 [00:00<?, ?it/s]

Removing stop words for 'fr' texts:   0%|          | 0/200 [00:00<?, ?it/s]

Removing stop words for 'de' texts:   0%|          | 0/200 [00:00<?, ?it/s]

Removing stop words for 'es' texts:   0%|          | 0/200 [00:00<?, ?it/s]

Removing stop words for 'it' texts:   0%|          | 0/200 [00:00<?, ?it/s]

Removing stop words for 'ar' texts:   0%|          | 0/200 [00:00<?, ?it/s]

## TF-IDF

Let's start by initiate all the TF-IDF retriever for each queries languages.

In [3]:
tf_idf_retrievers = {
    lang: TFIDFRetriever(queries_df=query.data_clean[lang],
                         tf_idf_data_path=f'TFIDF/tf_idf_matrix/{text_pre_processing_desc}/tf_idf_{lang}.pkl',
                         lang=f'{lang}',
                         top_k=10) for lang in langs
}

 Then we can perform the matching process to be able to see the performance of our most basic document retrieval method, using the TF-IDF.

In [4]:
results = []
for lang in langs:
    tf_idf = tf_idf_retrievers[lang]
    tf_idf.vectorize_query()
    tf_idf.match()
    results.append(tf_idf.matches)
    
# Stack the results and reset the index
stacked_series = pd.concat(results, ignore_index=True)

Computing the TF matrix: 100%|██████████| 200/200 [00:00<00:00, 64369.31it/s]


Computing TF-IDF matrix...
Done.



Computing the TF matrix: 100%|██████████| 200/200 [00:00<00:00, 13263.67it/s]


Computing TF-IDF matrix...
Done.



Computing the TF matrix: 100%|██████████| 200/200 [00:00<00:00, 34129.17it/s]


Computing TF-IDF matrix...
Done.



Computing the TF matrix: 100%|██████████| 200/200 [00:00<00:00, 49128.01it/s]


Computing TF-IDF matrix...
Done.



Computing the TF matrix: 100%|██████████| 200/200 [00:00<00:00, 26014.41it/s]


Computing TF-IDF matrix...
Done.



Computing the TF matrix: 100%|██████████| 200/200 [00:00<00:00, 31984.63it/s]


Computing TF-IDF matrix...
Done.



Computing the TF matrix: 100%|██████████| 200/200 [00:00<00:00, 50610.00it/s]


Computing TF-IDF matrix...
Done.



In [7]:
stacked_series.to_csv('tf_idf_output.csv', index=True)

## BM25

Then, we use the improved version of TF-IDF to compare the performance in our document retrieval task.

## LLM Embedding

Finally, we use the embeddings of our corpus created by a Large Language Model (LLM) to perform our document retrieval task.

## Performance Comparaison

From our three previous methods, we compute their performances using the Recall@10 metric and display the result on a bar chart.