### **Toulouse School of Economics**
#### **M2 Statistics & Econometrics**
---

### **Mathematics of Deep Learning Algorithms, Part 2**
# **Final Project: *Performance Benchmarking of Different Information Retrieval Methods***

### **Anh-Dung LE, Paul MELKI**

---

In this project, we aim at comparing the performance of different Information Retrieval techniques, mainly: **BM25** and **BERT-based search engine**. We work on a corpus formed of the latest dump of English Wikipedia, and restrict our work to only a small subset of this dump (mainly, articles whose title starts with the letter 'A'), and that is due to unavailability of enough computational resources. 

But first, we start with some preliminary steps: 

### **Preliminaries & Corpus Creation**

In [1]:
# Import required libraries
import os
from gensim.corpora import WikiCorpus
from rank_bm25 import BM25Okapi, BM25Plus

In order to create our own local textual corpus based from Wikipedia, we make use of the class `WikiCorpus` implement in the `gensim.corpora` library. This class implements different functions that facilitate the handling and manipulation of Wikipedia dumps, which are usually downloaded as BZ2-compressed XML files.

Based on this library, we create our own function to read and save the corpus locally, with each Wikipedia being saved in its own `.txt` file:

In [4]:
# Define function to read and create corpus from downloaded dump
def make_corpus(in_file, out_directory):
    """
    Function that converts a Wikipedia .xml dump into a 
    corpus, saving each article in a separate .txt file.
    
    Parameters
    ----------
    @param in_file: str, 
        A valid string specifying the path to the local *.xml.bz2 Wikipedia 
        dump file.
    @param out_directory, str,
        A valid string specifying the path to the directory in which we wish to
        save the created .txt files.
    """
    
    # Instantiate WikiCorpus object, based on the local dump file.
    wiki = WikiCorpus(in_file)
    print("Corpus is read!")
    
    # Initialize counter of articles read.
    i = 0
    
    print("Getting texts...")
    # For new article read, do...
    for text in wiki.get_texts():
        # Create and open new file for new article.
        output_file = open(f'{out_directory}\\{str(i+1)}.txt', 'w')
        # Extract the text of the read article.
        article_text = bytes(' '.join(text), 'utf-8').decode('utf-8') + '\n'
        # Take only first 1000 words from each article, to keep sizes small.
        first_n_words = ' '.join(article_text.split(' ')[0:1000])
        # Write text to file & close the file.
        output_file.write(first_n_words)
        output_file.close()
        # Update counter
        i = i + 1
        # If 1000 articles have been read, stop reading.
        if (i % 1000 == 0):
            print(f'Processed {str(i)} articles')
            break

    print('Processing Complete!')

In [3]:
# Initialize input and output paths
in_path = "C:\\Users\\Paul\\Documents\\Python Scripts\\Data\\enwiki-latest-pages-articles1.xml-p1p41242.bz2"
out_path = "C:\\Users\\Paul\\Documents\\Python Scripts\\Data\\Wiki Corpus"

# Create corpus!
make_corpus(in_path, out_path)

NameError: name 'make_corpus' is not defined

Now that the corpus is created, we also need to create a function to read the corpus from the files we created.

In [12]:
def read_corpus(corpus_directory):
    """
    Function that iteratively reads the saved articles from the corpus directory
    and appends the text to a list.
    
    Parameters
    ----------
    @param corpus_directory: str,
        A valid string specifying the path to the local directory in which the 
        files were saved using make_corpus().
        
    Returns
    -------
    @return corpus, list
        A list containing the text of an article in each element.
    """
    
    # Initialize empty corpus list
    corpus = []
    
    # For each file in the corpus directory, do...
    print("Reading local corpus, please wait...")
    for filename in os.listdir(corpus_directory):
        file = open(f'{corpus_directory}\\{filename}', 'r')
        article_text = file.read()
        corpus.append(article_text)
        
    # Done, return
    print("Done!")
    return corpus

In [13]:
# Read corpus! 
corpus = read_corpus("C:\\Users\\Paul\\Documents\\Python Scripts\\Data\\Wiki Corpus")
# Look at some example...
corpus[3][0:100]

Reading local corpus, please wait...
Done!


'atm or atm often refers to atmosphere unit or atm unit of atmospheric pressure automated teller mach'

### **BM25 Implementation**

The first Information Retrieval method we try is the **BM25** method, which is a TF-IDF method, that retrieves the article that has the highest score based on the query given. 

Given, a document $D$ and a $Q$ that contains keywords $q_1,..., q_n$, we define the BM25 score of the document $D$ as:

$$
score(D, Q) = \sum_{i = 1}^n IDF(q_i) \cdot \frac{TF(q_i, D) \cdot (k_1 + 1)}{TF(q_i, D) + k_1 \cdot \left(1 - b + b \cdot \frac{|D|}{avgdl} \right)}
$$

where: 
- $TF(q_i, D)$ is the *text frequency* of keyword $q_i$ in document $D$,
- $IDF(q_i)$ is the *inverse document frequency* of keyword $q_i$, using the well-known definition,
- $|D|$ is the length of the document $D$ in words.
- $avgdl$ is the average document length in words in the whole corpus.
- $k_1$ and $b$ are free parameters that are chosen rather than estimated, and which are usually chosen as $k_1 \in [1.2, 2.0]$ and $b = 0.75$. These may also be chosen based on some advanced optimization.

After computing the BM25 score of each document, which gives the relevance of each document to the given query, we sort the documents in descending order from most relevant to least relevant.

On the implementation side, we use `Rank-BM25` library developed by Dorian Brown (https://github.com/dorianbrown/rank_bm25), and which implements different variants of the BM25 algorithm. It can be easily installed using `pip install rank-bm25`. 

In [None]:
# Tokenize the corpus
tokenized_corpus = [doc.split(" ") for doc in corpus]

# Instantiate BM25 object from the tokenized corpus
bm25 = BM25Okapi(tokenized_corpus)

In [None]:
query = "math"
tokenized_query = query.split(" ")

bm25.get_top_n(tokenized_query, corpus, n=1)