### **Toulouse School of Economics**
#### **M2 Statistics & Econometrics**
---

### **Mathematics of Deep Learning Algorithms, Part 2**
# **Final Project: *Performance Benchmarking of Different Information Retrieval Methods***

### **Anh-Dung LE, Paul MELKI**

---

In this project, we aim at comparing the performance of different Information Retrieval techniques, mainly: **BM25** and **BERT-based search engine**. We work on a corpus formed of the latest dump of English Wikipedia, and restrict our work to only a small subset of this dump (mainly, articles whose title starts with the letter 'A'), and that is due to unavailability of enough computational resources. 

But first, we start with some preliminary steps: 

### **Preliminaries & Corpus Creation**

In [3]:
# Install required libraries
!pip install rank-bm25



In [4]:
pip install -q tf-models-official==2.3.0

[K     |████████████████████████████████| 849kB 4.9MB/s 
[K     |████████████████████████████████| 174kB 9.6MB/s 
[K     |████████████████████████████████| 1.1MB 15.4MB/s 
[K     |████████████████████████████████| 102kB 9.0MB/s 
[K     |████████████████████████████████| 37.6MB 1.3MB/s 
[K     |████████████████████████████████| 358kB 48.8MB/s 
[?25h  Building wheel for py-cpuinfo (setup.py) ... [?25l[?25hdone


In [5]:
# Define the path to the project's directory
PATH = '/content/drive/MyDrive/College Material/Master 2/Mathematics of Deep Learning Algorithms/Final Project'

# Load drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
# Import required libraries
import os
import pprint as pp
import numpy as np
import json
import tensorflow as tf 
from gensim.corpora import WikiCorpus
from rank_bm25 import BM25Okapi, BM25Plus

from official.modeling import tf_utils
from official import nlp
from official.nlp import bert

# Load the required submodules
import official.nlp.optimization
import official.nlp.bert.bert_models
import official.nlp.bert.configs
import official.nlp.bert.run_classifier
import official.nlp.bert.tokenization
import official.nlp.data.classifier_data_lib
import official.nlp.modeling.losses
import official.nlp.modeling.models
import official.nlp.modeling.networks

In order to create our own local textual corpus based from Wikipedia, we make use of the class `WikiCorpus` implement in the `gensim.corpora` library. This class implements different functions that facilitate the handling and manipulation of Wikipedia dumps, which are usually downloaded as BZ2-compressed XML files.

Based on this library, we create our own function to read and save the corpus locally, with each Wikipedia being saved in its own `.txt` file:

In [7]:
# Define function to read and create corpus from downloaded dump
def make_corpus(in_file, out_directory):
    """
    Function that converts a Wikipedia .xml dump into a 
    corpus, saving each article in a separate .txt file.
    
    Parameters
    ----------
    @param in_file: str, 
        A valid string specifying the path to the local *.xml.bz2 Wikipedia 
        dump file.
    @param out_directory, str,
        A valid string specifying the path to the directory in which we wish to
        save the created .txt files.
    """
    
    # Instantiate WikiCorpus object, based on the local dump file.
    wiki = WikiCorpus(in_file)
    print("Corpus is read!")
    
    # Initialize counter of articles read.
    i = 0
    
    print("Getting texts...")
    # For new article read, do...
    for text in wiki.get_texts():
        # Create and open new file for new article.
        output_file = open(f'{out_directory}\\{str(i+1)}.txt', 'w')
        # Extract the text of the read article.
        article_text = bytes(' '.join(text), 'utf-8').decode('utf-8') + '\n'
        # Take only first 1000 words from each article, to keep sizes small.
        first_n_words = ' '.join(article_text.split(' ')[0:1000])
        # Write text to file & close the file.
        output_file.write(first_n_words)
        output_file.close()
        # Update counter
        i = i + 1
        # If 1000 articles have been read, stop reading.
        if (i % 1000 == 0):
            print(f'Processed {str(i)} articles')
            break

    print('Processing Complete!')

In [None]:
# Initialize input and output paths
in_path = "C:\\Users\\Paul\\Documents\\Python Scripts\\Data\\enwiki-latest-pages-articles1.xml-p1p41242.bz2"
out_path = "C:\\Users\\Paul\\Documents\\Python Scripts\\Data\\Wiki Corpus"

# Create corpus!
make_corpus(in_path, out_path)

NameError: name 'make_corpus' is not defined

Now that the corpus is created, we also need to create a function to read the corpus from the files we created.

In [8]:
def read_corpus(corpus_directory):
    """
    Function that iteratively reads the saved articles from the corpus directory
    and appends the text to a list.
    
    Parameters
    ----------
    @param corpus_directory: str,
        A valid string specifying the path to the local directory in which the 
        files were saved using make_corpus().
        
    Returns
    -------
    @return corpus, list
        A list containing the text of an article in each element.
    """
    
    # Initialize empty corpus list
    corpus = []
    
    # For each file in the corpus directory, do...
    print("Reading local corpus, please wait...")

    for filename in os.listdir(corpus_directory):
        file = open(f'{corpus_directory}/{filename}', 'r',
                    encoding="utf8")
        article_text = file.read()
        corpus.append(article_text)
        
    # Done, return
    print("Done!")
    return corpus

In [10]:
# Read corpus! 
corpus = read_corpus(f'{PATH}/Wiki Corpus/')

# Look at some example...
corpus[3][0:100]

# For some reason (only in Google Colab), this cell might need to be stopped
# for the first run, then run again. 

Reading local corpus, please wait...
Done!


'anarchism is political philosophy and movement that is sceptical of authority and rejects all involu'

### **BM25 Implementation**

The first Information Retrieval method we try is the **BM25** method, which is a TF-IDF method, that retrieves the article that has the highest score based on the query given. 

Given, a document $D$ and a $Q$ that contains keywords $q_1,..., q_n$, we define the BM25 score of the document $D$ as:

$$
score(D, Q) = \sum_{i = 1}^n IDF(q_i) \cdot \frac{TF(q_i, D) \cdot (k_1 + 1)}{TF(q_i, D) + k_1 \cdot \left(1 - b + b \cdot \frac{|D|}{avgdl} \right)}
$$

where: 
- $TF(q_i, D)$ is the *text frequency* of keyword $q_i$ in document $D$,
- $IDF(q_i)$ is the *inverse document frequency* of keyword $q_i$, using the well-known definition,
- $|D|$ is the length of the document $D$ in words.
- $avgdl$ is the average document length in words in the whole corpus.
- $k_1$ and $b$ are free parameters that are chosen rather than estimated, and which are usually chosen as $k_1 \in [1.2, 2.0]$ and $b = 0.75$. These may also be chosen based on some advanced optimization.

After computing the BM25 score of each document, which gives the relevance of each document to the given query, we sort the documents in descending order from most relevant to least relevant.

On the implementation side, we use `Rank-BM25` library developed by Dorian Brown (https://github.com/dorianbrown/rank_bm25), and which implements different variants of the BM25 algorithm. It can be easily installed using `pip install rank-bm25`. 

In [11]:
# Tokenize the corpus
tokenized_corpus = [doc.split(" ") for doc in corpus]

# Instantiate BM25 object from the tokenized corpus
bm25 = BM25Okapi(tokenized_corpus)

Now we create  a simple function that a takes a string query, and a number `n` of required results, and returns the `n` most relevant results from our corpus:

In [12]:
def bm25okapi_search(tokenized_query, bm25, corpus, n_results = 1):
    """
    Function that takes a tokenized query and prints the first 100 words of the 
    n_results most relevant results found in the corpus, based on the BM25
    method.
    
    Parameters
    ----------
    @param tokenized_query: list, array-like
        A valid list containing the tokenized query.
    @param bm25: BM25 object,
        A valid object of type BM25 (BM25Okapi or BM25Plus) from the library
        `rank-bm25`, initialized with a valid corpus.
    @param corpus: list, array-like
        A valid list containing the corpus from which the BM25 object has been 
        initialized. As returned from function read_corpus().
    @param n_results: int, default = 1
        The number of top results to print.
    """
    
    # We skip checking validity of arguments for now... We assume the user 
    # knows what they're doing.
    
    # Get top results for the query
    top_results = bm25.get_top_n(tokenized_query, corpus, n = n_results)
    
    # Take only first 100 words from each result
    top_results_100words = [' '.join(top_result.split(' ')[0:100]) 
                             for top_result in top_results]
    
    # Print results
    print(f'Query: "{query}"\n')
    print(f'Top {n_results} results from Wikipedia:\n\n')
    i = 1
    for result in top_results_100words: 
        print(f'{i}. {result}\n\n')
        i = i + 1

As we know the topics of some of the articles included, we implement some queries about these topics and see whether their relevant articles are returned. Some of these topics included:
- Autism
- Anarchism 
- ATM 

We first try to implement some simple queries that include only the title of the article, and see if the relevant article is returned.

In [None]:
query = "autism"
tokenized_query = query.split(" ")
bm25okapi_search(tokenized_query = tokenized_query,
                 bm25 = bm25, 
                 corpus = corpus,
                 n_results = 5)

Query: "autism"

Top 5 results from Wikipedia:


1. autism is developmental disorder characterized by difficulties with social interaction and communication and by restricted and repetitive behavior parents often notice signs during the first three years of their child life these signs often develop gradually though some children with autism experience worsening in their communication and social skills after reaching developmental milestones at normal pace autism is associated with combination of genetic and environmental factors risk factors during pregnancy include certain infections such as rubella toxins including valproic acid alcohol cocaine pesticides lead and air pollution fetal growth restriction and autoimmune diseases controversies surround other proposed environmental causes for


2. alfonso cuarón born november is mexican film director screenwriter producer cinematographer and editor his other notable films from variety of film genres including the family drama little prin

In [None]:
query = "anarchism"
tokenized_query = query.split(" ")
bm25okapi_search(tokenized_query = tokenized_query,
                 bm25 = bm25, 
                 corpus = corpus,
                 n_results = 5)

Query: "anarchism"

Top 5 results from Wikipedia:


1. anarchism is political philosophy and movement that is sceptical of authority and rejects all involuntary coercive forms of hierarchy anarchism calls for the abolition of the state which it holds to be undesirable unnecessary and harmful it is usually described alongside libertarian marxism as the libertarian wing libertarian socialism of the socialist movement and as having historical association with anti capitalism and socialism the history of anarchism goes back to prehistory when humans arguably lived in anarchistic societies long before the establishment of formal states realms or empires with the rise of organised hierarchical bodies scepticism toward authority also rose


2. anarcho capitalism is political philosophy and economic theory that advocates the elimination of centralized states in favor of system of private property enforced by private agencies free markets and the right libertarian interpretation of self ownersh

In the above two queries, we see that the results obtained are relevant. For the query about "autism", only the top result seems to be relevant. However, this could be simply due to the unavailability of more relevant articles in the small corpus we have, and not due to a problem in the method.

In the second query related to "anarchism", we see that the top three results are relevant indeed: the first one being an article exactly related to the topic, the second one being a related one and the third being about an author (Ayn Rand) who wrote many pieces and books about anarchism.

So far, BM25 looks like a usesful method. However, we will see how it fails when the queries become more complicated, such as when they contain a question, a whole sentence, or an abbreviation. We search for an abbreviation ("ATM") and look at the results:

In [None]:
query = "ATM"
tokenized_query = query.split(" ")
bm25okapi_search(tokenized_query = tokenized_query,
                 bm25 = bm25, 
                 corpus = corpus,
                 n_results = 5)

Query: "ATM"

Top 5 results from Wikipedia:


1. events the battle of hormozdgan is fought ardashir defeats and kills artabanus effectively ending the parthian empire emperor constantius ii enters rome for the first time to celebrate his victory over magnus magnentius assassination of conrad of montferrat conrad king of jerusalem in tyre two days after his title to the throne is confirmed by election the killing is carried out by hashshashin nichiren japanese buddhist monk propounds namu myōhō renge kyō for the very first time and declares it to be the essence of buddhism in effect founding nichiren buddhism the battle of cerignola is fought it is noted


2. an abscess is collection of pus that has built up within the tissue of the body signs and symptoms of abscesses include redness pain warmth and swelling the swelling may feel fluid filled when pressed the area of redness often extends beyond the swelling carbuncles and boils are types of abscess that often involve hair follicles wi

We can clearly see that the results obtained are not relevant to queried term.

Let's try to query for the same topics as above, but using a more complicated query:

In [13]:
query = "what is anarchism?"
tokenized_query = query.split(" ")
bm25okapi_search(tokenized_query = tokenized_query,
                 bm25 = bm25, 
                 corpus = corpus,
                 n_results = 5)

Query: "what is anarchism?"

Top 5 results from Wikipedia:


1. an author is the creator or originator of any written work such as book or play and is also considered writer more broadly defined an author is the person who originated or gave existence to anything and whose authorship determines responsibility for what was created legal significance of authorship typically the first owner of copyright is the person who created the work the author if more than one person created the work then case of joint authorship can be made provided some criteria are met in the copyright laws of various jurisdictions there is necessity for little flexibility regarding what


2. aesthetics or esthetics is branch of philosophy that deals with the nature of beauty and taste as well as the philosophy of art its own area of philosophy that comes out of aesthetics it examines subjective and sensori emotional values or sometimes called judgments of sentiment and taste aesthetics covers both natural and art

In the above results, we can see clearly how BM25 fails as the queries become more complicated. This could mainly due to the fact that it is a pure TF-IDF method that does not prioritize keywords in the query over other words. 

This problem could be solved by combining BM25 with more advanced text processing techniques. Indeed, as we can see, we are not applying any advanced processing techniques such as lemmatization or keyword extraction. Further experiments will work on implementing these.

### **BERT-Based Implementation**

Following Nogueira and Cho's (2019) method, we try to implement BERT as a document re-ranker that will rank the relevance of the documents in the corpus with respect to a given query. 

As we know, BERT for classification tasks takes two sentences as input. Given a document $D$ and a query $Q$ that have been tokenized using a BERT tokenizer, we concatenate the query (Sentence 1) and the document (Sentence 2) together, separating them with a `[CLS]` classification token, and feed them to the original pre-trained BERT model implement as a binary classifier where the two classes are: 

$$
\begin{cases}
0 = \text{not relevant}, \\
1 = \text{relevant}
\end{cases}
$$

As such, BERT will return the probability of document $D$ being relevant to the query $Q$. Given a certain query $Q$, we apply this method on all documents $D_1, D_2, ..., D_n$ in the corpus and get a *relevance score* for each of them. The documents are then ranked by their obtained scores from most relevant to least relevant (similarly to BM25) and this will be the result of our information retrieval task.

First we start by preparing everything for the model:

We retrieve the BERT configurations directory from official Google servers, and read the BERT configs from `json` file:

In [14]:
# Retrieve BERT configs directory from official Google servers
gs_folder_bert = "gs://cloud-tpu-checkpoints/bert/keras_bert/uncased_L-12_H-768_A-12"

# Let's take a look at the content of the directory
tf.io.gfile.listdir(gs_folder_bert)

['bert_config.json',
 'bert_model.ckpt.data-00000-of-00001',
 'bert_model.ckpt.index',
 'vocab.txt']

In [15]:
# Read BERT configs
bert_config_file = os.path.join(gs_folder_bert, 'bert_config.json')
config_dict = json.loads(tf.io.gfile.GFile(bert_config_file).read())
bert_config = bert.configs.BertConfig.from_dict(config_dict)

# Take a look at the BERT configs
config_dict

{'attention_probs_dropout_prob': 0.1,
 'hidden_act': 'gelu',
 'hidden_dropout_prob': 0.1,
 'hidden_size': 768,
 'initializer_range': 0.02,
 'intermediate_size': 3072,
 'max_position_embeddings': 512,
 'num_attention_heads': 12,
 'num_hidden_layers': 12,
 'type_vocab_size': 2,
 'vocab_size': 30522}

Now set up the BERT tokenizer that will be used to tokenize both the Wikipedia articles and the queries:

In [16]:
tokenizer = bert.tokenization.FullTokenizer(
    vocab_file = os.path.join(gs_folder_bert, 'vocab.txt'),
    do_lower_case = True
)

print('Vocab size: ', len(tokenizer.vocab))

Vocab size:  30522


Now, we create functions that will tokenize, encode and prepare our text to be fed into BERT for scoring:

In [17]:
def encode_text(text, tokenizer):
    """
    Function that takes a text string and a BERT-compatible tokenizer
    and returns the tokenized text with the '[SEP]' flag appended, 
    after taking a subset of the tokens' list in order to stay under 
    the 512 BERT max sequence length.

    This function is a utility function for the following 'bert_encode' function.

    Parameters
    ----------
    @param text: str,
        A valid string of text to be tokenized
    @param tokenizer: BERT.tokenization function,
        A valid BERT-compatible tokenizer

    Returns
    -------
    @return tokenized text, list
    """

    # Retrieve tokens from tokenizer
    tokens = list(tokenizer.tokenize(text))
    # Take only the first 450 elements
    tokens = tokens[0:450]
    # Append [SEP]
    tokens.append('[SEP]')
    return tokenizer.convert_tokens_to_ids(tokens)

def bert_encode(corpus, query, tokenizer):
    """
    Function that takes a corpus, a query and a tokenizer and returns the 
    query and all texts in the corpus concatenated together and separated by
    [CLS] flag, then tokenized and ready for BERT.

    This function utilizes the previous utility function 'encode_text'.
@param corpus: list,
        A valid list of string elements where each element is an article in our
        corpus. As returned from 'read_corpus' function. As returned from 'read_corpus' function.
    @param query: string,
        A valid text string which is the query for which answers need to be 
        retrieved.
    @param tokenizer: BERT.tokenization function,
        A valid BERT-compatible tokenizer.

    Returns
    -------
    @return inputs: dict,
        A dictionary containg three elements: 
            - input_word_ids: TF.io.tensor, 
                    The tokenized words ids.
            - input_mask: TF.io.tensor,
                    Tensor taking values based on whether the element at each 
                    position is a mask (flag) or not.
            - input_type_ids: TF.io.tensor,
                    Tensor taking values based on the type of the input element
                    at each position.
    """

    # Compute corpus length.
    corpus_length = len(corpus)

    # Transform each article in the corpus to a TF ragged constant.
    tf_corpus = tf.ragged.constant(
        [encode_text(article, tokenizer) for article in corpus]
        )

    # Encode the query, then transform it to a TF ragged constant of same 
    # length as the corpus.
    encoded_query = encode_text(query, tokenizer)
    tf_query = tf.ragged.constant(
        [encoded_query for i in range(corpus_length)]
        )
    
    # Create as many [CLS] flags as the number of articles in the corpus.
    cls = [tokenizer.convert_tokens_to_ids(['[CLS]'])] * tf_corpus.shape[0]
    # Concatenate all elements together 
    input_word_ids = tf.concat([cls, tf_query, tf_corpus], axis = -1)

    # Create masks tensor...
    input_mask = tf.ones_like(input_word_ids).to_tensor()

    # Create types tensors...
    type_cls = tf.zeros_like(cls)
    type_corpus = tf.zeros_like(tf_corpus)
    type_query = tf.zeros_like(tf_query)
    # ... and concatenate them together
    input_type_ids = tf.concat(
        [type_cls, type_query, type_corpus],
        axis = -1
    ).to_tensor()

    # Prepare results dictionary for returning...
    inputs = {
        'input_word_ids' : input_word_ids.to_tensor(),
        'input_mask' : input_mask,
        'input_type_ids' : input_type_ids
    }

    # Return...
    return inputs

Let's try our function and see if they work properly:

In [18]:
text = 'this is a text to test our functions'

# Try...
encode_text(text, tokenizer)

[2023, 2003, 1037, 3793, 2000, 3231, 2256, 4972, 102]

Now let's try to tokenize and encode our full corpus with a given query and take a look at the specifications of the obtained results:

In [55]:
query = 'anarchism'
query_data1 = bert_encode(
    corpus = corpus,
    query = query,
    tokenizer = tokenizer
)

In [20]:
for key, value in query_data1.items():
  print(f'{key:15s} shape: {value.shape}')

input_word_ids  shape: (1000, 456)
input_mask      shape: (1000, 456)
input_type_ids  shape: (1000, 456)


Everything looks working great!

#### **BERT without Finetuning**

As we know, the corpus on which BERT has been trained contains the **full English Wikipedia** (2,500M words) along with the BooksCorpus (800M words).

For this reason, we thought that we do not need to re-train and finetune BERT for our scoring task, since it has already seen the articles found in our corpus. Being trained on document-level corpus and not word-based ones, BERT would be able to idenitfy the connections between our queries and the articles available in the small corpus that we have.

Furthermore, finetuning BERT would require training again on query-answers data sets such as [**MSMARCO**](https://microsoft.github.io/msmarco/) or [**TREC-CAR**](https://trec.nist.gov/pubs/trec26/papers/Overview-CAR.pdf), which were used by Nogueira and Cho (2019) in their implementation. However, due to network constraints (downloading the huge data sets proved not possible) and computational constraints, as well as time constraints (according to Nogueira and Cho, finetuning BERT required more than 30 hours of training), we were unable to finetune it to our specific task. We assumed that it may provide good reasults 'out-of-the-box', however, experimental results have shown otherwise:

Let's create a BERT classifier and prepare it for precdiciton:

In [23]:
# BERT configs already imported
bert_classifier, bert_encoder = bert.bert_models.classifier_model(
    bert_config, num_labels = 2
)

As feeding the whole corpus at once to BERT results in RAM overload, we predict labels in batches instead, manually:

In [52]:
def bert_score(inputs, bert_classifier):
    """
    Function that takes a dictionary of inputs returned from
    'bert_encode' and computes a relevance score for each of the 
    documents in the corpus.
    
    This function is a utility function for the function 'bert_search'
    
    Parameters
    ----------
    @param inputs: dict,
        A dictionary containg three elements: 
            - input_word_ids: TF.io.tensor, 
                    The tokenized words ids.
            - input_mask: TF.io.tensor,
                    Tensor taking values based on whether the element at each 
                    position is a mask (flag) or not.
            - input_type_ids: TF.io.tensor,
                    Tensor taking values based on the type of the input element
                    at each position.
        As returned from 'bert_encode'
    @param bert_classifier: TF.classifier,
        A valid TensorFlow BERT classifier object.
    """
    
    # Initialize list of results
    results = []
    # Counter of scored articles
    i = 0
    
    # For each article in the corpus:
    while i < len(corpus):
        print(f'Scored {i + 5} examples!')
        # Create batch of 5 examples
        batch = {
            'input_type_ids': inputs['input_type_ids'][i:(i+5)],
            'input_mask': inputs['input_mask'][i:(i+5)],
            'input_word_ids': inputs['input_word_ids'][i:(i+5)]
        }
        # Compute scores for articles in the batch
        result = bert_classifier(batch, training = False)
        results.append(result)
        i = i + 6
        
    # Return obtained results
    return results
        
def bert_search(scores, corpus, n_results = 5):
    """
    Function that takes the scores obtained from the function 'bert_score' and
    returns the top n_results most relevant articles in the corpus based
    on the scores.
    
    Parameters
    ---------- 
    @param scores: list,
        A valid list of scores as returned by the function bert_score.
    @param corpus: list,
        A valid list of string elements where each element is an article in our
        corpus. As returned from 'read_corpus' function.
    @param n_results: int,
        A valid positive integer specifying the number of search results 
        required.
        
    Returns
    -------
    @return relevant_results: list,
        List of the n_results most relevant articles from the corpus.
    """

    # Retrieve the 2nd score returned by BERT, which is the relevance score.
    relevance_score = [scores[i][:, 1] for i in range(len(scores) - 1)]
    
    # Append all the results in one list
    relevance_list = []
    for i in range(len(relevance_score) - 1):
        for j in range(5):
            relevance_list.append(float(relevance_score[i][j]))
            
    # Retrieve the indices of the top n_results most relevant results.
    relevance_list = np.asarray(relevance_list, dtype = np.float32)
    idx = (-relevance_list).argsort()[:n_results]
    
    # Retrieve the relevant results from the corpus.
    relevant_results = [corpus[i] for i in idx]
    
    # Done, return!
    return relevant_results

Let's try to look at the search results using our created functions:

In [36]:
scores = bert_score(inputs = query_data1,
                    bert_classifier = bert_classifier)

Scored 5 examples!
Scored 11 examples!
Scored 17 examples!
Scored 23 examples!
Scored 29 examples!
Scored 35 examples!
Scored 41 examples!
Scored 47 examples!
Scored 53 examples!
Scored 59 examples!
Scored 65 examples!
Scored 71 examples!
Scored 77 examples!
Scored 83 examples!
Scored 89 examples!
Scored 95 examples!
Scored 101 examples!
Scored 107 examples!
Scored 113 examples!
Scored 119 examples!
Scored 125 examples!
Scored 131 examples!
Scored 137 examples!
Scored 143 examples!
Scored 149 examples!
Scored 155 examples!
Scored 161 examples!
Scored 167 examples!
Scored 173 examples!
Scored 179 examples!
Scored 185 examples!
Scored 191 examples!
Scored 197 examples!
Scored 203 examples!
Scored 209 examples!
Scored 215 examples!
Scored 221 examples!
Scored 227 examples!
Scored 233 examples!
Scored 239 examples!
Scored 245 examples!
Scored 251 examples!
Scored 257 examples!
Scored 263 examples!
Scored 269 examples!
Scored 275 examples!
Scored 281 examples!
Scored 287 examples!
Scored 29

In [56]:
n_results = 5
results = bert_search(scores, corpus, n_results)

tf.Tensor([-0.16832183 -0.16766648 -0.17123891 -0.18343538 -0.18021578], shape=(5,), dtype=float32)


In [59]:
print(f'Query: "{query}"\n')
print(f'Top {n_results} results from Wikipedia:\n\n')
i = 1
for result in results: 
    print(f'{i}. {result}\n\n')
    i = i + 1

Query: "anarchism"

Top 5 results from Wikipedia:


1. apollo was an october space mission carried out by the united states it was the first crewed flight in nasa apollo program and saw the resumption of human spaceflight by the agency after the fire that killed the three apollo astronauts in january the apollo crew was commanded by walter schirra with command module pilot donn eisele and lunar module pilot walter cunningham so designated even though apollo did not carry lunar module the three astronauts were originally designated for the second crewed apollo flight and then as backups for apollo after the fire crewed flights were suspended while the cause of the accident was investigated and improvements made to the spacecraft and safety procedures and uncrewed test flights made determined to prevent repetition of the fire the crew spent long periods of time monitoring the construction of their apollo command and service modules csm training continued over much of the month pause that

Unfortunately, we notice that BERT does not provide results that are relevant to the query. However, we believe that this downside can be solved if finetuning to our task was possible.

In reality, the main downside we find to BERT, in comparison to BM25, is time consumption. Indeed, ranking all documents in our corpus using BM25 is a task that takes only a couple of seconds using BM25 but consumes time in the order to minutes (more than 10 minutes in general) using BERT. This can be a game-changing downside especially since most users in information retrieval applications are not only looking for reliable and correct results, but also *fast* results. That is, that can be retrieved quickly.

**Note**: In order to use BERT trained on MS-MARCO, simply obtain the model from the official Nogueira and Cho (2019) repository (link provided in README) then load its `configs` as done previously, instead of the original BERT configuations.

---