# Okapi BM25 Retrieval model

This notebook has the implementation of the Query-Likelihood retrieval model for Mixed-Language retrieval. This notebook takes ~9 hours to rum on Colab with High-ram and CPU config. 

## Mount and Install

Mount the folder with required files needed for this notebook. Skip the execution of cells required for mounting (next two) if running locally.

In [1]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [2]:
%cd /content/drive/MyDrive/IR_Project
!ls

/content/drive/MyDrive/IR_Project
lid.176.bin  Project_qrels.csv	Project_topics.csv  __pycache__  requirements.txt  utils.py


## Install kist of packages

Install all the modules needed for this notebook to run.

In [3]:
!pip install -r requirements.txt

Collecting fasttext-langdetect (from -r requirements.txt (line 1))
  Downloading fasttext-langdetect-1.0.5.tar.gz (6.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting translators (from -r requirements.txt (line 2))
  Downloading translators-5.8.9-py3-none-any.whl (54 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.7/54.7 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Collecting sacremoses (from -r requirements.txt (line 4))
  Downloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence-transformers (from -r requirements.txt (line 5))
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting faiss-cpu (from -r requireme

## Modules and Translator

Import all modules needed including the custom utils module. Set the translator to the one needed for this experiment. It can be 'google','bing','alibaba' and 'baidu'.

In [4]:
from utils import *
from IPython.display import clear_output
import itertools
import json
from IPython.core.display import HTML

TRANSLATOR = 'google'

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


.gitattributes:   0%|          | 0.00/391 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.95k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

## Load Pyserini indexes

Load pre-built indexes.

In [5]:
searcher_en = LuceneSearcher.from_prebuilt_index('miracl-v1.0-en')
searcher_es = LuceneSearcher.from_prebuilt_index('miracl-v1.0-es')
searcher_fr = LuceneSearcher.from_prebuilt_index('miracl-v1.0-fr')

reader_en = IndexReader.from_prebuilt_index('miracl-v1.0-en')
reader_es = IndexReader.from_prebuilt_index('miracl-v1.0-es')
reader_fr = IndexReader.from_prebuilt_index('miracl-v1.0-fr')

Downloading index at https://rgw.cs.uwaterloo.ca/pyserini/indexes/lucene-index.miracl-v1.0-en.20221004.2b2856.tar.gz...


lucene-index.miracl-v1.0-en.20221004.2b2856.tar.gz: 16.5GB [03:39, 80.7MB/s]                            


Downloading index at https://rgw.cs.uwaterloo.ca/pyserini/indexes/lucene-index.miracl-v1.0-es.20221004.2b2856.tar.gz...


lucene-index.miracl-v1.0-es.20221004.2b2856.tar.gz: 5.06GB [01:06, 81.4MB/s]                            


Downloading index at https://rgw.cs.uwaterloo.ca/pyserini/indexes/lucene-index.miracl-v1.0-fr.20221004.2b2856.tar.gz...


lucene-index.miracl-v1.0-fr.20221004.2b2856.tar.gz: 6.23GB [01:49, 61.0MB/s]                            


## Load files

Load the files topics that has queries and qrels that has judgements. Create a judgment hashmap for ease of coding.

In [6]:
topics_csv = pd.read_csv('Project_topics.csv')
topics_csv.insert(0, 'id', range(1,11))
topics = topics_csv.to_dict('records')

In [7]:
qrels_csv = pd.read_csv('Project_qrels.csv')
qrels = qrels_csv.values.tolist()

In [8]:
qrel_judgement_dict = dict()
for qrel in qrels:
    if qrel[3] != 0:
        if qrel[0] not in qrel_judgement_dict:
            qrel_judgement_dict[qrel[0]] = [qrel[1]]
        else:
            qrel_judgement_dict[qrel[0]] += [qrel[1]]

## Retrieve and Calculate mAP

Retrieve the top documents using Okapi BM25 model and calaculate Mean Average Precision (mAP@100)

In [9]:
average_precision_list = list()
top_3_en_docs = list()
top_3_es_docs = list()
top_3_fr_docs = list()
top_10_hits = list()
for i,topic in enumerate(topics[:5]):
    translated_queries = translate_query(TRANSLATOR,topic['title'])

    hits_en = query_likelihood_model(translated_queries['english'],reader_en,searcher_en,100,'en')
    hits_es = query_likelihood_model(translated_queries['spanish'],reader_es,searcher_es,100,'es')
    hits_fr = query_likelihood_model(translated_queries['french'],reader_fr,searcher_fr,100,'fr')

    top_3_en_docs.append(hits_en[:3])
    top_3_es_docs.append(hits_es[:3])
    top_3_fr_docs.append(hits_fr[:3])

    scores_sum_en = sum([x[0] for x in hits_en])
    scores_sum_es = sum([x[0] for x in hits_es])
    scores_sum_fr = sum([x[0] for x in hits_fr])

    normalized_hits_en = [(x[0]/scores_sum_en,x[1],x[2]) for x in hits_en]
    normalized_hits_es = [(x[0]/scores_sum_es,x[1],x[2]) for x in hits_es]
    normalized_hits_fr = [(x[0]/scores_sum_fr,x[1],x[2]) for x in hits_fr]

    hits = sorted(itertools.chain(normalized_hits_en, normalized_hits_es, normalized_hits_fr),reverse = True)[:100]

    top_10_hits.append(hits[:10])

    hit_docs = [hit[1] for hit in hits]

    hit_docs_relevance = [1 if hit_doc in qrel_judgement_dict[i+1] else 0 for hit_doc in hit_docs]

    # to store number of relevant documents
    relevant_count = 0

    # calculate precision at k
    precision = []
    for i,relevance in enumerate(hit_docs_relevance):
        if relevance == 1:
            relevant_count += 1
            precision.append(relevant_count/(i+1))
        else:
            precision.append(relevant_count/(i+1))

    # calculate average precision
    average_precision = 0
    for i in range(len(precision)):
        average_precision += precision[i]*hit_docs_relevance[i]
    if relevant_count != 0:
        average_precision = average_precision/relevant_count

    average_precision_list.append(average_precision)

    clear_output(wait=True)

    # print current query and time taken to execute it
    print(topic['title'],'completed')

# clear console
clear_output(wait=True)
# find mean of all average precisions
mean_average_precision = sum(average_precision_list)/len(average_precision_list)
print(f'Mean Average Precision (@mAP100): {mean_average_precision}')

Mean Average Precision (@mAP100): 0.13553247017453943


## Ranked List

Display Top-10 documents based on merged normalized score list.

In [10]:
display(HTML('<h3>Display of Top 10 ranked documents example</h3>'))
display(HTML('<h3><span style="color:#ff0000;">Query</span> - {}</h3>'.format(topics[0]['title'])))
query = translate_query(TRANSLATOR,topics[0]['title'])
for _,doc_id,language in top_10_hits[0]:
    title = None
    text = None
    if language == 'en':
        doc = json.loads(reader_en.doc(doc_id).raw())
        title = doc['title']
        text = doc['text']
        display(HTML('<h2><p style="color:#3484F0;">{}</p></h2>'.format(title)))
        display(HTML('<p>{}</p>'.format(pretty_print(text,query['english'].split(' ')))))
    elif language == 'es':
        doc = json.loads(reader_es.doc(doc_id).raw())
        title = doc['title']
        text = doc['text']
        display(HTML('<h2><p style="color:#3484F0;">{}</p></h2>'.format(title)))
        display(HTML('<p>{}</p>'.format(pretty_print(text,query['spanish'].split(' ')))))
    elif language == 'fr':
        doc = json.loads(reader_fr.doc(doc_id).raw())
        title = doc['title']
        text = doc['text']
        display(HTML('<h2><p style="color:#3484F0;">{}</p></h2>'.format(title)))
        display(HTML('<p>{}</p>'.format(pretty_print(text,query['french'].split(' ')))))