# Normalized Discounted Cumulative Gain and Document Similarity

This notebook helps in finding the nDCG value for a given query and find the similarity between the top-3 documents retrieved in each language.

## Mount and Install

Mount the folder with required files needed for this notebook. Skip the execution of cells required for mounting (next two) if running locally.

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [None]:
%cd /content/drive/MyDrive/IR_Project
!ls

/content/drive/MyDrive/IR_Project
lid.176.bin  Project_qrels.csv	Project_topics.csv  __pycache__  requirements.txt  utils.py


## Install list of packages

Install all the modules needed for this notebook to run.

In [None]:
!pip install -r requirements.txt

Collecting fasttext-langdetect (from -r requirements.txt (line 1))
  Downloading fasttext-langdetect-1.0.5.tar.gz (6.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting translators (from -r requirements.txt (line 2))
  Downloading translators-5.8.9-py3-none-any.whl (54 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.7/54.7 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
Collecting sacremoses (from -r requirements.txt (line 4))
  Downloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m24.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence-transformers (from -r requirements.txt (line 5))
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting faiss-cpu (from -r requireme

## Modules and Translator

Import all modules needed including the custom utils module. Set the translator to the one needed for this experiment. It can be 'google','bing','alibaba' and 'baidu'.

In [None]:
from utils import *
from IPython.display import clear_output
import itertools
import json
from IPython.core.display import HTML
from typing import Callable
import math

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


.gitattributes:   0%|          | 0.00/391 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.95k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

## Load Pyserini indexes

Load pre-built indexes.

In [None]:
searcher_en = LuceneSearcher.from_prebuilt_index('miracl-v1.0-en')
searcher_es = LuceneSearcher.from_prebuilt_index('miracl-v1.0-es')
searcher_fr = LuceneSearcher.from_prebuilt_index('miracl-v1.0-fr')

reader_en = IndexReader.from_prebuilt_index('miracl-v1.0-en')
reader_es = IndexReader.from_prebuilt_index('miracl-v1.0-es')
reader_fr = IndexReader.from_prebuilt_index('miracl-v1.0-fr')

Downloading index at https://rgw.cs.uwaterloo.ca/pyserini/indexes/lucene-index.miracl-v1.0-en.20221004.2b2856.tar.gz...


lucene-index.miracl-v1.0-en.20221004.2b2856.tar.gz: 16.5GB [02:47, 106MB/s]                             


Downloading index at https://rgw.cs.uwaterloo.ca/pyserini/indexes/lucene-index.miracl-v1.0-es.20221004.2b2856.tar.gz...


lucene-index.miracl-v1.0-es.20221004.2b2856.tar.gz: 5.06GB [00:58, 92.9MB/s]                            


Downloading index at https://rgw.cs.uwaterloo.ca/pyserini/indexes/lucene-index.miracl-v1.0-fr.20221004.2b2856.tar.gz...


lucene-index.miracl-v1.0-fr.20221004.2b2856.tar.gz: 6.23GB [01:09, 96.9MB/s]                            


## Load files

Load the files topics that has queries and qrels that has judgements. Create a judgment hashmap for ease of coding.

In [None]:
topics_csv = pd.read_csv('Project_topics.csv')
topics_csv.insert(0, 'id', range(1,11))
topics = topics_csv.to_dict('records')

In [None]:
qrels_csv = pd.read_csv('Project_qrels.csv')
qrels = qrels_csv.values.tolist()

In [None]:
qrel_judgement_dict = dict()
for qrel in qrels:
    if qrel[3] != 0:
        if qrel[0] not in qrel_judgement_dict:
            qrel_judgement_dict[qrel[0]] = [qrel[1]]
        else:
            qrel_judgement_dict[qrel[0]] += [qrel[1]]

In [None]:
doc_relevance_dict = dict()
for qrel in qrels:
    doc_relevance_dict[qrel[1]] = qrel[3]

## Calculate nDCG and retrieve top docs
Calculate the normalized discounted cumulative gain and retrieve the top documents in each language to find siilarity.

In [None]:
def ndcg_and_top_docs(translator: str,retrieval_model: Callable,query: str) -> tuple[list,float]:
    translated_queries = translate_query(translator,query)

    hits_en = retrieval_model(translated_queries['english'],reader_en,searcher_en,100,'en')
    hits_es = retrieval_model(translated_queries['spanish'],reader_es,searcher_es,100,'es')
    hits_fr = retrieval_model(translated_queries['french'],reader_fr,searcher_fr,100,'fr')

    top_hits = list()
    top_hits.append(hits_en[:3])
    top_hits.append(hits_es[:3])
    top_hits.append(hits_fr[:3])

    scores_sum_en = sum([x[0] for x in hits_en])
    scores_sum_es = sum([x[0] for x in hits_es])
    scores_sum_fr = sum([x[0] for x in hits_fr])

    normalized_hits_en = [(x[0]/scores_sum_en,x[1],x[2]) for x in hits_en]
    normalized_hits_es = [(x[0]/scores_sum_es,x[1],x[2]) for x in hits_es]
    normalized_hits_fr = [(x[0]/scores_sum_fr,x[1],x[2]) for x in hits_fr]

    hits = sorted(itertools.chain(normalized_hits_en, normalized_hits_es, normalized_hits_fr),reverse = True)[:100]

    hit_docs = [hit[1] for hit in hits]
    relevance = list()
    for doc in hit_docs:
        if doc not in doc_relevance_dict:
            relevance.append(0)
        else:
            relevance.append(doc_relevance_dict[doc])

    dcg = 0
    for i,rel in enumerate(relevance):
        dcg += rel/math.log2(i+2)

    idcg = 0
    sorted_relevance = sorted(relevance,reverse = True)
    for i,rel in enumerate(sorted_relevance):
        idcg += rel/math.log2(i+2)
    if idcg != 0:
        ndcg = dcg/idcg
    else:
        ndcg = 0

    return top_hits,ndcg

Find nDCG for all paris of translators and models for comaprison

In [None]:
translators = ['google','bing','alibaba','baidu']
models = [okapi_bm25,okapi_tf_idf,query_likelihood_model]
model_names = ['Okapi BM25','Okapi TF-IDF','Query Likelihood Model']
ndcg_dict = dict()
for i,model in enumerate(models):
    if model_names[i] not in ndcg_dict:
        ndcg_dict[model_names[i]] = dict()
    for t in translators:
        if model_names[i] == 'Okapi TF-IDF' and t == 'google':
            top_hits,ndcg = ndcg_and_top_docs(t,model,topics[3]['title'])
        else:
            _,ndcg = ndcg_and_top_docs(t,model,topics[3]['title'])
        ndcg_dict[model_names[i]][t] = ndcg

print('nDCG scores for query:',topics[3]['title'])
display(pd.DataFrame(ndcg_dict))

nDCG scores for query: Venezuela económica crise


Unnamed: 0,Okapi BM25,Okapi TF-IDF,Query Likelihood Model
google,0.458833,0.418115,0.263593
bing,0.458833,0.418115,0.263593
alibaba,0.367031,0.297346,0.152925
baidu,0.458833,0.418115,0.263593


## Document similarity

Find the similarity between the top-3 documents retrieved for the query **Venezuela económica crise** using the **Google** translator and **Okapi + TF-IDF** retrieval model.

In [None]:
documents = list()

for _,doc_id,_ in top_hits[0]:
    doc = json.loads(reader_en.doc(doc_id).raw())
    documents.append(doc['text'])

for _,doc_id,_ in top_hits[1]:
    doc = json.loads(reader_es.doc(doc_id).raw())
    documents.append(doc['text'])

for _,doc_id,_ in top_hits[2]:
    doc = json.loads(reader_fr.doc(doc_id).raw())
    documents.append(doc['text'])

document_similarity(documents)

Unnamed: 0,"In 2017, Donald Trump's administration imposed more economic sanctions on Venezuela.","The Venezuelan economic crisis also known as Great Depression in Venezuela, refers to the deterioration that began to be noticed in the main macroeconomic indicators from the year 2012, and whose consequences have extended in time to the present, not only economically but also politically and socially.","By 2014, Venezuela had entered an economic recession and by 2016, the country had an inflation rate of 800%, the highest rate in its history. The International Monetary Fund expects inflation in Venezuela to be 1,000,000% for 2018.","Durante la crisis económica de Venezuela, la tasa de oro excavado cayó un 64,1 % entre febrero de 2013 y febrero de 2014, la producción de hierro cayó un 49,8 %.",Los efectos de la crisis económica empezaron a evidenciarse después de mediados del tercer mandato de Hugo Chávez.,"La crisis económica en Venezuela ocurridas durante las dos primeras décadas del siglo XXI, también denominada depresión económica venezolana, o colapso económico venezolano, se refiere al deterioro económico en los principales indicadores macroeconómicos en Venezuela durante los Regímenes de Hugo Chávez (1999-2013) y Nicolás Maduro (desde 2013), y cuyas consecuencias se han extendido en el tiempo, no solo en el plano económico sino también en el político y social del país sudamericano.","La crise du Venezuela désigne une période de chamboulements sociaux, économiques et politiques débutant au Venezuela en 2013, sous la présidence de Nicolás Maduro.","À partir de 2014, les crises politique et économique ainsi que l’insécurité qui règne dans le pays ont détruit l’industrie touristique du Venezuela.","Cette dépendance de l'économie au pétrole participa à une importante crise financière en 1994, aggravée par la récession de 1993."
"In 2017, Donald Trump's administration imposed more economic sanctions on Venezuela.",1.0,0.422068,0.426839,0.217298,0.384837,0.539093,0.552248,0.397806,0.230664
"The Venezuelan economic crisis also known as Great Depression in Venezuela, refers to the deterioration that began to be noticed in the main macroeconomic indicators from the year 2012, and whose consequences have extended in time to the present, not only economically but also politically and socially.",0.422068,1.0,0.350683,0.301429,0.515376,0.745894,0.610397,0.509414,0.478101
"By 2014, Venezuela had entered an economic recession and by 2016, the country had an inflation rate of 800%, the highest rate in its history. The International Monetary Fund expects inflation in Venezuela to be 1,000,000% for 2018.",0.426839,0.350683,1.0,0.416351,0.242391,0.530769,0.555703,0.445152,0.254061
"Durante la crisis económica de Venezuela, la tasa de oro excavado cayó un 64,1 % entre febrero de 2013 y febrero de 2014, la producción de hierro cayó un 49,8 %.",0.217298,0.301429,0.416351,1.0,0.162814,0.35771,0.339749,0.403266,0.295619
Los efectos de la crisis económica empezaron a evidenciarse después de mediados del tercer mandato de Hugo Chávez.,0.384837,0.515376,0.242391,0.162814,1.0,0.533169,0.492747,0.451557,0.415943
"La crisis económica en Venezuela ocurridas durante las dos primeras décadas del siglo XXI, también denominada depresión económica venezolana, o colapso económico venezolano, se refiere al deterioro económico en los principales indicadores macroeconómicos en Venezuela durante los Regímenes de Hugo Chávez (1999-2013) y Nicolás Maduro (desde 2013), y cuyas consecuencias se han extendido en el tiempo, no solo en el plano económico sino también en el político y social del país sudamericano.",0.539093,0.745894,0.530769,0.35771,0.533169,1.0,0.812009,0.584485,0.381604
"La crise du Venezuela désigne une période de chamboulements sociaux, économiques et politiques débutant au Venezuela en 2013, sous la présidence de Nicolás Maduro.",0.552248,0.610397,0.555703,0.339749,0.492747,0.812009,1.0,0.596107,0.295842
"À partir de 2014, les crises politique et économique ainsi que l’insécurité qui règne dans le pays ont détruit l’industrie touristique du Venezuela.",0.397806,0.509414,0.445152,0.403266,0.451557,0.584485,0.596107,1.0,0.377511
"Cette dépendance de l'économie au pétrole participa à une importante crise financière en 1994, aggravée par la récession de 1993.",0.230664,0.478101,0.254061,0.295619,0.415943,0.381604,0.295842,0.377511,1.0
