<a href="https://colab.research.google.com/github/mobassir94/Deep-Learning-Practice/blob/main/nlp/pyserini_Bangla_information_retrieval(IR)_Mr_TyDi.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install faiss-cpu
!pip install pyserini

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# rechecking mr.tydi https://arxiv.org/pdf/2108.08787.pdf for bangla information retrieval

In [2]:
!python -m pyserini.search.lucene --threads 16 --batch-size 128 \
  --language bn \
  --topics mrtydi-v1.1-bengali-test \
  --index mrtydi-v1.1-bengali \
  --output run.mrtydi.bm25.bn.test.txt --bm25 --hits 100


Running mrtydi-v1.1-bengali-test topics, saving to run.mrtydi.bm25.bn.test.txt...
100% 111/111 [00:03<00:00, 29.91it/s]


In [3]:
#paper
!python -m pyserini.search.faiss --threads 16 --batch-size 512 \
  --encoder castorini/mdpr-question-nq \
  --topics mrtydi-v1.1-bengali-test \
  --index mrtydi-v1.1-bengali-mdpr-nq \
  --output run.mrtydi.mdpr-split-pft-nq.bn.test.txt --hits 100


Attempting to initialize pre-built index mrtydi-v1.1-bengali-mdpr-nq.
/root/.cache/pyserini/indexes/faiss.mrtydi-v1.1-bengali.20220207.5df364.e60cb6f1f7139cf0551f0ba4e4e83bf6 already exists, skipping download.
Initializing mrtydi-v1.1-bengali-mdpr-nq...
Running mrtydi-v1.1-bengali-test topics, saving to run.mrtydi.mdpr-split-pft-nq.bn.test.txt...
100% 111/111 [00:29<00:00,  3.75it/s]


In [4]:
!python -m pyserini.search.faiss --threads 16 --batch-size 512 \
  --encoder-class auto \
  --encoder castorini/mdpr-tied-pft-msmarco-ft-all \
  --topics mrtydi-v1.1-bengali-test \
  --index mrtydi-v1.1-bengali-mdpr-tied-pft-msmarco-ft-all \
  --output run.mrtydi.mdpr-tied-pft-msmarco-ft-all.bn.test.txt --hits 100


Attempting to initialize pre-built index mrtydi-v1.1-bengali-mdpr-tied-pft-msmarco-ft-all.
/root/.cache/pyserini/indexes/faiss.mrtydi-v1.1-bengali.20220524.7b099d5.d1e75f4960a723b068bb778a972ffb54 already exists, skipping download.
Initializing mrtydi-v1.1-bengali-mdpr-tied-pft-msmarco-ft-all...
Running mrtydi-v1.1-bengali-test topics, saving to run.mrtydi.mdpr-tied-pft-msmarco-ft-all.bn.test.txt...
  0% 0/111 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
100% 111/111 [00:17<00:00,  6.39it/s]


In [5]:
""" Reproduce the hybrid results in v1.1 """
#https://github.com/castorini/mr.tydi/blob/main/scripts/hybrid.py
import argparse
from collections import defaultdict

from tqdm import tqdm


lang2alpha = {'arabic': 0.77, 'bengali': 0.71, 'english': 1.0, 'finnish': 0.77, 'indonesian': 0.9, 'japanese': 0.99, 'korean': 1.0, 'russian': 0.93, 'swahili': 0.56, 'telugu': 0.73, 'thai': 0.84}


def load_runs(fn):
    runs = defaultdict(dict)
    with open(fn) as f:
        for line in f:
            qid, _, docid, _, score, _ = line.rstrip().split()
            runs[qid][docid] = float(score)
    return runs

class CFG:
  def __init__(self):  
    self.dense = '/content/run.mrtydi.mdpr-tied-pft-msmarco-ft-all.bn.test.txt' #/content/run.mrtydi.mdpr-split-pft-nq.bn.test.txt /content/run.mrtydi.mdpr-tied-pft-msmarco-ft-all.bn.test.txt
    self.sparse = '/content/run.mrtydi.bm25.bn.test.txt'
    self.output = "run_output.txt"
    self.lang = 'bengali' # one of {'arabic', 'bengali', 'english', 'finnish', 'indonesian', 'japanese', 'korean', 'russian', 'swahili', 'telugu', 'thai'}
    self.alpha = 0.0
    self.normalization = True
    self.weight_on_dense = True
def hybrid(paperwork = False):
    args = CFG()
    if(paperwork):
      print("paperwork......\n")
      args.dense = '/content/run.mrtydi.mdpr-split-pft-nq.bn.test.txt'
      args.output = 'paper_output.txt'
    print(args.dense)
    runs_1 = load_runs(args.dense) 
    runs_2 = load_runs(args.sparse)

    hybrid_result = {}
    output_f = open(args.output, 'w')
    alpha = args.alpha
    lang = args.lang
    if lang:
        if lang not in lang2alpha:
            raise ValueError(f"Unrecognized lang, need to be one of {list(lang2alpha.keys())}.")
        alpha = lang2alpha[lang]
    # alpha = 0.69
    print("alpha = ",alpha)
    for key in tqdm(list(set(runs_1.keys()).union(set(runs_2.keys())))):
        dense_hits = {docid: runs_1[key][docid] for docid in runs_1[key]} if key in runs_1 else {}
        sparse_hits = {docid: runs_2[key][docid] for docid in runs_2[key]} if key in runs_2 else {}

        hybrid_result = []
        min_dense_score = min(dense_hits.values()) if len(dense_hits) > 0 else 0
        max_dense_score = max(dense_hits.values()) if len(dense_hits) > 0 else 1
        min_sparse_score = min(sparse_hits.values()) if len(sparse_hits) > 0 else 0
        max_sparse_score = max(sparse_hits.values()) if len(sparse_hits) > 0 else 1
        for doc in set(dense_hits.keys()) | set(sparse_hits.keys()):
            if doc not in dense_hits:
                sparse_score = sparse_hits[doc]
                dense_score = min_dense_score
            elif doc not in sparse_hits:
                sparse_score = min_sparse_score
                dense_score = dense_hits[doc]
            else:
                sparse_score = sparse_hits[doc]
                dense_score = dense_hits[doc]

            if args.normalization:
                sparse_score = 0 if (max_sparse_score - min_sparse_score) == 0 else (
                    (sparse_score - (min_sparse_score + max_sparse_score) / 2) / (max_sparse_score - min_sparse_score))
                dense_score = 0 if (max_sparse_score - min_sparse_score) == 0 else (
                    (dense_score - (min_dense_score + max_dense_score) / 2) / (max_dense_score - min_dense_score))

            score = alpha * sparse_score + dense_score if not args.weight_on_dense else sparse_score + alpha * dense_score
            hybrid_result.append((doc, score))

        hybrid_result = sorted(hybrid_result, key=lambda x: x[1], reverse=True)
        for idx, item in enumerate(hybrid_result):
            output_f.write(f'{key} Q0 {item[0]} {idx+1} {item[1]} hybrid\n')
    output_f.close()

In [6]:
hybrid()

/content/run.mrtydi.mdpr-tied-pft-msmarco-ft-all.bn.test.txt
alpha =  0.71


100%|██████████| 111/111 [00:00<00:00, 1966.50it/s]


In [7]:
hybrid(paperwork = True)

paperwork......

/content/run.mrtydi.mdpr-split-pft-nq.bn.test.txt
alpha =  0.71


100%|██████████| 111/111 [00:00<00:00, 1916.65it/s]


In [8]:

# run_output = load_runs('/content/run_output.txt')
# run_output

In [9]:
!python -m pyserini.eval.trec_eval \
  -c -M 100 -m recip_rank mrtydi-v1.1-bengali-test \
  run.mrtydi.mdpr-tied-pft-msmarco-ft-all.bn.test.txt

Downloading https://search.maven.org/remotecontent?filepath=uk/ac/gla/dcs/terrierteam/jtreceval/0.0.5/jtreceval-0.0.5-jar-with-dependencies.jar to /root/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar...
/root/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar already exists!
Skipping download.
Running command: ['java', '-jar', '/root/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar', '-c', '-M', '100', '-m', 'recip_rank', '/root/.cache/pyserini/topics-and-qrels/qrels.mrtydi-v1.1-bn.test.txt', 'run.mrtydi.mdpr-tied-pft-msmarco-ft-all.bn.test.txt']
Results:
recip_rank            	all	0.6228


mrtydi-v1.1-bengali-mdpr-tied-pft-msmarco-ft-all alone can get 0.6228

In [10]:
!python -m pyserini.eval.trec_eval \
  -c -M 100 -m recip_rank mrtydi-v1.1-bengali-test \
  run.mrtydi.bm25.bn.test.txt

Downloading https://search.maven.org/remotecontent?filepath=uk/ac/gla/dcs/terrierteam/jtreceval/0.0.5/jtreceval-0.0.5-jar-with-dependencies.jar to /root/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar...
/root/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar already exists!
Skipping download.
Running command: ['java', '-jar', '/root/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar', '-c', '-M', '100', '-m', 'recip_rank', '/root/.cache/pyserini/topics-and-qrels/qrels.mrtydi-v1.1-bn.test.txt', 'run.mrtydi.bm25.bn.test.txt']
Results:
recip_rank            	all	0.4182


In [11]:
#paper 
!python -m pyserini.eval.trec_eval \
  -c -M 100 -m recip_rank mrtydi-v1.1-bengali-test \
  run.mrtydi.mdpr-split-pft-nq.bn.test.txt

Downloading https://search.maven.org/remotecontent?filepath=uk/ac/gla/dcs/terrierteam/jtreceval/0.0.5/jtreceval-0.0.5-jar-with-dependencies.jar to /root/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar...
/root/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar already exists!
Skipping download.
Running command: ['java', '-jar', '/root/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar', '-c', '-M', '100', '-m', 'recip_rank', '/root/.cache/pyserini/topics-and-qrels/qrels.mrtydi-v1.1-bn.test.txt', 'run.mrtydi.mdpr-split-pft-nq.bn.test.txt']
Results:
recip_rank            	all	0.2911


# paper Hybrid (sparse : bm25 + dense : mrtydi-v1.1-bengali-mdpr-nq)

# couldn't reproduce paper's result for bangla. authors claimed that their hybrid got 0.555 MRR@100 for bangla but i can only get 0.5069 (please proofread code cell below)

In [12]:
#paper

!python -m pyserini.eval.trec_eval \
  -c -M 100 -m recip_rank mrtydi-v1.1-bengali-test \
  paper_output.txt

Downloading https://search.maven.org/remotecontent?filepath=uk/ac/gla/dcs/terrierteam/jtreceval/0.0.5/jtreceval-0.0.5-jar-with-dependencies.jar to /root/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar...
/root/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar already exists!
Skipping download.
Running command: ['java', '-jar', '/root/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar', '-c', '-M', '100', '-m', 'recip_rank', '/root/.cache/pyserini/topics-and-qrels/qrels.mrtydi-v1.1-bn.test.txt', 'paper_output.txt']
Results:
recip_rank            	all	0.5069


In [13]:
#paper
!python -m pyserini.eval.trec_eval \
  -c -m recall.100 mrtydi-v1.1-bengali-test \
  paper_output.txt

Downloading https://search.maven.org/remotecontent?filepath=uk/ac/gla/dcs/terrierteam/jtreceval/0.0.5/jtreceval-0.0.5-jar-with-dependencies.jar to /root/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar...
/root/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar already exists!
Skipping download.
Running command: ['java', '-jar', '/root/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar', '-c', '-m', 'recall.100', '/root/.cache/pyserini/topics-and-qrels/qrels.mrtydi-v1.1-bn.test.txt', 'paper_output.txt']
Results:
recall_100            	all	0.9279


# Mobassir's Hybrid (sparse : bm25 + dense : mrtydi-v1.1-bengali-mdpr-tied-pft-msmarco-ft-all)

# Output of the code cell below confirms that the hybrid combination that i tried performs much better than the hybrid used in the original paper on test set for bangla. mine one achieves 0.6321 MRR@100 which is ~0.1252 boost

In [14]:
!python -m pyserini.eval.trec_eval \
  -c -M 100 -m recip_rank mrtydi-v1.1-bengali-test \
  run_output.txt

Downloading https://search.maven.org/remotecontent?filepath=uk/ac/gla/dcs/terrierteam/jtreceval/0.0.5/jtreceval-0.0.5-jar-with-dependencies.jar to /root/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar...
/root/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar already exists!
Skipping download.
Running command: ['java', '-jar', '/root/.cache/pyserini/eval/jtreceval-0.0.5-jar-with-dependencies.jar', '-c', '-M', '100', '-m', 'recip_rank', '/root/.cache/pyserini/topics-and-qrels/qrels.mrtydi-v1.1-bn.test.txt', 'run_output.txt']
Results:
recip_rank            	all	0.6321


In [15]:
from pyserini.search.lucene import LuceneSearcher

'''
mrtydi-v1.1-bengali
mrtydi-v1.1-bengali-lucene8  
'''
searcher = LuceneSearcher.from_prebuilt_index('mrtydi-v1.1-bengali-lucene8')
hits = searcher.search('বিশ্বের বৃহত্তম মরুভূমি')

# Print the first 10 hits:
for i in range(0, 10):
    print(f'{i+1:2} {hits[i].docid:15} {hits[i].score:.5f}')
    

 1 428749#0        3.89819
 2 615945#0        3.85638
 3 822#58          3.81012
 4 2861#0          3.77815
 5 316292#3        3.74671
 6 18756#1         3.74520
 7 704892#2        3.71788
 8 1108#38         3.71407
 9 67429#0         3.70363
10 1108#54         3.70215


In [16]:
# from pyserini.search.lucene import LuceneSearcher
LuceneSearcher.list_prebuilt_indexes()

                        cacm                                                                                                    \
description              Lucene index of the CACM corpus. (Lucene 9)                                                             
filename                 lucene-index.cacm.tar.gz                                                                                
urls                     [https://github.com/castorini/anserini-data/raw/master/CACM/lucene-index.cacm.20221005.252b5e.tar.gz]   
md5                      cfe14d543c6a27f4d742fb2d0099b8e0                                                                        
size compressed (bytes)  2347197                                                                                                 
total_terms              320968                                                                                                  
documents                3204                                                             

In [17]:
import json
jsondoc = json.loads(hits[0].raw)
jsondoc

{'docid': '428749#0',
 'title': 'রাষ্ট্র অনুযায়ী বৃহত্তম শহর ও দ্বিতীয় বৃহত্তম শহরের তালিকা',
 'text': 'এই নিবন্ধে জনসংখ্যা অনুযায়ী প্রত্যেক রাষ্ট্রের বৃহত্তম ও দ্বিতীয় বৃহত্তম শহরের তালিকা উপস্থাপন করা হয়েছে। রাজ্যাংশ বা পরাধীন অঞ্চলের নামের সামনেই প্রথম বন্ধনীতে রাষ্ট্রের নাম উল্লেখ করা আছে।'}

In [18]:
# Prints the first 10 hits
from IPython.core.display import HTML
for i in range(0, 10):
    jsondoc = json.loads(hits[i].raw)
    print(f'idx = {i+1:2} -> similarity score = {hits[i].score:.5f} \n')

    display(HTML('<div style="font-family: Times New Roman; padding-bottom:10px">' + hits[i].raw + '</div>'))

idx =  1 -> similarity score = 3.89819 



idx =  2 -> similarity score = 3.85638 



idx =  3 -> similarity score = 3.81012 



idx =  4 -> similarity score = 3.77815 



idx =  5 -> similarity score = 3.74671 



idx =  6 -> similarity score = 3.74520 



idx =  7 -> similarity score = 3.71788 



idx =  8 -> similarity score = 3.71407 



idx =  9 -> similarity score = 3.70363 



idx = 10 -> similarity score = 3.70215 



# Hybrid trial

In [19]:

from pyserini.search.faiss import FaissSearcher, DprQueryEncoder
from pyserini.search.hybrid import HybridSearcher

ssearcher = LuceneSearcher.from_prebuilt_index('mrtydi-v1.1-bengali-lucene8')
encoder = DprQueryEncoder('castorini/mdpr-tied-pft-msmarco-ft-all') 
dsearcher = FaissSearcher.from_prebuilt_index(
    'mrtydi-v1.1-bengali-mdpr-tied-pft-msmarco-ft-all', #mrtydi-v1.1-bengali-mdpr-tied-pft-nq
    encoder
)
hsearcher = HybridSearcher(dsearcher, ssearcher)
# GET TOP HITS 
query= 'বিশ্বের বৃহত্তম মরুভূমি'
hits = hsearcher.search(query,alpha = 0.71,normalization=True,weight_on_dense=True)

for i in range(0, 10):
    doc = ssearcher.doc(hits[i].docid)
    json_doc = json.loads(doc.raw())
    print(f'\nidx -> {i+1:2} , docid -> {hits[i].docid} , score -> {hits[i].score:.5f}')
    print(json_doc)


You are using a model of type bert to instantiate a model of type dpr. This is not supported for all configurations of models and can yield errors.
Some weights of the model checkpoint at castorini/mdpr-tied-pft-msmarco-ft-all were not used when initializing DPRQuestionEncoder: ['embeddings.word_embeddings.weight', 'encoder.layer.2.attention.self.key.bias', 'encoder.layer.1.output.dense.weight', 'encoder.layer.0.attention.self.value.weight', 'encoder.layer.5.output.LayerNorm.bias', 'encoder.layer.8.output.dense.weight', 'encoder.layer.9.output.LayerNorm.bias', 'encoder.layer.1.attention.output.dense.bias', 'encoder.layer.3.output.dense.bias', 'encoder.layer.8.intermediate.dense.weight', 'encoder.layer.1.intermediate.dense.bias', 'encoder.layer.4.attention.self.query.weight', 'encoder.layer.5.attention.self.key.weight', 'encoder.layer.9.attention.self.value.bias', 'encoder.layer.0.attention.self.value.bias', 'encoder.layer.3.intermediate.dense.weight', 'encoder.layer.5.output.dense.weig

Attempting to initialize pre-built index mrtydi-v1.1-bengali-mdpr-tied-pft-msmarco-ft-all.
/root/.cache/pyserini/indexes/faiss.mrtydi-v1.1-bengali.20220524.7b099d5.d1e75f4960a723b068bb778a972ffb54 already exists, skipping download.
Initializing mrtydi-v1.1-bengali-mdpr-tied-pft-msmarco-ft-all...

idx ->  1 , docid -> 428749#0 , score -> 0.14500
{'docid': '428749#0', 'title': 'রাষ্ট্র অনুযায়ী বৃহত্তম শহর ও দ্বিতীয় বৃহত্তম শহরের তালিকা', 'text': 'এই নিবন্ধে জনসংখ্যা অনুযায়ী প্রত্যেক রাষ্ট্রের বৃহত্তম ও দ্বিতীয় বৃহত্তম শহরের তালিকা উপস্থাপন করা হয়েছে। রাজ্যাংশ বা পরাধীন অঞ্চলের নামের সামনেই প্রথম বন্ধনীতে রাষ্ট্রের নাম উল্লেখ করা আছে।'}

idx ->  2 , docid -> 615945#0 , score -> -0.06830
{'docid': '615945#0', 'title': 'আয়তন অনুযায়ী কানাডীয় প্রদেশ ও অঞ্চলগুলির তালিকা', 'text': 'একটি দেশ হিসাবে, কানাডাতে দশটি প্রদেশ এবং তিনটি অঞ্চল রয়েছে। এই উপবিভাগগুলি ভূমি ও পানি উভয় ক্ষেত্রে ব্যাপকভাবে বিস্তৃত। ভূমি এলাকা দ্বারা বৃহত্তম উপবিভাগ হচ্ছে নুনাভুট অঞ্চল। জল এলাকা দ্বারা বৃহত্তম উপবিভা