<a href="https://colab.research.google.com/github/leonardo3108/robustez-query/blob/main/Originals%20%2B%20BM25%20%2B%20nDCG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preparação

## Carga dos pacotes pygaggle e pyserini

In [None]:
!pip install git+https://github.com/castorini/pygaggle.git

Collecting git+https://github.com/castorini/pygaggle.git
  Cloning https://github.com/castorini/pygaggle.git to /tmp/pip-req-build-e07jctw0
  Running command git clone -q https://github.com/castorini/pygaggle.git /tmp/pip-req-build-e07jctw0
  Running command git submodule update --init --recursive -q


In [None]:
!pip install pyserini



## Carga do dataset MsMarco Passage

In [None]:
from pyserini.search import SimpleSearcher

# `SimpleSearcher` defaults to BM25 scoring function.
searcher = SimpleSearcher.from_prebuilt_index('msmarco-passage')

Attempting to initialize pre-built index msmarco-passage.
/root/.cache/pyserini/indexes/index-msmarco-passage-20201117-f87c94.1efad4f1ae6a77e235042eff4be1612d already exists, skipping download.
Initializing msmarco-passage...


## Carga das queries originais

In [None]:
! wget -nc https://raw.githubusercontent.com/leonardo3108/robustez-query/main/queries-originals.txt

File ‘queries-originals.txt’ already there; not retrieving.



In [None]:
queries = []

for query in open('queries-originals.txt'):
    fields = query.strip().split()
    queries.append((fields[0], ' '.join(fields[1:])))

## Carga dos julgamentos do TREC 2020





In [None]:
! wget -nc https://trec.nist.gov/data/deep/2020qrels-pass.txt

--2021-11-02 14:04:37--  https://trec.nist.gov/data/deep/2020qrels-pass.txt
Resolving trec.nist.gov (trec.nist.gov)... 132.163.4.175, 2610:20:6005:13::19
Connecting to trec.nist.gov (trec.nist.gov)|132.163.4.175|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 218617 (213K) [text/plain]
Saving to: ‘2020qrels-pass.txt’


2021-11-02 14:04:37 (5.80 MB/s) - ‘2020qrels-pass.txt’ saved [218617/218617]



In [None]:
judment = {}
for i, line in enumerate(open('2020qrels-pass.txt')):
    query_nr, _, pid, eval = line.rstrip().split()
    judment[(query_nr, pid)] = int(eval)    

In [None]:
scale = {3:'perfectly relevant', 2:'highly relevant', 1:'related', 0:'irrelevant'}

# Execução

## Busca por meio do BM25 (10 melhores ranks)

In [None]:
import json
from pygaggle.rerank.base import Query, Text
from pygaggle.rerank.base import hits_to_texts

results = {}
texts = {}

print('BM25:')
for query_number, query_text in queries:
    print(query_number, query_text + ':')
    query = Query(query_text)
    hits = searcher.search(query.text, k=10)
    texts[query_number] = hits_to_texts(hits)
    results[query_number] = []

    for i, doc in enumerate(texts[query_number]):
        results[query_number].append(doc.metadata["docid"])
        content = json.loads(doc.text)['contents']
        print(f'\t{i+1:2} {doc.metadata["docid"]:15} {doc.score:.5f} {content}')

BM25:
23849 are naturalization records public information:
	 1 4348282         10.06630 Civil Records Definition. Civil records are a group of public records that pertain to civil registry records, civil family matters and non criminal civil offenses. These records vary a lot because of the nature of the information that is recorded.
	 2 2674124         9.86550 See our FAQ's to learn how you can benefit by using our search. Share: DISCLAIMER: Due to the nature of the origin of public record information, the public records and commercially available data sources used in the PeopleWise system may contain errors.
	 3 7119957         9.64430 Yes, in most cases Public Records are available to the public. Some documents, such as certain court records, confidential personal information, and other sensitive information may be kept sealed or is only available with a court order. In certain states, there is a waiting period to obtain Public Records that reveal private information. Which governme

# Avaliação

## Batimento com os julgamentos

In [None]:
for query_number, query_text in queries:
    print(query_number, query_text + ':')
    for i, doc in enumerate(texts[query_number]):
        content = json.loads(doc.text)['contents']
        eval = judment.get((query_number, doc.metadata["docid"]), 0)
        print(f'\t{i+1:2} {doc.metadata["docid"]:15} {doc.score:.5f}\t{eval} ({scale[eval]:18})    {content}')

23849 are naturalization records public information:
	 1 4348282         10.06630	0 (irrelevant        )    Civil Records Definition. Civil records are a group of public records that pertain to civil registry records, civil family matters and non criminal civil offenses. These records vary a lot because of the nature of the information that is recorded.
	 2 2674124         9.86550	0 (irrelevant        )    See our FAQ's to learn how you can benefit by using our search. Share: DISCLAIMER: Due to the nature of the origin of public record information, the public records and commercially available data sources used in the PeopleWise system may contain errors.
	 3 7119957         9.64430	0 (irrelevant        )    Yes, in most cases Public Records are available to the public. Some documents, such as certain court records, confidential personal information, and other sensitive information may be kept sealed or is only available with a court order. In certain states, there is a waiting period 

## Cálculo do DCG@10

In [None]:
import math

dcg10_bm25_original = {}

for query in results.keys():
    dcg = 0
    for i, doc in enumerate(results[query]):
        eval = judging.get((query, doc), 0)
        dcg += (2**int(eval)-1) * math.log(2) / math.log(i+2)
    dcg10_bm25_original[query] = dcg

In [None]:
print('Query: DCG@10')
for query in dcg10_bm25_original.keys():
    print('{}: {}'.format(query, dcg10_bm25_original[query]))

Query: DCG@10
23849: 0.2890648263178878
42255: 14.9165082750002
47210: 17.951615323586772
67316: 7.0
118440: 4.416508275000202
121171: 5.056649103653413
135802: 12.763483535311373
141630: 6.472085511897926
156498: 4.901788907020144
169208: 6.4525880959238044
174463: 7.416508275000202
258062: 9.023453784225214
324585: 7.0
330975: 5.946038405679959
332593: 11.219453000293806
336901: 3.0147359065137516
390360: 7.3169161285439195
405163: 15.246708288661921
555530: 0.3868528072345416
583468: 24.143953229684456
640502: 12.201419960397946
673670: 4.8927892607143715
701453: 19.768541251244844
730539: 12.629296253917301
768208: 12.406075075354144
877809: 17.95631607813879
911232: 5.749755712591929
914916: 12.093130290294111
938400: 0.0
940547: 4.945859079330484
997622: 9.478732993720168
1030303: 19.08560846251293
1037496: 13.9137922056359
1043135: 15.407525167228123
1051399: 0.0
1064670: 6.939725036603284
1071750: 6.847469127403145
1105792: 8.500950096341658
1106979: 14.57626737591509
1108651: 

In [None]:
w = open('dcg_bm25_original.csv', 'w')
w.write('query, dcg@10\n')
for query in dcg10_bm25_original.keys():
    w.write('{}, {}\n'.format(query, dcg10_bm25_original[query]))
w.close()

## Obtenção do IDCG

In [None]:
! wget -nc https://raw.githubusercontent.com/leonardo3108/robustez-query/main/idcg.csv

File ‘idcg.csv’ already there; not retrieving.



In [None]:
import csv

idcg = {}

print('query idcg')
with open('idcg.csv', newline='\n') as csvfile:
    csvreader = csv.reader(csvfile, delimiter=',', skipinitialspace=True)
    for query, value in csvreader:
        if query == 'query': continue
        idcg[query] = float(value)
        print(query, value)

query idcg
23849 31.804915366618417
42255 18.7710512655814
47210 28.182676571548026
67316 26.849343238214693
118440 31.804915366618417
121171 13.630678014265037
135802 13.052548361629261
141630 20.154397028550864
156498 15.819558616729841
169208 23.877103260844436
174463 17.630678014265037
258062 31.804915366618417
324585 31.804915366618417
330975 31.804915366618417
332593 26.849343238214693
336901 14.9165082750002
390360 26.849343238214693
405163 31.804915366618417
555530 13.630678014265037
583468 31.804915366618417
640502 31.804915366618417
673670 29.44453607869094
701453 23.877103260844436
730539 17.630678014265037
768208 22.06598386330924
877809 31.804915366618417
911232 13.666771961378048
914916 31.804915366618417
938400 31.804915366618417
940547 13.630678014265037
997622 31.804915366618417
1030303 20.16042416454164
1037496 31.804915366618417
1043135 29.44453607869094
1051399 22.15439702855086
1064670 9.666771961378048
1071750 18.974207384587125
1105792 11.819558616729841
1106979 

## Cálculo do nDCG@10

In [None]:
ndcg10_bm25_original = {}

print('Query: nDCG@10')
for query in dcg10_bm25_original.keys():
    ndcg10_bm25_original[query] = dcg10_bm25_original[query] / idcg[query]
    print('{}: {}'.format(query, ndcg10_bm25_original[query]))

Query: nDCG@10
23849: 0.009088684028421673
42255: 0.794654921770477
47210: 0.6369734002380002
67316: 0.26071401217877443
118440: 0.1388624438735545
121171: 0.37097561092422787
135802: 0.9778537632415402
141630: 0.3211252364796388
156498: 0.3098562371921236
169208: 0.2702416631294328
174463: 0.4206592774820958
258062: 0.28371255449703187
324585: 0.2200917662980802
330975: 0.18695344216890325
332593: 0.41786694373682665
336901: 0.20210734650054762
390360: 0.2725175086640386
405163: 0.47938213678331165
555530: 0.028381039213873662
583468: 0.7591264731056414
640502: 0.3836331529183765
673670: 0.16616968416952885
701453: 0.8279287916664022
730539: 0.7163250468132252
768208: 0.5622262370989336
877809: 0.5645767602634546
911232: 0.42071059126768146
914916: 0.38022834366623515
938400: 0.0
940547: 0.3628476202104143
997622: 0.2980272981222516
1030303: 0.9466868507697815
1037496: 0.4374730146346952
1043135: 0.5232728111610009
1051399: 0.0
1064670: 0.7178947702842047
1071750: 0.3608830128506654
1

In [None]:
w = open('ndcg_bm25_original.csv', 'w')
w.write('query, ndcg@10\n')
for query in ndcg10_bm25_original.keys():
    w.write('{}, {}\n'.format(query, ndcg10_bm25_original[query]))
w.close()