# FEVER Document Retrieval

**Purpose**: the purpose of this notebook is to develop a baseline approach for scoring document retrieval on the FEVER dataset with Apache Lucene.

**Input**: This document requires the Lucene index, and JSON files to run.

## Setting up Lucene Query

In [2]:
import utils
import pickle
from tqdm import tqdm_notebook
from joblib import Parallel, delayed
from multiprocessing import cpu_count
import numpy as np

In [3]:
claims, labels, article_list, claim_set, claim_to_article = utils.extract_fever_jsonl_data("../train.jsonl")

Num Distinct Claims 109810
Num Data Points 125051


In [5]:
output = utils.query_lucene(claims[0])
retrieved = utils.process_lucene_output(output)
relevant = claim_to_article[claims[0]]

In [6]:
utils.preprocess_article_name(claims[0])

'nikolaj coster waldau worked with the fox broadcasting company '

In [7]:
utils.query_lucene(claims[0])

['Searching for: nikolaj coster waldau worked fox broadcasting company',
 '316945 total matching documents',
 '1. /home/moinnadeem/Documents/UROP/wiki-pages/processed_pages/Ved_verdens_ende.txt',
 '2. /home/moinnadeem/Documents/UROP/wiki-pages/processed_pages/Nukaaka_Coster-Waldau.txt',
 '3. /home/moinnadeem/Documents/UROP/wiki-pages/processed_pages/A_Second_Chance_-LRB-2014_film-RRB-.txt',
 '4. /home/moinnadeem/Documents/UROP/wiki-pages/processed_pages/A_Thousand_Times_Good_Night.txt',
 '5. /home/moinnadeem/Documents/UROP/wiki-pages/processed_pages/New_Amsterdam_-LRB-TV_series-RRB-.txt',
 '6. /home/moinnadeem/Documents/UROP/wiki-pages/processed_pages/The_Baker_-LRB-film-RRB-.txt',
 '7. /home/moinnadeem/Documents/UROP/wiki-pages/processed_pages/Nikolaj.txt',
 '8. /home/moinnadeem/Documents/UROP/wiki-pages/processed_pages/Nikolaj_Coster-Waldau.txt',
 '9. /home/moinnadeem/Documents/UROP/wiki-pages/processed_pages/Coster.txt',
 '10. /home/moinnadeem/Documents/UROP/wiki-pages/processed_pag

In [8]:
utils.calculate_precision(retrieved, relevant, 10)

0.1

In [9]:
utils.calculate_recall(retrieved, relevant, 10)

0.5

## Applying Statistics to Dataset

We calculate the Precision, Recall at one of (1,2,5,10).

In [13]:
k = [1,2, 5,10]

In [14]:
def score_claim(claim):
    cleaned_claim = claim.replace("/", " ")
    choices = query_lucene(cleaned_claim)
    retrieved = process_lucene_output(choices)
    relevant = claim_to_article[claim]
    mAP = {}
    for i in k:
        precision = calculate_precision(retrieved=retrieved, relevant=relevant, k=i)
        recall = calculate_recall(retrieved=retrieved, relevant=relevant, k=i)
        mAP[i] = {}
        mAP[i]['precision'] = precision
        mAP[i]['recall'] = recall
    return mAP

We run this on the CSAIL cluster, and cache the results in a `result.pkl` file. We load this file into the notebook for the purpose of documentation.

In [18]:
loadCached = True

In [20]:
if not loadCached:
    result = Parallel(n_jobs=8, verbose=1)(delayed(score_claim)(k) for k in list(claim_to_article.keys())[:500])
    with open("result.pkl", "wb") as f:
        pickle.dump(result, f)
else:
    with open("result.pkl", "rb") as f:
        result = pickle.load(f)

In [21]:
def calculatemAP(mAP, k):
    mAP_final = {}
    
    for i in k:
        mAP_final[i] = {}
        mAP_final[i]['precision'] = []
        mAP_final[i]['recall'] = []
        
    for ap in mAP:
        for k, v in ap.items():
            mAP_final[k]['precision'].append(v['precision'])
            mAP_final[k]['recall'].append(v['recall'])

    return mAP_final

def displaymAP(mAP):
    for k, v in mAP.items():
        for k_i, v_i in v.items():
            print("{} @ {}: {}".format(k_i, k, np.mean(v_i)))

In [22]:
mAP = calculatemAP(result, k)

In [23]:
displaymAP(mAP)

recall @ 1: 0.20031331520793064
precision @ 1: 0.02151639967500445
recall @ 2: 0.28886552397970694
precision @ 2: 0.031180563702168523
recall @ 10: 0.5110500005811957
precision @ 10: 0.055638976872308885
recall @ 5: 0.41472813779176937
precision @ 5: 0.045006886603492405
