# RCLC Baseline

## General Description of the model
Our model is composed of the modules:
* Question-Answering (QA) system (DocumentQA). Provides the answers
* Entity typing of the answers (Ultra Fine Entity Typing)
* Rescoring of the answers. A neural network that takes the score of the QA system and the probability of an answer to be a of a certain type (10k types). The output is the final confidence score

## Prepare the input data
You can skip to Execution of the Model if you have already prepared the input data

### Download the data
Download the corpus from https://github.com/Coleridge-Initiative/rclc

### Convert the data into .txt files
Follow the steps in https://github.com/LARC-CMU-SMU/rclc_2019_baseline to convert the corpus into .txt files (kudos to @philipskokoh)


Put the txt files in ../data/input/files/text/

The input data we have is composed by a .jsonld file with the metadata information about publications and datasets and a folder with all the publications in .txt format.

Our model assumes the existence of a .json file with the metadata information about the publications so first, we will create this file. This file has the following format:

```
{'publication_id': 0,
 'unique_identifier': '9bab608b09ad5834ecf9',
 'title': '9bab608b09ad5834ecf9',
 'pub_date': '2015-10-15',
 'pdf_file_name': '9bab608b09ad5834ecf9.pdf',
 'text_file_name': '9bab608b09ad5834ecf9.pdf.txt'}
```

In [288]:
from tqdm import tqdm_notebook
import json
import os

In [160]:
path = "../data/input/files/text/"
list_txt = os.listdir(path)

In [286]:
list_txt[0:2]

['9bab608b09ad5834ecf9.pdf.txt', 'bbc5405690917684db8c.pdf.txt']

The publication with id 2b4497873374d080d359 should not be included because there is a problem with the conversion into txt

In [163]:
assert not "2b4497873374d080d359.pdf.txt" in list_txt
#Remove it from the folder and from the list

AssertionError: 

In [164]:
list_txt.remove("2b4497873374d080d359.pdf.txt")

In [165]:
dict_num_id2original_id = {} 
#Our model assumes that the ids are numbers, but the ids of the datasets are strings, so we will
# create this dictionary to transform from num_id to string_id (the original one)
list_json_datasets = []
for i, file in enumerate(list_txt):
    pub_id = file[0:-8]
    list_json_datasets.append({
                    "publication_id": i ,
                    "unique_identifier": pub_id,
                    "title": pub_id,
                    "pub_date": "2015-10-15",
                    "pdf_file_name": file[0:-4],
                    "text_file_name": file
                    })
    dict_num_id2original_id[i] = pub_id
    os.rename(path + file, path + str(i) + ".txt") 
    #Rename the file because the model assumes that the name of the files are the same as their ids (numbers)

In [166]:
list_json_datasets[2]

{'publication_id': 2,
 'unique_identifier': '74c81f06754918327e5d',
 'title': '74c81f06754918327e5d',
 'pub_date': '2015-10-15',
 'pdf_file_name': '74c81f06754918327e5d.pdf',
 'text_file_name': '74c81f06754918327e5d.pdf.txt'}

In [167]:
len(list_json_datasets)

479

In [168]:
with open('../data/input/publications.json', 'w+') as fout:
    json.dump(list_json_datasets, fout)

We need to save this dictionary because we need to convert from num_id to string_id (original id) during the evaluation

In [169]:
with open('./dict_num_id2original_id.json', 'w+') as f:
    json.dump(dict_num_id2original_id, f)

## Execution of the model
Execute from the terminal the file code.sh . It executes the model and write the outputs in the folder data/output/data_set_mentions.json

In [None]:
!python3 -m nltk.downloader punkt stopwords
%run ./docqa/scripts/run_on_user_documents_eval_final-mod_ph2_.py ./pretrain/cpu
%run ./ner/DocQA2UltraFine.py
%run ./ner/open_type/rcc_ultra_fine_main_v2.py release_model -lstm_type single -enhanced_mention -data_setup joint -add_crowd -multitask -mode test -reload_model_name release_model -load

## Evaluation

In [289]:
import numpy as np
from fuzzywuzzy import fuzz
import json

In [290]:
output_model_path = '../data/output/data_set_mentions.json'

In [291]:
#output from DocQA
with open(output_model_path, 'r') as f:
    output_model = json.load(f)

In [293]:
output_model[0]

{'publication_id': 0,
 'mention': 'National Health and Nutrition Examination Survey Data',
 'score': 0.38750842213630676}

In [255]:
with open("../corpus.jsonld", 'r') as f:
    corpus = json.load(f)

In [257]:
with open('dict_num_id2original_id.json', 'r') as f:
    dict_num_id2original_id = json.load(f)

In [258]:
dict_original_id2num_id = {v: k for k, v in dict_num_id2original_id.items()}

Let's create a dictionary <Dataset id, dataset metadada> 

In [259]:
dict_datasets = {}
for instance in corpus["@graph"]:
    if instance["@type"] == "Dataset":
        dataset_id = instance["@id"].split("-")[-1]
        dict_datasets[dataset_id] = instance

In [261]:
dict_datasets['b5ecda6a60ca5c2476d2']

{'@id': 'https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#dataset-b5ecda6a60ca5c2476d2',
 '@type': 'Dataset',
 'dct:alternative': [{'@value': 'Quarterly Food-at-Home Price Database'},
  {'@value': 'QFAFHP'},
  {'@value': 'QFAHPD'},
  {'@value': 'FAH'},
  {'@value': 'Quarterly Food at Home Prices Database'}],
 'dct:publisher': {'@value': 'U.S. Department of Agriculture'},
 'dct:title': {'@value': 'Quarterly Food at Home Prices'},
 'foaf:page': {'@type': 'xsd:anyURI',
  '@value': 'https://www.ers.usda.gov/data-products/quarterly-food-at-home-price-database'}}

In [262]:
'''
Given a dataset id, retrieve the list of names of this dataset
'''
def get_list_dataset_names(dataset_id):
    dataset = dict_datasets[dataset_id]
    list_names = [dataset['dct:title']['@value']]
    if "dct:alternative" in dataset:
        if type(dataset["dct:alternative"]) is list:
            for d in dataset["dct:alternative"]:
                list_names.append(d["@value"])
        else:
            list_names.append(dataset["dct:alternative"]["@value"])
    return list_names

Let's create a dictionary that given a dataset id, returns the list of aliases of that dataset

In [263]:
dict_list_names_datasets = {}
for key in dict_datasets.keys():
    dict_list_names_datasets[key] = get_list_dataset_names(key)

### Entity Linking Model
From a string mention of the dataset to its dataset id

In [264]:
'''
Retrieve the id of a dataset given its string mention in a publication.
This is the Entity Linking model
'''
def get_id_dataset_name(dataset_name, dict_list_names_datasets):
    dict_dist = {}
    for key in dict_list_names_datasets.keys():
        list_aliases = dict_list_names_datasets[key]
        list_dist = []
        for alias in list_aliases:
            list_dist.append(fuzz.token_set_ratio(alias, dataset_name))
        dict_dist[key] = max(list_dist)
    return max(dict_dist.items(), key=lambda x: x[1])[0]
assert get_id_dataset_name("NHANES", dict_list_names_datasets) == '0a7b604ab2e52411d45a'
assert get_id_dataset_name("NHANES data", dict_list_names_datasets) == '0a7b604ab2e52411d45a'
assert get_id_dataset_name("NHANES dataset", dict_list_names_datasets) == '0a7b604ab2e52411d45a'
assert get_id_dataset_name("ICDB", dict_list_names_datasets) == '22571eb2d0cf42459c19'

### Select the best 5 answers per publication

In [265]:
def update_dict(d, k, v):
    if k in d:
        d[k].append(v)
    else:
        d[k] = [v]

In [266]:
dict_pub2answers = {}
for instance in output_model:
    pub_id = instance['publication_id'] 
    update_dict(dict_pub2answers, pub_id, (instance['mention'], instance['score']))

In [296]:
print(dict_pub2answers[0])

[('National Health and Nutrition Examination Survey Data', 0.38750842213630676), ('NHANES', 0.5256460905075073), ('NHANES', 0.6468685865402222), ('National Heath and Nutrition Examination Surveys', 0.5594545006752014)]


In [268]:
dict_y_preds = {}
for key in dict_pub2answers.keys():
    list_answers = sorted(dict_pub2answers[key], key=lambda t: t[1], reverse=True)[:5]
    dict_y_preds[str(key)] = [pred for (pred, score) in list_answers]

In [269]:
dict_y_preds['1']

['Investigation Approach',
 'Empirical Implications',
 'empirical implications',
 'Corporate taxation data']

In [270]:
from collections import OrderedDict

In [271]:
dict_y_pred_ids = {}
for key in tqdm_notebook(dict_y_preds.keys()):
    list_preds_ids = []
    for pred in dict_y_preds[key]:
        pred_id = get_id_dataset_name(pred, dict_list_names_datasets)
        list_preds_ids.append(pred_id)
    list_preds_ids = list(OrderedDict.fromkeys(list_preds_ids))
    while len(list_preds_ids) < 5:
        list_preds_ids.append("")
    dict_y_pred_ids[key] = list_preds_ids

HBox(children=(IntProgress(value=0, max=474), HTML(value='')))




In [295]:
dict_y_pred_ids['0']

['0a7b604ab2e52411d45a', '', '', '', '']

### Retrieve gold answers

- <publication id (num), List<dataset names\>>  #We may not need this
- <publication id (num), List<dataset ids\>> (dict_golden_ans_ids)

In [273]:
dict_golden_ans = {}
dict_golden_ans_ids = {}
for instance in corpus["@graph"]:
    if instance["@type"] == "ResearchPublication":
        publication_id = instance["@id"].split("-")[-1]
        if publication_id != "2b4497873374d080d359": # BLANK TXT, SO SKIP
            list_dict_data_set_citation = []
            num_id = dict_original_id2num_id[publication_id]

            list_datasets = instance['cito:citesAsDataSource']
            if not type(instance['cito:citesAsDataSource']) is list:
                list_datasets = [instance['cito:citesAsDataSource']]
        
            list_datasets_ids = []
            for dataset in list_datasets:
                dataset_id = dataset["@id"].split("-")[-1]
                list_datasets_ids.append(dataset_id)
                list_dict_data_set_citation.append(
                    get_list_dataset_names(dataset_id)
                )
            dict_golden_ans[num_id] = list_dict_data_set_citation
            dict_golden_ans_ids[num_id] = list_datasets_ids

In [294]:
dict_golden_ans_ids['0']

['0a7b604ab2e52411d45a']

### Top5UpToD

In [275]:
'''
From https://github.com/LARC-CMU-SMU/rclc_2019_baseline/blob/master/src/eval_utils.py
'''
def top5UptoD_prec(y_true, y_pred):
    """ Evaluation metric used by rclc.
        This metric prioritizes precision and take into account a variable
        number of datasets per publication.
        Reference: https://github.com/Coleridge-Initiative/rclc#evaluation
        Input:
            - y_true: ground truths
            - y_pred: top 5 predictions
    """
    if len(y_true) == 0:
        raise ValueError(f'Error: No ground truth!')
    if len(y_pred) != 5:
        raise ValueError(f'the length of y_pred must equal to 5.')

    _D = len(y_true) if len(y_true) <= 5 else 5
    correct = 0
    error = 0
    m = 0
    for di in y_pred:
        if di in y_true:
            correct += 1
        else:
            error += 1
        m += 1
        if correct == _D:
            break
    return (correct * 1.) / m

In [276]:
list_scores = []
for key in dict_golden_ans_ids.keys():
    if key in dict_y_pred_ids:
        score = top5UptoD_prec(dict_golden_ans_ids[key], dict_y_pred_ids[key])
        list_scores.append(score)
    else:
        list_scores.append(0)
print(np.mean(list_scores))

0.631906750174
