# [TREC - COVID Data](https://ir.nist.gov/covidSubmit/index.html)
Using [round 2](https://ir.nist.gov/covidSubmit/round2.html)

## Dataset

The dataset is composed by three pieces

- [COVID-19 papers](https://www.semanticscholar.org/cord19) provided by Allen Institute for Artificial Intelligence
- [list of valid ids](https://ir.nist.gov/covidSubmit/data/docids-rnd2.txt). So, all the documents we can use
- [topic set](https://ir.nist.gov/covidSubmit/data/topics-rnd2.xml), a xml file with different questions about COVID-19

The `cord_uid` column is the **only valid** id to identify a paper.

An entry point in the topic set looks like:


```xml
 <topic number="1000">
      <query>covid effects, muggles vs. wizards</query>
      <question>Are wizards and muggles affected differently by COVID-19?</question>
      <narrative> 
      Seeking comparison of specific outcomes regarding infections in
      wizards vs. muggles population groups.
      </narrative>
 </topic>
```

### Topic set

In [24]:
import requests
import xmltodict
import json
import pprint

TOPIC_SET_URL = 'https://ir.nist.gov/covidSubmit/data/topics-rnd2.xml'
topic_set_xml = requests.get(TOPIC_SET_URL).text

topic_set = xmltodict.parse(topic_set_xml)

topics = topic_set['topics']['topic']

A single topic

In [29]:
topics[0]

OrderedDict([('@number', '1'),
             ('query', 'coronavirus origin'),
             ('question', 'what is the origin of COVID-19'),
             ('narrative',
              "seeking range of information about the SARS-CoV-2 virus's origin, including its evolution, animal source, and first transmission into humans")])

All the questions

In [25]:
queries = [x['query'] for x in topics]
pprint.pprint(queries)

['coronavirus origin',
 'coronavirus response to weather changes',
 'coronavirus immunity',
 'how do people die from the coronavirus',
 'animal models of COVID-19',
 'coronavirus test rapid testing',
 'serological tests for coronavirus',
 'coronavirus under reporting',
 'coronavirus in Canada',
 'coronavirus social distancing impact',
 'coronavirus hospital rationing',
 'coronavirus quarantine',
 'how does coronavirus spread',
 'coronavirus super spreaders',
 'coronavirus outside body',
 'how long does coronavirus survive on surfaces',
 'coronavirus clinical trials',
 'masks prevent coronavirus',
 'what alcohol sanitizer kills coronavirus',
 'coronavirus and ACE inhibitors',
 'coronavirus mortality',
 'coronavirus heart impacts',
 'coronavirus hypertension',
 'coronavirus diabetes',
 'coronavirus biomarkers',
 'coronavirus early symptoms',
 'coronavirus asymptomatic',
 'coronavirus hydroxychloroquine',
 'coronavirus drug repurposing',
 'coronavirus remdesivir',
 'difference between cor

## Submissions

https://ir.nist.gov/covidSubmit/submission2.html

### Submission format


```
topicid Q0 docid rank score run-tag
```

- `topicid` is a number from `0...35`
- `rank` is the rank position of this document in the list -> ??
- score is computed by the scoring system

In [35]:
import pandas as pd
from collections import OrderedDict

def run_row(topicid, docid, rank, run_tag='0'):
    return dict(topicid = topicid, docid = docid, rank = rank, run_tag=run_tag)

df = pd.DataFrame([run_row(0, 'a', 1), run_row(1, 'b', 1)])

In [41]:
OUT_PATH = './temp.tt'
df.to_csv(OUT_PATH, index=None, sep=' ', header=None)

In [42]:
with open(OUT_PATH) as f:
    print(f.read())

0 a 1 0
1 b 1 0

