Research based on (Yang, Qi, Zhang, et al. 2018)
First you need Indri. Installation instructions depending on operating system are available here.
Clone repository with git
.
git clone https://github.com/janaleible/hotpotQA-ir-task.git
Create a virtual environment with python3.6
:
python -m venv venv
Activate virtual environment.
source venv/bin/activate
Install requirements with pip
.
pip install
If pyndri does not compile, follow the instruction here with the virtual environment active.
Get the raw wiki data from HotpotQA.
Uncompress the file in ./data/raw
.
From the root of the project, run
(1) python data_processing.py -a title
(2) python data_processing.py -a trec
Process (1) will build a mapping from document title to wikipedia ID
and vice-versa. The title2wid
has the structure Dict[str, List[int]]
because some titles my not be disambiguated. Tests show this only happens
for one article with title: Harry Diamond
. TODO: ignore one and create a
fully one-to-one mapping.
Process (2) will build trec TEXT files to be indexed with Indri. If you
are running into memory issues, set the flag --use_less_memory 1
. This
will use constant memory and will be faster in building the index. A side
effect is that the index will be slightly slower and that the original
will be much harder to recover from an external id.
Once you have the TREC files run indri from the project root.
IndriBuildIndex build_indri_index.xml
The module retrieval.index
provides an interface to the index and
mappings. Some obvious functionality is implemented but much more is
possible. Check out pyndri,
IndriBuildIndex,
IndriQueryLanguage,
and IndriQueryLanguageReference
for ideas.
A known issue is that there is no easy way to retrieve the original TREC
document from an external id. Right now you have to match the external
id in the range of ids provided by the file .trectext
file names,
parse the xml until you find the document id and retrieve the <TEXT>
tag.