Improving Retriveal and Comprehension on HotpotQA

Research based on (Yang, Qi, Zhang, et al. 2018)

Setup and Installation

First you need Indri. Installation instructions depending on operating system are available here.

Clone repository with git.

git clone https://github.com/janaleible/hotpotQA-ir-task.git

Create a virtual environment with python3.6:

python -m venv venv

Activate virtual environment.

source venv/bin/activate

Install requirements with pip.

pip install

If pyndri does not compile, follow the instruction here with the virtual environment active.

Pre-process Data.

Get the raw wiki data from HotpotQA. Uncompress the file in ./data/raw.

From the root of the project, run

(1) python data_processing.py -a title

(2) python data_processing.py -a trec

Process (1) will build a mapping from document title to wikipedia ID and vice-versa. The title2wid has the structure Dict[str, List[int]] because some titles my not be disambiguated. Tests show this only happens for one article with title: Harry Diamond. TODO: ignore one and create a fully one-to-one mapping.

Process (2) will build trec TEXT files to be indexed with Indri. If you are running into memory issues, set the flag --use_less_memory 1. This will use constant memory and will be faster in building the index. A side effect is that the index will be slightly slower and that the original will be much harder to recover from an external id.

Build Index

Once you have the TREC files run indri from the project root.

IndriBuildIndex build_indri_index.xml

Load the Index and Experiment

The module retrieval.index provides an interface to the index and mappings. Some obvious functionality is implemented but much more is possible. Check out pyndri, IndriBuildIndex, IndriQueryLanguage, and IndriQueryLanguageReference for ideas.

A known issue is that there is no easy way to retrieve the original TREC document from an external id. Right now you have to match the external id in the range of ids provided by the file .trectext file names, parse the xml until you find the document id and retrieve the <TEXT> tag.

Name		Name	Last commit message	Last commit date
Latest commit History 224 Commits
data_processing		data_processing
docs		docs
figures		figures
lisa		lisa
retrieval		retrieval
scripts		scripts
services		services
.gitignore		.gitignore
README.md		README.md
index.xml		index.xml
main_candidate_evaluation.py		main_candidate_evaluation.py
main_constants.py		main_constants.py
main_create_index.py		main_create_index.py
main_data_processing.py		main_data_processing.py
main_evaluation.py		main_evaluation.py
main_model_evaluation.py		main_model_evaluation.py
main_retrieval.py		main_retrieval.py
main_test.py		main_test.py
main_visualise.py		main_visualise.py
main_xperiments.ipynb		main_xperiments.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Improving Retriveal and Comprehension on HotpotQA

Setup and Installation

Pre-process Data.

Build Index

Load the Index and Experiment

About

Releases

Packages

Contributors 2

Languages

janaleible/hotpotQA-ir-task

Folders and files

Latest commit

History

Repository files navigation

Improving Retriveal and Comprehension on HotpotQA

Setup and Installation

Pre-process Data.

Build Index

Load the Index and Experiment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages