SearchQA

Associated paper:
https://arxiv.org/abs/1704.05179

Here are raw, split, and processed files: https://drive.google.com/drive/u/2/folders/1kBkQGooNyG0h8waaOJpgdGtOnlb1S649

One can collect the original json files through web search using the scripts in qacrawler. Please refer to the README in the folder for further details on how to use the scraper. Furthermore, one can use the files in the test folder to try it. The above link also contains the original json files that are collected using the Jeopardy! dataset.

There are also stat files that gives the number of snippets found for the question associated to its filename. This number can range from 0 to 100. For some questions the crawler is set to collect the first 50 snippets and for some it was 100. When the search doesn't give enough results to reach this level then the ones available are collected. During the training we ignored all the files that contain 40 or less snippets to eliminate possible trivial cases. Also, the training data ignores snippets from the 51st onward.

And here is the link for the Jeopardy! files themselves:
https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/

NOTE: We will release the the script that converts these to the training files above with appropriate restrictions.

Some requirements: nltk==3.2.1
pandas==0.18.1
selenium==2.53.6
pytest==3.0.2
pytorch==0.1.11

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
qacrawler		qacrawler
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SearchQA

About

Releases

Packages

Contributors 3

Languages

License

nyu-dl/dl4ir-searchQA

Folders and files

Latest commit

History

Repository files navigation

SearchQA

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages