Skip to content
No description, website, or topics provided.
HTML Python
Branch: master
Clone or download
Latest commit fe1ea9f Aug 17, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
qacrawler add the crawler Apr 15, 2017
tests adding test files for the crawler Apr 15, 2017
.gitignore initial commit for gitignore and remove .DS_Store Apr 18, 2017
LICENSE Update LICENSE Apr 15, 2017
README.md Update README.md Aug 17, 2018
requirements.txt adding req's Apr 15, 2017

README.md

SearchQA

Associated paper:
https://arxiv.org/abs/1704.05179

Here are raw, split, and processed files: https://drive.google.com/drive/u/2/folders/1kBkQGooNyG0h8waaOJpgdGtOnlb1S649


One can collect the original json files through web search using the scripts in qacrawler. Please refer to the README in the folder for further details on how to use the scraper. Furthermore, one can use the files in the test folder to try it. The above link also contains the original json files that are collected using the Jeopardy! dataset.

There are also stat files that gives the number of snippets found for the question associated to its filename. This number can range from 0 to 100. For some questions the crawler is set to collect the first 50 snippets and for some it was 100. When the search doesn't give enough results to reach this level then the ones available are collected. During the training we ignored all the files that contain 40 or less snippets to eliminate possible trivial cases. Also, the training data ignores snippets from the 51st onward.

And here is the link for the Jeopardy! files themselves:
https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/

NOTE: We will release the the script that converts these to the training files above with appropriate restrictions.


Some requirements: nltk==3.2.1
pandas==0.18.1
selenium==2.53.6
pytest==3.0.2
pytorch==0.1.11

You can’t perform that action at this time.