This repository contains FreebaseQA, a new data set for open-domain QA over the Freebase knowledge graph. The question-answer pairs in this data set are collected from various sources, including the TriviaQA data set (Joshi et al., 2017) and other trivia websites (QuizBalls, QuizZone, KnowQuiz), and are matched against Freebase to generate relevant subject-predicate-object triples that were further verified by human annotators. As all questions in FreebaseQA are composed independently for human contestants in various trivia-like competitions, this data set shows richer linguistic variation and complexity than existing QA data sets, making it a good test-bed for emerging KB-QA systems.
If you find this data set useful, please cite the paper:
[1] K. Jiang, D. Wu and H. Jiang, "FreebaseQA: A New Factoid QA Data Set Matching Trivia-Style Question-Answer Pairs with Freebase," Proc. of North American Chapter of the Association for Computational Linguistics (NAACL), June 2019.
All data is distributed under the CC-BY-4.0 license.
This data set contains 28,348 unique questions that are divided into three subsets: train
(20,358), dev
(3,994) and eval
(3,996), formatted as JSON files: FreebaseQA-[train|dev|eval].json
.
We have also included FreebaseQA-partial.json
, which is not officially part of FreebaseQA but may be useful for training models for certain NLP tasks such as named entity recognition and entity linking.
Each file is formatted as follows:
Dataset
: The name of this data setVersion
: The version of the FreebaseQA data setQuestions
: The set of unique questions in this data setQuestion-ID
: The unique ID of each questionRawQuestion
: The original question collected from data sourcesProcessedQuestion
: The question processed with some operations such as removal of trailing question mark and decapitalizationParses
: The semantic parse(s) for the questionParse-Id
: The ID of each semantic parsePotentialTopicEntityMention
: The potential topic entity mention in the questionTopicEntityName
: The name or alias of the topic entity in the question from FreebaseTopicEntityMid
: The Freebase MID of the topic entity in the questionInferentialChain
: The path from the topic entity node to the answer node in Freebase, labeled as a predicateAnswers
: The answer found from this parseAnswersMid
: The Freebase MID of the answerAnswersName
: The answer string from the original question-answer pair
Accuracy is used as the evaluation metric for this data set, i.e. a question is considered correct only if the predicted answer is exactly the same as one of the given answers.
We have extracted a subset of Freebase (2.2GB zip), which includes all relevant entities (16M) and triples (182M) to all FreebaseQA questions. The subset can accompany the FreebaseQA data set in order to evaluate the accuracy of trained models in answering questions. The subset may be downloaded from the following link: https://www.dropbox.com/sh/a25p7j2ir8gqnvx/AABJvjoI9mbHYj3hyfuxSdGaa?dl=0