Skip to content

happen2me/kqa-pro-datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

annotations_creators language language_creators license multilinguality pretty_name size_categories source_datasets tags task_categories task_ids
machine-generated
expert-generated
en
found
mit
monolingual
KQA-Pro
10K<n<100K
original
knowledge graph
freebase
question-answering
open-domain-qa

Dataset Card for KQA Pro

Table of Contents

Dataset Description

Dataset Summary

KQA Pro is a large-scale dataset of complex question answering over knowledge base. The questions are very diverse and challenging, requiring multiple reasoning capabilities including compositional reasoning, multi-hop reasoning, quantitative comparison, set operations, and etc. Strong supervisions of SPARQL and program are provided for each question.

Supported Tasks and Leaderboards

It supports knowlege graph based question answering. Specifically, it provides SPARQL and program for each question.

Languages

English

Dataset Structure

train.json/val.json

[
    {
        'question': str,
        'sparql': str, # executable in our virtuoso engine
        'program': 
        [
            {
                'function': str,  # function name
                'dependencies': [int],  # functional inputs, representing indices of the preceding functions
                'inputs': [str],  # textual inputs
            }
        ],
        'choices': [str],  # 10 answer choices
        'answer': str,  # golden answer
    }
]

test.json

[
    {
        'question': str,
        'choices': [str],  # 10 answer choices
    }
]

Data Configs

This dataset has two configs: train_val and test because they have different available fields. Please specify this like load_dataset('drt/kqa_pro', 'train_val').

Data Splits

train, val, test

Additional Information

Knowledge Graph File

You can find the knowledge graph file kb.json in the original github repository. It comes with the format:

{
    'concepts':
    {
        '<id>':
        {
            'name': str,
            'instanceOf': ['<id>', '<id>'], # ids of parent concept
        }
    },
    'entities': # excluding concepts
    {
        '<id>': 
        {
            'name': str,
            'instanceOf': ['<id>', '<id>'], # ids of parent concept
            'attributes':
            [
                {
                    'key': str, # attribute key
                    'value':  # attribute value
                    {
                        'type': 'string'/'quantity'/'date'/'year',
                        'value': float/int/str, # float or int for quantity, int for year, 'yyyy/mm/dd' for date
                        'unit': str,  # for quantity
                    },
                    'qualifiers':
                    {
                        '<qk>':  # qualifier key, one key may have multiple corresponding qualifier values
                        [
                            {
                                'type': 'string'/'quantity'/'date'/'year',
                                'value': float/int/str,
                                'unit': str,
                            }, # the format of qualifier value is similar to attribute value
                        ]
                    }
                },
            ]
            'relations':
            [
                {
                    'predicate': str,
                    'object': '<id>', # NOTE: it may be a concept id
                    'direction': 'forward'/'backward',
                    'qualifiers':
                    {
                        '<qk>':  # qualifier key, one key may have multiple corresponding qualifier values
                        [
                            {
                                'type': 'string'/'quantity'/'date'/'year',
                                'value': float/int/str,
                                'unit': str,
                            }, # the format of qualifier value is similar to attribute value
                        ]
                    }
                },
            ]
        }
    }
}

How to run SPARQLs and programs

We implement multiple baselines in our codebase, which includes a supervised SPARQL parser and program parser.

In the SPARQL parser, we implement a query engine based on Virtuoso. You can install the engine based on our instructions, and then feed your predicted SPARQL to get the answer.

In the program parser, we implement a rule-based program executor, which receives a predicted program and returns the answer. Detailed introductions of our functions can be found in our paper.

How to submit results of test set

You need to predict answers for all questions of test set and write them in a text file in order, one per line. Here is an example:

Tron: Legacy
Palm Beach County
1937-03-01
The Queen
...

Then you need to send the prediction file to us by email caosl19@mails.tsinghua.edu.cn, we will reply to you with the performance as soon as possible. To appear in the learderboard, you need to also provide following information:

  • model name
  • affiliation
  • open-ended or multiple-choice
  • whether use the supervision of SPARQL in your model or not
  • whether use the supervision of program in your model or not
  • single model or ensemble model
  • (optional) paper link
  • (optional) code link

Licensing Information

MIT License

Citation Information

If you find our dataset is helpful in your work, please cite us by

@inproceedings{KQAPro,
  title={{KQA P}ro: A Large Diagnostic Dataset for Complex Question Answering over Knowledge Base},
  author={Cao, Shulin and Shi, Jiaxin and Pan, Liangming and Nie, Lunyiu and Xiang, Yutong and Hou, Lei and Li, Juanzi and He, Bin and Zhang, Hanwang},
  booktitle={ACL'22},
  year={2022}
}

Contributions

Thanks to @happen2me for adding this dataset.

About

I added a dataset to huggingface datasets hub

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages