# Entity extractor
We will use this notebook to extra the found entities from QALD-json questions and store them under the linked entities key.
To achive this we will work from the train and test QALD9 subsets, this subsets organized the questions into simple and complex (simple being questions that require only one triple in the SPARQL query), as the main goal of extracting the linked entities is to measure the quality of an entity linker, we will focus only in the simple questions. 

Let's import the datasets:

In [1]:
import json

def read_json(filename):
    with open(filename, 'r', encoding="utf8") as f:
        return json.load(f)
    
def save_json(filename, data):
    with open(filename, 'w', encoding="utf8") as f:
        json.dump(data, f, indent=4, ensure_ascii=False)

In [46]:
train_subsets = read_json('../datasets/train_subsets.json')
test_subsets = read_json('../datasets/test_subsets.json')

Now let's create a function to add to a question the linked entities, we will try it with the following example question:

In [47]:
example_question = {
                "id": "86",
                "question": [
                    {
                        "language": "en",
                        "string": "What is the highest mountain in Germany?"
                    }
                ],
                "query": {
                    "sparql": "PREFIX wdt: <http://www.wikidata.org/prop/direct/> PREFIX wd: <http://www.wikidata.org/entity/> SELECT ?uri WHERE { ?uri wdt:P31 wd:Q8502 ; wdt:P2044 ?elevation ; wdt:P17 wd:Q183 . } ORDER BY DESC(?elevation) LIMIT 1"
                },
                "answers": [
                    {
                        "head": {
                            "vars": [
                                "uri"
                            ]
                        },
                        "results": {
                            "bindings": [
                                {
                                    "uri": {
                                        "type": "uri",
                                        "value": "http://www.wikidata.org/entity/Q3375"
                                    }
                                }
                            ]
                        }
                    }
                ]
            }

In [130]:
import re
#def extract_linked_entities_from_question(question:dict):
#    try:
#        sparql = question.get('query').get('sparql')
#        where_queries = re.findall(r'WHERE\s?{.+}', sparql, re.I)
#        triples = []
#        entities = []
#        
#        for where_query in where_queries:
#            where_query = re.findall(r'{.+}', where_query, re.I)[0][1:-1]
#            triples = triples + re.split(r'\.|;', where_query)
        
#        triples = list(filter(lambda x: 'wdt:P31 ' not in x, triples))
#        for triple in triples:
#            entities = entities + [ x.replace('wd:', '') for x in re.findall('wd:Q\d+', triple) ]
#            entities = entities + [ x.replace('<http://www.wikidata.org/entity/', '').replace('>','') for x in re.findall('<http:\/\/www\.wikidata\.org\/entity\/Q\d+>', triple) ]
        
#        return list(set(entities))
    
#    except:
#        print(question)
def extract_linked_entities_from_question(question:dict):
    try:
        sparql = question.get('query').get('sparql')
        prefix_entities = [ x.replace('wd:', '') for x in re.findall('wd:Q\d+', sparql) ]
        uri_entities = [ x.replace('<http://www.wikidata.org/entity/', '').replace('>','') for x in re.findall('<http:\/\/www\.wikidata\.org\/entity\/Q\d+>', sparql) ]
        return list(set(prefix_entities + uri_entities))
    except:
        print(question)

In [131]:
extract_linked_entities_from_question(example_question)

['Q8502', 'Q183']

Now lets apply this to all the simple questions subsets (singular, boolean, multiple and aggregation)

In [132]:
def add_linked_entities_to_dataset(dataset:dict):
    def apply_to_subset(subset):
        for question in subset:
            question['linked_entities'] = extract_linked_entities_from_question(question)
        return subset
    
    for key, value in dataset.get('simple').items():
        dataset['simple'][key] = apply_to_subset(value)
        
    return dataset
    

In [133]:
result_test_subsets = add_linked_entities_to_dataset(test_subsets)
result_train_subsets = add_linked_entities_to_dataset(train_subsets)

Let's save the results

In [134]:
save_json('../datasets/train_subsets.json', result_train_subsets)
save_json('../datasets/test_subsets.json', result_test_subsets)