# Assignment 3: Simple Question Analysis

The goal of this assignment is to write a first version of an an interactive QA system. The system should be able to take a question in natural language (English) as input, analyse the question, and generate a SPARQL query for it.

For now, we will restrict our attention to questions of the form

What/Who was/is/were (the/a/an) X of (the/a/an) Y? 

i.e.
* "What is the average heartrate of a chicken?",
* "What is the more common english name of the Pinus longaeva?",
* "What is the average height of a giraffe?",
* "What is the conservation status of the Atlantic salmon?",
* "What is the top speed of a cheetah?",
* "What is the heart rate of a giraffe?",
* "What is the wingspan of a raven?",
* "What is the maximum wingspan of a bat?",
* "What is the taxon rank of birch?"
* etc

You can find more examples by searching the all-questions json file (on Nestor) for strings matching your pattern (using linux grep/egrep or other search tools)

Below, we briefly go over the steps that are needed to build a simple interactive QA system (reading user input, analyzing the string, finding names of entities and properties, finding the URI for an entity or property, and constructing the  SPARQL query), and discuss some tools you can use.

In [None]:
regex = '(?:Who|What) (?:was|is|were) ?(?:the|a|an)? (?:.+) of ?(?:the|a|an)? (?:.+)\?'


## Interactivity

In a notebook, you can ask for user input by using the input function, as shown below. The text entered by the user is stored in the variable question.


In [None]:
question = input('Please ask a question\n')

## Linguistic Analysis with Spacy

To generate a SPARQL query, the system needs to find a a property and an entity. The first step is to match the correct words in the question. So, for the first example, the property is indicated by the word _genres_ and the entity by the word _Inception_ These words can be sent as key-words to a wikidata api, to find the corresponding wikidata IDs (URIs). (More on this below)

Pattern matching can be done using regular expressions (using the re python library). Alternatively, we can use a Spacy, a toolkit for doing linguistic analysis.

[Spacy](https://spacy.io/usage) is a toolkit that comes with pretrained models for doing linguistic analysis in a number of languages. It can read in a text or sentence, tokenize the text (separate punctuation from words), assign Part-of-Speech (NOUN, VERB, PROPN, etc.) to tokens, lemmatize words (_actors_ --> _actor_), and detect named entities (like movie titles). See here for a short [tutorial](https://spacy.io/usage/spacy-101). 

When you install Spacy, make sure to also download the statistical model for analysing English sentences, en_core_web_sm

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm") # this loads the model for analysing English text

## Spacy tokenization and annotation

The Spacy nlp function analyses an input text (i.e. the question of the user), and assigns an annotation to each token in the input. It returns a list of token objects, where each token object is a dictionary that has values for various attributes of each token in the sentence. 

The example below illustrates how to iterate over the token objects, and find interesting attributes. Spans can be useful if you want to grab multiple tokens. The analyzer also finds names of entities (persons, organisations, locations), but, unfortunately, for movie titles this often does not work. A more robust approach might be to find tokens that start with an uppercase. 

In [None]:
question = 'What is the more common english name of the Pinus longaeva?'

parse = nlp(question) # parse the input

for word in parse : # iterate over the token objects
    print(word.text, word.lemma_, word.pos_)
print(parse[3:5].text) # you can also select multiple tokens as a span.
for ent in parse.ents : # the analysis also detects names of entities. Very unreliable for animal and plant names...
    print(ent.text, ent.label_)
for word in parse :
        if word.text.istitle() : # check if word starts with uppercase letter
            print(word)
print(parse[8:10].text.istitle()) # check if all words in a span start with uppercase

## Visualisation

For a quick understanding of what the parser does, and how it assigns part-of-speech, entities, etc. you can also visualise parse results. Below, the entity visualiser and parsing visualiser is demonstrated. You can ignore the arrows (dependency links) for now, we return to them next week. 

This code is for illustration only, it is not part of the assignment. 

In [None]:
from spacy import displacy

question = "Who is the main character of the movie Harry Potter"

parse = nlp(question)

displacy.render(parse, jupyter=True, style="ent")

displacy.render(parse, jupyter=True, style="dep")

## Finding a URI:  the Wikidata entity finder 

From the parse of the user question, you can extract the words for the property and the entity that are needed to formulate a SPARQL query. The ids of the property and entity can be found by accessing the wikidata entity finder.

See the example below for finding the id of the movie The Godfather. In most cases, the first result is correct, but it may be necessary to try various ids...

Properties can be found by including 'type' : 'property' in the parameters. 

In [None]:
import requests

url = 'https://www.wikidata.org/w/api.php'
params = {'action':'wbsearchentities',
          'language':'en',
          'format':'json'}

params['search'] = 'The Godfather'
json = requests.get(url,params).json()
for result in json['search']:
    print("{}\t{}\t{}".format(result['id'], result['label'], result['description']))

## Building a SPARQL query 

With the id of the property and the entity, an SPARQL query can be formulated, and the sparql endpoint of wikidata can be queried for an answer. Note that this is the same as in the previous assignment.

Also note that for python, a SPARQL query is just a string, so you can construct the query by concatenating the start of the query, the ids, and the end of the query. 

In [None]:
ID1 = 'q47703'
ID2 = 'p577'
query = 'SELECT ?answerLabel WHERE { wd:' + ID1 + ' wdt:' + ID2 + ' ?answer }'
print(query)

# Assignment 

Using the steps outlined above, write a function that takes input from the user, analyses it with Spacy, extracts the relevant key-words, finds the wikidata URIs for these (using wikidata entity finder or manually written rules, constructs the relevant SPARQL query, and sends the SPARQL query to the sparql endpoint, and prints the answer. 

Include 10 examples of questions that worked for your system in the comments or in a separate markdown cell. 

In [4]:
import requests, re, spacy

def formatAnswer(data):
    data = data[0]
    ans = str(data['answerLabel']['value']).capitalize()

    unit = ''
    if 'unitLabel' in data:
        unit =  ' ' + data['unitLabel']['value']

    # If the answer is a (number > 1) + unit combination, make unit plural
    if ans.replace('.','',1).isdigit() and float(ans) > 1.0:
       unit += 's'

    return ans + unit



def queryWikidata(query) :
    try:
        data = requests.get('https://query.wikidata.org/sparql',
            headers = {'User-Agent': 'QASys/0.0 (https://rug.nl/LTP/; josh@bruegger.it)'},
            params={'query': query, 'format': 'json'}).json()['results']['bindings']
    except:
        return 'Error: Query failed: Too many requests?'

    if len(data) == 0:
        return None

    return formatAnswer(data)



def buildQuery(object, property):
    q = 'SELECT ?answerLabel ?unitLabel WHERE{wd:'
    q += object + ' p:' + property + '?s.?s ps:'
    q += property + '?answer. OPTIONAL{?s psv:'
    q += property + '?u.?u wikibase:quantityUnit ?unit.}'
    q += 'SERVICE wikibase:label{bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en"}}'
    return q



def getWikidataIDs(query, isProperty = False):
    url = 'https://www.wikidata.org/w/api.php'
    params = {'action':'wbsearchentities',
            'language':'en',
            'format':'json',
            'search': query}
    if isProperty:
        params['type'] = 'property'

    return requests.get(url,params).json()['search']



def extractPropertyAndObject(question):
    nlp = spacy.load("en_core_web_trf") # this loads the model for analysing English text
    doc = nlp(question) # parse the input

    nouns = list(doc.noun_chunks)
    removeArticle = lambda str: re.sub('^(?:the|a|an) ', '', str)

    if len(nouns) < 3:
        print(len(nouns))
        return "ERR", "ERR"

    object = removeArticle(nouns[len(nouns)-1].text)
    if (len(nouns) > 3):
        pattern = '(' + nouns[1].text + '.*?' + nouns[len(nouns)-2].text + ')'
        m = re.search(pattern, question)
        property = removeArticle(m.group(1))
    else:
        property = removeArticle(nouns[1].text)

    return property, object



# Has to match (?:Who|What) (?:was|is|were) ?(?:the|a|an)? (?:.+) of ?(?:the|a|an)? (?:.+)\?
def getAnswer(question):
    propertyText, objectText = extractPropertyAndObject(question)

    #print('Looking for: ' + propertyText + ' of ' + objectText + '...')

    possibleObjects = getWikidataIDs(objectText)
    possibleProperties = getWikidataIDs(propertyText, True)

    for object in possibleObjects:
        for property in possibleProperties:
            #print('trying: ' + property['id'] + ' of ' + object['id'])
            answer = queryWikidata(buildQuery(object['id'], property['id']))
            if answer is not None:
                return answer

    return None


def answerQuestion(q):
    ans = getAnswer(q)
    if (ans != None):
        print('Question: ' + q)
        print('Answer: ' + ans)
    else:
        print("Sorry, I don't know!")

In [3]:
questions = '''What is the life expectancy of a cat?
What is the highest observed lifespan of a tiger?
What was the height of a mammoth?
What is the parent taxon of Teuthida?
What is the height of an elephant?
What is the gestation period of a LLama?
What is the flower color of Helianthus annuus?
What is the taxon name of the black mamba?
What is the means of locomotion of Stomatopoda?
What is the opposite of a perennial plant?
What is the heart rate of a chicken?
What is the life expectancy of a house cat?
What is the heart rate of a giraffe?'''


for q in questions.splitlines():
    answerQuestion(q)

Question: What is the life expectancy of a cat?
Answer: 15 years
Question: What is the highest observed lifespan of a tiger?
Answer: 26.3 years
Question: What was the height of a mammoth?
Answer: 3.4 metres
Question: What is the parent taxon of Teuthida?
Answer: Decapodiformes
Question: What is the height of an elephant?
Answer: 4 metres
Question: What is the gestation period of a LLama?
Answer: 358 days
Question: What is the flower color of Helianthus annuus?
Answer: Yellow
Question: What is the taxon name of the black mamba?
Answer: Dendroaspis polylepis
Question: What is the means of locomotion of Stomatopoda?
Answer: Rolling
Question: What is the opposite of a perennial plant?
Answer: Annual plant
Question: What is the heart rate of a chicken?
Answer: 275 beats per minutes
Question: What is the life expectancy of a house cat?
Answer: 15 years


In [2]:
# Interactive
while True:
    q = input('Please ask a question\n Type stop to exit.\n')
    if q == 'stop':
        break
    answerQuestion(q)



RegistryError: [E893] Could not find function 'spacy.Tagger.v2' in function registry 'architectures'. If you're using a custom function, make sure the code is available. If the function is provided by a third-party package, e.g. spacy-transformers, make sure the package is installed in your environment.

Available names: spacy-legacy.CharacterEmbed.v1, spacy-legacy.EntityLinker.v1, spacy-legacy.HashEmbedCNN.v1, spacy-legacy.MaxoutWindowEncoder.v1, spacy-legacy.MishWindowEncoder.v1, spacy-legacy.MultiHashEmbed.v1, spacy-legacy.Tagger.v1, spacy-legacy.TextCatBOW.v1, spacy-legacy.TextCatCNN.v1, spacy-legacy.TextCatEnsemble.v1, spacy-legacy.Tok2Vec.v1, spacy-legacy.TransitionBasedParser.v1, spacy-transformers.Tok2VecTransformer.v1, spacy-transformers.Tok2VecTransformer.v2, spacy-transformers.Tok2VecTransformer.v3, spacy-transformers.TransformerListener.v1, spacy-transformers.TransformerModel.v1, spacy-transformers.TransformerModel.v2, spacy-transformers.TransformerModel.v3, spacy.CharacterEmbed.v2, spacy.EntityLinker.v1, spacy.HashEmbedCNN.v2, spacy.MaxoutWindowEncoder.v2, spacy.MishWindowEncoder.v2, spacy.MultiHashEmbed.v2, spacy.PretrainCharacters.v1, spacy.PretrainVectors.v1, spacy.SpanCategorizer.v1, spacy.Tagger.v1, spacy.TextCatBOW.v2, spacy.TextCatCNN.v2, spacy.TextCatEnsemble.v2, spacy.TextCatLowData.v1, spacy.Tok2Vec.v2, spacy.Tok2VecListener.v1, spacy.TorchBiLSTMEncoder.v1, spacy.TransitionBasedParser.v2

### Questions tested

- What is the life expectancy of a cat?
- What is the highest observed lifespan of a tiger?
- What was the height of a mammoth?
- What is the parent taxon of Teuthida?
- What is the height of an elephant?
- What is the gestation period of a LLama?
- What is the flower color of Helianthus annuus?
- What is the taxon name of the black mamba?
- What is the means of locomotion of Stomatopoda?
- What is the opposite of a perennial plant?
- What is the heart rate of a chicken?
- What is the life expectancy of a house cat?
- What is the heart rate of a giraffe?