# Advanced Question Analysis

The goal of this assignment is to write a more flexible version of the interactive QA system. As in the previous assignment, the system should be able to take a question in natural language (English) as input, analyse the question, and generate a SPARQL query for it.

## Assignment  // Additional requirements

* Make sure that your system can analyse at least two more question types. E.g. questions that start with *How*, *When*, or where the property is expressed by a verb, yes/no questions, etc.
* Apart from the techniques introduced last week (matching tokens on the basis of their lemma or part-of-speech), also include at least one pattern where you use the dependency relations to find the relevant property or entity in the question. 
* Include 10 examples of questions that your system can handle, and that illustrate the fact that you cover additional question types

## Examples

Here is a non-representative list of questios and question types to consider. See the list with all questions for more examples

* What do cows eat?
* How old can a European hedgehog get?
* Which tree produces papayas?
* How fast can a cheetah run?
* When did dodos go extinct?
* Where do platypuses live?
* Is a lion a feline?
* Can you eat ferns?
* How many beats per minutes does a chicken have at rest?
* Who gave the Canis lupus familiaris its name?


In [None]:
import spacy

nlp = spacy.load("en_core_web_lg") # this loads the model for analysing English text

## Dependency Analysis with Spacy

All the functionality of Spacy, as in the last assignment, is still available for doing question analysis. 

In addition, also use the dependency relations assigned by spacy. Note that a dependency relation is a directed, labeled, arc between two tokens in the input. In the example below, the system detects that *movie* is the subject of the passive sentence (with label nsubjpass), and that the head of which this subject is a dependent is the word *are* with lemma *be*. 


In [None]:
question = 'Where do platypuses live?'
# nlp = spacy.load('en_core_sci_lg')
parse = nlp(question) # parse the input

print(list(parse.noun_chunks))
# print(getWikidataIDs("live", True))

for word in parse : # iterate over the token objects 
    print(word.lemma_, word.pos_, word.dep_, word.head.lemma_)


## Phrases

You can also match with the full phrase that is the subject of the sentence, or any other dependency relation, using the subtree function 


In [None]:
def phrase(word) :
    children = []
    for child in word.subtree :
        children.append(child.text)
    return " ".join(children)

for word in parse:
    if word.dep_ == 'nsubjpass' or word.dep_ == 'agent' :
        phrase_text = phrase(word)
        print(phrase_text)


## Visualisation

For a quick understanding of what the parser does, and how it assigns part-of-speech, entities, etc. you can also visualise parse results. Below, the entity visualiser and parsing visualiser is demonstrated. 
This code is for illustration only, it is not part of the assignment. 

In [None]:
from spacy import displacy

parse = nlp("Which tree produces papayas?")

displacy.render(parse, jupyter=True, style="ent")

displacy.render(parse, jupyter=True, style="dep")

In [6]:
import requests, re, spacy

def formatAnswer(data):
    data = data[0]
    ans = str(data['answerLabel']['value']).capitalize()

    unit = ''
    if 'unitLabel' in data:
        unit =  ' ' + data['unitLabel']['value']

    # If the answer is a (number > 1) + unit combination, make unit plural
    if ans.replace('.','',1).isdigit() and float(ans) > 1.0:
       unit += 's'

    return ans + unit



def queryWikidata(query) :
    try:
        data = requests.get('https://query.wikidata.org/sparql',
            headers = {'User-Agent': 'QASys/0.0 (https://rug.nl/LTP/; josh@bruegger.it)'},
            params={'query': query, 'format': 'json'}).json()['results']['bindings']
    except:
        return 'Error: Query failed: Too many requests?'

    if len(data) == 0:
        return None

    return formatAnswer(data)



def buildQuery(object, property):
    q = 'SELECT ?answerLabel ?unitLabel WHERE{wd:'
    q += object + ' p:' + property + '?s.?s ps:'
    q += property + '?answer. OPTIONAL{?s psv:'
    q += property + '?u.?u wikibase:quantityUnit ?unit.}'
    q += 'SERVICE wikibase:label{bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en"}}'
    return q



def getWikidataIDs(query, isProperty = False):
    url = 'https://www.wikidata.org/w/api.php'
    headers = {'User-Agent': 'QASys/0.0 (https://rug.nl/LTP/); josh@bruegger.it)'}

    params = {'action':'wbsearchentities',
            'language':'en',
            'format':'json',
            'limit':'50',
            'search': query}
    if isProperty:
        params['type'] = 'property'

    return requests.get(url,params, headers = headers).json()['search']



def extractPropertyAndObject(question):
    nlp = spacy.load("en_core_web_sm") # this loads the model for analysing English text
    doc = nlp(question) # parse the input

    nouns = list(doc.noun_chunks)
    removeArticle = lambda str: re.sub('^(?:the|a|an) ', '', str)

    if len(nouns) < 3:
        print(len(nouns))
        return "ERR", "ERR"

    object = removeArticle(nouns[len(nouns)-1].text)
    if (len(nouns) > 3):
        pattern = '(' + nouns[1].text + '.*?' + nouns[len(nouns)-2].text + ')'
        m = re.search(pattern, question)
        property = removeArticle(m.group(1))
    else:
        property = removeArticle(nouns[1].text)

    return property, object



# Has to match (?:Who|What) (?:was|is|were) ?(?:the|a|an)? (?:.+) of ?(?:the|a|an)? (?:.+)\?
def getAnswer(question):
    propertyText, objectText = extractPropertyAndObject(question)

    #print('Looking for: ' + propertyText + ' of ' + objectText + '...')

    possibleObjects = getWikidataIDs(objectText)
    possibleProperties = getWikidataIDs(propertyText, True)

    for object in possibleObjects:
        for property in possibleProperties:
            #print('trying: ' + property['id'] + ' of ' + object['id'])
            answer = queryWikidata(buildQuery(object['id'], property['id']))
            if answer is not None:
                return answer

    return None


def answerQuestion(q):
    ans = getAnswer(q)
    print('Question: ' + q)
    if (ans != None):
        print('Answer: ' + ans)
    else:
        print("Sorry, I don't know!")

In [11]:
import spacy
import classy_classification

query = '''SELECT DISTINCT ?wdLabel WHERE {
  wd:Q6145 ?wdt ?a.
  ?wd wikibase:directClaim ?wdt .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}'''

try:
  data = requests.get('https://query.wikidata.org/sparql',
                      headers = {'User-Agent': 'QASys/0.0 (https://rug.nl/LTP/; josh@bruegger.it)'},
                      params={'query': query, 'format': 'json'}).json()['results']['bindings']
except:
  print('Error: Query failed: Too many requests?')


properties = []
for item in data:
  p = item['wdLabel']['value']
  if 'ID' not in p:
    properties.append(p)

print(properties)

question = "How old can a European hedgehog get?"
qDoc = nlp(question)

sim = {}

for p in properties:
  pDoc = nlp(p)
  # print(question + ' : ' + p + '       ' + str(pDoc.similarity(qDoc)))
  sim[pDoc.similarity(qDoc)] = p

#print map by sorting keys
for k in sorted(sim.keys(), reverse=True):
  print(str(k) + ' : ' + sim[k])

# text1 = 'How can I kill someone?'
# text2 = 'What should I do to be a peaceful?'
# doc1 = nlp(text1)
# doc2 = nlp(text2)
# print("spaCy :", doc1.similarity(doc2))


# nlp = spacy.blank("en")
# nlp.add_pipe(
#     "text_categorizer",
#     config={
#         "data": properties,
#         "model": "facebook/bart-large-mnli",
#         # "model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
#         "cat_type": "zero",
#     }
# )

# print(nlp(question)._.cats)

['sequenced genome URL', 'video', 'award received', 'parent taxon', 'taxon name', 'pronunciation audio', 'taxon common name', 'Commons gallery', 'Commons category', 'image', 'on focus list of Wikimedia project', 'taxon range map image', 'IUCN conservation status', "topic's main category", 'gestation period', 'diel cycle', 'instance of', 'taxon rank', 'NBN System Key', 'EPPO Code', 'ITIS TSN', 'highest observed lifespan']
0.5833975185661963 : on focus list of Wikimedia project
0.5758644747240347 : instance of
0.5558329773388655 : topic's main category
0.47577313833952456 : video
0.40596640635175096 : taxon common name
0.40422444163639043 : taxon range map image
0.39444809488485993 : highest observed lifespan
0.3905310878629736 : image
0.3770107429114158 : diel cycle
0.3622714463190898 : EPPO Code
0.35922744031450804 : Commons gallery
0.34479242451065223 : pronunciation audio
0.31448513764637925 : award received
0.30480809228589606 : gestation period
0.29195863626017393 : Commons categor

In [2]:
import numpy as np
import spacy
# nlp = spacy.load('en_core_web_lg')

text1 = 'How can I kill someone?'
text2 = 'What should I do to be a peaceful?'
doc1 = nlp(text1)
doc2 = nlp(text2)
print("spaCy :", doc1.similarity(doc2))

spaCy : 0.9035856671380171


In [2]:
import spacy
import json
nlp = spacy.load('en_core_web_trf')

data = json.load(open('../all_questions.json'))

for ex in data[1:5]:
    print(ex['string'])
    doc = nlp(ex['string'])
    nouns = list(doc.noun_chunks)
    print(doc.ents)
    print(nouns)
    # for ent in doc.ents :
    #     print(ent)

What is the life expectancy of a cat?
()
[What, the life expectancy, a cat]
How long is the gestation period of the European rabbit?
(European,)
[the gestation period, the European rabbit]
Who was the polar bear born in captivity at the Berlin Zoological Garden in 2006?
(the Berlin Zoological Garden, 2006)
[Who, the polar bear, captivity, the Berlin Zoological Garden]
How many Chinese common names are there for a great white shark?
(Chinese,)
[How many Chinese common names, a great white shark]


In [1]:
import spacy
nlp = spacy.load('en_core_sci_lg')

question = "What is the life expectancy of a cat?"

#Build wikidata search query from question
from spacy.tokens import Span
# wikidata_id is the attribute we add to the parse results
# wikidata_entity_link is the function that calls the api
# Span.set_extension('wikidata_id',getter=getWikidataIDs)

doc = nlp(question)

firstWord = doc[0].text.lower()
match firstWord:
    case 'what':
        print('what')
    case "is" |'does' | 'are' | 'do' | 'can':
        print('is')
    case "where":
        print('where')
    case "which":
        print('which')
    case _:
        print("IDK")
# for ent in doc.ents :
#     print(ent.text, ent._.wikidata_id)


  from .autonotebook import tqdm as notebook_tqdm


what


In [None]:
def extractPropertyAndObject(question):
    nlp = spacy.load("en_core_web_sm") # this loads the model for analysing English text
    doc = nlp(question) # parse the input

    nouns = list(doc.noun_chunks)
    removeArticle = lambda str: re.sub('^(?:the|a|an) ', '', str)

    if len(nouns) < 3:
        print(len(nouns))
        return "ERR", "ERR"

    object = removeArticle(nouns[len(nouns)-1].text)
    if (len(nouns) > 3):
        pattern = '(' + nouns[1].text + '.*?' + nouns[len(nouns)-2].text + ')'
        m = re.search(pattern, question)
        property = removeArticle(m.group(1))
    else:
        property = removeArticle(nouns[1].text)

    return property, object