# Advanced Question Analysis

The goal of this assignment is to write a more flexible version of the interactive QA system. As in the previous assignment, the system should be able to take a question in natural language (English) as input, analyse the question, and generate a SPARQL query for it.

## Assignment  // Additional requirements

* Make sure that your system can analyse at least two more question types. E.g. questions that start with *which*, *when*, where the property is expressed by a verb, etc.
* Apart from the techniques introduced last week (matching tokens on the basis of their lemma or part-of-speech), also include at least one pattern where you use the dependency relations to find the relevant property or entity in the question. 
* Include 10 examples of questions that your system can handle, and that illustrate the fact that you cover additional question types

## Examples

Here is a non-representative list of questios and question types to consider. See the list with all questions for more examples

* For what movie did Leonardo DiCaprio win an Oscar?
* How long is Pulp Fiction?
* How many episodes does Twin Peaks have?
* In what capital was the film The Fault in Our Stars, filmed?
* In what year was The Matrix released?
* When did Alan Rickman die?
* Where was Morgan Freeman born?
* Which actor played Aragorn in Lord of the Rings?
* Which actors played the role of James Bond
* Who directed The Shawshank Redemption?
* Which movies are directed by Alice Wu?


In [1]:
import spacy

nlp = spacy.load("en_core_web_sm") # this loads the model for analysing English text
                   

## Dependency Analysis with Spacy

All the functionality of Spacy, as in the last assignment, is still available for doing question analysis. 

In addition, also use the dependency relations assigned by spacy. Note that a dependency relation is a directed, labeled, arc between two tokens in the input. In the example below, the system detects that *movie* is the subject of the passive sentence (with label nsubjpass), and that the head of which this subject is a dependent is the word *are* with lemma *be*. 


In [2]:
question = 'Which movies are directed by Alice Wu?'

parse = nlp(question) # parse the input 

for word in parse : # iterate over the token objects 
    print(word.lemma_, word.pos_, word.dep_, word.head.lemma_)


which DET det movie
movie NOUN nsubjpass direct
be AUX auxpass direct
direct VERB ROOT direct
by ADP agent direct
Alice PROPN compound Wu
Wu PROPN pobj by
? PUNCT punct direct


## Phrases

You can also match with the full phrase that is the subject of the sentence, or any other dependency relation, using the subtree function 


In [3]:
def phrase(word) :
    children = []
    for child in word.subtree :
        children.append(child.text)
    return " ".join(children)
        
for word in parse:
    if word.dep_ == 'nsubjpass' or word.dep_ == 'agent' :
        phrase_text = phrase(word)
        print(phrase_text)
        

Which movies
by Alice Wu


## Visualisation

For a quick understanding of what the parser does, and how it assigns part-of-speech, entities, etc. you can also visualise parse results. Below, the entity visualiser and parsing visualiser is demonstrated. 
This code is for illustration only, it is not part of the assignment. 

In [21]:
from spacy import displacy

question = "In what capital was the film The Fault in Our Stars filmed?"

parse = nlp(question)

displacy.render(parse, jupyter=True, style="ent")

displacy.render(parse, jupyter=True, style="dep")

## Assignment 4

In [5]:
q1 = "Which director directed Avatar"
q2 = "How long is Inception?"
q3 = "How many episodes does Twin Peaks have?"
q4 = "In what capital was the film The Fault in Our Stars filmed?"
q5 = "In what year was The Matrix released?"
q6 = "When did Alan Rickman die?"
q7 = "Where was Morgan Freeman born?"
q8 = "Which city was the movie Pulp Fiction filmed?"
q9 = "Where did Paul Walker die?"
q10 = "Who directed The Shawshank Redemption?"

### Improved QA (different from assignment 3)

In [6]:
import nltk
from nltk.corpus import wordnet as wn
all_stopwords = nlp.Defaults.stop_words
from nltk.tokenize import word_tokenize
import regex as re
import requests

In [7]:
dep = ['nsubjpass', 'agent', 'nsubj','dobj']
tag = ['VBN', 'NNP', 'VB', 'JJ','VBD', 'WRB', 'NNS']
wh = ['Where', 'When']

#finding synonyms which are noun versions of verbs manually
#tried doing this with NLTK but does not work for some versions
def find_synonyms(sentence):
    if sentence.lower() == 'long':
        return "duration"
    if sentence.lower() == 'where born':
        return "place of birth"
    if sentence.lower() == 'where die':
        return "place of death"
    if sentence.lower() == 'when die':
        return "date of death"
    if sentence.lower == 'how die':
        return "cause of death"
    else:
        return sentence

def nounify(verb_word):
    set_of_related_nouns = set()
    for lemma in wn.lemmas(wn.morphy(verb_word, wn.VERB), pos="v"):
        for related_form in lemma.derivationally_related_forms(): #get the first noun version of the verbs, where the verb's
            for synset in wn.synsets(related_form.name(), pos=wn.NOUN): #lemma is most related to the noun's lemma
                for lemma in synset.lemmas():
                    return lemma.name()

def phrase(word) :
    children = [] #to look through the child tree
    for child in word.subtree :
        children.append(child.text)
    return (children)

def get_entities(question):
    parse = nlp(question)
    entities = []
    for word in parse:            
        if word.dep_ in dep:
            phrase_text = phrase(word) #getting entities based on dependencies
            entities = phrase_text
    return entities

def get_properties(entities, question):
    parse = nlp(question)
    properties = []
    for word in parse: #getting properties based of dependencies 
        if (word.tag_ in tag and word.text not in entities) and (not word.is_stop or word.text in wh): #where and when are
            #important stop words, do not ignore them
            properties.append(word.text)
    return properties

In [8]:
#from assignment 3

url = 'https://www.wikidata.org/w/api.php'
url1 = 'https://query.wikidata.org/sparql'

params1 = {'action':'wbsearchentities', 
          'language':'en',
          'format':'json'} #for entities
params2 = {'action':'wbsearchentities', 
          'language':'en',
          'format':'json',
          'type':'property'} #for properties

words = spacy.load("en_core_web_sm")

def entityID(entity):
    ids = []
    params1['search'] = entity
    json = requests.get(url,params1).json()
    for result in json['search']:
        ids.append(format(result['id'])) #getting all the relevant entity IDs
    return ids

def propertyID(property1):
    ids = []
    params2['search'] = property1
    json = requests.get(url,params2).json()
    for result in json['search']:
        ids.append(format(result['id'])) #getting all relevant property IDs
    return ids

def query(ID1, ID2):
    query = 'SELECT ?answerLabel WHERE { wd:' + str(ID1) + " wdt:" + str(ID2) + ' ?answer .SERVICE wikibase:label {bd:serviceParam wikibase:language "en" .}}'
    results = requests.get(url1, params={'query': query, 'format': 'json'}).json()
    return results

def question_answer(question):
    entities = get_entities(question)
    properties = get_properties(entities, question)
    
    entity = " ".join(entities)
    entity = " ".join(re.split("(?=\\p{Upper})",entity,maxsplit=1)[1:]) #only keeping capital starting words
    propertyy = " ".join(properties)
    propertyy = find_synonyms(propertyy)
    
    ID1 = entityID(entity)
    ID2 = propertyID(propertyy)

    count = 0
    for i in ID2:
        for j in ID1:
            results = query(j,i)
            if(results is not None):
                for item in results['results']['bindings']:
                    for var in item :
                        count += 1 #result was obtained so next loop don't print any more results
                        print('{}\t{}'.format(var,item[var]['value']))   
            if(count >= 1):
                return

### Questions

q1 = "Which director directed Avatar?"

q2 = "How long is Inception?"

q3 = "How many episodes does Twin Peaks have?"

q4 = "In what capital was the film The Fault in Our Stars filmed?"

q5 = "In what year was The Matrix released?"

q6 = "When did Alan Rickman die?"

q7 = "Where was Morgan Freeman born?"

q8 = "Which city was the movie Pulp Fiction filmed?"

q9 = "Which company distributed the movie Titanic?"

q10 = "Who directed The Shawshank Redemption?"

In [9]:
question_answer(q1)

answerLabel	James Cameron


In [10]:
question_answer(q2)

answerLabel	148


In [11]:
question_answer(q3)

answerLabel	30


In [12]:
question_answer(q4)

answerLabel	Amsterdam


In [13]:
question_answer(q5)

answerLabel	1999-03-31T00:00:00Z
answerLabel	1999-06-17T00:00:00Z
answerLabel	1999-07-14T00:00:00Z


In [14]:
question_answer(q6)

answerLabel	2016-01-14T00:00:00Z


In [15]:
question_answer(q7)

answerLabel	Memphis


In [16]:
question_answer(q8)

answerLabel	Los Angeles


In [17]:
question_answer(q9)

answerLabel	Valencia


In [18]:
question_answer(q10)

answerLabel	Frank Darabont
