# Advanced Question Analysis

The goal of this assignment is to write a more flexible version of the interactive QA system. As in the previous assignment, the system should be able to take a question in natural language (English) as input, analyse the question, and generate a SPARQL query for it.

## Assignment  // Additional requirements

* Make sure that your system can analyse at least two more question types. E.g. questions that start with *which*, *when*, where the property is expressed by a verb, etc.
* Apart from the techniques introduced last week (matching tokens on the basis of their lemma or part-of-speech), also include at least one pattern where you use the dependency relations to find the relevant property or entity in the question. 
* Include 10 examples of questions that your system can handle, and that illustrate the fact that you cover additional question types

## Examples

Here is a non-representative list of questios and question types to consider. See the list with all questions for more examples

* For what movie did Leonardo DiCaprio win an Oscar?
* How long is Pulp Fiction?
* How many episodes does Twin Peaks have?
* In what capital was the film The Fault in Our Stars, filmed?
* In what year was The Matrix released?
* When did Alan Rickman die?
* Where was Morgan Freeman born?
* Which actor played Aragorn in Lord of the Rings?
* Which actors played the role of James Bond
* Who directed The Shawshank Redemption?
* Which movies are directed by Alice Wu?


In [1]:
import spacy

nlp = spacy.load("en_core_web_sm") # this loads the model for analysing English text
                   

## Dependency Analysis with Spacy

All the functionality of Spacy, as in the last assignment, is still available for doing question analysis. 

In addition, also use the dependency relations assigned by spacy. Note that a dependency relation is a directed, labeled, arc between two tokens in the input. In the example below, the system detects that *movie* is the subject of the passive sentence (with label nsubjpass), and that the head of which this subject is a dependent is the word *are* with lemma *be*. 


In [2]:
question = 'Which movies are directed by Alice Wu?'

parse = nlp(question) # parse the input 

for word in parse : # iterate over the token objects 
    print(word.lemma_, word.pos_, word.dep_, word.head.lemma_)

which DET det movie
movie NOUN nsubjpass direct
be AUX auxpass direct
direct VERB ROOT direct
by ADP agent direct
Alice PROPN compound Wu
Wu PROPN pobj by
? PUNCT punct direct


## Phrases

You can also match with the full phrase that is the subject of the sentence, or any other dependency relation, using the subtree function 


In [3]:
def phrase(word) :
    children = []
    for child in word.subtree :
        children.append(child.text)
    return " ".join(children)
        
for word in parse:
    if word.dep_ == 'nsubjpass' or word.dep_ == 'agent' :
        phrase_text = phrase(word)
        print(phrase_text)
        

Which movies
by Alice Wu


## Visualisation

For a quick understanding of what the parser does, and how it assigns part-of-speech, entities, etc. you can also visualise parse results. Below, the entity visualiser and parsing visualiser is demonstrated. 
This code is for illustration only, it is not part of the assignment. 

In [20]:
from spacy import displacy

question = 'In how many films is Pulp Fiction?'

parse = nlp(question)

displacy.render(parse, jupyter=True, style="ent")

displacy.render(parse, jupyter=True, style="dep")

# Assignment Submission
### S3889807

## Code from last assignment
- Get wikidata IDs
- Generate SPARQL Queries
- Connect to wikidata endpoint to get SPARQL results

In [5]:
import requests

def get_wikidata_ids(name, search_property = False):
    """
    Returns a list of ID dictionaries (with labels and possibly descriptions)
    for a given name, either looking for entities or properties (set search_property:=True for the latter)
    Each dict contains keys: 'id', 'label', and possibly 'description'.
    If a description cannot be found, it will not be included in the dict.
    """
    all_results = []
    
    url = 'https://www.wikidata.org/w/api.php'
    params = {'action':'wbsearchentities', 
              'language':'en',
              'format':'json'}
    
    # add a param to the request if it needs to look for a property
    if search_property:
        params['type'] = 'property'
    
    params['search'] = name
    json = requests.get(url,params).json()
    
    # extract only the useful data from the json file
    try:
        for result in json['search']:
            # append an empty dictionary
            all_results.append({})
            # add the ID and label
            all_results[-1]['id'] = result['id']
            all_results[-1]['label'] = result['label']
            # add a description if it exists
            if 'description' in result.keys():
                all_results[-1]['description'] = result['description']
    except:
        # no results
        pass
    return all_results

In [6]:
def generate_sparql_query(entity_id, property_id):
    """ 
    Returns string with entity id and property id in place as a SPARQL query
    """
    query = f'''SELECT ?answerLabel WHERE {{
                wd:{entity_id} wdt:{property_id} ?answer.
                SERVICE wikibase:label {{ bd:serviceParam wikibase:language "en". }}
                }}'''
    return query

def getSPARQLresults(query):
    """
    Relates to previous assignment. Return results (string) for a SPARQL query.
    The format is arbitrary can can be changed as desired.
    """
    url = 'https://query.wikidata.org/sparql'
    results = ""
    data = requests.get(url, params={'query': query, 'format': 'json'}).json()
    for item in data['results']['bindings']:
        for var in item :
            results+=('{}\t{}\n'.format(var,item[var]['value']))
            
    return results

In [21]:
"""
Helpers
"""

def children(q, head, includeSelf):
    """
    Returns direct children and self if needed
    """
    children = []
    for token in q:
        if (token != head and token.head == head) or (includeSelf and token == head):
            children.append(token)
    return children

def get_root(doc):
    """
    Return the root of the dependency tree
    in a given nlp-parsed sentence (root)
    """
    for word in doc:
        if word.dep_ == "ROOT":
            return word
        
def phrase(word):
    """
    Given code: Return the phrase that the given word heads
    """
    children = []
    for child in word.subtree :
        children.append(child.text)
    return " ".join(children)

def findDep(q, dep):
    """
    Returns the first token corresponding to the dependency. Else false
    """
    for word in q:
        if word.dep_ in dep:
            return word
    return False

def nominalize(word):
    nom_dict = {
        'much' : 'quantity',
        'long' : 'duration',
        'many' : 'quantity',
        'often' : 'frequency'
    }
    return nom_dict[word]

In [22]:
### Entity extraction functions ###
import re
import spacy
from spacy.tokenizer import Tokenizer


def get_named_entities(doc):
    """ 
    spacy has entity recognition in-built, which might work well
    for names, but not for multi-word named entities (like movie titles)
    """
    return doc.ents

def custom_tokenizer(nlp):
    """
    spacy gives the programmer the ability to customize the tokenizer using regex.
    This one specifically looks for sets of contiguous words that all have an upper-
    case letter (i.e. that are titled). This can alternatively be done by using spacy's
    istitle() function on all combinations of words, but that is less efficient.
    e.g. "How I Met Your Mother" will be a single token using this.
    """
    token_re = re.compile(r"([A-Z][a-z']*(?:[\s][A-Z][a-z]+)*)")
    return Tokenizer(nlp.vocab, token_match = token_re.findall)

def get_entity_complex(q_str):
    """
    calls the above function on a query string
    """
    nlp.tokenizer = custom_tokenizer(nlp)
    doc = nlp(q_str)
    # return the last named entity since the needed 
    # entity is likely at the very end of the string
    return doc[-1].text

def get_closest_proper_noun(root):
    """
    It is often the case that the proper noun
    that is most closely associated with the root
    is the most relevent entity in question.
    This is a recursive function starting at the 
    root and doing a BFS through the tree
    """
    pn = None
    for child in root.children:
        if child.pos_ == 'PROPN':
            pn = phrase(child)
            return pn
        
        pn = get_closest_proper_noun(child)
        if pn is not None:
            break
    
    return pn

### Parser for natural language query ###

def preprocess(query):
    """
    Preprocessing for looking for entities using
    non-dependency methods. This is not strictly
    necessary, but makes it slightly less brittle
    wrt the orthography of the sentence.
    """
    query = query.replace('?','')
    query = query.replace(query[0], query[0].lower(), 1)
    return query

def get_entity(query):
    """
    Return the entity of a given English query.
    The flow is:
        Check if there is an entity according to
        dependency tree
        if yes:
            return it
        else:
            preprocess query
            x1 <- get named entities
            x2 <- get entity with a custom tokenizer
            return the longest string between x1 and x2
    """
    nlp = spacy.load("en_core_web_sm") # this loads the model for analysing English text

    doc = nlp(query)
    entity, entity_temp = "", ""
    
    root = get_root(doc)
    entity_temp = get_closest_proper_noun(root)
    
    if entity_temp is not None:
        return entity_temp
    
    query = preprocess(query)
    
    entity_temp = get_named_entities(doc)
    entity = entity_temp if len(entity_temp)>len(entity) else entity
    
    entity_temp = get_entity_complex(q)
    entity = entity_temp if len(entity_temp)>len(entity) else entity
    
    return entity

In [27]:
### Property Extraction functions ###
def reduce_based_on_ids(property_list):
    """
    If there are multiple ways of getting a list of properties,
    then they may be repeated. This simply removes duplicates,
    while not changing the relative order within the input list.
    """
    p_set = {}
    for p in property_list:
        p_set[p['id']] = p

    return list(p_set.values())

def q_type_addition(doc):
    if doc[0].dep == 'prep' or doc[len(doc)-1].dep == 'prep':
        return True
    return False

def q_type_binary(doc):
    return doc[0].lemma_ in ['be', 'do', 'have']

"""Select question type"""
def questionType(doc):
    options = {
            'What' : whatOrWho,
            'Who' : whatOrWho,
            'When' : whenOrWhere,
            'Where' : whenOrWhere,
            'Howlong' : howLong,
            'Howmany' : howMany}
    if (q[0].text+q[1].text in options):
        options[q[0].text+q[1].text](q)
    elif (q[0].text in options):
        options[q[0].text](q)
    else:
        print("question type not supported, but we'll try...")
        passive(q)

def get_root_related_props(doc, entity):
    """
    Several methods to try and get properties with
    respect to the root of the question.
    
    ps <- list of possible properties
    For each child in root:
        (i) it cannot be a property if it is the entity
        (ii) it cannot be a property if it is a question word (w-word)
        (iii) if it is a nominal subject, add it to ps
        (iv) if it is a direct object, add it to ps
        (v) if it is an adjective, add it to ps
    If the root itself is not a simple word, add it to ps (e.g. if root := 'direct')
    
    return list of possible properties.
    
    Note: The lemmas and the phrases are added in order to make sure
          multi-word properties (e.g. 'voice actor') are also considered
    """
    ps = []
    root = get_root(doc)

    for child in root.children:
        if phrase(child) == entity:
            continue
        if child.text.lower() in ['who', 'what', 'when', 'how', 'which']:
            continue
        if child.dep_ == 'nsubj':
            ps.append(phrase(child))
            ps.append(child.text)
        if child.dep_ == 'dobj':
            ps.append(phrase(child))
            ps.append(child.text)
        if child.pos_ == 'ADJ':
            ps.append(nominalize(child.lemma_))
            ps.append(child.text)
            ps.append(child.lemma_)
    if root.lemma_ not in ['be', 'have', 'do']:
        ps.append(root.text)
        ps.append(root.lemma_)
    return ps

def get_properties(q, entity):
    """
    Returns list of possible properties (list of strings)
    """
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(q)
    
    ps = get_root_related_props(doc, entity)

    # remove all Nones
    return [x for x in ps if x is not None]
    

SyntaxError: invalid syntax (<ipython-input-27-5f2ea0c77dc5>, line 20)

In [10]:
### Functions that are majorly hueristic/custom ###

def entity_related_to_movies(entity_list):
    """
    Given a list of dictionaries with information about the entity,
    check if the description contains a word that is related to a movie.
    These have been chosen based on wordnet's synsets. This can easily be
    extended or made more complex using nltk, but kept straightforward for now
    as it works well enough. This helps remove non-relevent entities that have
    with the same name, but not related to movies (e.g. Lord of the Rings book series)
    """
    valid = []
    movie_relation = ['movie', 'film', 'picture', 'moving picture', 'motion', 'pic', 'flick', 'TV',
                      'television', 'show', 'animation', 'animation']
    character_relation = ['fiction', 'fictitious', 'character']
    actor_relation = ['actor', 'actress', 'thespian']
    for word in movie_relation+character_relation+actor_relation:
        for e in entity_list:
            if 'description' in e.keys():
                if word in e['description']:
                    if e not in valid:
                        valid.append(e)
                    
    return valid
            

In [24]:
import time

def pipeline(query, moderated = False):
    """
    Combines the above functions to create a pipeline to answer questions.
    
    Input: English string of the form "Who/What was/is/were (the) X of Y?"
    Output: Result of that query if found

    Often if there are too many queries sent to the endpoint at once,
    it will return none, so an optional boolean moderation is added
    to add an artificial 0.5 seconds between each request. This can be slower
    but has a lower chance of producing a request-related error. If an error
    occurs, try running with moderated = True
    """
    result = ''
    
    # get entities
    entity = get_entity(query)
    entity = entity.replace("the ", "")
    entity_ids = entity_related_to_movies(get_wikidata_ids(entity))
    
    # get properties
    properties = get_properties(query, entity)
    
    property_ids = []
    for p in properties:
        property_ids += get_wikidata_ids(p, True)
     
    # remove duplicates
    property_ids = reduce_based_on_ids(property_ids)
   
    # wrap in a try/except to help with request errors
    try:
        # for each combination of entities and properties
        # it is likely that the entities and properties
        # are sorted by relevence/similarity by wikidata
        # so return the first result that it finds. This
        # is not guaranteed however
        for entity_id in entity_ids:
            for property_id in property_ids:
                print(entity_id['label'], property_id['label'])
                # general SPARQL query
                sparql_query = generate_sparql_query(entity_id['id'], property_id['id'])
                result = getSPARQLresults(sparql_query)

                # check if there is a result
                if result is not None and result!='':
                    print("Closest answer:")
                    print(f"        entity: {entity_id['label']}")
                    print(f"      property: {property_id['label']}\n")
                    return result

                if moderated:
                    time.sleep(0.5)
                    
    except:
        print("Error while searching!")
        if not moderated:
            print("Attempting moderated search!")
            return pipeline(query, moderated = True)
        
        else:
            pass # goes directly to final return statement
            
    return "Answer not found"

## Question handling

This QA system should be able to handle questions about movies of several types, but specifically desiged to be able to work with the following, with X being the property and Y being the entity:
- Who/What/When/etc was/is/were the/a/an X of Y? (from previous assignment, more passive, noun properties)
- Who/What/When/etc was/is/were Y X? (similar to above, more active, verb properties)
- How X is Y? (similar questions that use adjective properties)

The following are pairs of questions that the system is able to answer. These are in pairs to show that the same question that is phrased differently (as long as it follows an above format) should give the same answer. A noun property (e.g. height) can be translated to a adjective property (e.g. tall). Similarly, a verb property (acted) can be translated to a noun property (actor).

In [17]:
qs = ['Who directed The Shawshank Redemption?'
     ,'Who is the director of The Shawshank Redemption?'
      
     ,'What is the birth date of Alan Rickman?'
     ,'When was Alan Rickman born?'
      
     ,'What is the height of Amitabh Bachchan?'
     ,'How tall is Amitabh Bachchan?'
      
     ,'What is the publication date of The Dark Knight?'
     ,'When was The Dark Knight published?'
     
     ,'Who acted as Gollum?'
     ,'Which actor played Gollum?'
     
     ,'What is the length of Interstellar?'
     ,'How long does Interstellar run?'
    ]
q11 = '''When did Alan Rickman die?'''
q12 = '''When was Pulp Fiction published?'''
q13 = '''Where was Morgan Freeman born?'''
q14 = '''Where does Home Alone originate?'''
q15 = '''Which movies are directed by Alice Wu?'''
q16 = '''How long is Pulp Fiction?'''
q17 = '''How many episodes does Twin Peaks have?'''
q18 = '''How long is Interstellar?'''
q19 = '''Which character was married by Aragorn'''
q20 = '''Which character did Aragorn marry?'''
for i in range(12,20):
    qs.append(globals()['q'+str(i)])
    
for q in qs[11:]:
    print(f"Query: {q}")
    print(pipeline(q))
    print("\t**********\n")

Query: How long does Interstellar run?
Closest answer:
        entity: Interstellar
      property: duration

answerLabel	169

	**********

Query: When was Pulp Fiction published?
Closest answer:
        entity: Pulp Fiction
      property: publication date

answerLabel	1994-05-21T00:00:00Z
answerLabel	1994-10-14T00:00:00Z
answerLabel	1994-11-03T00:00:00Z

	**********

Query: Where was Morgan Freeman born?
Closest answer:
        entity: Morgan Freeman
      property: date of birth

answerLabel	1937-06-01T00:00:00Z

	**********

Query: Where does Home Alone originate?
Closest answer:
        entity: Home Alone
      property: country of origin

answerLabel	United States of America

	**********

Query: Which movies are directed by Alice Wu?
Answer not found
	**********

Query: How long is Pulp Fiction?
Error while searching!
Attempting moderated search!
Error while searching!
Answer not found
	**********

Query: How many episodes does Twin Peaks have?
Error while searching!
Attempting m

In [18]:
print(pipeline("The Lord of the Rings was directed by whom?"))

Answer not found


In [28]:
print(pipeline("How much did Inception cost?"))

Inception student
Inception cost
Closest answer:
        entity: Inception
      property: cost

answerLabel	160000000

