# Question Analysis

The goal of this assignment is to write a first version of an an interactive QA system. The system should be able to take a question in natural language (English) as input, analyse the question, and generate a SPARQL query for it.

For now, we will restrict our attention to questions of the form

Who/What was/is/were (the) X of Y? 

i.e.

* What are the genres of Inception?
* What are the main subjects of Saving Private Ryan?
* What are the names of the Coen Brothers?
* What is the box office of Interstellar?
* What is the country of origin of Black Mirror?
* What is the duration of I Am Legend?
* What is the main subject of "The Godfather"?
* Who are the founders of Pixar Animation Studios?
* Who is the composer of Lord of The Rings?
* etc

## Interactivity

In a notebook, you can ask for user input by using the input function, as shown below. The text entered by the user is stored in the variable question. 


In [None]:
question = input('Please ask a question\n')

## Linguistic Analysis with Spacy

To generate a SPARQL query, the system needs to find a a property and an entity. The first step is to match the correct words in the question. So, for the first example, the property is indicated by the word _genres_ and the entity by the word _Inception_ These words can be sent as key-words to a wikidata api, to find the corresponding wikidata IDs (URIs). (More on this below)

Pattern matching can be done using regular expressions (using the re python library). Alternatively, we can use a Spacy, a toolkit for doing linguistic analysis.

[Spacy](https://spacy.io/usage) is a toolkit that comes with pretrained models for doing linguistic analysis in a number of languages. It can read in a text or sentence, tokenize the text (separate punctuation from words), assign Part-of-Speech (NOUN, VERB, PROPN, etc.) to tokens, lemmatize words (_actors_ --> _actor_), and detect named entities (like movie titles). See here for a short [tutorial](https://spacy.io/usage/spacy-101). 

When you install Spacy, make sure to also download the statistical model for analysing English sentences, en_core_web_sm

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm") # this loads the model for analysing English text

## Spacy tokenization and annotation

The Spacy nlp function analyses an input text (i.e. the question of the user), and assigns an annotation to each token in the input. It returns a list of token objects, where each token object is a dictionary that has values for various attributes of each token in the sentence. 

The example below illustrates how to iterate over the token objects, and find interesting attributes. Spans can be useful if you want to grab multiple tokens. The analyzer also finds names of entities (persons, organisations, locations), but, unfortunately, for movie titles this often does not work. A more robust approach might be to find tokens that start with an uppercase. 

In [None]:
question = 'Who are the main characters of the movie Apocalypse Now?'

parse = nlp(question) # parse the input 

for word in parse : # iterate over the token objects 
    print(word.text, word.lemma_, word.pos_)
print(parse[3:5].text) # you can also select multiple tokens as a span. 
for ent in parse.ents : # the analysis also detects names of entities. Very unreliable for movie titles...
    print(ent.text, ent.label_)
for word in parse :
        if word.text.istitle() : # check if word starts with uppercase letter 
            print(word)
print(parse[8:10].text.istitle()) # check if all words in a span start with uppercase  

## Visualisation

For a quick understanding of what the parser does, and how it assigns part-of-speech, entities, etc. you can also visualise parse results. Below, the entity visualiser and parsing visualiser is demonstrated. You can ignore the arrows (dependency links) for now, we return to them next week. 

This code is for illustration only, it is not part of the assignment. 

In [None]:
from spacy import displacy

question = "Who is the main character of the movie Harry Potter"

parse = nlp(question)

displacy.render(parse, jupyter=True, style="ent")

displacy.render(parse, jupyter=True, style="dep")

## Accessing the Wikidata entity finder 

From the parse of the user question, you can extract the words for the property and the entity that are needed to formulate a SPARQL query. The ids of the property and entity can be found by accessing the wikidata entity finder.

See the example below for finding the id of the movie The Godfather. In most cases, the first result is correct, but it may be necessary to try various ids...

Properties can be found by including 'type' : 'property' in the parameters. 

In [None]:
import requests

url = 'https://www.wikidata.org/w/api.php'
params = {'action':'wbsearchentities', 
          'language':'en',
          'format':'json'}

params['search'] = 'The Godfather'
json = requests.get(url,params).json()
for result in json['search']:
    print("{}\t{}\t{}".format(result['id'], result['label'], result['description']))

## Building a SPARQL query 

With the id of the property and the entity, an SPARQL query can be formulated, and the sparql endpoint of wikidata can be queried for an answer. Note that this is the same as in the previous assignment.

Also note that for python, a SPARQL query is just a string, so you can construct the query by concatenating the start of the query, the ids, and the end of the query. 

In [None]:
ID1 = 'q47703'
ID2 = 'p577'
query = 'SELECT ?answerLabel WHERE { wd:' + ID1 + ' wdt:' + ID2 + '....'
print(query)

# Assignment 

Using the steps outlined above, write a function that takes input from the user, analyses it with Spacy, extracts the relevant key-words, finds the wikidata URIs for these, and sends the SPARQL query to the sparql endpoint, and prints the answer. 

Include 10 examples of questions that worked for your system in the comments or in a separate markdown cell. 

### Get ID from wikidata

In [1]:
import requests

def get_wikidata_ids(name, search_property = False):
    """
    Returns a list of ID dictionaries (with labels and possibly descriptions)
    for a given name, either looking for entities or properties (set search_property:=True for the latter)
    Each dict contains keys: 'id', 'label', and possibly 'description'.
    If a description cannot be found, it will not be included in the dict.
    """
    all_results = []
    
    url = 'https://www.wikidata.org/w/api.php'
    params = {'action':'wbsearchentities', 
              'language':'en',
              'format':'json'}
    
    # add a param to the request if it needs to look for a property
    if search_property:
        params['type'] = 'property'
    
    params['search'] = name
    json = requests.get(url,params).json()
    
    # extract only the useful data from the json file
    for result in json['search']:
        # append an empty dictionary
        all_results.append({})
        # add the ID and label
        all_results[-1]['id'] = result['id']
        all_results[-1]['label'] = result['label']
        # add a description if it exists
        if 'description' in result.keys():
            all_results[-1]['description'] = result['description']
        
    return all_results

In [2]:
# example output
get_wikidata_ids('Interstellar', False)

[{'id': 'Q13417189',
  'label': 'Interstellar',
  'description': '2014 British-American science fiction film directed by Christopher Nolan'},
 {'id': 'Q41872',
  'label': 'interstellar medium',
  'description': 'matter and radiation in the space between the star systems in a galaxy'},
 {'id': 'Q3153615',
  'label': 'Interstellar',
  'description': 'Wikimedia disambiguation page'},
 {'id': 'Q21186666',
  'label': 'Interstellar',
  'description': 'Gué Pequeno song'},
 {'id': 'Q6057099', 'label': 'Interstellar'},
 {'id': 'Q59659728', 'label': 'Interstellar', 'description': '2005 film'},
 {'id': 'Q1054444',
  'label': 'interstellar cloud',
  'description': 'accumulation of gas, plasma and dust in a galaxy'}]

### Generate SPARQL Query and request results from wikidata endpoint

In [3]:
def generate_sparql_query(entity_id, property_id):
    """ 
    Returns string with entity id and property id in place as a SPARQL query
    """
    query = f'''SELECT ?answerLabel WHERE {{
                wd:{entity_id} wdt:{property_id} ?answer.
                SERVICE wikibase:label {{ bd:serviceParam wikibase:language "en". }}
                }}'''
    return query

In [4]:
import requests # in case a previous cell hadn't imported it already

def getSPARQLresults(query):
    """
    Relates to previous assignment. Return results (string) for a SPARQL query.
    The format is arbitrary can can be changed as desired.
    """
    url = 'https://query.wikidata.org/sparql'
    results = ""
    data = requests.get(url, params={'query': query, 'format': 'json'}).json()
    for item in data['results']['bindings']:
        for var in item :
            results+=('{}\t{}\n'.format(var,item[var]['value']))
            
    return results

### Extract Entites and Properties from Question String

A number of methods and heuristics have been implemented here to find entities and properties respectively.Some rely on assumptions or heuristics. How they work are described within the function docstring.

In [5]:
### imports ###
import re
import spacy
from spacy.tokenizer import Tokenizer
nlp = spacy.load("en_core_web_sm") # this loads the model for analysing English text

In [6]:
### Entity extraction functions ###

def get_named_entities(doc):
    """ 
    spacy has entity recognition in-built, which might work well
    for names, but not for multi-word named entities (like movie titles)
    """
    return doc.ents

def custom_tokenizer(nlp):
    """
    spacy gives the programmer the ability to customize the tokenizer using regex.
    This one specifically looks for sets of contiguous words that all have an upper-
    case letter (i.e. that are titled). This can alternatively be done by using spacy's
    istitle() function on all combinations of words, but that is less efficient.
    e.g. "How I Met Your Mother" will be a single token using this.
    """
    token_re = re.compile(r"([A-Z][a-z']*(?:[\s][A-Z][a-z]+)*)")
    return Tokenizer(nlp.vocab, token_match = token_re.findall)

def get_entity_complex(q_str):
    """
    calls the above function on a query string
    """
    nlp.tokenizer = custom_tokenizer(nlp)
    doc = nlp(q_str)
    # return the last named entity since the needed 
    # entity is likely at the very end of the string
    return doc[-1].text

def regex_entity_finder(q_str):
    """
    Naive regex search to find words that have capital letters.
    The string that is returned is from the first letter of the
    first found word, to the last letter of the last found word.
    e.g. "Lord of the Rings" will be found using this.
    """
    # look for concurrent titled words
    token_re = re.compile(r"([A-Z][a-z']*(?:[\s][A-Z][a-z]+)*)")
    uppers =  [x for x in token_re.finditer(q_str)]
    # if there are titled words, return the relevent string
    if uppers:
        idx1 = uppers[0].span()[0]
        idx2 = uppers[-1].span()[1]
        return q_str[idx1:idx2]
    # else return an empty string
    return ""

def get_entity(doc):
    """
    Calls upon the entities above. 
    Heuristic: a longer entity name may be more relevent
    e.g.: "Legend" is less relevant than "I Am Legend"
    """
    entity_, entity_temp = "", ""
    
    # if there are named entities
    # assign entity to it
    if get_named_entities(doc):
        entity_ = get_named_entities(doc)[0].text
    # check using a custom tokenizer
    entity_temp = get_entity_complex(str(doc))
    if len(entity_temp)>len(entity_):
        entity_ = entity_temp
    # check using a naive regex search
    entity_temp = regex_entity_finder(str(doc))
    if len(entity_temp)>len(entity_):
        entity_ = entity_temp
    
    return entity_
    

In [7]:
### Property Extraction functions ###

# Note: a more naive method is defined a couple of cells
# below this one, and not included here as it does not
# directly relate to spacy functions

def get_noun_property(doc):
    """
    For questions of the form given, the needed property
    will usually be a noun that comes immediately before the
    entity: e.g. _director_ of Shrek. For the current set of
    questions, this will be the first noun in the string query.
    """
    for word in doc : # iterate over the token objects
        if word.pos_ == 'NOUN':
            return(word.lemma_)

def get_property(doc):
    property_ = get_noun_property(doc)
    return property_


In [8]:
### Parser for natural language query ###

def preprocess(query):
    query = query.replace('?','')
    query = query.replace(query[0], query[0].lower(), 1)
    return query

def parser(query):
    query = preprocess(query)
    
    nlp = spacy.load("en_core_web_sm") # this loads the model for analysing English text
    doc = nlp(query)
    
    entity_ = get_entity(doc)
    property_ = get_property(doc)

    return entity_, property_

In [9]:
# example parsing: entity, property
parser("Who is the composer of Lord of The Rings?")

('Lord of The Rings', 'composer')

### Functions that use major assumptions

These functions are specifically designed for questions of the form *Who/What was/is/were (the) X of Y?*, and does reasonably well within the movie domain. Specifically, they are used to (i) reduce the number of queries by removing non-movie-related entities, and (ii) use a very naive method for getting a probable property from the string query for this format of question. These are not required, but may provide better quality results.

In [38]:
### Functions that are majorly hueristic/custom ###

def entity_related_to_movies(entity_list):
    """
    Given a list of dictionaries with information about the entity,
    check if the description contains a word that is related to a movie.
    These have been chosen based on wordnet's synsets. This can easily be
    extended or made more complex using nltk, but kept straightforward for now
    as it works well enough. This helps remove non-relevent entities that have
    with the same name, but not related to movies (e.g. Lord of the Rings book series)
    """
    valid = []
    movie_relation = ['movie', 'film', 'picture', 'moving picture', 'motion', 'pic', 'flick', 'TV',
                      'television', 'show', 'animation', 'animation']
    character_relation = ['fiction', 'fictitious', 'character']
    actor_relation = ['actor', 'actress', 'thespian']
    for word in movie_relation+character_relation+actor_relation:
        for e in entity_list:
            if 'description' in e.keys():
                if word in e['description']:
                    if e not in valid:
                        valid.append(e)
                    
    return valid
            
def get_property_str_naive(q_str):
    """
    For the questions of the form:
        > Who/What was/is/were the/a/an X of Y?
    It is reasonable to assume X (the property) falls
    squarely within the first instances of "the" and "of".
    """
    the_pos = q_str.find(' the ')
    a_pos = q_str.find(' a ')
    an_pos = q_str.find(' an ')
    of_pos = q_str.find(' of ')
    # if an article and "of" have been found
    # return X, else return an empty string
    try:
        art_pos = min([x for x in [the_pos, a_pos, an_pos] if x > 0])
        art_len = q_str[art_pos+1:].find(' ') - art_pos
        if art_pos != -1 and of_pos!=-1:
            return (q_str[art_pos + art_len:of_pos].strip())
    except:
        pass
    return ""

def reduce_based_on_ids(property_list):
    """
    If there are multiple ways of getting a list of properties,
    then they may be repeated. This simply removes duplicates,
    while not changing the relative order within the input list.
    """
    p_set = {}
    for p in property_list:
        p_set[p['id']] = p

    return list(p_set.values())

### Pipeline: Natural Language string → SPARQL query → Results

This function puts all the above functions together. It takes in a natural language query and returns the required result, if it can. The process is as such:

    Input: English string of the form *Who/What was/is/were (the) X of Y?*
    Extract entites and properties
    Get a list of wikidata IDs that may be relevent
        Reduce the entity ID set to only those related to movies
        Get property from SPAQL and naive methods
            Remove duplicate properties
    For each entity
          For each property
              Generate a sparql query with (entity, property)
              Call wikidata endpoint for results
              If there is a result
                  It is likely this is a relevent result
                  so return it

In [11]:
import time

def pipeline(query, moderated = False):
    """
    Combines the above functions to create a pipeline to answer questions.
    
    Input: English string of the form "Who/What was/is/were (the) X of Y?"
    Output: Result of that query if found
    
    Often if there are too many queries sent to the endpoint at once,
    it will return none, so an optional boolean moderation is added
    to add an artificial 0.5 seconds between each request. This can be slower
    but has a lower chance of producing a request-related error. If an error
    occurs, try running with moderated = True
    """
    result = ''
    
    # get entities
    entity_, property_ = parser(query)
    entity_ = entity_.replace("the ", "")
    entity_ids = entity_related_to_movies(get_wikidata_ids(entity_))
    
    # get property IDs
    property_ids = get_wikidata_ids(property_, True)
    # get naive property options too
    naive_property_ = get_property_str_naive(query)
    property_ids += get_wikidata_ids(naive_property_, True)
    # remove duplicates
    property_ids = reduce_based_on_ids(property_ids)
   
    # wrap in a try/except to help with request errors
    try:
        # for each combination of entities and properties
        # it is likely that the entities and properties
        # are sorted by relevence/similarity by wikidata
        # so return the first result that it finds. This
        # is not guaranteed however
        for entity_id in entity_ids:
            for property_id in property_ids:
#                 print(entity_id['label'], property_id['label'])
                # general SPARQL query
                sparql_query = generate_sparql_query(entity_id['id'], property_id['id'])
                result = getSPARQLresults(sparql_query)

                # check if there is a result
                if result is not None and result!='':
                    print("Closest answer:")
                    print(f"        entity: {entity_id['label']}")
                    print(f"      property: {property_id['label']}\n")
                    return result

                if moderated:
                    time.sleep(0.5)
                    
    except:
        print("Error while searching!")
        if not moderated:
            print("Attempting moderated search!")
            return pipeline(query, moderated = True)
        
        else:
            pass # goes directly to final return statement
            
    return "Answer not found"

### Example Queries

In [35]:
q = "Who is a child of Shrek?"
print(pipeline(q))

Closest answer:
        entity: Shrek
      property: child

answerLabel	Fergus
answerLabel	Farkle
answerLabel	Felicia



In [34]:
q = "What are the main subjects of Saving Private Ryan?"
print(pipeline(q))

Closest answer:
        entity: Saving Private Ryan
      property: main subject

answerLabel	World War II
answerLabel	Invasion of Normandy
answerLabel	Sole Survivor Policy
answerLabel	altruistic suicide
answerLabel	Operation Overlord
answerLabel	rescue operation
answerLabel	comradeship



In [41]:
q = "What is a duration of I Am Legend?"
print(pipeline(q))

Closest answer:
        entity: I Am Legend
      property: duration

answerLabel	100



In [None]:
q = "What is the height of Amitabh Bachchan?"
print(pipeline(q))

In [43]:
q = "Who is the director of The Room?"
print(pipeline(q))

Closest answer:
        entity: The Room
      property: director

answerLabel	Tommy Wiseau



In [None]:
q = "What is the birth date of Tom Hanks?"
print(pipeline(q))

In [None]:
q = "Who is the lead actor of Johnny English?"
print(pipeline(q))

In [None]:
q = "Who is a founder of Disney Animation?"
print(pipeline(q))

In [None]:
q = "What is the father of Bruce Wayne?"
print(pipeline(q))

In [None]:
q = "What is the native language of Gollum?"
print(pipeline(q))

In [None]:
q = "What is the given names of Leonardo DiCaprio?"
print(pipeline(q))

In [None]:
# Extra question (question 11)
# The extra heuristics are helpful for any question of
# the given form, no matter the question itself.
# This is a downside too, depending on the needed
# generality of the QA system, and it might be a good
# idea to generalise them in the future
q = """Who is the director of the 1997 cult space horror 
               film which also happens to be one of the films that 
               is used in this long and convoluted question, Event Horizon (1997)?"""
print(pipeline(q))

In [None]:
# interaction is of course possible too
q = input('Please ask a question\n')
print(pipeline(q))