# Question Analysis

The goal of this assignment is to write a first version of an an interactive QA system. The system should be able to take a question in natural language (English) as input, analyse the question, and generate a SPARQL query for it.

For now, we will restrict our attention to questions of the form

Who/What was/is/were (the) X of Y? 

i.e.

* What are the genres of Inception?
* What are the main subjects of Saving Private Ryan?
* What are the names of the Coen Brothers?
* What is the box office of Interstellar?
* What is the country of origin of Black Mirror?
* What is the duration of I Am Legend?
* What is the main subject of "The Godfather"?
* Who are the founders of Pixar Animation Studios?
* Who is the composer of Lord of The Rings?
* etc

## Interactivity

In a notebook, you can ask for user input by using the input function, as shown below. The text entered by the user is stored in the variable question. 


In [1]:
question = input('Please ask a question\n')

Please ask a question
gjkh


## Linguistic Analysis with Spacy

To generate a SPARQL query, the system needs to find a a property and an entity. The first step is to match the correct words in the question. So, for the first example, the property is indicated by the word _genres_ and the entity by the word _Inception_ These words can be sent as key-words to a wikidata api, to find the corresponding wikidata IDs (URIs). (More on this below)

Pattern matching can be done using regular expressions (using the re python library). Alternatively, we can use a Spacy, a toolkit for doing linguistic analysis.

[Spacy](https://spacy.io/usage) is a toolkit that comes with pretrained models for doing linguistic analysis in a number of languages. It can read in a text or sentence, tokenize the text (separate punctuation from words), assign Part-of-Speech (NOUN, VERB, PROPN, etc.) to tokens, lemmatize words (_actors_ --> _actor_), and detect named entities (like movie titles). See here for a short [tutorial](https://spacy.io/usage/spacy-101). 

When you install Spacy, make sure to also download the statistical model for analysing English sentences, en_core_web_sm

In [2]:
import spacy
import pandas as pd
nlp = spacy.load("en_core_web_sm") # this loads the model for analysing English text
                   

## Spacy tokenization and annotation

The Spacy nlp function analyses an input text (i.e. the question of the user), and assigns an annotation to each token in the input. It returns a list of token objects, where each token object is a dictionary that has values for various attributes of each token in the sentence. 

The example below illustrates how to iterate over the token objects, and find interesting attributes. Spans can be useful if you want to grab multiple tokens. The analyzer also finds names of entities (persons, organisations, locations), but, unfortunately, for movie titles this often does not work. A more robust approach might be to find tokens that start with an uppercase. 

In [3]:
question = 'Who are the main characters of the movie Apocalypse Now?'

parse = nlp(question) # parse the input 

for word in parse : # iterate over the token objects 
    print(word.text, word.lemma_, word.pos_)
print('---------------')
print(parse[3:5].text) # you can also select multiple tokens as a span. 
print('---------------')

for ent in parse.ents : # the analysis also detects names of entities. Very unreliable for movie titles...
    print(ent.text, ent.label_)
    
print('---------------')
for word in parse :
        if word.text.istitle() : # check if word starts with uppercase letter 
            print(word)
print('---------------')
print(parse[8:10].text.istitle()) # check if all words in a span start with uppercase  

Who who PRON
are be AUX
the the DET
main main ADJ
characters character NOUN
of of ADP
the the DET
movie movie NOUN
Apocalypse apocalypse NOUN
Now now ADV
? ? PUNCT
---------------
main characters
---------------
Apocalypse Now LAW
---------------
Who
Apocalypse
Now
---------------
True


## Visualisation

For a quick understanding of what the parser does, and how it assigns part-of-speech, entities, etc. you can also visualise parse results. Below, the entity visualiser and parsing visualiser is demonstrated. You can ignore the arrows (dependency links) for now, we return to them next week. 

This code is for illustration only, it is not part of the assignment. 

In [4]:
from spacy import displacy

question = "Who is the main character of the movie Harry Potter"

parse = nlp(question)

displacy.render(parse, jupyter=True, style="ent")

displacy.render(parse, jupyter=True, style="dep")

## Accessing the Wikidata entity finder 

From the parse of the user question, you can extract the words for the property and the entity that are needed to formulate a SPARQL query. The ids of the property and entity can be found by accessing the wikidata entity finder.

See the example below for finding the id of the movie The Godfather. In most cases, the first result is correct, but it may be necessary to try various ids...

Properties can be found by including 'type' : 'property' in the parameters. 

In [5]:
import requests

url = 'https://www.wikidata.org/w/api.php'
params = {'action':'wbsearchentities', 
          'language':'en',
          'format':'json'}

params['search'] = 'The Godfather'
json = requests.get(url,params).json()
for result in json['search']:
    print("{}\t{}\t{}".format(result['id'], result['label'], result['description']))

Q47703	The Godfather	1972 American film directed by Francis Ford Coppola
Q243556	The Godfather	1969 novel by Mario Puzo
Q1139696	The Godfather	2006 open world action-adventure video game
Q443678	Nikola Peković	Montenegrin basketball player
Q1066512	Charles Wright	American professional wrestler
Q1158135	The Godfather	soundtrack of the 1972 crime film of the same name
Q4051101	The Godfather	1991 video game based on the Godfather movie trilogy


## Building a SPARQL query 

With the id of the property and the entity, an SPARQL query can be formulated, and the sparql endpoint of wikidata can be queried for an answer. Note that this is the same as in the previous assignment.

Also note that for python, a SPARQL query is just a string, so you can construct the query by concatenating the start of the query, the ids, and the end of the query. 

In [6]:
ID1 = 'q47703'
ID2 = 'p577'
query = 'SELECT ?answerLabel WHERE { wd:' + ID1 + ' wdt:' + ID2 + '....'
print(query)

SELECT ?answerLabel WHERE { wd:q47703 wdt:p577....


## Assignment 

Using the steps outlined above, write a function that takes input from the user, analyses it with Spacy, extracts the relevant key-words, finds the wikidata URIs for these, and sends the SPARQL query to the sparql endpoint, and prints the answer. 

Include 10 examples of questions that worked for your system in the comments or in a separate markdown cell. 

In [7]:
import requests

url = 'https://www.wikidata.org/w/api.php'
url1 = 'https://query.wikidata.org/sparql'

params1 = {'action':'wbsearchentities', 
          'language':'en',
          'format':'json'} #for entities
params2 = {'action':'wbsearchentities', 
          'language':'en',
          'format':'json',
          'type':'property'} #for properties

words = spacy.load("en_core_web_sm")

def get_entity(question):
    parse = nlp(question)
    label = ["WORK_OF_ART", "PERSON", "ORG", "NOUN"]
    for entity in parse.ents: #getting entities which have the correct labels eg work of art i.e. a movie
        if entity.label_ in label:
            return entity.text

def process_property(text):
        pos = ["NOUN", "ADP"] #all properties have a noun eg duration or main subject
        tokens = nlp(text)
        return ' '.join(token.lemma_ for token in tokens if token.pos_ in pos).rstrip('of ') 
        #in cases like country of origin of remove the of at end
        #get the lemma i.e. the simplest version of the token 
        
def entities(entity):
    ids = []
    params1['search'] = entity
    json = requests.get(url,params1).json()
    for result in json['search']:
        ids.append(format(result['id'])) #getting all the relevant entity IDs
    return ids

def properties(property1):
    ids = []
    params2['search'] = property1
    json = requests.get(url,params2).json()
    for result in json['search']:
        ids.append(format(result['id'])) #getting all relevant property IDs
    return ids

def query(ID1, ID2):
    query = 'SELECT ?answerLabel WHERE { wd:' + str(ID1) + " wdt:" + str(ID2) + ' ?answer .SERVICE wikibase:label {bd:serviceParam wikibase:language "en" .}}'
    results = requests.get(url1, params={'query': query, 'format': 'json'}).json()
    return results

def question_answer(question):
    propertyy = process_property(question)
    entity = get_entity(question)
    ID1 = entities(entity)
    ID2 = properties(propertyy)
    count = 0 
    
    for i in ID1:
        for j in ID2:
            results = query(i,j)
            if(results is not None):
                for item in results['results']['bindings']:
                    for var in item :
                        count += 1 #result was obtained so next loop don't print any more results
                        print('{}\t{}'.format(var,item[var]['value']))   
            if(count >= 1):
                return

### Questions
1) What is the duration of "Attack of the Clones"?

2) What is the box office of Interstellar?

3) What is the country of origin of Black Mirror?

4) Who are the founders of Pixar Animation Studios?

5) What is the main subject of "The Godfather?"

6) Who is the director of Pulp Fiction?

7) Who are the cast members of Titanic?

8) What is/are the publication dates of Avatar

9) What is the country of citizenship of Brad Pitt?

10) What is the date of birth of Christopher Nolan?

In [8]:
q1 = 'What is the duration of Attack of the Clones?'
question_answer(q1)

KeyError: 'search'

In [None]:
q2 = 'What is the box office of Interstellar?'
question_answer(q2)

In [None]:
q3 = 'What is the country of origin of Black Mirror?'
question_answer(q3)

In [None]:
q4 = 'Who are the founders of Pixar Animation Studios?'
question_answer(q4)

In [None]:
q5 = 'What is the main subject of "The Godfather?"'
question_answer(q5)

In [None]:
q6 = 'Who is the director of Pulp Fiction?'
question_answer(q6)

In [None]:
q7 = 'Who are the cast members of Titanic?'
question_answer(q7)

In [None]:
q8 = 'What is/are the publication dates of Avatar?'
question_answer(q8)

In [None]:
q9 = "What is the country of citizenship of Brad Pitt?"
question_answer(q9)

In [None]:
q10 = "What is the date of birth of Christopher Nolan?"
question_answer(q10)