# Advanced Question Analysis

The goal of this assignment is to write a more flexible version of the interactive QA system. As in the previous assignment, the system should be able to take a question in natural language (English) as input, analyse the question, and generate a SPARQL query for it.

## Assignment  // Additional requirements

* Make sure that your system can analyse at least two more question types. E.g. questions that start with *which*, *when*, where the property is expressed by a verb, etc.
* Apart from the techniques introduced last week (matching tokens on the basis of their lemma or part-of-speech), also include at least one pattern where you use the dependency relations to find the relevant property or entity in the question. 
* Include 10 examples of questions that your system can handle, and that illustrate the fact that you cover additional question types

## Examples

Here is a non-representative list of questios and question types to consider. See the list with all questions for more examples

* For what movie did Leonardo DiCaprio win an Oscar?
* How long is Pulp Fiction?
* How many episodes does Twin Peaks have?
* In what capital was the film The Fault in Our Stars, filmed?
* In what year was The Matrix released?
* When did Alan Rickman die?
* Where was Morgan Freeman born?
* Which actor played Aragorn in Lord of the Rings?
* Which actors played the role of James Bond
* Who directed The Shawshank Redemption?
* Which movies are directed by Alice Wu?


In [1]:
import spacy

nlp = spacy.load('en_core_web_trf') # this loads the model for analysing English text
                   

# Assignment Submission
### SRECK

## Code from last assignment
- Get wikidata IDs
- Generate SPARQL Queries
- Connect to wikidata endpoint to get SPARQL results

In [2]:
"""
Query helpers
"""

import requests

def reduce_based_on_ids(id_list):
    """
    If there are multiple ways of getting a list of properties,
    then they may be repeated. This simply removes duplicates,
    while not changing the relative order within the input list.
    """
    id_set = {}
    for obj in id_list:
        id_set[obj['id']] = obj

    return list(id_set.values())

def get_wikidata_ids_of_word(name, search_property = False):
    """
    Returns a list of ID dictionaries (with labels and possibly descriptions)
    for a given name, either looking for entities or properties (set search_property:=True for the latter)
    Each dict contains keys: 'id', 'label', and possibly 'description'.
    If a description cannot be found, it will not be included in the dict.
    """
    all_results = []
    
    url = 'https://www.wikidata.org/w/api.php'
    params = {'action':'wbsearchentities', 
              'language':'en',
              'format':'json'}
    
    # add a param to the request if it needs to look for a property
    if search_property:
        params['type'] = 'property'
    
    params['search'] = name
    json = requests.get(url,params).json()
    
    # extract only the useful data from the json file
    try:
        for result in json['search']:
            # append an empty dictionary
            all_results.append({})
            # add the ID and label
            all_results[-1]['id'] = result['id']
            all_results[-1]['label'] = result['label']
            # add a description if it exists
            if 'description' in result.keys():
                all_results[-1]['description'] = result['description']
    except Exception:
        # no results
        pass
    
    return all_results

def get_wikidata_ids(list_of_words, search_property = False):
    """
    Returns a set of candidate id's for the list of words
    """
    list_of_ids = []
    for word in list_of_words:
        list_of_ids += get_wikidata_ids_of_word(word, search_property)
    # remove duplicates
    set_of_ids = reduce_based_on_ids(list_of_ids)
    return set_of_ids

def simple_sparql_query(entity_id, property_id, entity_id_2 = None, reverse = False, binary = False):
    """ 
    Returns string with entity id and property id in place as a SPARQL query
    """
    if reverse:
        p2 = "wd:" + entity_id
        p1 = "?answer"
    else:
        p1 = "wd:" + entity_id
        p2 = "?answer"
        
    if binary:
        query = f'''ASK {{
            wd:{entity_id} wdt:{property_id} wd:{entity_id_2} .
        }}'''
    else:
        query = f'''SELECT ?answerLabel WHERE {{
            {p1} wdt:{property_id} {p2}.
            SERVICE wikibase:label {{ bd:serviceParam wikibase:language "en". }}
        }}'''
    return query

def property_qualifier_query(entity_id, property_id, qualifierEntity_id, qualifierProperty_id, reverse) :
    # Find answer to the original property with qualifier property as filter
    if reverse:
        p2 = "wd:" + qualifierEntity_id
        p1 = "?item"
    else:
        p1 = "wd:" + qualifierEntity_id
        p2 = "?item"
    query = f'''SELECT ?itemLabel WHERE {{ 
        wd:{entity_id} p:{property_id} ?stat . 
        ?stat ps:{property_id} {p1} . 
        ?stat pq:{qualifierProperty_id} {p2} .
        SERVICE wikibase:label {{ bd:serviceParam wikibase:language "en". }}
      }}'''
    return query

def simple_qualifier_query(entity_id, property_id) :
    query = f'''SELECT ?itemLabel WHERE {{ 
        wd:{entity_id} p:{property_id} ?stat . 
        ?stat ps:{property_id} ?item .
        SERVICE wikibase:label {{ bd:serviceParam wikibase:language "en". }}
      }}'''
    return query

#Possibly ask query with qualifier?

def get_SPARQL_results(query, shouldBeCounted = False):
    """
    Relates to previous assignment. Return results (string) for a SPARQL query.
    The format is arbitrary can can be changed as desired.
    """
    url = 'https://query.wikidata.org/sparql'
    if shouldBeCounted:
        result = 0
    else:
        result = ""
    # Max 1000 attempts
    for _ in range(1000):
        data = requests.get(url, params={'query': query, 'format': 'json'})
        if data.status_code == 200:
            break
    data = data.json()
    try:
        return data['boolean']
    except:
        for item in data['results']['bindings']:
            for var in item:
                if shouldBeCounted:
                    result += 1
                else:
                    result += ('{}\t{}\n'.format(var,item[var]['value']))
    
    return str(result)

In [3]:
"""
Linguistic Helpers
"""

def get_root(doc):
    """
    Return the root of the dependency tree
    in a given nlp-parsed sentence (root)
    """
    for word in doc:
        if word.dep_ == "ROOT":
            return word
        
def is_q_word(string):
    return string in ['who', 'what', 'which', 'how', 'when', 'where', 'why', 'whom']
                      
def is_exception(string):
    return string in ['was', 'is', 'does', 'did'] or is_q_word(string)
        
def phrase(word, remove = ""):
    """
    Given code: Return the phrase that the given word heads
    """
    children = []
    for child in word.subtree:
        children.append(child.text.replace(remove,''))
    return " ".join(children)

def check_key_words(doc):
    """
    Check for special question words and other keywords
    """
    key_words = {
        'how many' : 'number',
        'quantity' : 'number',
        'amount' : 'number',
        'number of': 'number',
        'how long' : 'duration',
        'how often' : 'frequency',
        'when' : 'date',
        'where' : 'place',
        'why' : 'cause',
        'whose' : 'owner',
        'birthday' : 'date of birth',
        'directed' : 'director'
    }
    for i in range(len(doc)-1):
        one_word = doc[i].text.lower()
        two_words = doc[i].text.lower() + " " + doc[i+1].text.lower()
        if two_words in list(key_words.keys()):
            return key_words[two_words]
        elif one_word in list(key_words.keys()):
            return key_words[one_word]
    return None

def add_variations(prop):
    if prop == 'born' or prop == 'bear':
        return ['birth place', 'birth date']
    else:
        return []

def remove_duplicates(ls):
    d = {}
    for l in ls:
        d[l] = 0
    return list(d.keys())

In [4]:
### Entity extraction functions ###
import re
import spacy
from spacy.tokenizer import Tokenizer

def get_named_entities(doc):
    """ 
    spacy has entity recognition in-built, which might work well
    for names, but not for multi-word named entities (like movie titles)
    """
    return doc.ents

def custom_tokenizer(nlp):
    """
    spacy gives the programmer the ability to customize the tokenizer using regex.
    This one specifically looks for sets of contiguous words that all have an upper-
    case letter (i.e. that are titled). This can alternatively be done by using spacy's
    istitle() function on all combinations of words, but that is less efficient.
    e.g. "How I Met Your Mother" will be a single token using this.
    """
    token_re = re.compile(r"([A-Z0-9]+[a-z']*(?:[\s][A-Z][a-z]+|[\s][0-9]+)*)")
    return Tokenizer(nlp.vocab, token_match = token_re.findall)

def get_entity_complex(q_str):
    """
    Calls the above function on a query string
    """
    #nlp.tokenizer = custom_tokenizer(nlp)
    #doc = nlp(q_str)
    # return the last named entity since the needed 
    # entity is likely at the very end of the string
    #return doc[-1].text
    q_word = ['who','what','was','when','in','how','where','which']
    if q_str.split()[0].lower() in q_word: 
        q_str = q_str.split(' ', 1)[1] #removing the first word since its mostly useless 
        entities = re.findall(r"([A-Z0-9]+[a-z']*(?:[\s][A-Z][a-z]+|[\s][0-9]+)*)", q_str)
    else: 
        entities = re.findall(r"([A-Z0-9]+[a-z']*(?:[\s][A-Z][a-z]+|[\s][0-9]+)*)", q_str)
        
    return entities


def get_closest_proper_noun(root, remove = ''):
    """
    It is often the case that the proper noun
    that is most closely associated with the root
    is the most relevent entity in question.
    This is a recursive function starting at the 
    root and doing a BFS through the tree
    """
    pn = ""
    for child in root.children:
        if child.pos_ == 'PROPN':
            pn = phrase(child, remove)
            return pn
        
        pn = get_closest_proper_noun(child)
        if pn != "":
            break
    
    return pn

def find_single_uppercase_entity(parse):
    """
    If we are only searching for one entity,
    we can just find the first and last index 
    of words beginning with an uppercase letter.
    There is a check to make sure that the first 
    word isn't added if it isn't part of the entity.
    """
    entities = []
    entityRange = [] 
    
    for i in range(len(parse)):
        word = parse[i]
        try:
            secondWord = parse[i+1]
        except:
            secondWord = False
        if word.text.istitle():
            # Check if word starts with uppercase letter for entities
            if i != 0 or (not is_exception(word.lemma_.lower())
                          and (not secondWord or not is_exception(secondWord.lemma_))):
                # If it isn't one of the question words (Who/What/Which/When) or a word before them
                entityRange.append(i)
    if len(entityRange) > 1:
        minEntity = entityRange[0]
        maxEntity = entityRange[len(entityRange)-1]
        entity = parse[minEntity:maxEntity+1].text
        entities.append(entity)
    elif len(entityRange) == 1:
        entity = parse[entityRange[0]].text
        entities.append(entity)
    return entities

def get_entity_from_index_range(parse, begin, end):
    return parse[begin:end+1].text

def add_entity_based_on_index(numberRange, frontIndex, backIndex, parse):
    entities = []
    if len(numberRange) > 1:
        minNumber = numberRange[0]
        maxNumber = numberRange[len(numberRange)-1]
        entities.append(get_entity_from_index_range(parse, minNumber, maxNumber))
    for number in numberRange:
        if not type(frontIndex) == bool:
            # Add title parts from the front
            entities.append(get_entity_from_index_range(parse, frontIndex, number))
        if not type(backIndex) == bool:
            # Add title parts from the back
            if backIndex > number:
                entities.append(get_entity_from_index_range(parse, number, backIndex))
            else:
                entities.append(get_entity_from_index_range(parse, backIndex, number))
            if not type(frontIndex) == bool and not type(backIndex) == bool:
                # Add Title parts from the front and back of this number
                if backIndex > number:
                    entities.append(get_entity_from_index_range(parse, frontIndex, backIndex))
    return entities

def find_single_number_entity(parse):
    """
    Find an entity with a number in it.
    Multiple possibilities will be returned 
    based on whether there are words starting 
    with uppercase letters before or after it.
    """
    entities = []
    numberRange = []
    frontIndex = False 
    backIndex = False
    
    for i in range(len(parse)):
        word = parse[i]
        try:
            secondWord = parse[i+1]
        except:
            secondWord = False
        if word.pos_ == "NUM":
            # Check if word is a number and add to range
            numberRange.append(i)
            # Add number itself as entity
            entities.append(word.text)
            # Show that number has been used
            if type(backIndex) == bool:
                backIndex = True
        if word.text.istitle():
            # Check if word starts with uppercase letter for entities
            if i != 0 or (not is_exception(word.lemma_.lower())
                          and (not secondWord or not is_exception(secondWord.lemma_))):
                # If it isn't one of the question words (Who/What/Which/When) or a word before them
                if not type(backIndex) == bool or backIndex == True:
                    # number has already been used, so part after number
                    backIndex = i
                elif type(frontIndex) == bool:
                    # First part of entity
                    frontIndex = i
    return entities + add_entity_based_on_index(numberRange, frontIndex, backIndex, parse)

def entity_split_sentence(parse):
    """
    Will return ranges for the sentences 
    so they contain at most one entity.
    """
    ranges = []
    minIndex = 0
    maxIndex = False
    for i in range(len(parse)):
        word = parse[i]
        try:
            secondWord = parse[i+1]
        except:
            secondWord = False
        if word.text.istitle():
            # Check if word starts with uppercase letter for entities
            if i != 0 or (not is_exception(word.lemma_.lower())
                          and (not secondWord or not is_exception(secondWord.lemma_))):
                    # If it isn't one of the question words (Who/What/Which/When) or a word before them
                    if type(maxIndex) == bool:
                        maxIndex = i
                    elif i > maxIndex+2:
                        ranges.append([minIndex, i-1])
                        minIndex = i
                        maxIndex = False
                    else:
                        maxIndex = i
    ranges.append([minIndex, len(parse)-1])
    return ranges
    
def find_uppercase_and_number_entities(parse):
    entities = []
    ranges = entity_split_sentence(parse)
    for r in ranges:
        subSentence = parse[r[0]:r[1]+1].text.replace('?','')
        subSentence = nlp(subSentence)
        uppercase = find_single_uppercase_entity(subSentence)
        number = find_single_number_entity(subSentence)
        entities += uppercase + number
    return entities

### Parser for natural language query ###

def preprocess(query):
    """
    Preprocessing for looking for entities using
    non-dependency methods. This is not strictly
    necessary, but makes it slightly less brittle
    wrt the orthography of the sentence.
    """
    query = query.replace('?','')
    query = query.replace(query[0], query[0].lower(), 1)
    return query

def get_entities(doc):
    """
    Return the possible entities of a given English query.
    The possibilites are:
        Proper noun phrase closest to root
        Named entities according to SPACY
        Regex-found entities (titled words in a row)
    """
    entities = []
    
    root = get_root(doc)
    entities.append(get_closest_proper_noun(root))
    if q_type_binary(doc):
        entities += find_uppercase_and_number_entities(doc)
    else:
        entities += find_single_uppercase_entity(doc)
        entities += find_single_number_entity(doc)
    query = preprocess(doc.text)
    entities += [str(e) for e in get_named_entities(doc)]
    entities.extend(get_entity_complex(query))
    
    entities = [str(e) for e in entities if " 's" not in e]
    
    entities += [entity.replace("the ", "") for entity in entities]
    entities += [entity.replace("'s", "").strip() for entity in entities]
    entities = remove_duplicates(entities)
    
    # Longer strings should be prioritized first
    entities = sorted(entities, key=len, reverse=True)
    
    return entities

In [5]:
### Property Extraction functions ###

##Question types
def q_type_addition(doc):
    """
    Returns True iff the question word is prefixed by a preposition,
    for example, if the question starts with In what, For how long, etc.
    """
    return doc[0].dep == 'prep' or doc[len(doc)-1].dep == 'prep'

def q_type_count(doc):
    """
    Returns True iff the question asks for the number of results
    """
    return check_key_words(doc) == 'number'

def q_type_binary(doc):
    """
    Returns True in and only in the case of a yes/no question
    """
    return doc[0].lemma_ in ['be', 'do', 'have']

def q_type_which_what_who(doc):
    """
    Returns true if the first word or second word is which, what or who
    (basic)
    """
    wh = ['which','what','who']
    return doc[0].text.lower() in wh or doc[1].text.lower() in wh

def q_type_passive(doc):
    """
    Returns true for a passive sentence
    Normal passive sentences have a past particple of the verb
    at the second word or the third word or the fourth word.
    Would further need to check the sentence type of passive sentences.
    """
    return doc[1].tag_ == 'VBN' or doc[2].tag_ == 'VBN' or doc[3].tag_ == 'VBN'

def q_type_how_adj(doc):
    """
    Returns true for how + adj sentences like
    "How long was Titanic?"
    """
    for token in doc:
        if token.text.lower() == 'how':
            return token.nbor().pos_ == 'ADJ' or token.nbor().pos_ == 'ADV' 
    return False          
    
def q_type_when(doc):
    """
    Returns true if its a when questions eg "When did Alan Rickman die?"
    Usually the when keyword is at the start.
    """
    return doc[0].text.lower() == 'when' 
    
def q_type_where(doc):
    """
    Returns true if its a where question eg "Where was Alan Rickman born?"
    Usually the where keyword is at the start.
    """
    return doc[0].text.lower() == 'where'

def get_root_related_props(doc, entities):
    """
    Several methods to try and get properties with
    respect to the root of the question.
    
    ps <- list of possible properties
    For each child in root:
        (i) it cannot be a property if it is an entity
        (ii) it cannot be a property if it is a question word (w-word)
        (iii) if it is a nominal subject, add it to ps
        (iv) if it is a direct object, add it to ps
        (v) if it is an adjective, add it to ps
    If the root itself is not a simple word, add it to ps (e.g. if root := 'direct')
    
    return list of possible properties.
    
    Note: The lemmas and the phrases are added in order to make sure
          multi-word properties (e.g. 'voice actor') are also considered
    """
    ps = {}
    root = get_root(doc)
    
    for child in root.children:
        if len(entities) < 2 and phrase(child) in entities:
            continue
        if is_q_word(child.text.lower()):
            continue
        if child.dep_ == 'nsubj':
            ps[phrase(child)] = 1
            ps[child.text] = 1
            ps[child.lemma_] = 1
        if child.dep_ == 'dobj':
            ps[phrase(child)] = 1
            ps[child.text] = 1
            ps[child.lemma_] = 1
        if child.pos_ == 'ADJ':
            ps[child.text] = 1
            ps[child.lemma_] = 1
    if root.lemma_ not in ['be', 'have', 'do']:
        ps[root.text] = 1
        ps[root.lemma_] = 1
    
    # Possibly add keyword
    key_word = check_key_words(doc)
    sorted_extended_ps = []
    if key_word != None:
        extended_ps = {}
        extended_ps[key_word] = 3
        for prop in list(ps.keys()):
            extended_ps[prop + ' ' + key_word] = 2
        sorted_extended_ps = list(dict(sorted(extended_ps.items(), key=lambda item: -item[1])).keys())
    
    # Sort in descending order of keys and convert to list
    sorted_ps = list(dict(sorted(ps.items(), key=lambda item: -item[1])).keys())
    
    return sorted_ps, sorted_extended_ps

def dumb_property_finder(parse) :
    propRange = []
    prop = ""
    props = []
    for i in range(len(parse)) : # iterate over the token objects 
        word = parse[i]
    
        if word.dep_ == "ROOT" and word.lemma_ not in ['be', 'have', 'do']:
            # Set root as property when it isn't a form of to be, to have, or to do
            prop = word.text
    
        if not word.text.istitle() and (word.pos_ == "NOUN" or word.pos_ == "VERB") and not is_exception(word.text):
            # Properties are nouns or verbs
            previousWord = parse[i-1]
            if previousWord.pos_ == "ADJ" :
                # Also add adjectives of the properties
                propRange.append(i-1)
                props.append(parse[i-1:i+1].text)
            else:
                props.append(parse[i:i+1].text)
            propRange.append(i)
    
    if prop == "" and len(propRange) > 0:
        minProp = propRange[0]
        maxProp = propRange[len(propRange)-1]
        prop = parse[minProp:maxProp+1].text
    props.append(prop)
    return props
    
def get_properties(doc, entity):
    """
    Returns list of possible properties (list of strings)
    """
    ps, extended_ps = get_root_related_props(doc, entity)
    if q_type_binary(doc) or len(ps) == 0:
        props = dumb_property_finder(doc)
        if len(ps) == 0:
            extended_ps += props
        ps += props
        for prop in ps:
            ps += add_variations(prop)
        ps = remove_duplicates(ps)
    
    # Remove all Nones
    return [x for x in ps if x is not None], [x for x in extended_ps if x is not None]

In [6]:
def binary_query(entity_id, property_id, answer_id):
    # This is the ask (YES/NO) query
    query = f'''ASK {{
       wd:{entity_id} wdt:{property_id} wd:{answer_id} .
    }}'''
    return query
    
def isNumber(string):
    return type(string) == str and string.isdigit()
    
def orderAnswers(entities):
    answers = []
    # Add numbers as possible answers (for count ask queries)
    for entity in entities:
        if isNumber(entity):
            answers.append(entity)
    answers += get_wikidata_ids(entities)
    return answers
    
def binary_queries(entity_id, property_id, answer_ids, non_ids):
    non_ids.append(entity_id['id'])
    non_ids.append(property_id['id'])
    result = ''
    answer = ''
    for answer_id in answer_ids:
        if isNumber(answer_id):
            print("entity: ", entity_id['label'], entity_id['id'])
            print("property: ", property_id['label'])
            print("number: ", answer_id)
            answer = answer_id
            sparql_query = simple_sparql_query(entity_id['id'], property_id['id'])
            result = get_SPARQL_results(sparql_query, True)
            print(result)
            if result == int(answer_id):
                return 'Yes', answer
        elif answer_id['id'] not in non_ids:
            answer = answer_id['label']
            sparql_query = binary_query(entity_id['id'], property_id['id'], answer_id['id'])
            result = get_SPARQL_results(sparql_query)
            if result == True:
                return 'Yes', answer
    return None, answer

In [7]:
### Functions that are majorly hueristic/custom ###
from functools import lru_cache
from nltk.corpus import wordnet as wn

def get_synonyms(word, depth=1):
    """
    Using WordNet (via NLTK), return synsets (synonyms/related words)
    of a given word. Using the depth argument, the user can recursively 
    go down the tree of a given word's synonyms' synonyms to
    get more words, but with probably less relevence, traversing
    the tree in a BFS fasion. Most applications should just need 
    depth=1 (return just the first level of synonyms).
    """
    # base case
    if depth == 0:
        return []
    
    # surface level synonyms
    related_words = []
    for syn in wn.synsets(word):
        related_words += [x.name().replace('_', ' ') for x in syn.lemmas()]
    
    # deeper synonyms
    for ls in [get_synonyms(x, depth-1) for x in related_words]:
        related_words += ls
    
    # remove duplicates and return
    return remove_duplicates(related_words)

@lru_cache(maxsize=None)
def get_movie_related_words(include_wordnet=True):
    """
    Finds all (several) related words for entities in 
    the domain of movies. The top level have been hard-coded
    and several more are found using WordNet's synsets.
    This also means that not all returned words may be
    strongly related to movies, just because of how WordNet
    is designed.
    
    Note: Cached for speed using the lru_cache wrapper
    """
    # naive relations, hand-written
    # starting off point for synonym searching
    movie_relation = ['movie', 'film', 'picture', 'moving picture', 'motion', 'pic', 'flick', 'TV',
                      'television', 'show', 'animation', 'animation']
    character_relation = ['fiction', 'fictitious', 'character']
    actor_relation = ['actor', 'actress', 'thespian']
    music_relation = ['musician', 'music', 'score', 'compose', 'song']
    
    all_relations = []
    all_relations += movie_relation
    all_relations += character_relation
    all_relations += actor_relation
    all_relations += music_relation
    
    if include_wordnet:
        # get WordNet synsets
        all_syns = [get_synonyms(x) for x in all_relations]

        # add to relations
        for syn in all_syns:
            all_relations+=syn

    # remove duplicates and return
    return remove_duplicates(all_relations)

In [8]:
def entity_related_to_movies(entity_list):
    """
    Given a list of dictionaries with information about the entity,
    check if the description contains a word that is related to a movie.
    These have been chosen based on wordnet's synsets. This helps remove 
    non-relevent entities that have with the same name, but not related 
    to movies (e.g. Lord of the Rings book series).
    """
    valid = []
    all_relations = get_movie_related_words()
    for word in all_relations:
        for e in entity_list:
            if 'description' in e.keys():
                if word in e['description']:
                    if e not in valid:
                        valid.append(e)
                    
    return valid

In [9]:
def permute(doc, entity_ids, property_ids, answer_ids, isCountQuestion = False):
    # for each combination of entities and properties
    # it is likely that the entities and properties
    # are sorted by relevence/similarity by wikidata
    # so return the first result that it finds. This
    # is not guaranteed however
    for entity_id in entity_ids:
        for property_id in property_ids:
            # print(entity_id['label'], property_id['label'])
            if q_type_binary(doc):
                non_ids = []
                non = entity_related_to_movies(get_wikidata_ids(entity_id['label']))
                for n in non:
                    non_ids.append(n['id'])
                result, answer_id = binary_queries(entity_id, property_id, answer_ids, non_ids)
            else:
                sparql_query = simple_sparql_query(entity_id['id'], property_id['id'])
                result = get_SPARQL_results(sparql_query, isCountQuestion)

            # if no result, try the reverse query
            if result is None or result == '':
                sparql_query = simple_sparql_query(entity_id['id'], property_id['id'], reverse = True)
                result = get_SPARQL_results(sparql_query, isCountQuestion)

            if result is not None and result != '':
                print("Closest answer:")
                print(f"        entity: {entity_id['label']}")
                print(f"      property: {property_id['label']}")
                if q_type_binary(doc):
                    print(f"        answer: {answer_id}")
                print("")
                return result
    return None

def pipeline(question):
    """
    Combines the above functions to create a pipeline to answer questions.
    
    Input: English question string
    Output: Result (answer) of wikidata queries for that question
    """
    result = ''
    
    # Load NLP model and tokenize/analize the question
    nlp = spacy.load("en_core_web_trf")
    doc = nlp(question)
    
    # get entities & their ids
    entities = get_entities(doc)
    entity_ids = entity_related_to_movies(get_wikidata_ids(entities))
    
    # get properties
    properties, extended_properties = get_properties(doc, entities)
    
    property_ids = get_wikidata_ids(properties, True)
    extended_property_ids = get_wikidata_ids(extended_properties, True)
    
    answer_ids = []
    if q_type_binary(doc):
        answer_ids = orderAnswers(entities)
        
    print(entities)
    print(properties)
   
    # wrap in a try/except to help with request errors
    try:
        result = permute(doc, entity_ids, extended_property_ids, answer_ids)
        print(extended_properties)
        if result == None:
            print(properties)
            result = permute(doc, entity_ids, property_ids, answer_ids, q_type_count(doc))
        if result != None:
            return result
        if q_type_binary(doc):
            return "No"
            
    except Exception:
        print("Error while searching!")
    
    # Guess
    if q_type_binary(doc):
        return "Default binary, Yes"
    elif q_type_count(doc):
        return "Default count"
        #return "1"
    return None

def ask_question(question, base_answer=False):
    # the main function used to ask queries
    # it is mainly a wrapper for the pipeline
    #try:
    ans = pipeline(question)
    #except Exception:
    #    ans = "Error"
    if ans is None:
        print(">>>>> Warning: Answer == None!")
        ans = "Answer not found"
        
    if base_answer: # strips the ans of any formatting
        ans = ans.replace('answerLabel\t', "").strip()
        ans = ans.replace('\n', ", ").strip()
        
    return ans

## Question handling

This QA system should be able to handle questions about movies of several types, but specifically desiged to be able to work with the following, with X being the property and Y being the entity:
- Who/What/When/etc was/is/were the/a/an X of Y? (from previous assignment, more passive, noun properties)
- Who/What/When/etc was/is/were Y X? (similar to above, more active, verb properties)
- How X is Y? (similar questions that use adjective properties)

The following are pairs of questions that the system is able to answer. These are in pairs to show that the same question that is phrased differently (as long as it follows an above format) should give the same answer. A noun property (e.g. height) can be translated to a adjective property (e.g. tall). Similarly, a verb property (acted) can be translated to a noun property (actor).

In [10]:
count_qs = ['How many episodes does Twin Peaks have?',
     'How many awards did Titanic win?',
      'How many awards did George Clooney receive?',
      'How many Pokémon episodes are there?',
      'What was the box office amount for the movie Psycho?',
      'What is the total number of cast members of Iron Man?',
      'What is the amount of Academy Award nominations that Morgan Freeman has?'
     ]
for q in count_qs:
    print(f"Query: {q}")
    print(ask_question(q))
    print("\t**********\n")
#30; 30; 17; 1115; 40,000,000 United States dollar; 64; 5
# Find 18 for number of episodes Twin Peaks, since two TV series with exact same name

Query: How many episodes does Twin Peaks have?
['Twin Peaks']
['How many episodes', 'episodes', 'episode']
['number', 'How many episodes number', 'episodes number', 'episode number']
['How many episodes', 'episodes', 'episode']
Closest answer:
        entity: Twin Peaks: Fire Walk with Me
      property: number of episodes

0
	**********

Query: How many awards did Titanic win?
['Titanic']
['How many awards', 'awards', 'award', 'win']


KeyboardInterrupt: 

In [11]:
qs = ['How many genres does Pulp Fiction have?',
     'When was Alan Rickman born?',
     'Where was Alan Rickman born?',
     'How many episodes does Twin Peaks have?',
     'How long is Interstellar?']#,
     #'Who is the director of Blade Runner 2049?']


lonely_qs = ['Who directed The Shawshank Redemption?'
     ,'Who is the director of The Shawshank Redemption?'
      
     ,'What is the birth date of Alan Rickman?'
     ,'When was Alan Rickman born?'
      
     ,'What is the height of Amitabh Bachchan?'
     ,'How tall is Amitabh Bachchan?'
      
     ,'What is the publication date of The Dark Knight?'
     ,'When was The Dark Knight published?'
     
     ,'Who acted as Gollum?'
     ,'Which actor played Gollum?',
     
     'What is the length of Interstellar?',
     'How long does Interstellar run?',
     'When did Alan Rickman die?',
     'When was Pulp Fiction published?',
     'Where was Morgan Freeman born?',
     'Where does Home Alone originate?',
     'Which movies are directed by Alice Wu?',
     'How long is Pulp Fiction?',
     'How many episodes does Twin Peaks have?',
     'How long is Interstellar?',
     'Which character was married to Aragorn?',
     'Which character did Aragorn marry?']
    
for q in qs:
    print(f"Query: {q}")
    print(ask_question(q))
    print("\t**********\n")

for q in lonely_qs:
    print(f"Query: {q}")
    print(ask_question(q))
    print("\t**********\n")
# Use "en_core_web_trf" instead of "en_core_web_sm"

Query: How many genres does Pulp Fiction have?
['Pulp Fiction']
['How many genres', 'genres', 'genre']
['number', 'How many genres number', 'genres number', 'genre number']
['How many genres', 'genres', 'genre']
Closest answer:
        entity: Pulp Fiction
      property: genre

4
	**********

Query: When was Alan Rickman born?
['Alan Rickman']
['born', 'bear']
Closest answer:
        entity: Alan Rickman
      property: date of birth

['date', 'born date', 'bear date']
answerLabel	1946-02-21T00:00:00Z

	**********

Query: Where was Alan Rickman born?
['Alan Rickman']
['born', 'bear']
Closest answer:
        entity: Alan Rickman
      property: place of birth

['place', 'born place', 'bear place']
answerLabel	Hammersmith

	**********

Query: How many episodes does Twin Peaks have?
['Twin Peaks']
['How many episodes', 'episodes', 'episode']
['number', 'How many episodes number', 'episodes number', 'episode number']
['How many episodes', 'episodes', 'episode']
Default count
	**********



In [None]:
ask_question("What is the height of Amitabh Bachchan?")

In [None]:
from run_qs import get_q_list

qs = [q[0] for q in get_q_list()[:3]]
real_anss = [q[1] for q in get_q_list()[:3]]
total_ans = 0
not_found = 0

f = open('answerlist.txt', 'w', encoding = 'utf-8')
f.write('No|Query|Given Answer|System Answer\n')

for q, r_ans in zip(qs, real_anss):
    
    r_ans = ",".join(r_ans)
    
    print(f'{total_ans}) Query: {q}')
    
    ans = ask_question(q, base_answer=True)
    
    try:
        f.write(f'{total_ans}|{q}|{r_ans}|{ans}\n')
    except:
        print(">>>>> Could not write to file")
        print(f'>>>>> {total_ans}|{q}|{r_ans}|{ans}\n')
        
    if ans in ['Answer not found', 'Error']:
        not_found += 1
        
    total_ans += 1
    
    print (f'\nAnswer: {ans}\n')
    print (f'Given Answer: {r_ans}\n')

f.close()

print(f'Questions queried: {total_ans}')
print(f'Not found ratio: {not_found/total_ans}')

In [None]:
print(ask_question("How many episodes does Twin Peaks have?"))

In [None]:
import pandas as pd
df = pd.read_csv('answerlist.txt', sep='|')

print('Questions that could not be answered:')
filtered = (df[df['System Answer']=='Answer not found'])

for index, row in filtered.iterrows():
    print(row["Query"])