# Final Project

For the final project, the goal is to implement a QA-sytem that will answer all kinds of questions about movies, actors, and everything related to the movie business.

## Test Questions

The result will be evaluated, among others, on a set of test questions. The test questions are provided as a tab-separated csv file consisting of an ID and the text of the question:

    ID   Text of the question
    
Your results are also to be submitted as a tab-separated csv file. Code for reading in the data and writing the answers file is provided below. All you have to do is improve the question-answering function. 

## Answer file

The answer file is also a tab-separated csv file, as in the example output below (in file 'our_team_answers.csv'). For list questions, return the list of answers separated by a comma, also as shown in the example below. 

Make sure to answer all questions, so if no answer is found by the system, insert a dummy answer such as 'No answer found'. 


## Submission by:
### Team SREK 🐸
Team members:
- Joris Peters (s4001109)
- Ruhi Mahadeshwar (s4014456)
- Satchit Chatterji (s3889807)
- Yara Bikowski (s3989585)

---

# Program Code

### (IO at the end of file)

### Import Libraries
NOTE: Please make sure these libraries are availible, and that the spacy model 'en_core_web_trf' is downloaded and loadable!

In [None]:
"""NLP library"""
import spacy
"""Request operations"""
import requests
"""For regular expressions"""
import re
"""Tokenizer"""
from spacy.tokenizer import Tokenizer
"""Used for word similarity"""
from nltk.corpus import wordnet as wn
"""Cache similar function calls for speed"""
from functools import lru_cache
"""Regulate query times"""
import datetime

"""Load Spacy's large transformer model"""
nlp = spacy.load('en_core_web_trf')

### Query functions

In [None]:
def reduce_based_on_ids(id_list):
    """
    If there are multiple ways of getting a list of properties,
    then they may be repeated. This simply removes duplicates,
    while not changing the relative order within the input list.
    """
    id_set = {}
    for obj in id_list:
        id_set[obj['id']] = obj

    return list(id_set.values())

def get_wikidata_ids_of_word(name, search_property = False):
    """
    Returns a list of ID dictionaries (with labels and possibly descriptions)
    for a given name, either looking for entities or properties (set search_property:=True for the latter)
    Each dict contains keys: 'id', 'label', and possibly 'description'.
    If a description cannot be found, it will not be included in the dict.
    """
    all_results = []
    
    url = 'https://www.wikidata.org/w/api.php'
    params = {'action':'wbsearchentities', 
              'language':'en',
              'format':'json'}
    
    # add a param to the request if it needs to look for a property
    if search_property:
        params['type'] = 'property'
    
    params['search'] = name
    json = requests.get(url,params).json()
    
    # extract only the useful data from the json file
    try:
        for result in json['search']:
            # append an empty dictionary
            all_results.append({})
            # add the ID and label
            all_results[-1]['id'] = result['id']
            all_results[-1]['label'] = result['label']
            # add a description if it exists
            if 'description' in result.keys():
                all_results[-1]['description'] = result['description']
    except Exception:
        # no results
        pass
    
    return all_results

def get_wikidata_ids(list_of_words, search_property = False):
    """
    Returns a set of candidate id's for the list of words
    """
    list_of_ids = []
    for word in list_of_words:
        list_of_ids += get_wikidata_ids_of_word(word, search_property)
    # remove duplicates
    set_of_ids = reduce_based_on_ids(list_of_ids)
    return set_of_ids

def simple_sparql_query(entity_id, property_id, binary_entity = None, reverse = False):
    """ 
    Returns a SPARQL query string with a given entity and property, and, possibly,
    reverses the arguments or adds an additional 'answer' for binary (Yes/No) questions
    """
    if reverse:
        p2 = "wd:" + entity_id
        p1 = "?answer"
    else:
        p1 = "wd:" + entity_id
        p2 = "?answer"
        
    if binary_entity != None:
        query = f'''ASK {{
            wd:{entity_id} wdt:{property_id} wd:{binary_entity} .
        }}'''
    else:
        query = f'''SELECT ?answerLabel WHERE {{
            {p1} wdt:{property_id} {p2}.
            SERVICE wikibase:label {{ bd:serviceParam wikibase:language "en". }}
        }}'''
    return query
    
def property_qualifier_query(entity_id, property_id, qualifierEntity_id, qualifierProperty_id, reverse = False) :
    """ 
    Returns a SPARQL query string for qualified sentence with two given entities and properties,
    possibly reversing the arguments
    """
    if reverse:
        p2 = "wd:" + qualifierEntity_id
        p1 = "?item"
    else:
        p1 = "wd:" + qualifierEntity_id
        p2 = "?item"
    query = f'''SELECT ?itemLabel WHERE {{ 
        wd:{entity_id} p:{property_id} ?stat . 
        ?stat ps:{property_id} {p1} . 
        ?stat pq:{qualifierProperty_id} {p2} .
        SERVICE wikibase:label {{ bd:serviceParam wikibase:language "en". }}
      }}'''
    return query

def simple_qualifier_query(entity_id, property_id):
    """
    Use qualifier statement to also get a complete list of results.
    """
    query = f'''SELECT ?itemLabel WHERE {{ 
        wd:{entity_id} p:{property_id} ?stat . 
        ?stat ps:{property_id} ?item .
        SERVICE wikibase:label {{ bd:serviceParam wikibase:language "en". }}
      }}'''
    return query

def get_SPARQL_results(query, shouldBeCounted = False):
    """
    Return results (string) for a SPARQL query. In case it should be counted,
    returns the number of results.
    """
    url = 'https://query.wikidata.org/sparql'
    if shouldBeCounted:
        result = 0
    else:
        result = ""
    # Max 1000 attempts
    for _ in range(1000):
        data = requests.get(url, params={'query': query, 'format': 'json'})
        if data.status_code == 200:
            break
    data = data.json()
    try:
        return data['boolean']
    except:
        for item in data['results']['bindings']:
            for var in item:
                if shouldBeCounted:
                    result += 1
                else:
                    result += ('{}\t{}\n'.format(var,item[var]['value']))
    
    if shouldBeCounted and result == 0:
        return None
    
    return str(result)

### Linguistic helpers

In [None]:
def get_root(doc):
    """
    Return the root of the dependency tree
    in a given nlp-parsed sentence (root)
    """
    for word in doc:
        if word.dep_ == "ROOT":
            return word
        
def is_q_word(string):
    """
    Returns whether a string is a question word.
    """
    return string in ['who', 'what', 'which', 'how', 'in', 'when', 'where', 'why', 'whom']
                      
def is_exception(string):
    """
    Returns whether the word is a question word
    or the begin word of a binary question.
    """
    return string in ['was', 'is', 'does', 'did'] or (is_q_word(string) or string == "in")
        
def phrase(word, remove = ""):
    """
    Given code: Return the phrase that the given word heads
    """
    children = []
    for child in word.subtree:
        children.append(child.text.replace(remove,''))
    return " ".join(children)

def empty_prop_id_dict(prop_id, label):
    """
    Creates a dictionary with a given property id for qualifier shortcuts (see function below)
    """
    return {
        'id': prop_id,
        'label': label,
        'description': ''
    }

def check_qualified_words(doc):
    """
    Given the presence of keywords in the document, this function returns a list of
    id-dictionaries that will serve as shortcuts for answering qualified questions.
    """
    key_words = {
        'voice' : ['P453', 'P175', 'P725'],
        'play' : ['P453', 'P161']
    }
    res = []
    for token in doc:
        one_word = token.lemma_
        if token.pos_ == 'NUM':
            res.append(empty_prop_id_dict("P585", "point in time"))
        if one_word in list(key_words.keys()):
            res += [empty_prop_id_dict(x, one_word) for x in key_words[one_word]]
    return res

def check_key_words(doc):
    """
    Check for special question words and other keywords. These will be appended to the properties.
    """
    key_words = {
        'how many' : 'number',
        'quantity' : 'number',
        'amount' : 'number',
        'number of': 'number',
        'how long' : 'duration',
        'runtime' : 'duration',
        'how often' : 'frequency',
        'when' : 'date',
        'where' : 'location',
        'location' : 'location',
        'why' : 'cause',
        'cause' : 'cause',
        'whose' : 'owner',
        'birthday' : 'date of birth',
        'directed' : 'director',
        'about' : 'main subject',
        'release':'publication'
    }
    for i in range(len(doc)-1):
        one_word = doc[i].text.lower()
        two_words = doc[i].text.lower() + " " + doc[i+1].text.lower()
        if two_words in list(key_words.keys()):
            return key_words[two_words]
        elif one_word in list(key_words.keys()):
            return key_words[one_word]

        one_word = doc[i].lemma_.lower()
        if one_word in list(key_words.keys()):
            return key_words[one_word]

    return None

def add_variations(prop):
    """
    A normalizer that adds all possible variations based on the given property.
    """
    if prop == 'born' or prop == 'bear':
        return ['birth place', 'birth date']
    else:
        return []

def remove_duplicates(ls):
    """
    Removes duplicates from a list
    """
    d = {}
    for l in ls:
        d[l] = 0
    return list(d.keys())

### Entity and Property extraction functions

In [None]:
def get_named_entities(doc):
    """ 
    spacy has entity recognition in-built, which might work well
    for names, but not for multi-word named entities (like movie titles)
    """
    return doc.ents

def get_entity_complex(q_str):
    """
    Using regex, the fuction finds the most likely entities. 
    Some example of entities it can find:
    24, Fault In Our Stars, Inception, Blade Runner 2049,
    12 Angry Men, Avengers: Endgame, WandaVision, Face/Off
    """
    if is_exception(q_str.split()[0].lower()): 
        q_str = q_str.split(' ', 1)[1] #removing the first word since its mostly useless question word
    entities = re.findall(r"([A-Z0-9]+[a-z']*(?:[:/-]?[\s]?[A-Z][a-z]+|[\s][0-9]+)*)", q_str)
        
    return entities


def get_closest_proper_noun(root, remove = ''):
    """
    It is often the case that the proper noun
    that is most closely associated with the root
    is the most relevent entity in question.
    This is a recursive function starting at the 
    root and doing a BFS through the tree
    """
    pn = ""
    for child in root.children:
        if child.pos_ == 'PROPN':
            pn = phrase(child, remove)
            return pn
        
        pn = get_closest_proper_noun(child)
        if pn != "":
            break
    
    return pn

def find_single_uppercase_entity(parse):
    """
    Used for binary questions:
    If we are only searching for one entity,
    we can just find the first and last index 
    of words beginning with an uppercase letter.
    There is a check to make sure that the first 
    word isn't added if it isn't part of the entity.
    """
    entities = []
    entity_range = [] 
    
    for i in range(len(parse)):
        word = parse[i]
        try:
            second_word = parse[i+1]
        except:
            second_word = False
        if word.text.istitle():
            # Check if word starts with uppercase letter for entities
            if i != 0 or (not is_exception(word.lemma_.lower())
                          and (not second_word or not is_exception(second_word.lemma_))):
                # If it isn't one of the question words (Who/What/Which/When) or a word before them
                entity_range.append(i)
    if len(entity_range) > 1:
        min_entity = entity_range[0]
        max_entity = entity_range[len(entity_range)-1]
        entity = parse[min_entity:max_entity+1].text
        entities.append(entity)
    elif len(entity_range) == 1:
        entity = parse[entity_range[0]].text
        entities.append(entity)
    return entities

def get_entity_from_index_range(parse, begin, end):
    """
    Return the text in the parsed text
    based on the given begin and end index.
    """
    return parse[begin:end+1].text

def add_entity_based_on_index(number_range, front_index, back_index, parse):
    """
    Based on the range, front index, and back index
    return all possible variations of entities.
    """
    entities = []
    if len(number_range) > 1:
        min_number = number_range[0]
        max_number = number_range[len(number_range)-1]
        entities.append(get_entity_from_index_range(parse, min_number, max_number))
    for number in number_range:
        if not type(front_index) == bool:
            # Add title parts from the front
            entities.append(get_entity_from_index_range(parse, front_index, number))
        if not type(back_index) == bool:
            # Add title parts from the back
            if back_index > number:
                entities.append(get_entity_from_index_range(parse, number, back_index))
            else:
                entities.append(get_entity_from_index_range(parse, back_index, number))
            if not type(front_index) == bool and not type(back_index) == bool:
                # Add Title parts from the front and back of this number
                if back_index > number:
                    entities.append(get_entity_from_index_range(parse, front_index, back_index))
    return entities

def find_single_number_entity(parse):
    """
    Find an entity with a number in it.
    Multiple possibilities will be returned 
    based on whether there are words starting 
    with uppercase letters before or after it.
    """
    entities = []
    number_range = []
    front_index = False 
    back_index = False
    
    for i in range(len(parse)):
        word = parse[i]
        try:
            second_word = parse[i+1]
        except:
            second_word = False
        if word.pos_ == "NUM":
            # Check if word is a number and add to range
            number_range.append(i)
            # Add number itself as entity
            entities.append(word.text)
            # Show that number has been used
            if type(back_index) == bool:
                back_index = True
        if word.text.istitle():
            # Check if word starts with uppercase letter for entities
            if i != 0 or (not is_exception(word.lemma_.lower())
                          and (not second_word or not is_exception(second_word.lemma_))):
                # If it isn't one of the question words (Who/What/Which/When) or a word before them
                if not type(back_index) == bool or back_index == True:
                    # number has already been used, so part after number
                    back_index = i
                elif type(front_index) == bool:
                    # First part of entity
                    front_index = i
    return entities + add_entity_based_on_index(number_range, front_index, back_index, parse)

def entity_split_sentence(parse):
    """
    Will return ranges for the sentences 
    so they contain at most one entity,
    but some entities might be split.
    """
    ranges = []
    min_index = 0
    max_index = False
    for i in range(len(parse)):
        word = parse[i]
        try:
            second_word = parse[i+1]
        except:
            second_word = False
        if word.text.istitle():
            # Check if word starts with uppercase letter for entities
            if i != 0 or (not is_exception(word.lemma_.lower())
                          and (not second_word or not is_exception(second_word.lemma_))):
                    # If it isn't one of the question words (Who/What/Which/When) or a word before them
                    if type(max_index) == bool:
                        max_index = i
                    elif i > max_index+2:
                        ranges.append([min_index, i-1])
                        min_index = i
                        max_index = False
                    else:
                        max_index = i
    ranges.append([min_index, len(parse)-1])
    return ranges
    
def find_uppercase_and_number_entities(parse):
    """
    First split the sentence if it contains multiple entities.
    Then run both the uppercase and number entity finders 
    on them to find all entities in the sentence.
    """
    entities = []
    ranges = entity_split_sentence(parse)
    for r in ranges:
        sub_sentence = parse[r[0]:r[1]+1].text.replace('?','')
        sub_sentence = nlp(sub_sentence)
        uppercase = find_single_uppercase_entity(sub_sentence)
        number = find_single_number_entity(sub_sentence)
        entities += uppercase + number
    return entities

### Parser for natural language query ###

def preprocess(query):
    """
    Preprocessing for looking for entities using
    non-dependency methods. This is not strictly
    necessary, but makes it slightly less brittle
    wrt the orthography of the sentence.
    """
    query = query.replace('?','')
    query = query.replace(query[0], query[0].lower(), 1)
    return query

def get_entities(doc):
    """
    Return the possible entities of a given English query.
    The possibilites are:
        Proper noun phrase closest to root
        Named entities according to SPACY
        Regex-found entities (titled words in a row)
    """
    entities = []
    
    root = get_root(doc)
    entities.append(get_closest_proper_noun(root))
    if q_type_binary(doc):
        entities += find_uppercase_and_number_entities(doc)
    else:
        entities += find_single_uppercase_entity(doc)
        entities += find_single_number_entity(doc)
    query = preprocess(doc.text)
    entities += [str(e) for e in get_named_entities(doc)]
    entities.extend(get_entity_complex(query))
    
    entities = [str(e) for e in entities if " 's" not in e]
    
    entities += [entity.replace("the ", "") for entity in entities]
    entities += [entity.replace("'s", "").strip() for entity in entities]
    entities = remove_duplicates(entities)
    
    # Longer strings should be prioritized first
    entities = sorted(entities, key=len, reverse=True)
    
    return entities

In [None]:
### Property Extraction functions ###

##Question types
def q_type_addition(doc):
    """
    Returns True iff the question word is prefixed by a preposition,
    for example, if the question starts with In what, For how long, etc.
    """
    return doc[0].dep == 'prep' or doc[len(doc)-1].dep == 'prep'

def q_type_count(doc):
    """
    Returns True iff the question asks for the number of results
    """
    return check_key_words(doc) == 'number'

def q_type_date(doc):
    """
    Returns True iff the question contains a number (presumably a date)
    """
    return 'NUM' in [x.pos_ for x in doc]

def q_type_qualifier(doc):
    """
    Returns True iff the question might contain qualifiers
    """
    return sum([x.dep_ in ['nsubj', 'dobj', 'pobj'] for x in doc]) > 2

def q_type_easy_qualifier(doc):
    """
    Returns True iff the question can be handled by some shortcuts
    """
    return sum([x.lemma_ in ['in', 'character', 'play', 'voice'] for x in doc]) > 2

def q_type_binary(doc):
    """
    Returns True in and only in the case of a yes/no question
    """
    return doc[0].lemma_ in ['be', 'do', 'have']

def get_root_related_props(doc, entities):
    """
    Several methods to try and get properties with
    respect to the root of the question.
    
    Return list of possible properties.
    
    Note: The lemmas, text and/or phrases are added in order to make sure
          multi-word properties (e.g. 'voice actor') are also considered
    """
    ps = {}
    root = get_root(doc)
    
    for child in root.children:
        if len(entities) < 2 and phrase(child) in entities:
            continue
        if is_q_word(child.text.lower()):
            continue
        if child.dep_ == 'nsubj':
            ps[phrase(child)] = 1
            ps[child.text] = 1
            ps[child.lemma_] = 1
        if child.dep_ == 'dobj':
            ps[phrase(child)] = 1
            ps[child.text] = 1
            ps[child.lemma_] = 1
        if child.pos_ == 'ADJ':
            ps[child.text] = 1
            ps[child.lemma_] = 1
    if root.lemma_ not in ['be', 'have', 'do']:
        ps[root.lemma_] = 1
        ps[root.text] = 1
    for token in doc:
        if token.dep_ == 'acl' and token.pos_ == 'VERB':
            ps[token.lemma_] = 1
    
    # Possibly add keyword
    key_word = check_key_words(doc)
    sorted_extended_ps = []
    if key_word != None:
        extended_ps = {}
        extended_ps[key_word] = 3
        for prop in list(ps.keys()):
            extended_ps[prop + ' ' + key_word] = 4
        sorted_extended_ps = list(dict(sorted(extended_ps.items(), key=lambda item: -item[1])).keys())
    
    # Sort in descending order of keys and convert to list
    sorted_ps = list(dict(sorted(ps.items(), key=lambda item: -item[1])).keys())
    for token in doc:
        if key_word == 'location' and 'die' == token.lemma_: 
            sorted_extended_ps = ['place of death']
        if key_word == 'location' and 'bear' == token.lemma_: 
            sorted_extended_ps = ['place of birth']
    return sorted_ps, sorted_extended_ps

def dumb_property_finder(parse) :
    """
    Find the property based on the root
    or whether the word is a noun/verb
    also add adjectives if present.
    """
    begin_prop_index = False
    end_prop_index = False
    props = []
    for i in range(len(parse)):
        word = parse[i]
    
        if word.dep_ == "ROOT" and word.lemma_ not in ['be', 'have', 'do']:
            # Set root as property when it isn't a form of to be, to have, or to do
            props.append(word.lemma_)
    
        if (word.pos_ == "NOUN" or word.pos_ == "VERB") and not is_exception(word.text):
            # Properties are nouns or verbs
            props.append(phrase(word))
            previous_word = parse[i-1]
            if previous_word.pos_ == "ADJ" :
                # Also add adjectives of the properties
                if type(begin_prop_index) == bool:
                    begin_prop_index = i-1
                props.append(parse[i-1:i+1].lemma_)
            if type(begin_prop_index) == bool:
                begin_prop_index = i
            end_prop_index = i
            if q_type_count(parse) and word.text in ['amount', 'number']:
                continue
            props.append(parse[i].lemma_)
        elif type(begin_prop_index) != bool and i > end_prop_index+1:
            props.append(parse[begin_prop_index:end_prop_index+1].text)
            begin_prop_index = False
    
    # Longer strings should be prioritized first
    return props
    
def get_properties(doc, entity):
    """
    Returns list of possible properties (list of strings)
    """
    ps, extended_ps = get_root_related_props(doc, entity)
    
    if q_type_binary(doc) or not q_type_qualifier(doc) or q_type_count(doc):
        props = dumb_property_finder(doc)
        if q_type_binary(doc) or q_type_count(doc):
            extended_ps += props
            ps = props + ps
        else:
            ps += props
    
    if q_type_date(doc):
        ps.append('point in time')
    if q_type_qualifier(doc):
        ps = [p for p in ps if p.lower() == p]
    
    # Remove all Nones
    return [x for x in ps if x is not None], [x for x in extended_ps if x is not None]

In [None]:
def is_number(string):
    """
    Returns whether the string is actually a string
    and whether this is a number.
    """
    if type(string) == str:
        string = string.replace(".","")
        string = string.replace(",","")
    else:
        return False
    return string.isdigit()
    
def order_answers(entities):
    """
    From all entities, add the numbers separetly.
    Then also add all entity ids to the answer ids.
    """
    answers = []
    # Add numbers as possible answers (for count ask queries)
    for entity in entities:
        if is_number(entity):
            answers.append(entity)
    answers += get_wikidata_ids(entities)
    return answers
    
def binary_queries(entity_id, property_id, answer_ids, non_ids):
    """
    Try to get an answer to the binary question.
    If the answer is True, or the number 
    is equal to the answer number.
    Otherwise return None.
    """
    non_ids.append(entity_id['id'])
    non_ids.append(property_id['id'])
    result = ''
    for answer_id in answer_ids:
        if is_number(answer_id):
            sparql_query = simple_sparql_query(entity_id['id'], property_id['id'])
            result = get_SPARQL_results(sparql_query, True)
            if result == int(answer_id):
                return 'Yes'
        elif answer_id['id'] not in non_ids:
            sparql_query = simple_sparql_query(entity_id['id'], property_id['id'], answer_id['id'], False)
            result = get_SPARQL_results(sparql_query)
            if result == True:
                return 'Yes'
    return None

### Functions that are majorly hueristic/custom

In [None]:
def get_synonyms(word, depth=1):
    """
    Using WordNet (via NLTK), return synsets (synonyms/related words)
    of a given word. Using the depth argument, the user can recursively 
    go down the tree of a given word's synonyms' synonyms to
    get more words, but with probably less relevence, traversing
    the tree in a BFS fasion. Most applications should just need 
    depth=1 (return just the first level of synonyms).
    """
    # base case
    if depth == 0:
        return []
    
    # surface level synonyms
    related_words = []
    for syn in wn.synsets(word):
        related_words += [x.name().replace('_', ' ') for x in syn.lemmas()]
    
    # deeper synonyms
    for ls in [get_synonyms(x, depth-1) for x in related_words]:
        related_words += ls
    
    # remove duplicates and return
    return remove_duplicates(related_words)

@lru_cache(maxsize=1)
def get_movie_related_words(include_wordnet=True):
    """
    Finds all (several) related words for entities in 
    the domain of movies. The top level have been hard-coded
    and several more are found using WordNet's synsets.
    This also means that not all returned words may be
    strongly related to movies, just because of how WordNet
    is designed.
    
    Note: Cached for speed using the lru_cache wrapper
    """
    # naive relations, hand-written
    # starting off point for synonym searching
    movie_relation = ['movie', 'film', 'picture', 'moving picture', 'motion', 'pic', 'flick', 'TV',
                      'television', 'show', 'animation', 'animation']
    character_relation = ['fiction', 'fictitious', 'character']
    actor_relation = ['actor', 'actress', 'thespian']
    music_relation = ['musician', 'music', 'score', 'compose', 'song']
    
    all_relations = []
    all_relations += movie_relation
    all_relations += character_relation
    all_relations += actor_relation
    all_relations += music_relation
    
    if include_wordnet:
        # get WordNet synsets
        all_syns = [get_synonyms(x) for x in all_relations]

        # add to relations
        for syn in all_syns:
            all_relations+=syn

    # remove duplicates and return
    return remove_duplicates(all_relations)

In [None]:
def entity_related_to_movies(entity_list):
    """
    Given a list of dictionaries with information about the entity,
    check if the description contains a word that is related to a movie.
    These have been chosen based on wordnet's synsets. This helps remove 
    non-relevent entities that have with the same name, but not related 
    to movies (e.g. Lord of the Rings book series).
    """
    valid = []
    all_relations = get_movie_related_words()
    for word in all_relations:
        for e in entity_list:
            if 'description' in e.keys():
                if word in e['description']:
                    if e not in valid:
                        valid.append(e)
                    
    return valid

### Combining it all together

In [None]:
def permute(doc, entity_ids, property_ids, answer_ids, is_qual_question = False, isCountQuestion = False):
    """
    For each combination of entities and properties
    it is likely that the entities and properties
    are sorted by relevence/similarity by wikidata
    so return the first result that it finds. This
    is not guaranteed however. Furthermore, parameters
    of whether the q should be handled with a qualified
    query and if the result should be counted are handled.
    """
    cur_time = datetime.datetime.now()
    result = None
    for entity_id in entity_ids:
        for property_id in property_ids:
            if datetime.datetime.now() - cur_time > datetime.timedelta(seconds = 300) and is_qual_question:
                return None
            if q_type_binary(doc):
                non_ids = []
                non = entity_related_to_movies(get_wikidata_ids(entity_id['label']))
                for n in non:
                    non_ids.append(n['id'])
                result = binary_queries(entity_id, property_id, answer_ids, non_ids)
            elif isCountQuestion:
                sparql_query = simple_sparql_query(entity_id['id'], property_id['id'])
                result = get_SPARQL_results(sparql_query, isCountQuestion)
                if result not in [None, '']:
                    temp_res = result
                    sparql_query = simple_qualifier_query(entity_id['id'], property_id['id'])
                    result = get_SPARQL_results(sparql_query, isCountQuestion)
                    if result not in [None, '']:
                        return result
                    if result not in [None, '']:
                        return result
            elif is_qual_question:
                for entity_id_2 in entity_ids:
                    for property_id_2 in property_ids:
                        if entity_id_2 != entity_id:
                            for reverse in [True, False]:
                                sparql_query = property_qualifier_query(entity_id['id'], property_id['id'],
                                        entity_id_2['id'], property_id_2['id'], reverse)
                                result = get_SPARQL_results(sparql_query)
                                if result not in [None, '']:
                                    return result
            else:
                for reverse in [True, False]:
                    sparql_query = simple_sparql_query(entity_id['id'], property_id['id'], None, reverse)
                    result = get_SPARQL_results(sparql_query, False)
                    if q_type_count(doc):
                        if not is_number(result.replace('answerLabel\t', "").strip()):
                            result = None
                    if result not in [None, '']:
                        return result

    return result

def pipeline(question):
    """
    Combines the above functions to create a pipeline to answer questions.
    
    Input: English question string
    Output: Result (answer) of wikidata queries for that question
    """
    result = ''
    
    # Load NLP model and tokenize/analize the question
    nlp = spacy.load("en_core_web_trf")
    doc = nlp(question)
    
    # get entities & their ids
    entities = get_entities(doc)
    entity_ids = entity_related_to_movies(get_wikidata_ids(entities))
    
    # get properties
    properties, extended_properties = get_properties(doc, entities)
    
    # specific modifiers for yes/no questions
    answer_ids = []
    if q_type_binary(doc):
        for propertyy in properties:
            properties += add_variations(propertyy)
        answer_ids = order_answers(entities)
    
    # get property ids
    property_ids = get_wikidata_ids(properties, True)
    property_ids = reduce_based_on_ids(property_ids)
    extended_property_ids = get_wikidata_ids(extended_properties, True)
    
    # approach each type of question differently
    if q_type_qualifier(doc):
        qualifier_shortcuts = check_qualified_words(doc)
        if qualifier_shortcuts:
            # Qualifier shortcuts
            result = permute(doc, entity_ids, qualifier_shortcuts, answer_ids, True)
        if result in [None, '']:
            # Qualified normal properties
            result = permute(doc, entity_ids, property_ids, answer_ids, True)
    if result in [None, '']:
        # Keyword extended properties
        result = permute(doc, entity_ids, extended_property_ids, answer_ids)
        if result in [None, '']:
            # Normal properties, possibly count
            result = permute(doc, entity_ids, property_ids, answer_ids, False, q_type_count(doc))
    
    if result not in [None, '']:
        return result

    # Guesses if answer not found
    if q_type_binary(doc):
        return "No"
    elif q_type_count(doc):
        return '0'
    
    return None

def ask_question(question, base_answer=True):
    """ The main function used to ask queries
       it is mainly a wrapper for the pipeline"""
    ans = 'Answer not found'
    try:
        ans = pipeline(question)
    except Exception:
       ans = "Error encoutered while searching!"
    
    if ans in [None, '']:
        ans = "Answer not found"
        
    if base_answer: # strips the ans of any formatting
        ans = ans.replace('answerLabel\t', "").strip()
        ans = ans.replace('\n', ", ").strip()
        
    return ans

---
# Input-Output

In [None]:
import csv

In [None]:
answers = {}

with open('test_questions.csv') as questions :
    reader = csv.reader(questions,delimiter='\t')
    for row in reader:
        question_id = row[0]
        question_text = row[1]
        print(f'Answering question: {question_id}')
        answers[question_id] = ask_question(question_text)
        
with open('our_team_answers.csv', mode='w') as answerfile :
    writer = csv.writer(answerfile,delimiter='\t')
    for key in answers :
        writer.writerow([key,answers[key]])

## Interactive

In [None]:
q = input("Input your question!")
print(ask_question(q))