### === Purpose ===

The goal of this lab is to disambiguate entities in a text. For example, given a Wikipedia article:

    <Paris_17>
    Paris is a figure in the Greek mythology.

the goal is to determine that `<Paris_17> = <Paris_(mythology)>`.
Here, `<Paris_17>` is an artificial title of the Wikipedia article, and `<Paris_(mythology)>` is the unambiguous entity in the YAGO knowledge base.
(https://yago-knowledge.org/graph/%22Paris%22@en?relation=all&inverse=1)

### === Provided Data ===

We provide:
1. a preprocessed version of the Simple Wikipedia wikipedia-ambiguous.txt, which contains ambiguous article titles with their content, as above.
2. a simplified version of the YAGO knowledge base.
3. a template for your code, disambiguator.py
4. a gold standard sample.

### === Task ===

Your task is to complete the function `disambiguate()` in this file.
It receives as input (1) the ambiguous Wikipedia title ("Paris" in the example), and (2) the article content.
The method shall return the unambiguous entity from YAGO.
In order to ensure a fair evaluation, do not use any non-standard Python libraries except `nltk`.
The lab will be graded by a variant of the F1 score that gives higher weight to precision (with `beta=0.5`).

Input:
`<Babilonia_0>`
Babilonia is a 1987 Argentine drama film directed and written by Jorge Salvador based on a play by Armando Discépolo.

Output:
`<Babilonia_0>`	TAB `<Babilonia>`

### === Development and Testing ===

**In YAGO, the entities have readable ids, as in <Ashok_Kumar_(British_politician)>. This is, however, not the case in all knowledge bases. Therefore, your algorithm should not rely on the suffix "British Politician"!**

To enforce this, we deliver two versions of the lab:
1) Development: With readable entity ids
The corresponding YAGO knowledge base is dev_yago.tsv, and the gold standard is dev_gold_samples.tsv
2) Testing: Without readable entity ids
The corresponding YAGO knowledge base is test_yago.tsv. Here, the British politician has the id <Ashok_Kumar_1081507>. This is the file that you will be evaluated on!
   
### === Submission ===

1. Take your code, any necessary resources to run the code, and the output of your code on the test dataset (no need to put the other datasets!)
2. ZIP these files in a file called `firstName_lastName.zip`
3. submit it here before the deadline announced during the lab:

https://www.dropbox.com/request/aFP23kphMb4isbYGz0gm


### === Contact ===

If you have any additional questions, you can send an email to: nedeljko.radulovic@telecom-paris.fr

In [None]:
"""
This cell contains the classes and functions that are used for reading and parsing the simplified knowledge base.
Don't modify this code.
"""
import sys


class Page:
    '''
    This class is used to store title and content of a wiki page
    '''
    __author__ = "Jonathan Lajus"

    def __init__(self, title, content):
        self.content = content
        self.title = title
        if sys.version_info[0] < 3:
            self.title = title.decode("utf-8")
            self.content = content.decode("utf-8")

    def __eq__(self, other):
        return isinstance(other, self.__class__) and self.title == other.title and self.content == other.content

    def __ne__(self, other):
        return not self.__eq__(other)

    def __hash__(self):
        return hash((self.title, self.content))

    def __str__(self):
        return 'Wikipedia page: "' + (self.title.encode("utf-8") if sys.version_info[0] < 3 else self.title) + '"'

    def __repr__(self):
        return self.__str__()

    def _to_tuple(self):
        return (self.title, self.content)

    # Only used for Disambiguation TP
    def label(self):
        return self.title[1:self.title.rindex("_")].replace("_", " ")


class Parsy:
    '''
    Parses a Wikipedia file, returns page objects
    '''
    __author__ = "Jonathan Lajus"

    def __init__(self, wikipediaFile):
        self.file = wikipediaFile

    def __iter__(self):
        title, content = None, ""
        with open(self.file, encoding='utf-8') as f:
            for line in f:
                line = line.strip()
                if not line and title is not None:
                    yield Page(title, content.rstrip())
                    title, content = None, ""
                elif title is None:
                    title = line
                elif title is not None:
                    content += line + " "
            yield Page(title, content.rstrip())


def clean(entity):
    '''
    clean entity
    :param entity: example "<http://yago-knowledge.org/resource/Lochaber>"
    :return: <Lochaber>
    '''
    if entity[0] == '<':
        entity = entity[1:]
        entity = entity[entity.rfind("/")+1:]
        entity = entity[entity.rfind("#")+1:]
        entity = "<"+entity
    elif entity[0] == '"':
        entity = entity[0:entity.rfind('"')+1]
    return entity


class KnowledgeBase:
    '''
    A simple knowledge base. Don't modify this code.

    Load the knowledge base:
        kb = KnowledgeBase("yago.tsv")

    Access facts:
        albumsOfElvis = kb.facts["<Elvis>"]["<albums>"]

    Access inverse facts:
        entitiesCalledParis = kb.inverseFacts['"Paris"']["<label>"]
    '''
    __author__ = "Fabian Suchanek"

    def __init__(self, yagoFile):
        self.facts = {}
        self.inverseFacts = {}
        with open(yagoFile, encoding="utf-8") as file:
            print("Loading", yagoFile, end="...", flush=True)
            for line in file:
                split_line = line.split('\t')
                if len(split_line)<3:
                    raise RuntimeError("The file is not a valid KB file")
                subject = clean(split_line[0])
                relation = clean(split_line[1])
                obj = clean(split_line[2])
                self.facts.setdefault(subject, {})
                self.facts[subject].setdefault(relation,set())
                self.facts[subject][relation].add(obj)
                self.inverseFacts.setdefault(obj, {})
                self.inverseFacts[obj].setdefault(relation,set())
                self.inverseFacts[obj][relation].add(subject)
                if relation == "<type>":
                    self.facts.setdefault(obj, {})
                    self.facts[obj].setdefault("<label>", set())
                    self.facts[obj]["<label>"].add(obj.replace("(", "").replace(")", "").replace("<", "").replace(">", "").replace("_", " "))

        print("done", flush=True)


def evaluate(student_file, goldstandard_file):
    '''
    run this code to evaluate your model on a gold standard dataset.
    :param student_file: a result file generated by you
    :param goldstandard_file: a gold standard dataset
    :return:
    '''
    # Dictionaries
    goldstandard = dict()
    student = dict()

    # Reading first file
    with open(goldstandard_file, 'r', encoding='utf-8') as f:
        for line in f:
            temp = line.strip().split("\t")
            if len(temp) != 2:
                print("The line:", line, "has an incorrect number of tabs")
            else:
                if temp[0] in goldstandard:
                    print(temp[0], " has two solutions")
                goldstandard[temp[0]] = temp[1]

    # Reading second file
    with open(student_file, 'r', encoding='utf-8') as f:
        for line in f:
            temp = line.strip().split("\t")
            if len(temp) != 2:
                print("The line: '", line, "' has an incorrect number of tabs")
            else:
                if temp[0] in student:
                    print(temp[0], " has two solutions")
                student[temp[0]] = temp[1]

    true_pos = 0
    false_pos = 0
    false_neg = 0

    for key in student:
        if key in goldstandard:
            if student[key] == goldstandard[key]:
                true_pos += 1
            else:
                false_pos += 1
                print("You got", key, "wrong. Expected output: ", goldstandard[key], ",given:", student[key])

    for key in goldstandard:
        if key not in student:
            false_neg += 1
            if false_neg < 500:
                print("No solution was given for", key)
            elif false_neg == len(goldstandard):
                print("Other solutions not found...")

    if true_pos + false_pos != 0:
        precision = float(true_pos) / (true_pos + false_pos) * 100.0
    else:
        precision = 0.0

    if true_pos + false_neg != 0:
        recall = float(true_pos) / (true_pos + false_neg + false_pos) * 100.0
    else:
        recall = 0.0

    beta = 0.5

    if precision + recall != 0.0:
        f05 = (1 + beta * beta) * precision * recall / (beta * beta * precision + recall)
    else:
        f05 = 0.0

    # grade = 0.75 * precision + 0.25 * recall
    grade = f05

    print("Precision:", precision)
    print("Recall:", recall)
    print("F0.5:", f05)

In [None]:
# a preprocessed version of the Simple Wikipedia wikipedia-ambiguous.txt,
# which contains ambiguous article titles with their content.
wikipedia_file = "/content/wikipedia-ambiguous.txt"

# development dataset (suffix is readable)
# [ dev_kb_file ] a simplified YAGO knowledge base
# [ dev_result_file ] generate your prediction
# [ dev_gold_file ] a certain number of gold standard samples
dev_kb_file = "/content/dev_yago.tsv"
dev_result_file = "/content/dev_results.tsv"
dev_gold_file = '/content/dev_gold_samples.tsv'

# test dataset (suffix is un-readable)
# [ test_kb_file ] a simplified YAGO knolwdge base
# [ test_result_file ] generate your prediction
# [ test_gold_file ] a certain number of gold standard samples
test_kb_file = "/content/test_yago.tsv"
test_result_file = "/content/results.tsv"
test_gold_file = '/content/test_gold_samples.tsv'


In [None]:
# YOUR CODE GOES HERE
def disambiguate(entityName, text, kb):
    '''
    :param entityName: a string, name appearing in wikipedia-ambiguous.txt
    :param text: a corresponding context
    :param kb: knowledge base
    :return: return a correct entity from this kb
    '''
    entities = []
    entities2 =  []
    text = text.lower()
    text_list2 = []
    text_list2.append(text)
    text_list2 = text_list2[0].split()

    text_list = []
    for i in text_list2:
        if i not in text_list:
            text_list.append(i)
    to_replace0 = ['"']
    to_replace = ['>','<','"']
    elements_of_entityName = []
    elements_of_entityName.append(entityName)
    elements_of_entityName = elements_of_entityName[0].split()
    try :
        for identifier,label in dictionary_of_identifiers.items():
            if (entityName in label):
                entities2.append(identifier)
            for element_of_entity_name in elements_of_entityName:
                if element_of_entity_name in label:
                    entities.append(identifier)
        if (len(entities) == 0 ):
            return None

    except (KeyError):
        return None
    else:
        if  (len(entities2) != 0):
            entities = entities2
        scores = []
        for entity in entities:
            elements_of_entity = []
            for attribute in list(kb.facts[entity]):
                try:
                    my_string = list(kb.facts[list(kb.facts[entity][attribute])[0]]['<label>'])[0]
                    for c in to_replace0:
                        my_string = my_string.replace(c,'')
                    new_string = []
                    new_string.append(my_string.lower())
                    new_string = new_string[0].split()
                    for i in new_string:
                        if i not in elements_of_entity:
                            elements_of_entity.append(i)
                except (KeyError):
                    my_string = list(kb.facts[entity][attribute])[0]
                    for c in to_replace:
                        my_string = my_string.replace(c,'')
                    new_string = []
                    new_string.append(my_string.lower())
                    new_string = new_string[0].split()
                    for i in new_string:
                        if i not in elements_of_entity:
                            elements_of_entity.append(i)  
            score = 0
            sum = 0
            text_list_for_entity = text_list.copy()
            for element in elements_of_entity:
                if element not in text_list_for_entity:
                    text_list_for_entity.append(element)
                sum = sum + 1
                if (element in text):
                    score = score + 1
            scores.append(score / len(text_list_for_entity)) 
        max_index = scores.index(max(scores))
        if (max(scores) < 0.15 ):  
            return None
        return entities[max_index]

In [None]:
def evaluate_on_dev():
    '''
    evaluate your model on the development dataset.
    In the development dataset, each entity name (suffix) is readable.
    :return:
    '''

    # load YAGO knowledge base
    # example: kb.facts["<Babilonia>"]
    kb = KnowledgeBase(dev_kb_file)

    # predict each record and generate results.tsv file
    with open(dev_result_file, 'w', encoding="utf-8") as output:
        for page in Parsy(wikipedia_file):
            result = disambiguate(page.label(), page.content, kb)
            if result is not None:
                output.write(page.title+"\t"+result+"\n")

    # evaluate
    evaluate(dev_result_file, dev_gold_file)


def evaluate_on_test():
    '''
    evaluate your model on the test dataset.
    In the test dataset, each entity name (suffix) is un-readable.
    We hide all suffixes.
    :return:
    '''

    # load YAGO knowledge base
    # example: kb.facts["<Babilonia_1049451>"]
    kb = KnowledgeBase(test_kb_file)
    # predict each record and generate results.tsv file
    with open(test_result_file, 'w', encoding="utf-8") as output:
        for page in Parsy(wikipedia_file):
            result = disambiguate(page.label(), page.content, kb)
            if result is not None:
                output.write(page.title + "\t" + result + "\n")

    # evaluate
    evaluate(test_result_file, test_gold_file)
    

In [None]:
# evaluate
'''
kb = KnowledgeBase(dev_kb_file)
dictionary_of_identifiers = {}
to_replace = ['"']
to_replace2 = ['<','>']
for key in kb.facts:
    try:
        label = list(kb.facts[key]['<label>'])[0]
        for c in to_replace:
            label = label.replace(c,'')
        dictionary_of_identifiers[key] = label
    except (KeyError):
        label = key
        for c in to_replace2:
            label = label.replace(c,'')
        label = label.replace('_',' ')
        dictionary_of_identifiers[key] = label
evaluate_on_dev()
'''

kb = KnowledgeBase(test_kb_file)
dictionary_of_identifiers = {}
to_replace = ['"']
to_replace2 = ['<','>']
for key in kb.facts:
    try:
        label = list(kb.facts[key]['<label>'])[0]
        for c in to_replace:
            label = label.replace(c,'')
        dictionary_of_identifiers[key] = label
    except (KeyError):
        label = key
        for c in to_replace2:
            label = label.replace(c,'')
        label = label.replace('_',' ')
        dictionary_of_identifiers[key] = label
evaluate_on_test()



Loading /content/test_yago.tsv...done
Loading /content/test_yago.tsv...done
You got <Willmar_1> wrong. Expected output:  <Willmar,_Minnesota_1032106> ,given: <Willmar_Township,_Kandiyohi_County,_Minnesota_1070559>
You got <Ashok_Kumar_2> wrong. Expected output:  <Ashok_Kumar_1063632> ,given: <Ashok_Kumar_1037344>
You got <Ashok_Kumar_3> wrong. Expected output:  <Ashok_Kumar_1006860> ,given: <Ashok_Kumar_1060537>
You got <Ashok_Kumar_5> wrong. Expected output:  <Ashok_Kumar_1053957> ,given: <Ashok_Kumar_1067851>
You got <Mortal_Kombat_2> wrong. Expected output:  <Mortal_Kombat_1062946> ,given: <Mortal_Kombat_1039139>
You got <Mortal_Kombat_3> wrong. Expected output:  <Mortal_Kombat_1087923> ,given: <Mortal_Kombat_1039139>
You got <Epirus_1> wrong. Expected output:  <Epirus_1048279> ,given: <Deidamia_I_of_Epirus_1044562>
You got <Epirus_2> wrong. Expected output:  <Epirus_1016742> ,given: <Deidamia_I_of_Epirus_1044562>
You got <Calla_1> wrong. Expected output:  <Calla_1020419> ,given: <C