## Extract type facts from a Wikipedia file


### === Purpose ===

The goal of this lab is to extract the class to which an entity belongs from Wikipedia.
For example, given the Wikipedia article about Leicester:

    Leicester is a small city in England
    
the goal is to extract:

    Leicester TAB city


### === Provided Data ===

We provide:

1. a preprocessed version of the Simple Wikipedia (`wikipedia-first.txt`), which looks like above
2. a template for your code, `extractor.py`
3. a gold standard sample (`gold-standard-sample.tsv`).


### === Task ===

Complete the `extract_type()` function so that it extracts the type of the article entity from the content.
For example, for a content of "Leicester is a beautiful English city in the UK", it should return "city".
Exclude terms that are too abstract ("member of...", "way of..."), and try to extract exactly the noun(s).
You can also skip articles (e.g. `return None`) if you are not sure or if the text does not contain any type.
In order to ensure a fair evaluation, do not use any non-standard Python libraries except `nltk` (`pip install nltk`).

Input:

April
April is the fourth month of the year with 30 days.

Output:
April TAB month


### === Development and Testing ===

We provide a certain number of gold samples for validating your model.
Finally, we calculate a F1 score using following equation:

`F1 = (1 + beta * beta) * precision * recall / (beta * beta * precision + recall)`

with `beta = 0.5`, putting more weight on precision in that way.


### === Submission ===

1. Take your code, any necessary resources to run the code, and the output of your code on the test dataset (no need to put the other datasets!)
2. ZIP these files in a file called `firstName_lastName.zip`
3. submit it here before the deadline announced during the lab:


https://www.dropbox.com/request/n1pxRxyUHuq9w1ewOzWB


### === Contact ===

If you have any additional questions, you can send an email to: nedeljko.radulovic@telecom-paris.fr


In [None]:
"""
Don't modify this code.
"""

import sys

class Page:
    '''
    This class is used to store title and content of a wiki page
    '''
    __author__ = "Jonathan Lajus"

    def __init__(self, title, content):
        self.content = content
        self.title = title
        if sys.version_info[0] < 3:
            self.title = title.decode("utf-8")
            self.content = content.decode("utf-8")

    def __eq__(self, other):
        return isinstance(other, self.__class__) and self.title == other.title and self.content == other.content

    def __ne__(self, other):
        return not self.__eq__(other)

    def __hash__(self):
        return hash((self.title, self.content))

    def __str__(self):
        return 'Wikipedia page: "' + (self.title.encode("utf-8") if sys.version_info[0] < 3 else self.title) + '"'

    def __repr__(self):
        return self.__str__()

    def _to_tuple(self):
        return (self.title, self.content)


class Parsy:
    '''
    Parse a Wikipedia file, return page objects
    '''
    __author__ = "Jonathan Lajus"

    def __init__(self, wikipediaFile):
        self.file = wikipediaFile

    def __iter__(self):
        title, content = None,""
        with open(self.file, encoding='utf-8') as f:
            for line in f:
                line = line.strip()
                if not line and title is not None:
                    yield Page(title, content.rstrip())
                    title, content = None,""
                elif title is None:
                    title = line
                elif title is not None:
                    content += line + " "

    
def eval_f1(gold_file, pred_file):

    # Dictionaries
    goldstandard = dict()
    student = dict()

    # Reading first file
    with open(gold_file, 'r', encoding="utf-8") as f:
        for line in f:
            temp = line.split("\t")
            if len(temp) != 2:
                print("The line:", line, "has an incorrect number of tabs")
            else:
                if temp[0] in goldstandard:
                    print(temp[0], " has two solutions")
                goldstandard[temp[0]] = str.lower(temp[1])

    # Reading second file
    with open(pred_file, 'r', encoding="utf-8") as f:
        for line in f:
            temp = line.split("\t")
            if len(temp) != 2:
                if not debug:
                    print("Comment :=>> The line: '", line, "' has an incorrect number of tabs")
                else:
                    print("The line: '", line, "' has an incorrect number of tabs")
            else:
                if temp[0] in student:
                    if not debug:
                        print("Comment :=>>", temp[0], "has two solutions")
                    else:
                        print(temp[0], " has two solutions")
                student[temp[0]] = str.lower(temp[1])

    true_pos = 0
    false_pos = 0
    false_neg = 0

    for key in student:
        if key in goldstandard:
            if student[key] == goldstandard[key]:
                true_pos += 1
            else:
                false_pos += 1
                print("You got", key, "wrong. Expected output: ", goldstandard[key], ",given:", student[key])

    for key in goldstandard:
        if key not in student:
            false_neg += 1
            print("No solution was given for", key)

    if true_pos + false_pos != 0:
        precision = float(true_pos) / (true_pos + false_pos) * 100.0
    else:
        precision = 0.0

    if true_pos + false_neg != 0:
        recall = float(true_pos) / (true_pos + false_neg + false_pos) * 100.0
    else:
        recall = 0.0

    beta = 0.5

    if precision + recall != 0.0:
        f05 = (1 + beta * beta) * precision * recall / (beta * beta * precision + recall)
    else:
        f05 = 0.0

    # grade = 0.75 * precision + 0.25 * recall
    grade = f05

    print("Comment :=>>", "Precision:", precision)
    print("Comment :=>>", "Recall:", recall)
    print("Simulated Grade (F0.5) :=>>", grade)


In [None]:
# a simplified wiki page document
wiki_file = '/content/wikipedia-first.txt'
# some gold samples for validation
gold_file = '/content/gold-standard-sample.tsv'
# predicted results generated by your model
# you are supposed to submit this file
result_file = 'results.tsv'

In [None]:
import nltk
nltk.download("popular")

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Do

True

In [None]:
def rule(grammar,tag):
    cp = nltk.RegexpParser(grammar)
    result = cp.parse(tag)
    docs = []
    for subtree in result.subtrees(filter=lambda t: t.label() == 'NP'):
        docs.append(" ".join([a for (a,b) in subtree.leaves()]))
    if (len(docs) != 0):                
        text2 = word_tokenize(docs[0]) 
        tag = nltk.pos_tag(text2)
        if (tag[len(text2)-1][1] == 'NN' or tag[len(text2)-1][1] == 'NNS' or tag[len(text2)-1][1] == 'NNP' or tag[len(text2)-1][1] == 'NNPS'):
            return (text2[len(text2)-1])
        else:
            return 0
    else:
        return 0

In [None]:
from nltk import pos_tag
from nltk.tokenize import word_tokenize


def extract_type(wiki_page):
    '''

    :param wiki_page is an object contains a title and the first sentence from its wiki page.
    :return:
    '''
    title = wiki_page.title
    content = wiki_page.content

    res = []
    for i in range(0,len(content)):
        if (content[i].isalnum() == True or content[i] == ' ' or content[i] == "'" or (i== len(content)-1)):
            res.append(content[i])
    content = "".join(res)

    # Code goes here
    text = word_tokenize(content)
    tag = nltk.pos_tag(text)
    
    there_is_of = False
    for i in range(0,len(text)):
        if (tag[i][0] == 'of'):
            there_is_of = True
            content = content.replace(tag[i][0]+' ','')
        if (tag[i][1] == 'RP' or tag[i][1] == 'FW' or tag[i][1] == 'CD'): # 
            content = content.replace(tag[i][0]+' ','')
        if (tag[i][1] == 'POS' or tag[i][0] == "'s"):
            content = content.replace(tag[i-1][0],'')
            content = content.replace(tag[i][0],'')

    text = word_tokenize(content)
    tag = nltk.pos_tag(text)


    grammars = []
    grammar1 = "NP: {<DT>?<NN|NNS|NNP|NNPS>*<MD>?<RB>?<VB|VBD|VBN|VBP|VBZ>+<DT>?<RBS|RBR>?<JJ|JJR|VBG|JJS|PRP$>*<NN|NNS>*}"
    grammar2 = "NP: {<DT>?<NN|NNS|NNP|NNPS>*<MD>?<RB>?<VB|VBD|VBN|VBP|VBZ>+<TO><RB>?<VB|VBD|VBN|VBP|VBZ>+<DT>?<RBS|RBR>?<JJ|JJR|VBG|JJS|PRP$>*<NN|NNS>*}" # could, will
    grammar3 = "NP: {<DT>?<NN|NNS|NNP|NNPS>*<RB>?<VB|VBD|VBN|VBP|VBZ>*<DT>?<RBS|RBR>?<JJ|JJR|VBG|JJS|PRP$>*<NN|NNS>*}"

    grammars.append(grammar1)
    grammars.append(grammar2)
    grammars.append(grammar3)

    for i in range(0,len(grammars)):
        answer = rule(grammars[i],tag)
        if  (answer != 0):
            return answer

    return None 

In [None]:
def run():
    '''
    First, extract types from each sentence in the wiki file
    Next, use gold samples to evaluate your model
    :return:
    '''
    with open(result_file, 'w', encoding="utf-8") as output:
        for page in Parsy(wiki_file):
            typ = extract_type(page)
            if typ:
                output.write(page.title + "\t" + typ + "\n")

    # Evaluate on some gold samples for checking your model
    eval_f1(gold_file, result_file)


run()

You got Army wrong. Expected output:  military
 ,given: part

You got Bath wrong. Expected output:  thing
 ,given: people

You got Hillary Rodham Clinton wrong. Expected output:  senator
 ,given: junior

You got Tropical cyclone wrong. Expected output:  weather
 ,given: name

You got Ritual wrong. Expected output:  actions
 ,given: people

You got Medusa (animal) wrong. Expected output:  animals
 ,given: forms

You got Mud wrong. Expected output:  mixture
 ,given: dirt

You got Artillery wrong. Expected output:  guns
 ,given: word

You got Communication Studies wrong. Expected output:  study
 ,given: college

You got Phoenicia wrong. Expected output:  civilization ,given: civilization

Comment :=>> Precision: 88.88888888888889
Comment :=>> Recall: 88.88888888888889
Simulated Grade (F0.5) :=>> 88.88888888888889
