# Notebook 6 - Knowledge Representation and NER

CSI4106 Artificial Intelligence  
Fall 2019  
Prepared by Caroline Barrière and Julian Templeton

***INTRODUCTION***:  

In this notebook you will explore one of the techniques and one of the resources that you have seen in class for Name-Entity Recognition (NER). Instead of typical PER, LOC, ORG entity types, we will go into the restaurant world and focus on entities of type *Cuisine*, *Dish* and others.

You will begin by loading in our train and test corpora of restaurant related sentences and move to using a Gazetteer for NER of the words in the corpus.   

Then you will explore Wordnet, a lexical semantic network in which knowledge is organized by interrelated synsets (groups of synonyms). Using Wordnet you will use the hyponyms of the *dish* to recognize how to tag tokens as *Dish*.   

An important note is that this notebook will be working with *Single Words* rather than *Multi-words*. This means that when tagging words you will not be able to get all of the tags for each word. For example, a sentence "Find me a restaurant where they serve fettucini alfredo" is a sentence where the tag *Cuisine* should be used for several words (fettucini alfredo). But looking at single words, only fettucini may be tagged correctly. Dealing with multi-words is quite complex, and that is why we limit ourselves to single words in this notebook, but keep in mind that we would typically want to consider several words, not just single words.

Also, the evaluation in this notebook is more qualitative in nature.  We will not perform the usual quantitative methods of precision/recall, but rather just look at examples of output and discuss.

***HOMEWORK***:  
Go through the notebook by running each cell, one at a time.  
Look for **(TO DO)** for the tasks that you need to perform. Do not edit the code outside of the questions which you are asked to answer unless specifically asked. Once you're done, Sign the notebook (at the end of the notebook), and submit it.  

*The notebook will be marked on 15.  
Each **(TO DO)** has a number of points associated with it.*
***

In [1]:
import re # For the regular expressions that we will be using
from nltk.corpus import wordnet # Import Wordnet

**1. Setting up the corpus**   
Before working on our NER tasks we must setup the corpus that we will be using. For this notebook we will be working with a corpus related to restaurants.   

The restaurant corpus is provided on Brightspace but is also available at: https://groups.csail.mit.edu/sls/downloads/ as the *MIT Restaurant Corpus*. This corpus provides lists of tokens from sentences, where each sentence is separated by a blank line, with each token provided a NER tag such as *Cuisine*, *Dish*, *Other (O)*, ... We will explore these soon.

In [2]:
# Begin collecting the sentences and tags from restauranttrain.bio.txt
corpus_train = []
with open("restauranttrain.bio.txt") as f:
    corpus_train = f.readlines()
    
# Begin collecting the sentences and tags from restauranttest.bio.txt
corpus_test = []
with open("restauranttest.bio.txt") as f:
    corpus_test = f.readlines()

In [3]:
# We will convert the corpus into pairs of sentence tokens and their respective tags.
# We will also limit the minimum length of sentences that we keep along with the total number
# of sentences.
def setup_sentences(corpus, min_sentence_len, max_sentences):
    sentences = [] # All of our tokenized sentences with their respective NER tags
    sentence = [[], []] # Storing the NER tags along with the sentences in two separate lists
    # Format our corpus as a list of lists for easy use later
    for line in corpus:
        vals = line.split()
        if (vals == []):
            sentences.append(sentence)
            sentence = [[], []]
        else:
            sentence[0].append(re.sub(r"[A-Z]-", "", vals[0])) # NER tag (we remove prefixes for simplicity)
            sentence[1].append(vals[1]) # Sentence
    # Now we will only keep sentences containing at least min_sentence_len words
    sentences = [sentence for sentence in sentences if len(sentence[1]) >= min_sentence_len]
    # Finally return up to max_sentences sentences
    return sentences[:max_sentences]

In [4]:
train_sentences = setup_sentences(corpus_train, 5, 1000)
test_sentences = setup_sentences(corpus_test, 5, 500)

In [5]:
print(len(train_sentences))
print(len(test_sentences))

1000
500


In [6]:
# Let us take a look at a few sample sentences and their NER tags from train_sentences
# We will discuss the tags themselves very soon!
print(train_sentences[1][1], "\n", train_sentences[1][0], "\n")
print(train_sentences[10][1], "\n", train_sentences[10][0], "\n")
print(train_sentences[100][1], "\n", train_sentences[100][0], "\n")
print(train_sentences[500][1], "\n", train_sentences[500][0], "\n")

['5', 'star', 'resturants', 'in', 'my', 'town'] 
 ['Rating', 'Rating', 'O', 'Location', 'Location', 'Location'] 

['about', 'how', 'much', 'is', 'a', 'midpriced', 'bottle', 'of', 'good', 'wine', 'at', 'davidos', 'italian', 'palace'] 
 ['O', 'O', 'O', 'O', 'O', 'Price', 'Rating', 'O', 'Rating', 'Cuisine', 'O', 'Restaurant_Name', 'Restaurant_Name', 'Restaurant_Name'] 

['are', 'there', 'any', 'authentic', 'vietnamese', 'restaurants', 'in', 'the', 'area', 'that', 'specialize', 'in', 'regional', 'dishes'] 
 ['O', 'O', 'O', 'Cuisine', 'Cuisine', 'O', 'Location', 'Location', 'Location', 'O', 'O', 'O', 'Cuisine', 'O'] 

['can', 'i', 'get', 'a', 'list', 'of', 'restaurants', 'that', 'are', 'still', 'open', 'that', 'serve', 'pancakes'] 
 ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'Hours', 'Hours', 'O', 'O', 'Dish'] 



**2. Understanding the corpus**   
Let us explore the corpus to look at the kinds of NER tags being used.

In [7]:
tag_dict = { }
# Collect all unique tags via a dictionary
for tags, sentence in train_sentences:
    for tag in tags:
        tag_dict[tag] = True
        
tags = []
# Output all tags used
for key in tag_dict:
    print(key)
    tags.append(key)

Rating
O
Amenity
Location
Restaurant_Name
Price
Hours
Dish
Cuisine


As you can see, there are 9 different NER tags being used. Note that the tag *O* means Other.   
So when we look at the sentence below, we can see that each word of a sentence in our corpus contains a NER tag:  
['5', 'star', 'resturants', 'in', 'my', 'town']      
['Rating', 'Rating', 'O', 'Location', 'Location', 'Location']   

For example '5' is tagged *Rating*, 'in' is tagged *Location*, 'town' is tagged *Location*

In [8]:
# Given a tag and list of tag/sentence lists, returns the list of all unique words with that
# NER tag.
def tag_words(target_tag, sentences):
    words = []
    for tags, sentence in sentences:
        for tag, word in zip(tags, sentence):
            if (target_tag == tag) and (not word in words):
                words.append(word)
    words.sort()
    return words

**(TO DO) Q1 - 2 marks**   
Using the function *tag_words()* above, explore the words associated with each type of named entity from *train_sentences*.  Write a few lines of code to output a few examples of *each type* of entity, as well as the number of different words for each type of entity.

In [9]:
# Give some examples (words) for each type of entity along with the number of different words
# tagged with each entity in train_sentences
print("Number of different words for entity 'Rating': "+ str(len(tag_words('Rating', train_sentences))))
print("Examples of entity 'Rating': "+ str(tag_words('Rating', train_sentences)))

print("\nNumber of different words for entity 'O': "+ str(len(tag_words('O', train_sentences))))
print("Examples of entity 'O': "+ str(tag_words('O', train_sentences)))

print("\nNumber of different words for entity 'Amenity': "+ str(len(tag_words('Amenity', train_sentences))))
print("Examples of entity 'Amenity': "+ str(tag_words('Amenity', train_sentences)))

print("\nNumber of different words for entity 'Location': "+ str(len(tag_words('Location', train_sentences))))
print("Examples of entity 'Location': "+ str(tag_words('Location', train_sentences)))

print("\nNumber of different words for entity 'Restaurant_Name': "+ str(len(tag_words('Restaurant_Name', train_sentences))))
print("Examples of entity 'Restaurant_Name': "+ str(tag_words('Restaurant_Name', train_sentences)))

print("\nNumber of different words for entity 'Price': "+ str(len(tag_words('Price', train_sentences))))
print("Examples of entity 'Price': "+ str(tag_words('Price', train_sentences)))

print("\nNumber of different words for entity 'Hours': "+ str(len(tag_words('Hours', train_sentences))))
print("Examples of entity 'Hours': "+ str(tag_words('Hours', train_sentences)))

print("\nNumber of different words for entity 'Dish': "+ str(len(tag_words('Dish', train_sentences))))
print("Examples of entity 'Dish': "+ str(tag_words('Dish', train_sentences)))

print("\nNumber of different words for entity 'Cuisine': "+ str(len(tag_words('Cuisine', train_sentences))))
print("Examples of entity 'Cuisine': "+ str(tag_words('Cuisine', train_sentences)))


Number of different words for entity 'Rating': 58
Examples of entity 'Rating': ['2', '3', '4', '5', 'a', 'at', 'best', 'bottle', 'busy', 'crowd', 'delicious', 'descent', 'excellent', 'famous', 'favorite', 'few', 'first', 'five', 'food', 'four', 'good', 'great', 'health', 'high', 'highest', 'highly', 'least', 'local', 'month', 'nice', 'nicest', 'notable', 'on', 'past', 'pleasing', 'quality', 'rate', 'rated', 'rating', 'ratings', 'reviewed', 'reviews', 'score', 'secret', 'service', 'star', 'starred', 'stars', 'start', 'suggestions', 'superior', 'top', 'toppings', 'tripadvisor', 'views', 'well', 'wonderful', 'zagat']

Number of different words for entity 'O': 367
Examples of entity 'O': ['00', '1', '100', '2', '20', '3', '30', '4', '6', '8', '98', 'a', 'able', 'ablle', 'about', 'accept', 'address', 'adress', 'all', 'allow', 'allows', 'along', 'alot', 'also', 'although', 'am', 'an', 'and', 'anniversary', 'any', 'anybody', 'anyplace', 'anything', 'anywhere', 'are', 'area', 'around', 'arrang

**3. Using Gazetteers for NER**    
Now we will explore using Gazetteers for NER. Specifically, we will be using Wikipedia's list of Cuisines to along with Edit Distance to tag any tokens as *Cuisine*.

***3.1. Setting up the Gazetteer***    
First thing to do is to setup the Gazetteer from Wikipedia's list of Cuisines (https://en.wikipedia.org/wiki/List_of_cuisines). We could scrape the webpage and extract these, but for the sake of keeping it simple the notebook includes the file *cuisine_gazetteer.txt* that we will use to load the Gazetteer.

In [10]:
# Setup the Gazetteer
gazetteer = []
with open("cuisine_gazetteer.txt") as f:
    gazetteer = f.read().splitlines()

# Since we are working with Single Words, we will only keep the first word 
# of each index in our Gazatteer.
gazetteer = list(set([cuisine.split()[0].lower() for cuisine in gazetteer]))
gazetteer.sort()

In [11]:
# Look at the contents of the Gazetteer
gazetteer

['ainu',
 'albanian',
 'andhra',
 'anglo-indian',
 'arab',
 'argentine',
 'armenian',
 'assyrian',
 'awadhi',
 'azerbaijani',
 'balochi',
 'bangladeshi',
 'belarusian',
 'bengali',
 'berber',
 'brazilian',
 'buddhist',
 'bulgarian',
 'cajun',
 'cantonese',
 'caribbean',
 'chechen',
 'chinese',
 'circassian',
 'crimean',
 'cypriot',
 'danish',
 'english',
 'estonian',
 'filipino',
 'french',
 'georgian',
 'german',
 'goan',
 'greek',
 'gujarati',
 'hong',
 'hyderabad',
 'indian',
 'indonesian',
 'inuit',
 'irish',
 'italian',
 'jamaican',
 'japanese',
 'jewish',
 'karnataka',
 'kazakh',
 'keralite',
 'korean',
 'kurdish',
 'laotian',
 'latvian',
 'lebanese',
 'lithuanian',
 'louisiana',
 'maharashtrian',
 'malay',
 'malaysian',
 'mangalorean',
 'mediterranean',
 'mexican',
 'mordovian',
 'mughal',
 'native',
 'nepalese',
 'new',
 'odia',
 'pakistani',
 'parsi',
 'pashtun',
 'pennsylvania',
 'peranakan',
 'persian',
 'peruvian',
 'polish',
 'portuguese',
 'punjabi',
 'rajasthani',
 'roma

***3.2. Tagging Cuisines using Edit Distance***    
Now that we have the Gazetteer, we need to define a metric to determine whether a token should be tagged as *Cuisine*. To do this we will use *Edit Distance*.   
Recall that for regular edit distance we only look at *Deletion* (+1), *Insertion* (+1), and *Replacement* (+1)

In [12]:
# Calculates the Edit Distance between two words s1 and s2
def edit_distance(s1, s2):
    m=len(s1)+1
    n=len(s2)+1

    tbl = {}
    for i in range(m): tbl[i,0]=i
    for j in range(n): tbl[0,j]=j
    for i in range(1, m):
        for j in range(1, n):
            cost = 0 if s1[i-1] == s2[j-1] else 1
            tbl[i,j] = min(tbl[i, j-1]+1, tbl[i-1, j]+1, tbl[i-1, j-1]+cost)

    return tbl[i,j]

In [13]:
# Let us check the edit-distance with a few examples
print("Edit Distance between tuesday and thursday:", edit_distance("tuesday", "thursday"))
print("Edit Distance between artificial and intelligence:", edit_distance("artificial", "thursday"))
print("Edit Distance between fresh and frozen:", edit_distance("fresh", "frozen"))

Edit Distance between tuesday and thursday: 2
Edit Distance between artificial and intelligence: 8
Edit Distance between fresh and frozen: 4


**(TO DO) Q2 - 2 marks**      
With an understanding of the edit_distance function, we will define the function *ner_ed* that computes the edit distance between all tokens in our corpus with each item in our Gazetteer. If the edit distance between a token and a cuisine from our Gazetteer is under a specified threshold (an integer) we classify it as the same tag as those in our Gazetteer (in this case, *Cuisine*) by appending it to the list of words containing that tag. Specifically, we append the token and the gazetteer term as 'token/term'. The function returns a sorted list of the words that are determined to have the same tag as specified by the Gazetteer.    

Below is a partially completed implementation of ner_ed which you must complete. Understanding the description of the function above you must complete the implementation below by adding the appropriate loop structure to allow the edit_distance logic to be correctly computed (already there, no need to edit it just need to setup the loops to be compatible). Recall that the structure of sentences is a list of lists ([[tags], [tokens]]). Also note that we will call a word from a Gazetteer a *term* and a word from a sentence a *token*.

In [14]:
# Given a list of tag/sentence lists, a gazetteer, and a threshold, compute the edit distance
# between each token from each sentence to each cuisine from the gazatteer. If the edit distance
# is <= a provided threshold, we collect the word as a word that would be tagged with the same tag
# as the Gazetteer content.
# In this section the tag would be cuisine since the gazetteer is based on cuisines.
def ner_ed(sentences, gazetteer, threshold):
    words = []
    # TO DO - Setup the three loops:
    # Loop 1
    for x in sentences:
        # Loop 2 (introduces the variable token used below)
        for token in tag_words('Cuisine', sentences):
            # Loop 3 (introduces the variable term used below)
            for term in gazetteer:
                # Computing the edit distance between sentence tokens and 
                # the Gazetteer terms.
                if (edit_distance(token, term) <= threshold):
                    if (not token + "/" + term in words):
                        words.append(token + "/" + term)
                    break
    # Sort the list
    words.sort()
    return words

With this function, let's test it using three different thresholds on *test_sentences*.

In [15]:
# The below will give sorted lists of words that tagged as Cuisine
# This may take 40+ seconds (depending on your CPU), so be patient

# Threshold = 0
gz_ed_th0_words = ner_ed(test_sentences, gazetteer, threshold=0)
# Threshold = 1
gz_ed_th1_words = ner_ed(test_sentences, gazetteer, threshold=1)
# Threshold = 2
gz_ed_th2_words = ner_ed(test_sentences, gazetteer, threshold=2)

In [16]:
print("Distance 0")
print(gz_ed_th0_words)
print("Distance 1")
print(gz_ed_th1_words)
print("Distance 2")
print(gz_ed_th2_words)

Distance 0
['brazilian/brazilian', 'cajun/cajun', 'chinese/chinese', 'french/french', 'german/german', 'greek/greek', 'indian/indian', 'irish/irish', 'italian/italian', 'japanese/japanese', 'korean/korean', 'malaysian/malaysian', 'mexican/mexican', 'spanish/spanish', 'thai/thai', 'turkish/turkish', 'vietnamese/vietnamese']
Distance 1
['brazilian/brazilian', 'cajun/cajun', 'chinese/chinese', 'crab/arab', 'french/french', 'german/german', 'greek/greek', 'indian/indian', 'irish/irish', 'italian/italian', 'japanese/japanese', 'korean/korean', 'malaysian/malaysian', 'mexican/mexican', 'portugues/portuguese', 'spanish/spanish', 'thai/thai', 'turkish/turkish', 'vietnamese/vietnamese']
Distance 2
['american/mexican', 'and/ainu', 'bars/parsi', 'brazilian/brazilian', 'burger/berber', 'cajun/cajun', 'chicken/chechen', 'chinese/chinese', 'crab/arab', 'french/french', 'fruit/inuit', 'german/german', 'greek/greek', 'indian/indian', 'irish/irish', 'italian/italian', 'japanese/japanese', 'korean/korea

**(TO DO) Q3 - 4 marks**   
1) Describe what starts to happen as the threshold increases. Are any other cuisines found by increasing the threshold (That weren't previously there)?  
2) A good threshold should be relative to the length of the term. This means that rather than using an integer threshold, we use a float percentage. Then, when checking the threshold, we multiply the threshold by the length of the Gazetteer term. Copy over your implementation of the *ner_ed* function into the *ner_ed_rel* below, modify the function to use a relative threshold as described above.  
3) Test this new approach using a 25% and 50% threshold. Print the returned lists.    
4) Does relative thresholding better capture variations than not using a relative thresholding? ***Look at the short words vs longer words.***

Q3 - ANSWER (1)   
Describe what starts to happen as the threshold increases:   
As the threshold increases, the overall chance of classifying incorrect cuisine and correct cuisine increases. Words with relatively shorter lengths are more likely to be classified as a cuisine, regardless of their actual classification. For example, a word with 'N' letters will always be classified as a cuisine if the threshold is above N.    


Are any other cuisines found by increasing the thershold (that weren't previously there)?    
Yes. When increasing the threshold from 0 to 1, we found two new cuisines - ['crab/arab', 'portugues/portuguese']. When increasing the threshold from 0 to 2, we found seventeen new cuisines - ['american/mexican', 'and/ainu', 'bars/parsi', 'burger/ berber', 'chicken/chechen', 'crab/arab', 'fruit/inuit', 'latin/laotian', 'ranch/french', 'raw/arab', 'rib/arab', 'sea/new', 'soul/south', 'spanish/danish', 'sub/sri', 'todai/thai', 'wine/ainu'].


In [17]:
# Q3 - ANSWER (2)

def ner_ed_rel(sentences, gazetteer, threshold):
    words = []
    # TO DO - Setup the three loops:
    # Loop 1
    for x in sentences:
        # Loop 2 (introduces the variable token used below)
        for token in tag_words('Cuisine', sentences):
            # Loop 3 (introduces the variable term used below)
            for term in gazetteer:
                # Computing the edit distance between sentence tokens and 
                # the Gazetteer terms.
                if (edit_distance(token, term) <= (threshold*len(term))):
                    if (not token + "/" + term in words):
                        words.append(token + "/" + term)
                    break
    # Sort the list
    words.sort()
    return words

In [18]:
# Q3 - ANSWER (3)

# 25% Threshold
gz_ed_th0_25_words = ner_ed_rel(test_sentences, gazetteer, threshold=0.25)
print("Distance with 25% Threshold")
print(gz_ed_th0_25_words)

# 50% threshold
gz_ed_th0_50_words = ner_ed_rel(test_sentences, gazetteer, threshold=0.50)
print("Distance with 50% Threshold")
print(gz_ed_th0_50_words)

Distance with 25% Threshold
['brazilian/brazilian', 'cajun/cajun', 'chinese/chinese', 'crab/arab', 'french/french', 'german/german', 'greek/greek', 'indian/indian', 'irish/irish', 'italian/italian', 'japanese/japanese', 'korean/korean', 'malaysian/malaysian', 'mexican/mexican', 'portugues/portuguese', 'spanish/spanish', 'thai/thai', 'turkish/turkish', 'vietnamese/vietnamese']
Distance with 50% Threshold
['afghan/mughal', 'american/armenian', 'and/ainu', 'asian/albanian', 'bars/parsi', 'brazilian/belarusian', 'burger/berber', 'cajun/cajun', 'cambodian/albanian', 'chicken/chechen', 'chinese/cantonese', 'crab/arab', 'cream/crimean', 'dining/danish', 'ethiopian/estonian', 'french/french', 'fruit/inuit', 'fusion/russian', 'german/georgian', 'greek/greek', 'hawaiian/albanian', 'health/keralite', 'indian/anglo-indian', 'irish/danish', 'italian/albanian', 'japanese/cantonese', 'joint/soviet', 'korean/georgian', 'latin/albanian', 'malaysian/albanian', 'mexican/armenian', 'portugues/portuguese',

Q3 - ANSWER (4)    
Does relative thresholding better capture variations than not using a relative thresholding?   

Relative thresholding better captures variations than not using relative thresholding when the relative threshold is low. When using a relative threshold of 25%, the variations captured matched the results obtained when using a threshold distance of 1. However, using a relative threshold of 50% resulted in 45 captured variations, 11 more than when using a threshold distance of 2. Additionally, longer words fared worse when it comes to relative thresholding as the relative threshold allows for more distance than shorter words.

**4. Exploring Wordnet**   
Let's first explore a bit the wordnet interface within nltk.  
You can also look a the [WordNet interface description](http://www.nltk.org/howto/wordnet.html)

In [19]:
# A synset is a concept associated with a set of synonyms
ratingSenses = wordnet.synsets('rating')
print(ratingSenses)
print(len(ratingSenses))

[Synset('evaluation.n.02'), Synset('evaluation.n.01'), Synset('rating.n.03'), Synset('military_rank.n.01'), Synset('rate.v.01'), Synset('rate.v.02'), Synset('rate.v.03'), Synset('rat.v.01'), Synset('rat.v.02'), Synset('fink.v.01'), Synset('rat.v.04'), Synset('rat.v.05'), Synset('denounce.v.04')]
13


This shows that there are 13 senses of rating, 4 nouns and 9 verbs.  The word displayed is the most representative word for each sense.  

You can try other words.  I recommend that you also perform the same search [online](http://wordnetweb.princeton.edu/perl/webwn) to better understand the results.

Let's look at the basic information in each synset.        

In [20]:
# We define a function to print the basic information
def printBasicSynsetInfo(d):
    print("SynLemmas")
    print(d.lemmas())
    # Print the synonyms
    print("Synonyms")
    synonyms = [l.name() for l in d.lemmas()]
    print(synonyms)
    # Print the definition
    print("Definition")
    print(d.definition())

In [21]:
# We can print the information for each sense of "rating"
for i in range(len(ratingSenses)):
    print("[Sense " + str(i) + "]")
    printBasicSynsetInfo(ratingSenses[i])
    print()

[Sense 0]
SynLemmas
[Lemma('evaluation.n.02.evaluation'), Lemma('evaluation.n.02.valuation'), Lemma('evaluation.n.02.rating')]
Synonyms
['evaluation', 'valuation', 'rating']
Definition
an appraisal of the value of something

[Sense 1]
SynLemmas
[Lemma('evaluation.n.01.evaluation'), Lemma('evaluation.n.01.rating')]
Synonyms
['evaluation', 'rating']
Definition
act of ascertaining or fixing the value or worth of

[Sense 2]
SynLemmas
[Lemma('rating.n.03.rating')]
Synonyms
['rating']
Definition
standing or position on a scale

[Sense 3]
SynLemmas
[Lemma('military_rank.n.01.military_rank'), Lemma('military_rank.n.01.military_rating'), Lemma('military_rank.n.01.paygrade'), Lemma('military_rank.n.01.rating')]
Synonyms
['military_rank', 'military_rating', 'paygrade', 'rating']
Definition
rank in a military organization

[Sense 4]
SynLemmas
[Lemma('rate.v.01.rate'), Lemma('rate.v.01.rank'), Lemma('rate.v.01.range'), Lemma('rate.v.01.order'), Lemma('rate.v.01.grade'), Lemma('rate.v.01.place')]
Sy

A rich taxonomy has been manually developed in Wordnet, making it a rich resource.  

In [29]:
# We define a function to print the basic information, receives a synset
def printTaxonomyInfo(d):
    # Print the synonmyms
    synonyms = [l.name() for l in d.lemmas()]
    print("Synonyms:")
    print(synonyms)
    # Print the hypernyms
    print("Hypernyms:")
    print(d.hypernyms())
    # Print the hyponyms *** We will use these later so note how we get the hyponyms ***
    print("Hyponyms:")
    print(d.hyponyms())

**(TO DO) Q4 - 2 marks**   
Choose two words and write code to print the taxonomic information for all senses of those words using printTaxonomyInfo().

In [45]:
# TO DO - Choose two words and print the taxonomix information for all 
# senses of those words using printTaxonomyInfo

# Word 1
dogSenses = wordnet.synsets('dog')
print('Word 1: dog')
for i in range(len(dogSenses)):
    print("[Sense " + str(i) + "]")
    printTaxonomyInfo(dogSenses[i])
    print()
# Word 2
print('Word 2: peanut')
peanutSenses = wordnet.synsets('peanut')
for i in range(len(peanutSenses)):
    print("[Sense " + str(i) + "]")
    printTaxonomyInfo(peanutSenses[i])
    print()

Word 1: dog
[Sense 0]
Synonyms:
['dog', 'domestic_dog', 'Canis_familiaris']
Hypernyms:
[Synset('canine.n.02'), Synset('domestic_animal.n.01')]
Hyponyms:
[Synset('basenji.n.01'), Synset('corgi.n.01'), Synset('cur.n.01'), Synset('dalmatian.n.02'), Synset('great_pyrenees.n.01'), Synset('griffon.n.02'), Synset('hunting_dog.n.01'), Synset('lapdog.n.01'), Synset('leonberg.n.01'), Synset('mexican_hairless.n.01'), Synset('newfoundland.n.01'), Synset('pooch.n.01'), Synset('poodle.n.01'), Synset('pug.n.01'), Synset('puppy.n.01'), Synset('spitz.n.01'), Synset('toy_dog.n.01'), Synset('working_dog.n.01')]

[Sense 1]
Synonyms:
['frump', 'dog']
Hypernyms:
[Synset('unpleasant_woman.n.01')]
Hyponyms:
[]

[Sense 2]
Synonyms:
['dog']
Hypernyms:
[Synset('chap.n.01')]
Hyponyms:
[]

[Sense 3]
Synonyms:
['cad', 'bounder', 'blackguard', 'dog', 'hound', 'heel']
Hypernyms:
[Synset('villain.n.01')]
Hyponyms:
[Synset('perisher.n.01')]

[Sense 4]
Synonyms:
['frank', 'frankfurter', 'hotdog', 'hot_dog', 'dog', 'wien

Now to show how to get the names from each hyponym by using our example *ratingSenses*.

In [47]:
for i in range(len(ratingSenses)):
    print("[Sense " +  str(i) + "]")
    for hyponym in ratingSenses[i].hyponyms():
        # Access the hyponym names with: hyponym.name().split(".")[0]
        print(hyponym.name().split(".")[0])
    print()

[Sense 0]
bond_rating
mark
overvaluation
pricing
reevaluation
undervaluation

[Sense 1]
marking

[Sense 2]

[Sense 3]
flag_rank

[Sense 4]
downgrade
prioritize
reorder
seed
sequence
shortlist
subordinate
superordinate
upgrade

[Sense 5]

[Sense 6]
revalue

[Sense 7]

[Sense 8]

[Sense 9]

[Sense 10]

[Sense 11]

[Sense 12]
sell_out



**5. Using Wordnet for NER**   
Having explored Wordnet, we will use what we have learned to try another method of NER tagging. Specifically, we will be trying to tag any words labelled *Dish* by using the edit distance and the edit distance thresholding technique that we used in section 3 with the *hyponyms* for the word *dish*.   
  
Since we have covered everything needed to answer this question in this notebook, you will be doing most of the coding yourself. 

**(TO DO) Q5 - 5 marks**   
1) Go through the senses for the word '*dish*' (recall how we used the functions printBasicSynsetInfo() and wordnet.synsets() above). Looking through the senses, select the single most appropriate sense of the word *dish* for our problem (Restaurant Corpus, look into the corpus or file if you need more context on which words are labelled *Dish*). Collect a list of all *hyponyms* for the word '*dish*' based on the selected sense of the word. You can think of this as a Gazetteer.    
2) Using thresholds 0, 1, and 2 with ner_ed, collect three  lists of words that we would tag as *Dish* based on the *Edit Distance* between our Dish Gazetteer and test_sentences. Print the returned lists.     
3) Are more dishes being found with a larger threshold? Give an example to justify your answer.    
4) Using thresholds 25% and 50% with ner_ed_rel, collect two lists of words that we would tag as *Dish* based on the *Edit Distance* between our Dish Gazetteer and test_sentences. Print the returned lists.    
5) Does relative thresholding better capture variations than not using a relative thresholding for the tag *Dish*? Give an example to justify your answer.

In [63]:
# TO DO (1)
# Look through the senses of the word dish to select the most appropriate sense for our corpus
# Recall how we used the printBasicSynsetInfo() function above
dishSenses = wordnet.synsets('dish')
for i in range(len(dishSenses)):
    print("[Sense " + str(i) + "]")
    printBasicSynsetInfo(dishSenses[i])
    print()

# Print which sense you have selected (ex: Sense X)
print('The sense chosen is: Sense 1.')

# Collect a list of all hyponyms for the word 'dish' only from the selected sense
dishHyponyms = []
for hyponym in dishSenses[1].hyponyms():
    dishHyponyms.append(hyponym.name().split(".")[0])
dishHyponyms.sort()

# Print the Dish Gazetteer
print(dishHyponyms)

[Sense 0]
SynLemmas
[Lemma('dish.n.01.dish')]
Synonyms
['dish']
Definition
a piece of dishware normally used as a container for holding or serving food

[Sense 1]
SynLemmas
[Lemma('dish.n.02.dish')]
Synonyms
['dish']
Definition
a particular item of prepared food

[Sense 2]
SynLemmas
[Lemma('dish.n.03.dish'), Lemma('dish.n.03.dishful')]
Synonyms
['dish', 'dishful']
Definition
the quantity that a dish will hold

[Sense 3]
SynLemmas
[Lemma('smasher.n.02.smasher'), Lemma('smasher.n.02.stunner'), Lemma('smasher.n.02.knockout'), Lemma('smasher.n.02.beauty'), Lemma('smasher.n.02.ravisher'), Lemma('smasher.n.02.sweetheart'), Lemma('smasher.n.02.peach'), Lemma('smasher.n.02.lulu'), Lemma('smasher.n.02.looker'), Lemma('smasher.n.02.mantrap'), Lemma('smasher.n.02.dish')]
Synonyms
['smasher', 'stunner', 'knockout', 'beauty', 'ravisher', 'sweetheart', 'peach', 'lulu', 'looker', 'mantrap', 'dish']
Definition
a very attractive or seductive looking woman

[Sense 4]
SynLemmas
[Lemma('dish.n.05.dish'), 

In [64]:
# TO DO (2)
# Using thresholds 0, 1, and 2 with ner_ed, collect three lists of words that we
# would tag as *Dish* based on the *Edit Distance* between our Dish Gazetteer and test_sentences. 

# Threshold = 0
dish_gz_ed_th0_words = ner_ed(test_sentences, dishHyponyms, threshold=0)
# Threshold = 1
dish_gz_ed_th1_words = ner_ed(test_sentences, dishHyponyms, threshold=1)
# Threshold = 2
dish_gz_ed_th2_words = ner_ed(test_sentences, dishHyponyms, threshold=2)

#Print the returned lists. 
print("Distance 0")
print(dish_gz_ed_th0_words)
print("Distance 1")
print(dish_gz_ed_th1_words)
print("Distance 2")
print(dish_gz_ed_th2_words)

Distance 0
['barbecue/barbecue', 'burrito/burrito', 'pizza/pizza', 'sushi/sushi', 'taco/taco']
Distance 1
['barbecue/barbecue', 'burrito/burrito', 'pizza/pizza', 'soul/soup', 'sushi/sushi', 'taco/taco']
Distance 2
['and/viand', 'barbecue/barbecue', 'burrito/burrito', 'fast/hash', 'food/mold', 'house/mousse', 'pho/poi', 'pizza/pizza', 'pub/poi', 'sea/stew', 'soul/soup', 'stand/viand', 'steak/stew', 'sub/soup', 'sushi/sushi', 'taco/taco']


Q5 - TO DO (3)   
Are more dishes being found with a larger threshold? Give an example to justify your answer.  

Yes. The following table presents the amount of dishes found with respect to the threshold distance:

Threshold Distance 0: 5 dishes found.
Threshold Distance 1: 6 dishes found.
Threshold Distance 1: 16 dishes found.

As seen from above, as the threshold increases, the amount of dishes being found also increases.


In [65]:
# TO DO (4)
# Using thresholds 25% and 50% with ner_ed_rel, collect two lists of words that we would tag 
# as *Dish* based on the *Edit Distance* between our Dish Gazetteer and test_sentences.     
dish_gz_ed_th0_25_words = ner_ed_rel(test_sentences, dishHyponyms, threshold=0.25)
dish_gz_ed_th0_50_words = ner_ed_rel(test_sentences, dishHyponyms, threshold=0.50)

# Print the returned lists.
print("Distance with 25% Threshold")
print(dish_gz_ed_th0_25_words)
print("Distance with 50% Threshold")
print(dish_gz_ed_th0_50_words)

Distance with 25% Threshold
['barbecue/barbecue', 'burrito/burrito', 'pizza/pizza', 'soul/soup', 'sushi/sushi', 'taco/taco']
Distance with 50% Threshold
['and/viand', 'barbecue/barbecue', 'burger/turnover', 'burmese/barbecue', 'burrito/burrito', 'chicken/chicken_kiev', 'coffee/souffle', 'donut/fondue', 'fast/hash', 'food/fondue', 'french/french_toast', 'grill/egg_roll', 'house/mousse', 'latin/galantine', 'meat/meatball', 'meats/meatball', 'pizza/pizza', 'sandwich/sandwich_plate', 'sea/stew', 'seafood/snack_food', 'soul/souffle', 'spanish/spanish_rice', 'stand/custard', 'steak/stew', 'sub/soup', 'sushi/sashimi', 'taco/taco', 'waffle/souffle']


Q5 - TO DO (5)    
Does relative thresholding better capture variations than not using a relative thresholding for the tag *Dish*? Give an example to justify your answer.     

When using a relative distance threshold of 25%, the results obtained were the same as a threshold distance of 1. However, when using a relative distance threshold of 50%, the results produced 28 variations. This is 12 more than when the threshold distance is 2. 
It can be seen that with a relatively low relative distance threshold (e.g. 25%), the variations captured are similar to when using a relatively low threshold distance (e.g. distance of 1). However, when using a relatively high relative distance threshold (e.g. 50%), the variations captured increases in comparison to a relatively high threshold distance (e.g. distance of 2). 
Therefore, we can conclude that relative thresholding does not better capture variations when using a high relative threshold but can better capture variations when using a low relative threshold.

**This is the last notebook, hope you enjoyed them and learned some new things :) Best of luck with your projects!**

#### Signature

I, Kenny Nguyen, declare that the answers provided in this notebook are my own.