# <center>Notebook #7</center>
<center>Name: NamChi Nguyen</center>
<center>Student ID: 7236760</center>

## Knowledge representation and similarity 
### Grounding (Word-Sense Disambiguation) to WordNet

CSI4106 Artificial Intelligence  
Fall 2018  
Caroline Barrière

***

In this notebook, first, you will explore Wordnet, a lexical semantic network, in which knowledge is organized by interrelated synsets (groups of synonyms).  Second, you will attempt Word-Sense Disambiguation (WSD), using simple Lesk-like algorithm which compares BOWs (bag-of-words).  

This notebook uses the same package NLTK as we used in the last notebook. We will also reuse some knowledge from the previous notebook (tokenization, lemmatization, POS tagging), so make sure to do the NLP Pipeline notebook before this one.

*As you now have more experience, this notebook requires that you write more code by yourself than the previous ones.*

***HOMEWORK***:  
Go through the notebook by running each cell, one at a time. Look for (**TO DO**) for the tasks that you need to perform.  
Make sure you *sign* (type your name) the notebook at the end. Once you're done, submit your notebook.

***

In [1]:
# let's import nltk, and wordnet

import nltk
from nltk.corpus import wordnet

**1. Exploring Wordnet**  

Let's first explore a bit the wordnet interface within nltk.  
You can also look a the [WordNet interface description](http://www.nltk.org/howto/wordnet.html)

In [2]:
# a synset is a concept associated with a set of synonyms

paperSenses = wordnet.synsets('paper')
print(paperSenses)

[Synset('paper.n.01'), Synset('composition.n.08'), Synset('newspaper.n.01'), Synset('paper.n.04'), Synset('paper.n.05'), Synset('newspaper.n.02'), Synset('newspaper.n.03'), Synset('paper.v.01'), Synset('wallpaper.v.01')]


This shows that there are 9 senses of paper, 7 nouns and 2 verbs.  The word displayed is the most representative word for each sense.  

You can try other words.  I recommend that you also perform the same search [online](http://wordnetweb.princeton.edu/perl/webwn) to better understand the results.

Let's look at the basic information in each synset.        

In [3]:
# We define a function to print the basic information

def printBasicSynsetInfo(d):
    print("SynLemmas")
    print(d.lemmas())
    print("Synonyms")
    synonyms = [l.name() for l in d.lemmas()]
    print(synonyms)
    print("Definition")
    print(d.definition())

In [4]:
# We can print the information for each sense of "paper"

for i in range(len(paperSenses)):
    print("[Sense " + str(i) + "]")
    printBasicSynsetInfo(paperSenses[i])
    print()

[Sense 0]
SynLemmas
[Lemma('paper.n.01.paper')]
Synonyms
['paper']
Definition
a material made of cellulose pulp derived mainly from wood or rags or certain grasses

[Sense 1]
SynLemmas
[Lemma('composition.n.08.composition'), Lemma('composition.n.08.paper'), Lemma('composition.n.08.report'), Lemma('composition.n.08.theme')]
Synonyms
['composition', 'paper', 'report', 'theme']
Definition
an essay (especially one written as an assignment)

[Sense 2]
SynLemmas
[Lemma('newspaper.n.01.newspaper'), Lemma('newspaper.n.01.paper')]
Synonyms
['newspaper', 'paper']
Definition
a daily or weekly publication on folded sheets; contains news and articles and advertisements

[Sense 3]
SynLemmas
[Lemma('paper.n.04.paper')]
Synonyms
['paper']
Definition
a medium for written communication

[Sense 4]
SynLemmas
[Lemma('paper.n.05.paper')]
Synonyms
['paper']
Definition
a scholarly article describing the results of observations or stating hypotheses

[Sense 5]
SynLemmas
[Lemma('newspaper.n.02.newspaper'), Lemm

A rich taxonomy has been manually developed in Wordnet, making it a rich resource.  

**(TO-DO : Q1)** Choose two words, and write code to print the taxonomic information for all senses of those words.

In [5]:
# We define a function to print the basic information, receives a synset

def printTaxonomyInfo(d):
    synonyms = [l.name() for l in d.lemmas()]
    print(synonyms)
    print("Hypernyms:")
    print(d.hypernyms())
    print("Hyponyms:")
    print(d.hyponyms())

In [6]:
# Q1 - ANSWER
# We can print the taxonomy information for each sense of a word X

# Word #1
bookSenses = wordnet.synsets('book')
for i in range(len(bookSenses)):
    print("[Sense " + str(i) + "]")
    printTaxonomyInfo(bookSenses[i])
    print()

print()

# Word #2
ringSenses = wordnet.synsets('ring')
for i in range(len(ringSenses)):
    print("[Sense " + str(i) + "]")
    printTaxonomyInfo(ringSenses[i])
    print()

[Sense 0]
['book']
Hypernyms:
[Synset('publication.n.01')]
Hyponyms:
[Synset('appointment_book.n.01'), Synset('authority.n.07'), Synset('bestiary.n.01'), Synset('booklet.n.01'), Synset('catalog.n.01'), Synset('catechism.n.02'), Synset('copybook.n.01'), Synset('curiosa.n.01'), Synset('formulary.n.01'), Synset('phrase_book.n.01'), Synset('playbook.n.02'), Synset('pop-up_book.n.01'), Synset('prayer_book.n.01'), Synset('reference_book.n.01'), Synset('review_copy.n.01'), Synset('songbook.n.01'), Synset('storybook.n.01'), Synset('textbook.n.01'), Synset('tome.n.01'), Synset('trade_book.n.01'), Synset('workbook.n.01'), Synset('yearbook.n.01')]

[Sense 1]
['book', 'volume']
Hypernyms:
[Synset('product.n.02')]
Hyponyms:
[Synset('album.n.02'), Synset('coffee-table_book.n.01'), Synset('folio.n.03'), Synset('hardback.n.01'), Synset('journal.n.04'), Synset('notebook.n.01'), Synset('novel.n.02'), Synset('order_book.n.02'), Synset('paperback_book.n.01'), Synset('picture_book.n.01'), Synset('sketchboo

**2. Word-Sense Disambiguation.**  

Let's now implement a simple modified Lesk algorithm for WSD.  
The idea is to compare the sentence containing the ambiguous word W to all the definitions of W and choose the most similar.

(Step 1) Create a BOW (bag of words) for each definition.

In [7]:
# we will need the tokenizer

from nltk import word_tokenize

In [8]:
# define a small method to return the set of words found in a text
# we can exclude some words

def bow(text, excluded = None):
    text = text.replace("_", " ") # the compound nouns in wordnet text have _
    tokens = word_tokenize(text)
    setTokens = set(tokens)
    if excluded != None:
        if (excluded in setTokens):
            setTokens.remove(excluded)
    return setTokens

In [9]:
# testing 
print(bow("There is a lot of food on the table", excluded='table'))
print(bow("He wrote an excellent conference paper referred by many researchers", excluded='paper'))

{'of', 'There', 'food', 'is', 'lot', 'the', 'on', 'a'}
{'researchers', 'by', 'many', 'excellent', 'He', 'wrote', 'referred', 'conference', 'an'}


In [10]:
# make BOWs for all the senses in a received word
# exclude from the BOW, the word being defined

def makeDefBOWs(testWord):
    synsets = wordnet.synsets(testWord)
    defs = [s.definition() for s in synsets]
    bows = [bow(d, excluded=testWord) for d in defs]
    return bows

In [11]:
# try with different words, look at the resulting info

testWord = "cell" # bank, course, paper, ...
defBOWs = makeDefBOWs(testWord)
    
print(*defBOWs, sep="\n")  # to print a list on separate lines

{'small', 'any', 'compartment'}
{'animals', 'of', 'life', 'exist', 'as', 'basic', 'or', 'the', 'tissues', ';', 'unit', 'they', 'units', 'form', 'plants', ')', 'monads', 'may', 'independent', '(', 'structural', 'organisms', 'biology', 'and', 'all', 'colonies', 'in', 'functional', 'higher'}
{'device', 'result', 'current', 'of', 'reaction', 'chemical', 'as', 'electric', 'that', 'the', 'an', 'a', 'delivers'}
{'movement', 'serving', 'unit', 'of', 'political', 'as', 'part', 'the', 'or', 'nucleus', 'small', 'larger', 'a'}
{',', 'each', 'area', 'radiotelephone', 'divided', 'short-range', 'use', 'with', 'own', 'mobile', 'into', 'in', 'transmitter/receiver', 'for', 'sections', 'its', 'hand-held', 'an', 'a', 'small'}
{'which', 'nun', 'lives', 'room', 'monk', 'in', 'or', 'small', 'a'}
{'kept', 'room', 'is', 'where', 'prisoner', 'a'}


(Step 2) Create a method to compare BOWs

In [12]:
# We're interested in the size of the intersection between the BOWs
# If you wish to see the words in common to understand the results, uncomment the prints

def bowOverlap(bow1, bow2):
    #print(bow1)
    #print(bow2)
    print(bow1.intersection(bow2))
    return len(bow1.intersection(bow2))

**(TO-DO: Q2)** Implement the (Step 3) of the algorithm.  The (Step 3) consist in comparing the BOW of a test sentence (let's call it our context C) containing an ambiguous word (X) to the BOWs of all the senses of the X.  To do Step 3, you need to complete the method below which receives a word X, as well as the text C in which X occurs.  The method should return the synsets with largest common BOWs with X.  Notice that there could be more than one maximum, so your method should return all synsets with maximum intersection.

In [13]:
# Q2 - ANSWER

# method receives a word and its context
# returns all the synsets with maximum overlap

def findMostProbableSense(word, context):
    bows = makeDefBOWs(word)
    textBOW = bow(context)
    max_overlap = 0
    max_synsets = [] #list of synsets  
    olaps = [] #list of overlap indices
  
    # find senses with max overlap
    senses = wordnet.synsets(word)
    for i in range(len(senses)):
        overlap = bowOverlap(bows[i], textBOW) 
        olaps.append(overlap) # Keep track of overlap indices if there's more than 1 max found
                
        # Print statements for verifying
        print("Overlap: ",overlap)
        print("BOWS: ",bows[i])
        print("textBOW: ", textBOW, "\n")
        
        if overlap > max_overlap:        
            max_overlap = overlap
    
    # Add all synsets w/ max intersection        
    for j in range(len(olaps)):
        if max_overlap == olaps[j]:
            max_synsets.append(senses[j]) 
            
    return max_synsets

##### Your method should return the chosen senses for the example below.  We will test your method using the following code.

In [14]:
# Show the BOWs of the senses with the overlap, and the chosen sense(s)
# You can try with various words and sentences

testWord = "cell"
testSentence = "He lived in this prison cell for many years." 

####  CALL TO YOUR METHOD RECEIVING THE WORD AND ITS CONTEXT
chosenSynsets = findMostProbableSense(testWord, testSentence)  

# print all the definitions of the most probable senses
for s in chosenSynsets:
    printBasicSynsetInfo(s)
    print()

set()
Overlap:  0
BOWS:  {'small', 'any', 'compartment'}
textBOW:  {'lived', 'many', 'this', 'He', 'cell', 'in', '.', 'for', 'years', 'prison'} 

{'in'}
Overlap:  1
BOWS:  {'animals', 'of', 'life', 'exist', 'as', 'basic', 'or', 'the', 'tissues', ';', 'unit', 'they', 'units', 'form', 'plants', ')', 'monads', 'may', 'independent', '(', 'structural', 'organisms', 'biology', 'and', 'all', 'colonies', 'in', 'functional', 'higher'}
textBOW:  {'lived', 'many', 'this', 'He', 'cell', 'in', '.', 'for', 'years', 'prison'} 

set()
Overlap:  0
BOWS:  {'device', 'result', 'current', 'of', 'reaction', 'chemical', 'as', 'electric', 'that', 'the', 'an', 'a', 'delivers'}
textBOW:  {'lived', 'many', 'this', 'He', 'cell', 'in', '.', 'for', 'years', 'prison'} 

set()
Overlap:  0
BOWS:  {'movement', 'serving', 'unit', 'of', 'political', 'as', 'part', 'the', 'or', 'nucleus', 'small', 'larger', 'a'}
textBOW:  {'lived', 'many', 'this', 'He', 'cell', 'in', '.', 'for', 'years', 'prison'} 

{'in', 'for'}
Overlap:

**(TO-DO: Q3)** What do you notice? With the example above for "cell", what are the words making the BOWs look similar?  Are these significant words?

*Q3-ANSWER*  
The prepositions 'in' and 'for' are matched with the intersection choosing a definition that is inaccurate with the context. These words aren't significant and don't give any meaning. The best sense should be the last definition: "a room where a prisoner is kept" since prison and prisoner are similar.

**(TO-DO: Q4)  Refining our BOWs**

**Exploring variations:**
1. What if you lowercase everything?
2. What if you apply lemmatisation on all words in the BOWs?
3. What if you focus on only the NOUNS in the BOWs?

(hint) Go back to your notebook NLP pipeline for questions (2) use the lemmatizer and (3) perform POS tagging on the sentences. 

For your answer (code to write):  

a) First complete the BOW method below in which I've added parameters to possibly activate the lowercase, the lemmatization and the POS tagging.   
b) Add a few tests to see if your BOW works.  


In [15]:
# Q4 - ANSWER - part a)

# The parameters possibly ACTIVATE lowercase, lemmatization, and keeping only Nouns in BOWs.

# nltk contains a method to obtain the part-of-speech of each token
# Download the wordnet resource
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('averaged_perceptron_tagger')
wnl = nltk.WordNetLemmatizer()

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.ADV  # just use as default, for ADV the lemmatizer doesn't change anything 

    
# refine the method with parameters
def bow(text, excluded = None, lowercase = False, lemmatize=False, nounsOnly=False):
    
    text = text.replace("_", " ") # the compound nouns in wordnet text have _
    tokens = word_tokenize(text)

    if lowercase:
        text = text.lower()
        tokens = word_tokenize(text)

    # Continue with the options to deal with the various cases (lemmatized T/F, nounsOnly T/F)
    if lemmatize:
        lemmas = [wnl.lemmatize(t) for t in word_tokenize(text)]
        tokens = lemmas
    
    if nounsOnly:
        nouns = [] # list of nouns
        posTokens = nltk.pos_tag(word_tokenize(text))
        wordnet_tags = [get_wordnet_pos(p[1]) for p in posTokens]        
        posLemmas = [wnl.lemmatize(t,w) for t,w in zip(word_tokenize(text),wordnet_tags)]

        for i in range(len(wordnet_tags)):
            if wordnet_tags[i] == 'n':
                nouns.append(posLemmas[i])
        tokens = nouns
    
    setTokens = set(tokens)
    if excluded != None:
        if (excluded in setTokens):
            setTokens.remove(excluded)
    return setTokens

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\viet_\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [16]:
# Q4 - ANSWER - part b)

# TEST YOUR METHOD 
print(bow("There is a lot of food on the table", excluded='table', lowercase=True, lemmatize=True, nounsOnly=True))

# Your example 1 - Lemmatization only
print(bow("She went to the play with her friend", excluded='play', lowercase=False, lemmatize=True, nounsOnly=False))

# Your example 2 - Nouns only
print(bow("They took a break from studying to eat lunch", excluded='break', lowercase=False, lemmatize=False, nounsOnly=True))


{'food', 'lot'}
{'She', 'her', 'friend', 'went', 'with', 'to', 'the'}
{'lunch'}


**(TO-DO: Q5)** TESTING BOW VARIATIONS IN LESK-LIKE DISAMBIGUATION

a) Redo the method makeDefBOW and findMostProbableSense to use the new parameters.  

b) Generate three example cases and test your disambiguation strategy programmed above.  An example case contains an ambiguous word (e.g. bank) and a sentence in which that word must be disambiguated (e.g. He sat on the bank throwing rocks in the water.).  

c) For your examples, which filtering seems to work better (with/without lemmatization, with/without focus only on nouns)?


In [17]:
# Q5 - ANSWER - part a)

# add the parameters to makeBOW as well, same default
def makeDefBOWs(testWord, lowercase=False, lemmatize=False, nounsOnly=False):
    synsets = wordnet.synsets(testWord)
    defs = [s.definition() for s in synsets]
    bows = [bow(d, excluded=testWord, lowercase=lowercase, lemmatize=lemmatize, nounsOnly=nounsOnly) for d in defs]
    return bows
   

def findMostProbableSense(word, text, lowercase=False, lemmatize=False, nounsOnly=False):
    bows = makeDefBOWs(word, lowercase, lemmatize, nounsOnly)
    textBOW = bow(text, excluded=word, lowercase=lowercase, lemmatize=lemmatize, nounsOnly=nounsOnly)
    max_overlap = 0
    max_synsets = [] #list of synsets  
    olaps = [] #list of overlap indices
  
    # find senses with max overlap
    senses = wordnet.synsets(word)
    for i in range(len(senses)):
        overlap = bowOverlap(bows[i], textBOW) 
        olaps.append(overlap) # Keep track of overlap indices if there's more than 1 max found
                
        # Print statements for verifying
        print("Overlap: ",overlap)
        print("BOWS: ",bows[i])
        print("textBOW: ", textBOW, "\n")
        
        if overlap > max_overlap:        
            max_overlap = overlap
    
    # Add all synsets w/ max intersection        
    for j in range(len(olaps)):
        if max_overlap == olaps[j]:
            max_synsets.append(senses[j]) 
            
    return max_synsets
   

In [18]:
# Q5 - ANSWER - part b)
testWord = "table"
testSentence = "There is a lot of food on the table."

chosenSynsets = findMostProbableSense(testWord, testSentence, lowercase=True, lemmatize=True, nounsOnly=True)  

# print all the definitions of the most probable senses
for s in chosenSynsets:
    printBasicSynsetInfo(s)

    
# Your example 1 - with lemmatization only, without only nouns
print("\nEx 1. Lemmatization only\n")

testWord_1 = "sign"
testSentence_1 = "The deaf girl made a gesture for hello in sign language."

chosenSynsets_1 = findMostProbableSense(testWord_1, testSentence_1, lowercase=True, lemmatize=True, nounsOnly=False)  
for s in chosenSynsets_1:
    printBasicSynsetInfo(s)
    
# Your example 2 - without lemmatization, with only nouns
print("\nEx 2. Nouns only\n")
chosenSynsets_2 = findMostProbableSense(testWord_1, testSentence_1, lowercase=True, lemmatize=False, nounsOnly=True)  
for s in chosenSynsets_2:
    printBasicSynsetInfo(s) 
    
# Your example 3 - with lemmatization and only nouns
print("\nEx 3. Lemmatization & nouns only\n")
chosenSynsets_3 = findMostProbableSense(testWord_1, testSentence_1, lowercase=True, lemmatize=True, nounsOnly=True)
for s in chosenSynsets_3:
    printBasicSynsetInfo(s)

set()
Overlap:  0
BOWS:  {'set', 'row', 'data', 'column'}
textBOW:  {'food', 'lot'} 

set()
Overlap:  0
BOWS:  {'furniture', 'piece', 'leg', 'top'}
textBOW:  {'food', 'lot'} 

set()
Overlap:  0
BOWS:  {'furniture', 'piece', 'tableware', 'meal'}
textBOW:  {'food', 'lot'} 

set()
Overlap:  0
BOWS:  {'tableland', 'edge'}
textBOW:  {'food', 'lot'} 

set()
Overlap:  0
BOWS:  {'game', 'company', 'people', 'meal'}
textBOW:  {'food', 'lot'} 

{'food'}
Overlap:  1
BOWS:  {'food', 'meal'}
textBOW:  {'food', 'lot'} 

set()
Overlap:  0
BOWS:  {'time'}
textBOW:  {'food', 'lot'} 

set()
Overlap:  0
BOWS:  {'enter', 'arrange', 'form'}
textBOW:  {'food', 'lot'} 

SynLemmas
[Lemma('board.n.04.board'), Lemma('board.n.04.table')]
Synonyms
['board', 'table']
Definition
food or meals in general

Ex 1. Lemmatization only

{'a'}
Overlap:  1
BOWS:  {'something', 'indication', 'of', 'visible', 'clue', 'ha', 'immediately', 'perceptible', '(', 'that', 'not', ')', 'a', 'apparent', 'happened'}
textBOW:  {'deaf', '

set()
Overlap:  0
BOWS:  {'something', 'indication', 'clue'}
textBOW:  {'deaf', 'gesture', 'language', 'girl', 'hello'} 

set()
Overlap:  0
BOWS:  {'display', 'message'}
textBOW:  {'deaf', 'gesture', 'language', 'girl', 'hello'} 

{'gesture'}
Overlap:  1
BOWS:  {'action', 'gesture', 'message'}
textBOW:  {'deaf', 'gesture', 'language', 'girl', 'hello'} 

set()
Overlap:  0
BOWS:  {'structure', 'board', 'advertisement'}
textBOW:  {'deaf', 'gesture', 'language', 'girl', 'hello'} 

set()
Overlap:  0
BOWS:  {'zodiac', 'area', 'astrology'}
textBOW:  {'deaf', 'gesture', 'language', 'girl', 'hello'} 

set()
Overlap:  0
BOWS:  {'disease', 'disorder', 'presence', 'evidence', 'medicine'}
textBOW:  {'deaf', 'gesture', 'language', 'girl', 'hello'} 

set()
Overlap:  0
BOWS:  {'distinction', 'charge', 'pole'}
textBOW:  {'deaf', 'gesture', 'language', 'girl', 'hello'} 

set()
Overlap:  0
BOWS:  {'thing', 'event'}
textBOW:  {'deaf', 'gesture', 'language', 'girl', 'hello'} 

{'gesture', 'language'}
Overl

*Q5 - ANSWER - part c)*  
Ambiguous word used: sign  
In a sentence: "The deaf girl made a gesture for hello in sign language."

Case 1: Lemmatization only  
Definitions:
- a gesture that is part of a sign language  
- make the sign of the cross over someone in order to call on God for protection; consecrate  
- used of the language of the deaf  

Case 2: Nouns only  
Definitions:  
- a gesture that is part of a sign language  
- used of the language of the deaf  

Case 3: Lemmatization & nouns  
Definitions:  
- a gesture that is part of a sign language  
- used of the language of the deaf  

In the above cases, filtering by only nouns (Case 2) selected the most appropriate definitions based on the context by focusing on nouns such as 'language' and 'gesture'. However, lemmatization is suitable for when words are not of their root form and in Case 1, the incorrect defintion was chosen due to preposition words 'for' and 'in' and article 'the'.

#### Signature

I, -------NamChi Nguyen--------------, declare that the answers provided in this notebook are my own.