# Taaltheorie en Taalverwerking · 2019 · Assignment 5

NLTK has a interface for working with WordNet in more detail than is possible with the web interface alone. In this assignment, we will explore that interface further. 

As you are becoming more comfortable with NLTK, this assignment will involve some searching through the documentation yourself in order to find the best functions or methods for your needs. Specifically, you will want to look at:

  * *Natural Language Processing with Python* (The NLTK Book), Chapter 2, Section 5 (http://www.nltk.org/book/ch02.html)
  * The `wordnet` module documentation (http://www.nltk.org/api/nltk.corpus.reader.html#module-nltk.corpus.reader.wordnet)
  * The WordNet HOWTO for NLTK (http://www.nltk.org/howto/wordnet.html)

In [None]:
# FILL THIS IN FOR YOUR GROUP, also name your file as: tttv-w19-<group>-<name1>-<name2>.ipynb

# Group        :
# Name - UvaID :
# Name - UvaID :
# Date         :

### Loading the WordNet interface

The first time that you use WordNet, you will need to download it. You may comment that line out after you have WordNet working on your machine (but it does no harm to leave it there).

You will also need the `math` module for this assignment, and so let's load that now, too.

In [None]:
from math import *
import nltk
from nltk.corpus import wordnet as wn
nltk.download('wordnet')

### Using WordNet

The starting point of most WordNet queries involves pulling lists of synsets for particular words. NLTK allows these queries to be restricted to specific parts of speech: `wn.NOUN`, `wn.VERB`, or `wn.ADJ`.

In [None]:
wn.synsets('bank')

# [Synset('bank.n.01'),
#  Synset('depository_financial_institution.n.01'),
#  Synset('bank.n.03'),
#  Synset('bank.n.04'),
#  Synset('bank.n.05'),
#  Synset('bank.n.06'),
#  Synset('bank.n.07'),
#  Synset('savings_bank.n.02'),
#  Synset('bank.n.09'),
#  Synset('bank.n.10'),
#  Synset('bank.v.01'),
#  Synset('bank.v.02'),
#  Synset('bank.v.03'),
#  Synset('bank.v.04'),
#  Synset('bank.v.05'),
#  Synset('deposit.v.02'),
#  Synset('bank.v.07'),
#  Synset('trust.v.01')]

In [None]:
wn.synsets('bank', pos = wn.NOUN)

# [Synset('bank.n.01'),
#  Synset('depository_financial_institution.n.01'),
#  Synset('bank.n.03'),
#  Synset('bank.n.04'),
#  Synset('bank.n.05'),
#  Synset('bank.n.06'),
#  Synset('bank.n.07'),
#  Synset('savings_bank.n.02'),
#  Synset('bank.n.09'),
#  Synset('bank.n.10')]

### Question 1: Word similarity based on WordNet path length  (6 pts)

Although NLTK provides `path_similarity()`, a function (and a method) for computing path-length similarity in WordNet, its definition is different than the definitions of path-length similarity in our textbook. In particular, NLTK only defines similarity between synsets, not whole words, and it normalises the similarity measure to fall between zero and one. 

Write a function `textbook_similarity()` that computes the similarity between two words, represented as Python strings, using the definitions of Equations 20.19 and 20.20 (2nd edition) or Equations 17.21 and 17.33 (3rd edition) in your textbook. In other words, return the maximum similarity across all pairs of senses of the two words, where similarity is defined to be `-log(shortest_path_length)`. Your function should include an argument to restrict the search to a specific part of speech. 

**N.B.:** Your function will require special treatment if the two words are the same (or part of the same synset), in which case the similarity should be `inf`, or if there is no path at all between the two words, in which case the similarity should be `-inf`. You may want to write a helper function to make this conversion for you.

In [None]:
def textbook_similarity(word1, word2, pos):
    #--> Replace the fake return statement with your implementation.
    return 0.0

print(textbook_similarity('port', 'port', wn.NOUN))  # Should be inf
print(textbook_similarity('port', 'bank', wn.NOUN))  # Should be -1.61
print(textbook_similarity('port', 'bank', wn.VERB))  # Should be -1.39
print(textbook_similarity('port', 'drink', wn.VERB)) # Should be 0.00
print(textbook_similarity('port', 'couch', wn.VERB)) # Should be -inf

### Question 2: The Lesk family

#### 2.1 Simplified Lesk (4 pts)

Implement the *simplified* Lesk algorithm to disambiguate words in the context of single sentences. Implement the algorithm as a function that accepts a word and its context (the sentence). Your function should return an object of the NLTK WordNet `Synset` class.

**Hint:** Use the pseudo-code in your textbook as a model. The NLTK documentation will help you extract glosses (definitions) and examples. You may need to experiment with the interface before writing your function, so that you understand exactly how NLTK returns information from WordNet.

In [None]:
STOPWORDS = {
    'a', 'able', 'about', 'across', 'after', 'all', 'almost', 'also', 
    'am', 'among', 'an', 'and', 'any', 'are', 'as', 'at', 
    'be', 'because', 'been', 'but', 'by', 'can', 'cannot', 'could', 
    'dear', 'did', 'do', 'does', 'either', 'else', 'ever', 'every', 
    'for', 'from', 'get', 'got', 'had', 'has', 'have', 'he', 
    'her', 'hers', 'him', 'his', 'how', 'however', 'i', 'if', 
    'in', 'into', 'is', 'it', 'its', 'just', 'least', 'let', 
    'like', 'likely', 'may', 'me', 'might', 'most', 'must', 
    'my', 'neither', 'no', 'nor', 'not', 'of', 'off', 'often', 
    'on', 'only', 'or', 'other', 'our', 'own', 'rather', 'said', 
    'say', 'says', 'she', 'should', 'since', 'so', 'some', 'than', 
    'that', 'the', 'their', 'them', 'then', 'there', 'these', 'they', 
    'this', 'tis', 'to', 'too', 'twas', 'us', 'wants', 'was', 
    'we', 'were', 'what', 'when', 'where', 'which', 'while', 'who', 
    'whom', 'why', 'will', 'with', 'would', 'yet', 'you', 'your'}

def simplified_lesk(word, sentence):
    #--> Replace the pass statement with your implementation.
    pass

# Example 20.10 from the textbook. 
# Should return Synset('depository_financial_institution.n.01')
# Double check whether it is correct using 2.3
print(simplified_lesk(
    'bank',
    'the bank can guarantee deposits will eventually cover future \
     tuition costs because it invests in adjustable-rate mortage securities'))

#### 2.2 Original Lesk (2 pts)

Implement the *original* Lesk algorithm to disambiguate words in the context of single sentences. Implement the algorithm as a function that accepts a word and its context (the sentence).

**Hint:** Use your simplified Lesk function as a basis. You may find it handy to write a helper function that computes signatures from word senses. 

In [None]:
def original_lesk(word, sentence):
    #--> Replace the pass statement with your implementation.
    pass

# Exercise 20.4 from the textbook (and an example from the original Lesk paper)
print(original_lesk('time', 'time flies like an arrow'))  # Should return time.n.02
print(original_lesk('flies', 'time flies like an arrow')) # Should return flies.n.01
print(original_lesk('arrow', 'time flies like an arrow')) # Should return arrow.n.02

#### 2.3 Comparing the simplified and original Lesk algorithms (2 pts)

In [None]:
print(simplified_lesk('time', 'time flies like an arrow'))  # Should return time.n.01 
print(simplified_lesk('flies', 'time flies like an arrow')) # Should return fly.v.08
print(simplified_lesk('arrow', 'time flies like an arrow')) # Should return arrow.n.01
# If the function returns other answers, check your simplified Lesk algorithm for mistakes.

These Lesk variants disagree about every sense of the famous sentence 'Time flies like an arrow.' Examine their output. Do either of the algorithms disambiguate the sentence correctly? Which version do you think does a better job, and why?

**ANSWER:** *Double-click on this Markdown cell and replace this text with your answer.*