## Exercises
https://web.stanford.edu/~jurafsky/slp3/C.pdf

In [1]:
import nltk
from nltk.corpus import wordnet as wn

from functools import reduce

C.1. Collect a small corpus of example sentences of varying lengths from any
newspaper or magazine. Using WordNet or any standard dictionary, determine how many senses there are for each of the open-class words in each sentence. How many distinct combinations of senses are there for each sentence?
How does this number seem to vary with sentence length?

> But even in places where the virus is under control, schools lack the means to safely provide full-time instruction. In New York City, the nation’s largest school district says that it can only safely provide a few days each week of in-person instruction.
https://www.nytimes.com/2020/07/10/opinion/coronavirus-schools-reopening.html

In [2]:
sentence_1 = "But even in places where the virus is under control, schools lack the means to safely provide full-time instruction."
sentence_2 = "In New York City, the nation’s largest school district says that it can only safely provide a few days each week of in-person instruction."

words_1 = [word for word in nltk.tokenize.word_tokenize(sentence_1) if len(wn.synsets(word)) != 0]
words_2 = [word for word in nltk.tokenize.word_tokenize(sentence_2) if len(wn.synsets(word)) != 0]

nb_senses_1 = [len(wn.synsets(word)) for word in words_1]
nb_senses_2 = [len(wn.synsets(word)) for word in words_2]

combinations_1 = reduce(lambda a,b: a*b, nb_senses_1)
combinations_2 = reduce(lambda a,b: a*b, nb_senses_2)

print("length 1st sentence:", len(nltk.tokenize.word_tokenize(sentence_1)))
print("length 2nd sentence:", len(nltk.tokenize.word_tokenize(sentence_2)))
print("combinations for the 1st sentence:", combinations_1)
print("combinations for the 2nd sentence:", combinations_2)
print("1st sentence, combination/length:", int(combinations_1/len(sentence_1)))
print("2nd sentence, combination/length:", int(combinations_2/len(sentence_2)))

length 1st sentence: 21
length 2nd sentence: 28
combinations for the 1st sentence: 286289203200
combinations for the 2nd sentence: 18927077621760
1st sentence, combination/length: 2468010372
2nd sentence, combination/length: 137152736389


C.2. Using WordNet or a standard reference dictionary, tag each open-class word
in your corpus with its correct tag. Was choosing the correct sense always a
straightforward task? Report on any difficulties you encountered.

In [3]:
# filter words that have only 1 sense

words_1 = [word for word in words_1 if len(wn.synsets(word)) > 1]
words_2 = [word for word in words_2 if len(wn.synsets(word)) > 1]

senses_1 = [wn.synsets(word) for word in words_1]
senses_2 = [wn.synsets(word) for word in words_2]

#### sentence 1

- even$^{11}$
- in -> not a noun/verb/adjective/adverb
- place$^4$

- virus$^1$
- is$^1$
- under$^4$

- control$^1$
- schools$^1$
- lack$^1$

- means$^2$
- provide$^2$
- full-time$^1$

- instruction$^3$

--------------------------

#### sentence 2

- In -> not a noun/verb/adjective/adverb
- New -> part of New York City
- City -> part of New York City

- nation$^1$
- s -> not a noun/verb/adjective/adverb
- largest$^1$

- school$^1$
- district$^1$
- says$^3$

- can -> not a noun/verb/adjective/adverb
- only$^3$
- provide$^2$

- a -> not a noun/verb/adjective/adverb
- few$^2$
- days$^2$

- each$^1$
- week$^2$
- instruction$^3$

--------------------------

I think that the definition is not enough for me to determine the sense of the word, examples often help me (English is not my native language).

C.3. Using your favorite dictionary, simulate the original Lesk word overlap disambiguation algorithm described on page 16 on the phrase _Time flies like an arrow_. Assume that the words are to be disambiguated one at a time, from left to right, and that the results from earlier decisions are used later in the process.

C.4. Build an implementation of your solution to the previous exercise. Using WordNet, implement the original Lesk word overlap disambiguation algorithm described on page 16 on the phrase Time flies like an arrow.

In [4]:
# count_shared_words(sentence, gloss)
def compute_overlap(sentence: list, word: str, num_sense: int) -> int:
    count = 0
    count += len([w for w in sentence if w in wn.synsets(word)[num_sense].definition()])
    count += len([w for w in sentence for example in wn.synsets(word)[num_sense].examples() if w in example])

    return count

def simplified_lesk(sentence: list, word: str) -> int:
    wn.synsets(word)
    overlap = [compute_overlap(sentence, word, i) for i in range(len(wn.synsets(word)))]
    max_overlap = max(overlap)

    return overlap.index(max_overlap)

In [6]:
sentence = nltk.tokenize.word_tokenize("Time flies like an arrow")

[(word, simplified_lesk(sentence, word)) for word in sentence]

[('Time', 1), ('flies', 13), ('like', 7), ('an', 0), ('arrow', 1)]