<a href="https://colab.research.google.com/github/gonzalovaldenebro/NaturalLanguageProcessing-Portfolio/blob/main/F4_2_WordSenseDisambiguation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS 195: Natural Language Processing
## WordSense Disambiguation

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/f23-CS195NLP/blob/main/F4_2_WordSenseDisambiguation.ipynb)


## References

Word Senses and WordNet, Chapter 23 of *Speech and Language Processing* by Daniel Jurafsky & James H. Martin: https://web.stanford.edu/~jurafsky/slp3/23.pdf

WordNet documentation: https://www.nltk.org/api/nltk.corpus.reader.wordnet.html

SemCor Corpus Module documentation: https://www.nltk.org/api/nltk.corpus.reader.semcor.html

NLTK Stopwords: https://pythonspot.com/nltk-stop-words/

Lemmatization with NLTK: https://www.geeksforgeeks.org/python-lemmatization-with-nltk/

In [1]:
import sys
!{sys.executable} -m pip install nltk



In [2]:
#you shouldn't need to do this in Colab, but I had to do it on my own machine
#in order to connect to the nltk service
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context


## Word Sense Disambiguation

As we explored last time, one word can have many *senses*.

The **WordNet** database can be used to look up different word senses of a particular word.

The task of figuring out which sense is being usede in a given context is called **word sense disambiguation**

Important for
* extracting proper meaning from text
* translation - e.g., different senses of one word in English might have different translations
* question answering

## Typical approach for WSD

Look at the *context* of a word - what other words are around it

For example, consider the word **bank** in

"I need to go to the bank and deposit my paycheck."

We can determine from *deposit*, *paycheck*, and maybe even *go to* that we're talking about a financial institution and not a river bank.

Which definition does the context share the most words with?

*Definition 1:* 'sloping land (especially the slope beside a body of water)'

*Definition 2:* 'a financial institution that accepts deposits and channels the money into lending activities'


In [3]:
def compute_overlap(set1, set2):
    count_overlap = 0
    for item in set1:
        if item in set2:
            count_overlap += 1
    return count_overlap


sentence = ["i", "need", "to", "go", "to", "the", "bank", "and", "deposit", "my", "paycheck"]
definition1 = ["sloping", "land", "especially", "the", "slope", "beside", "a", "body", "of", "water"]
definition2 = ["a", "financial", "institution", "that", "accepts", "deposits", "and", "channels", "the", "money", "into", "lending", "activities"]

print(compute_overlap(sentence,definition1))
print(compute_overlap(sentence,definition2))

1
2


### Discuss: What problems do you see with this approach?

## The Simplified Lesk Algorithm

The **Simplified Lesk Algorithm** loops over all possible word senses to find the one whose definition/examples share the most words in common with the sentence context.

Given a `word` and `sentence`
1. Make a *set* of all the words in the sentence (my need to tokenize)
2. Look up all the `synsets` for `word` in **WordNet**
3. Loop through the list of `synsets`
    * create a signature - the set of all the words that appear the definition and list of examples for this `word` from **WordNet** (may need to tokenize)
    * compute the overlap between the signature and the word context
    * if this is better than the previous best overlap, save the new sense

In [4]:
from nltk.corpus import wordnet as wn
#nltk.download('wordnet') #only need to do this once

def simplified_lesk(word,sentence):
    best_sense = 0

    #fill this in

    return best_sense

### Discuss: How should we tokenize our text data for this problem?

I think that we should:

- 1. Tokenize the input data
- 2. Sentence Tokenization
- 3. Word tokenization
- 4. Synset Retrieval
- 5. Signature creation
- 6. Word context tokenization
- 7. Overlap calculation
- 8. Sense disambiguation

### Group Exercise: Finish implementing this algorithm

In [8]:
from nltk.corpus import wordnet as wn
import nltk
nltk.download('punkt') # Only need to do this once
nltk.download('wordnet') #only need to do this once
from nltk.tokenize import word_tokenize

def simplified_lesk(word, sentence):
    best_sense = None
    max_overlap = 0

    # Tokenize the sentence
    sentence_words = set(word_tokenize(sentence))
    print('Here is the tokenized sentence: ',sentence_words)
    print('-----------------------------------------------')

    # Get the synsets for the target word
    word_synsets = wn.synsets(word)
    print('Word Synsets: ',word_synsets)
    print('-----------------------------------------------')

    for synset in word_synsets:
        # Create a signature for the current synset
        signature = set(word_tokenize(synset.definition()))

        for example in synset.examples():
            signature.update(word_tokenize(example))
            print('Signature: ',signature)

        # Calculate the overlap between the signature and the word context
        overlap = len(signature.intersection(sentence_words))

        # If the current overlap is better than the previous best, update best_sense
        if overlap > max_overlap:
            max_overlap = overlap
        best_sense = synset

    return best_sense

# Example usage:
word = "bank"
sentence = "I went to the bank to deposit my money."

best_sense = simplified_lesk(word, sentence)
if best_sense:
    print("Best Sense:", best_sense.name())
else:
    print("No sense found for the word in the given context.")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Here is the tokenized sentence:  {'my', 'the', 'I', 'deposit', 'bank', 'money', 'went', '.', 'to'}
-----------------------------------------------
Word Synsets:  [Synset('bank.n.01'), Synset('depository_financial_institution.n.01'), Synset('bank.n.03'), Synset('bank.n.04'), Synset('bank.n.05'), Synset('bank.n.06'), Synset('bank.n.07'), Synset('savings_bank.n.02'), Synset('bank.n.09'), Synset('bank.n.10'), Synset('bank.v.01'), Synset('bank.v.02'), Synset('bank.v.03'), Synset('bank.v.04'), Synset('bank.v.05'), Synset('deposit.v.02'), Synset('bank.v.07'), Synset('trust.v.01')]
-----------------------------------------------
Signature:  {')', 'sloping', 'on', 'the', 'water', 'canoe', 'beside', 'a', 'up', 'bank', 'land', 'pulled', '(', 'body', 'they', 'especially', 'of', 'slope'}
Signature:  {'bank', 'land', 'body', 'especially', 'of', 'sloping', 'currents', 'beside', 'river', 'watched', 'canoe', 'and', 'he', '(', 'they', ')', 'on', 'the', 'water', 'a', 'up', 'sat', 'pulled', 'slope'}
Signa

## Improving the algorithm

Two things we could do to try to improve the Lesk algorithm

1. Remove tokens that don't carry meaning like punctuation and *stopwords* (words like "the", "is", "to", etc.)

2. Lemmatize the words - convert them into their base form

Try to catch the word "deposit(s)" in
* "a financial institution that accepts **deposits** and channels the money into lending activities'
* "I need to go to the bank and **deposit** my paycheck."

## Stopwords Corpus



In [10]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords') #only need to do this once
stops = set(stopwords.words('english'))
print(stops)

{'their', 'few', 'with', 'any', 'o', 'ourselves', 'myself', "doesn't", 'there', 'she', "hadn't", 'if', 'some', 'nor', 'very', "you'll", 'during', "wasn't", 'again', 'my', 'does', 'further', 'haven', "shouldn't", 'we', 'an', 'because', 'same', 'is', 'they', 'under', 'no', 'were', 'not', 'am', 'on', 'why', 'having', 'the', "mustn't", "shan't", 'own', 'when', 'was', 'you', 'hers', 'up', 'but', "should've", 'yours', "it's", 'so', 'had', 'until', "needn't", 'these', 'both', 'our', "didn't", 'theirs', 'be', 'only', 'ain', 'being', 'couldn', 'then', 'most', 'about', 'it', 'after', 'shouldn', 'that', 'should', "won't", 'yourself', "you're", 'above', 'where', 'will', 'he', 'll', 'other', 'its', 'now', "weren't", 'here', 'too', 'which', 'mightn', 'a', "wouldn't", 'in', 'whom', 'from', 'isn', 'hasn', 'y', 'off', 'itself', 'at', 'over', 'out', 'her', 'who', 'themselves', 'aren', 'before', 'through', 'of', 'against', "that'll", 'once', 'needn', 'mustn', 'do', 'doesn', 'such', 'won', 'me', 're', 'yo

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## WordNet Lemmatizer

In [11]:
from nltk.stem import WordNetLemmatizer
#nltk.download('wordnet') #do it once

lemmatizer = WordNetLemmatizer()

print("deposit :", lemmatizer.lemmatize("deposit"))
print("deposits:", lemmatizer.lemmatize("deposits"))

deposit : deposit
deposits: deposit


## Exercise

Add stopword removal and lemmatization to your Lesk Algorithm implementation.

In [14]:
from nltk.corpus import wordnet as wn
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

#nltk.download('punkt')      # Only need to do this once
#nltk.download('wordnet')    # Only need to do this once
#nltk.download('stopwords')  # Only need to do this once

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))

def simplified_lesk(word, sentence):
    best_sense = None
    max_overlap = 0

    # Tokenize and lemmatize the sentence
    sentence_words = [lemmatizer.lemmatize(word) for word in word_tokenize(sentence) if word.lower() not in stop_words]
    print('-----------------------------------------------')
    print('Here is the tokenized and lemmatized sentence: ', sentence_words)
    print('-----------------------------------------------')

    # Get the synsets for the target word
    word_synsets = wn.synsets(word)
    print('Word Synsets: ', word_synsets)
    print('-----------------------------------------------')

    for synset in word_synsets:
        # Create a signature for the current synset
        definition = [lemmatizer.lemmatize(word) for word in word_tokenize(synset.definition())]
        signature = set(definition)

        for example in synset.examples():
            example_words = [lemmatizer.lemmatize(word) for word in word_tokenize(example)]
            signature.update(example_words)
            print('Signature: ', signature)

        # Calculate the overlap between the signature and the word context
        overlap = len(signature.intersection(sentence_words))

        # If the current overlap is better than the previous best, update best_sense
        if overlap > max_overlap:
            max_overlap = overlap
            best_sense = synset

    return best_sense

# Example usage:
word = "bank"
sentence = "I went to the bank to deposit my money."

best_sense = simplified_lesk(word, sentence)
if best_sense:
    print('-----------------------------------------------')
    print("Best Sense:", best_sense.name())
    print('-----------------------------------------------')
else:
    print('-----------------------------------------------')
    print("No sense found for the word in the given context.")
    print('-----------------------------------------------')


-----------------------------------------------
Here is the tokenized and lemmatized sentence:  ['went', 'bank', 'deposit', 'money', '.']
-----------------------------------------------
Word Synsets:  [Synset('bank.n.01'), Synset('depository_financial_institution.n.01'), Synset('bank.n.03'), Synset('bank.n.04'), Synset('bank.n.05'), Synset('bank.n.06'), Synset('bank.n.07'), Synset('savings_bank.n.02'), Synset('bank.n.09'), Synset('bank.n.10'), Synset('bank.v.01'), Synset('bank.v.02'), Synset('bank.v.03'), Synset('bank.v.04'), Synset('bank.v.05'), Synset('deposit.v.02'), Synset('bank.v.07'), Synset('trust.v.01')]
-----------------------------------------------
Signature:  {')', 'sloping', 'on', 'the', 'water', 'canoe', 'beside', 'a', 'up', 'bank', 'land', 'pulled', '(', 'body', 'they', 'especially', 'of', 'slope'}
Signature:  {'bank', 'land', 'body', 'especially', 'of', 'sloping', 'beside', 'river', 'watched', 'canoe', 'current', 'and', 'he', '(', 'they', ')', 'on', 'the', 'water', 'a',

## Dataset for evaluation WSD

The SemCor NLTK corpus contains text that has been tagged with WordNet sense (mostly Lemmas)

In [15]:
import nltk
nltk.download('semcor') #do this once
from nltk.corpus import semcor

[nltk_data] Downloading package semcor to /root/nltk_data...


In [16]:
# Get a list of file identifiers in SemCor
file_ids = semcor.fileids()
print(file_ids) #looks like they're from the brown dataset

['brown1/tagfiles/br-a01.xml', 'brown1/tagfiles/br-a02.xml', 'brown1/tagfiles/br-a11.xml', 'brown1/tagfiles/br-a12.xml', 'brown1/tagfiles/br-a13.xml', 'brown1/tagfiles/br-a14.xml', 'brown1/tagfiles/br-a15.xml', 'brown1/tagfiles/br-b13.xml', 'brown1/tagfiles/br-b20.xml', 'brown1/tagfiles/br-c01.xml', 'brown1/tagfiles/br-c02.xml', 'brown1/tagfiles/br-c04.xml', 'brown1/tagfiles/br-d01.xml', 'brown1/tagfiles/br-d02.xml', 'brown1/tagfiles/br-d03.xml', 'brown1/tagfiles/br-d04.xml', 'brown1/tagfiles/br-e01.xml', 'brown1/tagfiles/br-e02.xml', 'brown1/tagfiles/br-e04.xml', 'brown1/tagfiles/br-e21.xml', 'brown1/tagfiles/br-e24.xml', 'brown1/tagfiles/br-e29.xml', 'brown1/tagfiles/br-f03.xml', 'brown1/tagfiles/br-f10.xml', 'brown1/tagfiles/br-f19.xml', 'brown1/tagfiles/br-f43.xml', 'brown1/tagfiles/br-g01.xml', 'brown1/tagfiles/br-g11.xml', 'brown1/tagfiles/br-g15.xml', 'brown1/tagfiles/br-h01.xml', 'brown1/tagfiles/br-j01.xml', 'brown1/tagfiles/br-j02.xml', 'brown1/tagfiles/br-j03.xml', 'brown1/t

In [17]:
# Access the sense-tagged sentences from a file
sentences = semcor.sents(file_ids[0])
print(sentences)

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', 'Atlanta', "'s", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term', 'end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]


In [18]:
# Access the sense tags for those sentences
tags = semcor.tagged_sents(file_ids[0],tag="sem")
print(tags)

[[['The'], Tree(Lemma('group.n.01.group'), [Tree('NE', ['Fulton', 'County', 'Grand', 'Jury'])]), Tree(Lemma('state.v.01.say'), ['said']), Tree(Lemma('friday.n.01.Friday'), ['Friday']), ['an'], Tree(Lemma('probe.n.01.investigation'), ['investigation']), ['of'], Tree(Lemma('atlanta.n.01.Atlanta'), ['Atlanta']), ["'s"], Tree(Lemma('late.s.03.recent'), ['recent']), Tree(Lemma('primary.n.01.primary_election'), ['primary', 'election']), Tree(Lemma('produce.v.04.produce'), ['produced']), ['``'], ['no'], Tree(Lemma('evidence.n.01.evidence'), ['evidence']), ["''"], ['that'], ['any'], Tree(Lemma('abnormality.n.04.irregularity'), ['irregularities']), Tree(Lemma('happen.v.01.take_place'), ['took', 'place']), ['.']], [['The'], Tree(Lemma('jury.n.01.jury'), ['jury']), Tree(Lemma('far.r.02.far'), ['further']), Tree(Lemma('state.v.01.say'), ['said']), ['in'], Tree(Lemma('term.n.02.term'), ['term']), Tree(Lemma('end.n.02.end'), ['end']), Tree(Lemma('presentment.n.01.presentment'), ['presentments']), ['

This is a complex format - notice that some (but not all!) of the words are grouped together in a tree structure.

In [19]:
# tags[0] is the tags for the first sentence, sentence[0]
for tag in tags[0]:
    print(tag)

['The']
(Lemma('group.n.01.group') (NE Fulton County Grand Jury))
(Lemma('state.v.01.say') said)
(Lemma('friday.n.01.Friday') Friday)
['an']
(Lemma('probe.n.01.investigation') investigation)
['of']
(Lemma('atlanta.n.01.Atlanta') Atlanta)
["'s"]
(Lemma('late.s.03.recent') recent)
(Lemma('primary.n.01.primary_election') primary election)
(Lemma('produce.v.04.produce') produced)
['``']
['no']
(Lemma('evidence.n.01.evidence') evidence)
["''"]
['that']
['any']
(Lemma('abnormality.n.04.irregularity') irregularities)
(Lemma('happen.v.01.take_place') took place)
['.']


Notice
* Some tokens don't have a tag - stopwords, punctuation, etc. - these show up as a string inside a list
* "Fulton County Grand Jury" is grouped under Lemma('group.n.01.group')
* "primary election" is grouped as a compound word with Lemma('primary.n.01.primary_election')

This is going to be tough to work with. Here's an attempt to loop through them, match them up wit the word from the sentence, and handle these issues.

In [20]:
# for keeping track of which word and tag we're on
word_idx = 0
tag_idx = 0

while tag_idx < len(tags[0]) and word_idx < len(sentences[0]):
    word = sentences[0][word_idx] #the current word
    tag = tags[0][tag_idx] #the tag for the current word

    # check for tags that got assigned to compound words like primary_election
    if len(tag) > 1:
        print("Word:",sentences[0][word_idx:(word_idx+len(tag)-1)])
        print("Tag:",tag)
        word_idx += len(tag) #move to the next word that isn't part of the compound

    # for Tree objects, check if it really tagged a word and not a group
    elif type(tag) is nltk.Tree and type(tag[0]) is str:
        print("Word:",word)
        print("Tag:",tag)

        # here's how we can get the synset for tags that give us a Lemma
        if  type(tag.label()) != str:
            actual_sense = tag.label().synset()
            #pred_sense = simplified_lesk(word,sentences[0])
            #this is where you could check if you correctly matched the actual sense

        word_idx += 1 #advance to next word

    # check if it's a punctuation/stopword - if we got here, it means tag was not of type nltk.Tree
    elif type(tag[0]) is str:
        print("Word:",word)
        print("Tag:",tag)
        word_idx += 1

    # If we get gerem it means the Tree contained a group of words, and we can count
    # how many with len( tag.leaves() )
    else:
        print("Word:",word)
        print("Tag:",tag)
        print("Words in this group:",tag.leaves())
        word_idx += len(tag.leaves())
    tag_idx += 1
    print()

Word: The
Tag: ['The']

Word: Fulton
Tag: (Lemma('group.n.01.group') (NE Fulton County Grand Jury))
Words in this group: ['Fulton', 'County', 'Grand', 'Jury']

Word: said
Tag: (Lemma('state.v.01.say') said)

Word: Friday
Tag: (Lemma('friday.n.01.Friday') Friday)

Word: an
Tag: ['an']

Word: investigation
Tag: (Lemma('probe.n.01.investigation') investigation)

Word: of
Tag: ['of']

Word: Atlanta
Tag: (Lemma('atlanta.n.01.Atlanta') Atlanta)

Word: 's
Tag: ["'s"]

Word: recent
Tag: (Lemma('late.s.03.recent') recent)

Word: ['primary']
Tag: (Lemma('primary.n.01.primary_election') primary election)

Word: produced
Tag: (Lemma('produce.v.04.produce') produced)

Word: ``
Tag: ['``']

Word: no
Tag: ['no']

Word: evidence
Tag: (Lemma('evidence.n.01.evidence') evidence)

Word: ''
Tag: ["''"]

Word: that
Tag: ['that']

Word: any
Tag: ['any']

Word: irregularities
Tag: (Lemma('abnormality.n.04.irregularity') irregularities)

Word: ['took']
Tag: (Lemma('happen.v.01.take_place') took place)

Word: .

In [21]:
from nltk.corpus import wordnet as wn
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

#nltk.download('punkt')      # Only need to do this once
#nltk.download('wordnet')    # Only need to do this once
#nltk.download('stopwords')  # Only need to do this once

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))

def simplified_lesk(word, sentence):
    best_sense = None
    max_overlap = 0

    # Tokenize and lemmatize the sentence
    sentence_words = [lemmatizer.lemmatize(word) for word in word_tokenize(sentence) if word.lower() not in stop_words]
    print('-----------------------------------------------')
    print('Here is the tokenized and lemmatized sentence: ', sentence_words)
    print('-----------------------------------------------')

    # Get the synsets for the target word
    word_synsets = wn.synsets(word)
    print('Word Synsets: ', word_synsets)
    print('-----------------------------------------------')

    for synset in word_synsets:
        # Create a signature for the current synset
        definition = [lemmatizer.lemmatize(word) for word in word_tokenize(synset.definition())]
        signature = set(definition)

        for example in synset.examples():
            example_words = [lemmatizer.lemmatize(word) for word in word_tokenize(example)]
            signature.update(example_words)
            print('Signature: ', signature)

        # Calculate the overlap between the signature and the word context
        overlap = len(signature.intersection(sentence_words))

        # If the current overlap is better than the previous best, update best_sense
        if overlap > max_overlap:
            max_overlap = overlap
            best_sense = synset

    return best_sense

# Example usage:
word = "bank"
sentence = "I went to the bank to deposit my money."

best_sense = simplified_lesk(word, sentence)
if best_sense:
    print('-----------------------------------------------')
    print("Best Sense:", best_sense.name())
    print('-----------------------------------------------')
else:
    print('-----------------------------------------------')
    print("No sense found for the word in the given context.")
    print('-----------------------------------------------')


-----------------------------------------------
Here is the tokenized and lemmatized sentence:  ['went', 'bank', 'deposit', 'money', '.']
-----------------------------------------------
Word Synsets:  [Synset('bank.n.01'), Synset('depository_financial_institution.n.01'), Synset('bank.n.03'), Synset('bank.n.04'), Synset('bank.n.05'), Synset('bank.n.06'), Synset('bank.n.07'), Synset('savings_bank.n.02'), Synset('bank.n.09'), Synset('bank.n.10'), Synset('bank.v.01'), Synset('bank.v.02'), Synset('bank.v.03'), Synset('bank.v.04'), Synset('bank.v.05'), Synset('deposit.v.02'), Synset('bank.v.07'), Synset('trust.v.01')]
-----------------------------------------------
Signature:  {')', 'sloping', 'on', 'the', 'water', 'canoe', 'beside', 'a', 'up', 'bank', 'land', 'pulled', '(', 'body', 'they', 'especially', 'of', 'slope'}
Signature:  {'bank', 'land', 'body', 'especially', 'of', 'sloping', 'beside', 'river', 'watched', 'canoe', 'current', 'and', 'he', '(', 'they', ')', 'on', 'the', 'water', 'a',

## Applied Exploration

For cases where the SemCor dataset has a single word tagged with a WordNet sense, run your `simplified_lesk` code on it and see if it matches. Go through all of the sentences in a particular file_id and compute an accuracy score.

Write notes here on what you did and the results you got.