# Word Sense Disambiguation

### Downloads
You will need to download the wordnet data from NLTK data

In [20]:
import nltk
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger') # for other languages

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/sahithimv/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/sahithimv/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

This code uses NLTK to tokenize the sentences and WordNet to obtain synsets (sets of synonyms with shared meanings) for each word. It then checks the definitions of the synsets to identify the relevant senses.

In [7]:

from nltk.wsd import lesk
from nltk.tokenize import word_tokenize

sentence_1 = "I saw a bat flying in the night sky."
sentence_2 = "The baseball player swung the bat with precision."

tokenized_sentence_1 = word_tokenize(sentence_1.lower())
tokenized_sentence_2 = word_tokenize(sentence_2.lower())

sense_1 = lesk(tokenized_sentence_1, 'bat')
sense_2 = lesk(tokenized_sentence_2, 'bat')

print(f"Original Sentence 1: {sentence_1}")
print(f"Sense 1: {sense_1.name()} - Definition: {sense_1.definition()}")
print("-------------------------")
print(f"Original Sentence 2: {sentence_2}")
print(f"Sense 2: {sense_2.name()} - Definition: {sense_2.definition()}")


Original Sentence 1: I saw a bat flying in the night sky.
Sense 1: cricket_bat.n.01 - Definition: the club used in playing cricket
-------------------------
Original Sentence 2: The baseball player swung the bat with precision.
Sense 2: bat.v.01 - Definition: strike with, or as if with a baseball bat


The above example uses the Lesk Algorithm to find overlap of words in dictionaries, to figure out different meanings of the same word.

### Supporting different languages

You will need to install the pywsd library to use the lesk algorithm to perform WSD.

In [18]:
pip install pywsd


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


After this you can use the lesk algorithm to perform WSD

In [22]:
from pywsd.lesk import simple_lesk
import nltk

def perform_lesk(sentence, word, pos='n'):
    try:
        tokenized_sentence = nltk.word_tokenize(sentence.lower())
        sense = simple_lesk(' '.join(tokenized_sentence), word, pos=pos)
        print(f"Sentence: {sentence}")
        if sense:
            print(f"Sense: {sense.name()} - Definition: {sense.definition()}")
        else:
            print(f"No sense found for '{word}' in the given context.")
        print("-------------------------")
    except Exception as e:
        print(f"Error: {e}")

sentence_3 = "Tomé una hoja de papel para tomar notas."
sentence_4 = "El viento movía las hojas de los árboles."

perform_lesk(sentence_3, 'hoja', pos='n')
perform_lesk(sentence_4, 'hoja', pos='n')




Sentence: Tomé una hoja de papel para tomar notas.
No sense found for 'hoja' in the given context.
-------------------------
Sentence: El viento movía las hojas de los árboles.
No sense found for 'hoja' in the given context.
-------------------------


I have also added code for handling exceptions when suitable lesk values, i.e. the entries for the words cannot be found. You might want to add this even for English text.

Keep in mind this a very simple example. You might have to apply machine learning algorithms or other complex mechanisms for special use cases.