<h1><b><font color = 'brown' size = '6'>
Word Sense Disambiguation Using the Lesk Algorithm
</font></b></h1>

<h2>
<b>

<ul>
<font color = 'brown green' size = '5'>

<li>
The Lesk algorithm is used for resolving word sense disambiguation.
</li><br>

<li>
Suppose we have a sentence such as "On the bank of river Ganga, there lies the scent of spirituality" and another sentence, "I'm going to withdraw some cash from the bank".
</li><br>

<li>
Here, the same word—that is, "bank"—is used in two different contexts.
</li><br>

<li>
For text processing results to be accurate, the context of the words needs to be considered.
</li><br>

<li>
In the Lesk algorithm, words with ambiguous meanings are stored in the background in synsets.
</li><br>

<li>
The definition that is closer to the meaning of a word being used in the
context of the sentence will be taken as the right definition.
</li><br>

<li>
Let's perform a simple exercise to get a better idea of how we can implement this.
</li><br>

</font>
</ul>
</b>
</h2>

<h1><b><font color = 'brown'>
Exercise: Implementing the Lesk Algorithm Using String Similarity and
Text Vectorization
</font></b></h1>

<b>In this exercise, we are going to implement the Lesk algorithm step by step using the techniques we have learned so far. <br>

We will find the meaning of the word "bank" in the sentence, "On the banks of river Ganga, there lies the scent of spirituality."<br>

We will use cosine similarity as well as Jaccard similarity here. Follow these steps to complete this exercise: <b>

1. Open a Jupyter or Colab Notebook.

2. Insert a new cell and add the following code to import the necessary libraries:

In [None]:
import pandas as pd
import nltk
from sklearn.metrics.pairwise import cosine_similarity
from nltk import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups
import numpy as np

3. Define a method for getting the TFIDF vectors of a corpus:

In [None]:
def get_tf_idf_vectors(corpus):

    vectorizer = TfidfVectorizer()
    tf_idf_vectors = vectorizer.fit_transform(corpus)

    return tf_idf_vectors.toarray()

4. Define a method to convert the corpus into lowercase:

In [None]:
def to_lower_case(corpus):

    return [sentence.lower() for sentence in corpus]

5. Define a method to find the similarity between the sentence and the possible
definitions and return the definition with the highest similarity score:

In [None]:
def find_sentence_definition(sent_vector, definition_vectors):

    max_score = -np.inf
    definition_id = None

    for key, def_vector in definition_vectors.items():
        similarity_score = cosine_similarity(sent_vector.reshape(1, -1), def_vector.reshape(1, -1))
        if similarity_score > max_score:
            max_score = similarity_score
            definition_id = key

    return definition_id, max_score

6. Define a corpus with random sentences with the sentence and the two
definitions as the top three sentences:

In [None]:
corpus = ["On the banks of river Ganga, there lies the scent of spirituality",
          "An institute where people can store extra cash or money.",
          "The land alongside or sloping down to a river or lake"
           "What you do defines you",
           "Your deeds define you",
           "Once upon a time there lived a king.",
           "Who is your queen?",
            "He is desperate",
           "Is he not desperate?"]
corpus

['On the banks of river Ganga, there lies the scent of spirituality',
 'An institute where people can store extra cash or money.',
 'The land alongside or sloping down to a river or lakeWhat you do defines you',
 'Your deeds define you',
 'Once upon a time there lived a king.',
 'Who is your queen?',
 'He is desperate',
 'Is he not desperate?']

7. Use the previously defined methods to find the definition of the word bank:

In [None]:
lower_case_corpus = to_lower_case(corpus)
lower_case_corpus

['on the banks of river ganga, there lies the scent of spirituality',
 'an institute where people can store extra cash or money.',
 'the land alongside or sloping down to a river or lakewhat you do defines you',
 'your deeds define you',
 'once upon a time there lived a king.',
 'who is your queen?',
 'he is desperate',
 'is he not desperate?']

In [None]:
corpus_tf_idf = get_tf_idf_vectors(lower_case_corpus)
corpus_tf_idf

array([[0.        , 0.        , 0.2652394 , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.2652394 , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.2652394 ,
        0.        , 0.        , 0.        , 0.53047881, 0.2652394 ,
        0.        , 0.        , 0.        , 0.        , 0.22229132,
        0.2652394 , 0.        , 0.2652394 , 0.        , 0.44458264,
        0.22229132, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        ],
       [0.        , 0.32104135, 0.        , 0.32104135, 0.32104135,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.32104135, 0.        , 0.        , 0.32104135,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.32104135, 0.        , 0.        , 0.        ,
        0.        , 0.26905771, 0.32104135, 0.        , 0.        ,
   

In [None]:
sent_vector = corpus_tf_idf[0]
sent_vector

array([0.        , 0.        , 0.2652394 , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.2652394 , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.2652394 ,
       0.        , 0.        , 0.        , 0.53047881, 0.2652394 ,
       0.        , 0.        , 0.        , 0.        , 0.22229132,
       0.2652394 , 0.        , 0.2652394 , 0.        , 0.44458264,
       0.22229132, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        ])

In [None]:
deffinition_vectors = {'def1': corpus_tf_idf[1], 'def2': corpus_tf_idf[2]}
deffinition_vectors

{'def1': array([0.        , 0.32104135, 0.        , 0.32104135, 0.32104135,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.32104135, 0.        , 0.        , 0.32104135,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.32104135, 0.        , 0.        , 0.        ,
        0.        , 0.26905771, 0.32104135, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.32104135, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.32104135,
        0.        , 0.        , 0.        ]),
 'def2': array([0.25799474, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.25799474, 0.        , 0.25799474,
        0.25799474, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.25799474, 0.25799474, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.43243946, 0.        , 0.        

In [None]:
deffinition_id, score = find_sentence_definition(sent_vector, deffinition_vectors)
deffinition_id, score

('def2', array([[0.14419131]]))

In [None]:
print("The definition of word 'bank' is '{}' with a similarity of {}".format(deffinition_id, score))

The definition of word 'bank' is 'def2' with a similarity of [[0.14419131]]
