### Exercise 2.17: Implementing the Lesk Algorithm Using String Similarity and Text Vectorization

In this exercise, we are going to implement the Lesk algorithm step by step using the techniques we have learned so far. We will find the meaning of the word "bank" in the sentence, "On the banks of river Ganga, there lies the scent of spirituality." We will use cosine similarity as well as Jaccard similarity here. 

In [1]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from nltk import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups
import numpy as np

In [2]:
# Method to get TFIDF vectors
def get_tf_idf_vectors(corpus):
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_results = tfidf_vectorizer.fit_transform(corpus).todense()
    return tfidf_results

In [3]:
# Method to convert corpus into lower case
def to_lower_case(corpus):
    lowercase_corpus = [x.lower() for x in corpus]
    return lowercase_corpus

In [4]:
# Method to find similarity between sentence and the possible definitions
def find_sentence_definition(sent_vector, definition_vectors):
    result_dict = {}
    for definition_id, def_vector in definition_vectors.items():
        sim = cosine_similarity(sent_vector, def_vector)
        result_dict[definition_id] = sim[0][0]
    definition = sorted(result_dict.items(), key=lambda x: x[1], reverse=True)[0]
    return definition[0],definition[1]

In [5]:
corpus = ["On the banks of river Ganga, there lies the scent of spirituality.",\
         "An institute where people can store extra cash or money.",\
         "The land alongside or sloping down to a river or lake"\
         "What you do defines you",\
         "Your deeds define you",\
         "Once upon a time there lived a king.",\
         "Who is your queen?",\
         "He is desperate",\
         "Is he not desperate?"]

In [6]:
# Find definition of the word bank
lower_case_corpus = to_lower_case(corpus)
corpus_tf_idf = get_tf_idf_vectors(lower_case_corpus)
sent_vector = corpus_tf_idf[0]
definition_vectors = {'def1':corpus_tf_idf[1], 'def2':corpus_tf_idf[2]}
definition_id, score = find_sentence_definition(sent_vector, definition_vectors)
print("The definition of word {} is {} with similarity of {}".format('bank', definition_id, score))

The definition of word bank is def2 with similarity of 0.14419130686278897




As we already know, def2 represents a riverbank. So, we have found the correct definition of the word here. In this exercise, we have learned how to use text vectorization and text similarity to find the right definition of ambiguous words. 