<a href="https://www.kaggle.com/sid9300/learning-some-nlp-functions?scriptVersionId=84326156" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# <font color='#4a8bad'>Table of contents</font>
***
* [Soundex](#soundex)
* [Levenshtein Edit Distance](#led)
* [Levenshtein Distance in 'nltk' Library](#lednltk)
    * [Levenshtein Distance](#ld)
    * [Damerau-Levenshtein Distance](#dld)
* [Heteronyms Detection](#hd)
    * [Example I](#hdi)
    * [Example II](#hdii)
* [Navigating Wordnet Relationships](#nwr)
* [Word-Sense Disambiguation](#wsd)
* [Lesk Algorithm](#lesk)
    * [Example I](#leski)
    * [Example II](#leskii)
    * [Example III](#leskiii)
* [Automatic POS Tagging + Lesk with spaCy](#apt)

<a id="soundex"></a>
<div style="color:blue;
           display:fill;
           border-radius:5px;
           background-color:blue;
           font-size:150%;
           font-family:Verdana;
           letter-spacing:0.5px">
<h1 style="text-align: center;
           padding: 10px;
           color:white;">
Soundex
</h1>
</div>

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling.

It converts an alphanumeric string to a four-character code that is based on how the string sounds when spoken in English. The first character of the code is the first character of character_expression, converted to upper case.

In [1]:
def get_soundex(word):
    
    # Change the word to Uppercase
    word = word.upper()
    
    soundex = ""
    
    # Initialize a soundex variable - first alphabet of word
    soundex += word[0]
    
    # Create a dictionary which maps letters to respective soundex codes. Vowels and 'H', 'W' and 'Y' will be represented by '.'
    dictionary = {"BFPV": "1", "CGJKQSXZ":"2", "DT":"3", "L":"4", "MN":"5", "R":"6", "AEIOUHWY":"."}
    
    # Logic to prepare the soundex
    for char in word[1:]:
        for key in dictionary.keys():
            if char in key:
                code = dictionary[key] 
                if code != '.': 
                    if code != soundex[-1]: 
                        soundex += code
    
    # Trim or pad to make soundex a 4-character code
    soundex = soundex[:4].ljust(4, "0")
    
    return soundex

print("Soundex Example :", 1)
print("--------------------")
print("Aggrawal :", get_soundex("Aggrawal"))
print("Agrawal  :", get_soundex("Agrawal"))
print("Aggarwal :", get_soundex("Aggarwal"))
print("Agarwal  :", get_soundex("Agarwal"))
print('\n')
print("Soundex Example :", 2)
print("--------------------")
print("Bombay   :", get_soundex("Bombay"))
print("Bambai   :", get_soundex("Bambai"))
print("Mumbai   :", get_soundex("Mumbai"))

Soundex Example : 1
--------------------
Aggrawal : A264
Agrawal  : A264
Aggarwal : A264
Agarwal  : A264


Soundex Example : 2
--------------------
Bombay   : B510
Bambai   : B510
Mumbai   : M510


<a id="led"></a>
<div style="color:blue;
           display:fill;
           border-radius:5px;
           background-color:blue;
           font-size:150%;
           font-family:Verdana;
           letter-spacing:0.5px">
<h1 style="text-align: center;
           padding: 10px;
           color:white;">
Levenshtein Edit Distance
</h1>
</div>

The levenshtein distance calculates the number of steps (insertions, deletions or substitutions) required to go from source string to target string.

In [2]:
def lev_distance(source='', target=''):
    """Make a Levenshtein Distances Matrix"""
    
    # get length of both strings
    n1, n2 = len(source), len(target)
    
    # create matrix using length of both strings - source string sits on columns, target string sits on rows
    matrix = [ [ 0 for i1 in range(n1 + 1) ] for i2 in range(n2 + 1) ]
    
    # fill the first row - (0 to n1-1)
    for i1 in range(1, n1 + 1):
        matrix[0][i1] = i1
    
    # fill the first column - (0 to n2-1)
    for i2 in range(1, n2 + 1):
        matrix[i2][0] = i2
    
    # fill the matrix
    for i2 in range(1, n2 + 1):
        for i1 in range(1, n1 + 1):
            
            # check whether letters being compared are same
            if (source[i1-1] == target[i2-1]):
                value = matrix[i2-1][i1-1]               # top-left cell value
            else:
                value = min(matrix[i2-1][i1]   + 1,      # left cell value     + 1
                            matrix[i2][i1-1]   + 1,      # top cell  value     + 1
                            matrix[i2-1][i1-1] + 1)      # top-left cell value + 1
            
            matrix[i2][i1] = value
    
    # return bottom-right cell value
    return matrix[-1][-1]

print("Levenshtein Edit Distance Example :", 1)
print("--------------------------------------")
print("cat vs cta :", lev_distance('cat', 'cta'))

Levenshtein Edit Distance Example : 1
--------------------------------------
cat vs cta : 2


<a id="lednltk"></a>
<div style="color:blue;
           display:fill;
           border-radius:5px;
           background-color:blue;
           font-size:150%;
           font-family:Verdana;
           letter-spacing:0.5px">
<h1 style="text-align: center;
           padding: 10px;
           color:white;">
Levenshtein Distance in 'nltk' Library
</h1>
</div>

In [3]:
# Import library
from nltk.metrics.distance import edit_distance

<a id="ld"></a>
<div style="color:lightblue;
           display:fill;
           border-radius:5px;
           background-color:lightblue;
           font-size:75%;
           font-family:Verdana;
           letter-spacing:0.5px">
<h1 style="text-align: left;
           padding: 10px;
           color:black;">
Levenshtein Distance
</h1>
</div>

In [4]:
print("apple vs appel :", edit_distance("apple", "appel"))

apple vs appel : 2


<a id="dld"></a>
<div style="color:lightblue;
           display:fill;
           border-radius:5px;
           background-color:lightblue;
           font-size:75%;
           font-family:Verdana;
           letter-spacing:0.5px">
<h1 style="text-align: left;
           padding: 10px;
           color:black;">
Damerau-Levenshtein Distance
</h1>
</div>

The Damerau-Levenshtein distance allows transpositions (swap of two letters which are adjacent to each other) as well.

In [5]:
print("apple vs appel :", edit_distance("apple", "appel", transpositions=True))

apple vs appel : 1


<a id="hd"></a>
<div style="color:blue;
           display:fill;
           border-radius:5px;
           background-color:blue;
           font-size:150%;
           font-family:Verdana;
           letter-spacing:0.5px">
<h1 style="text-align: center;
           padding: 10px;
           color:white;">
Heteronyms Detection
</h1>
</div>

In [6]:
# Importing SpaCy Library
import spacy

# Load pre-trained SpaCy model for performing basic
# NLP tasks such as POS tagging, parsing, etc.
model = spacy.load("en_core_web_sm")

def print_token(text):
    
    # Use the model to process the input sentence
    tokens = model(text)

    # Print the tokens and their respective PoS tags.
    for token in tokens:
        print(token.text, "--", token.pos_, "--", token.tag_)

<a id="hdi"></a>
<div style="color:lightblue;
           display:fill;
           border-radius:5px;
           background-color:lightblue;
           font-size:75%;
           font-family:Verdana;
           letter-spacing:0.5px">
<h1 style="text-align: left;
           padding: 10px;
           color:black;">
Example I
</h1>
</div>


In [7]:
print_token("She wished she could desert him in the desert.")

She -- PRON -- PRP
wished -- VERB -- VBD
she -- PRON -- PRP
could -- AUX -- MD
desert -- VERB -- VB
him -- PRON -- PRP
in -- ADP -- IN
the -- DET -- DT
desert -- NOUN -- NN
. -- PUNCT -- .


<a id="hdii"></a>
<div style="color:lightblue;
           display:fill;
           border-radius:5px;
           background-color:lightblue;
           font-size:75%;
           font-family:Verdana;
           letter-spacing:0.5px">
<h1 style="text-align: left;
           padding: 10px;
           color:black;">
Example II
</h1>
</div>


In [8]:
print_token("The bass swam around the bass drum on the ocean floor.")

The -- DET -- DT
bass -- NOUN -- NN
swam -- NOUN -- NN
around -- ADP -- IN
the -- DET -- DT
bass -- NOUN -- NN
drum -- NOUN -- NN
on -- ADP -- IN
the -- DET -- DT
ocean -- NOUN -- NN
floor -- NOUN -- NN
. -- PUNCT -- .


<a id="nwr"></a>
<div style="color:blue;
           display:fill;
           border-radius:5px;
           background-color:blue;
           font-size:150%;
           font-family:Verdana;
           letter-spacing:0.5px">
<h1 style="text-align: center;
           padding: 10px;
           color:white;">
Navigating Wordnet Relationships
</h1>
</div>

In [9]:
!pip install nltk



In [10]:
from nltk import download
download('wordnet')

from nltk.corpus import wordnet

[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [11]:
# Synsets
tractor = wordnet.synsets("tractor")
tractor

[Synset('tractor.n.01'), Synset('tractor.n.02')]

In [12]:
# Definitions of senses
i = 1
for sys in tractor:
    print("{:02d}".format(i), " : ", sys.definition())
    i += 1

01  :  a wheeled vehicle with large wheels; used in farming and other applications
02  :  a truck that has a cab but no body; used for pulling large trailers or vans


In [13]:
# Hypernyms : Relation between a concept and its superordinate
tractor = wordnet.synset('tractor.n.01')
tractor.hypernyms()

[Synset('self-propelled_vehicle.n.01')]

In [14]:
self_propelled_vehicle = wordnet.synset('self-propelled_vehicle.n.01')
self_propelled_vehicle.hypernyms()

[Synset('wheeled_vehicle.n.01')]

In [15]:
# Meronyms : Relation between a part and its whole
wheeled_vehicle = wordnet.synset('wheeled_vehicle.n.01')
wheeled_vehicle.part_meronyms()

[Synset('axle.n.01'),
 Synset('brake.n.01'),
 Synset('splasher.n.01'),
 Synset('wheel.n.01')]

In [16]:
# Hyponyms : Relation between a concept and its subordinate
wheeled_vehicle.hyponyms()

[Synset('baby_buggy.n.01'),
 Synset('bicycle.n.01'),
 Synset('boneshaker.n.01'),
 Synset('car.n.02'),
 Synset('handcart.n.01'),
 Synset('horse-drawn_vehicle.n.01'),
 Synset('motor_scooter.n.01'),
 Synset('rolling_stock.n.01'),
 Synset('scooter.n.02'),
 Synset('self-propelled_vehicle.n.01'),
 Synset('skateboard.n.01'),
 Synset('trailer.n.04'),
 Synset('tricycle.n.01'),
 Synset('unicycle.n.01'),
 Synset('wagon.n.01'),
 Synset('wagon.n.04'),
 Synset('welcome_wagon.n.01')]

In [17]:
# Holonyms : Relation between whole and its parts
axle = wordnet.synset('axle.n.01')
axle.part_holonyms()

[Synset('wheeled_vehicle.n.01')]

In [18]:
self_propelled_vehicle.hyponyms()

[Synset('armored_vehicle.n.01'),
 Synset('carrier.n.02'),
 Synset('forklift.n.01'),
 Synset('locomotive.n.01'),
 Synset('motor_vehicle.n.01'),
 Synset('personnel_carrier.n.01'),
 Synset('reconnaissance_vehicle.n.01'),
 Synset('recreational_vehicle.n.01'),
 Synset('streetcar.n.01'),
 Synset('tracked_vehicle.n.01'),
 Synset('tractor.n.01'),
 Synset('weapons_carrier.n.01')]

In [19]:
motor_vehicle = wordnet.synset('motor_vehicle.n.01')
motor_vehicle.hyponyms()

[Synset('amphibian.n.01'),
 Synset('bloodmobile.n.01'),
 Synset('car.n.01'),
 Synset('doodlebug.n.01'),
 Synset('four-wheel_drive.n.01'),
 Synset('go-kart.n.01'),
 Synset('golfcart.n.01'),
 Synset('hearse.n.01'),
 Synset('motorcycle.n.01'),
 Synset('snowplow.n.01'),
 Synset('truck.n.01')]

In [20]:
car = wordnet.synset('car.n.01')
car.part_meronyms()

[Synset('accelerator.n.01'),
 Synset('air_bag.n.01'),
 Synset('auto_accessory.n.01'),
 Synset('automobile_engine.n.01'),
 Synset('automobile_horn.n.01'),
 Synset('buffer.n.06'),
 Synset('bumper.n.02'),
 Synset('car_door.n.01'),
 Synset('car_mirror.n.01'),
 Synset('car_seat.n.01'),
 Synset('car_window.n.01'),
 Synset('fender.n.01'),
 Synset('first_gear.n.01'),
 Synset('floorboard.n.02'),
 Synset('gasoline_engine.n.01'),
 Synset('glove_compartment.n.01'),
 Synset('grille.n.02'),
 Synset('high_gear.n.01'),
 Synset('hood.n.09'),
 Synset('luggage_compartment.n.01'),
 Synset('rear_window.n.01'),
 Synset('reverse.n.02'),
 Synset('roof.n.02'),
 Synset('running_board.n.01'),
 Synset('stabilizer_bar.n.01'),
 Synset('sunroof.n.01'),
 Synset('tail_fin.n.02'),
 Synset('third_gear.n.01'),
 Synset('window.n.02')]

<a id="wsd"></a>
<div style="color:blue;
           display:fill;
           border-radius:5px;
           background-color:blue;
           font-size:150%;
           font-family:Verdana;
           letter-spacing:0.5px">
<h1 style="text-align: center;
           padding: 10px;
           color:white;">
Word-Sense Disambiguation
</h1>
</div>

In [21]:
from nltk import wsd
from nltk.corpus import wordnet as wn

In [22]:
X = 'The die is cast.'
Y = 'Roll the die to get a 6.'
Z = 'What is dead may never die.'

# To know the senses of 'die'
wn.synsets('die')

[Synset('die.n.01'),
 Synset('die.n.02'),
 Synset('die.n.03'),
 Synset('die.v.01'),
 Synset('die.v.02'),
 Synset('die.v.03'),
 Synset('fail.v.04'),
 Synset('die.v.05'),
 Synset('die.v.06'),
 Synset('die.v.07'),
 Synset('die.v.08'),
 Synset('die.v.09'),
 Synset('die.v.10'),
 Synset('die.v.11')]

In [23]:
# To know the senses of 'die' which are 'noun'
wn.synsets('die', pos=wn.NOUN)

[Synset('die.n.01'), Synset('die.n.02'), Synset('die.n.03')]

In [24]:
# Different definitions for 'noun'
i = 1
for syn in wn.synsets('die', pos=wn.NOUN):
    print("{:02d}".format(i), " : ", syn.definition())
    i += 1

01  :  a small cube with 1 to 6 spots on the six faces; used in gambling to generate random numbers
02  :  a device used for shaping metal
03  :  a cutting tool that is fitted into a diestock and used for cutting male (external) screw threads on screws or bolts or pipes or rods


In [25]:
# Different definitions for 'verb'
i = 1
for syn in wn.synsets('die', pos=wn.VERB):
    print("{:02d}".format(i), " : ", syn.definition())
    i += 1

01  :  pass from physical life and lose all bodily attributes and functions necessary to sustain life
02  :  suffer or face the pain of death
03  :  be brought to or as if to the point of death by an intense emotion such as embarrassment, amusement, or shame
04  :  stop operating or functioning
05  :  feel indifferent towards
06  :  languish as with love or desire
07  :  cut or shape with a die
08  :  to be on base at the end of an inning, of a player
09  :  lose sparkle or bouquet
10  :  disappear or come to an end
11  :  suffer spiritual death; be damned (in the religious sense)


<a id="lesk"></a>
<div style="color:blue;
           display:fill;
           border-radius:5px;
           background-color:blue;
           font-size:150%;
           font-family:Verdana;
           letter-spacing:0.5px">
<h1 style="text-align: center;
           padding: 10px;
           color:white;">
Lesk Algorithm
</h1>
</div>

<a id="leski"></a>
<div style="color:lightblue;
           display:fill;
           border-radius:5px;
           background-color:lightblue;
           font-size:75%;
           font-family:Verdana;
           letter-spacing:0.5px">
<h1 style="text-align: left;
           padding: 10px;
           color:black;">
Example I
</h1>
</div>

In [26]:
# Trying to find nearest match for the word 'die'
print("Statement : ", X)
syn = wsd.lesk(X.split(), 'die')
print("Match     : ", syn)

Statement :  The die is cast.
Match     :  Synset('die.v.07')


In [27]:
# Getting the definition
print("Wrong Definition : ", syn.definition())

Wrong Definition :  cut or shape with a die


In [28]:
print("Right Definition : ", wsd.lesk(X.split(), 'die', pos=wn.NOUN).definition())

Right Definition :  a cutting tool that is fitted into a diestock and used for cutting male (external) screw threads on screws or bolts or pipes or rods


<a id="leskii"></a>
<div style="color:lightblue;
           display:fill;
           border-radius:5px;
           background-color:lightblue;
           font-size:75%;
           font-family:Verdana;
           letter-spacing:0.5px">
<h1 style="text-align: left;
           padding: 10px;
           color:black;">
Example II
</h1>
</div>

In [29]:
# Trying to find nearest match for the word 'die'
print("Statement        : ", Y)
print("\n")
print("Wrong Definition : ", wsd.lesk(Y.split(), 'die').definition())
print("Right Definition : ", wsd.lesk(Y.split(), 'die', pos=wn.NOUN).definition())

Statement        :  Roll the die to get a 6.


Wrong Definition :  to be on base at the end of an inning, of a player
Right Definition :  a small cube with 1 to 6 spots on the six faces; used in gambling to generate random numbers


<a id="leskiii"></a>
<div style="color:lightblue;
           display:fill;
           border-radius:5px;
           background-color:lightblue;
           font-size:75%;
           font-family:Verdana;
           letter-spacing:0.5px">
<h1 style="text-align: left;
           padding: 10px;
           color:black;">
Example III
</h1>
</div>

In [30]:
# Trying to find nearest match for the word 'die'
print("Statement        : ", Z)
print("\n")
print("Wrong Definition : ", wsd.lesk(Z.split(), 'die').definition())
print("Right Definition : ", wsd.lesk(Z.split(), 'die', pos=wn.VERB).definition())

Statement        :  What is dead may never die.


Wrong Definition :  a cutting tool that is fitted into a diestock and used for cutting male (external) screw threads on screws or bolts or pipes or rods
Right Definition :  stop operating or functioning


<a id="apt"></a>
<div style="color:blue;
           display:fill;
           border-radius:5px;
           background-color:blue;
           font-size:150%;
           font-family:Verdana;
           letter-spacing:0.5px">
<h1 style="text-align: center;
           padding: 10px;
           color:white;">
Automatic POS Tagging + Lesk with spaCy
</h1>
</div>

In [31]:
!pip install spacy



In [32]:
from spacy import load
from spacy.cli import download

nlp = load("en_core_web_sm")

In [33]:
import warnings

POS_MAP = {
    'VERB': wn.VERB,
    'NOUN': wn.NOUN,
    'PROPN': wn.NOUN 
}

def lesk(doc, word):
    found = False
    for token in doc:
        if token.text == word:
            word = token
            found = True
            break
    if not found:
        raise ValueError(f'Word \'{word}\' does not appear in the document: {doc.text}.')
    pos = POS_MAP.get(word.pos_, False)
    if not pos:
        warnings.warn(f'POS tag for {word.text} not found in wordnet. Falling back to default Lesk behaviour.')
    args = [c.text for c in doc], word.text
    kwargs = dict(pos=pos)
    return wsd.lesk(*args, **kwargs)

In [34]:
# Trying to find nearest match for the word 'die'
print("Statement  : ", Y)
doc = nlp(Y)
print("Definition : ", lesk(doc, 'die').definition())

Statement  :  Roll the die to get a 6.
Definition :  a small cube with 1 to 6 spots on the six faces; used in gambling to generate random numbers


In [35]:
# Trying to find nearest match for the word 'die'
T = "I work at google."
print("Statement  : ", T)
doc = nlp(T)
print("Definition : ", lesk(doc, 'google').definition())
print("\n")
T = "I will google it."
print("Statement  : ", T)
doc = nlp(T)
print("Definition : ", lesk(doc, 'google').definition())

Statement  :  I work at google.
Definition :  a widely used search engine that uses text-matching techniques to find web pages that are important and relevant to a user's search


Statement  :  I will google it.
Definition :  search the internet (for information) using the Google search engine
