# Search and Comparison Methods

All of us are familiar with searching a text for a specified word or character sequence (pattern). 

The goal is to either find the exact occurrence (match) or to find an in-exact match using characters with a special meaning, for example by regular expressions or by fuzzy logic. 

Mostly, it is a sequence of characters that is similar to another one.

Furthermore, the similarity can be measured by the way words sound -- do they sound similar but are written in a different way? 

how many changes (edits) are necessary to get from one word to the other? The less edits to be done the higher is the similarity level. This category of comparison contains the __Levenshtein distance__

pip install python-Levenshtein

# Implementing Levenshtein Distance

- NLTK library has the `Edit Distance` algorithm ready to use
- `Editdistance` - (pip install editdistance)
- `python-Levenshtein` - (pip install python-Levenshtein)
- `pyxDamerauLevenshtein`
- `pylev` (pip install pylev)

## using NLTK

In [2]:
import nltk
 
w1 = 'data science'
w2 = 'deep learning'
 
nltk.edit_distance(w1, w2)

10

#### Example #2
Basic `Spelling Checker` - Let’s assume you have a mistaken word and a list of possible words and you want to know the nearest suggestion.

In [3]:
mistake = "ligting"
 
words = ['apple', 'bag', 'drawing', 'listing', 'linking', 'living', 'lighting', 'orange', 'walking', 'zoo']

edit_dist = {}

for word in words:
    dist = nltk.edit_distance(mistake, word)
    
    edit_dist[word] = dist

In [4]:
edit_dist

{'apple': 7,
 'bag': 6,
 'drawing': 4,
 'listing': 1,
 'linking': 2,
 'living': 2,
 'lighting': 1,
 'orange': 6,
 'walking': 4,
 'zoo': 7}

In [5]:
# list the words in terms of distance
{k: v for k, v in sorted(edit_dist.items(), key=lambda item: item[1])}

{'listing': 1,
 'lighting': 1,
 'linking': 2,
 'living': 2,
 'drawing': 4,
 'walking': 4,
 'bag': 6,
 'orange': 6,
 'apple': 7,
 'zoo': 7}

#### Example #3

Sentence or paragraph comparison is useful in applications like __plagiarism detection__ (to know if one article is a stolen version of another article), and __translation memory systems__ (that save previously translated sentences and when there is a new untranslated sentence, the system retrieves a similar one that can be slightly edited by a human translator instead of translating the new sentence from scratch).

In [6]:
sent1 = "It might help to re-install Python if possible."

sent2 = "It can help to install Python again if possible."
sent3 = "It can be so helpful to reinstall C++ if possible."
sent4 = "help It possible Python to re-install if might." # This has the same words as sent1 with a different order.
sent5 = "I love Python programming."

In [7]:
ed_sent_1_2 = nltk.edit_distance(sent1, sent2)
ed_sent_1_3 = nltk.edit_distance(sent1, sent3)
ed_sent_1_4 = nltk.edit_distance(sent1, sent4)
ed_sent_1_5 = nltk.edit_distance(sent1, sent5)

In [8]:
print(ed_sent_1_2, 'Edit Distance between sent1 and sent2')
print(ed_sent_1_3, 'Edit Distance between sent1 and sent3')
print(ed_sent_1_4, 'Edit Distance between sent1 and sent4')
print(ed_sent_1_5, 'Edit Distance between sent1 and sent5')

14 Edit Distance between sent1 and sent2
19 Edit Distance between sent1 and sent3
32 Edit Distance between sent1 and sent4
33 Edit Distance between sent1 and sent5


In [10]:
nltk.edit_distance(w1, w2)

10

## Using editdistance

In [11]:
!pip install editdistance

Collecting editdistance
  Downloading editdistance-0.6.2-cp311-cp311-win_amd64.whl (22 kB)
Installing collected packages: editdistance
Successfully installed editdistance-0.6.2


In [12]:
import editdistance

In [13]:
editdistance.eval('banana', 'bahama')

2

In [14]:
a = u'fsffvfdsbbdfvvdavavavavavava'
b = u'fvdaabavvvvvadvdvavavadfsfsdafvvav'

In [16]:
editdistance.eval('banana', 'bahama')

2

## Using Levenshtein

In [29]:
!pip install python-Levenshtein



In [30]:
import Levenshtein

In [31]:
%timeit Levenshtein.distance(a, b)

2.19 µs ± 194 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


## Using pylev

In [34]:
!pip install pylev

Collecting pylev
  Downloading pylev-1.4.0-py2.py3-none-any.whl (6.1 kB)
Installing collected packages: pylev
Successfully installed pylev-1.4.0


In [35]:
import pylev
distance = pylev.levenshtein('kitten', 'sitting')
distance

3

In [36]:
%timeit pylev.levenshtein(a, b)

616 µs ± 45.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


## A short application - Find matching city names

In [17]:
import numpy as np
import pandas as pd

In [18]:
a = np.array([1, 2, 3])
print(a)

l1 = np.linalg.norm(a, 1)
print(l1)

l1 = np.linalg.norm(a, 2)
print(l1)

[1 2 3]
6.0
3.7416573867739413


In [39]:
location = r'D:\MYLEARN\datasets\city_names.txt'

In [40]:
names_df = pd.read_csv(location, header=None, names=['city_name'])
names_df.shape


(385, 1)

In [41]:
names_df.head()

Unnamed: 0,city_name
0,Aberdeen
1,Abilene
2,Akron
3,Albany
4,Albuquerque


In [42]:
given_city_name = 'ren'

for idx, row in names_df.iterrows():
    distance = Levenshtein.distance(given_city_name, row.city_name)
    
    names_df.loc[idx, 'dist'] = distance

In [43]:
names_df.sort_values(['dist'])[:10]

Unnamed: 0,city_name,dist
255,Orem,2.0
290,Reno,2.0
45,Bryan,3.0
2,Akron,3.0
98,Erie,3.0
116,Fresno,3.0
206,Mesa,3.0
369,Warren,3.0
248,Ogden,3.0
103,Fargo,4.0
