## Goal

Can we use fuzzyquzzy library to detect similar words like `homework` and `hw`?


`Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.`

https://pypi.org/project/fuzzywuzzy/0.18.0/

https://github.com/seatgeek/thefuzz

## Install

In [2]:
!pip install fuzzywuzzy==0.18.0
!pip install thefuzz

Collecting thefuzz
  Downloading https://files.pythonhosted.org/packages/24/7c/2acf47d228b0c0879468b4e2fd15a14eb58bd97897b4bb8a9a7ed47d22f7/thefuzz-0.19.0-py2.py3-none-any.whl
Installing collected packages: thefuzz
Successfully installed thefuzz-0.19.0


## Import

In [10]:
from thefuzz import fuzz
from thefuzz import process

import numpy as np

## Test 1

In [4]:
words = ['assignment', 'ass', 'assessment', 'homework', 'paper', 'test', 'exam', 'midterm', 'final']

## Compute similarit between a query and all other words

In [21]:
query_inx = words.index("assignment")

sims = {}

for target_inx in range(len(words)):
    key = query_inx, target_inx
    sims[key] = fuzz.ratio(words[query_inx], words[target_inx])

## Sort similarities

In [22]:
sorted_sims = {k: v for k, v in sorted(sims.items(), key=lambda item: item[1], reverse=True)}

## Log based on similarity

In [23]:
sorted_sims 

{(0, 0): 100,
 (0, 2): 70,
 (0, 1): 46,
 (0, 5): 29,
 (0, 6): 29,
 (0, 4): 27,
 (0, 7): 24,
 (0, 3): 22,
 (0, 8): 13}

In [24]:
for key, score in sorted_sims.items():
    print("Sim({}, {}) --> {}".format(words[key[0]], words[key[1]], score))

Sim(assignment, assignment) --> 100
Sim(assignment, assessment) --> 70
Sim(assignment, ass) --> 46
Sim(assignment, test) --> 29
Sim(assignment, exam) --> 29
Sim(assignment, paper) --> 27
Sim(assignment, midterm) --> 24
Sim(assignment, homework) --> 22
Sim(assignment, final) --> 13


## Try "homework"

In [25]:
query_inx = words.index("homework")

sims = {}

for target_inx in range(len(words)):
    key = query_inx, target_inx
    sims[key] = fuzz.ratio(words[query_inx], words[target_inx])

In [26]:
sorted_sims = {k: v for k, v in sorted(sims.items(), key=lambda item: item[1], reverse=True)}

In [27]:
for key, score in sorted_sims.items():
    print("Sim({}, {}) --> {}".format(words[key[0]], words[key[1]], score))

Sim(homework, homework) --> 100
Sim(homework, midterm) --> 40
Sim(homework, paper) --> 31
Sim(homework, assignment) --> 22
Sim(homework, assessment) --> 22
Sim(homework, test) --> 17
Sim(homework, exam) --> 17
Sim(homework, ass) --> 0
Sim(homework, final) --> 0


## Observations

As we could expect `Levenshtein Distance` doesnt have any clue about words underlying semantic so we dont get useful simialrity distance!!