# Levenshtein distance

The Levenshtein distance, also known as the edit distance, is a way to measure how different two strings are by counting the minimum number of operations needed to transform one string into the other. These operations can be insertions, deletions, or substitutions of individual characters.

Imagine you have two words, let's say "kitten" and "sitting," and you want to find out how different they are using Levenshtein distance.

Here's how it works step by step:

1. First, we set up a grid, where one string is on the top (in this case "kitten") and the other is on the side (in this case "sitting"). Each cell in the grid represents a comparison between two characters, one from the top string and one from the side string.

2. We start filling in the grid from the top-left corner to the bottom-right corner. At each cell, we compare the characters from the two strings. If they are the same, we don't need to do anything, and we simply copy the value from the diagonal cell above it.

3. If the characters are different, we have three options:

- **Insertion**: We can insert a character from the side string into the top string. This means we move to the cell above.
- **Deletion**: We can delete a character from the top string. This means we move to the cell on the left.
- **Substitution**: We can substitute one character with another. This means we move diagonally up and to the left.

We keep filling in the grid, considering these three operations, until we reach the bottom-right corner of the grid. The number in that cell represents the Levenshtein distance between the two words. In our example, the Levenshtein distance between "kitten" and "sitting" is 3.

$$
\operatorname{lev}(a, b) = \begin{cases}
  |a| & \text{ if } |b| = 0, \\
  |b| & \text{ if } |a| = 0, \\
  \operatorname{lev}(\text{tail}(a),\text{tail}(b)) & \text{ if } a[0] = b[0], \\
  1 + \min \begin{cases}
          \operatorname{lev}(\text{tail}(a), b) \\
          \operatorname{lev}(a, \text{tail}(b)) \\
          \operatorname{lev}(\text{tail}(a), \text{tail}(b)) \\
       \end{cases} & \text{ otherwise}
\end{cases}
$$

Here's how the formula relates to this process:

* `lev(a, b)` represents the Levenshtein distance between strings `a` and `b`.
* The formula starts by checking if either of the strings is empty (length is 0). If one of them is empty, the distance is simply the length of the other string.
* If the first characters of the strings `a` and `b` are the same, we look at the Levenshtein distance between the remaining parts of the strings, which is calculated recursively.
* If the first characters are different, we consider all three operations (insertion, deletion, substitution) and choose the one that minimizes the distance.

In simpler terms, it's like solving a puzzle by finding the shortest path through the grid, where each cell represents a choice to either match characters, insert, delete, or substitute, and we're looking for the fewest steps to make the two strings the same.

In [9]:
def levenshtein_distance(s1, s2):
    # Create a matrix to store the distances
    matrix = [[0 for _ in range(len(s2) + 1)] for _ in range(len(s1) + 1)]

    # Initialize the matrix
    for i in range(len(s1) + 1):
        matrix[i][0] = i
    for j in range(len(s2) + 1):
        matrix[0][j] = j

    # Fill in the matrix
    for i in range(1, len(s1) + 1):
        for j in range(1, len(s2) + 1):
            cost = 0 if s1[i - 1] == s2[j - 1] else 1
            matrix[i][j] = min(
                matrix[i - 1][j] + 1,         # Deletion
                matrix[i][j - 1] + 1,         # Insertion
                matrix[i - 1][j - 1] + cost  # Substitution
            )

    # The final value in the matrix is the Levenshtein distance
    return matrix[len(s1)][len(s2)]

# Example usage:
s1 = "kitten"
s2 = "sitting"
distance = levenshtein_distance(s1, s2)
print(f"The Levenshtein distance between '{s1}' and '{s2}' is {distance}")

The Levenshtein distance between 'kitten' and 'sitting' is 3


Fuzzywuzzy is a Python library that provides a simple and easy-to-use way to calculate string similarity or similarity ratios between strings. It's often used for tasks like fuzzy string matching and approximate string comparison. Fuzzywuzzy uses the Levenshtein distance (edit distance) as one of its underlying algorithms to compute these similarity scores.

**String Comparison**: Fuzzywuzzy allows you to compare two strings and obtain a similarity score, often referred to as a "fuzzy match score." This score indicates how similar the two strings are based on their characters and the number of operations (insertions, deletions, substitutions) needed to transform one string into the other.

In [10]:
from fuzzywuzzy import fuzz

string1 = "apple"
string2 = "apples"

# Calculate similarity score between string1 and string2
similarity_score = fuzz.ratio(string1, string2)
print(f"Similarity Score: {similarity_score}%")

Similarity Score: 91%



**Partial Ratio**: One of the functions provided by Fuzzywuzzy is `fuzz.partial_ratio()`. This function calculates the similarity ratio between two strings, considering the best partial match. It's particularly useful when you want to find a subsequence of characters that closely matches within a longer string. The partial ratio function uses the Levenshtein distance to calculate this score.

In [11]:
from fuzzywuzzy import fuzz

string1 = "programming in Python is fun"
string2 = "Python programming is really fun"

# Calculate partial ratio between string1 and string2
partial_ratio = fuzz.partial_ratio(string1, string2)
print(f"Partial Ratio: {partial_ratio}%")

Partial Ratio: 72%


**Token Set Ratio**: Fuzzywuzzy also offers `fuzz.token_set_ratio()`, which is handy when comparing sets of words (tokens) within strings. It calculates the similarity based on the intersection and difference of tokens in the two strings while considering the order of words. This function employs the Levenshtein distance for its computations.

In [12]:
from fuzzywuzzy import fuzz

string1 = "data science in Python is interesting"
string2 = "Python is great for data science"

# Calculate token set ratio between string1 and string2
token_set_ratio = fuzz.token_set_ratio(string1, string2)
print(f"Token Set Ratio: {token_set_ratio}%")

Token Set Ratio: 81%


**Scoring Mechanism**: Fuzzywuzzy uses the Levenshtein distance algorithm to compute the number of edit operations required to make one string match another. It then converts this into a percentage or ratio to provide a similarity score. The higher the score, the more similar the strings are.

In [13]:
from fuzzywuzzy import fuzz

string1 = "kitten"
string2 = "sitting"

# Calculate similarity score between string1 and string2
similarity_score = fuzz.ratio(string1, string2)
print(f"Similarity Score: {similarity_score}%")

Similarity Score: 62%


**Customizable Thresholds**: Fuzzywuzzy allows you to set a similarity threshold, and you can use this threshold to filter out results that fall below a certain similarity score. This can be useful in applications like string deduplication or fuzzy string matching.

In [14]:
from fuzzywuzzy import fuzz

string1 = "apple"
string2 = "apples"

# Calculate similarity score between string1 and string2
similarity_score = fuzz.ratio(string1, string2)

# Set a similarity threshold (e.g., 80%)
threshold = 80

if similarity_score >= threshold:
    print("Strings are similar.")
else:
    print("Strings are not similar.")

Strings are similar.


**Practical Use Cases**: Fuzzywuzzy is often used in scenarios where approximate string matching is required, such as deduplication of records in databases, spell-checking, and search engines. It's valuable when dealing with user-generated text that may contain typos, abbreviations, or slight variations.

In [15]:
from fuzzywuzzy import fuzz

# Sample list of strings (possibly containing duplicates)
strings = ["apple", "apples", "appl", "banana", "app"]

# Deduplicate the list based on a similarity threshold (e.g., 80%)
threshold = 80

# Create a dictionary to store unique strings
unique_strings = {}

for string in strings:
    # Check if a similar string is already in the unique_strings dictionary
    found_similar = False
    for key in unique_strings.keys():
        if fuzz.ratio(string, key) >= threshold:
            found_similar = True
            break
    if not found_similar:
        unique_strings[string] = True

# Extract the unique strings
unique_strings_list = list(unique_strings.keys())
print("Unique Strings:", unique_strings_list)

Unique Strings: ['apple', 'banana', 'app']


In this example, Fuzzywuzzy is used to deduplicate a list of strings by comparing each string to the existing unique strings based on a similarity threshold. If a similar string is found, it's not added to the `unique_strings dictionary`. This can be useful for cleaning up datasets or lists with potentially similar entries.