# Levenshtein distance and TheFuzz

The Levenshtein distance, also known as the edit distance, is a way to measure how different two strings are by counting the minimum number of operations needed to transform one string into the other. These operations can be insertions, deletions, or substitutions of individual characters.

Imagine you have two words, let's say "kitten" and "sitting," and you want to find out how different they are using Levenshtein distance.

Here's how it works step by step:

1. First, we set up a grid, where one string is on the top (in this case "kitten") and the other is on the side (in this case "sitting"). Each cell in the grid represents a comparison between two characters, one from the top string and one from the side string.

2. We start filling in the grid from the top-left corner to the bottom-right corner. At each cell, we compare the characters from the two strings. If they are the same, we don't need to do anything, and we simply copy the value from the diagonal cell above it.

3. If the characters are different, we have three options:

- **Insertion**: We can insert a character from the side string into the top string. This means we move to the cell above.
- **Deletion**: We can delete a character from the top string. This means we move to the cell on the left.
- **Substitution**: We can substitute one character with another. This means we move diagonally up and to the left.

We keep filling in the grid, considering these three operations, until we reach the bottom-right corner of the grid. The number in that cell represents the Levenshtein distance between the two words. In our example, the Levenshtein distance between "kitten" and "sitting" is 3.

$$
\operatorname{lev}(a, b) = \begin{cases}
  |a| & \text{ if } |b| = 0, \\
  |b| & \text{ if } |a| = 0, \\
  \operatorname{lev}(\text{tail}(a),\text{tail}(b)) & \text{ if } a[0] = b[0], \\
  1 + \min \begin{cases}
          \operatorname{lev}(\text{tail}(a), b) \\
          \operatorname{lev}(a, \text{tail}(b)) \\
          \operatorname{lev}(\text{tail}(a), \text{tail}(b)) \\
       \end{cases} & \text{ otherwise}
\end{cases}
$$

Here's how the formula relates to this process:

* `lev(a, b)` represents the Levenshtein distance between strings `a` and `b`.
* The formula starts by checking if either of the strings is empty (length is 0). If one of them is empty, the distance is simply the length of the other string.
* If the first characters of the strings `a` and `b` are the same, we look at the Levenshtein distance between the remaining parts of the strings, which is calculated recursively.
* If the first characters are different, we consider all three operations (insertion, deletion, substitution) and choose the one that minimizes the distance.

In simpler terms, it's like solving a puzzle by finding the shortest path through the grid, where each cell represents a choice to either match characters, insert, delete, or substitute, and we're looking for the fewest steps to make the two strings the same.

In [50]:
def levenshtein_distance(s1, s2):
    # Create a matrix to store the distances
    matrix = [[0 for _ in range(len(s2) + 1)] for _ in range(len(s1) + 1)]

    # Initialize the matrix
    for i in range(len(s1) + 1):
        matrix[i][0] = i
    for j in range(len(s2) + 1):
        matrix[0][j] = j

    # Fill in the matrix
    for i in range(1, len(s1) + 1):
        for j in range(1, len(s2) + 1):
            cost = 0 if s1[i - 1] == s2[j - 1] else 1
            matrix[i][j] = min(
                matrix[i - 1][j] + 1,         # Deletion
                matrix[i][j - 1] + 1,         # Insertion
                matrix[i - 1][j - 1] + cost  # Substitution
            )

    # The final value in the matrix is the Levenshtein distance
    return matrix[len(s1)][len(s2)]

# Example usage:
s1 = "kitten"
s2 = "sitting"
distance = levenshtein_distance(s1, s2)
print(f"The Levenshtein distance between '{s1}' and '{s2}' is {distance}")

The Levenshtein distance between 'kitten' and 'sitting' is 3


# TheFuzz

String similarity and matching are common tasks in data analysis, natural language processing, and information retrieval. TheFuzz is a Python library that provides powerful tools for comparing strings and finding similar matches. In this tutorial, you will learn how to use TheFuzz to perform various string matching operations, including simple and partial ratios, token sorting and setting ratios, and partial token sorting.

## Prerequisites

Before we begin, make sure you have TheFuzz installed. You can install it using pip:

```bash
pip install thefuzz
```



1. **Simple Ratio**:

Imagine you have a database of product names, and you want to find products that are similar to a given query, the `fuzz.ratio()` function calculates the similarity ratio between two strings using a simple comparison. It returns a value between 0 and 100, where higher values indicate greater similarity.

In [51]:
from thefuzz import fuzz

# Database of product names
products = ["Apple iPhone 13 Pro", "Samsung Galaxy S21", "Google Pixel 6", "Sony Xperia 1 III"]

# User query
user_query = "Samsung Galaxy S20"

# Calculate similarity ratio for each product
similarities = [(product, fuzz.ratio(user_query, product)) for product in products]

# Find the most similar product
most_similar_product = max(similarities, key=lambda x: x[1])

print(f"The most similar product to '{user_query}' is '{most_similar_product[0]}' with a similarity ratio of {most_similar_product[1]}")

The most similar product to 'Samsung Galaxy S20' is 'Samsung Galaxy S21' with a similarity ratio of 94


2. **Partial Ratio**:

Now, let's say you have a list of customer names, and you want to find matches for partially typed names, the `fuzz.partial_ratio()` function finds partial matches between strings, making it useful for matching against a database.

In [52]:
from thefuzz import fuzz

# List of customer names
customer_names = ["John Doe", "Jane Smith", "Robert Johnson", "Sarah Williams"]

# Partially typed customer name
partial_name = "Rob Jhon"

# Calculate partial ratio for each name
similar_names = [(name, fuzz.partial_ratio(partial_name, name)) for name in customer_names]

# Find the best matching name
best_match = max(similar_names, key=lambda x: x[1])

print(f"The best match for '{partial_name}' is '{best_match[0]}' with a partial ratio of {best_match[1]}")

The best match for 'Rob Jhon' is 'Robert Johnson' with a partial ratio of 62


3. **Token Sort Ratio**:

The `fuzz.token_sort_ratio()` function compares strings even if the words are in different orders. This is helpful for finding similar strings regardless of word arrangement.

In [53]:
from thefuzz import fuzz

string1 = "fuzzy wuzzy was a bear"
string2 = "wuzzy fuzzy was a bear"

# Calculate token sort ratio using fuzz.token_sort_ratio
token_sort_ratio = fuzz.token_sort_ratio(string1, string2)
print(f"Token Sort Ratio: {token_sort_ratio}")


Token Sort Ratio: 100


4. **Token Set Ratio**:

The `fuzz.token_set_ratio()` function matches strings even if they contain extra words or variations. It's ideal for flexible string matching.

In [54]:
from thefuzz import fuzz

string1 = "fuzzy was a bear"
string2 = "fuzzy fuzzy was a bear"

# Calculate token set ratio using fuzz.token_set_ratio
token_set_ratio = fuzz.token_set_ratio(string1, string2)
print(f"Token Set Ratio: {token_set_ratio}")

Token Set Ratio: 100


5. **Partial Token Sort Ratio**:

The `fuzz.partial_token_sort_ratio()` function partially matches strings with word order flexibility.

In [55]:
from thefuzz import fuzz

string1 = "fuzzy was a bear"
string2 = "wuzzy fuzzy was a bear"

# Calculate partial token sort ratio using fuzz.partial_token_sort_ratio
partial_token_sort_ratio = fuzz.partial_token_sort_ratio(string1, string2)
print(f"Partial Token Sort Ratio: {partial_token_sort_ratio}")

Partial Token Sort Ratio: 100


6. **Process Module**:

The `process` module allows you to extract top matches from a list of choices or find the best matching item.

In [56]:
from thefuzz import process

choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]

# Extract top matches using process.extract
top_matches = process.extract("new york jets", choices, limit=2)
print("Top Matches:", top_matches)

# Extract a single best match using process.extractOne
best_match = process.extractOne("cowboys", choices)
print("Best Match:", best_match)

Top Matches: [('New York Jets', 100), ('New York Giants', 79)]
Best Match: ('Dallas Cowboys', 90)
