<a href="https://colab.research.google.com/github/hthomas229/PurpleCrown/blob/main/rapidfuzz_video.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#RapidFuzz

##A fast string matching library for Python and C++, using string similarity calculations.

RapidFuzz is a high-performance Python library for fuzzy string matching, ideal for tasks where approximate string comparison is needed. Here are some compelling use cases:

---

### 🔍 Common Use Cases for RapidFuzz

#### 1. **Data Deduplication**
- Identify and merge duplicate records in datasets where names or addresses may be slightly different.
- Example: `"Jon Smith"` vs `"John Smith"`.

#### 2. **Record Linkage Across DataFrames**
- Match similar entries across two datasets, such as customer names or product titles.


#### 3. **Search Autocomplete and Suggestions**
- Improve user experience by suggesting close matches to user input.
- Example: Typing `"iphne"` returns `"iPhone"`.

#### 4. **Natural Language Processing**
- Compare user queries or text snippets for similarity in chatbots or recommendation engines.
- Useful for intent recognition or clustering similar queries.

#### 5. **Spell Checking and Correction**
- Detect and correct typos by comparing input against a dictionary of valid words.

#### 6. **Filtering and Ranking Search Results**
- Rank results based on similarity to the search query.



Install RapidFuzz

In [1]:
!pip install rapidfuzz

Collecting rapidfuzz
  Downloading rapidfuzz-3.13.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading rapidfuzz-3.13.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz
Successfully installed rapidfuzz-3.13.0


Import the necessary libraries

In [2]:
from rapidfuzz import fuzz, process
import pandas as pd  #read csv to dataframe
import os #check file path

#Basic Usage

Simple Fuzz Ratio

In [7]:
str1 = "Hello Newman!"
str2 = "Hello Jerry!"

ratio = fuzz.ratio(str1, str2)

ratio

64.0

Partial Ratio – Match Substrings Useful when one string is contained in another.

In [8]:
str1 = "Jimmy likes Elaine"
str2 = "Jimmy"

ratio = fuzz.ratio(str1, str2)
ratio_partial = fuzz.partial_ratio(str1, str2)

ratio, ratio_partial

(43.47826086956522, 100.0)

Token Sort Ratio – Ignore Word Order

In [11]:
str1 = ("Kenny Rogers Roasters")
str2 = ("Roasters Kenny Rogers")

ratio = fuzz.ratio(str1, str2)
ratio_partial = fuzz.partial_ratio(str1, str2)
ratio_token_sort = fuzz.token_sort_ratio(str1, str2)

ratio, ratio_partial, ratio_token_sort

(50.0, 72.72727272727273, 87.5)

Token Set Ratio – Match Common Words

In [12]:
str1 = "The Soup Nazi The"
str2 = "Nazi Soup the and more stuff"

ratio = fuzz.ratio(str1, str2)
ratio_partial = fuzz.partial_ratio(str1, str2)
ratio_token_sort = fuzz.token_sort_ratio(str1, str2)
ratio_token_set = fuzz.token_set_ratio(str1,str2)

ratio, ratio_partial, ratio_token_sort, ratio_token_set

(40.0, 53.333333333333336, 62.22222222222222, 81.81818181818181)

Process Extract - to find best matches

In [13]:
seinfeld_quotes = [
    "No soup for you!",
    "These pretzels are making me thirsty.",
    "Serenity now!",
    "Yada yada yada.",
    "George likes his chicken very spicy!",
    "The sea was angry that day, my friends."
]


user_query = "very thirsty"

multiple_matches = process.extract(user_query, seinfeld_quotes)

if multiple_matches:
  print(f"Top matches for '{user_query}': ")
  for match, score, index in multiple_matches:
    print(f"- {match} (Score: {score:.2f})")
else:
  print("No matches found")

Top matches for 'very thirsty': 
- George likes his chicken very spicy! (Score: 85.50)
- These pretzels are making me thirsty. (Score: 73.64)
- The sea was angry that day, my friends. (Score: 52.50)
- Serenity now! (Score: 40.00)
- No soup for you! (Score: 27.14)


Process Extract One - to find best match

In [16]:
from rapidfuzz import process

seinfeld_quotes = [
    "No soup for you!",
    "These pretzels are making me thirsty.",
    "Serenity now!",
    "Yada yada yada.",
    "George likes his chicken very spicy!",
    "The sea was angry that day, my friends."
]

user_query = "very thirsty"

best_match = process.extractOne(user_query, seinfeld_quotes)

if best_match:
    print(f"Best match: {best_match[0]} (Score: {best_match[1]:.2f})")
else:
    print("No match found.")

Best match: George likes his chicken very spicy! (Score: 85.50)


##Advanced: Choose Scorer, Item Limits & Cutoff Score

1.   Chooose score method
2.   Limit items returned
3.   Score-cutoff


Scorer

In [18]:


seinfeld_quotes = [
    "No soup for you!",
    "These pretzels are making me thirsty.",
    "Serenity now!",
    "Yada yada yada.",
    "George likes his chicken very spicy!",
    "The sea was angry that day, my friends."
]

user_query = "very thirsty"

# Using partial_ratio as the custom scorer
best_match_partial = process.extractOne(user_query, seinfeld_quotes, scorer=fuzz.partial_ratio)

if best_match_partial:
    print(f"Best match using partial_ratio: {best_match_partial[0]} (Score: {best_match_partial[1]:.2f})")
else:
    print("No match found.")

Best match using partial_ratio: These pretzels are making me thirsty. (Score: 81.82)


Limit Items Returned

In [19]:


seinfeld_quotes = [
    "No soup for you!",
    "These pretzels are making me thirsty.",
    "Serenity now!",
    "Yada yada yada.",
    "The sea was angry that day, my friends.",
    "Jerry, these are load bearing walls!",
    "The Contest."
]

user_query = "the"

# Get up to 3 best matches
multiple_matches_limited = process.extract(user_query, seinfeld_quotes, limit=3)

if multiple_matches_limited:
    print(f"Top {len(multiple_matches_limited)} matches for '{user_query}':")
    for match, score, index in multiple_matches_limited:
        print(f"- {match} (Score: {score:.2f})")
else:
    print("No matches found.")


#

Top 3 matches for 'the':
- The Contest. (Score: 68.40)
- Jerry, these are load bearing walls! (Score: 60.00)
- These pretzels are making me thirsty. (Score: 40.00)


Score Cutoff

In [21]:


seinfeld_quotes = [
    "No soup for you!",
    "These pretzels are making me thirsty.",
    "Serenity now!",
    "George likes his chicken very spicy.",
    "Yada yada yada.",
    "The sea was angry that day, my friends."
]

user_query = "very thirsty"

# Set a score cutoff of 50
best_match_cutoff = process.extract(user_query, seinfeld_quotes, score_cutoff=70)

if best_match_cutoff:
    print(f"Matches found above the cutoff score for '{user_query}':")
    for match, score, index in best_match_cutoff:
        print(f"- {match} (Score: {score:.2f})")
else:
    print(f"No match found above the cutoff score for '{user_query}'.")



Matches found above the cutoff score for 'very thirsty':
- George likes his chicken very spicy. (Score: 85.50)
- These pretzels are making me thirsty. (Score: 73.64)


WRatio & QRatio

In [23]:
str1 = "These pretzels are making me thirsty."
str2 = "are making me thirsty these pretels. "


# Compare with a simple ratio
ratio_score = fuzz.ratio(str1, str2)
print(f"Ratio score: {ratio_score:.2f}")


# Compare with a simple ratio
ratio_score = fuzz.partial_ratio(str1, str2)
print(f"Partial Ratio score: {ratio_score:.2f}")

# Using WRatio which incorporates weighting
wratio_score = fuzz.WRatio(str1, str2)
print(f"WRatio score: {wratio_score:.2f}")


# Using QRatio for comparison
qratio_score = fuzz.QRatio(str1, str2)
print(f"QRatio score: {qratio_score:.2f}")

Ratio score: 59.46
Partial Ratio score: 74.58
WRatio score: 75.48
QRatio score: 59.46


#Search Database From User Input

Declare our csv file name as a variable


In [24]:
CSV_FILE = '/content/product.csv'

Make sure the file exists in the directory using os library

In [25]:
if not os.path.exists(CSV_FILE):
  print({f"{CSV_FILE}not found.  Please make sure it exists."})

Upload & Transform the csv into a Pandas Dataframe

1.   read file into dataframe
2.   check for errors



In [28]:
try:
  df = pd.read_csv(CSV_FILE)
except  Exception as e:
  print(f"✖️ Error reading CSV: {e}")

View the top 5 rows

In [29]:
df.head(20)

Unnamed: 0,Product ID,Product Name,Category,Price,Stock,Supplier,Date Added
0,P001,Wireless Earbuds,Electronics,59.99,150,SoundWave Inc,2023-01-10
1,P002,Bluetooth Speaker,Electronics,89.5,75,AudioMax Ltd,2023-02-15
2,P003,Smart Watch,Wearables,199.99,40,TimeTech Corp,2023-03-05
3,P004,Yoga Mat,Fitness,25.0,200,FitLife Co,2023-01-22
4,P005,Stainless Steel Bottle,Accessories,18.75,300,EcoGear Ltd,2023-04-10
5,P006,Laptop Stand,Electronics,45.99,80,DeskPro Solutions,2023-05-18
6,P007,Noise-Canceling Headphones,Electronics,249.95,35,SilentAir Inc,2023-06-01
7,P008,Resistance Bands,Fitness,15.5,250,FitLife Co,2023-02-28
8,P009,Portable Charger,Electronics,34.99,180,PowerGo Ltd,2023-03-14
9,P010,Smart Light Bulb,Home,12.99,400,BrightHome Inc,2023-07-22


Make sure the required column exists

In [30]:
if 'Product Name' not in df.columns:
  print(f"Column not found. Columns are:  {list(df.columns)}")
  exit()

Extract the list of product names

In [34]:
product_names = df['Product Name'].tolist()
product_names

['Wireless Earbuds',
 'Bluetooth Speaker',
 'Smart Watch',
 'Yoga Mat',
 'Stainless Steel Bottle',
 'Laptop Stand',
 'Noise-Canceling Headphones',
 'Resistance Bands',
 'Portable Charger',
 'Smart Light Bulb',
 'Waterproof Backpack',
 'Electric Toothbrush',
 'Running Shoes',
 'Protein Powder',
 'Wireless Mouse',
 'Yoga Block',
 'Smart Thermostat',
 'Desk Lamp',
 'USB-C Hub',
 'Folding Chair',
 'Action Camera',
 'Sleep Mask',
 'Insulated Mug',
 'Wireless Keyboard',
 'Fitness Tracker',
 'Back Support Cushion',
 'Smart Doorbell',
 'Electric Kettle',
 'Phone Grip',
 'LED Strip Lights',
 'Water Flosser',
 'Mesh Running Shorts',
 'Smart Scale',
 'Neck Pillow',
 'Portable SSD',
 'Desk Organizer',
 'Metal Water Bottle',
 'Wireless Charging Pad',
 'Jump Rope',
 'Smart Plug',
 'Bluetooth Tracker',
 'Cooling Towel',
 'LED Desk Lamp',
 'Wireless Earbuds Pro',
 'Posture Corrector',
 'Travel Backpack',
 'Smart Ring',
 'Gaming Mouse',
 'Resistance Tube Set',
 'Electric Fan']

Get user search input string

In [76]:
user_input = input('Search products by name:  ').strip()

Search products by name:  smrt


Check for no input

In [77]:
if not user_input:
  print(f"No input provided.")
  exit()

Find the best match We'll use a score cutoff of 60 — adjust higher for stricter matching

In [78]:
result = process.extractOne(user_input, product_names, score_cutoff=60)

In [79]:
if result:
  best_match, score, index = result
  matched_product = df.iloc[index]
  print("\n☑️  Best Match Found:  ")
  print(f" Product:  {matched_product['Product Name']}")
  print(f" ID:  {matched_product['Product ID']}")
  print(f" Categoy:  {matched_product['Category']}")
  print(f" Price:  {matched_product['Price']}")
  print(f" Stock:  {matched_product['Stock']}")
  print(f" Supplier:  {matched_product['Supplier']}")

else:
  print(f"\n✖️ No close match found for '{user_input}")



☑️  Best Match Found:  
 Product:  Smart Watch
 ID:  P003
 Categoy:  Wearables
 Price:  199.99
 Stock:  40
 Supplier:  TimeTech Corp


List other suggestions

In [81]:
suggestions = process.extract(user_input, product_names, limit=3)
if suggestions:
  print("\n Did you possbily mean?")
  for sugg, sugg_score, _ in suggestions:
        print(f" {sugg} (similarity: {sugg_score:.1f})")


 Did you possbily mean?
 Smart Watch (similarity: 67.5)
 Smart Light Bulb (similarity: 67.5)
 Smart Thermostat (similarity: 67.5)
