# **Product Identification with Semantic Search and Levenshtein Distance**



## **Intro**

The notebook is a proof of concept. The concept is to provide a means of quantifying the accuracy of a provided product identification number against a large dataset of known products.

1. Return a list of possible "hits" along with a metric showning how close the matches are.
2. It needs to be scalable, to millions of products.
3. Needs to be fast.
4. Free.
5. Easy to maintain.

This this is not:

1. Additional work would need to be done to access impact of different types of vector databases and index types, along with different semantic similarity models.
2. Not intended for fluid and dynamic product lists that change constantly, but for those where updates are not applied more than once per day.

What you'll need:

A Hugging Face token.  

I suggest running in Colab. If not you'll need to update the environment variable HF_TOKEN with your Hugging Face authentication token.

Also suggest you change the runtime type to GPU, but is not necessary.

### Faker

Great library for generating test data. Very easy to use, fast, and has numerous providers for domain-specific testing data (e.g., vehicles, internet, etc.)

See: https://github.com/joke2k/faker

### Hugging Face

One of the best resources of all types of models, datasets, and tools machine learning.  Will use a the sentence transformers library with a model specifically trained for sentence similarity tasks.  Offering you dataset via Hugging Face is easy and convenient, and makes your data available to the development community.

See: https://huggingface.co/

### Semantic Search
Convert words (or text) to numeric representations (embeddings) based on a trained NLP model. Then use an mathematical functions to identify other words near to your search word (or text).

See: https://huggingface.co/learn/nlp-course/en/chapter5/6

### Levenshtein Distance
Return a numeric value representing the "distance" between two strings -- the total number of characters that must be changed before the strings are identical.

For example,

```
String 1:    ABC123
String 2:    BBC123
             ______
To Change:   1-----
Distance:    1
Change 1 character, "B" to "A".

String 1:    ABC123
String 2:    123ABC
             ______
To Change:   111111
Distance:    6             
Change 6 characters "1" to "A", "2" to "B", etc.
```             
See: https://github.com/ztane/python-Levenshtein

### FAISS Index

Efficient and easy to implement indexing on embedded data and provides a variety of index types and algorithms for similarity searches.

See: https://faiss.ai/

### Combine Semantic Search for Speed with Levenshtein Distance for Accuracy

Calculating Levenshtein distance is time consuming and bogs down at scale. But if you first build a vector database of your models, then use semantic search to pull out a chunk of possible matches, it's much quicker to use that result subset to calculate the distance and report those matches with the fewest number of characters that need changing.


## **Constants**

In [None]:
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
COLUMN_TO_EMBED = "model_search"
EMBEDDING_COLUMN = "model_search_embeddings"
HF_DATASET = "blade57/ModelNumbers4Searching_Full"
FAISS_INDEX = "model_search.faiss"

## **Prepare Dataset**

**Database schema:**
- brand: Faker field, manufacturer's name.
- model_number: Faker field unique to brand.
- model_name: Faker field, model description.
- year: Faker field.
- randomdata: int from 1000-2000, append to model_number when Faker is too short.
- model_search: Based on "cleaned" version of model_number. Used for creating model number embeddings.
- model_search_embeddings: Generated embeddings of model_search (not stored in persisted dataset).

### **Faker**

I took the generated fake model numbers and created a search version ('model_search') by removing unwanted characters, spaces and made everything upper case.




In [None]:
# install dependencies
!pip install -q Faker faker-vehicle

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m162.4/162.4 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# create Faker object and add vehicle provider
from faker import Faker
from faker_vehicle import VehicleProvider

fake = Faker()
fake.add_provider(VehicleProvider)

In [None]:
# function to generate fake data
import re
import random

def remove_junk(search_term: str):
  """
  Removes unwanted characters from a search term.

  Args:
    search_term: The search term to be cleaned.

  Returns:
    The cleaned search term with unwanted characters removed.
  """
  # remove blank space
  results = search_term.replace(' ','')
  # remove unecessary characters
  results = re.sub(r'[/\+\-_=~*%$#@!"(){}]', '', results)
  # upper case
  return results.upper()

def create_rows_faker(num: int=1):
  """
  Creates a list of rows with fake data.

  Args:
    num (int): The number of rows to create.

  Returns:
    list: A list of dictionaries containing fake data.
  """
  return_set = []
  for x in range(num):
    randomdata = random.randint(1000,2000)
    model_number = fake.machine_model()
    # if model number is less than 6 characters, add randomdata, adjust as desired
    if len(model_number) < 6:
      model_number += str(randomdata)
    return_set.append({"brand":fake.machine_make(),
                   "model_number":model_number,
                   "model_name":fake.machine_category(),
                   "year":fake.machine_year(),
                   "randomdata":randomdata,
                   "model_search":remove_junk(model_number)
                       })
  return return_set


In [None]:
# generate fake data
import pandas as pd

number_of_sample_rows = 500
df_faker = pd.DataFrame(create_rows_faker(number_of_sample_rows))

# uncomment to save if you wish
#df_faker.to_csv('Test_Data.csv', index=False)

print(df_faker.info())
df_faker.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   brand         500 non-null    object
 1   model_number  500 non-null    object
 2   model_name    500 non-null    object
 3   year          500 non-null    object
 4   randomdata    500 non-null    int64 
 5   model_search  500 non-null    object
dtypes: int64(1), object(5)
memory usage: 23.6+ KB
None


Unnamed: 0,brand,model_number,model_name,year,randomdata,model_search
0,Fendt,CC232HF,Wheel Loader,2017,1781,CC232HF
1,Caterpillar,ED160-5 BLADE RUNNER,Wheel Loader,2009,1978,ED1605BLADERUNNER
2,Sumitomo,L220G1194,Wheel Loader,2008,1194,L220G1194
3,Hitachi,R914 COMPACT LITRONIC,Loader Backhoe,2011,1534,R914COMPACTLITRONIC
4,AGCO,H2200B,Hydraulic Excavator,2016,1459,H2200B


### Load from Hugging Face Hub

Import the generated Faker data from Hugging Face using their Datasets library.

In [None]:
import os
from google.colab import userdata

os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN')

In [None]:
# install dependencies for using Hugging Face datasets
!pip install -q datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m401.2/401.2 kB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from datasets import load_dataset

# load dataset
ds = load_dataset(HF_DATASET, split='train')


Downloading readme:   0%|          | 0.00/629 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data:   0%|          | 0.00/2.69M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [None]:
# dataset info
print(ds)

Dataset({
    features: ['brand', 'model_number', 'model_name', 'year', 'randomdata', 'model_search'],
    num_rows: 50000
})


In [None]:
# single record dictionary
print(ds[0])

{'brand': 'Landini', 'model_number': 'L4240HSTC', 'model_name': 'Hydraulic Excavator', 'year': 2017, 'randomdata': 1439, 'model_search': 'L4240HSTC'}


In [None]:
# load dataset into a dataframe
import pandas as pd

df_hf = ds.to_pandas()

print(f'Rows: {len(df_hf)}')
df_hf.head()


Rows: 50000


Unnamed: 0,brand,model_number,model_name,year,randomdata,model_search
0,Landini,L4240HSTC,Hydraulic Excavator,2017,1439,L4240HSTC
1,John Deere,LS1401203,4WD Tractor,2007,1203,LS1401203
2,Volvo,R40441789,Wheel Loader,2017,1789,R40441789
3,Volvo,Lexion 520,4WD Tractor,2012,1415,Lexion520
4,Caterpillar,9570RT,2WD Tractor,2005,1531,9570RT


In [None]:
# remove duplicate rows from df_head -- I haven't found any

df_hf = df_hf.drop_duplicates()
print(f'Rows: {len(df_hf)}')

Rows: 50000


### Load from Repo

Use this to load a copy of the test data and the FAISS index from a repo. Will be stored to a local directory \data.  I include a csv version of the dataset (just like in HF).

In [None]:
# clone repo
import os
from pathlib import Path

data_path = Path("data/")

if data_path.is_dir():
  print("No need to clone repo")
else:
  !git clone https://github.com/nicholassolomon/ModelNumberSearch.git
  data_path.mkdir(parents=True, exist_ok=True)
  !mv ModelNumberSearch/Data/*.* data
  !rm -rf ModelNumberSearch

Cloning into 'ModelNumberSearch'...
remote: Enumerating objects: 21, done.[K
remote: Counting objects: 100% (9/9), done.[K
remote: Compressing objects: 100% (8/8), done.[K
remote: Total 21 (delta 1), reused 8 (delta 1), pack-reused 12[K
Receiving objects: 100% (21/21), 68.66 MiB | 20.89 MiB/s, done.
Resolving deltas: 100% (1/1), done.


## **Embedding with Hugging Face**

### Load Embedding Model

- https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

In [None]:
# install dependencies
!pip install -q sentence-transformers

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/171.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m163.8/171.5 kB[0m [31m6.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# Load Model and Embedding Function
from sentence_transformers import SentenceTransformer

model = SentenceTransformer(EMBEDDING_MODEL)

In [None]:
# embedding function
def create_embeddings(text):
  """
  Creates an embedding from a given text using the model

  Args:
    text: The text to be embedded.

  Returns:
    A list containing the embedding of the text.
  """
  embeddings = model.encode([text])
  return embeddings



In [None]:
# create embeddings
from datasets import load_dataset

ds = load_dataset(HF_DATASET, split='train')

# for test purposes, cut dataset down to 50 rows
ds_train_small = ds.select(range(50))

# run embedding function against dataset and save embedding to new column
ds_with_embeddings = ds_train_small.map(lambda example: {EMBEDDING_COLUMN: create_embeddings(example[COLUMN_TO_EMBED])[0]})

Repo card metadata block was not found. Setting CardData to empty.


Map:   0%|          | 0/50 [00:00<?, ? examples/s]

In [None]:
# examine dataset with embedding column
ds_with_embeddings

Dataset({
    features: ['brand', 'model_number', 'model_name', 'year', 'randomdata', 'model_search', 'model_search_embeddings'],
    num_rows: 50
})

In [None]:
# examine embedded row

embedded_model = ds_with_embeddings[0][EMBEDDING_COLUMN]
print(f'Type: {type(embedded_model)}')
print(f'Length: {len(embedded_model)}')
print(f'Slice: {embedded_model[:5]}')

Type: <class 'list'>
Length: 384
Slice: [-0.019833004102110863, 0.03396640717983246, -0.010078956373035908, -0.027987472712993622, -0.016674449667334557]


## **FAISS for Semantic Searches**

Note that I added the model_search embeddings to a new column in the dataset, model_search_embeddings.  This is the field used to create the FAISS index.  The index is then saved to a separate file.  Once done, there is no need to re-generate the embeddings, nor must you save them.  If you load the original dataset and the index, semantic searching works.  There's no need to store the embedding data in the dataset. The FAISS library used the index for searching!

*You'll need to restart your session after running the install of faiss-`x`pu.*


In [None]:
# install dependencies
import torch

!sudo apt-get install libomp-dev

# determine cpu or gpu availability
if torch.cuda.is_available():
  !pip install -q faiss-gpu
  !nvidia-smi
else:
  !pip install -q faiss-cpu

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  libomp-14-dev libomp5-14
Suggested packages:
  libomp-14-doc
The following NEW packages will be installed:
  libomp-14-dev libomp-dev libomp5-14
0 upgraded, 3 newly installed, 0 to remove and 45 not upgraded.
Need to get 738 kB of archives.
After this operation, 8,991 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 libomp5-14 amd64 1:14.0.0-1ubuntu1.1 [389 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 libomp-14-dev amd64 1:14.0.0-1ubuntu1.1 [347 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libomp-dev amd64 1:14.0-55~exp2 [3,074 B]
Fetched 738 kB in 1s (927 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debcon

### **RESTART SESSION**

In [None]:
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
COLUMN_TO_EMBED = "model_search"
EMBEDDING_COLUMN = "model_search_embeddings"
HF_DATASET = "blade57/ModelNumbers4Searching_Full"
FAISS_INDEX = "model_search.faiss"

In [None]:
# Load Model and Embedding Function
from sentence_transformers import SentenceTransformer

model = SentenceTransformer(EMBEDDING_MODEL)



In [None]:
# embedding function
def create_embeddings(text):
  """
  Creates an embedding from a given text using the model

  Args:
    text: The text to be embedded.

  Returns:
    A list containing the embedding of the text.
  """
  embeddings = model.encode([text])
  return embeddings



In [None]:
# load fresh copy of data from Hugging Face
from datasets import load_dataset

# load dataset
ds = load_dataset(HF_DATASET, split='train')

# for test purposes, cut dataset down to 50 rows
ds = ds.select(range(50))

# create embeddings
ds = ds.map(lambda example: {EMBEDDING_COLUMN: create_embeddings(example[COLUMN_TO_EMBED])[0]})

ds

Repo card metadata block was not found. Setting CardData to empty.


Dataset({
    features: ['brand', 'model_number', 'model_name', 'year', 'randomdata', 'model_search', 'model_search_embeddings'],
    num_rows: 50
})

In [None]:
# create index on embedding column model_search_embeddings

ds.add_faiss_index(column=EMBEDDING_COLUMN)

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset({
    features: ['brand', 'model_number', 'model_name', 'year', 'randomdata', 'model_search', 'model_search_embeddings'],
    num_rows: 50
})

In [None]:
# save faiss index

ds.save_faiss_index(model_search_embeddings,
                    model_search.faiss)

## **Basic Searching using FAISSIndex and Hugging Face Datasets**

Hugging Face datasets library works well with FAISS indexes.

- **add_faiss_index()**:  add a dense index to a dataset.
- **load_faiss_index()**:  load a FaissIndex from disk.
- **save_faiss_index()**:  save a FaissIndex to disk.
- **get_nearest_examples()**: find nearest examples based on the query.

Many other methods, see documentation for more info.

See: https://huggingface.co/docs/datasets/v1.5.0/package_reference/main_classes.html

In [None]:
# search functions
import pandas as pd
import re

def remove_junk(search_term: str):
  """
  Removes unwanted characters from a search term.

  Args:
    search_term: The search term to be cleaned.

  Returns:
    The cleaned search term with unwanted characters removed.
  """
  # remove blank space
  results = search_term.replace(' ','')
  # remove unecessary characters
  results = re.sub(r'[/\+\-_=~*%$#@!"(){}]', '', results)
  # upper case
  return results.upper()

def query_df(ds_with_faiss, search_text, return_no=10):
  """
  Queries the dataset for the most similar model numbers to the search text using the SentenceTransformers model and the Faiss index.
  Returns the results in a pandas dataframe.

  Args:
    search_text: The text to be used for the search.
    return_no: The number of results to return.

  Returns:
    A tuple containing the scores, the search results, and a pandas dataframe containing the results.
  """
  search = remove_junk(search_text)
  search_embedding = create_embeddings(search)
  scores, search_results = ds_with_faiss.get_nearest_examples(EMBEDDING_COLUMN,
                                                   search_embedding,
                                                   k=return_no)
  results = pd.DataFrame({
    'scores': scores,
    'model_search': search_results['model_search'],
    'model_number': search_results['model_number'],
    'model_name': search_results['model_name'],
    'brand': search_results['brand'],
    'search_for': search
  })
  # sort by scores
  results = results.sort_values(by=['scores'], ascending=True)
  return results, scores, search_results


In [None]:
# load a new copy of dataset and load faiss saved index
ds_new_copy = load_dataset(HF_DATASET, split='train')

# for test purposes, cut dataset down to 50 rows
ds_new_copy = ds_new_copy.select(range(50))

# load FAISS index for dataset
ds_new_copy.load_faiss_index(EMBEDDING_COLUMN, FAISS_INDEX)


Repo card metadata block was not found. Setting CardData to empty.


In [None]:
# process results into df and result sets -- experiment with variations!

search_for = 'x957Ort'  # the actual model is 9570RT

# rows to return
rows = 10

result_df, scores, results = query_df(ds_new_copy, search_for, rows)

result_df.head(rows)


Unnamed: 0,scores,model_search,model_number,model_name,brand,search_for
0,0.91859,S6901089,S6901089,4WD Tractor,Washburn,X957ORT
1,0.933182,R40441789,R40441789,Wheel Loader,Volvo,X957ORT
2,0.939604,450E6415,450E/6415,Hydraulic Excavator,New Holland,X957ORT
3,1.028025,9570RT,9570RT,2WD Tractor,Caterpillar,X957ORT
4,1.028831,962L1700,962L1700,Combine,Case IH,X957ORT
5,1.084639,DHS745,DHS745,4WD Tractor,Caterpillar,X957ORT
6,1.085883,L4240HSTC,L4240HSTC,Hydraulic Excavator,Landini,X957ORT
7,1.098147,S6501126,S6501126,Disc,Mecalac,X957ORT
8,1.103843,CS1421748,CS1421748,Combine,Vibromax,X957ORT
9,1.111076,5325Utility,5325 Utility,4WD Tractor,Caterpillar,X957ORT


## **Applying Levenshtein Distance**

In [None]:
!pip install -q python-Levenshtein

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.4/177.4 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m29.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# code
from Levenshtein import distance

def get_ls_rank(search1, search2):
  """
  Calculates the Levenshtein distance between two strings.

  Args:
    search1: The first string.
    search2: The second string.

  Returns:
    The Levenshtein distance between the two strings.
  """
  return distance(s1=str(search1).upper(),
                  s2=str(search2).upper()
                  )

In [None]:
# using dataframe from prior section, get Levenshtein distance --add as new column

result_df['LS_rank'] = result_df['model_search'].apply(lambda x: get_ls_rank(search_for, x))
# resort df by LS ranking
result_df = result_df.sort_values(by=['LS_rank'], ascending=True)

result_df.head(rows)

Unnamed: 0,scores,model_search,model_number,model_name,brand,search_for,LS_rank
3,1.028025,9570RT,9570RT,2WD Tractor,Caterpillar,X957ORT,2
5,1.084639,DHS745,DHS745,4WD Tractor,Caterpillar,X957ORT,6
7,1.098147,S6501126,S6501126,Disc,Mecalac,X957ORT,7
0,0.91859,S6901089,S6901089,4WD Tractor,Washburn,X957ORT,7
4,1.028831,962L1700,962L1700,Combine,Case IH,X957ORT,8
2,0.939604,450E6415,450E/6415,Hydraulic Excavator,New Holland,X957ORT,8
6,1.085883,L4240HSTC,L4240HSTC,Hydraulic Excavator,Landini,X957ORT,8
1,0.933182,R40441789,R40441789,Wheel Loader,Volvo,X957ORT,9
8,1.103843,CS1421748,CS1421748,Combine,Vibromax,X957ORT,9
9,1.111076,5325Utility,5325 Utility,4WD Tractor,Caterpillar,X957ORT,9


# **Putting It All Together**

It's fine to start and run the notebook from here. Required dependencies and code are duplicated.

## **Semantic Searching Models and Applying Levenshtein Distance**

You'll need to restart your session after running the install of faiss-`x`pu.

In [None]:
# install dependencies
import os
from google.colab import userdata

os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN')

import torch

!pip install -q sentence-transformers datasets python-Levenshtein

!sudo apt-get install libomp-dev

if torch.cuda.is_available():
  !pip install -q faiss-gpu
  !nvidia-smi
else:
  !pip install -q faiss-cpu


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m30.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m401.2/401.2 kB[0m [31m26.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.4/177.4 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m43.5 MB/s[0m eta [36m0:00:00[0m
Reading package lists..

### **RESTART SESSION**

In [None]:
# UPDATED constants (FAISS_INDEX)
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
COLUMN_TO_EMBED = "model_search"
EMBEDDING_COLUMN = "model_search_embeddings"
HF_DATASET = "blade57/ModelNumbers4Searching_Full"
FAISS_INDEX = "/content/data/ModelSearch_Full.faiss"  # use the FULL INDEX

In [None]:
# Load Embedding Model
from sentence_transformers import SentenceTransformer
from datasets import load_dataset

model = SentenceTransformer(EMBEDDING_MODEL)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
# functions
import pandas as pd
import re
from Levenshtein import distance

def remove_junk(search_term: str):
  """
  Removes unwanted characters from a search term.

  Args:
    search_term: The search term to be cleaned.

  Returns:
    The cleaned search term with unwanted characters removed.
  """
  # remove blank space
  results = search_term.replace(' ','')
  # remove unecessary characters
  results = re.sub(r'[/\+\-_=~*%$#@!"(){}]', '', results)
  # upper case
  return results.upper()

# embedding function
def create_embeddings(text):
  """
  Creates an embedding from a given text using the model

  Args:
    text: The text to be embedded.

  Returns:
    A list containing the embedding of the text.
  """
  cleaned_text = remove_junk(text)
  embeddings = model.encode([cleaned_text])
  return embeddings

def query_df(ds_with_faiss, search_text, return_no=10):
  """
  Queries the dataset for the most similar model numbers to the search text using the SentenceTransformers model and the Faiss index.
  Returns the results in a pandas dataframe.

  Args:
    ds_with_faiss: Dataset with a FAISS index
    search_text: The text to be used for the search.
    return_no: The number of results to return (defaults to 10).

  Returns:
    A tuple containing the scores, the search results, and a pandas dataframe containing the results.
  """
  search = remove_junk(search_text)
  search_embedding = create_embeddings(search)
  scores, search_results = ds_with_faiss.get_nearest_examples(EMBEDDING_COLUMN,
                                                   search_embedding,
                                                   k=return_no)
  results = pd.DataFrame({
    'scores': scores,
    'model_search': search_results['model_search'],
    'model_number': search_results['model_number'],
    'model_name': search_results['model_name'],
    'brand': search_results['brand'],
    'search_for': search
  })
  return results, scores, search_results

def get_ls_rank(search1, search2):
  """
  Calculates the Levenshtein distance between two strings.

  Args:
    search1: The first string.
    search2: The second string.

  Returns:
    The Levenshtein distance between the two strings.
  """
  return distance(s1=str(search1).upper(),
                  s2=str(search2).upper()
                  )


In [None]:
# clone repo
import os
from pathlib import Path

data_path = Path("data/")

if data_path.is_dir():
  print("No need to clone repo")
else:
  !git clone https://github.com/nicholassolomon/ModelNumberSearch.git
  data_path.mkdir(parents=True, exist_ok=True)
  !mv ModelNumberSearch/Data/*.* data
  !rm -rf ModelNumberSearch

Cloning into 'ModelNumberSearch'...
remote: Enumerating objects: 21, done.[K
remote: Counting objects: 100% (9/9), done.[K
remote: Compressing objects: 100% (8/8), done.[K
remote: Total 21 (delta 1), reused 8 (delta 1), pack-reused 12[K
Receiving objects: 100% (21/21), 68.66 MiB | 9.57 MiB/s, done.
Resolving deltas: 100% (1/1), done.


In [None]:
# load data from Hugging Face Hub
from datasets import load_dataset

# load dataset
ds_full = load_dataset(HF_DATASET, split='train')

Downloading readme:   0%|          | 0.00/629 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data:   0%|          | 0.00/2.69M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [None]:
# attach prior created FAISS index to dataset
ds_full.load_faiss_index(EMBEDDING_COLUMN, FAISS_INDEX)

In [None]:
# Testing
import time

start_time = time.perf_counter()

search_for = '9570RT'  # the actual model is 9570RT

rows = 100

results_dataframe, scores, results = query_df(ds_with_faiss = ds_full,
                                              search_text = search_for,
                                              return_no = rows)
end_time = time.perf_counter()

# get LS rank and resort by LS rank (ascending)
results_dataframe['LS_rank'] = results_dataframe['model_search'].apply(lambda x: get_ls_rank(search_for, x))
# sort by LS distance
results_dataframe = results_dataframe.sort_values(by=['LS_rank'], ascending=True)

print(f'Query Time: {end_time-start_time} seconds')

results_dataframe.head(rows)

Query Time: 0.041526122000050236 seconds


Unnamed: 0,scores,model_search,model_number,model_name,brand,search_for,LS_rank
16,0.309984,9530T1335,9530T1335,Rock Truck,Hyundai,9570RT333,4
21,0.325521,9570RT,9570RT,Crawler Tractor,Ezee-On,9570RT333,4
23,0.325521,9570RT,9570RT,2WD Tractor,Hitachi,9570RT333,4
22,0.325521,9570RT,9570RT,2WD Tractor,Caterpillar,9570RT333,4
25,0.325521,9570RT,9570RT,Loader Backhoe,Kubota,9570RT333,4
...,...,...,...,...,...,...,...
10,0.280887,953D1687,953D1687,Disc,JCB,9570RT333,8
84,0.394219,9530Scraper,9530 Scraper,Mini Excavator,Land Pride,9570RT333,8
17,0.310028,953MH1711,953MH1711,Crawler Tractor,Hitachi,9570RT333,8
63,0.376837,938F1750,938F1750,Skid Steer Loader,Massey Ferguson,9570RT333,9
