# **Model Identification with Semantic Search and Levenshtein Distance**


Other Notebooks:
ModelNumber_VecorStore.ipynb:  from RC days, exploring this with appliance models.


## **Intro**

You'll need:
- Hugging Face token

Suggest running in Colab. If not you'll need to update the environment variable HF_TOKEN with your Hugging Face authentication token.

Also suggest you change the runtime type to GPU.

### Semantic Search
Convert words (or text) to numeric representations (embeddings) based on a trained NLP model. Then use an mathematical functions to identify other words near to your search word (or text).

### Levenshtein Distance
Return a numeric value representing the "distance" between two strings -- the total number of characters that must be changed before the strings are identical.

For example,

```
String 1:    ABC123
String 2:    BBC123
             ______
To Change:   1-----
Distance:    1
Change 1 character, "B" to "A".

String 1:    ABC123
String 2:    123ABC
             ______
To Change:   111111
Distance:    6             
Change 6 characters "1" to "A", "2" to "B", etc.
```             

### Combine Semantic Search for Speed with Levenshtein Distance for Accuracy

Calculating Levenshtein distance is time consuming and bogs down at scale. But if you first build a vector database of your models, then use semantic search to pull out a chunk of possible matches, it's much quicker to use that result subset to calculate the distance and report those matches with the fewest number of characters that need changing.


## **Constants**

In [16]:
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
COLUMN_TO_EMBED = "model_search"
EMBEDDING_COLUMN = "model_search_embeddings"
HF_DATASET = "blade57/ModelNumbers4Searching_Full"
FAISS_INDEX = "model_search.faiss"

## **Prepare Dataset**

- Semantic Model Identification.ipynb: https://colab.research.google.com/drive/1bqNwYPrxbNbi5FSQaSdhtUk-tmiD21xY

**Database schema:**
- brand: Faker field, manufacturer's name.
- model_number: Faker field unique to brand.
- model_name: Faker field, model description.
- year: Faker field.
- randomdata: int from 1000-2000, append to model_number when Faker is too short.
- model_search: Based on "cleaned" version of model_number. Used for creating model number embeddings.
- model_search_embeddings: embeddings.

### **Faker**

Used Faker to generate test data.  Easy and quick.  It created a variety of model numbers (what we're search for).

I took the generated fake model numbers and created a search version ('model_search') by removing unwanted characters, spaces and made everything upper case.


> **Faker References:**
- github: https://github.com/joke2k/faker
- documentation: https://faker.readthedocs.io/en/master/




In [28]:
# install dependencies
!pip install -q Faker faker-vehicle

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.8 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.3/1.8 MB[0m [31m7.8 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━[0m [32m0.8/1.8 MB[0m [31m11.9 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━[0m [32m1.5/1.8 MB[0m [31m14.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/162.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m162.4/162.4 kB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [30]:
# create Faker object and add vehicle provider
from faker import Faker
from faker_vehicle import VehicleProvider
import pandas as pd

fake = Faker()
fake.add_provider(VehicleProvider)

In [52]:
# function to generate fake data
import re

def remove_junk(search_term: str):
  """
  Removes unwanted characters from a search term.

  Args:
    search_term: The search term to be cleaned.

  Returns:
    The cleaned search term with unwanted characters removed.
  """
  # remove blank space
  results = search_term.replace(' ','')
  # remove unecessary characters
  results = re.sub(r'[/\+\-_=~*%$#@!"(){}]', '', results)
  # upper case
  return results.upper()

def create_rows_faker(num: int=1):
  """
  Creates a list of rows with fake data.

  Args:
    num (int): The number of rows to create.

  Returns:
    list: A list of dictionaries containing fake data.
  """
  return_set = []
  for x in range(num):
    randomdata = random.randint(1000,2000)
    model_number = fake.machine_model()
    # if model number is less than 6 characters, add randomdata, adjust as desired
    if len(model_number) < 6:
      model_number += str(randomdata)
    return_set.append({"brand":fake.machine_make(),
                   "model_number":model_number,
                   "model_name":fake.machine_category(),
                   "year":fake.machine_year(),
                   "randomdata":randomdata,
                   "model_search":remove_junk(model_number)
                       })
  return return_set


In [53]:
# generate fake data
import pandas as pd
import random
import re

number_of_sample_rows = 500
df_faker = pd.DataFrame(create_rows_faker(number_of_sample_rows))

# uncomment to save if you wish
#df_faker.to_csv('Test_Data.csv', index=False)

print(df_faker.info())
df_faker.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   brand         500 non-null    object
 1   model_number  500 non-null    object
 2   model_name    500 non-null    object
 3   year          500 non-null    object
 4   randomdata    500 non-null    int64 
 5   model_search  500 non-null    object
dtypes: int64(1), object(5)
memory usage: 23.6+ KB
None


Unnamed: 0,brand,model_number,model_name,year,randomdata,model_search
0,Massey Ferguson,T6701208,Motor Grader,2013,1208,T6701208
1,Wirtgen,T6070 Plus,Hydraulic Excavator,2017,1581,T6070PLUS
2,Kubota,LB34B1766,Mini Excavator,2003,1766,LB34B1766
3,Komatsu,8310R1948,Midi Excavator,2013,1948,8310R1948
4,Rostselmash,CP74B1729,Wheel Loader,2007,1729,CP74B1729


### Load from Hugging Face Hub

In [None]:
import os
from google.colab import userdata

os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN')

In [None]:
# install dependencies for using Hugging Face datasets
!pip install -q datasets

In [3]:
from datasets import load_dataset

# load dataset
ds = load_dataset(HF_DATASET, split='train')


Downloading readme:   0%|          | 0.00/629 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data:   0%|          | 0.00/2.69M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [4]:
# dataset info
print(ds)

Dataset({
    features: ['brand', 'model_number', 'model_name', 'year', 'randomdata', 'model_search'],
    num_rows: 50000
})


In [5]:
# single record dictionary
print(ds[0])

{'brand': 'Landini', 'model_number': 'L4240HSTC', 'model_name': 'Hydraulic Excavator', 'year': 2017, 'randomdata': 1439, 'model_search': 'L4240HSTC'}


In [7]:
# load dataset into a dataframe
import pandas as pd

df_hf = ds.to_pandas()

print(f'Rows: {len(df_hf)}')
df_hf.head()


Rows: 50000


Unnamed: 0,brand,model_number,model_name,year,randomdata,model_search
0,Landini,L4240HSTC,Hydraulic Excavator,2017,1439,L4240HSTC
1,John Deere,LS1401203,4WD Tractor,2007,1203,LS1401203
2,Volvo,R40441789,Wheel Loader,2017,1789,R40441789
3,Volvo,Lexion 520,4WD Tractor,2012,1415,Lexion520
4,Caterpillar,9570RT,2WD Tractor,2005,1531,9570RT


In [8]:
# remove duplicate rows from df_head -- I haven't found any

df_hf = df_hf.drop_duplicates()
print(f'Rows: {len(df_hf)}')

Rows: 50000


### Load from Repo

Use this to load a copy of the test data and the FAISS index from a repo. Will be stored to a local directory \data.

In [9]:
# clone repo
import os
from pathlib import Path

data_path = Path("data/")

if data_path.is_dir():
  print("No need to clone repo")
else:
  !git clone https://github.com/nicholassolomon/ModelNumberSearch.git
  data_path.mkdir(parents=True, exist_ok=True)
  !mv ModelNumberSearch/Data/*.* data
  !rm -rf ModelNumberSearch

Cloning into 'ModelNumberSearch'...
remote: Enumerating objects: 15, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 15 (delta 0), reused 3 (delta 0), pack-reused 12[K
Receiving objects: 100% (15/15), 68.62 MiB | 25.60 MiB/s, done.


## **Embedding with Hugging Face**

### Load Embedding Model

- https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

In [10]:
# install dependencies
!pip install -q sentence-transformers

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/171.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m163.8/171.5 kB[0m [31m4.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [17]:
# Load Model and Create Embedding Function
from sentence_transformers import SentenceTransformer
from datasets import load_dataset

model = SentenceTransformer(EMBEDDING_MODEL)



In [18]:
# embedding function
def create_embeddings(text):
  """
  Creates an embedding from a given text using the model

  Args:
    text: The text to be embedded.

  Returns:
    A list containing the embedding of the text.
  """
  embeddings = model.encode([text])
  return embeddings



In [20]:
# create embeddings
# load dataset
ds = load_dataset(HF_DATASET, split='train')

# for test purposes, cut dataset down to 50 rows
ds_train_small = ds.select(range(50))

# run embedding function against dataset and save embedding to new column
ds_with_embeddings = ds_train_small.map(lambda example: {EMBEDDING_COLUMN: create_embeddings(example[COLUMN_TO_EMBED])[0]})

Repo card metadata block was not found. Setting CardData to empty.


Map:   0%|          | 0/50 [00:00<?, ? examples/s]

In [21]:
# examine dataset with embedding column
ds_with_embeddings

Dataset({
    features: ['brand', 'model_number', 'model_name', 'year', 'randomdata', 'model_search', 'model_search_embeddings'],
    num_rows: 50
})

In [22]:
# examine embedded row

embedded_model = ds_with_embeddings[0][EMBEDDING_COLUMN]
print(f'Type: {type(embedded_model)}')
print(f'Length: {len(embedded_model)}')
print(f'Slice: {embedded_model[:5]}')

Type: <class 'list'>
Length: 384
Slice: [-0.019833004102110863, 0.03396640717983246, -0.010078956373035908, -0.027987472712993622, -0.016674449667334557]


## **FAISS for Semantic Searches**

- Build FAISS Index: Build_FAISS_Index.ipynb https://colab.research.google.com/drive/1L5ATG9tSnFf5e4Vv0PwXZDF-P02ffo-I

In [30]:
# install dependencies
!sudo apt-get install libomp-dev
!pip install faiss-gpu

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libomp-dev is already the newest version (1:14.0-55~exp2).
0 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.


In [23]:
# load fresh copy of data from Hugging Face
from datasets import load_dataset

# load dataset
ds = load_dataset(HF_DATASET, split='train')

# for test purposes, cut dataset down to 50 rows
ds = ds.select(range(50))

# create embeddings
ds = ds.map(lambda example: {EMBEDDING_COLUMN: create_embeddings(example[COLUMN_TO_EMBED])[0]})

ds

Repo card metadata block was not found. Setting CardData to empty.


Map:   0%|          | 0/50 [00:00<?, ? examples/s]

Dataset({
    features: ['brand', 'model_number', 'model_name', 'year', 'randomdata', 'model_search', 'model_search_embeddings'],
    num_rows: 50
})

In [24]:
# create index on embedding column model_search_embeddings

ds.add_faiss_index(column=EMBEDDING_COLUMN)

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset({
    features: ['brand', 'model_number', 'model_name', 'year', 'randomdata', 'model_search', 'model_search_embeddings'],
    num_rows: 50
})

In [25]:
# save faiss index

ds.save_faiss_index(EMBEDDING_COLUMN,
                    FAISS_INDEX)

## **Basic Searching using FAISS**

- Search with FAISS.ipynb: https://colab.research.google.com/drive/1E88ystX0tVbdYLtntqGv8LczPisDBgrF
- Searching.ipynb: https://colab.research.google.com/drive/1NNH8mLmS74lZXfRY6e2HsWKVFBMon7QQ

In [34]:
# search functions
import pandas as pd

def query(ds_with_faiss, search_text, return_no=10):
  """
  Queries the dataset for the most similar model numbers to the search text using the SentenceTransformers model and the Faiss index.

  Args:
    ds_with_faiss:  Dataset with FAISS index
    search_text: The text to be used for the search.
    return_no: The number of results to return.

  Returns:
    A tuple containing the scores and the search results.
  """
  search_embedding = create_embeddings(search_text)
  scores, search_results = ds_with_faiss.get_nearest_examples(EMBEDDING_COLUMN,
                                                   search_embedding,
                                                   k=return_no)
  return scores, search_results

def query_df(ds_with_faiss, search_text, return_no=10):
  """
  Queries the dataset for the most similar model numbers to the search text using the SentenceTransformers model and the Faiss index.
  Returns the results in a pandas dataframe.

  Args:
    search_text: The text to be used for the search.
    return_no: The number of results to return.

  Returns:
    A tuple containing the scores, the search results, and a pandas dataframe containing the results.
  """
  search_embedding = create_embeddings(search_text)
  scores, search_results = ds_with_faiss.get_nearest_examples(EMBEDDING_COLUMN,
                                                   search_embedding,
                                                   k=return_no)
  results = pd.DataFrame({
    'scores': scores,
    'model_search': search_results['model_search'],
    'model_number': search_results['model_number'],
    'model_name': search_results['model_name'],
    'brand': search_results['brand'],
    'search_for': search_text
  })
  return results, scores, search_results


In [30]:
# load a new copy of dataset and load faiss saved index
ds_new_copy = load_dataset(HF_DATASET, split='train')

# for test purposes, cut dataset down to 50 rows
ds_new_copy = ds_new_copy.select(range(50))

# load FAISS index for dataset
ds_new_copy.load_faiss_index(EMBEDDING_COLUMN, FAISS_INDEX)


Repo card metadata block was not found. Setting CardData to empty.


In [41]:
# process results into df and result sets

search_for = 'X9570RT'  # the actual model is 9570RT

# rows to return
rows = 10

result_df, scores, results = query_df(ds_new_copy, search_for, rows)

# sort by scores (descending)
result_df = result_df.sort_values(by=['scores'], ascending=True)
result_df.head(rows)


Unnamed: 0,scores,model_search,model_number,model_name,brand,search_for
0,0.573626,9570RT,9570RT,2WD Tractor,Caterpillar,X9570RT
1,0.872193,LT85A1084,LT85A1084,Multi Terrain Loader,Champion,X9570RT
2,0.873749,450E6415,450E/6415,Hydraulic Excavator,New Holland,X9570RT
3,0.884443,7730152PTOhp,7730 152 PTO hp,Vibratory Compactor,Caterpillar,X9570RT
4,0.89761,962L1700,962L1700,Combine,Case IH,X9570RT
5,0.903676,S6901089,S6901089,4WD Tractor,Washburn,X9570RT
6,0.910964,R40441789,R40441789,Wheel Loader,Volvo,X9570RT
7,0.933778,325BLL,325B LL,Wheel Loader,New Holland,X9570RT
8,0.979653,L4240HSTC,L4240HSTC,Hydraulic Excavator,Landini,X9570RT
9,1.004836,R480LC-9,R480LC-9,Utility Tractor,Volvo BM,X9570RT


## **Applying Levenshtein Distance**

In [42]:
!pip install -q python-Levenshtein

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.4/177.4 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [43]:
# code

from Levenshtein import distance

def get_ls_rank(search1, search2):
  """
  Calculates the Levenshtein distance between two strings.

  Args:
    search1: The first string.
    search2: The second string.

  Returns:
    The Levenshtein distance between the two strings.
  """
  return distance(s1=str(search1).upper(),
                  s2=str(search2).upper()
                  )

In [44]:
# using dataframe from prior section, get Levenshtein distance --add as new column

result_df['LS_rank'] = result_df['model_search'].apply(lambda x: get_ls_rank(search_for, x))
result_df = result_df.sort_values(by=['LS_rank'], ascending=True)

result_df.head(rows)

Unnamed: 0,scores,model_search,model_number,model_name,brand,search_for,LS_rank
0,0.573626,9570RT,9570RT,2WD Tractor,Caterpillar,X9570RT,1
5,0.903676,S6901089,S6901089,4WD Tractor,Washburn,X9570RT,6
7,0.933778,325BLL,325B LL,Wheel Loader,New Holland,X9570RT,6
1,0.872193,LT85A1084,LT85A1084,Multi Terrain Loader,Champion,X9570RT,7
4,0.89761,962L1700,962L1700,Combine,Case IH,X9570RT,7
8,0.979653,L4240HSTC,L4240HSTC,Hydraulic Excavator,Landini,X9570RT,7
9,1.004836,R480LC-9,R480LC-9,Utility Tractor,Volvo BM,X9570RT,8
2,0.873749,450E6415,450E/6415,Hydraulic Excavator,New Holland,X9570RT,8
6,0.910964,R40441789,R40441789,Wheel Loader,Volvo,X9570RT,9
3,0.884443,7730152PTOhp,7730 152 PTO hp,Vibratory Compactor,Caterpillar,X9570RT,11


# **Putting It All Together**

## **Semantic Searching Models and Applying Levenshtein Distance**

- ModelNumber_Testing.ipynb: https://colab.research.google.com/drive/1JjclQmNMuYbFuThMqmTJJzZuVNiajhQm#scrollTo=_XAa4f5mP3Me
- Test_FAISS_Indexing v2.ipynb: https://colab.research.google.com/drive/1-4ceIJsXw9n7UH_shsOGfGYVL91U2uY4

In [1]:
# install dependencies
import os
from google.colab import userdata

os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN')

import torch

!pip install -q sentence-transformers datasets python-Levenshtein

!sudo apt-get install libomp-dev

if torch.cuda.is_available():
  !pip install -q faiss-gpu
  !nvidia-smi
else:
  !pip install -q faiss-cpu


[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/171.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m163.8/171.5 kB[0m [31m4.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m401.2/401.2 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K     

In [2]:
# UPDATED constants (FAISS_INDEX)
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
COLUMN_TO_EMBED = "model_search"
EMBEDDING_COLUMN = "model_search_embeddings"
HF_DATASET = "blade57/ModelNumbers4Searching_Full"
FAISS_INDEX = "/content/data/ModelSearch_Full.faiss"  # use the FULL INDEX

In [3]:
# Load Embedding Model
from sentence_transformers import SentenceTransformer
from datasets import load_dataset

model = SentenceTransformer(EMBEDDING_MODEL)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [48]:
# functions
import pandas as pd
from Levenshtein import distance

def remove_junk(search_term: str):
  """
  Removes unwanted characters from a search term.

  Args:
    search_term: The search term to be cleaned.

  Returns:
    The cleaned search term with unwanted characters removed.
  """
  # remove blank space
  results = search_term.replace(' ','')
  # remove unecessary characters
  results = re.sub(r'[/\+\-_=~*%$#@!"(){}]', '', results)
  # upper case
  return results.upper()

# embedding function
def create_embeddings(text):
  """
  Creates an embedding from a given text using the model

  Args:
    text: The text to be embedded.

  Returns:
    A list containing the embedding of the text.
  """
  embeddings = model.encode([text])
  return embeddings

def query_df(ds_with_faiss, search_text, return_no=10):
  """
  Queries the dataset for the most similar model numbers to the search text using the SentenceTransformers model and the Faiss index.
  Returns the results in a pandas dataframe.

  Args:
    ds_with_faiss: Dataset with a FAISS index
    search_text: The text to be used for the search.
    return_no: The number of results to return (defaults to 10).

  Returns:
    A tuple containing the scores, the search results, and a pandas dataframe containing the results.
  """
  search_embedding = create_embeddings(search_text)
  scores, search_results = ds_with_faiss.get_nearest_examples(EMBEDDING_COLUMN,
                                                   search_embedding,
                                                   k=return_no)
  results = pd.DataFrame({
    'scores': scores,
    'model_search': search_results['model_search'],
    'model_number': search_results['model_number'],
    'model_name': search_results['model_name'],
    'brand': search_results['brand'],
    'search_for': search_text
  })
  return results, scores, search_results

def get_ls_rank(search1, search2):
  """
  Calculates the Levenshtein distance between two strings.

  Args:
    search1: The first string.
    search2: The second string.

  Returns:
    The Levenshtein distance between the two strings.
  """
  return distance(s1=str(search1).upper(),
                  s2=str(search2).upper()
                  )


In [5]:
# clone repo
import os
from pathlib import Path

data_path = Path("data/")

if data_path.is_dir():
  print("No need to clone repo")
else:
  !git clone https://github.com/nicholassolomon/ModelNumberSearch.git
  data_path.mkdir(parents=True, exist_ok=True)
  !mv ModelNumberSearch/Data/*.* data
  !rm -rf ModelNumberSearch

Cloning into 'ModelNumberSearch'...
remote: Enumerating objects: 15, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 15 (delta 0), reused 3 (delta 0), pack-reused 12[K
Receiving objects: 100% (15/15), 68.62 MiB | 30.58 MiB/s, done.


In [6]:
# load data
from datasets import load_dataset

# load dataset
ds_full = load_dataset(HF_DATASET, split='train')

Downloading readme:   0%|          | 0.00/629 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data:   0%|          | 0.00/2.69M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [7]:
# attach prior created FAISS index to dataset
ds_full.load_faiss_index(EMBEDDING_COLUMN, FAISS_INDEX)

In [57]:
# Testing
import time

start_time = time.perf_counter()

search_for = ' _A9570_*@ #$RT'  # the actual model is 9570RT by Hitachi - 2WD Tractor
search_for = remove_junk(search_for)

rows = 100

results_dataframe, scores, results = query_df(ds_with_faiss = ds_full,
                                              search_text = search_for,
                                              return_no = rows)
end_time = time.perf_counter()

# get LS rank and resort by LS rank (ascending)
results_dataframe['LS_rank'] = results_dataframe['model_search'].apply(lambda x: get_ls_rank(search_for, x))
# sort by LS distance
results_dataframe = results_dataframe.sort_values(by=['LS_rank'], ascending=True)

print(f'Query Time: {end_time-start_time} seconds')

results_dataframe.head(rows)

Query Time: 0.050506315999882645 seconds


Unnamed: 0,scores,model_search,model_number,model_name,brand,search_for,LS_rank
78,0.535003,9570RT,9570RT,2WD Tractor,Caterpillar,A9570RT,1
79,0.535003,9570RT,9570RT,4WD Tractor,Case,A9570RT,1
75,0.535003,9570RT,9570RT,Loader Backhoe,Kubota,A9570RT,1
80,0.535003,9570RT,9570RT,Crawler Tractor,Ezee-On,A9570RT,1
76,0.535003,9570RT,9570RT,Hydraulic Excavator,Landini,A9570RT,1
...,...,...,...,...,...,...,...
99,0.555884,A924LITRONIC,A924 LITRONIC,MFWD Tractor,Liebherr,A9570RT,9
98,0.555884,A924LITRONIC,A924 LITRONIC,MFWD Tractor,Massey Ferguson,A9570RT,9
42,0.489241,A918LITRONIC,A918 LITRONIC,Vibratory Smooth Drum Roller,Case IH,A9570RT,9
41,0.489241,A918LITRONIC,A918 LITRONIC,Combine,Hesston,A9570RT,9
