In [1]:
import pandas as pd

Algorithm Description:

Calculate semantically meaningful vector embeddings of 'PART_DESCRIPTION' text.
Then for a new part description, calculate the vector embedding of the description, calculate cosine similarity with all other embeddings and return the most similar K items (the K with highest cosine similarity value).

To make algorithm faster, we could first run K-Means clustering on the database embeddings, and then compare a new embedding first against the mean embeddings, and finally search only within the most similar cluster.

In [None]:
df = pd.read_csv("Fuse.csv", sep=";")

In [None]:
pd.options.display.max_columns = None
df

Check missing values

In [None]:
df.isna().sum()

Inspect unique values of each feature

In [None]:
for col in df.columns:
    if col != "PART_ID":
        print(f"Col: {col}; Categories: {df[col].unique()} \n")

Comments on data:

    1 - multiple types of features: numeric (e.g. body height), unordered and ordered multinomial categorical (e.g. Fuse Material and Blow Characteristic), text (e.g. Part Description, although this is just a combination of several other features into text format)
    
    2 some categorical features have very high cardinality (e.g. Application)

Observed problems with data:

    1 - many collinear/redundant features (e.g. Body Height + Body length = Fuse Size or Physical Dimension; Mounting Feature = Mounting; 'Additional Features' includes redundant information about other features)
    
    2 - many missing values

Potential solutions:

    - Leverage the fact that there are redundant features to try to complete feature values for as many samples as possible (e.g. build missing PART_DESCRIPTION values by constructing strings from values of other columns)

    - then drop redundant features

In [None]:
import boto3
from sklearn.metrics.pairwise import cosine_similarity
import json
import numpy as np

To simplify demonstration, will use a pretrained text semantic embedding model from amazon

In [None]:
region = "us-east-1"  # change to your region
bedrock_client = boto3.client(service_name="bedrock-runtime", region_name=region)
model_id = "amazon.titan-embed-text-v2:0"

In [None]:
embedding_col = "PART_DESCRIPTION"

To simplify algorithm demonstration, just drop rows with missing values for now

In [None]:
df_filt = df[df[embedding_col].notna()]

In [None]:
df_filt["PART_DESCRIPTION"].isna().sum()

Embed all descriptions in dataset

In [None]:
def get_embedding(text):
    body = json.dumps({"inputText": text})
    resp = bedrock_client.invoke_model(modelId=model_id, body=body)
    resp_body = json.loads(resp["body"].read())
    return np.array(resp_body["embedding"], dtype=float)

In [None]:
df_filt["embedding"] = df_filt[col].apply(lambda x: get_embedding(str(x)))

Save to pickle

In [None]:
df_filt.to_pickle("df_with_embeddings.pkl")

Load from pickle

In [2]:
df_load = pd.read_pickle("df_with_embeddings.pkl")

Similarity score calculation

In [None]:
def calc_similarity_scores(object, embeddings):
    return [cosine_similarity([object], [x])[0][0] for x in embeddings]

Compare new embedding with all known embeddings, return most similar K items

In [None]:
def find_similar_parts(part_description, database, top_k=5):
    part_embedding = get_embedding(part_description)
    database_embeddings = database["embedding"]
    scores = calc_similarity_scores(part_embedding, database_embeddings)
    idx = np.argsort(scores)[::-1][0:top_k]
    similar_items = database.iloc[idx].copy(deep=True)
    return similar_items

Test with new embedding

In [None]:
test_item = "Fuse Glass Very Fast 5x20mm"

In [None]:
similar_items = find_similar_parts(test_item, df_load)
similar_items

Steps to include other features in the dataset:

- Create part description that systematically (instead of arbitrarily, as currently) incorporates information from all features
- use semantic vector embedding, as now
- given a new query, find most similar items through cosine similarity