![Cover image](https://d1fiydes8a4qgo.cloudfront.net/blog/2025/february/1/linkedin_card.png)

In [40]:
import ast
import base64
from datetime import datetime
import json
import os

import boto3
from botocore.exceptions import NoCredentialsError

from IPython.display import display, Image, HTML

import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm
from tqdm.autonotebook import tqdm

In [20]:
# Move up a directory if needed
if not os.path.exists('data/solemates_shoe_directory.csv'):
    os.chdir('..')
    assert os.path.exists('data/solemates_shoe_directory.csv'), "Data directory not found."

# Define paths dynamically
DATA_PATH = "data/solemates_shoe_directory.csv"
IMAGE_PATH = "data/footwear"  # Adjust if images are stored elsewhere

# Load shoe data
Let's start by reading the SoleMates shoe dataset. This dataset contains detailed product information, such as shoe colors and heel heights, which we'll transform into embeddings.

In [21]:
# Load the SoleMates shoe dataset
df_shoes = pd.read_csv(DATA_PATH)

# Convert 'color_details' from string representation of a list to an actual list
df_shoes['color_details'] = df_shoes['color_details'].apply(ast.literal_eval)

# Display the first few rows of the dataset
df_shoes.head()

Unnamed: 0,product_title,gender,product_type,color,usage,color_details,heel_height,heel_type,price_usd,brand,product_id,image
0,Puma men future cat remix sf black casual shoes,men,casual shoes,black,casual,[],,,220,puma,1,1.jpg
1,Buckaroo men flores black formal shoes,men,formal shoes,black,formal,[],,,155,buckaroo,2,2.jpg
2,Gas men europa white shoes,men,casual shoes,white,casual,[],,,105,gas,3,3.jpg
3,Nike men's incinerate msl white blue shoe,men,sports shoes,white,sports,[blue],,,125,nike,4,4.jpg
4,Clarks men hang work leather black formal shoes,men,formal shoes,black,formal,[],,,220,clarks,5,5.jpg


# Visualize shoes

In [22]:
width = 100
images_html = ""
for img_file in df_shoes.head()['image']:
    img_path = os.path.join(IMAGE_PATH, img_file)
    # Add each image as an HTML <img> tag
    images_html += f'<img src="{img_path}" style="width:{width}px; margin-right:10px;">'
# Display all images in a row using HTML
display(HTML(f'<div style="display: flex; align-items: center;">{images_html}</div>'))

# Cost of vectorization and pre-embedded dataset
Vectorizing datasets with AWS Bedrock and the Titan multimodal model involves costs based on the number of input tokens and images:

Text embeddings: $0.0008 per 1,000 input tokens

Image embeddings: $0.00006 per image

The provided SoleMates dataset is small, containing just 1306 pairs of shoes, making it affordable to vectorize. For this dataset, I calculated the total cost of vectorization and summarized the token counts below:

Token Count: 12746 tokens
Images: 1306
Total Cost: $0.0885568
If you prefer not to generate embeddings yourself or don't have access to AWS, you can use a pre-embedded dataset that I've prepared as a CSV file. This file includes all embeddings and token counts, allowing you to follow the guide without incurring additional costs. However, for hands-on experience, I recommend running the embedding process to understand the workflow.

To load the pre-embedded dataset, use the following code:
```python
# Load pre-embedded dataset
df_shoes_with_embeddings = pd.read_csv('../data/solemates_shoe_directory_with_embeddings_token_count.csv')

# Convert string representations to actual lists
df_shoes_with_embeddings['titan_embedding'] = df_shoes_with_embeddings['titan_embedding'].apply(ast.literal_eval)
```
This step is entirely optional and designed to accommodate various levels of access and resources.

# Set up AWS Bedrock client
You'll need access to Amazon Bedrock foundation models.

## What is AWS Bedrock?
Amazon Bedrock is a fully managed service offering high-performing foundation models (FMs) for building generative AI applications.

Bedrock is serverless and offers multiple foundational models to choose between.

## What is AWS Titan?
Amazon Titan Multimodal Embeddings G1 model is a multimodal embedding model that converts both product texts and images into vectors.

[Learn more about AWS Titan G1 ↗](https://docs.aws.amazon.com/bedrock/latest/userguide/titan-multiemb-models.html)

In [23]:
# Define your AWS profile
# Replace AWS_PROFILE with the name of your AWS CLI profile
# To use your default AWS profile, leave 'aws_profile' as None
aws_profile = os.environ.get('AWS_PROFILE')

# Specify the AWS region where Bedrock is available
aws_region_name = "us-east-1"

try:
    # Set the default session for the specified profile
    if aws_profile:
        boto3.setup_default_session(profile_name=aws_profile)
    else:
        boto3.setup_default_session()  # Use default AWS profile if none is specified

    # Initialize the Bedrock runtime client
    bedrock_runtime = boto3.client(
        service_name="bedrock-runtime",
        region_name=aws_region_name
    )
except NoCredentialsError:
    print("AWS credentials not found. Please configure your AWS profile.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

# Generate embeddings for product data
To prepare our product data for the vector database, we'll generate embeddings for each product using AWS Titan. These embeddings combine image and text data to represent each product in a format suitable for search and recommendation systems.

Before generating embeddings, we'll initialize two new columns in the dataset:

- **titan_embedding:** To store the embedding vectors
- **token_count:** To store the token count for each product title

Then, we'll define a function to generate embeddings and apply it to the dataset.

# Initialize columns for embeddings

In [24]:
# Initialize columns to store embeddings and token counts
df_shoes['titan_embedding'] = None  # Placeholder for embedding vectors
df_shoes['token_count'] = None  # Placeholder for token counts

# Define function for generating embeddings

In [25]:
# Main function to generate image and text embeddings
def generate_embeddings(df, image_col='image', text_col='product_title', embedding_col='embedding', image_folder=None):

    if image_folder is None:
        raise ValueError("Please provide the path to the image folder.")

    for index, row in tqdm(df.iterrows(), total=df.shape[0], desc="Generating embeddings"):
        try:

            # Read and encode the image as base64
            image_path = os.path.join(image_folder, row[image_col])
            with open(image_path, 'rb') as img_file:
                image_base64 = base64.b64encode(img_file.read()).decode('utf-8')

            # Prepare input data for AWS Titan
            input_data = {"inputImage": image_base64, "inputText": row[text_col]}

            # Invoke the AWS Titan model via Bedrock runtime
            response = bedrock_runtime.invoke_model(
                body=json.dumps(input_data),
                modelId="amazon.titan-embed-image-v1",
                accept="application/json",
                contentType="application/json"
            )
            response_body = json.loads(response.get("body").read())

            # Extract embedding and token count from response
            embedding = response_body.get("embedding")
            token_count = response_body.get("inputTextTokenCount")

            # Validate and save the embedding and token count
            if isinstance(embedding, list):
                df.at[index, embedding_col] = embedding  # Save embedding as a list
                df.at[index, 'token_count'] = int(token_count)  # Save token count as an integer
            else:
                raise ValueError("Unexpected response: embedding is not a list.")

        except Exception as e:
            print(f"Error for row {index}: {e}")
            df.at[index, embedding_col] = None  # Handle errors gracefully

    return df

# Generate embeddings

In [26]:
# Generate embeddings for the product data
df_shoes = generate_embeddings(
    df=df_shoes, 
    embedding_col='titan_embedding', 
    image_folder=IMAGE_PATH
)

Generating embeddings:   0%|          | 0/1306 [00:00<?, ?it/s]

# Save dataset for reuse

In [30]:
# Save the dataset with generated embeddings to a new CSV file
# Get today's date in YYYY_MM_DD format
today = datetime.now().strftime('%Y_%m_%d')

# Save the dataset with generated embeddings to a CSV file
df_shoes.to_csv(f'shoes_with_embeddings_token_{today}.csv', index=False)
print(f"Dataset with embeddings saved as 'shoes_with_embeddings_token_{today}.csv'")

Dataset with embeddings saved as 'shoes_with_embeddings_token_2025_02_02.csv'


# Querying the vectorized data
Now that your dataset contains embeddings, we can implement a simple vector similarity search to find the products most relevant to a query.

## Define helper functions

### a. Request AWS Titan embeddings

In [31]:
def request_embedding(image_base64=None, text_description=None):
    """
    Request embeddings from AWS Titan multimodal model.

    Parameters:
        image_base64 (str, optional): Base64 encoded image string.
        text_description (str, optional): Text description.

    Returns:
        list: Embedding vector.
    """
    input_data = {"inputImage": image_base64, "inputText": text_description}
    body = json.dumps(input_data)

    # Invoke the Titan multimodal model
    response = bedrock_runtime.invoke_model(
        body=body,
        modelId="amazon.titan-embed-image-v1",
        accept="application/json",
        contentType="application/json"
    )

    response_body = json.loads(response.get("body").read())

    if response_body.get("message"):
        raise ValueError(f"Embeddings generation error: {response_body.get('message')}")

    return response_body.get("embedding")

### b. Cosine similarity function

In [32]:
def compute_cosine_similarities(query_vec, embeddings):
    """
    Compute cosine similarities between a query vector and a list of embeddings.
    
    Parameters:
        query_vec (list or np.array): Query embedding vector.
        embeddings (list of lists): List of product embedding vectors.
    
    Returns:
        np.array: Array of cosine similarity scores.
    """
    # Convert lists to NumPy arrays
    query_vec = np.array(query_vec).reshape(1, -1)
    embeddings = np.array(embeddings)
    return cosine_similarity(query_vec, embeddings).flatten()

### c. Dot product similarity function

In [33]:
def compute_dot_product_similarity(query_vec, embeddings):
    """
    Compute dot product similarities between a query vector and a list of embeddings.
    
    Parameters:
        query_vec (list or np.array): The query embedding vector.
        embeddings (list of lists): List of product embedding vectors.
    
    Returns:
        np.array: Array of dot product similarity scores.
    """
    query_vec = np.array(query_vec).flatten()  # Ensure query is a 1D array
    embeddings = np.array(embeddings)
    # Compute dot product similarity for each embedding
    return np.dot(embeddings, query_vec)

## Query your data

### a. Prepare the product embeddings

In [34]:
product_embeddings = df_shoes['titan_embedding'].tolist()

### b. Generate a query embedding

In [35]:
customer_query = "red heels"
query_embedding = request_embedding(text_description=customer_query)

### c. Compute similarities

In [41]:
cosine_similarities = compute_cosine_similarities(query_embedding, product_embeddings)
dot_similarities = compute_dot_product_similarity(query_embedding, product_embeddings)

### d. Retrieve top matching products
Select the top N similar products based on each similarity measure. 

Here, `num_neighbors` is set to 6, but you can adjust this number as needed.

In [42]:
num_neighbors = 6  # Change this to the desired number of neighbors
top_indices_cosine = np.argsort(cosine_similarities)[::-1][:num_neighbors]
top_indices_dot = np.argsort(dot_similarities)[::-1][:num_neighbors]

## Visualize the results

In [43]:
def visualize_neighbors(df, neighbor_indices, image_data_path='data/footwear', width=100):
    """
    Visualize the neighbor products from the cosine similarity search.
    
    Parameters:
        df (pd.DataFrame): DataFrame containing product data.
        neighbor_indices (list): List of indices for the neighbor products.
        image_data_path (str): Folder path where product images are stored.
        width (int): Display width of each image in pixels.
    """
    images_html = ""
    for idx in neighbor_indices:
        product = df.iloc[idx]
        img_file = product['image']
        product_title = product['product_title']
        img_path = os.path.join(image_data_path, img_file)
        images_html += (
            f'<div style="margin-right:10px; text-align:center;">'
            f'<img src="{img_path}" style="width:{width}px;"><br>'
            f'<span>{product_title}</span>'
            f'</div>'
        )
    display(HTML(f'<div style="display: flex; align-items: center;">{images_html}</div>'))

In [44]:
# Example usage: Visualize top 5 neighbors (using the indices from the cosine similarity search)
visualize_neighbors(df_shoes, top_indices_cosine, image_data_path='../data/footwear', width=100)

# Example usage: Visualize top 5 neighbors (using the indices from the cosine similarity search)
visualize_neighbors(df_shoes, top_indices_dot, image_data_path='../data/footwear', width=100)

## Notes
In this example, both cosine and dot product methods returned the same products.

# Try another query

In [45]:
customer_query = "blue shoes"
query_embedding = request_embedding(text_description=customer_query)
cosine_similarities = compute_cosine_similarities(query_embedding, product_embeddings)
dot_similarities = compute_dot_product_similarity(query_embedding, product_embeddings)
num_neighbors = 6  # Change this to the desired number of neighbors
top_indices_cosine = np.argsort(cosine_similarities)[::-1][:num_neighbors]
top_indices_dot = np.argsort(dot_similarities)[::-1][:num_neighbors]

# Visualize top 5 neighbors (using the indices from the cosine similarity search)
visualize_neighbors(df_shoes, top_indices_cosine, image_data_path='../data/footwear', width=100)

# Visualize top 5 neighbors (using the indices from the dot product similarity search)
visualize_neighbors(df_shoes, top_indices_dot, image_data_path='../data/footwear', width=100)

## Note
You might notice a slight difference in the ranking for **"blue shoes"**, illustrating how the two similarity metrics can sometimes produce different results.

## Next Steps

Now that you've built a vector search solution with AWS Titan, here are some ideas to take your project even further:

- **Integrate and experiment:**  
  Embed your vector search into your e-commerce platform and try out different similarity metrics (cosine vs. dot product) to fine-tune your recommendations

- **Scale your solution:**  
  As your product database grows, consider exploring specialized vector search libraries like FAISS or Pinecone for improved performance

- **Learn More and Level Up:**  
  Ready to build an AI agent that handles complex, multi-color product queries? Check out our free mini-course **"How to Build an AI Agent to Handle Multi-Color Product Queries"**:

  ![The AI agent looks through the retrieved vector database results and provide a response with recommended shoes](https://d1fiydes8a4qgo.cloudfront.net/blog/2025/january/1/2_agent_response.png "The AI agent looks through the retrieved vector database results and provide a response with recommended shoes")


In this step-by-step mini-course, you'll learn how to create an AI agent in **Jupyter Notebook** using **Python**, **Pinecone**, and **LlamaIndex** - no deployment required. 
You'll get hands-on experience building a smarter query engine that goes beyond basic retrieval.

<div style="text-align: left;">
    <a href="https://norahsakal.gumroad.com/l/mini-course-1" target="_blank" style="
        background-color: #1976d2;
        color: white;
        padding: 10px 20px;
        border-radius: 5px;
        text-decoration: none;
        font-family: 'Arial', sans-serif;
        font-size: 16px;
        font-weight: bold;
        display: inline-block;
        transition: 0.3s ease-in-out;
    " onmouseover="this.style.backgroundColor='#1565c0'" 
       onmouseout="this.style.backgroundColor='#1976d2'">
        Join for free via Gumroad ↗
    </a>
</div>
