# Supercharging Search and Retrieval for E-Commerce with Voyage AI and Pixeltable

**Best-in-class embedding models and rerankers for unstructured product data**

Modern e-commerce platforms deal with massive amounts of unstructured data: product descriptions, specifications, customer reviews, and images. Traditional keyword search often fails to capture the semantic meaning behind customer queries like "comfortable shoes for standing all day" or "gift ideas for a tech enthusiast."

In this tutorial, we'll demonstrate how to build a powerful semantic search system for product data by combining:

- **[Pixeltable](https://pixeltable.com)**: An multimodal data infrastructure that handles embeddings, indexing, and retrieval as declarative table operations for all data types
- **[Voyage AI](https://voyageai.com)**: State-of-the-art embedding models and rerankers purpose-built for search and retrieval

We'll use real Amazon product data from Hugging Face to showcase:

1. üîç **Semantic Product Search**: Find products by meaning, not just keywords
2. üéØ **Reranking for Precision**: Improve search relevance with Voyage AI's reranker
3. üìä **Incremental Updates**: Add new products without reprocessing the entire catalog

### Prerequisites

- A Voyage AI account with an API key ([get one free](https://www.voyageai.com/))
- Basic familiarity with Python and data operations


## Setup

First, let's install the required packages and configure our environment.


In [None]:
%pip install -qU pixeltable voyageai datasets


In [2]:
import os
import getpass

if 'VOYAGE_API_KEY' not in os.environ:
    os.environ['VOYAGE_API_KEY'] = getpass.getpass('Enter your Voyage AI API key: ')


In [None]:
import pixeltable as pxt
from pixeltable.functions import voyageai
from datasets import load_dataset

# Create a fresh workspace for this demo
pxt.drop_dir('ecommerce_search', force=True)
pxt.create_dir('ecommerce_search')


## Load Amazon Product Data from Hugging Face

We'll use the [Amazon Product Dataset 2020](https://huggingface.co/datasets/calmgoose/amazon-product-data-2020) from Hugging Face, which contains 10,000 real product listings with rich metadata including:

- Product names and descriptions
- Categories and specifications
- Pricing information
- Product images

Pixeltable can import Hugging Face datasets directly using the `source` parameter.


In [None]:
# Load a subset of the Amazon product dataset (500 products for demo)
hf_dataset = load_dataset(
    'calmgoose/amazon-product-data-2020',
    split='train[:500]'
)

hf_dataset


Generating train split: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10002/10002 [00:00<00:00, 58263.28 examples/s]

Loaded 500 products

Available columns: ['Uniq Id', 'Product Name', 'Category', 'Upc Ean Code', 'Selling Price', 'Model Number', 'About Product', 'Product Specification', 'Technical Details', 'Shipping Weight', 'Product Dimensions', 'Image', 'Variants', 'Product Url', 'Is Amazon Seller']





In [None]:
# Preview a sample product - note the Image column contains multiple URLs separated by |
sample = hf_dataset[0]
sample


Product: DB Longboards CoreFlex Crossbow 41" Bamboo Fiberglass Longboard Complete...
Category: Sports & Outdoors | Outdoor Recreation | Skates, Skateboards & Scooters | Skateboarding | Standard Skateboards & Longboards | Longboards
Price: $237.68

About: Make sure this fits by entering your model number. | RESPONSIVE FLEX: The Crossbow features a bamboo core encased in triaxial fiberglass and HD plastic for a responsive flex pattern that‚Äôs second to n...


### Import into Pixeltable

Now let's import this dataset into Pixeltable. Pixeltable automatically maps Hugging Face types to appropriate column types.


In [6]:
# Import the dataset into Pixeltable
products = pxt.create_table(
    'ecommerce_search.products',
    source=hf_dataset
)

products.head(3)


Created table 'products'.
Inserted 500 rows with 0 errors in 0.42 s (1186.38 rows/s)


Uniq_Id,Product_Name,Category,Upc_Ean_Code,Selling_Price,Model_Number,About_Product,Product_Specification,Technical_Details,Shipping_Weight,Product_Dimensions,Image,Variants,Product_Url,Is_Amazon_Seller
4c69b61db1fc16e7013b43fc926e502d,"DB Longboards CoreFlex Crossbow 41"" Bamboo Fiberglass Longboard Complete","Sports & Outdoors | Outdoor Recreation | Skates, Skateboards & Scooters | Skateboarding | Standard Skateboards & Longboards | Longboards",,\$237.68,,"Make sure this fits by entering your model number. | RESPONSIVE FLEX: The Crossbow features a bamboo core encased in triaxial fiberglass and HD plastic for a responsive flex pattern that‚Äôs second to none. Pumping & carving have never been so satisfying! Flex 2 is recommended for people 120 to 170 pounds. | COREFLEX TECH: CoreFlex construction is water resistant, impact resistant, scratch resistant and has a flex like you won‚Äôt believe. These boards combine fiberglass, epoxy, HD plastic and b ...... spired by the hills, waves, beaches & mountains all around our headquarters in the Northwest | BEST IN THE WORLD: DB was founded out of sheer love of longboarding with a mission to create the best custom longboards in the world, to do it sustainably, & to treat customers & employees like family | BEYOND COMPARE: Try our skateboards & accessories if you've tried similar products by Sector 9, Landyachtz, Arbor, Loaded, Globe, Orangatang, Hawgs, Powell-Peralta, Blood Orange, Caliber or Gullwing",Shipping Weight: 10.7 pounds (View shipping rates and policies)|ASIN: B07KMVJJK7| #474 in¬†Longboards Skateboard,,10.7 pounds,,https://images-na.ssl-images-amazon.com/images/I/51j3fPQTQkL.jpg|https://images-na.ssl-images-amazon.com/images/I/31hKM3cSoSL.jpg|https://images-na.ssl-images-amazon.com/images/I/51WlHdwghfL.jpg|https://images-na.ssl-images-amazon.com/images/I/51FsyLRBzwL.jpg|https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/transparent-pixel.jpg,https://www.amazon.com/DB-Longboards-CoreFlex-Fiberglass-Longboard/dp/B07KMVJJK7|https://www.amazon.com/DB-Longboards-CoreFlex-Fiberglass-Longboard/dp/B07KMN5KS7|https://www.amazon.com/DB-Longboards-CoreFlex-Fiberglass-Longboard/dp/B07KMXK857|https://www.amazon.com/DB-Longboards-CoreFlex-Fiberglass-Longboard/dp/B07KMW2VFR,https://www.amazon.com/DB-Longboards-CoreFlex-Fiberglass-Longboard/dp/B07KMVJJK7,Y
66d49bbed043f5be260fa9f7fbff5957,"Electronic Snap Circuits Mini Kits Classpack, FM Radio, Motion Detector, Music Box (Set of 5)",Toys & Games | Learning & Education | Science Kits & Toys,,\$99.95,55324.0,"Make sure this fits by entering your model number. | Snap circuits mini kits classpack provides basic electronic circuitry activities for students in grades 2-6 | Includes 5 separate mini building kits- an FM radio, a motion detector, music box, space battle sound effects, and a flying saucer | Each kit includes separate components and instructions to build | Each component represents one function in a circuit; components snap together to create working models of everyday electronic devices | Activity guide provides additional projects to teach students how circuitry works",Product Dimensions: 14.7 x 11.1 x 10.2 inches ; 4.06 pounds |Shipping Weight: 4 pounds (View shipping rates and policies)|Domestic Shipping: Item can be shipped within U.S.|International Shipping: This item can be shipped to select countries outside of the U.S. Learn More|ASIN: B008AK6DAS|Item model number: 55324| #3032 in¬†Science Kits & Toys,"The snap circuits mini kits classpack provides basic electric circuitry information for students in grades 2-6. This classpack includes 5 snap-together building kits. Components snap together to create working models of everyday electronic devices. Kits included are an FM radio, a motion detector, a music box, space battle sound effects, and a flying saucer. Each mini kit comes with individual components, and an activity guide which includes instructions and additional project ideas. Each pr ...... ce principles into classroom or homeschool projects. Teachers in pre-K, elementary, and secondary classrooms use science education kits, manipualtives, and products alongside science, technology, engineering, and math (STEM) curriculum to demonstrate STEM concepts and real-world applications through hands-on activities. Science education projects include a broad range of activities, such as practical experiments in engineering, aeronautics, robotics, chemistry, physics, biology, and geology.",4 pounds,14.7 x 11.1 x 10.2 inches 4.06 pounds,https://images-na.ssl-images-amazon.com/images/I/51M0KnJxjKL.jpg|https://images-na.ssl-images-amazon.com/images/I/5166GD8OkXL.jpg|https://images-na.ssl-images-amazon.com/images/I/61o5S1VnaNL.jpg|https://images-na.ssl-images-amazon.com/images/I/61t4Q0rPYjL.jpg|https://images-na.ssl-images-amazon.com/images/I/61NASUAyqcL.jpg|https://images-na.ssl-images-amazon.com/images/I/51OMrADdyJL.jpg|https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/transparent-pixel.jpg,,https://www.amazon.com/Electronic-Circuits-Classpack-Motion-Detector/dp/B008AK6DAS,Y
2c55cae269aebf53838484b0d7dd931a,"3Doodler Create Flexy 3D Printing Filament Refill Bundle (X5 Pack, Over 1000'. of Extruded Plastics! - Innovate",Toys & Games | Arts & Crafts | Craft Kits,,\$34.99,,"Make sure this fits by entering your model number. | ‚úÖ„ÄêSmooth 3D drawing experienced the best 3D drawing experience by only using 3Doodler Create Plastics with 3Doodler Create+ and create 3D Printing pen. | ‚úÖ„ÄêSafe to use„Äëthe 3Doodler Create Plastics, conforms to the health requirements of ASTM-D-4236 & require no additional labelling in accordance with the US Consumer Product safety Commission‚Äôs Regulations as mandated by Labeling of Hazardous Art Materials Act (LHAMA). | üëç„Äê3Doodler very own ...... g fun„Äëthis bundle includes 5 refill filament packs, that's a total of 1043 ft. Of 3D drawing and doodling fun! | üì±„ÄêThe 3Doodler app„Äëget an interactive experience! The app is packed with dedicated easy to follow stencil section and step by step interactive instructions, receive badges for completed projects and photograph & share YOUR creations directly on social media. The app is fully built on iOS & Android. | ‚úÖ„ÄêAll your favorite colors„Äëthis pack includes: green, blue, pink, orange & yellow",ProductDimensions:10.3x3.4x0.8inches|ItemWeight:12.8ounces|ShippingWeight:12.8ounces(Viewshippingratesandpolicies)|ASIN:B07D36747F|Manufacturerrecommendedage:14yearsandup,"show up to 2 reviews by default No longer are you bound by the rigid constraints of hard plastic! Our FLEXY line you can now squeeze, stretch, and twist your creations providing a truly dynamic Doodling experience. Do you want to take your creativity to new levels? Explore the wide variety of FLEXY plastic refill colors for your 3Doodler Create 3D pen! Flexy Plastics are compatible with the 3Doodler V.1, 2.0, and create 3D printing pens. Available in single & mixed color pack containing 25 strands each, and single colors tubes containing 100 strands. | 12.8 ounces (View shipping rates and policies)",12.8 ounces,,https://images-na.ssl-images-amazon.com/images/I/513cBC8PqpL.jpg|https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/transparent-pixel.jpg,,https://www.amazon.com/3Doodler-Plastic-Innovate-Filament-Refills/dp/B07D36747F,Y


## Multi-Column Embedding Strategy

Instead of combining all product fields into a single text, we'll create **separate embedding indexes** for each searchable column. This approach offers several advantages:

- **Flexible weighting**: Combine results from different columns with custom weights
- **Column-specific queries**: Search only product names, or only descriptions
- **Better relevance**: Each embedding captures the semantic meaning of its specific field


In [None]:
# Define the embedding function once for reuse
# The .using() syntax fixes the model parameter, creating a specialized embedding function
embed_fn = voyageai.embeddings.using(model='voyage-3.5', input_type='document')

# Add embedding indexes for each searchable text column
products.add_embedding_index('Product_Name', embedding=embed_fn)
products.add_embedding_index('Category', embedding=embed_fn)
products.add_embedding_index('About_Product', embedding=embed_fn)


AttributeError: Unknown column: Product Name

In [None]:
# View the table structure - note the embedding indexes
products


## Semantic Product Search

With embedding indexes on multiple columns, we can now perform semantic searches. Let's create a search function that combines similarity scores from all three columns with configurable weights.


In [None]:
def search_products(query: str, limit: int = 5, 
                     name_weight: float = 0.4, 
                     category_weight: float = 0.2, 
                     description_weight: float = 0.4):
    """
    Search products using weighted similarity across multiple columns.
    
    Args:
        query: Search query
        limit: Number of results to return
        name_weight: Weight for product name similarity
        category_weight: Weight for category similarity  
        description_weight: Weight for description similarity
    """
    # Compute similarity for each column
    name_sim = products['Product Name'].similarity(query)
    category_sim = products['Category'].similarity(query)
    description_sim = products['About Product'].similarity(query)
    
    # Combine with weights
    combined_score = (
        name_weight * name_sim + 
        category_weight * category_sim + 
        description_weight * description_sim
    )
    
    return (
        products
        .order_by(combined_score, asc=False)
        .limit(limit)
        .select(
            products['Product Name'],
            products['Category'],
            products['Selling Price'],
            name_score=name_sim,
            category_score=category_sim,
            description_score=description_sim,
            combined_score=combined_score
        )
        .collect()
    )


Let's try some realistic e-commerce search scenarios. Notice how the combined score weighs the individual column similarities:


In [None]:
# Search 1: Natural language query
search_products("fun games for kids birthday party")


In [None]:
# Search 2: Conceptual query - semantic search understands meaning, not just keywords
search_products("gift ideas for someone who loves the outdoors")


In [None]:
# Search 3: Adjust weights to prioritize product names over descriptions
search_products("educational toys", name_weight=0.6, category_weight=0.2, description_weight=0.2)


In [None]:
# Search 4: Category-focused search
search_products("skateboard accessories", name_weight=0.3, category_weight=0.5, description_weight=0.2)


## Boost Relevance with Voyage AI Reranking

While semantic search is powerful, we can further improve result quality using Voyage AI's reranker. The two-stage retrieval pattern works like this:

1. **First stage**: Use embeddings to quickly retrieve a broad set of candidates (e.g., top 20)
2. **Second stage**: Use the reranker to precisely score and reorder results

This approach combines the speed of embedding search with the precision of cross-encoder reranking.


In [None]:
# Create a query function that retrieves candidates for reranking
# Uses combined similarity across all columns
@pxt.query
def get_candidates(query_text: str, n_candidates: int = 20):
    """Retrieve top candidates using combined embedding similarity."""
    name_sim = products['Product Name'].similarity(query_text)
    category_sim = products['Category'].similarity(query_text)
    description_sim = products['About Product'].similarity(query_text)
    combined = 0.4 * name_sim + 0.2 * category_sim + 0.4 * description_sim
    
    return (
        products
        .order_by(combined, asc=False)
        .limit(n_candidates)
        .select(
            products['Product Name'],
            products['Selling Price'],
            products['About Product']
        )
    )


In [None]:
# Create a table to store search queries and their reranked results
searches = pxt.create_table(
    'ecommerce_search.searches',
    {'query': pxt.String}
)

# Add computed column for candidates (retrieves top 15 from embedding search)
searches.add_computed_column(
    candidates=get_candidates(searches.query, n_candidates=15)
)

# Add computed column for reranked results using Voyage AI reranker
# Reranks based on product descriptions for more precise relevance
searches.add_computed_column(
    reranked=voyageai.rerank(
        searches.query,
        searches.candidates['About Product'],
        model='rerank-2.5',
        top_k=5
    )
)


In [None]:
# Test the reranking pipeline with a complex query
test_query = "durable toys for active toddlers"
searches.insert([{'query': test_query}])


In [None]:
# View the reranked results with relevance scores
searches.select(
    searches.query,
    searches.reranked['results']
).where(searches.query == test_query).collect()


## Compare Embedding Search vs. Reranked Results

Let's compare the quality of results before and after reranking to see the improvement:


In [None]:
comparison_query = "safe and educational baby toys"

# Insert the query for reranking
searches.insert([{'query': comparison_query}])

# Embedding search results (before reranking)
search_products(comparison_query, limit=5)


In [None]:
# Reranked results (after reranking with Voyage AI)
searches.select(
    searches.query,
    searches.reranked['results']
).where(searches.query == comparison_query).collect()


## Incremental Updates: Adding New Products

One of Pixeltable's key strengths is handling incremental updates. When new products are added to the catalog, embeddings are computed automatically‚Äîno need to reprocess the entire dataset.


In [None]:
# Add new products - embeddings for all three indexes are computed automatically!
new_products = [
    {
        'Uniq Id': 'new_001',
        'Product Name': 'Ultimate STEM Building Kit - 500 Pieces',
        'Category': 'Toys & Games | Building Toys | Building Sets',
        'About Product': 'Educational building set with 500 pieces for ages 6+. Includes gears, motors, and instruction booklet for 50 projects. Develops problem-solving and engineering skills.',
        'Selling Price': '$49.99'
    },
    {
        'Uniq Id': 'new_002', 
        'Product Name': 'Outdoor Adventure Binoculars for Kids',
        'Category': 'Toys & Games | Sports & Outdoor Play | Exploration Toys',
        'About Product': 'Kid-friendly binoculars with 8x magnification, rubber grip, and neck strap. Perfect for bird watching, camping, and nature exploration. Shockproof design.',
        'Selling Price': '$24.99'
    }
]

products.insert(new_products)


In [None]:
# Search should now find the new products
search_products("STEM toys for kids who like to build things")


## Working with Product Images

The Amazon dataset includes multiple image URLs per product (separated by `|`). Let's create a view that splits these into individual rows, enabling image-based search and analysis.

We'll create a custom iterator to split the pipe-separated image URLs into individual rows.


In [None]:
from pixeltable.iterators import ComponentIterator
import pixeltable.type_system as ts
from typing import Any, Iterator

class ImageUrlSplitter(ComponentIterator):
    """Iterator that splits pipe-separated image URLs into individual rows."""
    
    def __init__(self, image_urls: str):
        self._urls = []
        if image_urls:
            # Split on | and filter out empty/placeholder URLs
            self._urls = [
                url.strip() for url in image_urls.split('|') 
                if url.strip() and 'transparent-pixel' not in url
            ]
        self._iter = iter(enumerate(self._urls))
    
    def __next__(self) -> dict[str, Any]:
        idx, url = next(self._iter)
        return {'image_idx': idx, 'image_url': url}
    
    def close(self) -> None:
        pass
    
    @classmethod
    def input_schema(cls) -> dict[str, ts.ColumnType]:
        return {'image_urls': ts.StringType(nullable=True)}
    
    @classmethod
    def output_schema(cls, *args, **kwargs) -> tuple[dict[str, ts.ColumnType], list[str]]:
        return {
            'image_idx': ts.IntType(),
            'image_url': ts.StringType()
        }, []


In [None]:
# Create a view that splits image URLs into individual rows
product_images = pxt.create_view(
    'ecommerce_search.product_images',
    products,
    iterator=ImageUrlSplitter._create(image_urls=products['Image'])
)

product_images


In [None]:
# Add a computed column that converts the URL to an actual image
product_images.add_computed_column(image=pxt.Image(product_images.image_url))

# View sample images with their products
product_images.select(
    product_images['Product Name'],
    product_images.image_idx,
    product_images.image
).limit(6).collect()


In [None]:
# Count images per product
product_images.group_by(product_images['Uniq Id']).select(
    product_images['Product Name'],
    image_count=product_images.image_idx.count()
).order_by(product_images['Product Name']).limit(10).collect()


## Summary

In this tutorial, we demonstrated how to build a production-ready semantic search system for e-commerce by combining:

### Pixeltable Capabilities
- **Hugging Face Integration**: Import datasets directly with automatic type mapping
- **Multi-Column Embedding Indexes**: Separate indexes for product name, category, and description
- **Weighted Search**: Combine similarity scores with custom weights per column
- **Custom Iterators**: Split multi-value fields (like images) into individual rows
- **Query Functions**: Reusable retrieval logic for complex pipelines

### Voyage AI Features
- **voyage-3.5**: Best-in-class embedding model for retrieval tasks
- **rerank-2.5**: High-precision reranker for improved relevance

### Key Benefits
1. **Flexible Multi-Column Search**: Weight different product attributes based on query intent
2. **Two-Stage Retrieval**: Combine fast embedding search with precise reranking
3. **Image Handling**: Split and process multiple product images per listing
4. **Incremental Updates**: Add new products without reprocessing

This architecture scales from small catalogs to millions of products and adapts easily to other use cases like document search, support ticket routing, or recommendation systems.


## Learn More

**Pixeltable Resources**
- [Documentation](https://docs.pixeltable.com/)
- [RAG Operations Tutorial](https://docs.pixeltable.com/howto/use-cases/rag-operations)
- [Working with Hugging Face](https://docs.pixeltable.com/howto/providers/working-with-hugging-face)

**Voyage AI Resources**
- [Voyage AI Documentation](https://docs.voyageai.com/)
- [Embedding Models Guide](https://docs.voyageai.com/docs/embeddings)
- [Reranker Guide](https://docs.voyageai.com/docs/reranker)

**Get Started**
- [Sign up for Voyage AI](https://www.voyageai.com/) (free tier available)
- [Install Pixeltable](https://github.com/pixeltable/pixeltable): `pip install pixeltable`
