# Supercharging Search and Retrieval for E-Commerce with Voyage AI

**Best-in-class embedding models and rerankers for unstructured product data**

Modern e-commerce platforms deal with massive amounts of unstructured data: product descriptions, specifications, customer reviews, and images. Traditional keyword search often fails to capture the semantic meaning behind customer queries like "comfortable shoes for standing all day" or "gift ideas for a tech enthusiast."

In this tutorial, we'll demonstrate how to build a powerful semantic search system for product data by combining:

- **[Pixeltable](https://pixeltable.com)**: An AI data infrastructure that handles embeddings, indexing, and retrieval as declarative table operations
- **[Voyage AI](https://voyageai.com)**: State-of-the-art embedding models and rerankers purpose-built for search and retrieval

We'll use real Amazon product data from Hugging Face to showcase:

1. üîç **Semantic Product Search**: Find products by meaning, not just keywords
2. üéØ **Reranking for Precision**: Improve search relevance with Voyage AI's reranker
3. üìä **Incremental Updates**: Add new products without reprocessing the entire catalog

### Prerequisites

- A Voyage AI account with an API key ([get one free](https://www.voyageai.com/))
- Basic familiarity with Python and data operations


## Setup

First, let's install the required packages and configure our environment.


In [None]:
%pip install -qU pixeltable voyageai datasets


In [None]:
import os
import getpass

if 'VOYAGE_API_KEY' not in os.environ:
    os.environ['VOYAGE_API_KEY'] = getpass.getpass('Enter your Voyage AI API key: ')


In [None]:
import pixeltable as pxt
from pixeltable.functions import voyageai
from datasets import load_dataset

# Create a fresh workspace for this demo
pxt.drop_dir('ecommerce_search', force=True)
pxt.create_dir('ecommerce_search')


## Load Amazon Product Data from Hugging Face

We'll use the [Amazon Product Dataset 2020](https://huggingface.co/datasets/calmgoose/amazon-product-data-2020) from Hugging Face, which contains 10,000 real product listings with rich metadata including:

- Product names and descriptions
- Categories and specifications
- Pricing information
- Product images

Pixeltable can import Hugging Face datasets directly using the `source` parameter.


In [None]:
# Load a subset of the Amazon product dataset (500 products for demo)
hf_dataset = load_dataset(
    'calmgoose/amazon-product-data-2020',
    split='train[:500]'
)

print(f"Loaded {len(hf_dataset)} products")
print(f"\nAvailable columns: {hf_dataset.column_names}")


In [None]:
# Preview a sample product
sample = hf_dataset[0]
print(f"Product: {sample['Product Name'][:80]}...")
print(f"Category: {sample['Category']}")
print(f"Price: {sample['Selling Price']}")
print(f"\nAbout: {sample['About Product'][:200]}..." if sample['About Product'] else "No description")


### Import into Pixeltable

Now let's import this dataset into Pixeltable. Pixeltable automatically maps Hugging Face types to appropriate column types.


In [None]:
# Import the dataset into Pixeltable
products = pxt.create_table(
    'ecommerce_search.products',
    source=hf_dataset
)

products.head(3)


## Prepare Product Data for Search

For effective semantic search, we need to combine relevant product information into a single searchable text field. We'll create a computed column that concatenates the product name, category, and description.


In [None]:
# Create a combined text field for embedding
# Handle None values by using empty strings
@pxt.udf
def combine_product_text(name: str, category: str, about: str) -> str:
    """Combine product fields into a single searchable text."""
    parts = []
    if name:
        parts.append(f"Product: {name}")
    if category:
        parts.append(f"Category: {category}")
    if about:
        parts.append(f"Description: {about}")
    return " | ".join(parts) if parts else "No information available"

products.add_computed_column(
    search_text=combine_product_text(
        products['Product Name'],
        products['Category'],
        products['About Product']
    )
)


In [None]:
# Preview the combined search text
products.select(
    products['Product Name'],
    products.search_text
).head(2)


## Add Voyage AI Embeddings for Semantic Search

Now comes the magic! We'll use Voyage AI's `voyage-3.5` model‚Äîone of the best embedding models for retrieval tasks‚Äîto create semantic embeddings of our product data.

Pixeltable's embedding index makes this incredibly simple:
- Embeddings are computed automatically for all existing and new products
- The index enables fast similarity search across the catalog
- Everything updates incrementally as new products are added


In [None]:
# Add an embedding index using Voyage AI's voyage-3.5 model
products.add_embedding_index(
    'search_text',
    string_embed=voyageai.embeddings.using(
        model='voyage-3.5',
        input_type='document'
    )
)


## Semantic Product Search

With our embedding index in place, we can now perform semantic searches that understand the meaning behind customer queries‚Äînot just keyword matching.

Let's try some realistic e-commerce search scenarios:


In [None]:
def search_products(query: str, limit: int = 5):
    """Search for products using semantic similarity."""
    sim = products.search_text.similarity(query)
    results = (
        products
        .order_by(sim, asc=False)
        .limit(limit)
        .select(
            products['Product Name'],
            products['Category'],
            products['Selling Price'],
            score=sim
        )
    )
    return results.collect()


In [None]:
# Search 1: Natural language query
print("üîç Query: 'fun games for kids birthday party'\n")
search_products("fun games for kids birthday party")


In [None]:
# Search 2: Conceptual query that wouldn't work well with keyword search
print("üîç Query: 'gift ideas for someone who loves the outdoors'\n")
search_products("gift ideas for someone who loves the outdoors")


In [None]:
# Search 3: Problem-based query
print("üîç Query: 'educational toys that help children learn'\n")
search_products("educational toys that help children learn")


## Boost Relevance with Voyage AI Reranking

While semantic search is powerful, we can further improve result quality using Voyage AI's reranker. The two-stage retrieval pattern works like this:

1. **First stage**: Use embeddings to quickly retrieve a broad set of candidates (e.g., top 20)
2. **Second stage**: Use the reranker to precisely score and reorder results

This approach combines the speed of embedding search with the precision of cross-encoder reranking.


In [None]:
# Create a query function that retrieves candidates for reranking
@pxt.query
def get_candidates(query_text: str, n_candidates: int = 20):
    """Retrieve top candidates using embedding similarity."""
    sim = products.search_text.similarity(query_text)
    return (
        products
        .order_by(sim, asc=False)
        .limit(n_candidates)
        .select(
            products['Product Name'],
            products['Selling Price'],
            products.search_text
        )
    )


In [None]:
# Create a table to store search queries and their reranked results
searches = pxt.create_table(
    'ecommerce_search.searches',
    {'query': pxt.String}
)

# Add computed column for candidates
searches.add_computed_column(
    candidates=get_candidates(searches.query, n_candidates=15)
)

# Add computed column for reranked results using Voyage AI reranker
searches.add_computed_column(
    reranked=voyageai.rerank(
        searches.query,
        searches.candidates.search_text,
        model='rerank-2.5',
        top_k=5
    )
)


In [None]:
# Test the reranking pipeline with a complex query
test_query = "durable toys for active toddlers"
searches.insert([{'query': test_query}])

print(f"üîç Query: '{test_query}'\n")
print("="*60)


In [None]:
# View the reranked results with relevance scores
result = searches.select(
    searches.query,
    searches.reranked['results']
).where(searches.query == test_query).collect()

print("\nüéØ Top 5 Reranked Results:\n")
for i, item in enumerate(result['results'][0][:5], 1):
    print(f"{i}. [Score: {item['relevance_score']:.3f}]")
    print(f"   {item['document'][:100]}...\n")


## Compare Embedding Search vs. Reranked Results

Let's compare the quality of results before and after reranking to see the improvement:


In [None]:
comparison_query = "safe and educational baby toys"

# Insert the query
searches.insert([{'query': comparison_query}])

# Get embedding-only results
print(f"üîç Query: '{comparison_query}'\n")
print("="*60)
print("\nüìä EMBEDDING SEARCH (Top 5):\n")

embedding_results = search_products(comparison_query, limit=5)
for i, row in embedding_results.iterrows():
    print(f"{i+1}. [{row['score']:.3f}] {row['Product Name'][:70]}...")


In [None]:
# Get reranked results
print("\nüéØ RERANKED RESULTS (Top 5):\n")

reranked_result = searches.select(
    searches.reranked['results']
).where(searches.query == comparison_query).collect()

for i, item in enumerate(reranked_result['results'][0][:5], 1):
    # Extract product name from the search_text
    doc = item['document']
    product_part = doc.split(' | ')[0].replace('Product: ', '')[:70]
    print(f"{i}. [{item['relevance_score']:.3f}] {product_part}...")


## Incremental Updates: Adding New Products

One of Pixeltable's key strengths is handling incremental updates. When new products are added to the catalog, embeddings are computed automatically‚Äîno need to reprocess the entire dataset.


In [None]:
# Add a few new products manually
new_products = [
    {
        'Uniq Id': 'new_001',
        'Product Name': 'Ultimate STEM Building Kit - 500 Pieces',
        'Category': 'Toys & Games | Building Toys | Building Sets',
        'About Product': 'Educational building set with 500 pieces for ages 6+. Includes gears, motors, and instruction booklet for 50 projects. Develops problem-solving and engineering skills.',
        'Selling Price': '$49.99'
    },
    {
        'Uniq Id': 'new_002', 
        'Product Name': 'Outdoor Adventure Binoculars for Kids',
        'Category': 'Toys & Games | Sports & Outdoor Play | Exploration Toys',
        'About Product': 'Kid-friendly binoculars with 8x magnification, rubber grip, and neck strap. Perfect for bird watching, camping, and nature exploration. Shockproof design.',
        'Selling Price': '$24.99'
    }
]

# Insert new products - embeddings are computed automatically!
products.insert(new_products)

print("‚úÖ New products added and indexed automatically!")


In [None]:
# Search should now find the new products
print("üîç Query: 'STEM toys for kids who like to build things'\n")
search_products("STEM toys for kids who like to build things")


## Summary

In this tutorial, we demonstrated how to build a production-ready semantic search system for e-commerce by combining:

### Pixeltable Capabilities
- **Hugging Face Integration**: Import datasets directly with automatic type mapping
- **Computed Columns**: Transform and prepare data declaratively
- **Embedding Indexes**: Fast similarity search with automatic updates
- **Query Functions**: Reusable retrieval logic for complex pipelines

### Voyage AI Features
- **voyage-3.5**: Best-in-class embedding model for retrieval tasks
- **rerank-2.5**: High-precision reranker for improved relevance

### Key Benefits
1. **Semantic Understanding**: Find products by meaning, not just keywords
2. **Two-Stage Retrieval**: Combine fast embedding search with precise reranking
3. **Incremental Updates**: Add new products without reprocessing
4. **Declarative Pipeline**: Define once, update automatically

This architecture scales from small catalogs to millions of products and adapts easily to other use cases like document search, support ticket routing, or recommendation systems.


## Learn More

**Pixeltable Resources**
- [Documentation](https://docs.pixeltable.com/)
- [RAG Operations Tutorial](https://docs.pixeltable.com/howto/use-cases/rag-operations)
- [Working with Hugging Face](https://docs.pixeltable.com/howto/providers/working-with-hugging-face)

**Voyage AI Resources**
- [Voyage AI Documentation](https://docs.voyageai.com/)
- [Embedding Models Guide](https://docs.voyageai.com/docs/embeddings)
- [Reranker Guide](https://docs.voyageai.com/docs/reranker)

**Get Started**
- [Sign up for Voyage AI](https://www.voyageai.com/) (free tier available)
- [Install Pixeltable](https://github.com/pixeltable/pixeltable): `pip install pixeltable`
