# Build semantic search for text

Create a searchable knowledge base that finds content by meaning, not just keywords.


## Problem

You have a collection of text content (articles, notes, documentation) and need to find relevant items based on meaning.

Keyword search fails when users phrase queries differently from the source text:

| Query | Keyword match | Semantic match |
|-------|---------------|----------------|
| "how to fix bugs" | ❌ No results | ✓ "Debugging best practices" |
| "ML training" | ❌ No results | ✓ "Machine learning model optimization" |
| "deploy to cloud" | ❌ No results | ✓ "Production infrastructure setup" |


## Solution

**What's in this recipe:**
- Create a text table with embeddings
- Search by semantic similarity
- Combine with metadata filters

You add an embedding index to your text column. Pixeltable automatically generates embeddings for each row and enables similarity search.


### Setup


In [1]:
%pip install -qU pixeltable sentence-transformers


Note: you may need to restart the kernel to use updated packages.


In [2]:
import pixeltable as pxt
from pixeltable.functions.huggingface import sentence_transformer


### Create knowledge base


In [3]:
# Create a fresh directory
pxt.drop_dir('search_demo', force=True)
pxt.create_dir('search_demo')


Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
Created directory 'search_demo'.


<pixeltable.catalog.dir.Dir at 0x344053e30>

In [4]:
# Create table with content and metadata
kb = pxt.create_table('search_demo.articles', {
    'title': pxt.String,
    'content': pxt.String,
    'category': pxt.String
})


Created table 'articles'.


In [5]:
# Insert sample content
kb.insert([
    {'title': 'Debugging best practices', 
     'content': 'Use logging, breakpoints, and unit tests to identify and fix issues in your code.',
     'category': 'engineering'},
    {'title': 'Machine learning model optimization', 
     'content': 'Improve training efficiency with batch normalization, learning rate schedules, and early stopping.',
     'category': 'ml'},
    {'title': 'Production infrastructure setup', 
     'content': 'Deploy applications using containers, load balancers, and automated scaling.',
     'category': 'devops'},
    {'title': 'API design principles', 
     'content': 'Create RESTful endpoints with proper versioning, authentication, and error handling.',
     'category': 'engineering'},
])


Inserting rows into `articles`: 4 rows [00:00, 356.61 rows/s]
Inserted 4 rows with 0 errors.


4 rows inserted, 12 values computed.

### Add semantic search

Create an embedding index on the content column:


In [6]:
# Add embedding index
kb.add_embedding_index(
    column='content',
    string_embed=sentence_transformer.using(model_id='all-MiniLM-L6-v2')
)


### Search by meaning

Find content semantically similar to your query:


In [7]:
# Search by meaning
query = "how to fix bugs"
sim = kb.content.similarity(query)

results = (
    kb
    .order_by(sim, asc=False)
    .select(kb.title, kb.content, score=sim)
    .limit(2)
)
results.collect()


title,content,score
Debugging best practices,"Use logging, breakpoints, and unit tests to identify and fix issues in your code.",0.391
API design principles,"Create RESTful endpoints with proper versioning, authentication, and error handling.",0.186


### Filter by metadata

Combine semantic search with metadata filters:


In [8]:
# Search within a specific category
query = "best practices"
sim = kb.content.similarity(query)

results = (
    kb
    .where(kb.category == 'engineering')  # Filter first
    .order_by(sim, asc=False)
    .select(kb.title, kb.category, score=sim)
    .limit(2)
)
results.collect()


title,category,score
API design principles,engineering,0.238
Debugging best practices,engineering,0.157


## Explanation

**How similarity search works:**

1. Your query is converted to an embedding vector
2. Pixeltable finds the most similar vectors in the index
3. Results are ranked by cosine similarity (0 to 1)

**Embedding models:**

| Model | Speed | Quality | Use case |
|-------|-------|---------|----------|
| `all-MiniLM-L6-v2` | Fast | Good | General text |
| `all-mpnet-base-v2` | Medium | Better | Higher accuracy |
| OpenAI `text-embedding-3-small` | API | Best | Production apps |

**New content is indexed automatically:**

When you insert new rows, embeddings are generated without extra code.


## See also

- [Vector database documentation](https://docs.pixeltable.com/platform/embedding-indexes)
- [Split documents for RAG](./doc-chunk-for-rag.ipynb)
