# Create text embeddings with OpenAI

Generate vector embeddings for text data to enable semantic search and similarity matching.

## Problem

You need to convert text into vector embeddings for:

- Semantic search (find similar documents)
- RAG pipelines (retrieve relevant context)
- Clustering and classification

| Use case | Input | Output |
|----------|-------|--------|
| Document search | Articles | Find related articles |
| Product matching | Descriptions | Find similar products |
| FAQ retrieval | Questions | Match to answers |

## Solution

**What's in this recipe:**

- Generate embeddings with OpenAI's models
- Store embeddings as computed columns
- Use embeddings for similarity queries

You add an embedding column that automatically generates vectors for new rows. The embeddings are cached and only recomputed when the source text changes.

### Setup

In [None]:
%pip install -qU pixeltable openai

In [2]:
import os
import getpass

if 'OPENAI_API_KEY' not in os.environ:
    os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ')

In [3]:
import pixeltable as pxt
from pixeltable.functions.openai import embeddings

In [4]:
# Create a fresh directory
pxt.drop_dir('embed_demo', force=True)
pxt.create_dir('embed_demo')

Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
Created directory 'embed_demo'.


<pixeltable.catalog.dir.Dir at 0x14ee4fcd0>

### Create table with embedding column

In [5]:
# Create table for documents
docs = pxt.create_table(
    'embed_demo.documents',
    {'title': pxt.String, 'content': pxt.String}
)

Created table 'documents'.


In [6]:
# Add embedding column using OpenAI's text-embedding-3-small
docs.add_computed_column(
    embedding=embeddings(docs.content, model='text-embedding-3-small')
)

Added 0 column values with 0 errors.


No rows affected.

### Insert documents

In [7]:
# Insert sample documents
sample_docs = [
    {'title': 'Python Basics', 'content': 'Python is a high-level programming language known for its clear syntax and readability.'},
    {'title': 'Machine Learning', 'content': 'Machine learning is a subset of AI that enables systems to learn from data.'},
    {'title': 'Web Development', 'content': 'Web development involves building websites and web applications using HTML, CSS, and JavaScript.'},
    {'title': 'Data Science', 'content': 'Data science combines statistics, programming, and domain expertise to extract insights from data.'},
    {'title': 'Cloud Computing', 'content': 'Cloud computing provides on-demand computing resources over the internet.'},
]

docs.insert(sample_docs)

Inserting rows into `documents`: 5 rows [00:00, 553.22 rows/s]
Inserted 5 rows with 0 errors.


5 rows inserted, 15 values computed.

In [9]:
# View documents with embeddings (showing first 5 dimensions)
result = docs.select(docs.title, docs.embedding).collect()

### Query by similarity

Find documents similar to a query by creating an embedding index:

In [10]:
# Add embedding index for semantic search
docs.add_embedding_index(
    column="content",
    string_embed=embeddings.using(model="text-embedding-3-small")
)

In [12]:
# Search for similar documents
sim = docs.content.similarity("artificial intelligence applications")
results = (
    docs.where(sim > 0.2)
    .order_by(sim, asc=False)
    .limit(3)
    .select(docs.title, docs.content, sim=sim)
)
results.collect()

title,content,sim
Machine Learning,Machine learning is a subset of AI that enables systems to learn from data.,0.415
Data Science,"Data science combines statistics, programming, and domain expertise to extract insights from data.",0.256
Web Development,"Web development involves building websites and web applications using HTML, CSS, and JavaScript.",0.205


## Explanation

**OpenAI embedding models:**

| Model | Dimensions | Use case |
|-------|------------|----------|
| `text-embedding-3-small` | 1536 | Cost-effective, good quality |
| `text-embedding-3-large` | 3072 | Higher accuracy |
| `text-embedding-ada-002` | 1536 | Legacy model |

**Similarity metrics:**

| Metric | Best for |
|--------|----------|
| `cosine` | Text similarity (default) |
| `ip` | Inner product |
| `l2` | Euclidean distance |

**Key benefits of computed embedding columns:**

- Embeddings are generated automatically on insert
- Results are cachedâ€”no re-computation on subsequent queries
- Index enables fast similarity search at scale

## See also

- [Semantic text search](https://docs.pixeltable.com/howto/cookbooks/search/search-semantic-text) - Full semantic search patterns
- [Chunk documents for RAG](https://docs.pixeltable.com/howto/cookbooks/text/doc-chunk-for-rag) - Prepare documents for retrieval