# Working with Twelve Labs in Pixeltable

Pixeltable's Twelve Labs integration enables you to create powerful multimodal embeddings for text, images, audio, and video using the Twelve Labs Embed API. These embeddings allow you to build semantic search across all modalities.

### Prerequisites
- A Twelve Labs account with an API key (https://playground.twelvelabs.io/)

### Important Notes

- Twelve Labs usage may incur costs based on your plan.
- Audio and video embeddings require a minimum duration of 4 seconds.
- The `marengo3.0` model produces 512-dimensional embeddings.

## Setup

First, install the required libraries and configure your API key.

In [26]:
%pip install -qU pixeltable twelvelabs

[0mNote: you may need to restart the kernel to use updated packages.


In [27]:
import os
import getpass

if 'TWELVELABS_API_KEY' not in os.environ:
    os.environ['TWELVELABS_API_KEY'] = getpass.getpass('Enter your Twelve Labs API key: ')

In [28]:
import pixeltable as pxt
from pixeltable.functions.twelvelabs import embed

# Create a fresh directory for our demo
pxt.drop_dir('twelvelabs_demo', force=True)
pxt.create_dir('twelvelabs_demo')

Created directory 'twelvelabs_demo'.


<pixeltable.catalog.dir.Dir at 0x36102f350>

## Text Embeddings with Embedding Index

Create text embeddings and enable semantic search using an embedding index.

In [29]:
# Create a table with text and add an embedding index
text_t = pxt.create_table('twelvelabs_demo.text_search', {'text': pxt.String})

# Add embedding index for text similarity search
text_t.add_embedding_index(
    'text',
    string_embed=embed.using(model_name='marengo3.0')
)

Created table 'text_search'.


In [30]:
# Insert sample documents
documents = [
    "Artificial intelligence is transforming video understanding and analysis.",
    "Machine learning models can detect objects and actions in video streams.",
    "Natural language processing enables understanding of spoken words in audio.",
    "Computer vision techniques analyze visual patterns in images and videos.",
    "Deep learning models generate embeddings that capture semantic meaning.",
    "Multimodal AI systems combine understanding of text, images, and audio.",
]

text_t.insert({'text': doc} for doc in documents)

Inserting rows into `text_search`: 6 rows [00:00, 979.94 rows/s]
Inserted 6 rows with 0 errors.


6 rows inserted, 12 values computed.

In [31]:
# Perform semantic similarity search
query = "How do AI systems understand video content?"
sim = text_t.text.similarity(string=query)

text_t.order_by(sim, asc=False).limit(3).select(text_t.text, score=sim).collect()

text,score
Artificial intelligence is transforming video understanding and analysis.,0.867
Computer vision techniques analyze visual patterns in images and videos.,0.669
Machine learning models can detect objects and actions in video streams.,0.656


## Image Embeddings with Embedding Index

Create image embeddings and search for similar images using text queries (cross-modal search).

In [32]:
# Create a table with images and add an embedding index
image_t = pxt.create_table('twelvelabs_demo.image_search', {'image': pxt.Image})

# Add embedding index using the generic 'embedding' parameter
# The Twelve Labs embed function supports multiple modalities (text, image, etc.)
# so it will automatically support both image indexing and text-based queries
image_t.add_embedding_index(
    'image',
    embedding=embed.using(model_name='marengo3.0')
)

Created table 'image_search'.


In [33]:
# Insert sample images
image_urls = [
    'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000025.jpg',
    'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000139.jpg',
    'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000042.jpg',
    'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000061.jpg',
]

image_t.insert({'image': url} for url in image_urls)

Inserting rows into `image_search`: 4 rows [00:00, 1270.62 rows/s]
Inserted 4 rows with 0 errors.


4 rows inserted, 12 values computed.

In [None]:
# Search images using text (cross-modal search)
sim = image_t.image.similarity(string="animals in nature")

image_t.order_by(sim, asc=False).limit(2).select(image_t.image, score=sim).collect()

## Audio Embeddings with Embedding Index

Create audio embeddings and enable semantic search over audio content. Audio segments must be at least 4 seconds long, so we use the audio splitter to create appropriately-sized chunks.

**Note:** This feature was added in PR #990. You can now index audio columns and search them using text queries.

In [None]:
from pixeltable.functions.audio import audio_splitter

# Create a base table for audio files
audio_t = pxt.create_table('twelvelabs_demo.audio_files', {'audio': pxt.Audio})

# Insert a sample audio file (JFK speech excerpt)
audio_url = 'https://github.com/pixeltable/pixeltable/raw/release/tests/data/audio/jfk_1961_0109_cityuponahill-excerpt.flac'
audio_t.insert([{'audio': audio_url}])

In [None]:
# Create a view that chunks the audio into searchable segments
# Twelve Labs requires minimum 4 second duration
audio_chunks_v = pxt.create_view(
    'twelvelabs_demo.audio_chunks',
    audio_t,
    iterator=audio_splitter(
        audio_t.audio,
        chunk_duration_sec=5.0,
        min_chunk_duration_sec=4.0
    )
)

# Add embedding index using the generic 'embedding' parameter
# The Twelve Labs embed function supports multiple modalities, so it will
# automatically resolve to both audio and text embeddings
audio_chunks_v.add_embedding_index(
    'audio_chunk',
    embedding=embed.using(model_name='marengo3.0')
)

In [None]:
# Search audio chunks using text
sim = audio_chunks_v.audio_chunk.similarity(string="speech about government and politics")

audio_chunks_v.order_by(sim, asc=False).limit(3).select(
    audio_chunks_v.audio_chunk,
    score=sim
).collect()

In [None]:
# You can also retrieve the embedding directly
audio_chunks_v.select(
    audio_chunks_v.audio_chunk,
    embedding=audio_chunks_v.audio_chunk.embedding()
).limit(2).collect()

## Video Embeddings with Embedding Index

Create video embeddings and enable semantic search over video content. Video segments must be at least 4 seconds long.

**Note:** This feature was added in PR #990. You can now index video columns and search them using text queries.

In [None]:
from pixeltable.functions.video import video_splitter

# Create a base table for video files
video_t = pxt.create_table('twelvelabs_demo.video_files', {'video': pxt.Video})

# Insert a sample video file
video_url = 'https://github.com/pixeltable/pixeltable/raw/release/tests/data/videos/bangkok_half_res.mp4'
video_t.insert([{'video': video_url}])

In [None]:
# Create a view that segments the video into searchable chunks
# Twelve Labs requires minimum 4 second duration
video_segments_v = pxt.create_view(
    'twelvelabs_demo.video_segments',
    video_t,
    iterator=video_splitter(
        video=video_t.video,
        duration=5.0,
        min_segment_duration=4.0
    )
)

# Add embedding index using the generic 'embedding' parameter
# The Twelve Labs embed function supports multiple modalities, so it will
# automatically resolve to both video and text embeddings
video_segments_v.add_embedding_index(
    'video_segment',
    embedding=embed.using(model_name='marengo3.0')
)

In [None]:
# Search video segments using text
sim = video_segments_v.video_segment.similarity(string="city traffic and urban scenery")

video_segments_v.order_by(sim, asc=False).limit(3).select(
    video_segments_v.video_segment,
    score=sim
).collect()

In [None]:
# Retrieve the embedding directly
video_segments_v.select(
    video_segments_v.video_segment,
    embedding=video_segments_v.video_segment.embedding()
).limit(2).collect()

## Computed Columns with Embeddings

You can also create computed columns that store embeddings directly, which is useful when you want to use the embeddings for other purposes beyond similarity search.

In [None]:
# Create a table with a computed embedding column
docs_t = pxt.create_table('twelvelabs_demo.docs_with_embeddings', {'text': pxt.String})

# Add computed column for embeddings
docs_t.add_computed_column(
    embedding=embed(docs_t.text, model_name='marengo3.0')
)

# Insert data
docs_t.insert([
    {'text': 'Video understanding is a key area of AI research.'},
    {'text': 'Audio analysis helps in transcription and content moderation.'},
])

# View the embeddings
docs_t.select(docs_t.text, docs_t.embedding).collect()

## Available Models

Twelve Labs provides several embedding models:

| Model | Embedding Dimension | Description |
|-------|-------------------|-------------|
| `marengo3.0` | 512 | Latest multimodal embedding model |
| `Marengo-retrieval-2.7` | 1024 | Retrieval-optimized model |

### Learn More

- [Twelve Labs Documentation](https://docs.twelvelabs.io/)
- [Embed API Guide](https://docs.twelvelabs.io/v1.3/docs/guides/create-embeddings)
- [Pixeltable Embedding Indexes](https://docs.pixeltable.com/platform/embedding-indexes)

If you have any questions, don't hesitate to reach out.