# Working with Twelve Labs in Pixeltable

Pixeltable's Twelve Labs integration enables you to create powerful multimodal embeddings for text, images, audio, and video using the Twelve Labs Embed API.

### Prerequisites
- A Twelve Labs account with an API key (https://playground.twelvelabs.io/)

### Important Notes

- Twelve Labs usage may incur costs based on your plan.
- Audio and video embeddings require a minimum duration of 4 seconds.

One of the most powerful features of Twelve Labs' model is true **cross-modal search**. You can query a video index using **any modality**: text, image, audio, or video. The model projects all modalities into the same semantic embedding space, enabling searches like:

- **Text to Video**: Find videos matching a text description
- **Image to Video**: Find videos similar to an image
- **Audio to Video**: Find videos with similar audio/speech
- **Video to Video**: Find videos similar to another video

Let's get started!

## Setup

First, install the required libraries and configure your API key.

In [None]:
%pip install -qU pixeltable twelvelabs

In [2]:
import os
import getpass

if 'TWELVELABS_API_KEY' not in os.environ:
    os.environ['TWELVELABS_API_KEY'] = getpass.getpass('Enter your Twelve Labs API key: ')

In [None]:
import pixeltable as pxt
from pixeltable.functions.twelvelabs import embed

# Create a fresh directory for our demo
pxt.drop_dir('twelvelabs_demo', force=True)
pxt.create_dir('twelvelabs_demo')

## Text Embeddings

Create text embeddings using the Twelve Labs embed function.

In [4]:
# Create a table with text and add an embedding index
text_t = pxt.create_table('twelvelabs_demo.text_search', {'text': pxt.String})

# Add embedding index for text similarity search
text_t.add_embedding_index(
    'text',
    embedding=embed.using(model_name='marengo3.0')
)

Created table 'text_search'.


In [5]:
# Insert sample documents
documents = [
    "Artificial intelligence is transforming video understanding and analysis.",
    "Machine learning models can detect objects and actions in video streams.",
    "Natural language processing enables understanding of spoken words in audio.",
    "Computer vision techniques analyze visual patterns in images and videos.",
    "Deep learning models generate embeddings that capture semantic meaning.",
    "Multimodal AI systems combine understanding of text, images, and audio.",
]

text_t.insert({'text': doc} for doc in documents)

Inserted 6 rows with 0 errors in 0.68 s (8.87 rows/s)


6 rows inserted.

In [6]:
# Perform semantic similarity search
query = "How do AI systems understand video content?"
sim = text_t.text.similarity(string=query)

text_t.order_by(sim, asc=False).limit(3).select(text_t.text, score=sim).collect()

text,score
Artificial intelligence is transforming video understanding and analysis.,0.868
Computer vision techniques analyze visual patterns in images and videos.,0.668
Machine learning models can detect objects and actions in video streams.,0.657


## Document Embeddings (PDF, HTML, Markdown)

Create embeddings from documents like PDFs, HTML, and Markdown files. Use the `document_splitter` iterator to chunk documents into searchable text segments.

In [7]:
from pixeltable.functions.document import document_splitter

# Create a table with documents
doc_t = pxt.create_table('twelvelabs_demo.documents', {'document': pxt.Document})

# Create a view that chunks documents into text segments
doc_chunks_v = pxt.create_view(
    'twelvelabs_demo.doc_chunks',
    doc_t,
    iterator=document_splitter(
        document=doc_t.document,
        separators='sentence'  # Split by sentence for fine-grained search
    )
)

# Add embedding index on the text chunks
doc_chunks_v.add_embedding_index(
    'text',
    embedding=embed.using(model_name='marengo3.0')
)

In [8]:
# Insert a PDF document
pdf_url = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/rag-demo/Argus-Market-Digest-June-2024.pdf'
doc_t.insert([{'document': pdf_url}])

Inserted 201 rows with 0 errors in 21.50 s (9.35 rows/s)


201 rows inserted.

In [9]:
# Search document chunks using text query
sim = doc_chunks_v.text.similarity(string="market performance and stock trends")

doc_chunks_v.order_by(sim, asc=False).limit(3).select(
    doc_chunks_v.text,
    score=sim
).collect()

text,score
"Growth markets (17% of total) were down 1% in GAAP and up 6% in local currency, reflecting solid demand led by Resources and Banking & Capital Markets.",0.607
"Stock investors did just that on Thursday, pushing the Dow Jones Industrial Average higher by 0.77% but the Nasdaq Composite and S&P 500 lower by 0.79% and 0.25%, respectively.",0.59
"Accenture's share price, which rose in 2023 on competitive share gains and AI interest, has struggled in 2024 with the consulting market.",0.565


## Image Embeddings

Create image embeddings and search using both **text queries** (cross-modal search) and **image queries** (image-to-image search).

In [10]:
# Create a table with images and add an embedding index
image_t = pxt.create_table('twelvelabs_demo.image_search', {'image': pxt.Image})

# Add embedding index - supports both image indexing and text-based queries
image_t.add_embedding_index(
    'image',
    embedding=embed.using(model_name='marengo3.0')
)

Created table 'image_search'.


In [11]:
# Insert sample images
image_urls = [
    'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000025.jpg',
    'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000139.jpg',
    'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000042.jpg',
    'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000061.jpg',
]

image_t.insert({'image': url} for url in image_urls)

Inserted 4 rows with 0 errors in 1.37 s (2.91 rows/s)


4 rows inserted.

In [12]:
# Search images using text (cross-modal search)
sim = image_t.image.similarity(string="animals in nature")

image_t.order_by(sim, asc=False).limit(2).select(image_t.image, score=sim).collect()

image,score
,0.072
,0.049


In [13]:
# Image-based image search (image-to-image similarity)
# Load a query image and find similar images in the table

# Download and load a query image
query_image_url = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000001.jpg'
sim_image = image_t.image.similarity(query_image_url)

image_t.order_by(sim_image, asc=False).limit(2).select(
    image=image_t.image, 
    similarity=sim_image
).collect()

image,similarity
,0.032
,0.011


## Text + Image Combined Embeddings

A unique feature of Twelve Labs is the ability to create embeddings from **both text and image together**. This captures the joint semantic representation of multimodal content and is useful for image captioning, visual question answering, and other multimodal applications.

In [14]:
# Create a table for text+image combined embeddings
multimodal_t = pxt.create_table(
    'twelvelabs_demo.text_image_combined',
    {'image': pxt.Image, 'caption': pxt.String}
)

# Add computed column that creates embeddings from BOTH text and image together
# This uses the text_image embedding type in the Twelve Labs API
multimodal_t.add_computed_column(
    combined_embedding=embed(
        multimodal_t.caption,  # text parameter
        multimodal_t.image,    # image parameter (optional)
        model_name='marengo3.0'
    )
)

Created table 'text_image_combined'.
Added 0 column values with 0 errors in 0.00 s


No rows affected.

In [15]:
# Insert images with captions
multimodal_t.insert([
    {
        'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000139.jpg',
        'caption': 'A person standing next to an elephant'
    },
    {
        'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000025.jpg',
        'caption': 'A giraffe in a natural habitat'
    }
])

Inserted 2 rows with 0 errors in 0.43 s (4.67 rows/s)


2 rows inserted.

In [16]:
# View the combined embeddings
multimodal_t.select(
    multimodal_t.image,
    multimodal_t.caption,
    multimodal_t.combined_embedding
).collect()

image,caption,combined_embedding
,A person standing next to an elephant,[-0.052 -0.08 0.008 -0.043 0.041 -0.008 ... -0.028 -0.034 -0.025 0.037 0.01 -0.048]
,A giraffe in a natural habitat,[-0.03 -0.042 -0.004 0.016 0. 0.066 ... -0.056 -0.031 0.047 -0.028 0.101 -0.029]


## Audio Embeddings with Embedding Index

Create audio embeddings and search using text queries. Audio segments must be at least 4 seconds long.

Twelve Labs audio embeddings support **embedding options** to focus on different aspects:
- `'audio'`: Focus on the raw audio signal (sounds, music, ambient noise)
- `'transcription'`: Focus on the spoken content (what is said)

In [17]:
from pixeltable.functions.audio import audio_splitter

# Create a base table for audio files
audio_t = pxt.create_table('twelvelabs_demo.audio_files', {'audio': pxt.Audio})

# Insert a sample audio file (JFK speech excerpt)
audio_url = 'https://github.com/pixeltable/pixeltable/raw/release/tests/data/audio/jfk_1961_0109_cityuponahill-excerpt.flac'
audio_t.insert([{'audio': audio_url}])

Inserted 1 row with 0 errors in 0.68 s (1.48 rows/s)


1 row inserted.

In [18]:
# Create a view that chunks the audio into searchable segments
# Twelve Labs requires minimum 4 second duration
audio_chunks_v = pxt.create_view(
    'twelvelabs_demo.audio_chunks',
    audio_t,
    iterator=audio_splitter(
        audio_t.audio,
        chunk_duration_sec=5.0,
        min_chunk_duration_sec=4.0
    )
)

# Add embedding index for similarity search
audio_chunks_v.add_embedding_index(
    'audio_chunk',
    embedding=embed.using(model_name='marengo3.0')
)

### Search audio chunks using text query

In [19]:
sim = audio_chunks_v.audio_chunk.similarity(string="speech about government and politics")

audio_chunks_v.order_by(sim, asc=False).limit(3).select(
    audio_chunks_v.audio_chunk,
    score=sim
).collect()

audio_chunk,score
,0.1
,0.095
,0.091


### Audio-based audio search (audio-to-audio similarity)

In [20]:
# You can also search using an audio file as the query
query_audio_path = '/tmp/query_audio.flac'

# Download a sample audio file to use as query
import urllib.request
urllib.request.urlretrieve(
    'https://github.com/pixeltable/pixeltable/raw/release/tests/data/audio/jfk_1961_0109_cityuponahill-excerpt.flac',
    query_audio_path
)

# Search for similar audio segments using the audio query
sim_audio = audio_chunks_v.audio_chunk.similarity(audio=query_audio_path)

audio_chunks_v.order_by(sim_audio, asc=False).limit(2).select(
    audio_chunks_v.audio_chunk,
    similarity=sim_audio
).collect()

audio_chunk,similarity
,0.946
,0.707


### Audio Embedding Options

Use `embedding_option` to focus on specific aspects of the audio content.

In [21]:
# Create computed column with transcription-focused embedding
audio_chunks_v.add_computed_column(
    transcription_embedding=embed(
        audio_chunks_v.audio_chunk,
        model_name='marengo3.0',
        embedding_option=['transcription']  # Focus on spoken content
    )
)

Added 12 column values with 0 errors in 2.27 s (5.28 rows/s)


12 rows updated.

In [22]:
# View the transcription-focused embeddings
audio_chunks_v.select(
    audio_chunks_v.audio_chunk,
    audio_chunks_v.transcription_embedding
).limit(2).collect()

audio_chunk,transcription_embedding
,[-0.085 -0.088 -0.118 -0.015 0.049 -0.048 ... -0.018 -0.047 -0.002 0.003 0.011 -0.013]
,[ 0.003 -0.018 -0.002 0.038 -0.001 0.039 ... 0.015 -0.029 0.009 0.053 0.072 -0.027]


## Video Embeddings with Embedding Index

Create video embeddings and search using all modalities. Video segments must be at least 4 seconds long.

Twelve Labs video embeddings support **embedding options** to focus on different aspects:
- `'visual'`: Focus on visual content (what you see)
- `'audio'`: Focus on audio content (what you hear)
- `'transcription'`: Focus on spoken content (what is said)

In [23]:
from pixeltable.functions.video import video_splitter

# Create a base table for video files
video_t = pxt.create_table('twelvelabs_demo.video_files', {'video': pxt.Video})

# Insert a sample video file
video_url = 'https://github.com/pixeltable/pixeltable/raw/main/docs/resources/The-Pursuit-of-Happiness.mp4'
video_t.insert([{'video': video_url}])

Inserted 1 row with 0 errors in 0.57 s (1.75 rows/s)


1 row inserted.

In [24]:
# Create a view that segments the video
# Twelve Labs requires minimum 4 second duration
video_segments_v = pxt.create_view(
    'twelvelabs_demo.video_segments',
    video_t,
    iterator=video_splitter(
        video=video_t.video,
        duration=5.0,
        min_segment_duration=4.0
    )
)

# Add embedding index for similarity search
video_segments_v.add_embedding_index(
    'video_segment',
    embedding=embed.using(model_name='marengo3.0')
)

In [25]:
# Search video segments using text query
sim = video_segments_v.video_segment.similarity(string="a man with grey hair")

video_segments_v.order_by(sim, asc=False).limit(3).select(
    video_segments_v.video_segment,
    score=sim
).collect()

video_segment,score
,0.319
,0.303
,0.254


In [None]:
# Search videos using an image query
from PIL import Image
import requests
from io import BytesIO

# Load a query image
query_image_url = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000001.jpg'
response = requests.get(query_image_url)
query_img = Image.open(BytesIO(response.content))

# Search videos using the image
sim_image = video_segments_v.video_segment.similarity(image=query_img)

video_segments_v.order_by(sim_image, asc=False).limit(2).select(
    video_segments_v.video_segment,
    score=sim_image
).collect()

video_segment,score
,0.553
,0.545


In [None]:
# Search videos using a video
query_video_path = '/tmp/query_video.mp4'

# Download a sample video file to use as query
import urllib.request
urllib.request.urlretrieve(
    'https://github.com/pixeltable/pixeltable/raw/main/docs/resources/The-Pursuit-of-Happiness.mp4',
    query_video_path
)

# Search for similar video segments using the video query
sim_video = video_segments_v.video_segment.similarity(video=query_video_path)

video_segments_v.order_by(sim_video, asc=False).limit(2).select(
    video_segments_v.video_segment,
    similarity=sim_video
).collect()

video_segment,similarity
,0.905
,0.579


In [28]:
# Download a sample audio file
audio_query_path = '/tmp/cross_modal_audio_query.flac'
urllib.request.urlretrieve(
    'https://github.com/pixeltable/pixeltable/raw/release/tests/data/audio/jfk_1961_0109_cityuponahill-excerpt.flac',
    audio_query_path
)

# Search videos using the audio
sim_audio = video_segments_v.video_segment.similarity(audio=audio_query_path)

video_segments_v.order_by(sim_audio, asc=False).limit(2).select(
    video_segments_v.video_segment,
    score=sim_audio
).collect()

video_segment,score
,0.046
,0.009


### Video Embedding Options

Use `embedding_option` to focus on specific aspects of the video content.

In [29]:
# Create computed column with visual-focused embedding
video_segments_v.add_computed_column(
    visual_embedding=embed(
        video_segments_v.video_segment,
        model_name='marengo3.0',
        embedding_option=['visual']  # Focus on visual content only
    )
)

Added 51 column values with 0 errors in 14.62 s (3.49 rows/s)


51 rows updated.

In [30]:
# View the visual-focused embeddings
video_segments_v.select(
    video_segments_v.video_segment,
    video_segments_v.visual_embedding
).limit(2).collect()

video_segment,visual_embedding
,[ 0.034 0.071 -0.038 0.062 -0.01 0.061 ... 0.047 -0.069 -0.009 -0.021 0.036 0.002]
,[ 0.03 0.047 -0.034 0.068 -0.001 0.009 ... 0.041 -0.029 -0.035 0.039 0.031 -0.014]


## Summary

### Twelve Labs Embed API

| Feature | Syntax | Description |
|---------|--------|-------------|
| Text embeddings | `embed(text, model_name=...)` | Embed text strings |
| Image embeddings | `embed(image, model_name=...)` | Embed images |
| Text + Image | `embed(text, image, model_name=...)` | Joint text+image embedding (unique to Twelve Labs) |
| Audio embeddings | `embed(audio, model_name=..., embedding_option=[...])` | Embed audio with options: `'audio'`, `'transcription'` |
| Video embeddings | `embed(video, model_name=..., embedding_option=[...])` | Embed video with options: `'visual'`, `'audio'`, `'transcription'` |

**Documentation:** [Twelve Labs Embed API](https://docs.twelvelabs.io/v1.3/docs/guides/create-embeddings)

### Pixeltable Features Used

| Feature | Syntax | Documentation |
|---------|--------|---------------|
| Tables | `pxt.create_table(name, schema)` | [Tables Guide](https://docs.pixeltable.com/platform/tables) |
| Views | `pxt.create_view(name, base, iterator=...)` | [Views Guide](https://docs.pixeltable.com/platform/views) |
| Computed Columns | `table.add_computed_column(col=expr)` | [Computed Columns](https://docs.pixeltable.com/platform/computed-columns) |
| Embedding Indices | `table.add_embedding_index(col, embedding=...)` | [Embedding Indexes](https://docs.pixeltable.com/platform/embedding-indexes) |
| Similarity Search | `col.similarity(string=... / image=... / audio=... / video=...)` | [Embedding Indexes](https://docs.pixeltable.com/platform/embedding-indexes) |
| Document Splitter | `document_splitter(document, separators=...)` | [Iterators](https://docs.pixeltable.com/platform/iterators) |
| Audio Splitter | `audio_splitter(audio, chunk_duration_sec=...)` | [Iterators](https://docs.pixeltable.com/platform/iterators) |
| Video Splitter | `video_splitter(video, duration=...)` | [Iterators](https://docs.pixeltable.com/platform/iterators) |

**Documentation:** [Pixeltable SDK Reference](https://docs.pixeltable.com/sdk/latest/pixeltable) | [GitHub](https://github.com/pixeltable/pixeltable)

### Learn More

- [Twelve Labs Documentation](https://docs.twelvelabs.io/)
- [Embed API Guide](https://docs.twelvelabs.io/v1.3/docs/guides/create-embeddings)
- [Pixeltable Embedding Indexes](https://docs.pixeltable.com/platform/embedding-indexes)

If you have any questions, don't hesitate to reach out.