# Working with Twelve Labs in Pixeltable

Pixeltable's Twelve Labs integration enables you to create powerful multimodal embeddings for text, images, audio, and video using the Twelve Labs Embed API.

### Prerequisites
- A Twelve Labs account with an API key (https://playground.twelvelabs.io/)

### Important Notes

- Twelve Labs usage may incur costs based on your plan.
- Audio and video embeddings require a minimum duration of 4 seconds.
- The `marengo3.0` model produces 512-dimensional embeddings.
- Similarity search supports: `string=` for text queries, `image=` for image queries (PIL Image object).

## Setup

First, install the required libraries and configure your API key.

In [None]:
%pip install -qU pixeltable twelvelabs

In [None]:
import os
import getpass

if 'TWELVELABS_API_KEY' not in os.environ:
    os.environ['TWELVELABS_API_KEY'] = getpass.getpass('Enter your Twelve Labs API key: ')

In [None]:
import pixeltable as pxt
from pixeltable.functions.twelvelabs import embed

# Create a fresh directory for our demo
pxt.drop_dir('twelvelabs_demo', force=True)
pxt.create_dir('twelvelabs_demo')

## Text Embeddings

Create text embeddings using the Twelve Labs embed function.

In [None]:
# Create a table with text and add an embedding index
text_t = pxt.create_table('twelvelabs_demo.text_search', {'text': pxt.String})

# Add embedding index for text similarity search
text_t.add_embedding_index(
    'text',
    embedding=embed.using(model_name='marengo3.0')
)

In [None]:
# Insert sample documents
documents = [
    "Artificial intelligence is transforming video understanding and analysis.",
    "Machine learning models can detect objects and actions in video streams.",
    "Natural language processing enables understanding of spoken words in audio.",
    "Computer vision techniques analyze visual patterns in images and videos.",
    "Deep learning models generate embeddings that capture semantic meaning.",
    "Multimodal AI systems combine understanding of text, images, and audio.",
]

text_t.insert({'text': doc} for doc in documents)

In [None]:
# Perform semantic similarity search
query = "How do AI systems understand video content?"
sim = text_t.text.similarity(string=query)

text_t.order_by(sim, asc=False).limit(3).select(text_t.text, score=sim).collect()

## Document Embeddings (PDF, HTML, Markdown)

Create embeddings from documents like PDFs, HTML, and Markdown files. Use the `document_splitter` iterator to chunk documents into searchable text segments.

In [None]:
from pixeltable.functions.document import document_splitter

# Create a table with documents
doc_t = pxt.create_table('twelvelabs_demo.documents', {'document': pxt.Document})

# Create a view that chunks documents into text segments
doc_chunks_v = pxt.create_view(
    'twelvelabs_demo.doc_chunks',
    doc_t,
    iterator=document_splitter(
        document=doc_t.document,
        separators='sentence'  # Split by sentence for fine-grained search
    )
)

# Add embedding index on the text chunks
doc_chunks_v.add_embedding_index(
    'text',
    embedding=embed.using(model_name='marengo3.0')
)

In [None]:
# Insert a PDF document
pdf_url = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/rag-demo/Argus-Market-Digest-June-2024.pdf'
doc_t.insert([{'document': pdf_url}])

In [None]:
# Search document chunks using text query
sim = doc_chunks_v.text.similarity(string="market performance and stock trends")

doc_chunks_v.order_by(sim, asc=False).limit(3).select(
    doc_chunks_v.text,
    score=sim
).collect()

## Image Embeddings

Create image embeddings and search using both **text queries** (cross-modal search) and **image queries** (image-to-image search).

In [None]:
# Create a table with images and add an embedding index
image_t = pxt.create_table('twelvelabs_demo.image_search', {'image': pxt.Image})

# Add embedding index - supports both image indexing and text-based queries
image_t.add_embedding_index(
    'image',
    embedding=embed.using(model_name='marengo3.0')
)

In [None]:
# Insert sample images
image_urls = [
    'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000025.jpg',
    'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000139.jpg',
    'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000042.jpg',
    'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000061.jpg',
]

image_t.insert({'image': url} for url in image_urls)

In [None]:
# Search images using text (cross-modal search)
sim = image_t.image.similarity(string="animals in nature")

image_t.order_by(sim, asc=False).limit(2).select(image_t.image, score=sim).collect()

In [None]:
# Image-based image search (image-to-image similarity)
# Load a query image and find similar images in the table
from PIL import Image
import urllib.request

# Download and load a query image
query_image_url = 'https://upload.wikimedia.org/wikipedia/commons/thumb/7/7a/Huskiesatrest.jpg/640px-Huskiesatrest.jpg'
with urllib.request.urlopen(query_image_url) as response:
    query_image = Image.open(response)

# Search for similar images using the image query
sim_image = image_t.image.similarity(image=query_image)

image_t.order_by(sim_image, asc=False).limit(2).select(
    image=image_t.image, 
    similarity=sim_image
).collect()

## Text + Image Combined Embeddings

A unique feature of Twelve Labs is the ability to create embeddings from **both text and image together**. This captures the joint semantic representation of multimodal content and is useful for image captioning, visual question answering, and other multimodal applications.

In [None]:
# Create a table for text+image combined embeddings
multimodal_t = pxt.create_table(
    'twelvelabs_demo.text_image_combined',
    {'image': pxt.Image, 'caption': pxt.String}
)

# Add computed column that creates embeddings from BOTH text and image together
# This uses the text_image embedding type in the Twelve Labs API
multimodal_t.add_computed_column(
    combined_embedding=embed(
        multimodal_t.caption,  # text parameter
        multimodal_t.image,    # image parameter (optional)
        model_name='marengo3.0'
    )
)

In [None]:
# Insert images with captions
multimodal_t.insert([
    {
        'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000139.jpg',
        'caption': 'A person standing next to an elephant'
    },
    {
        'image': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000025.jpg',
        'caption': 'A giraffe in a natural habitat'
    }
])

In [None]:
# View the combined embeddings
multimodal_t.select(
    multimodal_t.image,
    multimodal_t.caption,
    multimodal_t.combined_embedding
).collect()

## Audio Embeddings with Embedding Index

Create audio embeddings and search using text queries. Audio segments must be at least 4 seconds long.

Twelve Labs audio embeddings support **embedding options** to focus on different aspects:
- `'audio'`: Focus on the raw audio signal (sounds, music, ambient noise)
- `'transcription'`: Focus on the spoken content (what is said)

In [None]:
from pixeltable.functions.audio import audio_splitter

# Create a base table for audio files
audio_t = pxt.create_table('twelvelabs_demo.audio_files', {'audio': pxt.Audio})

# Insert a sample audio file (JFK speech excerpt)
audio_url = 'https://github.com/pixeltable/pixeltable/raw/release/tests/data/audio/jfk_1961_0109_cityuponahill-excerpt.flac'
audio_t.insert([{'audio': audio_url}])

In [None]:
# Create a view that chunks the audio into searchable segments
# Twelve Labs requires minimum 4 second duration
audio_chunks_v = pxt.create_view(
    'twelvelabs_demo.audio_chunks',
    audio_t,
    iterator=audio_splitter(
        audio_t.audio,
        chunk_duration_sec=5.0,
        min_chunk_duration_sec=4.0
    )
)

# Add embedding index for similarity search
audio_chunks_v.add_embedding_index(
    'audio_chunk',
    embedding=embed.using(model_name='marengo3.0')
)

In [None]:
# Search audio chunks using text query
sim = audio_chunks_v.audio_chunk.similarity(string="speech about government and politics")

audio_chunks_v.order_by(sim, asc=False).limit(3).select(
    audio_chunks_v.audio_chunk,
    score=sim
).collect()

### Audio Embedding Options

Use `embedding_option` to focus on specific aspects of the audio content.

In [None]:
# Create computed column with transcription-focused embedding
audio_chunks_v.add_computed_column(
    transcription_embedding=embed(
        audio_chunks_v.audio_chunk,
        model_name='marengo3.0',
        embedding_option=['transcription']  # Focus on spoken content
    )
)

In [None]:
# View the transcription-focused embeddings
audio_chunks_v.select(
    audio_chunks_v.audio_chunk,
    audio_chunks_v.transcription_embedding
).limit(2).collect()

## Video Embeddings with Embedding Index

Create video embeddings and search using text queries. Video segments must be at least 4 seconds long.

Twelve Labs video embeddings support **embedding options** to focus on different aspects:
- `'visual'`: Focus on visual content (what you see)
- `'audio'`: Focus on audio content (what you hear)
- `'transcription'`: Focus on spoken content (what is said)

In [None]:
from pixeltable.functions.video import video_splitter

# Create a base table for video files
video_t = pxt.create_table('twelvelabs_demo.video_files', {'video': pxt.Video})

# Insert a sample video file
video_url = 'https://github.com/pixeltable/pixeltable/raw/release/tests/data/videos/bangkok_half_res.mp4'
video_t.insert([{'video': video_url}])

In [None]:
# Create a view that segments the video
# Twelve Labs requires minimum 4 second duration
video_segments_v = pxt.create_view(
    'twelvelabs_demo.video_segments',
    video_t,
    iterator=video_splitter(
        video=video_t.video,
        duration=5.0,
        min_segment_duration=4.0
    )
)

# Add embedding index for similarity search
video_segments_v.add_embedding_index(
    'video_segment',
    embedding=embed.using(model_name='marengo3.0')
)

In [None]:
# Search video segments using text query
sim = video_segments_v.video_segment.similarity(string="city traffic and urban scenery")

video_segments_v.order_by(sim, asc=False).limit(3).select(
    video_segments_v.video_segment,
    score=sim
).collect()

### Video Embedding Options

Use `embedding_option` to focus on specific aspects of the video content.

In [None]:
# Create computed column with visual-focused embedding
video_segments_v.add_computed_column(
    visual_embedding=embed(
        video_segments_v.video_segment,
        model_name='marengo3.0',
        embedding_option=['visual']  # Focus on visual content only
    )
)

In [None]:
# View the visual-focused embeddings
video_segments_v.select(
    video_segments_v.video_segment,
    video_segments_v.visual_embedding
).limit(2).collect()

## Available Models

Twelve Labs provides several embedding models:

| Model | Embedding Dimension | Description |
|-------|---------------------|-------------|
| `marengo3.0` | 512 | Latest multimodal embedding model |
| `Marengo-retrieval-2.7` | 1024 | Retrieval-optimized model |

## Summary of Twelve Labs Features

| Feature | Description |
|---------|-------------|
| Text embeddings | `embed(text, model_name=...)` |
| Image embeddings | `embed(image, model_name=...)` |
| Text + Image combined | `embed(text, image, model_name=...)` - unique joint embedding |
| Audio embeddings | `embed(audio, model_name=..., embedding_option=[...])` |
| Video embeddings | `embed(video, model_name=..., embedding_option=[...])` |
| Document search | `document_splitter` + text embedding on chunks |
| Embedding indices | `add_embedding_index(col, embedding=embed.using(...))` |
| Text similarity search | `col.similarity(string="query")` |
| Image similarity search | `col.similarity(image=pil_image)` |

### Learn More

- [Twelve Labs Documentation](https://docs.twelvelabs.io/)
- [Embed API Guide](https://docs.twelvelabs.io/v1.3/docs/guides/create-embeddings)
- [Pixeltable Embedding Indexes](https://docs.pixeltable.com/platform/embedding-indexes)

If you have any questions, don't hesitate to reach out.