# RAG Operations in Pixeltable
In this tutorial, we'll explore Pixeltable's flexible handling of RAG operations on unstructured text. In a traditional AI workflow, such operations might be implemented as a Python script that runs on a periodic schedule or in response to certain events. In Pixeltable, as with everything else, they are implemented as persistent table operations that update incrementally as new data becomes available. In our tutorial workflow, we'll chunk PDF documents in various ways with a document splitter, then apply several kinds of embeddings to the chunks.

## Set Up the Table Structure

We start by installing the necessary dependencies, creating a Pixeltable directory `rag_ops_demo` (if it doesn't already exist), and setting up the table structure for our new workflow.

In [None]:
%pip install -qU pixeltable sentence-transformers spacy tiktoken
!python -m spacy download en_core_web_sm -q

In [None]:
import pixeltable as pxt

# Ensure a clean slate for the demo
pxt.drop_dir('rag_ops_demo', force=True)
# Create the Pixeltable workspace
pxt.create_dir('rag_ops_demo')

## Creating Tables and Views

Now we'll create the tables that represent our workflow, starting with a table to hold references to source documents. The table contains a single column `source_doc` whose elements have type `pxt.Document`, representing a general document instance. In this tutorial, we'll be working with PDF documents, but Pixeltable supports a range of other document types, such as Markdown and HTML.

In [3]:
docs = pxt.create_table(
    'rag_ops_demo.docs',
    {'source_doc': pxt.Document}
)

Created table 'docs'.


If we take a peek at the `docs` table, we see its very simple structure.

In [4]:
docs

0
table 'rag_ops_demo.docs'

Column Name,Type,Computed With
source_doc,Document,


Next we create a view to represent chunks of our PDF documents. A Pixeltable view is a virtual table, which is dynamically derived from a source table by applying a transformation and/or selecting a subset of data. In this case, our view represents a one-to-many transformation from source documents into individual sentences. This is achieved using Pixeltable's built-in `document_splitter` class.

Note that the `docs` table is currently empty, so creating this view doesn't actually *do* anything yet: it simply defines an operation that we want Pixeltable to execute when it sees new data.

In [None]:
from pixeltable.functions.document import document_splitter

sentences = pxt.create_view(
    'rag_ops_demo.sentences',  # Name of the view
    docs,  # Table from which the view is derived
    iterator=document_splitter(
        docs.source_doc,
        separators='sentence',  # Chunk docs into sentences
        metadata='title,heading,sourceline'
    )
)

Let's take a peek at the new `sentences` view.

In [6]:
sentences

0
view 'rag_ops_demo.sentences' (of 'rag_ops_demo.docs')

Column Name,Type,Computed With
pos,Required[Int],
text,Required[String],
title,String,
heading,Json,
sourceline,Int,
source_doc,Document,


We see that `sentences` inherits the `source_doc` column from `docs`, together with some new fields:
- `pos`: The position in the source document where the sentence appears.
-  `text`: The text of the sentence.
- `title`, `heading`, and `sourceline`: The metadata we requested when we set up the view.

## Data Ingestion

Ok, now it's time to insert some data into our workflow. A document in Pixeltable is just a URL; the following command inserts a single row into the `docs` table with the `source_doc` field set to the specified URL:

In [7]:
docs.insert([{'source_doc': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/rag-demo/Argus-Market-Digest-June-2024.pdf'}])

Inserting rows into `docs`: 1 rows [00:00, 292.76 rows/s]
Inserting rows into `sentences`: 217 rows [00:00, 42910.00 rows/s]
Inserted 218 rows with 0 errors.


218 rows inserted, 2 values computed.

We can see that two things happened. First, a single row was inserted into `docs`, containing the URL representing our source PDF. Then, the view `sentences` was incrementally updated by applying the `document_splitter` according to the definition of the view. This illustrates an important principle in Pixeltable: by default, anytime Pixeltable sees new data, the update is incrementally propagated to any downstream views or computed columns.

We can see the effect of the insertion with the `select` command. There's a single row in `docs`:

In [8]:
docs.select(docs.source_doc.fileurl).show()

source_doc_fileurl
https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/rag-demo/Argus-Market-Digest-June-2024.pdf


And here are the first 20 rows in `sentences`. The content of the PDF is broken into individual sentences, as expected.

In [9]:
sentences.select(sentences.text, sentences.heading).show(20)

text,heading
MARKET DIGEST,
- 1 -,
"FRIDAY, JUNE 21, 2024",
"JUNE 20, DJIA: 39,134.76 UP 299.90",
Independent Equity Research Since 1934 ARGUS,
A R G U S R E S E A R C H C O M P,
A N Y • 6 1 B R O,
"A D W A Y • N E W Y O R K , N. Y. 1 0 0 0 6 • ( 2 1 2 ) 4 2 5 - 7 5 0 0 LONDON SALES & MARKETING OFFICE TEL 011-44-207-256-8383 /",
FAX 011-44-207-256-8363,
®,


## Experimenting with Chunking

Of course, chunking into sentences isn't the only way to split a document. Perhaps we want to experiment with different chunking methodologies, in order to see which one performs best in a particular application. Pixeltable makes it easy to do this, by creating several views of the same source table. Here are a few examples. Notice that as each new view is created, it is initially populated from the data already in `docs`.

In [None]:
chunks = pxt.create_view(
    'rag_ops_demo.chunks', docs,
    iterator=document_splitter(
        docs.source_doc,
        separators='sentence,token_limit',
        limit=2048,
        overlap=0,
        metadata='title,heading,sourceline'
    )
)

Inserting rows into `chunks`: 217 rows [00:00, 47827.85 rows/s]


In [None]:
short_chunks = pxt.create_view(
    'rag_ops_demo.short_chunks', docs,
    iterator=document_splitter(
        docs.source_doc,
        separators='sentence,token_limit',
        limit=72,
        overlap=0,
        metadata='title,heading,sourceline'
    )
)

Inserting rows into `short_chunks`: 219 rows [00:00, 49104.70 rows/s]


In [None]:
short_char_chunks = pxt.create_view(
    'rag_ops_demo.short_char_chunks', docs,
    iterator=document_splitter(
        docs.source_doc,
        separators='sentence,char_limit',
        limit=72,
        overlap=0,
        metadata='title,heading,sourceline'
    )
)

Inserting rows into `short_char_chunks`: 459 rows [00:00, 63241.10 rows/s]


In [13]:
chunks.select(chunks.text, chunks.heading).show(20)

text,heading
MARKET DIGEST,
- 1 -,
"FRIDAY, JUNE 21, 2024",
"JUNE 20, DJIA: 39,134.76 UP 299.90",
Independent Equity Research Since 1934 ARGUS,
A R G U S R E S E A R C H C O M P,
A N Y • 6 1 B R O,
"A D W A Y • N E W Y O R K , N. Y. 1 0 0 0 6 • ( 2 1 2 ) 4 2 5 - 7 5 0 0 LONDON SALES & MARKETING OFFICE TEL 011-44-207-256-8383 /",
FAX 011-44-207-256-8363,
®,


In [14]:
short_chunks.select(short_chunks.text, short_chunks.heading).show(20)

text,heading
MARKET DIGEST,
- 1 -,
"FRIDAY, JUNE 21, 2024",
"JUNE 20, DJIA: 39,134.76 UP 299.90",
Independent Equity Research Since 1934 ARGUS,
A R G U S R E S E A R C H C O M P,
A N Y • 6 1 B R O,
"A D W A Y • N E W Y O R K , N. Y. 1 0 0 0 6 • ( 2 1 2 ) 4 2 5 - 7 5 0 0 LONDON SALES & MARKETING OFFICE",
TEL 011-44-207-256-8383 /,
FAX 011-44-207-256-8363,


In [15]:
short_char_chunks.select(short_char_chunks.text, short_char_chunks.heading).show(20)

text,heading
MARKET DIGEST,
- 1 -,
"FRIDAY, JUNE 21, 2024",
"JUNE 20, DJIA: 39,134.76 UP 299.90",
Independent Equity Research Since 1934 ARGUS,
A R G U S R E S E A R C H C O M P,
A N Y • 6 1 B R O,
"A D W A Y • N E W Y O R K , N. Y. 1 0 0 0 6 • ( 2 1 2 ) 4 2 5 -",
7 5 0 0 LONDON SALES & MARKETING OFFICE TEL 011-44-207-256-8383 /,
FAX 011-44-207-256-8363,


Now let's add a few more documents to our workflow. Notice how all of the downstream views are updated incrementally, processing just the new documents as they are inserted.

In [16]:
urls = [
    'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/rag-demo/Argus-Market-Watch-June-2024.pdf',
    'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/rag-demo/Company-Research-Alphabet.pdf',
    'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/rag-demo/Zacks-Nvidia-Report.pdf'
]
docs.insert({'source_doc': url} for url in urls)

Inserting rows into `docs`: 3 rows [00:00, 1969.77 rows/s]
Inserting rows into `chunks`: 742 rows [00:00, 61926.41 rows/s]
Inserting rows into `short_chunks`: 747 rows [00:00, 67743.68 rows/s]
Inserting rows into `sentences`: 742 rows [00:00, 67949.90 rows/s]
Inserting rows into `short_char_chunks`: 1165 rows [00:00, 3603.41 rows/s]
Inserted 3399 rows with 0 errors.


3399 rows inserted, 6 values computed.

## Further Experiments

This is a good time to mention another important guiding principle of Pixeltable. The preceding examples all used the built-in `document_splitter` class with various configurations. That's probably fine as a first cut or to prototype an application quickly, and it might be sufficient for some applications. But other applications might want to do more sophisticated kinds of chunking, implementing their own specialized logic or leveraging third-party tools. Pixeltable imposes no constraints on the AI or RAG operations a workflow uses: the iterator interface is highly general, and it's easy to implement new operations or adapt existing code or third-party tools into the Pixeltable workflow.

## Computing Embeddings

Next, let's look at how embedding indices can be added seamlessly to existing Pixeltable workflows. To compute our embeddings, we'll use the Huggingface `sentence_transformer` package, running it over the `chunks` view that broke our documents up into sentence-based chunks. Pixeltable has a built-in `sentence_transformer` adapter, and all we have to do is add a new column that leverages it. Pixeltable takes care of the rest, applying the new column to all existing data in the view.

In [17]:
from pixeltable.functions.huggingface import sentence_transformer

chunks.add_computed_column(minilm_embed=sentence_transformer(
    chunks.text,
    model_id='paraphrase-MiniLM-L6-v2'
))

Added 959 column values with 0 errors.


959 rows updated, 959 values computed.

The new column is a *computed column*: it is defined as a function on top of existing data and updated incrementally as new data are added to the workflow. Let's have a look at how the new column affected the `chunks` view.

In [18]:
chunks

0
view 'rag_ops_demo.chunks' (of 'rag_ops_demo.docs')

Column Name,Type,Computed With
pos,Required[Int],
text,Required[String],
title,String,
heading,Json,
sourceline,Int,
minilm_embed,"Required[Array[(384,), float32]]","sentence_transformer(text, model_id='paraphrase-MiniLM-L6-v2')"
source_doc,Document,


In [19]:
chunks.select(chunks.text, chunks.heading, chunks.minilm_embed).head()

text,heading,minilm_embed
MARKET DIGEST,,[-0.33 -0.824 -0.397 0.008 -0.325 -0.624 ... 0.406 -0.113 0.172 -0.475 0.669 -0.102]
- 1 -,,[-0.597 0.507 0.367 -0.109 -0.264 0.052 ... -0.089 0.237 0.35 0.153 0.837 -0.025]
"FRIDAY, JUNE 21, 2024",,[ 0.016 -0.178 0.134 -0.284 0.019 0.419 ... -0.35 -0.102 0.181 -0.476 -0.243 -0.209]
"JUNE 20, DJIA: 39,134.76 UP 299.90",,[ 0.32 -0.037 -0.18 0.118 -0.058 0.171 ... -0.274 0.051 -0.1 0.237 -0.367 -0.241]
Independent Equity Research Since 1934 ARGUS,,[-0.813 -0.261 -0.306 0.03 0.038 0.014 ... -0.481 -0.132 -0.07 -0.399 0.106 -0.271]
A R G U S R E S E A R C H C O M P,,[-0.107 0.194 -0.395 -0.058 -0.438 0.55 ... -0.499 -0.48 -0.315 -0.341 0.587 -0.008]
A N Y • 6 1 B R O,,[ 0.338 0.381 -0.174 -0.187 -0.655 0.058 ... -0.544 -0.366 -0.167 0.337 0.366 -0.179]
"A D W A Y • N E W Y O R K , N. Y. 1 0 0 0 6 • ( 2 1 2 ) 4 2 5 - 7 5 0 0 LONDON SALES & MARKETING OFFICE TEL 011-44-207-256-8383 /",,[-0.065 -0.214 -0.014 -0.384 -0.343 -0.139 ... -0.023 -0.036 -0.354 -0.129 0.064 0.19 ]
FAX 011-44-207-256-8363,,[-0.879 0.129 -0.125 -0.053 -0.06 -0.41 ... -0.193 0.273 0.723 -0.062 0.351 0.156]
®,,[-0.432 0.495 -0.327 -0.704 -0.235 -0.077 ... 0.222 -0.406 0.456 -0.064 0.441 0.266]


Similarly, we might want to add a CLIP embedding to our workflow; once again, it's just another computed column:

In [20]:
from pixeltable.functions.huggingface import clip

chunks.add_computed_column(clip_embed=clip(
    chunks.text, model_id='openai/clip-vit-base-patch32'
))

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


Added 959 column values with 0 errors.


959 rows updated, 959 values computed.

In [21]:
chunks

0
view 'rag_ops_demo.chunks' (of 'rag_ops_demo.docs')

Column Name,Type,Computed With
pos,Required[Int],
text,Required[String],
title,String,
heading,Json,
sourceline,Int,
minilm_embed,"Required[Array[(384,), float32]]","sentence_transformer(text, model_id='paraphrase-MiniLM-L6-v2')"
clip_embed,"Required[Array[(512,), float32]]","clip(text, model_id='openai/clip-vit-base-patch32')"
source_doc,Document,


In [22]:
chunks.select(chunks.text, chunks.heading, chunks.clip_embed).head()

text,heading,clip_embed
MARKET DIGEST,,[ 0.276 0.039 -0.095 -0.055 -0.193 -0.061 ... -0.298 -0.007 0.181 0.417 0.091 -0.072]
- 1 -,,[ 0.031 0.106 0.159 0.429 -0.145 -0.123 ... -0.233 0.124 0.031 -0.841 0.138 0.045]
"FRIDAY, JUNE 21, 2024",,[-0.198 -0.326 -0.505 -0.247 0.219 -0.505 ... -0.099 -0.195 0.694 0.123 -0.031 -0.302]
"JUNE 20, DJIA: 39,134.76 UP 299.90",,[-0.204 -0.214 0.03 0.103 -0.246 -0.155 ... 0.071 0.017 0.416 0.552 0.077 0.118]
Independent Equity Research Since 1934 ARGUS,,[-0.119 -0.158 -0.202 -0.239 -0.359 0.17 ... -0.257 -0.053 0.035 0.134 -0.093 0.114]
A R G U S R E S E A R C H C O M P,,[ 0.052 0.103 -0.212 0.007 0.396 0.075 ... -0.07 -0.023 -0.071 -0.769 -0.233 0.182]
A N Y • 6 1 B R O,,[ 0.085 0.043 0.238 0.021 0.012 -0.089 ... -0.239 -0.07 -0.029 -0.563 -0.007 0.216]
"A D W A Y • N E W Y O R K , N. Y. 1 0 0 0 6 • ( 2 1 2 ) 4 2 5 - 7 5 0 0 LONDON SALES & MARKETING OFFICE TEL 011-44-207-256-8383 /",,[-0.246 0.184 0.357 0.078 0.376 0.135 ... -0.189 -0.169 0.369 -0.353 0.012 -0.123]
FAX 011-44-207-256-8363,,[-0.109 0.136 -0.139 -0.098 0.185 -0.032 ... -0.099 -0.093 0.126 0.112 -0.349 0.058]
®,,[ 0.023 0.237 0.13 0.275 -0.013 -0.158 ... -0.102 0.027 -0.081 -1.035 0.181 0.205]
