[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/pixeltable/pixeltable/blob/release/docs/notebooks/use-cases/rag-demo.ipynb)&nbsp;&nbsp;
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pixeltable/pixeltable/blob/release/docs/notebooks/use-cases/rag-demo.ipynb)

# Document Indexing and RAG

In this tutorial, we'll demonstrate how RAG operations can be implemented in Pixeltable. In particular, we'll develop a RAG application that summarizes a collection of PDF documents and uses ChatGPT to answer questions about them.

In a traditional RAG workflow, such operations might be implemented as a Python script that runs on a periodic schedule or in response to certain events. In Pixeltable, they are implemented as persistent tables that are updated automatically and incrementally as new data becomes available.

**If you are running this tutorial in Colab:**
In order to make the tutorial run a bit snappier, let's switch to a GPU-equipped instance for this Colab session. To do that, click on the `Runtime -> Change runtime type` menu item at the top, then select the `GPU` radio button and click on `Save`.

We first set up our OpenAI API key:

In [1]:
import os
import getpass
if 'OPENAI_API_KEY' not in os.environ:
    os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')

We then install the packages we need for this tutorial and then set up our environment.

In [2]:
%pip install -q pixeltable sentence-transformers tiktoken openai openpyxl

Note: you may need to restart the kernel to use updated packages.


In [3]:
import numpy as np
import pixeltable as pxt

# Ensure a clean slate for the demo
pxt.drop_dir('rag_demo', force=True)
pxt.create_dir('rag_demo')

Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/home/marcel/.pixeltable/pgdata
Created directory 'rag_demo'.


<pixeltable.catalog.dir.Dir at 0x7eb34c1eceb0>

Next we'll create a table containing the sample questions we want to answer. The questions are stored in an Excel spreadsheet, along with a set of "ground truth" answers to help evaluate our model pipeline. We can use Pixeltable's handy `import_excel()` utility to load them. Note that we can pass the URL of the spreadsheet directly to the import utility.

In [4]:
base = 'https://github.com/pixeltable/pixeltable/raw/main/docs/resources/rag-demo/'
qa_url = base + 'Q-A-Rag.xlsx'
queries_t = pxt.io.import_excel('rag_demo.queries', qa_url)

Created table 'queries'.
Inserting rows into `queries`: 8 rows [00:00, 1853.94 rows/s]
Inserted 8 rows with 0 errors.


In [5]:
queries_t.head()

S__No_,Question,correct_answer
1,What is roughly the current mortage rate?,0.07
2,What is the current dividend yield for Alphabet Inc. (\$GOOGL)?,0.0046
3,What is the market capitalization of Alphabet?,\$2182.8 Billion
4,What are the latest financial metrics for Accenture PLC?,missed consensus forecasts and strong total bookings rising by 22% annually
5,What is the overall latest rating for Amazon.com from analysts?,SELL
6,What is the operating cash flow of Amazon in Q1 2024?,"18,989 Million"
7,What is the expected EPS for Nvidia in Q1 2026?,0.73 EPS
8,What are the main reasons to buy Nvidia?,"Datacenter, GPUs Demands, Self-driving, and cash-flow"


## Outline

There are two major parts to our RAG application:

1. Document Indexing: Load the documents, split them into chunks, and index them using a vector embedding.
2. Querying: For each question on our list, do a top-k lookup for the most relevant chunks, use them to construct a ChatGPT prompt, and send the enriched prompt to an LLM.

We'll implement both parts in Pixeltable.

## 1. Document Indexing

All data in Pixeltable, including documents, resides in tables.

Tables are persistent containers that can serve as the store of record for your data. Since we are starting from scratch, we will start with an empty table `rag_demo.documents` with a single column, `document`.

In [6]:
documents_t = pxt.create_table(
    'rag_demo.documents',
    {'document': pxt.Document}
)

documents_t

Created table 'documents'.


0
table 'rag_demo.documents'

Column Name,Type,Computed With
document,Document,


Next, we'll insert our first few source documents into the new table. We'll leave the rest for later, in order to show how to update the indexed document base incrementally.

In [7]:
document_urls = [
    base + 'Argus-Market-Digest-June-2024.pdf',
    base + 'Argus-Market-Watch-June-2024.pdf',
    base + 'Company-Research-Alphabet.pdf',
    base + 'Jefferson-Amazon.pdf',
    base + 'Mclean-Equity-Alphabet.pdf',
    base + 'Zacks-Nvidia-Report.pdf',
]

In [8]:
documents_t.insert({'document': url} for url in document_urls[:3])
documents_t.show()

Inserting rows into `documents`: 3 rows [00:00, 1925.76 rows/s]
Inserted 3 rows with 0 errors.


document


In RAG applications, we often decompose documents into smaller units, or chunks, rather than treating each document as a single entity. In this example, we'll use Pixeltable's built-in `DocumentSplitter`, but in general the chunking methodology is highly customizable. `DocumentSplitter` has a variety of options for controlling the chunking behavior, and it's also possible to replace it entirely with a user-defined iterator (or an adapter for a third-party document splitter).

In Pixeltable, operations such as chunking can be automated by creating **views** of the base `documents` table. A view is a virtual derived table: rather than adding data directly to the view, we define it via a computation over the base table. In this example, the view is defined by iteration over the chunks of a `DocumentSplitter`.

In [9]:
from pixeltable.iterators import DocumentSplitter

chunks_t = pxt.create_view(
    'rag_demo.chunks',
    documents_t,
    iterator=DocumentSplitter.create(
        document=documents_t.document,
        separators='token_limit',
        limit=300
    )
)

Inserting rows into `chunks`: 41 rows [00:00, 21767.91 rows/s]


Our `chunks` view now has 3 columns:

In [10]:
chunks_t

0
view 'rag_demo.chunks' (of 'rag_demo.documents')

Column Name,Type,Computed With
pos,Required[Int],
text,Required[String],
document,Document,


- `text` is the chunk text produced by the `DocumentSplitter`
- `pos` is a system-generated integer column, starting at 0, that provides a sequence number for each row
- `document`, which is simply the `document` column from the base table `documents`. We won't need it here, but having access to the base table's columns (in effect a parent-child join) can be quite useful.

Notice that as soon as we created it, `chunks` was automatically populated with data from the existing documents in our base table. We can select the first 2 chunks from each document using common dataframe operations, in order to get a feel for what was extracted:

In [11]:
chunks_t.where(chunks_t.pos < 2).show()

pos,text,document
0,"MARKET DIGEST - 1 -  FRIDAY, JUNE 21, 2024 JUNE 20, DJIA: 39,134.76 UP 299.90 Independent Equity Research Since 1934 ARGUS A R G U S R E S E A R C H C O M P A N Y • 6 1 B R O A D W A Y • N E W Y O R K , N. Y. 1 0 0 0 6 • ( 2 1 2 ) 4 2 5 - 7 5 0 0 LONDON SALES & MARKETING OFFICE TEL 011-44-207-256-8383 / FAX 011-44-207-256-8363 ® Good Morning. This is the Market Digest for Friday, June 21, 2024, with analysis of the financial markets and comments on Accenture plc. IN THIS ISSUE: * Growth Stock: Accenture plc: Shares rally on AI optimism (Jim Kelleher) MARKET REVIEW: Yogi Berra famously said ""When you come to a fork in the road, take it."" Stock investors did just that on Thursday, pushing the Dow Jones Industrial Average higher by 0.77% but the Nasdaq Composite and S&P",
1,"500 lower by 0.79% and 0.25%, respectively. In a rare event, shares of Nvidia not only lost ground, but lost a relatively meaningful amount (3.5%), proving nothing can go up forever. Still, the major indices are comfortably ahead for the year to date — and the big non-AI mover for stocks is the future direction of interest rates, which remains a concern for Wall Street and (for one day at least) offset AI mania. ACCENTURE PLC (NYSE: ACN, \$306.16) BUY ACN: Shares rally on AI optimism * Accen ...... entum for the business. * Total bookings were strong in 3Q24, rising 22% annually on acceleration in managed services and resulting in a 1.3 book-to-bill. Accenture still appears to be taking share from competitors. * We believe that Accenture has the financial resources, customer presence, and market strength to thrive as companies accelerate the process of digital transformation and begin their AI journeys. ANALYSIS INVESTMENT THESIS Shares of BUY-rated Accenture plc (NYSE: ACN) rose solid",
0,"Friday, June 21, 2024 Intermediate Term: Market Outlook Bullish -------------- PORTFOLIO STRATEGY ------------- Equity: 72% Cash: 1% Today's Market Movers IMPACT aGlobal Shares Lower GILD Pops on HIV Drug Results SRPT Soars on FDA Approval SWBI Drops on Sales Guidance + + + - a a a Recent Research Review ADSK, MRNA, IQV, WMB, BUD, LYFT, SRE, BP, AEE, PPC, JNPR, ORCL, CMG, TPR, DPZ, EOG, COST, PLTR, COR, VRTX Statistics Diary 12-Mth S&P 500 Forcast: S&P 500 Current/Next EPS: S&P 500 P/E: 12-Mth S&P P/E Range: 10-Year Yield: 12-Mth 10-Yr. Bond Forecast: Current Fed Funds Target: 12-Mth Fed Funds Forecast: 4800-5600 247/265 22.16 18.1 - 21.1 4.26% 3.50-4.50% 4.62% 4.50-5.50% DJIA: S&P 500: NASDAQ: Lrg/Small Cap: Growth/Value: PREVIOUS CLOSE 200-DAY AVERAGE 39134.76",
1,"37058.23 5473.17 4831.39 17721.59 15160.55 1.48 1.37 2.07 1.86 CURRENT RANKING Five-Day Put/Call: Momentum: Bullish Sentiment: Mutual Fund Cash: Vickers Insider Index: 1.00 Positive 346000 Neutral 44% Positive 1.70% Negative 3.42 Negative Housing Sentiment Slumps Mortgage rates near 7% are pushing prospective buyers to the sidelines and could turn housing to a drag on 2Q GDP after a strong contribution to 1Q growth. ""Millions of potential homebuyers have been priced out of the market by ...... rding to The State of The Nation's Housing report, which was published yesterday by Harvard's Joint Center for Housing Studies. Fannie Mae's Home Purchase Sentiment Index for May dropped by 2.5 points to an all-time survey low of 69.4. Just 14% of consumers said that it is a good time to buy a home, down from 20% in April. Doug Duncan, Chief Economist at Fannie Mae, said ""While many respondents expressed optimism at the beginning of the year that mortgage rates would decline,",
0,"Company Research Highlights ® Report created on June 21, 2024 This is not an investment recommendation from Fidelity Investments. Fidelity provides this information as a service to investors from independent, third-party sources. Performance of analyst recommendations are provided by StarMine from Refinitiv. Current analyst recommendations are collected and standardized by Investars. See each section in this report for third-party content attribution, as well as the final page of the report ...... Market Capitalization: \$2182.8 B Interactive Media & Services Industry Business Description Data provided by S&P Compustat Alphabet Inc. offers various products and platforms in the United States, Europe, the Middle East, Africa, the Asia-Pacific, Canada, and Latin America. It operates through Google Services, Google Cloud, and Other Bets segments. Key Statistics Employee Count 182,502 Institutional Ownership 80.9% Total Revenue (TTM) \$80,539.00 3/31/2024 Revenue Growth (TTM vs. Prior TTM)",
1,+11.78% Enterprise Value \$2103.1 B 6/20/2024 Ex. Dividend Date 6/10/2024 Dividend \$0.200000 Dividend Yield (Annualized) 0.45% 6/20/2024 P/E (TTM) 27.0 6/20/2024 Earning Yield (TTM) +3.70% 6/20/2024 EPS (Adjusted TTM) \$1.89 4/25/2024 Consensus EPS Estimate (Q2 2024) \$1.84 EPS Growth (TTM vs. Prior TTM) +45.2% 3-Year Price Performance Data provided by DataScope from Refinitiv Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 2021 2022 2023 2024 -40% -20% 0% 20% 40% 0 500 Average Monthly Volume (Millions) 50-Day Moving Average 200-Day Moving Average Trading Characteristics 52 Week High 6/12/2024 \$180.41 52 Week Low 7/11/2023 \$115.35 % Price Above/Below  20-Day Average 2.4  50-Day Average,


Now let's compute vector embeddings for the document chunks and store them in a vector index. Pixeltable has built-in support for vector indexing using a variety of embedding model families, and it's easy for users to add new ones via UDFs. In this demo, we're going to use the E5 model from the Huggingface `sentence_transformers` library, which runs locally. 

The following command creates a vector index on the `text` column in the `chunks` table, using the E5 embedding model. (For details on index creation, see the [Embedding and Vector Indices](https://docs.pixeltable.com/docs/embedding-vector-indexes) guide.) Note that defining the index is sufficient in order to load it with the existing data (and also to update it when the underlying data changes, as we'll see later).

In [12]:
from pixeltable.functions.huggingface import sentence_transformer

chunks_t.add_embedding_index(
    'text',
    embedding=sentence_transformer.using(model_id='intfloat/e5-large-v2')
)

This completes the first part of our application, creating an indexed document base. Next, we'll use it to run some queries.

## 2. Querying

In order to express a top-k lookup against our index, we use Pixeltable's `similarity` operator in combination with the standard `order_by` and `limit` operations. Before building this into our application, let's run a sample query to make sure it works.

In [13]:
query_text = "What is the expected EPS for Nvidia in Q1 2026?"
sim = chunks_t.text.similarity(query_text)
nvidia_eps_query = (
    chunks_t
    .order_by(sim, asc=False)
    .select(similarity=sim, text=chunks_t.text)
    .limit(5)
)
nvidia_eps_query.collect()

similarity,text
0.801,"on 6/20/2024: \$176.30 Communication Services Sector Market Capitalization: \$2182.8 B Interactive Media & Services Industry Alphabet Class A The content on this page is provided by third parties and not Fidelity. Fidelity did not prepare and does not endorse such content. All are third-party companies that are not affiliated with Fidelity. See each section in this report for third-party content attribution and see page 4 for full disclosures. Page 2 Report created on June 21, 2024 Equity Sum ...... rom contributors. The number of contributors for each security where there is a consensus recommendation is provided. Each contributor determines how their individual recommendation scale maps to the I/B/E/S from Refinitiv 5-point scale. Visit Fidelity.com for firm descriptions, detailed methodologies, and more information on the Equity Summary Score, First Call Consensus, opinion history and performance, and most current available research reports for GOOGL. Equity Summary Score (7 Firms†)"
0.799,/B/E/S from Refinitiv Earnings in US Dollars ACTUALS ESTIMATES STARMINE ESTIMATES GOOGL PRICE Q2 Q1 2024 2024 2024 2024 Q4 Q3 Q2 Q1 2023 2023 2023 2023 Q4 Q3 Q2 Q1 2022 2022 2022 2022 \$0 \$100 \$200 Today 1.23 1.21 1.06 1.05 1.17 1.44 1.55 1.64 1.89 1.84 Actuals vs. Estimates for Fiscal Year First Call Estimates  Actual (\$) Consensus (\$) Analysts in Consensus Low / High Range (\$) Previous Year (Ends 12/31/23) 5.80 5.74 48 5.26 / 6.10 Next Year (Ends 12/31/24) -- 7.57 53 6.87 / 8.17 Industry Comparisons** Data provided by S&P Compustat Valuation Ratios (trailing 12 months) GOOGL Industry GOOGL Percentile Rank in Industry Price / Earnings 27.0 28.1 -- PEG Ratio 1
0.796,+11.78% Enterprise Value \$2103.1 B 6/20/2024 Ex. Dividend Date 6/10/2024 Dividend \$0.200000 Dividend Yield (Annualized) 0.45% 6/20/2024 P/E (TTM) 27.0 6/20/2024 Earning Yield (TTM) +3.70% 6/20/2024 EPS (Adjusted TTM) \$1.89 4/25/2024 Consensus EPS Estimate (Q2 2024) \$1.84 EPS Growth (TTM vs. Prior TTM) +45.2% 3-Year Price Performance Data provided by DataScope from Refinitiv Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 2021 2022 2023 2024 -40% -20% 0% 20% 40% 0 500 Average Monthly Volume (Millions) 50-Day Moving Average 200-Day Moving Average Trading Characteristics 52 Week High 6/12/2024 \$180.41 52 Week Low 7/11/2023 \$115.35 % Price Above/Below  20-Day Average 2.4  50-Day Average
0.795,"Friday, June 21, 2024 Intermediate Term: Market Outlook Bullish -------------- PORTFOLIO STRATEGY ------------- Equity: 72% Cash: 1% Today's Market Movers IMPACT aGlobal Shares Lower GILD Pops on HIV Drug Results SRPT Soars on FDA Approval SWBI Drops on Sales Guidance + + + - a a a Recent Research Review ADSK, MRNA, IQV, WMB, BUD, LYFT, SRE, BP, AEE, PPC, JNPR, ORCL, CMG, TPR, DPZ, EOG, COST, PLTR, COR, VRTX Statistics Diary 12-Mth S&P 500 Forcast: S&P 500 Current/Next EPS: S&P 500 P/E: 12-Mth S&P P/E Range: 10-Year Yield: 12-Mth 10-Yr. Bond Forecast: Current Fed Funds Target: 12-Mth Fed Funds Forecast: 4800-5600 247/265 22.16 18.1 - 21.1 4.26% 3.50-4.50% 4.62% 4.50-5.50% DJIA: S&P 500: NASDAQ: Lrg/Small Cap: Growth/Value: PREVIOUS CLOSE 200-DAY AVERAGE 39134.76"
0.794,"2024: \$176.30 Communication Services Sector Market Capitalization: \$2182.8 B Interactive Media & Services Industry Alphabet Class A The content on this page is provided by third parties and not Fidelity. Fidelity did not prepare and does not endorse such content. All are third-party companies that are not affiliated with Fidelity. See each section in this report for third-party content attribution.  © 2024 FMR LLC. All rights reserved. 447628.8.0 Page 4 Report created on June 21, 2024 Impo ...... alyst recommendations data and analysis provided by Investars. All are third-party companies that are not affiliated with Fidelity. This report provides quotes and data concerning the financial markets, securities and other subjects. Content is supplied by companies that are not affiliated with Fidelity (""Third-Party Content""). Most Third-Party Content and its source are clearly and prominently identified. Fidelity does not prepare, edit, or endorse Third-Party Content. News and research are"


We perform this context retrieval for each row of our `queries` table by adding it as a computed column. In this case, the operation is a top-k similarity lookup against the data in the `chunks` table. To implement this operation, we'll use Pixeltable's `@query` decorator to enhance the capabilities of the `chunks` table.

In [14]:
# A @query is essentially a reusable, parameterized query that is attached to a table (or view),
# which is a modular way of getting data from that table.

@pxt.query
def top_k(query_text: str):
    sim = chunks_t.text.similarity(query_text)
    return (
        chunks_t.order_by(sim, asc=False)
            .select(chunks_t.text, sim=sim)
            .limit(5)
    )

# Now add a computed column to `queries_t`, calling the query
# `top_k` that we just defined.
queries_t.add_computed_column(
    question_context=top_k(queries_t.Question)
)

Added 8 column values with 0 errors.


8 rows updated, 8 values computed.

Our `queries` table now looks like this:

In [15]:
queries_t

0
table 'rag_demo.queries'

Column Name,Type,Computed With
S__No_,Int,
Question,String,
correct_answer,String,
question_context,Json,top_k(Question)


The new column `question_context` now contains the result of executing the query for each row, formatted as a list of dictionaries:

In [16]:
queries_t.select(queries_t.question_context).head(1)

question_context
"[{""sim"": 0.795, ""text"": "" that simply hasn't happened, and current sentiment \nreflects pent-up frustration with the overall lack of purchase affordability.\"" \nBased on the ...... .5% for April. The Zillow Home Value index rose by 4.3% in April and \n4.3% in May. High mortgage rates are a challenge, but we remain bullish on \n""}, {""sim"": 0.794, ""text"": ""\n37058.23\n5473.17\n4831.39\n17721.59\n15160.55\n1.48\n1.37\n2.07\n1.86\nCURRENT\nRANKING\nFive-Day Put/Call:\nMomentum:\nBullish Sentiment:\nMutual Fund Cash:\n ...... \nEconomist \nat \nFannie \nMae, \nsaid \n\""While \nmany \nrespondents expressed optimism at the beginning of the year that mortgage \nrates would decline,""}, {""sim"": 0.779, ""text"": ""+11.78%\nEnterprise Value\n\$2103.1 B\n6/20/2024\nEx. Dividend Date\n6/10/2024\nDividend\n\$0.200000\nDividend Yield (Annualized)\n0.45%\n6/20/2024\nP/E (TTM)\n ...... e\nTrading Characteristics\n52 Week High\n6/12/2024\n\$180.41\n52 Week Low\n7/11/2023\n\$115.35\n% Price Above/Below\n 20-Day Average\n2.4\n 50-Day Average""}, {""sim"": 0.768, ""text"": "", redistribution or disclosure is prohibited by law and can result in prosecution. The content of this report\nmay be derived from Argus research r ...... ny loss arising from the use of this report, nor shall Argus treat all recipients of this report as\ncustomers simply by virtue of their receipt of""}, {""sim"": 0.765, ""text"": "".20; the new guidance implies growth of 2%-3% from FY23. That is down from 3%-5% growth\nguidance offered in March 2024.\nWe are reducing our FY24 n ...... billion in FY23, \$9.54 billion in FY22, \$8.98 billion in FY21, \$8.36 billion in FY20,\nand \$6.63 billion in FY19.\nAccenture forecast free cash flow""}]"


### Asking the LLM

Now it's time for the final step in our application: feeding the document chunks and questions to an LLM for resolution. In this demo, we'll use OpenAI for this, but any other inference cloud or local model could be used instead.

We start by defining a UDF that takes a top-k list of context chunks and a question and turns them into a ChatGPT prompt.

In [17]:
# Define a UDF to create an LLM prompt given a top-k list of
# context chunks and a question.
@pxt.udf
def create_prompt(top_k_list: list[dict], question: str) -> str:
    concat_top_k = '\n\n'.join(
        elt['text'] for elt in reversed(top_k_list)
    )
    return f'''
    PASSAGES:

    {concat_top_k}

    QUESTION:

    {question}'''

We then add that again as a computed column to `queries`:

In [18]:
queries_t.add_computed_column(
    prompt=create_prompt(queries_t.question_context, queries_t.Question)
)

Added 8 column values with 0 errors.


8 rows updated, 16 values computed.

We now have a new string column containing the prompt:

In [19]:
queries_t

0
table 'rag_demo.queries'

Column Name,Type,Computed With
S__No_,Int,
Question,String,
correct_answer,String,
question_context,Json,top_k(Question)
prompt,String,"create_prompt(question_context, Question)"


In [20]:
queries_t.select(queries_t.prompt).head(1)

prompt
"PASSAGES:  .20; the new guidance implies growth of 2%-3% from FY23. That is down from 3%-5% growth guidance offered in March 2024. We are reducing our FY24 non-GAAP earnings estimate to \$11.88 per diluted share from \$12.11. However, we are raising our FY25 forecast to \$12.70 per diluted share from \$12.68. Our long-term annualized EPS growth rate forecast is 10%. FINANCIAL STRENGTH & DIVIDEND Accenture's financial strength ranking is High. The company, which formerly had fractional d ...... of 4.10 million (SAAR), down from 4.23 million in May 2023. Next week, we expect the Commerce Department to report May New Home Sales of 640,000 (SAAR), down from 741,000 a year earlier. The S&P/Case-Shiller National Home Price Index jumped 6.5% in March. We expect it to rise about 4.5% for April. The Zillow Home Value index rose by 4.3% in April and 4.3% in May. High mortgage rates are a challenge, but we remain bullish on QUESTION:  What is roughly the current mortage rate?"


We now add another computed column to call OpenAI. For the `chat_completions()` call, we need to construct two messages, containing the instructions to the model and the prompt. For the latter, we can simply reference the `prompt` column we just added.

In [21]:
from pixeltable.functions import openai

# Assemble the prompt and instructions into OpenAI's message format
messages = [
    {
        'role': 'system',
        'content': 'Please read the following passages and answer the question based on their contents.'
    },
    {
        'role': 'user',
        'content': queries_t.prompt
    }
]

# Add a computed column that calls OpenAI
queries_t.add_computed_column(
    response=openai.chat_completions(model='gpt-4o-mini', messages=messages)
)

Added 8 column values with 0 errors.


8 rows updated, 8 values computed.

Our `queries` table now contains a JSON-structured column `response`, which holds the entire API response structure. At the moment, we're only interested in the response content, which we can extract easily into another computed column:

In [22]:
queries_t.add_computed_column(
    answer=queries_t.response.choices[0].message.content
)

Added 8 column values with 0 errors.


8 rows updated, 8 values computed.

We now have the following `queries` schema:

In [23]:
queries_t

0
table 'rag_demo.queries'

Column Name,Type,Computed With
S__No_,Int,
Question,String,
correct_answer,String,
question_context,Json,top_k(Question)
prompt,String,"create_prompt(question_context, Question)"
response,Required[Json],"chat_completions(model='gpt-4o-mini', messages=[{'role': 'system', 'content': 'Please read the following passages and answer the question based on their contents.'}, {'role': 'user', 'content': prompt}])"
answer,Json,response.choices[0].message.content


Let's take a look at what we got back:

In [24]:
queries_t.select(queries_t.Question, queries_t.correct_answer, queries_t.answer).show()

Question,correct_answer,answer
What is roughly the current mortage rate?,0.07,The current mortgage rate is near 7%.
What is the overall latest rating for Amazon.com from analysts?,SELL,"The provided passages contain detailed information about Alphabet Inc. (GOOGL) and its ratings from various analysts, but there is no information about Amazon.com (AMZN) or its ratings. Therefore, I cannot provide the overall latest rating for Amazon.com from analysts based on the content provided."
What is the market capitalization of Alphabet?,\$2182.8 Billion,"The market capitalization of Alphabet Inc. is \$2,182.8 billion."
What is the current dividend yield for Alphabet Inc. (\$GOOGL)?,0.0046,"The passages do not provide any information regarding the current dividend yield for Alphabet Inc. (\$GOOGL). In fact, it appears that GOOGL does not distribute dividends, as the data presented emphasizes other financial metrics without mentioning dividends. Therefore, based on the information given, the current dividend yield for Alphabet Inc. is 0%."
What is the expected EPS for Nvidia in Q1 2026?,0.73 EPS,"The provided passages do not contain any information about Nvidia's expected EPS for Q1 2026. The data primarily focuses on Alphabet Class A (GOOGL) and various financial statistics related to it. To find the expected EPS for Nvidia in Q1 2026, you would need to consult specific reports or sources that provide detailed forecasts for that company."
What is the operating cash flow of Amazon in Q1 2024?,"18,989 Million","The provided passages do not contain information about Amazon's operating cash flow for Q1 2024. They primarily focus on Alphabet Class A (GOOGL) and Accenture's financial metrics. Therefore, it is not possible to answer the question based on the given content."
What are the latest financial metrics for Accenture PLC?,missed consensus forecasts and strong total bookings rising by 22% annually,"The latest financial metrics for Accenture PLC are as follows: 1. **Revenue for FY23**: \$64.1 billion, up 4% on a GAAP basis. 2. **Non-GAAP earnings per diluted share for FY23**: \$11.67, a 9% increase. 3. **Guidance for fiscal 4Q24 revenue**: \$16.05-\$16.65 billion, which translates to an annual increase of 2% to 6% in local currency, with a negative 2% impact from currency translation expected. 4. **FY24 revenue growth guidance**: 1.5%-2.5% in local currency, lowered from earlier guidance o ...... 0. 7. **Operating margin for FY24**: Projected to expand by 10 basis points to 15.5% from 15.4% in FY23. 8. **Free cash flow for FY23**: \$9.0 billion. 9. **Shareholder return forecast for FY24**: At least \$7.7 billion. 10. **Dividend per share**: A quarterly dividend of \$1.29 per share, reflecting a 15% increase announced in September 2023. Additionally, as of the end of 3Q24, Accenture has \$1.68 billion in debt and \$5.54 billion in cash. Cash flow from operations was \$9.52 billion in FY23."
What are the main reasons to buy Nvidia?,"Datacenter, GPUs Demands, Self-driving, and cash-flow","The passages provided do not explicitly discuss the reasons to buy Nvidia. However, it does mention that shares of Nvidia lost ground recently, indicating some volatility in its performance, and it highlights a general positive sentiment towards AI technologies impacting the market, which Nvidia is a significant player in due to its involvement in AI-driven hardware and software solutions. To deduce potential reasons for buying Nvidia based on the information provided: 1. **Strong Market P ...... omfortably ahead for the year, indicating investor confidence that could influence Nvidia positively in the long term. 4. **Future Growth Potential**: Nvidia has a strong potential for growth as businesses increasingly integrate AI into their operations. In summary, while the passages do not provide specific reasons to buy Nvidia, the underlying themes of growth in AI technology, Nvidia’s market position, and the general market performance are favorable indicators for potential investment."


The application works, but, as expected, a few questions couldn't be answered due to the missing documents. As a final step, let's add the remaining documents to our document base, and run the queries again.

## Incremental Updates

Pixeltable's views and computed columns update automatically in response to new data. We can see this when we add the remaining documents to our `documents` table. Watch how the `chunks` view is updated to stay in sync with `documents`:

In [25]:
documents_t.insert({'document': p} for p in document_urls[3:])

Inserting rows into `documents`: 3 rows [00:00, 1949.63 rows/s]
Inserting rows into `chunks`: 68 rows [00:00, 601.57 rows/s]
Inserted 71 rows with 0 errors.


71 rows inserted, 6 values computed.

In [26]:
documents_t.show()

document


(Note: although Pixeltable updates `documents` and `chunks`, it **does not** automatically update the `queries` table. This is by design: we don't want all rows in `queries` to get automatically re-executed every time a single new document is added to the document base. However, newly-added rows will be run over the new, incrementally-updated index.)

To confirm that the `chunks` index got updated, we'll re-run the chunks retrieval query for the question

```What is the expected EPS for Nvidia in Q1 2026?```

Previously, our most similar chunk had a similarity score of ~0.81. Let's see what we get now:

In [27]:
nvidia_eps_query.collect()

similarity,text
0.863,"4 7,192 A 13,507 A 18,120 A 22,103 A 60,922 A EPS Estimates(2)  Q1 Q2 Q3 Q4 Annual* 2026 0.73 E 0.73 E 0.77 E 0.81 E 3.04 E 2025 0.61 A 0.62 E 0.67 E 0.71 E 2.62 E 2024 0.11 A 0.27 A 0.40 A 0.52 A 1.30 A *Quarterly figures may not add up to annual. 1) The data in the charts and tables, except the estimates, is as of 06/11/2024. 2) The report's text, the analyst-provided estimates, and the price target are as of 06/12/2024. Zacks Report Date: June 12, 2024 © 2024 Zacks Investment Research, All Rights Reserved 10 S. Riverside Plaza Suite 1600 · Chicago, IL 60606 Overview NVIDIA Corporation is the worldwide leader in visual computing technologies and the inventor of the graphic processing unit, or GPU. Over the years, the company's focus has evolved from PC graphics to artificial intelligence (AI) based solutions that"
0.855,"and Microsoft's Xbox One will also be going with AMD. NVIDIA also has limited scope for growth in the apps processor market as it is dominated by Apple, Samsung and Qualcomm. We believe that competitive pressure from two CPU vendors, Intel and AMD, who are planning to integrate graphics cores into their chips can negatively impact NVIDIA's revenues in the long haul.  A substantial portion of the company's sales is derived from outside the United States. Sales revenues to customers outside ...... stimates, Revenues Rise Y/Y NVIDIA reported first-quarter fiscal 2025 earnings of \$6.12 per share, which beat the Zacks Consensus Estimate by 11.48% and increased 19% sequentially. Notably, NVDA posted earnings of \$1.09 per share in the year-ago quarter. Revenues jumped 262% year over year to \$26.04 billion and beat the Zacks Consensus Estimate by 7.02%. On a sequential basis, revenues increased 18%. NVIDIA is riding on a strong and innovative portfolio, with the growing adoption of its GPUs"
0.848,"ations for Windows to deliver maximum performance on NVIDIA GeForce RTX AI PCs and workstations. For the Professional Visualization domain, it launched NVIDIA RTX 500 and 1000 professional Ada generation laptop GPUs for AI-enhanced workflows, NVIDIA RTX A400 and A1000 GPUs for desktop workstations and NVIDIA Omniverse Cloud APIs. Operating Details NVIDIA's non-GAAP gross margin increased to 78.9% from 66.8% in the year-ago quarter and 76.7% from the previous quarter, mainly driven by higher ...... ted benefits. However, as a percentage of total revenues, non-GAAP operating expenses declined to 9.6% from 24.3% in the year-ago quarter and 30.7% in the previous quarter. The non-GAAP operating income was \$18.06 billion compared with \$3.05 billion in the year-ago quarter. Sequentially, the figure jumped 22.4%. Balance Sheet and Cash Flow As of Apr 28, 2024, NVDA's cash, cash equivalents and marketable securities were \$31.44 billion, up from \$25.98 billion as of Jan 28, 2024. As of Apr 28,"
0.841,"2024, the total long-term debt was \$8.46 billion, unchanged sequentially. NVIDIA generated \$15.4 billion in operating cash flow, up from the previous quarter's \$11.5 billion. The company ended the fiscal first quarter with a free cash flow of \$14.94 billion. In the fiscal first quarter, it returned \$7.8 billion to shareholders through dividend payouts and share repurchases. Guidance For the second quarter of fiscal 2025, NVIDIA anticipates revenues of \$28 billion (+/-2%), higher than the Zac ...... IGX platform. On Jun 2, NVIDIA announced that the world's leaders in robot development are adopting the NVIDIA Isaac robotics platform for the research, development and production of the next generation of AI-enabled autonomous machines and robots. On Jun 2, NVIDIA announced new NVIDIA RTX technology to power AI assistants and digital humans running on new GeForce RTX AI laptops. On Jun 2, NVIDIA announced the widespread adoption of the NVIDIA Spectrum-X Ethernet networking platform as well"
0.84,"9.78%. NVIDIA Drive Thor solution was adopted by BYD, XPENG, GAC's AION Hyper, Nuro and others in the reported quarter. Lucid and IM Motors are using the NVIDIA DRIVE Orin platform for vehicle models targeting the European market. OEM and Other revenues moved up 1% year over year but declined 13% sequentially to \$78 million. The figure missed the consensus mark by 15.39%. Expanding Portfolio Aids Prospects In the fiscal first quarter, NVIDIA launched the Blackwell platform targeted for AI co ...... ucture. Moreover, the company launched NVIDIA AI Enterprise 5.0 with NVIDIA NIM inference microservices to speed enterprise app development. For the gaming domain, NVIDIA launched AI gaming technologies for NVIDIA ACE and Neural Graphics. Moreover, it unveiled AI performance FY Quarter Ending 1/31/2024 Earnings Reporting Date May 22, 2024 Sales Surprise 7.02% EPS Surprise 11.48% Quarterly EPS 0.61 Annual EPS (TTM) 1.80 Zacks Equity Research www.zacks.com Page 5 of 10 optimizations and integr"


Our most similar chunk now has a score of ~0.86 and pulls in more relevant chunks from the newly-inserted documents.

Let's recompute the `response` column of the `queries` table, which will automatically recompute the `answer` column as well.

In [28]:
queries_t.recompute_columns('response')

Inserting rows into `queries`: 8 rows [00:00, 2128.01 rows/s]


8 rows updated, 16 values computed.

As a final step, let's confirm that all the queries now have answers:

In [29]:
queries_t.select(
    queries_t.Question,
    queries_t.correct_answer,
    queries_t.answer
).show()

Question,correct_answer,answer
What is roughly the current mortage rate?,0.07,The current mortgage rate is near 7%.
What is the market capitalization of Alphabet?,\$2182.8 Billion,"The market capitalization of Alphabet Class A (GOOGL) is \$2,182.8 billion."
What is the current dividend yield for Alphabet Inc. (\$GOOGL)?,0.0046,"The provided passages do not mention a current dividend yield for Alphabet Inc. (\$GOOGL). In fact, it appears that the company does not have a dividend yield indicated, as it typically does not pay dividends. For the latest and most accurate information on dividend yields, it is recommended to check financial news sources or the company’s official investor relations page."
What is the overall latest rating for Amazon.com from analysts?,SELL,"The provided passages contain detailed information regarding Alphabet Inc. (GOOGL) but do not include any data related to Amazon.com (AMZN). Therefore, there is no information available about the overall latest rating for Amazon.com from analysts in the content you've shared."
What is the operating cash flow of Amazon in Q1 2024?,"18,989 Million","The provided passages do not contain any information regarding the operating cash flow of Amazon in Q1 2024. They focus on the financial details of Alphabet Class A (GOOGL) and Accenture, but do not mention Amazon or its operating cash flow. Therefore, I am unable to provide the requested information about Amazon's operating cash flow."
What is the expected EPS for Nvidia in Q1 2026?,0.73 EPS,"The passages provided do not contain any information about Nvidia or its expected EPS (earnings per share) for Q1 2026. The information mentioned mainly pertains to Alphabet Class A (GOOGL) and its financial data. Therefore, I cannot provide an expected EPS for Nvidia based on the given content."
What are the main reasons to buy Nvidia?,"Datacenter, GPUs Demands, Self-driving, and cash-flow","The passages provided do not explicitly mention the reasons to buy Nvidia. However, they do highlight general trends in stock performance, particularly how some stocks, including Nvidia, have been affected by broader market factors such as interest rates and AI-related developments. ### From the context around Nvidia: 1. **Market Resilience**: Despite a notable drop in share price (3.5% loss), the major indices were described as comfortably ahead for the year, indicating a potential recover ...... AI technologies, positioning it well in a growing market. 2. **Future Growth Prospects**: As businesses increasingly adopt AI solutions, Nvidia's products will likely remain in high demand. 3. **Resilience in Performance**: Despite temporary market setbacks, Nvidia’s overall performance in the tech sector is usually robust. In summary, the reasons to consider buying Nvidia may revolve around its strong market position in AI, potential for growth, and resilience against market fluctuations."
What are the latest financial metrics for Accenture PLC?,missed consensus forecasts and strong total bookings rising by 22% annually,"The latest financial metrics for Accenture PLC are as follows: 1. **Record revenue for FY23**: \$64.1 billion, which is up 4% on a GAAP basis. 2. **Non-GAAP earnings**: Rose 9% to \$11.67 per diluted share in FY23. 3. **Guidance for fiscal 4Q24 revenue**: Expected to be between \$16.05 billion and \$16.65 billion, which would represent an increase of 2% to 6% annually in local currency, but flat to up 4% on a GAAP basis due to a 2% negative currency translation impact. 4. **FY24 revenue growth ...... or FY24; FY23 free cash flow was \$9.0 billion. 9. **Shareholder returns expectation for FY24**: At least \$7.7 billion to be returned to shareholders. 10. **Cash at the end of 3Q24**: \$5.54 billion, down from \$9.05 billion at the end of FY23. 11. **Debt as of the end of 3Q24**: \$1.68 billion. 12. **Cash flow from operations in FY23**: \$9.52 billion. These metrics reflect Accenture's overall financial health and performance, including revenues, earnings, cash flow, and return to shareholders."
