In [1]:
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# BigFrames AI Operator Tutorial

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/googleapis/python-bigquery-dataframes/blob/main/notebooks/experimental/ai_operators.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/googleapis/python-bigquery-dataframes/blob/main/notebooks/experimental/ai_operators.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/bigquery/import?url=https://github.com/googleapis/python-bigquery-dataframes/blob/main/notebooks/experimental/ai_operators.ipynb">
      <img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTW1gvOovVlbZAIZylUtf5Iu8-693qS1w5NJw&s" alt="BQ logo" width="35">
      Open in BQ Studio
    </a>
  </td>
</table>

This notebook provides a hands-on preview of AI operator APIs powered by the Gemini model.

The notebook is divided into two sections. The first section introduces the API syntax with examples, aiming to familiarize you with how AI operators work. The second section applies AI operators to a large real-world dataset and presents performance statistics.

This work is inspired by [this paper](https://arxiv.org/pdf/2407.11418) and powered by BigQuery ML and Vertex AI.

# Preparation

First, import the BigFrames modules.



In [2]:
import bigframes
import bigframes.pandas as bpd

Make sure the BigFrames version is at least `1.42.0`

In [3]:
from packaging.version import Version

assert Version(bigframes.__version__) >= Version("1.42.0")

Turn on the AI operator experiment. You will see a warning sign saying that these operators are still under experiments. If you don't turn on the experiment before using the operators, you will get `NotImplemenetedError`s.

In [4]:
bigframes.options.experiments.ai_operators = True

the future.


Specify your GCP project and location.

In [5]:
bpd.options.bigquery.project = 'bigframes-dev'
bpd.options.bigquery.location = 'US'

**Optional**: turn off the display of progress bar so that only the operation results will be printed out

In [6]:
bpd.options.display.progress_bar = None

Create LLM instances. They will be passed in as parameters for each AI operator.

This tutorial uses the "gemini-2.0-flash-001" model for text generation and "text-embedding-005" for embedding. While these are recommended, you can choose [other Vertex AI LLM models](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models) based on your needs and availability. Ensure you have [sufficient quota](https://cloud.google.com/vertex-ai/generative-ai/docs/quotas) for your chosen models and adjust it if necessary.

In [7]:
from bigframes.ml import llm
gemini_model = llm.GeminiTextGenerator(model_name="gemini-2.0-flash-001")
text_embedding_model = llm.TextEmbeddingGenerator(model_name="text-embedding-005")

**Note**: AI operators could be expensive over a large set of data. As a result, our team added this option `bigframes.options.compute.ai_ops_confirmation_threshold` at `version 1.42.0` so that the BigFrames will ask for your confirmation if the amount of data to be processed is too large. If the amount of rows exceeds your threshold, you will see a prompt for your keyboard input -- 'y' to proceed and 'n' to abort. If you abort the operation, no LLM processing will be done.

The default threshold is 0, which means the operators will always ask for confirmations. You are free to adjust the value as needed. You can also set the threshold to `None` to disable this feature.

In [8]:
if Version(bigframes.__version__) >= Version("1.42.0"):
    bigframes.options.compute.ai_ops_confirmation_threshold = 1000

If you would like your operations to fail automatically when the data is too large, set `bigframes.options.compute.ai_ops_threshold_autofail` to `True`:

In [9]:
# if Version(bigframes.__version__) >= Version("1.42.0"):
#     bigframes.options.compute.ai_ops_threshold_autofail = True

# API Samples

You will learn about each AI operator by trying some examples.

## AI Filtering

AI filtering allows you to filter your dataframe based on the instruction (i.e. prompt) you provided.

First, create a dataframe:

In [10]:
df = bpd.DataFrame({'country': ['USA', 'Germany', 'Japan'], 'city': ['Seattle', 'Berlin', 'Kyoto']})
df

Unnamed: 0,country,city
0,USA,Seattle
1,Germany,Berlin
2,Japan,Kyoto


Now, filter this dataframe by keeping only the rows where the value in `city` column is the capital of the value in `country` column. The column references could be "escaped" by using a pair of braces in your instruction. In this example, your instruction should be like this:
```
The {city} is the capital of the {country}.
```

Note that this is not a Python f-string, so you shouldn't prefix your instruction with an `f`.

In [11]:
df.ai.filter("The {city} is the capital of the {country}", model=gemini_model)

`db_dtypes` is a preview feature and subject to change.


Unnamed: 0,country,city
1,Germany,Berlin


The filter operator extracts the information from the referenced column to enrich your instruction with context. The instruction is then sent for the designated model for evaluation. For filtering operations, the LLM is asked to return only `True` and `False` for each row, and the operator removes the rows accordingly.

## AI Mapping

AI mapping allows to you to combine values from multiple columns into a single output based your instruction.

Here is an example:

In [12]:
df = bpd.DataFrame({
    "ingredient_1": ["Bun", "Soy Bean", "Sausage"],
    "ingredient_2": ["Beef Patty", "Bittern", "Long Bread"]
    })
df

Unnamed: 0,ingredient_1,ingredient_2
0,Bun,Beef Patty
1,Soy Bean,Bittern
2,Sausage,Long Bread


Now, you ask LLM what kind of food can be made from the two ingredients in each row. The column reference syntax in your instruction stays the same. In addition, you need to specify the output column name.

If you are using BigFrames version `2.5.0` or later, the column name is specified with the `output_schema` parameter. This parameter expects a dictionary input in the form of `{'col_name': 'type_name'}`.

In [13]:
df.ai.map("What is the food made from {ingredient_1} and {ingredient_2}? One word only.", model=gemini_model, output_schema={"food": "string"})

`db_dtypes` is a preview feature and subject to change.


Unnamed: 0,ingredient_1,ingredient_2,food
0,Bun,Beef Patty,Hamburger
1,Soy Bean,Bittern,Tofu
2,Sausage,Long Bread,Hotdog


If you are using BigFrames version 2.4.0 or prior, the column name is specified wit the `output_column` parameter. The outputs are always strings.

In [None]:
# df.ai.map("What is the food made from {ingredient_1} and {ingredient_2}? One word only.", output_column="food", model=gemini_model)

`db_dtypes` is a preview feature and subject to change.


Unnamed: 0,ingredient_1,ingredient_2,food
0,Bun,Beef Patty,Burger
1,Soy Bean,Bittern,Tofu
2,Sausage,Long Bread,Hotdog


## AI Joining

AI joining can join two dataframes based on the instruction you provided.

First, you prepare two dataframes:

In [14]:
cities = bpd.DataFrame({'city': ['Seattle', 'Ottawa', 'Berlin', 'Shanghai', 'New Delhi']})
continents = bpd.DataFrame({'continent': ['North America', 'Africa', 'Asia']})

You want to join the `cities` with `continents` to form a new dataframe such that, in each row the city from the `cities` data frame is in the continent from the `continents` dataframe. You could re-use the aforementioned column reference syntax:

In [15]:
cities.ai.join(continents, "{city} is in {continent}", model=gemini_model)

`db_dtypes` is a preview feature and subject to change.


Unnamed: 0,city,continent
0,Seattle,North America
1,Ottawa,North America
2,Shanghai,Asia
3,New Delhi,Asia


!! **Important:** AI join can trigger probihitively expensitve operations! This operation first cross joins two dataframes, then invokes AI filter on each row. That means if you have two dataframes of sizes `M` and `N`, the total amount of queries sent to the LLM is on the scale of `M * N`.

### Self Joins

This self-join example is for demonstrating a special case: what happens when the joining columns exist in both data frames? It turns out that you need to provide extra information in your column references: by attaching "left." and "right." prefixes to your column names.

Create an example data frame:

In [16]:
animals = bpd.DataFrame({'animal': ['cow', 'cat', 'spider', 'elephant']})

You want to compare the weights of these animals, and output all the pairs where the animal on the left is heavier than the animal on the right. In this case, you use `left.animal` and `right.animal` to differentiate the data sources:

In [17]:
animals.ai.join(animals, "{left.animal} generally weighs heavier than {right.animal}", model=gemini_model)

`db_dtypes` is a preview feature and subject to change.


Unnamed: 0,animal_left,animal_right
0,cow,cat
1,cow,spider
2,cat,spider
3,elephant,cow
4,elephant,cat
5,elephant,spider


## AI Top K

AI Top K selects the top K values based on your instruction. Here is an example:

In [18]:
df = bpd.DataFrame({"Animals": ["Corgi", "Orange Cat", "Parrot", "Tarantula"]})

You want to find the top two most popular pets:

In [19]:
df.ai.top_k("{Animals} are more popular as pets", model=gemini_model, k=2)

`db_dtypes` is a preview feature and subject to change.


Unnamed: 0,Animals
0,Corgi
1,Orange Cat


Under the hood, the AI top K operator performs pair-wise comparisons with LLM. The top K results are returned in the order of their indices instead of their ranks.

## AI Search

AI search searches the most similar values to your query within a single column. Here is an example:

In [20]:
df = bpd.DataFrame({"creatures": ["salmon", "sea urchin", "baboons", "frog", "chimpanzee"]})
df

Unnamed: 0,creatures
0,salmon
1,sea urchin
2,baboons
3,frog
4,chimpanzee


You want to get the top 2 creatures that are most similar to "monkey":

In [21]:
df.ai.search("creatures", query="monkey", top_k = 2, model = text_embedding_model, score_column='similarity score')

`db_dtypes` is a preview feature and subject to change.
`db_dtypes` is a preview feature and subject to change.
`db_dtypes` is a preview feature and subject to change.


Unnamed: 0,creatures,similarity score
2,baboons,0.708434
4,chimpanzee,0.635844


Note that you are using a text embedding model this time. This model generates embedding vectors for both your query as well as the values in the search space. The operator then uses BigQuery's built-in VECTOR_SEARCH function to find the nearest neighbors of your query.

In addition, `score_column` is an optional parameter for storing the distances between the results and your query. If not set, the score column won't be attached to the result.

## AI Similarity Join

When you want to perform multiple similarity queries in the same value space, you could use similarity join to simplify your call. For example:

In [22]:
df1 = bpd.DataFrame({'animal': ['monkey', 'spider', 'salmon', 'giraffe', 'sparrow']})
df2 = bpd.DataFrame({'animal': ['scorpion', 'baboon', 'owl', 'elephant', 'tuna']})

In this example, you want to pick the most related animal from `df2` for each value in `df1`.

In [23]:
df1.ai.sim_join(df2, left_on='animal', right_on='animal', top_k=1, model=text_embedding_model, score_column='distance')

`db_dtypes` is a preview feature and subject to change.
`db_dtypes` is a preview feature and subject to change.


Unnamed: 0,animal,animal_1,distance
0,monkey,baboon,0.620521
1,spider,scorpion,0.728024
2,salmon,tuna,0.782141
3,giraffe,elephant,0.7135
4,sparrow,owl,0.810864


!! **Important** Like AI join, this operator can also be very expensive. To guard against unexpected processing of large dataset, use the `bigframes.options.compute.sem_ops_confirmation_threshold` option to specify a threshold.

# Performance Analyses

In this section, you will use BigQuery's public data of hacker news to perform some heavy work. We recommend you to check the code without executing them in order to save your time and money. The execution results are attached after each cell for your reference.

First, load 3k rows from the table:

In [24]:
hacker_news = bpd.read_gbq("bigquery-public-data.hacker_news.full")[['title', 'text', 'by', 'score', 'timestamp', 'type']].head(3000)
hacker_news

Unnamed: 0,title,text,by,score,timestamp,type
0,,,,,2010-04-16 19:52:51+00:00,comment
1,,I&#x27;d agree about border control with a cav...,bandrami,,2023-06-04 06:12:00+00:00,comment
2,,So 4 pickups? At least pickups are high margin...,seanmcdirmid,,2023-09-19 14:19:46+00:00,comment
3,Workplace Wellness Programs Don’t Work Well. W...,,anarbadalov,2.0,2018-08-07 12:17:45+00:00,story
4,,Are you implying that to be a good developer y...,ecesena,,2016-06-10 19:38:25+00:00,comment
5,,It pretty much works with other carriers. My s...,toast0,,2024-08-13 03:11:32+00:00,comment
6,,,,,2020-06-07 22:43:03+00:00,comment
7,,&quot;not operated for profit&quot; and &quot;...,radford-neal,,2020-03-19 00:24:47+00:00,comment
8,,It&#x27;s a good description of one applicatio...,dkarl,,2024-10-07 13:38:18+00:00,comment
9,,"Might be a bit high, but....<p><i>&quot;For ex...",tyingq,,2017-01-23 19:49:15+00:00,comment


Then, keep only the rows that have text content:

In [25]:
hacker_news_with_texts = hacker_news[hacker_news['text'].isnull() == False]
len(hacker_news_with_texts)

2533

You can get an idea of the input token length by calculating the average string length.

In [26]:
hacker_news_with_texts['text'].str.len().mean()

393.2356889064355

**Optional**: You can raise the confirmation threshold for a smoother experience.

In [None]:
if Version(bigframes.__version__) >= Version("1.42.0"):
    bigframes.options.compute.ai_ops_confirmation_threshold = 5000

Now it's LLM's turn. You want to keep only the rows whose texts are talking about iPhone. This will take several minutes to finish.

In [28]:
iphone_comments = hacker_news_with_texts.ai.filter("The {text} is mainly focused on iPhone", gemini_model)
iphone_comments

`db_dtypes` is a preview feature and subject to change.


Unnamed: 0,title,text,by,score,timestamp,type
445,,"If I want to manipulate a device, I&#x27;ll bu...",exelius,,2017-09-21 17:39:37+00:00,comment
967,,"<a href=""https:&#x2F;&#x2F;archive.ph&#x2F;nnE...",blinding-streak,,2023-04-30 19:10:16+00:00,comment
975,,I&#x27;ve had my 6S Plus now for 36 months and...,throwaway427,,2019-01-03 18:06:33+00:00,comment
1253,,Apple is far more closed and tyrannical with i...,RyanMcGreal,,2012-12-21 00:45:40+00:00,comment
1274,,An iOS version was released earlier this year....,pls2halp,,2017-12-09 06:36:41+00:00,comment
1548,,I’m not sure how that fits with Apple pursuing...,alphabettsy,,2021-12-26 19:41:38+00:00,comment
1630,,"Not sure if you’re being ironic, but I use an ...",lxgr,,2025-03-29 03:57:25+00:00,comment
1664,,Quoting from the article I linked you:<p>&gt;&...,StreamBright,,2017-09-11 19:57:34+00:00,comment
1884,,"&gt; Not all wireless headsets are the same, h...",cptskippy,,2021-11-16 13:28:44+00:00,comment
2251,,"Will not buy any more apple product, iphone 4s...",omi,,2012-09-11 14:42:52+00:00,comment


The performance of the ai operators depends on the length of your input as well as your quota. Here are our benchmarks for running the previous operation with Gemini Flash 1.5 over data of different sizes. Here are the estimates supposing your quota is [the default 200 requests per minute](https://cloud.google.com/vertex-ai/generative-ai/docs/quotas):

* 800 Rows -> ~4m
* 2550 Rows -> ~13m
* 8500 Rows -> ~40m

These numbers can give you a general idea of how fast the operators run.

Now, use LLM to summarize the sentiments towards iPhone:

In [29]:
iphone_comments.ai.map("Summarize the sentiment of the {text}. Your answer should have at most 3 words", output_column="sentiment", model=gemini_model)

`db_dtypes` is a preview feature and subject to change.


Unnamed: 0,title,text,by,score,timestamp,type,sentiment
445,,"If I want to manipulate a device, I&#x27;ll bu...",exelius,,2017-09-21 17:39:37+00:00,comment,"Pragmatic, slightly annoyed"
967,,"<a href=""https:&#x2F;&#x2F;archive.ph&#x2F;nnE...",blinding-streak,,2023-04-30 19:10:16+00:00,comment,I lack the ability to access external websites...
975,,I&#x27;ve had my 6S Plus now for 36 months and...,throwaway427,,2019-01-03 18:06:33+00:00,comment,"Generally positive, impressed."
1253,,Apple is far more closed and tyrannical with i...,RyanMcGreal,,2012-12-21 00:45:40+00:00,comment,Negative towards Apple
1274,,An iOS version was released earlier this year....,pls2halp,,2017-12-09 06:36:41+00:00,comment,"Neutral, factual statement."
1548,,I’m not sure how that fits with Apple pursuing...,alphabettsy,,2021-12-26 19:41:38+00:00,comment,Skeptical and critical.
1630,,"Not sure if you’re being ironic, but I use an ...",lxgr,,2025-03-29 03:57:25+00:00,comment,"Wants interoperability, frustrated."
1664,,Quoting from the article I linked you:<p>&gt;&...,StreamBright,,2017-09-11 19:57:34+00:00,comment,Extremely positive review
1884,,"&gt; Not all wireless headsets are the same, h...",cptskippy,,2021-11-16 13:28:44+00:00,comment,Skeptical and critical
2251,,"Will not buy any more apple product, iphone 4s...",omi,,2012-09-11 14:42:52+00:00,comment,"Negative, regretful."


Here is another example: count the number of rows whose authors have animals in their names.

In [30]:
hacker_news = bpd.read_gbq("bigquery-public-data.hacker_news.full")[['title', 'text', 'by', 'score', 'timestamp', 'type']].head(3000)
hacker_news

incompatibilies with previous reads of this table. To read the latest
version, set `use_cache=False` or close the current session with
Session.close() or bigframes.pandas.close_session().
  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,title,text,by,score,timestamp,type
0,,,,,2010-04-16 19:52:51+00:00,comment
1,,I&#x27;d agree about border control with a cav...,bandrami,,2023-06-04 06:12:00+00:00,comment
2,,So 4 pickups? At least pickups are high margin...,seanmcdirmid,,2023-09-19 14:19:46+00:00,comment
3,Workplace Wellness Programs Don’t Work Well. W...,,anarbadalov,2.0,2018-08-07 12:17:45+00:00,story
4,,Are you implying that to be a good developer y...,ecesena,,2016-06-10 19:38:25+00:00,comment
5,,It pretty much works with other carriers. My s...,toast0,,2024-08-13 03:11:32+00:00,comment
6,,,,,2020-06-07 22:43:03+00:00,comment
7,,&quot;not operated for profit&quot; and &quot;...,radford-neal,,2020-03-19 00:24:47+00:00,comment
8,,It&#x27;s a good description of one applicatio...,dkarl,,2024-10-07 13:38:18+00:00,comment
9,,"Might be a bit high, but....<p><i>&quot;For ex...",tyingq,,2017-01-23 19:49:15+00:00,comment


In [31]:
hacker_news.ai.filter("{by} contains animal name", model=gemini_model)

`db_dtypes` is a preview feature and subject to change.


Unnamed: 0,title,text,by,score,timestamp,type
15,,&gt; Just do what most American cities do with...,AnthonyMouse,,2021-10-04 23:10:50+00:00,comment
16,,It&#x27;s not a space. The l and the C are at ...,antninja,,2013-07-13 09:48:34+00:00,comment
23,,I wish this would happen. There&#x27;s a &quo...,coredog64,,2018-02-12 16:03:37+00:00,comment
27,,"Flash got close, but was too complex and expen...",surfingdino,,2024-05-08 05:02:37+00:00,comment
36,,I think the &quot;algo genius&quot; type of de...,poisonborz,,2024-06-04 07:39:08+00:00,comment
150,,No one will be doing anything practical with a...,NeutralCrane,,2025-02-01 14:26:25+00:00,comment
160,,I think this is more semantics than anything.<...,superb-owl,,2022-06-08 16:55:54+00:00,comment
205,,Interesting to think of sign language localisa...,robin_reala,,2019-02-01 11:49:23+00:00,comment
231,,Probably because of their key location.,ape4,,2014-08-29 14:55:40+00:00,comment
250,,"I realize this is a bit passe, but there were ...",FeepingCreature,,2023-10-15 11:32:44+00:00,comment


Here are the runtime numbers with 500 requests per minute [raised quota](https://cloud.google.com/vertex-ai/generative-ai/docs/quotas):
* 3000 rows -> ~6m
* 10000 rows -> ~26m