In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# BigFrames Semantic Operator Demo

We implemented the semantics operators based on the idea in the "Lotus" paper: https://arxiv.org/pdf/2407.11418.

This notebook gives you a hands-on preview of semantic operator APIs powered by LLM. The demonstration is devided into two sections: 

The first section introduces the API syntax with some simple examples. We aim to get you familiar with how BigFrames semantic operators work. 

The second section talks about applying semantic operators on real-world large datasets. The examples are designed to benchmark the performance of the operators, and to (maybe) spark some ideas for your next application scenarios.

You can open this notebook on Google Colab [here](https://colab.research.google.com/github/googleapis/python-bigquery-dataframes/blob/main/notebooks/experimental/semantic_operators.ipynb).

Without further ado, let's get started.

# Preparation

First, let's import BigFrames packages.

In [1]:
import bigframes
import bigframes.pandas as bpd

Make sure the BigFrames version is at least `1.22.0`

In [2]:
from packaging.version import Version

assert Version(bigframes.__version__) >= Version("1.22.0")

Turn on the semantic operator experiment. You will see a warning sign saying that these operators are still under experiments. This is a necessary step. Otherwise you will see `NotImplementedError` when calling these operators.

In [3]:
bigframes.options.experiments.semantic_operators = True



Optional: turn off the display of progress bar so that only the operation results will be printed out

In [4]:
# bpd.options.display.progress_bar = None

Let's also create some LLM instances for these operators. They will be passed in as paramters in each method call.

In [5]:
import bigframes.ml.llm as llm
gemini_model = llm.GeminiTextGenerator(model_name=llm._GEMINI_1P5_FLASH_001_ENDPOINT)
text_embedding_model = llm.TextEmbeddingGenerator(model_name="text-embedding-005")

  return global_session.get_global_session()


# API Syntax

In this section we will go through the semantic operator APIs with small examples.

## Semantic Filtering

Semantic filtering allows you to filter your dataframe based on the instruction (i.e. prompt) you provided. Let's first create a small dataframe:

In [6]:
df = bpd.DataFrame({'country': ['USA', 'Germany', 'Japan'], 'city': ['Seattle', 'Berlin', 'Kyoto']})
df

Unnamed: 0,country,city
0,USA,Seattle
1,Germany,Berlin
2,Japan,Kyoto


Now, let's filter this dataframe by keeping only the rows where the value in `city` column is the capital of the value in `country` column. The column references could be "escaped" by using a pair of braces in your instruction. In this example, our instruction should be like this:
```
The {city} is the capital of the {country}.
```

Note that this is not a Python f-string, so you shouldn't prefix your instruction with an `f`. Let's give it a try:

In [7]:
df.semantics.filter("The {city} is the capital of the {country}", model=gemini_model)



Unnamed: 0,country,city
1,Germany,Berlin


The filter operator extracts the information from the referenced column to enrich your instruction with context. The instruction is then sent for the designated model for evaluation. For filtering operations, the LLM is asked to return only `True` and `False` for each row, and the operator removes the rows accordingly.

## Semantic Mapping

Semantic mapping allows to you to combine values from multiple columns into a single output based your instruction. To demonstrate this, let's create an example dataframe:

In [8]:
df = bpd.DataFrame({
    "ingredient_1": ["Bun", "Soy Bean", "Sausage"], 
    "ingredient_2": ["Beef Patty", "Bittern", "Long Bread"]
    })
df

Unnamed: 0,ingredient_1,ingredient_2
0,Bun,Beef Patty
1,Soy Bean,Bittern
2,Sausage,Long Bread


Now, let's ask LLM what kind of food can be made from the two ingredients in each row. The column reference syntax in your instruction stays the same. In addition, you need to specify the column name by setting the `output_column` parameter to hold the mapping results.

In [9]:
df.semantics.map("What is the food made from {ingredient_1} and {ingredient_2}? One word only.", output_column="food", model=gemini_model)



Unnamed: 0,ingredient_1,ingredient_2,food
0,Bun,Beef Patty,Burger
1,Soy Bean,Bittern,Tofu
2,Sausage,Long Bread,Hotdog


The mechanism behind semantic mapping is very similar with semantic filtering. The one major difference: instead of asking LLM to reply true or false to each row, the operator lets LLM reply free-form strings and attach them as a new column to the dataframe.

## Semantic Joining

Semantic joining can join two dataframes based on the instruction you provided. First, let's prepare two dataframes.

In [10]:
cities = bpd.DataFrame({'city': ['Seattle', 'Ottawa', 'Berlin', 'Shanghai', 'New Delhi']})
continents = bpd.DataFrame({'continent': ['North America', 'Africa', 'Asia']})

We want to join the `cities` with `continents` to form a new dataframe such that, in each row the city from the `cities` data frame is in the continent from the `continents` dataframe. We could re-use the aforementioned column reference syntax:

In [11]:
cities.semantics.join(continents, "{city} is in {continent}", model=gemini_model)



Unnamed: 0,city,continent
0,Seattle,North America
1,Ottawa,North America
2,Shanghai,Asia
3,New Delhi,Asia


!! **Important:** Semantic join can trigger probihitively expensitve operations! This operation first cross joins two dataframes, then invokes semantic filter on each row. That means if you have two dataframes of sizes `M` and `N`, the total amount of queries sent to the LLM is on the scale of `M * N`. Therefore, we have added a parameter `max_rows`, a threshold that guards against unexpected expensive calls. With this parameter, the operator first calculates the size of your cross-joined data, and compares it with the threshold. If the size exceeds your threshold, the fuction will abort early with a `ValueError`. You can manually set the value of `max_rows` to raise or lower the threshold.

### Self Joins

We use a self-join example to demonstrate a special case: what happens when the joining columns exist in both data frames? It turns out that you need to provide extra information in your column references: by attaching "left." and "right." prefixes to your column names. 

Let's create an example data frame:

In [12]:
animals = bpd.DataFrame({'animal': ['cow', 'cat', 'spider', 'elephant']})

We want to compare the weights of these animals, and output all the pairs where the animal on the left is heavier than the animal on the right. In this case, we use `left.animal` and `right.animal` to differentiate the data sources:

In [13]:
animals.semantics.join(animals, "{left.animal} generally weighs heavier than {right.animal}", model=gemini_model)



Unnamed: 0,animal_left,animal_right
0,cow,cat
1,cow,spider
2,cat,spider
3,elephant,cow
4,elephant,cat
5,elephant,spider


## Semantic Aggregation

Semantic aggregation merges all the values in a column into one. At this moment you can only aggregate a single column in each oeprator call. Let's create an example:

In [14]:
df = bpd.DataFrame({
    "Movies": [
        "Titanic",
        "The Wolf of Wall Street",
        "Killers of the Flower Moon",
        "The Revenant",
        "Inception",
        "Shuttle Island",
        "The Great Gatsby",
    ],
    "Year": [1997, 2013, 2023, 2015, 2010, 2010, 2013],
})
df

Unnamed: 0,Movies,Year
0,Titanic,1997
1,The Wolf of Wall Street,2013
2,Killers of the Flower Moon,2023
3,The Revenant,2015
4,Inception,2010
5,Shuttle Island,2010
6,The Great Gatsby,2013


Let's ask LLM to find the actor/actress that starred in all movies:

In [15]:
agg_df = df.semantics.agg("Find the actors/actresses who starred in all {Movies}. Reply with their names only.", model=gemini_model)
agg_df



0    Leonardo DiCaprio 

Name: Movies, dtype: string

Instead of going through each row one by one, this operator batches multiple rows in a single request towards LLM. It then aggregates all the batched results with the same technique, until there is only one value left. You could set the batch size with `max_agg_rows` parameter, which defaults to 10.

## Semantic Top K

Semantic Top K selects the top K values based on your instruction. Here is an example:

In [16]:
df = bpd.DataFrame({"Animals": ["Corgi", "Orange Cat", "Parrot", "Tarantula"]})

We want to find the top two most popular pets:

In [17]:
df.semantics.top_k("{Animals} are more popular as pets", model=gemini_model, k=2)



Unnamed: 0,Animals
1,Orange Cat
2,Parrot


Under the hood, the semantic top K operator performs pair-wise comparisons with LLM. It also adopts the quick select algorithm, which means the top K results are returns in the order of their indices instead of their ranks.

## Semantic Search

Semantic search searches the most similar values to your qury within a single column. Here is an example:

In [18]:
df = bpd.DataFrame({"creatures": ["salmon", "sea urchin", "baboons", "frog", "chimpanzee"]})
df

Unnamed: 0,creatures
0,salmon
1,sea urchin
2,baboons
3,frog
4,chimpanzee


We want to get the top 2 creatures that are most similar to "monkey":

In [19]:
df.semantics.search("creatures", query="monkey", top_k = 2, model = text_embedding_model, score_column='similarity score')





Unnamed: 0,creatures,similarity score
2,baboons,0.773411
4,chimpanzee,0.781101


Notice that we are using a text embedding model this time. This model generates embedding vectors for both your query as well as the values in the search space. The operator then uses BigQuery's built-in VECTOR_SEARCH function to find the nearest neighbors of your query.

In addition, `score_column` is an optional parameter for storing the distances between the results and your query. If not set, the score column won't be attached to the result.

## Semantic Similarity Join

When you have multiple queries to search in the same value space, you could use similarity join to simplify your call. For example:

In [20]:
df1 = bpd.DataFrame({'animal': ['monkey', 'spider', 'salmon', 'giraffe', 'sparrow']})
df2 = bpd.DataFrame({'animal': ['scorpion', 'baboon', 'owl', 'elephant', 'tuna']})

In this example, we want to pick the most related animal from `df2` for each value in `df1`, and this is how it's done:

In [21]:
df1.semantics.sim_join(df2, left_on='animal', right_on='animal', top_k=1, model= text_embedding_model, score_column='distance')





Unnamed: 0,animal,animal_1,distance
0,monkey,baboon,0.747665
1,spider,scorpion,0.890909
2,salmon,tuna,0.925461
3,giraffe,elephant,0.887858
4,sparrow,owl,0.932959


!! **Important** Like semantic join, this operator can also be very expensive. To guard against unexpected processing of large dataset, use the `max_rows` parameter to provide a threshold. 

## Semantic Cluster

Semantic Cluster group similar values together. For example:

In [22]:
df = bpd.DataFrame({'Product': ['Smartphone', 'Laptop', 'Coffee Maker', 'T-shirt', 'Jeans']})

We want to cluster these products into 3 groups, and this is how:

In [23]:
df.semantics.cluster_by(column='Product', output_column='Cluster ID', model=text_embedding_model, n_clusters=3)



Unnamed: 0,Product,Cluster ID
0,Smartphone,3
1,Laptop,3
2,Coffee Maker,1
3,T-shirt,2
4,Jeans,2


This operator uses the the embedding model to generate vectors for each value, and then uses KMeans algorithm to group them.

# Performance Analyses

In this section we will use BigQuery's public data of hacker news to perform some heavy work. First, let's load 3K rows from the table:

In [24]:
hacker_news = bpd.read_gbq("bigquery-public-data.hacker_news.full")[['title', 'text', 'by', 'score', 'timestamp', 'type']].head(3000)
hacker_news

Unnamed: 0,title,text,by,score,timestamp,type
0,,"Well, most people aren&#x27;t alcoholics, so I...",slipframe,,2021-06-26 02:37:56+00:00,comment
1,,"No, you don&#x27;t really <i>need</i> a smartp...",vetinari,,2023-04-19 15:56:34+00:00,comment
2,,It&#x27;s for the late Paul Allen RIP. Should&...,lsr_ssri,,2018-10-16 01:07:55+00:00,comment
3,,Yup they are dangerous. Be careful Donald Trump.,Sven7,,2015-08-10 16:05:54+00:00,comment
4,,"Sure, it&#x27;s totally reasonable. Just point...",nicoburns,,2020-10-05 11:20:51+00:00,comment
5,,I wonder how long before special forces start ...,autisticcurio,,2020-09-01 15:38:50+00:00,comment
6,The Impending NY Tech Apocalypse: Here's What ...,,gaoprea,3.0,2011-09-27 22:43:27+00:00,story
7,,Where would you relocate to? I'm assuming that...,pavel_lishin,,2011-09-16 19:02:01+00:00,comment
8,Eureca beta is live. A place for your business...,,ricardos,1.0,2012-10-15 13:09:32+00:00,story
9,,"It doesn’t work on Safari, and WebKit based br...",archiewood,,2023-04-21 16:45:13+00:00,comment


Then, let's keep only the rows that have text content:

In [25]:
hacker_news_with_texts = hacker_news[hacker_news['text'].isnull() == False]
len(hacker_news_with_texts)

2558

Let's calculate the average text length in all the rows:

In [26]:
hacker_news_with_texts['text'].str.len().mean()

390.7251759186865

Now it's LLM's turn. Let's keep the rows in which the text is talking about iPhone. This will take several minutes to finish.

In [27]:
iphone_comments=hacker_news_with_texts.semantics.filter("The {text} is mainly focused on iPhone", gemini_model)
iphone_comments





Unnamed: 0,title,text,by,score,timestamp,type
16,,you have to auth again when you use apple pay.,empath75,,2017-09-12 18:58:20+00:00,comment
413,,Well last time I got angry down votes for sayi...,drieddust,,2021-01-11 19:27:27+00:00,comment
797,,New iPhone should be announced on September. L...,meerita,,2019-07-30 20:54:42+00:00,comment
1484,,Why would this take a week? i(phone)OS was ori...,TheOtherHobbes,,2021-06-08 09:25:24+00:00,comment
1529,,&gt;or because Apple drama brings many clicks?...,weberer,,2022-09-05 13:16:02+00:00,comment
1561,,"Location: Sydney, AU<p>Remote: Yes<p>Willing t...",drEv0,,2016-05-03 23:55:26+00:00,comment


The performance of the semantic operators depends on the length of your input as well as your quota. Here are my benchmarks for running the previous operation over data of different sizes.

* 800 Rows -> 1m 21.3s
* 2550 Rows -> 5m 9s
* 8500 Rows -> 16m 34.4s

These numbers can give you a general idea of how fast the operators run.

Now let's use LLM to summarize the sentiments towards iPhone:

In [28]:
iphone_comments.semantics.map("Summarize the sentiment of the {text}. Your answer should have at most 3 words", output_column="sentiment", model=gemini_model)



Unnamed: 0,title,text,by,score,timestamp,type,sentiment
16,,you have to auth again when you use apple pay.,empath75,,2017-09-12 18:58:20+00:00,comment,"Frustrated, Negative, Annoyed"
413,,Well last time I got angry down votes for sayi...,drieddust,,2021-01-11 19:27:27+00:00,comment,"Frustrated, feeling cheated."
797,,New iPhone should be announced on September. L...,meerita,,2019-07-30 20:54:42+00:00,comment,Excited anticipation.
1484,,Why would this take a week? i(phone)OS was ori...,TheOtherHobbes,,2021-06-08 09:25:24+00:00,comment,"Frustrated, critical, obvious."
1529,,&gt;or because Apple drama brings many clicks?...,weberer,,2022-09-05 13:16:02+00:00,comment,"Negative, clickbait, controversy."
1561,,"Location: Sydney, AU<p>Remote: Yes<p>Willing t...",drEv0,,2016-05-03 23:55:26+00:00,comment,Seeking employment in Australia.


Here is another example: we  count the number of rows whose authors have animals in their names.

In [29]:
hacker_news = bpd.read_gbq("bigquery-public-data.hacker_news.full")[['title', 'text', 'by', 'score', 'timestamp', 'type']].head(3000)
hacker_news

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,title,text,by,score,timestamp,type
0,,"Well, most people aren&#x27;t alcoholics, so I...",slipframe,,2021-06-26 02:37:56+00:00,comment
1,,"No, you don&#x27;t really <i>need</i> a smartp...",vetinari,,2023-04-19 15:56:34+00:00,comment
2,,It&#x27;s for the late Paul Allen RIP. Should&...,lsr_ssri,,2018-10-16 01:07:55+00:00,comment
3,,Yup they are dangerous. Be careful Donald Trump.,Sven7,,2015-08-10 16:05:54+00:00,comment
4,,"Sure, it&#x27;s totally reasonable. Just point...",nicoburns,,2020-10-05 11:20:51+00:00,comment
5,,I wonder how long before special forces start ...,autisticcurio,,2020-09-01 15:38:50+00:00,comment
6,The Impending NY Tech Apocalypse: Here's What ...,,gaoprea,3.0,2011-09-27 22:43:27+00:00,story
7,,Where would you relocate to? I'm assuming that...,pavel_lishin,,2011-09-16 19:02:01+00:00,comment
8,Eureca beta is live. A place for your business...,,ricardos,1.0,2012-10-15 13:09:32+00:00,story
9,,"It doesn’t work on Safari, and WebKit based br...",archiewood,,2023-04-21 16:45:13+00:00,comment


In [30]:
hacker_news.semantics.filter("{by} contains animal name", model=gemini_model)





Unnamed: 0,title,text,by,score,timestamp,type
24,Working Best at Coffee Shops,,GiraffeNecktie,249.0,2011-04-19 14:25:17+00:00,story
96,,i resisted switching to chrome for months beca...,catshirt,,2011-04-06 08:02:24+00:00,comment
106,,I was about to say the same thing myself. For ...,geophile,,2011-12-08 21:13:08+00:00,comment
184,,I think it&#x27;s more than hazing. It may be ...,bayesianhorse,,2015-06-18 16:42:53+00:00,comment
223,,I don&#x27;t understand why a beginner would s...,wolco,,2019-02-03 14:35:43+00:00,comment
284,,I leaerned more with one minute of this than a...,agumonkey,,2016-07-16 06:19:39+00:00,comment
297,,I've suggested a <i>rationale</i> for the tabo...,mechanical_fish,,2008-12-17 04:42:02+00:00,comment
306,,Do you have any reference for this?<p>I&#x27;m...,banashark,,2023-11-13 19:57:00+00:00,comment
316,,Default search scope is an option in the Finde...,kitsunesoba,,2017-08-13 17:15:19+00:00,comment
386,,Orthogonality and biology aren&#x27;t friends.,agumonkey,,2016-04-24 16:33:41+00:00,comment


Here are my performance numbers:
* 3000 rows -> 6m 9.2s
* 10000 rows -> 26m 42.4s