In [None]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

## Build a Vector Search application using BigQuery DataFrames (aka BigFrames)

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/googleapis/python-bigquery-dataframes/blob/main/notebooks/generative_ai/bq_dataframes_llm_kmeans.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/googleapis/python-bigquery-dataframes/tree/main/notebooks/generative_ai/bq_dataframes_llm_kmeans.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/googleapis/python-bigquery-dataframes/tree/main/notebooks/generative_ai/bq_dataframes_llm_kmeans.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/bigquery/import?url=https://github.com/googleapis/python-bigquery-dataframes/blob/main/notebooks/generative_ai/bq_dataframes_llm_kmeans.ipynb">
      <img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTW1gvOovVlbZAIZylUtf5Iu8-693qS1w5NJw&s" alt="BQ logo" width="35">
      Open in BQ Studio
    </a>
  </td>
</table>


**Author:** Sudipto Guha (Google)

**Last updated:** March 16th 2025

## Overview

This notebook will guide you through a practical example of using [BigFrames](https://github.com/googleapis/python-bigquery-dataframes/issues) to perform [vector search](https://cloud.google.com/bigquery/docs/vector-search-intro) and analysis on a patent dataset within BigQuery. We will leverage Python and BigFrames to efficiently process, analyze, and gain insights from a large-scale dataset without moving data from BigQuery.

Here's a breakdown of what we'll cover:

1. **Data Ingestion and Embedding Generation:**
We will start by reading a public patent dataset directly from BigQuery into a BigFrames DataFrame.
We'll demonstrate how to use BigFrames' `TextEmbeddingGenerator` to create text embeddings for the patent abstracts. This process converts the textual data into numerical vectors that capture the semantic meaning of each abstract.
We'll show how BigFrames efficiently performs this embedding generation within BigQuery, avoiding data transfer to the client-side.
Finally, we'll store the generated embeddings back into a new BigQuery table for subsequent analysis.

2. **Indexing and Similarity Search:**
Here we'll create a vector index using BigFrames to enable fast and scalable similarity searches.
We'll demonstrate how to create an IVF index for efficient approximate nearest neighbor searches.
We'll then perform a vector search using a sample query string to find patents that are semantically similar to the query. This showcases how vector search goes beyond keyword matching to find relevant results based on meaning.

3. **AI-Powered Summarization with Retrieval Augmented Generation (RAG):**
To further enhance the analysis, we'll implement a RAG pipeline.
We'll retrieve the top most similar patents based on the vector search results from step 2.
We'll use BigFrames' `GeminiTextGenerator` to create a prompt for an LLM to generate a concise summary of the retrieved patents.
This demonstrates how to combine vector search with generative AI to extract and synthesize meaningful insights from complex patent data.


We will tie these pieces together in Python using BigQuery DataFrames. [Click here](https://cloud.google.com/bigquery/docs/dataframes-quickstart) to learn more about BigQuery DataFrames!

### Dataset

This notebook uses the [BQ Patents Public Dataset](https://bigquery.cloud.google.com/dataset/patents-public-data:patentsview).

### Costs

This tutorial uses billable components of Google Cloud:

* BigQuery (compute)
* BigQuery ML
* Generative AI support on Vertex AI

Learn about [BigQuery compute pricing](https://cloud.google.com/bigquery/pricing#analysis_pricing_models), [Generative AI support on Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing#generative_ai_models),
and [BigQuery ML pricing](https://cloud.google.com/bigquery/pricing#bqml),
and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Setup & initialization

Make sure you have the required roles and permissions listed below:

For [Vector embedding generation](https://cloud.google.com/bigquery/docs/generate-text-embedding#required_roles)

For [Vector Index creation](https://cloud.google.com/bigquery/docs/vector-index#roles_and_permissions)

## Before you begin

Complete the tasks in this section to set up your environment.

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

3. [Click here](https://console.cloud.google.com/flows/enableapi?apiid=bigquery.googleapis.com,bigqueryconnection.googleapis.com,aiplatform.googleapis.com) to enable the following APIs:

  * BigQuery API
  * BigQuery Connection API
  * Vertex AI API

4. If you are running this notebook locally, install the [Cloud SDK](https://cloud.google.com/sdk).

#### Set your project ID

**If you don't know your project ID**, see the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [1]:
# set your project ID below
PROJECT_ID = ""  # @param {type:"string"}

# set your region
REGION = "US"  # @param {type: "string"}

# Set the project id in gcloud
#! gcloud config set project {PROJECT_ID}

#### Authenticate your Google Cloud account

Depending on your Jupyter environment, you might have to manually authenticate. Follow the relevant instructions below.

**Vertex AI Workbench**

Do nothing, you are already authenticated.

**Local JupyterLab instance**

Uncomment and run the following cell:

In [None]:
# ! gcloud auth login

**Colab**

Uncomment and run the following cell:

In [2]:
# from google.colab import auth
# auth.authenticate_user()



Now we are ready to use BigQuery DataFrames!

## Step 1: Data Ingestion and Embedding Generation

Install libraries

In [33]:
import bigframes.pandas as bf
import bigframes.ml as bf_ml
import bigframes.bigquery as bf_bq
import bigframes.ml.llm as bf_llm


from google.cloud import bigquery
from google.cloud import storage

# Construct a BigQuery client object.
client = bigquery.Client()

import pandas as pd
from IPython.display import Image, display
from PIL import Image as PILImage
import io

import json
from IPython.display import Markdown

# Note: The project option is not required in all environments.
# On BigQuery Studio, the project ID is automatically detected.
bf.options.bigquery.project = PROJECT_ID
bf.options.bigquery.location = REGION



Partial ordering mode allows BigQuery DataFrames to push down many more row and column filters. On large clustered and partitioned tables, this can greatly reduce the number of bytes scanned and computation slots used. This [blog post](https://medium.com/google-cloud/introducing-partial-ordering-mode-for-bigquery-dataframes-bigframes-ec35841d95c0) goes over it in more detail.

In [4]:
bf.options.bigquery.ordering_mode = "partial"

If you want to reset the location of the created DataFrame or Series objects, reset the session by executing `bf.close_session()`. After that, you can reuse `bf.options.bigquery.location` to specify another location.

Data Input - read the data from a publicly available BigQuery dataset

In [17]:
publications = bf.read_gbq('patents-public-data.google_patents_research.publications')

incompatibilies with previous reads of this table. To read the latest
version, set `use_cache=False` or close the current session with
Session.close() or bigframes.pandas.close_session().[0m
  exec(code_obj, self.user_global_ns, self.user_ns)


In [18]:
## create patents base table (subset of 10k out of ~110M records)

keep = (publications.embedding_v1.str.len() > 0) & (publications.title.str.len() > 0) & (publications.abstract.str.len() > 30)

## Choose 10000 random rows to analyze
publications = publications[keep].peek(10000)

In [11]:
## take a look at the sample dataset

publications.head(5)

Unnamed: 0,publication_number,title,title_translated,abstract,abstract_translated,cpc,cpc_low,cpc_inventive_low,top_terms,similar,url,country,publication_description,cited_by,embedding_v1
0,AU-338190-S,Compressor wheel,False,Newness and distinctiveness is claimed in the ...,False,[],[],[],['compressor wheel' 'newness' 'distinctiveness...,"[{'publication_number': 'AU-338190-S', 'applic...",https://patents.google.com/patent/AU338190S,Australia,Design,[],[ 5.2067090e-02 -1.5462303e-01 -1.3415462e-01 ...
1,CN-100525651-C,Method for processing egg products,False,The invention discloses a processing method of...,False,[],[],[],['egg' 'processing method' 'egg body' 'pack' '...,"[{'publication_number': 'CN-101396133-B', 'app...",https://patents.google.com/patent/CN100525651C,China,Granted Patent,[],[-0.05154578 -0.00437102 0.01365495 -0.168424...
2,TW-I725505-B,Improved carbon molecular sieve adsorbent,False,Disclosed herein are rapid cycle pressure swin...,False,"[{'code': 'B01D2253/116', 'inventive': False, ...",['B01D2253/116' 'B01D2253/10' 'B01D2253/00' 'B...,['B01D2253/116' 'B01D2253/10' 'B01D2253/00' 'B...,['swing adsorption' 'pressure swing' 'molecula...,"[{'publication_number': 'EP-1867379-B1', 'appl...",https://patents.google.com/patent/TWI725505B,Taiwan,Granted Patent or patent of addition,[],[ 0.0163008 -0.20972364 0.02052403 -0.003073...
3,EP-0248026-B1,A system for supplying strip to a processing line,False,A system (10) for supplying strip material (S)...,False,"[{'code': 'B65H2701/37', 'inventive': False, '...",['B65H2701/37' 'B65H2701/30' 'B65H2701/00' 'B6...,['B65H2701/37' 'B65H2701/30' 'B65H2701/00' 'B6...,['strip material' 'assembly' 'coil' 'take' 'pr...,"[{'publication_number': 'EP-0248026-B1', 'appl...",https://patents.google.com/patent/EP0248026B1,European Patent Office,Granted patent,[],[-0.04377723 0.04111805 -0.0929429 0.043924...
4,MY-135762-A,Method for producing acrylic acid,False,A PROCESS FOR THE FRACTIONAL CONDENSATION OF A...,False,"[{'code': 'C07C51/50', 'inventive': True, 'fir...",['C07C51/50' 'C07C51/42' 'C07C51/00' 'C07C' 'C...,['C07C51/50' 'C07C51/42' 'C07C51/00' 'C07C' 'C...,['acrylic acid' 'producing acrylic' 'stabilize...,"[{'publication_number': 'SG-157371-A1', 'appli...",https://patents.google.com/patent/MY135762A,Malaysia,Granted patent / Utility model,[],[ 0.10407669 0.01262973 -0.22623734 -0.171453...


Generate the text embeddings

In [13]:
from bigframes.ml.llm import TextEmbeddingGenerator

text_model = TextEmbeddingGenerator() # No connection id needed

In [19]:
## rename abstract column to content as the desired column on which embedding will be generated
publications = publications[["publication_number", "title", "abstract"]].rename(columns={'abstract': 'content'})

## generate the embeddings
## takes ~2-3 mins to run
embedding = text_model.predict(publications)[["publication_number", "title", "content", "ml_generate_embedding_result","ml_generate_embedding_status"]]

## filter out rows where the embedding generation failed. the embedding status value is empty if the embedding generation was successful
embedding = embedding[~embedding["ml_generate_embedding_status"].isnull()]


`db_dtypes` is a preview feature and subject to change.[0m


In [20]:
embedding.head(5)



Unnamed: 0,publication_number,title,content,ml_generate_embedding_result,ml_generate_embedding_status
5753,HN-1996000102-A,NEW PESTICIDES,THE PRESENT INVENTION REFERS TO,[-0.02709213 0.0366395 0.03931784 -0.003942...,
8115,AU-325874-S,Baby sling,Adjustable baby sling with velcro.,[ 6.44167811e-02 -2.01051459e-02 -3.39564607e-...,
5415,AU-2016256863-A1,Microbial compositions and methods for denitri...,The present invention provides compositions an...,[-5.90537786e-02 2.38401629e-03 7.22754598e-...,
8886,FR-2368509-A1,NEW DEODORANTS OR FRESHENERS AND COMPOSITIONS ...,Polyanionic polyamide salts comprising a conca...,[-3.44522446e-02 5.64815439e-02 -1.35829514e-...,
5661,US-2006051255-A1,Gas generator,A gas generator insulated by a vacuum-jacket v...,[-1.50892800e-02 6.56989636e-03 2.34969519e-...,


In [21]:
# store embeddings in a BQ table
DATASET_ID = ""  # @param {type:"string"}
TEXT_EMBEDDING_TABLE_ID = "" # @param {type:"string"}
embedding.to_gbq(f"{DATASET_ID}.{TEXT_EMBEDDING_TABLE_ID}", if_exists='replace')

'bqml_llm_trial.patent_embedding_BF-n'

## Step 2: Indexing and Similarity Search

### [Create a Vector Index](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery#bigframes_bigquery_create_vector_index) using BigFrames


**Index Type**

The algorithm to use to build the vector index.
The supported values are IVF and TREE_AH.

In [22]:
## create vector index (note only works of tables >5000 rows)

bf_bq.create_vector_index(
    table_id = f"{DATASET_ID}.{TEXT_EMBEDDING_TABLE_ID}",
    column_name = "ml_generate_embedding_result",
    replace= True,
    index_name = "bf_python_index",
    distance_type="cosine",
    index_type= "ivf"
)

### Vector Search (semantic search) using Vector Index

ANN (approx nearest neighbor) search using the created vector index

In [23]:
## Set variable for vector search

TEXT_SEARCH_STRING = "Chip assemblies employing solder bonds to back-side lands including an electrolytic nickel layer"  ## replace with whatever search string you want to use for the vector search
FRACTION_LISTS_TO_SEARCH = 0.01

In [24]:
# convert search string to dataframe
TEXT_SEARCH_DF = bf.DataFrame([TEXT_SEARCH_STRING], columns=['search_string'])

#generate embedding of search query
search_query = bf.DataFrame(text_model.predict(TEXT_SEARCH_DF))

`db_dtypes` is a preview feature and subject to change.[0m


In [25]:
## search the base table for the user's query

vector_search_results = bf_bq.vector_search(
                  base_table=f"{DATASET_ID}.{TEXT_EMBEDDING_TABLE_ID}",
                  column_to_search="ml_generate_embedding_result",
                  query=search_query,
                  distance_type="COSINE",
                  query_column_to_search="ml_generate_embedding_result",
                  top_k=5)

`db_dtypes` is a preview feature and subject to change.[0m


In [27]:
## View the returned results based on simalirity with the user's query

vector_search_results[['content', 'publication_number',
       'title', 'content_1', 'distance']].rename(columns={'content': 'query', 'content_1':'abstract (relevant match)' , 'title':'title (relevant match)'})

Unnamed: 0,query,publication_number,title (relevant match),abstract (relevant match),distance
0,Chip assemblies employing solder bonds to back...,KR-102569815-B1,electronic device package,An electronic device package technology is dis...,0.357673
0,Chip assemblies employing solder bonds to back...,US-8962389-B2,Microelectronic packages including patterned d...,Embodiments of microelectronic packages and me...,0.344263
0,Chip assemblies employing solder bonds to back...,TW-I256279-B,Substrate for electrical device and methods of...,Substrate for electrical devices and methods o...,0.3687
0,Chip assemblies employing solder bonds to back...,US-2005230147-A1,"Wiring board, and electronic device with an el...",An electronic device is mounted on a wiring bo...,0.304293
0,Chip assemblies employing solder bonds to back...,US-6686652-B1,Locking lead tips and die attach pad for a lea...,An assembly and method suitable for use in pac...,0.364334


In [28]:
## Brute force result (for comparison)


brute_force_result = bf_bq.vector_search(
                  table_id = f"{DATASET_ID}.{TEXT_EMBEDDING_TABLE_ID}",
                  column_to_search="ml_generate_embedding_result",
                  query=search_query,
                  top_k=5,
                  distance_type="COSINE",
                  use_brute_force=True)


## Step 3: AI-Powered Summarization with Retrieval Augmented Generation (RAG)

Patent documents can be dense and time-consuming to digest. AI-Powered Patent Summarization utilizes Retrieval Augmented Generation (RAG) to streamline this process. By retrieving relevant patent information through vector search and then synthesizing it with a large language model, we can generate concise, human-readable summaries, saving valuable time and effort. The code sample below walks through how to set this up continuing with the same user query as the previous use case.

In [37]:
## gemini model

llm_model = bf_llm.GeminiTextGenerator(model_name = "gemini-2.0-flash-001") ## replace with other model as needed

We will use the same user query from Section 2, and pass the list of abstracts returned by the vector search into the prompt for the RAG application

In [35]:
TEMPERATURE = 0.4

In [34]:
# Extract strings into a list of JSON strings
json_strings = [json.dumps({'abstract': s}) for s in vector_search_results['content_1']]
ALL_ABSTRACTS = json_strings

# Print the result (optional)
print(ALL_ABSTRACTS)

['{"abstract": "Substrate for electrical devices and methods of fabricating such substrate are disclosed. An embodiment for an electrical device with substrate comprised of a chip having an active surface; a substrate being coupled with the chip; and a plurality of conductive wires (bumps) electrically connecting the chip to the substrate. In additionally, the present invention of the substrate for electrical devices may be comprised of an adhesive mean or a submember as required, and furthermore, by mean of using substrate, the present invention may be capable of affording a number of advantages, it is possible to include a thinner electrical device thickness, enhanced reliability, and a decreased cost in production."}', '{"abstract": "An electronic device is mounted on a wiring board, which includes: a substrate having through holes, and lands extending on surfaces of the substrate and adjacent to openings of the through holes. Further, at least one coating layer is provided, which c

In [41]:
## Setup the LLM prompt

prompt = f"""
You are an expert patent analyst. I will provide you the abstracts of the top 5 patents in json format retrieved by a vector search based on a user's query.
Your task is to analyze these abstracts and generate a concise, coherent summary that encapsulates the core innovations and concepts shared among them.

In your output, share the original user query.
Then output the concise, coherent summary that encapsulates the core innovations and concepts shared among the top 5 abstracts. The heading for this section should
be : Summary of the top 5 abstracts that are semantically closest to the user query.

User Query: {TEXT_SEARCH_STRING}
Top 5 abstracts: {ALL_ABSTRACTS}

Instructions:

Focus on identifying the common themes and key technological advancements described in the abstracts.
Synthesize the information into a clear and concise summary, approximately 150-200 words.
Avoid simply copying phrases from the abstracts. Instead, aim to provide a cohesive overview of the shared concepts.
Highlight the potential applications and benefits of the described inventions.
Maintain a professional and objective tone.
Do not mention the individual patents by number, focus on summarizing the shared concepts.
"""

print(prompt)


You are an expert patent analyst. I will provide you the abstracts of the top 5 patents in json format retrieved by a vector search based on a user's query.
Your task is to analyze these abstracts and generate a concise, coherent summary that encapsulates the core innovations and concepts shared among them.

In your output, share the original user query.
Then output the concise, coherent summary that encapsulates the core innovations and concepts shared among the top 5 abstracts. The heading for this section should
be : Summary of the top 5 abstracts that are semantically closest to the user query.

User Query: Chip assemblies employing solder bonds to back-side lands including an electrolytic nickel layer
Top 5 abstracts: ['{"abstract": "Substrate for electrical devices and methods of fabricating such substrate are disclosed. An embodiment for an electrical device with substrate comprised of a chip having an active surface; a substrate being coupled with the chip; and a plurality of 

In [38]:
## Define a function that will take the input propmpt and run the LLM

def predict(prompt: str, temperature: float = TEMPERATURE) -> str:
    # Create dataframe
    input = bf.DataFrame(
        {
            "prompt": [prompt],
        }
    )

    # Return response
    return llm_model.predict(input, temperature=temperature).ml_generate_text_llm_result.iloc[0]

In [42]:
# Invoke LLM with prompt
response = predict(prompt, temperature = TEMPERATURE)

# Print results as Markdown
Markdown(response)

`db_dtypes` is a preview feature and subject to change.[0m




User Query: Chip assemblies employing solder bonds to back-side lands including an electrolytic nickel layer

Summary of the top 5 abstracts that are semantically closest to the user query.

The top five patent abstracts describe advancements in microelectronic packaging, focusing on improved chip-to-substrate interconnection and enhanced reliability.  A common thread is the development of novel substrate designs and assembly methods to facilitate robust electrical connections.  Several abstracts highlight techniques for creating reliable connections between chips and substrates, emphasizing the use of conductive materials and adhesives to ensure strong and durable bonds.  These methods aim to improve the overall reliability and performance of electronic devices.  The innovations include improved techniques for preventing delamination or peeling of conductive lands, leading to more robust assemblies.  The use of encapsulating materials and specialized die-attach methods are also prominent, suggesting a focus on protecting the chip and its connections from environmental factors.  These advancements collectively contribute to the creation of thinner, more reliable, and cost-effective electronic devices, with applications spanning various consumer electronics and other industries.  While the abstracts don't explicitly mention electrolytic nickel layers, the focus on improved solder bond reliability and substrate design suggests that such a layer could be a complementary enhancement to the described technologies.


# Summary and next steps

Ready to dive deeper and explore the endless possibilities? Start building your own vector search applications with BigFrames and BigQuery today! Check out our [documentation](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery#bigframes_bigquery_vector_search), explore our sample [notebooks](https://github.com/googleapis/python-bigquery-dataframes/tree/main/notebooks), and unleash the power of vector analytics on your data.
The BigFrames team would also  love to hear from you. If you would like to reach out, please send an email to: bigframes-feedback@google.com or by filing an issue at the [open source BigFrames repository](https://github.com/googleapis/python-bigquery-dataframes/issues). To receive updates about BigFrames, subscribe to the BigFrames email list.