In [1]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

## Use BigQuery DataFrames to cluster and characterize complaints

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/googleapis/python-bigquery-dataframes/tree/main/notebooks/generative_ai/bq_dataframes_llm_kmeans.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/googleapis/python-bigquery-dataframes/tree/main/notebooks/generative_ai/bq_dataframes_llm_kmeans.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/googleapis/python-bigquery-dataframes/tree/main/notebooks/generative_ai/bq_dataframes_llm_kmeans.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>                                                                                               
</table>

## Overview

The goal of this notebook is to demonstrate a comment characterization algorithm for an online business. We will accomplish this using [Google's PaLM 2](https://ai.google/discover/palm2/) and [KMeans clustering](https://en.wikipedia.org/wiki/K-means_clustering) in three steps:

1. Use PaLM2TextEmbeddingGenerator to [generate text embeddings](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings) for each of 5000 complaints sent to an online bank. If you're not familiar with what a text embedding is, it's a list of numbers that are like coordinates in an imaginary "meaning space" for sentences. (It's like [word embeddings](https://en.wikipedia.org/wiki/Word_embedding), but for more general text.) The important point for our purposes is that similar sentences are close to each other in this imaginary space.
2. Use KMeans clustering to group together complaints whose text embeddings are near to eachother. This will give us sets of similar complaints, but we don't yet know _why_ these complaints are similar.
3. Prompt PaLM2TextGenerator in English asking what the difference is between the groups of complaints that we got. Thanks to the power of modern LLMs, the response might give us a very good idea of what these complaints are all about, but remember to ["understand the limits of your dataset and model."](https://ai.google/responsibility/responsible-ai-practices/#:~:text=Understand%20the%20limitations%20of%20your%20dataset%20and%20model)

We will tie these pieces together in Python using BigQuery DataFrames. [Click here](https://cloud.google.com/bigquery/docs/dataframes-quickstart) to learn more about BigQuery DataFrames!

### Dataset

This notebook uses the [CFPB Consumer Complaint Database](https://console.cloud.google.com/marketplace/product/cfpb/complaint-database).

### Costs

This tutorial uses billable components of Google Cloud:

* BigQuery (compute)
* BigQuery ML
* Generative AI support on Vertex AI

Learn about [BigQuery compute pricing](https://cloud.google.com/bigquery/pricing#analysis_pricing_models), [Generative AI support on Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing#generative_ai_models),
and [BigQuery ML pricing](https://cloud.google.com/bigquery/pricing#bqml),
and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Before you begin

Complete the tasks in this section to set up your environment.

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

3. [Click here](https://console.cloud.google.com/flows/enableapi?apiid=bigquery.googleapis.com,bigqueryconnection.googleapis.com,run.googleapis.com,artifactregistry.googleapis.com,cloudbuild.googleapis.com,cloudresourcemanager.googleapis.com) to enable the following APIs:

  * BigQuery API
  * BigQuery Connection API
  * Cloud Run API
  * Artifact Registry API
  * Cloud Build API
  * Cloud Resource Manager API
  * Vertex AI API

4. If you are running this notebook locally, install the [Cloud SDK](https://cloud.google.com/sdk).

#### Set your project ID

**If you don't know your project ID**, see the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [2]:
# set your project ID below
PROJECT_ID = ""  # @param {type:"string"}

# Set the project id in gcloud
! gcloud config set project {PROJECT_ID}

Updated property [core/project].


#### Set the region

You can also change the `REGION` variable used by BigQuery. Learn more about [BigQuery regions](https://cloud.google.com/bigquery/docs/locations#supported_locations).

In [3]:
REGION = "US"  # @param {type: "string"}

#### Authenticate your Google Cloud account

Depending on your Jupyter environment, you might have to manually authenticate. Follow the relevant instructions below.

**Vertex AI Workbench**

Do nothing, you are already authenticated.

**Local JupyterLab instance**

Uncomment and run the following cell:

In [4]:
# ! gcloud auth login

**Colab**

Uncomment and run the following cell:

In [5]:
# from google.colab import auth
# auth.authenticate_user()

If you want to reset the location of the created DataFrame or Series objects, reset the session by executing `bf.close_session()`. After that, you can reuse `bf.options.bigquery.location` to specify another location.

### Connect to Vertex AI

In order to use PaLM2TextGenerator, we will need to set up a [cloud resource connection](https://cloud.google.com/bigquery/docs/create-cloud-resource-connection).

In [6]:
from google.cloud import bigquery_connection_v1 as bq_connection

CONN_NAME = "bqdf-llm"

client = bq_connection.ConnectionServiceClient()
new_conn_parent = f"projects/{PROJECT_ID}/locations/{REGION}"
exists_conn_parent = f"projects/{PROJECT_ID}/locations/{REGION}/connections/{CONN_NAME}"
cloud_resource_properties = bq_connection.CloudResourceProperties({})

try:
    request = client.get_connection(
        request=bq_connection.GetConnectionRequest(name=exists_conn_parent)
    )
    CONN_SERVICE_ACCOUNT = f"serviceAccount:{request.cloud_resource.service_account_id}"
except Exception:
    connection = bq_connection.types.Connection(
        {"friendly_name": CONN_NAME, "cloud_resource": cloud_resource_properties}
    )
    request = bq_connection.CreateConnectionRequest(
        {
            "parent": new_conn_parent,
            "connection_id": CONN_NAME,
            "connection": connection,
        }
    )
    response = client.create_connection(request)
    CONN_SERVICE_ACCOUNT = (
        f"serviceAccount:{response.cloud_resource.service_account_id}"
    )
print(CONN_SERVICE_ACCOUNT)

serviceAccount:bqcx-1084210331973-vl8v@gcp-sa-bigquery-condel.iam.gserviceaccount.com


## Set permissions for the service account

The resource connection service account requires certain project-level permissions:
 - `roles/aiplatform.user` and `roles/bigquery.connectionUser`: These roles are required for the connection to create a model definition using the LLM model in Vertex AI ([documentation](https://cloud.google.com/bigquery/docs/generate-text#give_the_service_account_access)).
 - `roles/run.invoker`: This role is required for the connection to have read-only access to Cloud Run services that back custom/remote functions ([documentation](https://cloud.google.com/bigquery/docs/remote-functions#grant_permission_on_function)).

Set these permissions by running the following `gcloud` commands:

In [7]:
!gcloud projects add-iam-policy-binding {PROJECT_ID} --condition=None --no-user-output-enabled --member={CONN_SERVICE_ACCOUNT} --role='roles/bigquery.connectionUser'
!gcloud projects add-iam-policy-binding {PROJECT_ID} --condition=None --no-user-output-enabled --member={CONN_SERVICE_ACCOUNT} --role='roles/aiplatform.user'
!gcloud projects add-iam-policy-binding {PROJECT_ID} --condition=None --no-user-output-enabled --member={CONN_SERVICE_ACCOUNT} --role='roles/run.invoker'

[1;31mERROR:[0m (gcloud.projects.add-iam-policy-binding) User [henryjsolberg@google.com] does not have permission to access projects instance [bigframes-dev:setIamPolicy] (or it may not exist): Policy update access denied.
- '@type': type.googleapis.com/google.rpc.DebugInfo
  detail: |-
    [ORIGINAL ERROR] generic::permission_denied: Policy update access denied.
    com.google.apps.framework.request.StatusException: <eye3 title='PERMISSION_DENIED'/> generic::PERMISSION_DENIED: Policy update access denied. [google.rpc.error_details_ext] { code: 7 message: "Policy update access denied." }
[1;31mERROR:[0m (gcloud.projects.add-iam-policy-binding) User [henryjsolberg@google.com] does not have permission to access projects instance [bigframes-dev:setIamPolicy] (or it may not exist): Policy update access denied.
- '@type': type.googleapis.com/google.rpc.DebugInfo
  detail: |-
    [ORIGINAL ERROR] generic::permission_denied: Policy update access denied.
    com.google.apps.framework.reque

Now we are ready to use BigQuery DataFrames!

## Step 1: Text embedding 

Project Setup

In [8]:
import bigframes.pandas as bf

bf.options.bigquery.project = PROJECT_ID
bf.options.bigquery.location = REGION

Data Input

In [9]:
input_df = bf.read_gbq("bigquery-public-data.cfpb_complaints.complaint_database")

In [10]:
issues_df = input_df[["consumer_complaint_narrative"]].dropna()
issues_df.head(n=5) # View the first five complaints

Unnamed: 0,consumer_complaint_narrative
0,"In XXXX, Citimortgage coerced a voluntary judg..."
4,I have 2 credit cards from Citi bank as well a...
5,This is in regards to the Government taking ac...
8,I write to dispute {$32.00} in late charges ( ...
9,I decided to close my Citibank checking accoun...


In [11]:
# Choose 5,000 complaints randomly and store them in a column in a DataFrame
downsampled_issues_df = issues_df.sample(n=5000)

Generate the text embeddings

In [12]:
from bigframes.ml.llm import PaLM2TextEmbeddingGenerator

model = PaLM2TextEmbeddingGenerator() # No connection id needed

In [13]:
# Will take ~3 minutes to compute the embeddings
predicted_embeddings = model.predict(downsampled_issues_df)
# Notice the lists of numbers that are our text embeddings for each complaint
predicted_embeddings.head() 

Unnamed: 0,text_embedding
109,"[-0.010294735431671143, -0.017596377059817314,..."
181,"[-0.004606005270034075, -0.0029765090439468622..."
690,"[-0.023824475705623627, -0.03503825515508652, ..."
1068,"[-0.005357897840440273, 0.024292852729558945, ..."
1613,"[0.023095030337572098, -0.016921309754252434, ..."


In [14]:
# Join the complaints with their embeddings in the same DataFrame
combined_df = downsampled_issues_df.join(predicted_embeddings)

We now have the complaints and their text embeddings as two columns in our combined_df. Recall that complaints with numerically similar text embeddings should have similar meanings semantically. We will now group similar complaints together.

## Step 2: KMeans clustering

In [15]:
from bigframes.ml.cluster import KMeans

cluster_model = KMeans(n_clusters=10) # We will divide our complaints into 10 groups

Perform KMeans clustering

In [16]:
# Use KMeans clustering to calculate our groups. Will take ~3 minutes.
cluster_model.fit(combined_df[["text_embedding"]])
clustered_result = cluster_model.predict(combined_df[["text_embedding"]])
# Notice the CENTROID_ID column, which is the ID number of the group that
# each complaint belongs to.
clustered_result.head(n=5)

Unnamed: 0,CENTROID_ID
109,1
181,3
690,9
1068,10
1613,6


In [17]:
# Join the group number to the complaints and their text embeddings
combined_clustered_result = combined_df.join(clustered_result)

Our dataframe combined_clustered_result now has three columns: the complaints, their text embeddings, and an ID from 1-10 (inclusive) indicating which semantically similar group they belong to.

## Step 3: Summarize the complaints

Build prompts

In [18]:
# Using bigframes, with syntax identical to pandas,
# filter out the first and second groups
cluster_1_result = combined_clustered_result[
    combined_clustered_result["CENTROID_ID"] == 1
][["consumer_complaint_narrative"]]
cluster_1_result_pandas = cluster_1_result.head(5).to_pandas()

cluster_2_result = combined_clustered_result[
    combined_clustered_result["CENTROID_ID"] == 2
][["consumer_complaint_narrative"]]
cluster_2_result_pandas = cluster_2_result.head(5).to_pandas()

In [19]:
# Build plain-text prompts to send to PaLM 2. Use only 5 complaints from each group.
prompt1 = 'comment list 1:\n'
for i in range(5):
    prompt1 += str(i + 1) + '. ' + \
        cluster_1_result_pandas["consumer_complaint_narrative"].iloc[i] + '\n'

prompt2 = 'comment list 2:\n'
for i in range(5):
    prompt2 += str(i + 1) + '. ' + \
        cluster_2_result_pandas["consumer_complaint_narrative"].iloc[i] + '\n'

print(prompt1)
print(prompt2)


comment list 1:
1. On or about XXXX XXXX, 2016, I submitted to my servicer a loan modification request. They accepted and when I called back approximately two weeks later, Ocwen told me very rudely that they can not accept my request for workout assistance because the owner of the loan does not participate in modifications. They are not offering me any sort of options to foreclosure. How can the investor not abide by California laws designed to help people who are in distress. Also, each time I call Ocwen they are very rude and I have asked for Spanish speaking agents and they flat out tell me there is none. I believe Ocwen is headquartered XXXX and they simply do not understand Ca laws, nor that they are discriminating against people like me who feel more comfortable to speak in native tongue.
2. I have a mortgage loan which is interest only and the balance is {$700000.00}. The interest rate is 5.75 and since the real estate market crash of XX/XX/XXXX/XX/XX/XXXX I have been trying to 

In [20]:
# The plain English request we will make of PaLM 2
prompt = (
    "Please highlight the most obvious difference between"
    "the two lists of comments:\n" + prompt1 + prompt2
)
print(prompt)

Please highlight the most obvious difference betweenthe two lists of comments:
comment list 1:
1. On or about XXXX XXXX, 2016, I submitted to my servicer a loan modification request. They accepted and when I called back approximately two weeks later, Ocwen told me very rudely that they can not accept my request for workout assistance because the owner of the loan does not participate in modifications. They are not offering me any sort of options to foreclosure. How can the investor not abide by California laws designed to help people who are in distress. Also, each time I call Ocwen they are very rude and I have asked for Spanish speaking agents and they flat out tell me there is none. I believe Ocwen is headquartered XXXX and they simply do not understand Ca laws, nor that they are discriminating against people like me who feel more comfortable to speak in native tongue.
2. I have a mortgage loan which is interest only and the balance is {$700000.00}. The interest rate is 5.75 and sin

Get a response from PaLM 2 LLM

In [21]:
from bigframes.ml.llm import PaLM2TextGenerator

session = bf.get_global_session()
connection = f"{PROJECT_ID}.{REGION}.{CONN_NAME}"
q_a_model = PaLM2TextGenerator(session=session, connection_name=connection)

PermissionDenied: 403 Permission 'resourcemanager.projects.setIamPolicy' denied on resource '//cloudresourcemanager.googleapis.com/projects/bigframes-dev' (or it may not exist). [reason: "IAM_PERMISSION_DENIED"
domain: "cloudresourcemanager.googleapis.com"
metadata {
  key: "resource"
  value: "projects/bigframes-dev"
}
metadata {
  key: "permission"
  value: "resourcemanager.projects.setIamPolicy"
}
]

In [None]:
# Make a DataFrame containing only a single row with our prompt for PaLM 2
df = bf.DataFrame({"prompt": [prompt]})

In [None]:
# Send the request for PaLM 2 to generate a response to our prompt
major_difference = q_a_model.predict(df)
# PaLM 2's response is the only row in the dataframe result 
major_difference["ml_generate_text_llm_result"].iloc[0]

We now see PaLM2TextGenerator's characterization of the different comment groups. Thanks for using BigQuery DataFrames!