In [1]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

## Use BigQuery DataFrames to cluster and characterize complaints

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/googleapis/python-bigquery-dataframes/blob/main/notebooks/generative_ai/bq_dataframes_llm_kmeans.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/googleapis/python-bigquery-dataframes/tree/main/notebooks/generative_ai/bq_dataframes_llm_kmeans.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/googleapis/python-bigquery-dataframes/tree/main/notebooks/generative_ai/bq_dataframes_llm_kmeans.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>
  <td>
    <a href="https://pantheon.corp.google.com/bigquery?project=bigframes-dev&ws=!1m7!1m6!12m5!1m3!1sbigframes-dev!2sus-central1!3s06318b1a-ab57-46e4-b0a2-a0ad6665b0ee!2e2">
      <img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTW1gvOovVlbZAIZylUtf5Iu8-693qS1w5NJw&s" alt="BQ logo" width="35">
      Open in BQ Studio
    </a>
  </td> 
</table>

## Overview

The goal of this notebook is to demonstrate a comment characterization algorithm for an online business. We will accomplish this using [Google's Embedding Models](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models#models) and [KMeans clustering](https://en.wikipedia.org/wiki/K-means_clustering) in three steps:

1. Use TextEmbeddingGenerator to [generate text embeddings](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings) for each of 10000 complaints sent to an online bank. If you're not familiar with what a text embedding is, it's a list of numbers that are like coordinates in an imaginary "meaning space" for sentences. (It's like [word embeddings](https://en.wikipedia.org/wiki/Word_embedding), but for more general text.) The important point for our purposes is that similar sentences are close to each other in this imaginary space.
2. Use KMeans clustering to group together complaints whose text embeddings are near to eachother. This will give us sets of similar complaints, but we don't yet know _why_ these complaints are similar.
3. Prompt GeminiTextGenerator in English asking what the difference is between the groups of complaints that we got. Thanks to the power of modern LLMs, the response might give us a very good idea of what these complaints are all about, but remember to ["understand the limits of your dataset and model."](https://ai.google/responsibility/responsible-ai-practices/#:~:text=Understand%20the%20limitations%20of%20your%20dataset%20and%20model)

We will tie these pieces together in Python using BigQuery DataFrames. [Click here](https://cloud.google.com/bigquery/docs/dataframes-quickstart) to learn more about BigQuery DataFrames!

### Dataset

This notebook uses the [CFPB Consumer Complaint Database](https://console.cloud.google.com/marketplace/product/cfpb/complaint-database).

### Costs

This tutorial uses billable components of Google Cloud:

* BigQuery (compute)
* BigQuery ML
* Generative AI support on Vertex AI

Learn about [BigQuery compute pricing](https://cloud.google.com/bigquery/pricing#analysis_pricing_models), [Generative AI support on Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing#generative_ai_models),
and [BigQuery ML pricing](https://cloud.google.com/bigquery/pricing#bqml),
and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Before you begin

Complete the tasks in this section to set up your environment.

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

3. [Click here](https://console.cloud.google.com/flows/enableapi?apiid=bigquery.googleapis.com,bigqueryconnection.googleapis.com,aiplatform.googleapis.com) to enable the following APIs:

  * BigQuery API
  * BigQuery Connection API
  * Vertex AI API

4. If you are running this notebook locally, install the [Cloud SDK](https://cloud.google.com/sdk).

#### Set your project ID

**If you don't know your project ID**, see the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [2]:
# set your project ID below
PROJECT_ID = ""  # @param {type:"string"}

# Set the project id in gcloud
#! gcloud config set project {PROJECT_ID}

#### Authenticate your Google Cloud account

Depending on your Jupyter environment, you might have to manually authenticate. Follow the relevant instructions below.

**Vertex AI Workbench**

Do nothing, you are already authenticated.

**Local JupyterLab instance**

Uncomment and run the following cell:

In [3]:
# ! gcloud auth login

**Colab**

Uncomment and run the following cell:

In [4]:
# from google.colab import auth
# auth.authenticate_user()

Now we are ready to use BigQuery DataFrames!

## Step 1: Text embedding 

BigQuery DataFrames setup

In [5]:
import bigframes.pandas as bf

# Note: The project option is not required in all environments.
# On BigQuery Studio, the project ID is automatically detected.
bf.options.bigquery.project = PROJECT_ID

If you want to reset the location of the created DataFrame or Series objects, reset the session by executing `bf.close_session()`. After that, you can reuse `bf.options.bigquery.location` to specify another location.

Data Input - read the data from a publicly available BigQuery dataset

In [6]:
input_df = bf.read_gbq("bigquery-public-data.cfpb_complaints.complaint_database")

In [7]:
issues_df = input_df[["consumer_complaint_narrative"]].dropna()
issues_df.peek(n=5) # View an arbitrary five complaints

Unnamed: 0,consumer_complaint_narrative
2557016,I've been disputing fraud accounts on my credi...
2557686,American Express Platinum totally messed up my...
2558170,I recently looked at my credit report and noti...
2558545,Select Portfolio Servicing contacted my insura...
2558652,I checked my credit report and I am upset on w...


Downsample DataFrame to 10,000 records for model training.

In [8]:
# Choose 10,000 complaints randomly and store them in a column in a DataFrame
downsampled_issues_df = issues_df.sample(n=10000)

Generate the text embeddings

In [9]:
from bigframes.ml.llm import TextEmbeddingGenerator

model = TextEmbeddingGenerator() # No connection id needed

In [10]:
# Will take ~3 minutes to compute the embeddings
predicted_embeddings = model.predict(downsampled_issues_df)
# Notice the lists of numbers that are our text embeddings for each complaint
predicted_embeddings



Unnamed: 0,ml_generate_embedding_result,ml_generate_embedding_statistics,ml_generate_embedding_status,content
415,[ 2.56774724e-02 -1.06168222e-02 3.06945704e-...,"{""token_count"":171,""truncated"":false}",,DEPT OF EDUCATION/XXXX is stating I was late ...
596,[ 5.90653270e-02 -9.31344274e-03 -7.12460047e-...,"{""token_count"":668,""truncated"":false}",,I alerted my credit card company XX/XX/2017 th...
706,[ 0.01298233 0.00130001 0.01800315 0.037078...,"{""token_count"":252,""truncated"":false}",,Sallie mae is corrupt. I have tried to talk t...
804,[-1.39777679e-02 1.68943349e-02 5.53999236e-...,"{""token_count"":412,""truncated"":false}",,In accordance with the Fair Credit Reporting a...
861,[ 2.33309343e-02 -2.36528926e-03 3.37129943e-...,"{""token_count"":160,""truncated"":false}",,"Hello, My name is XXXX XXXX XXXX. I have a pro..."
1030,[ 0.06060313 -0.06495965 -0.03605044 -0.028016...,"{""token_count"":298,""truncated"":false}",,"Hello, I would like to complain about PayPal H..."
1582,[ 0.01255985 -0.01652482 -0.02638046 0.036858...,"{""token_count"":814,""truncated"":false}",,Transunion is listing personal information ( n...
1600,[ 5.13355099e-02 4.01246967e-03 5.72342947e-...,"{""token_count"":653,""truncated"":false}",,"On XX/XX/XXXX, I called Citizen Bank at XXXX t..."
2060,[ 6.44792162e-04 4.95899878e-02 4.67925966e-...,"{""token_count"":136,""truncated"":false}",,Theses names are the known liars that I have s...
2283,[ 4.71848622e-02 -8.68239347e-03 5.80501892e-...,"{""token_count"":478,""truncated"":false}",,My house was hit by a tree XX/XX/2018. My insu...


The model may have encountered errors while calculating embeddings for some rows. Filter out the errored rows before training the model. Alternatively, select these rows and retry the embeddings.

In [12]:
successful_rows = (
    (predicted_embeddings["ml_generate_embedding_status"] == "")
    # Series.str.len() gives the length of an array.
    # See: https://stackoverflow.com/a/41340543/101923
    & (predicted_embeddings["ml_generate_embedding_result"].str.len() != 0)
)
predicted_embeddings = predicted_embeddings[successful_rows]
predicted_embeddings


Unnamed: 0,ml_generate_embedding_result,ml_generate_embedding_statistics,ml_generate_embedding_status,content
415,[ 2.56774724e-02 -1.06168222e-02 3.06945704e-...,"{""token_count"":171,""truncated"":false}",,DEPT OF EDUCATION/XXXX is stating I was late ...
596,[ 5.90653270e-02 -9.31344274e-03 -7.12460047e-...,"{""token_count"":668,""truncated"":false}",,I alerted my credit card company XX/XX/2017 th...
706,[ 0.01298233 0.00130001 0.01800315 0.037078...,"{""token_count"":252,""truncated"":false}",,Sallie mae is corrupt. I have tried to talk t...
804,[-1.39777679e-02 1.68943349e-02 5.53999236e-...,"{""token_count"":412,""truncated"":false}",,In accordance with the Fair Credit Reporting a...
861,[ 2.33309343e-02 -2.36528926e-03 3.37129943e-...,"{""token_count"":160,""truncated"":false}",,"Hello, My name is XXXX XXXX XXXX. I have a pro..."
1030,[ 0.06060313 -0.06495965 -0.03605044 -0.028016...,"{""token_count"":298,""truncated"":false}",,"Hello, I would like to complain about PayPal H..."
1582,[ 0.01255985 -0.01652482 -0.02638046 0.036858...,"{""token_count"":814,""truncated"":false}",,Transunion is listing personal information ( n...
1600,[ 5.13355099e-02 4.01246967e-03 5.72342947e-...,"{""token_count"":653,""truncated"":false}",,"On XX/XX/XXXX, I called Citizen Bank at XXXX t..."
2060,[ 6.44792162e-04 4.95899878e-02 4.67925966e-...,"{""token_count"":136,""truncated"":false}",,Theses names are the known liars that I have s...
2283,[ 4.71848622e-02 -8.68239347e-03 5.80501892e-...,"{""token_count"":478,""truncated"":false}",,My house was hit by a tree XX/XX/2018. My insu...


We now have the complaints and their text embeddings as two columns in our predicted_embeddings DataFrame.

## Step 2: Create k-means model and predict clusters

In [13]:
from bigframes.ml.cluster import KMeans

cluster_model = KMeans(n_clusters=10) # We will divide our complaints into 10 groups

Perform KMeans clustering

In [15]:
# Use KMeans clustering to calculate our groups. Will take ~3 minutes.
cluster_model.fit(predicted_embeddings[["ml_generate_embedding_result"]])
clustered_result = cluster_model.predict(predicted_embeddings)
# Notice the CENTROID_ID column, which is the ID number of the group that
# each complaint belongs to.
clustered_result.peek(n=5)

Unnamed: 0,CENTROID_ID,NEAREST_CENTROIDS_DISTANCE,ml_generate_embedding_result,ml_generate_embedding_statistics,ml_generate_embedding_status,content
3172121,1,"[{'CENTROID_ID': 1, 'DISTANCE': 0.756634267893...",[ 3.18095312e-02 -3.54472063e-02 -7.13569671e-...,"{""token_count"":10,""truncated"":false}",,Company did not provide verification and detai...
2137420,1,"[{'CENTROID_ID': 1, 'DISTANCE': 0.606628249825...",[ 1.91578846e-02 5.55988774e-02 8.88887007e-...,"{""token_count"":100,""truncated"":false}",,I have already filed a dispute with Consumer A...
2350775,1,"[{'CENTROID_ID': 1, 'DISTANCE': 0.606676295233...",[ 2.25369893e-02 2.29400061e-02 -6.42273854e-...,"{""token_count"":100,""truncated"":false}",,I informed Central Financial Control & provide...
2904146,1,"[{'CENTROID_ID': 1, 'DISTANCE': 0.596729348974...",[ 9.35115516e-02 4.27814946e-03 4.62085977e-...,"{""token_count"":100,""truncated"":false}",,I received a letter from a collections agency ...
1075571,1,"[{'CENTROID_ID': 1, 'DISTANCE': 0.453806107968...",[-1.93953840e-03 -5.80236455e-03 8.49655271e-...,"{""token_count"":100,""truncated"":false}",,"I have not done business with this company, i ..."


Our DataFrame clustered_result now has an additional column that includes an ID from 1-10 (inclusive) indicating which semantically similar group they belong to.

## Step 3: Use Gemini to summarize complaint clusters

Build prompts - we will choose just two of our categories and prompt GeminiTextGenerator to identify their salient characteristics. The prompt is natural language in a python string.

In [16]:
# Using bigframes, with syntax identical to pandas,
# filter out the first and second groups
cluster_1_result = clustered_result[
    clustered_result["CENTROID_ID"] == 1
][["content"]]
cluster_1_result_pandas = cluster_1_result.head(5).to_pandas()

cluster_2_result = clustered_result[
    clustered_result["CENTROID_ID"] == 2
][["content"]]
cluster_2_result_pandas = cluster_2_result.head(5).to_pandas()

In [17]:
# Build plain-text prompts to send to Gemini. Use only 5 complaints from each group.
prompt1 = 'comment list 1:\n'
for i in range(5):
    prompt1 += str(i + 1) + '. ' + \
        cluster_1_result_pandas["content"].iloc[i] + '\n'

prompt2 = 'comment list 2:\n'
for i in range(5):
    prompt2 += str(i + 1) + '. ' + \
        cluster_2_result_pandas["content"].iloc[i] + '\n'

print(prompt1)
print(prompt2)

comment list 1:
1. This debt was from my identity being stolen I didnt open any account that resulted in this collection i have completed a police report which can be verified with the XXXX police @ XXXX report # XXXX and i have a notarized identity theft affidavit from ftc please remove this off of my credit and close my file ASAP
2. On XX/XX/XXXX  this company received certified mail asking for validation of debt. On XX/XX/XXXX  the company still did not validate debt owed and they did not mark the debt disputed by   XX/XX/XXXX   through the major credit reporting bureaus. This is a violation of the FDCPA and FCRA. I did send a second letter which the company received on   XX/XX/XXXX  . A lady from the company called and talked to me about the debt on XX/XX/XXXX    but again did not have the credit bureaus mark the item as disputed. The company still violated the laws.  Section [ 15 U.S.C. 1681s-2 ] ( 3 )  duty to provide notice of dispute. If the completeness or accuracy of any info

In [18]:
# The plain English request we will make of Gemini
prompt = (
    "Please highlight the most obvious difference between "
    "the two lists of comments:\n" + prompt1 + prompt2
)
print(prompt)

Please highlight the most obvious difference between the two lists of comments:
comment list 1:
1. This debt was from my identity being stolen I didnt open any account that resulted in this collection i have completed a police report which can be verified with the XXXX police @ XXXX report # XXXX and i have a notarized identity theft affidavit from ftc please remove this off of my credit and close my file ASAP
2. On XX/XX/XXXX  this company received certified mail asking for validation of debt. On XX/XX/XXXX  the company still did not validate debt owed and they did not mark the debt disputed by   XX/XX/XXXX   through the major credit reporting bureaus. This is a violation of the FDCPA and FCRA. I did send a second letter which the company received on   XX/XX/XXXX  . A lady from the company called and talked to me about the debt on XX/XX/XXXX    but again did not have the credit bureaus mark the item as disputed. The company still violated the laws.  Section [ 15 U.S.C. 1681s-2 ] ( 3 )

Get a response from Gemini by making a call to Vertex AI using our connection.

In [19]:
from bigframes.ml.llm import GeminiTextGenerator

q_a_model = GeminiTextGenerator()

In [20]:
# Make a DataFrame containing only a single row with our prompt for Gemini
df = bf.DataFrame({"prompt": [prompt]})

In [21]:
# Send the request for Gemini to generate a response to our prompt
major_difference = q_a_model.predict(df)
# Gemini's response is the only row in the dataframe result 
major_difference["ml_generate_text_llm_result"].iloc[0]



"## Key Differences between Comment Lists 1 and 2:\n\n**Comment List 1:**\n\n* **Focuses on Legal Violations:** The comments in List 1 primarily focus on how the debt collectors violated specific laws, such as the FDCPA and FCRA, by not validating debt, not marking accounts as disputed, and using illegal collection tactics.\n* **Detailed Evidence:** Commenters provide detailed evidence of their claims, including dates, reference numbers, police reports, and copies of communications.\n* **Formal Tone:** The language in List 1 is more formal and uses legal terminology, suggesting the commenters may have a deeper understanding of their rights.\n* **Emphasis on Debt Accuracy:** Many comments explicitly deny owing the debt and question its validity, requesting proof and demanding removal from credit reports. \n\n**Comment List 2:**\n\n* **Focus on Harassment and Intimidation:** The comments in List 2 highlight the harassing and intimidating behavior of the debt collectors, such as making mu

We now see GeminiTextGenerator's characterization of the different comment groups. Thanks for using BigQuery DataFrames!

# Summary and next steps

You've used the ML and LLM capabilities of BigQuery DataFrames to help analyze and understand a large dataset of unstructured feedback.

Learn more about BigQuery DataFrames in the [documentation](https://cloud.google.com/python/docs/reference/bigframes/latest) and find more sample notebooks in the [GitHub repo](https://github.com/googleapis/python-bigquery-dataframes/tree/main/notebooks).