# Using the Vertex AI PaLM API to explain BQML Clustering

This example demostrates how to use the Vertex AI PaLM API to explain BQML clustering.


Let's load in the Vertex AI libraries and restart the runtime

# Install packages

In [None]:
from google.colab import auth
auth.authenticate_user()

In [None]:
#!gsutil cp gs://vertex_sdk_llm_private_releases/SDK/google_cloud_aiplatform-1.23.0.llm.alpha.23.03.28-py2.py3-none-any.whl .
!gsutil cp gs://vertex_sdk_llm_private_releases/SDK/google_cloud_aiplatform-1.25.dev20230502+language.models-py2.py3-none-any.whl .

!pip install google_cloud_aiplatform-1.25.dev20230502+language.models-py2.py3-none-any.whl "shapely<2.0.0"

Copying gs://vertex_sdk_llm_private_releases/SDK/google_cloud_aiplatform-1.25.dev20230502+language.models-py2.py3-none-any.whl...
/ [0 files][    0.0 B/  2.4 MiB]                                                / [1 files][  2.4 MiB/  2.4 MiB]                                                
Operation completed over 1 objects/2.4 MiB.                                      
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Processing ./google_cloud_aiplatform-1.25.dev20230502+language.models-py2.py3-none-any.whl
Collecting shapely<2.0.0
  Downloading Shapely-1.8.5.post1-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
Collecting google-cloud-resource-manager<3.0.0dev,>=1.3.3 (from google-cloud-aiplatform==1.25.dev20230502+language.models)
  Downloading google_cloud_resource_manager-1.10.0-py2.py3-none-an

# Core code
Let's define some variables that will be used throughout this notebook.

These are the GCP Project ID `project_id`, the Model name `model_name` which is any name you prefer, and finally the Dataset name `dataset_name`.
The dataset needs to exist in the same Project as `project_id` and you'll need appropriate access to create and delete.

In [None]:
import pandas as pd
from typing import Union
import sys
from google.cloud import bigquery
from google.colab import auth
auth.authenticate_user()

In [None]:
#@title Setup Project Variables { run: "auto", display-mode: "form" }
project_id = 'cloud-llm-preview1' #@param {type:"string"}
dataset_name = "bqml_tutorial_us" #@param {type:"string"}
model_name = "ecommerce_customer_segment_cluster5" #@param {type:"string"}
eval_name = model_name + "_eval"
LOCATION = "us-central1"  # @param {type:"string"}
client = bigquery.Client(project=project_id)

## Create a K-means model to cluster ecommerce data

This query can be run in BigQuery on its own. Try it out!

In [None]:
query = """
SELECT
  user_id,
  order_id,
  sale_price,
  created_at as order_created_date
FROM `bigquery-public-data.thelook_ecommerce.order_items`
WHERE created_at BETWEEN CAST('2020-01-01 00:00:00' AS TIMESTAMP)
AND CAST('2023-01-01 00:00:00' AS TIMESTAMP)
"""
df = client.query(query).to_dataframe()
df.head()


Unnamed: 0,user_id,order_id,sale_price,order_created_date
0,27396,34003,2.5,2022-08-29 20:52:51+00:00
1,33356,41399,2.5,2021-09-11 09:25:35+00:00
2,4135,5159,2.5,2021-10-15 09:50:16+00:00
3,47833,59560,2.5,2022-05-31 03:51:50+00:00
4,68859,85914,2.5,2022-12-04 04:40:47+00:00


## `CREATE MODEL` using `KMEANS`

Create a query then start the model creation job, using a python loop to wait for the job to complete

In [None]:
query = """
CREATE OR REPLACE MODEL `{0}.{1}`
OPTIONS (
  MODEL_TYPE = "KMEANS",
  NUM_CLUSTERS = 5,
  KMEANS_INIT_METHOD = "KMEANS++",
  STANDARDIZE_FEATURES = TRUE )
AS (
SELECT * EXCEPT (user_id)
FROM (
  SELECT user_id,
    DATE_DIFF(CURRENT_DATE(), CAST(MAX(order_created_date) as DATE), day) AS days_since_order, -- RECENCY
    COUNT(order_id) AS count_orders, -- FREQUENCY
    AVG(sale_price) AS avg_spend -- MONETARY
  FROM (
    SELECT user_id,
      order_id,
      sale_price,
      created_at as order_created_date
    FROM `bigquery-public-data.thelook_ecommerce.order_items`
    WHERE created_at BETWEEN CAST('2020-01-01 00:00:00' AS TIMESTAMP)
    AND CAST('2023-01-01 00:00:00' AS TIMESTAMP)
  )
  GROUP BY user_id, order_id
 )
)
""".format(dataset_name, model_name)


In [None]:
# Wrapper to use BigQuery client to run query/job, return job ID or result as DF
def run_bq_query(sql: str) -> Union[str, pd.DataFrame]:
    """
    Input: SQL query, as a string, to execute in BigQuery
    Returns the query results as a pandas DataFrame, or error, if any
    """

    # Try dry run before executing query to catch any errors
    job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
    client.query(sql, job_config=job_config)

    # If dry run succeeds without errors, proceed to run query
    job_config = bigquery.QueryJobConfig()
    client_result = client.query(sql, job_config=job_config)

    job_id = client_result.job_id

    # Wait for query/job to finish running. then get & return data frame
    df = client_result.result().to_arrow().to_pandas()
    print(f"Finished job_id: {job_id}")
    return df

In [None]:
print(query)

# this should take under 5 minutes to create the model
run_bq_query(query)


SELECT
  user_id,
  order_id,
  sale_price,
  created_at as order_created_date
FROM `bigquery-public-data.thelook_ecommerce.order_items`
WHERE created_at BETWEEN CAST('2020-01-01 00:00:00' AS TIMESTAMP)
AND CAST('2023-01-01 00:00:00' AS TIMESTAMP)

Finished job_id: b4775f6f-2d4e-4d4c-810d-3746b92cb8ac


Unnamed: 0,user_id,order_id,sale_price,order_created_date
0,27396,34003,2.50,2022-08-29 20:52:51+00:00
1,33356,41399,2.50,2021-09-11 09:25:35+00:00
2,4135,5159,2.50,2021-10-15 09:50:16+00:00
3,47833,59560,2.50,2022-05-31 03:51:50+00:00
4,68859,85914,2.50,2022-12-04 04:40:47+00:00
...,...,...,...,...
121710,83904,104908,9.32,2022-11-11 06:35:07+00:00
121711,84906,106189,9.32,2022-08-01 13:45:12+00:00
121712,91014,113735,9.32,2021-06-02 13:22:42+00:00
121713,97761,122130,9.82,2021-04-21 11:37:16+00:00


Let's take a look at the model's clustering perforamcen, using these metrics - Davies Bouldin Index and Mean Squared Distance

In [None]:
query = """
SELECT *
FROM ML.EVALUATE(MODEL `{0}.{1}`)
""".format(dataset_name, model_name)
run_bq_query(query)


Finished job_id: 975446f5-5a0a-4c45-8d2a-234d34865c4d


Unnamed: 0,davies_bouldin_index,mean_squared_distance
0,1.016709,0.967718


Now let's get the cluster (centroid) information

In [None]:
query = """
SELECT
  CONCAT('cluster ', CAST(centroid_id as STRING)) as centroid,
  avg_spend as average_spend,
  count_orders as count_of_orders,
  days_since_order
FROM (
  SELECT centroid_id, feature, ROUND(numerical_value, 2) as value
  FROM ML.CENTROIDS(MODEL `{0}.{1}`)
)
PIVOT (
  SUM(value)
  FOR feature IN ('avg_spend',  'count_orders', 'days_since_order')
)
ORDER BY centroid_id
""".format(dataset_name, model_name)
run_bq_query(query)

Finished job_id: b8ccbfb7-0f91-4bf2-ac0b-00cf5c3fb144


Unnamed: 0,centroid,average_spend,count_of_orders,days_since_order
0,cluster 1,45.77,1.23,850.51
1,cluster 2,359.51,1.16,525.65
2,cluster 3,136.63,1.27,416.48
3,cluster 4,57.69,3.51,497.3
4,cluster 5,38.5,1.21,317.57


Whew! That's a lot of metrics and cluster info. How about we explain this to our colleagues using the magic of LLMs.

In [None]:
df = client.query(query).to_dataframe()
df.to_string(header=False, index=False)

cluster_info = []
for i, row in df.iterrows():
  cluster_info.append("{0}, average spend ${2}, count of orders per person {1}, days since last order {3}"
    .format(row["centroid"], row["count_of_orders"], row["average_spend"], row["days_since_order"]) )

print(str.join("\n", cluster_info))

cluster 1, average spend $45.77, count of orders per person 1.23, days since last order 850.51
cluster 2, average spend $359.51, count of orders per person 1.16, days since last order 525.65
cluster 3, average spend $136.63, count of orders per person 1.27, days since last order 416.48
cluster 4, average spend $57.69, count of orders per person 3.51, days since last order 497.3
cluster 5, average spend $38.5, count of orders per person 1.21, days since last order 317.57


## Explain with Vertex AI PaLM API

Install the python library and restart the runtime

In [None]:
from google.cloud import aiplatform
from google.cloud.aiplatform.private_preview.language_models import TextGenerationModel, ChatModel

aiplatform.init(project=project_id, location=LOCATION)

Generate a text prediction

In [None]:
from google.cloud.aiplatform.private_preview.language_models import TextGenerationModel

model = TextGenerationModel.from_pretrained("text-bison-001")

preamble = "Pretend you're a creative strategist, given the following clusters come up with creative brand persona and title labels for each of these clusters, and explain step by step; what would be the next marketing step for these clusters"

print(str.join("\n", cluster_info))
print('------------------------------')
print(model.predict(
   preamble + "\n" + str.join("\n", cluster_info),
    max_output_tokens=1024,
    temperature=0.35,
    top_p=0.8,
    top_k=40,
))

cluster 1, average spend $45.77, count of orders per person 1.23, days since last order 850.51
cluster 2, average spend $359.51, count of orders per person 1.16, days since last order 525.65
cluster 3, average spend $136.63, count of orders per person 1.27, days since last order 416.48
cluster 4, average spend $57.69, count of orders per person 3.51, days since last order 497.3
cluster 5, average spend $38.5, count of orders per person 1.21, days since last order 317.57
------------------------------
**Cluster 1: The Occasional Shopper**

This cluster is made up of customers who spend an average of $45.77 per order and place an order every 850.51 days. They are likely to be looking for affordable, everyday items that they can use on a regular basis.

**Brand persona:** The Occasional Shopper is a busy person who is always on the go. They are looking for products that are easy to use and don't require a lot of time or effort. They are also looking for products that are affordable and wo

Voila! We've now used k-means clustering to create groups of spenders and explain their profiles.

Sometimes, though, you want a little bit [extra](https://cloud.google.com/blog/transform/prompt-debunking-five-generative-ai-misconceptions).

In [None]:
from google.cloud.aiplatform.private_preview.language_models import TextGenerationModel

model = TextGenerationModel.from_pretrained("text-bison-001")

preamble = "Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite bakery item, a summary of how this relates to their purchasing behavior, and a witty e-mail headline for marketing campaign targeted to their group."

print(model.predict(
   preamble + "\n" + str.join("\n", cluster_info),
    max_output_tokens=1024,
    temperature=0.45,
    top_p=0.8, top_k=40,
))

**Cluster 1: The Swifties**

* Favorite Taylor Swift song: "Love Story"
* Purchasing behavior: This group is made up of loyal Taylor Swift fans who love her classic, country-inspired music. They are typically young women who are looking for affordable, stylish clothing and accessories.
* Witty e-mail headline: "Looking for the perfect Love Story? Shop our collection of Taylor Swift-inspired clothing and accessories!"

**Cluster 2: The Style Seekers**

* Favorite Taylor Swift song: "Blank Space"
* Purchasing behavior: This group is made up of fashion-forward women who are always looking for the latest trends. They are typically older than the Swifties and have a higher disposable income.
* Witty e-mail headline: "Are you ready to shake things up? Shop our collection of Taylor Swift-inspired fashion!"

**Cluster 3: The Casual Fans**

* Favorite Taylor Swift song: "Shake It Off"
* Purchasing behavior: This group is made up of casual Taylor Swift fans who enjoy her music but are not necess