# Using the Vertex AI PaLM API to explain BQML Clustering

This example demostrates how to use the Vertex AI PaLM API to explain BQML clustering.


Let's load in the Vertex AI libraries and restart the runtime

# Install packages

In [5]:
!pip install google-cloud-aiplatform --upgrade --quiet

# Core code
Let's define some variables that will be used throughout this notebook.

These are the GCP Project ID `project_id`, the Model name `model_name` which is any name you prefer, and finally the Dataset name `dataset_name`.
The dataset needs to exist in the same Project as `project_id` and you'll need appropriate access to create and delete.

## Setup

In [35]:
from typing import Union
import sys

import os
import io
import json
import base64
import requests
import concurrent.futures
import time

import numpy as np
import pandas as pd

import vertexai.preview.language_models
from google.cloud import aiplatform
from google.cloud import documentai
from google.cloud.documentai_v1 import Document
from google.cloud import storage
from google.cloud import bigquery

from IPython.display import display, Markdown, Latex


In [15]:
PROJECT_ID = 'mg-ce-demos'
REGION = 'us-central1'
DATASET = "bqml_demos" 
BQML_MODEL = "ecommerce_customer_segment_cluster5" 
EVAL = model_name + "_eval"


In [62]:
# vertex ai clients
vertexai.init(project = PROJECT_ID, location = REGION)
aiplatform.init(project = PROJECT_ID, location = REGION)

# bigquery client
bq = bigquery.Client(project = PROJECT_ID)

## Create a K-means model to cluster ecommerce data

This query can be run in BigQuery on its own. Try it out!

In [63]:
query = """
SELECT
  user_id,
  order_id,
  sale_price,
  created_at as order_created_date
FROM `mg-ce-demos.thelook_ecommerce.order_items`
WHERE created_at BETWEEN CAST('2022-01-01 00:00:00' AS TIMESTAMP)
AND CAST('2024-01-01 00:00:00' AS TIMESTAMP)
"""
df = client.query(query).to_dataframe()
df.head()


Unnamed: 0,user_id,order_id,sale_price,order_created_date
0,40085,50021,2.5,2023-04-07 08:09:34+00:00
1,90363,112961,2.5,2022-12-25 23:23:29+00:00
2,3922,4887,2.5,2022-09-29 13:47:16+00:00
3,56076,70152,2.5,2022-03-27 03:01:23+00:00
4,90828,113540,2.5,2023-03-18 08:30:11+00:00


## `CREATE MODEL` using `KMEANS`

Create a query then start the model creation job, using a python loop to wait for the job to complete

In [64]:
query = """
CREATE OR REPLACE MODEL `{0}.{1}`
OPTIONS (
  MODEL_TYPE = "KMEANS",
  NUM_CLUSTERS = 5,
  KMEANS_INIT_METHOD = "KMEANS++",
  STANDARDIZE_FEATURES = TRUE )
AS (
SELECT * EXCEPT (user_id)
FROM (
  SELECT user_id,
    DATE_DIFF(CURRENT_DATE(), CAST(MAX(order_created_date) as DATE), day) AS days_since_order, -- RECENCY
    COUNT(order_id) AS count_orders, -- FREQUENCY
    AVG(sale_price) AS avg_spend -- MONETARY
  FROM (
    SELECT user_id,
      order_id,
      sale_price,
      created_at as order_created_date
    FROM `mg-ce-demos.thelook_ecommerce.order_items`
    WHERE created_at BETWEEN CAST('2022-01-01 00:00:00' AS TIMESTAMP)
    AND CAST('2024-01-01 00:00:00' AS TIMESTAMP)
  )
  GROUP BY user_id, order_id
 )
)
""".format(DATASET, BQML_MODEL)


In [65]:
# Wrapper to use BigQuery client to run query/job, return job ID or result as DF
def run_bq_query(sql: str) -> Union[str, pd.DataFrame]:
    
    # Try dry run before executing query to catch any errors
    job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
    bq.query(sql, job_config=job_config)

    # If dry run succeeds without errors, proceed to run query
    job_config = bigquery.QueryJobConfig()
    client_result = bq.query(sql, job_config=job_config)

    job_id = client_result.job_id

    # Wait for query/job to finish running. then get & return data frame
    df = client_result.result().to_arrow().to_pandas()
    print(f"Finished job_id: {job_id}")
    return df

In [66]:
print(query)

# this should take under 5 minutes to create the model
run_bq_query(query)


CREATE OR REPLACE MODEL `bqml_demos.ecommerce_customer_segment_cluster5`
OPTIONS (
  MODEL_TYPE = "KMEANS",
  NUM_CLUSTERS = 5,
  KMEANS_INIT_METHOD = "KMEANS++",
  STANDARDIZE_FEATURES = TRUE )
AS (
SELECT * EXCEPT (user_id)
FROM (
  SELECT user_id,
    DATE_DIFF(CURRENT_DATE(), CAST(MAX(order_created_date) as DATE), day) AS days_since_order, -- RECENCY
    COUNT(order_id) AS count_orders, -- FREQUENCY
    AVG(sale_price) AS avg_spend -- MONETARY
  FROM (
    SELECT user_id,
      order_id,
      sale_price,
      created_at as order_created_date
    FROM `mg-ce-demos.thelook_ecommerce.order_items`
    WHERE created_at BETWEEN CAST('2022-01-01 00:00:00' AS TIMESTAMP)
    AND CAST('2024-01-01 00:00:00' AS TIMESTAMP)
  )
  GROUP BY user_id, order_id
 )
)

Finished job_id: 01821c42-7129-4064-bb34-5bb7c784d037


Let's take a look at the model's clustering perforamcen, using these metrics - Davies Bouldin Index and Mean Squared Distance

In [67]:
query = """
SELECT *
FROM ML.EVALUATE(MODEL `{0}.{1}`)
""".format(DATASET, BQML_MODEL)

run_bq_query(query)

Finished job_id: beb2fae5-a071-40a9-8fdc-0a9aa6d44e6a


Unnamed: 0,davies_bouldin_index,mean_squared_distance
0,1.012579,0.975499


Now let's get the cluster (centroid) information

In [68]:
query = """
SELECT
  CONCAT('cluster ', CAST(centroid_id as STRING)) as centroid,
  avg_spend as average_spend,
  count_orders as count_of_orders,
  days_since_order
FROM (
  SELECT centroid_id, feature, ROUND(numerical_value, 2) as value
  FROM ML.CENTROIDS(MODEL `{0}.{1}`)
)
PIVOT (
  SUM(value)
  FOR feature IN ('avg_spend',  'count_orders', 'days_since_order')
)
ORDER BY centroid_id
""".format(DATASET, BQML_MODEL)

run_bq_query(query)

Finished job_id: 9d4f9016-c929-4cce-baa0-571dcf0f91aa


Unnamed: 0,centroid,average_spend,count_of_orders,days_since_order
0,cluster 1,146.47,1.15,153.47
1,cluster 2,46.37,1.24,385.58
2,cluster 3,52.43,2.67,139.33
3,cluster 4,388.32,1.15,202.86
4,cluster 5,37.73,1.0,99.92


Whew! That's a lot of metrics and cluster info. How about we explain this to our colleagues using the magic of LLMs.

In [69]:
df = bq.query(query).to_dataframe()
df.to_string(header=False, index=False)

cluster_info = []
for i, row in df.iterrows():
  cluster_info.append("{0}, average spend ${2}, count of orders per person {1}, days since last order {3}"
    .format(row["centroid"], row["count_of_orders"], row["average_spend"], row["days_since_order"]) )

print(str.join("\n", cluster_info))

cluster 1, average spend $146.47, count of orders per person 1.15, days since last order 153.47
cluster 2, average spend $46.37, count of orders per person 1.24, days since last order 385.58
cluster 3, average spend $52.43, count of orders per person 2.67, days since last order 139.33
cluster 4, average spend $388.32, count of orders per person 1.15, days since last order 202.86
cluster 5, average spend $37.73, count of orders per person 1.0, days since last order 99.92


## Explain with Vertex AI PaLM API

Install the python library and restart the runtime

Generate a text prediction

In [70]:
textgen_model = vertexai.preview.language_models.TextGenerationModel.from_pretrained('text-bison@001')

In [71]:
preamble = "Pretend you're a creative strategist, given the following clusters come up with creative brand persona and title labels for each of these clusters, and explain step by step; what would be the next marketing step for these clusters:"

display(Markdown('## Prompt:'))
print(preamble)
print('\n')
print(str.join("\n", cluster_info))
display(Markdown('## Response:'))
display(Markdown(str(textgen_model.predict(
   preamble + "\n" + str.join("\n", cluster_info),
    max_output_tokens=1024,
    temperature=0.35,
    top_p=0.8,
    top_k=40,
))))

## Prompt:

Pretend you're a creative strategist, given the following clusters come up with creative brand persona and title labels for each of these clusters, and explain step by step; what would be the next marketing step for these clusters:


cluster 1, average spend $146.47, count of orders per person 1.15, days since last order 153.47
cluster 2, average spend $46.37, count of orders per person 1.24, days since last order 385.58
cluster 3, average spend $52.43, count of orders per person 2.67, days since last order 139.33
cluster 4, average spend $388.32, count of orders per person 1.15, days since last order 202.86
cluster 5, average spend $37.73, count of orders per person 1.0, days since last order 99.92


## Response:

**Cluster 1: The Frequent Buyers**

These customers are our most loyal and valuable. They spend a lot of money on our products and services, and they come back to us again and again. We should focus on keeping these customers happy by offering them exclusive deals and promotions, and by making sure that they have a positive experience every time they shop with us.

**Cluster 2: The Bargain Hunters**

These customers are looking for the best possible deal. They are willing to shop around for the best price, and they are not afraid to switch brands if they find a better deal. We should focus on offering these customers competitive prices and promotions, and we should make sure that our products and services are of high quality.

**Cluster 3: The Socializers**

These customers are always looking for new ways to connect with their friends and family. They are more likely to buy products and services that they can share with their social networks. We should focus on creating social media campaigns that appeal to these customers, and we should make sure that our products and services are easy to share.

**Cluster 4: The Luxury Seekers**

These customers are looking for the best possible quality. They are willing to pay a premium for products and services that are made with the finest materials and craftsmanship. We should focus on offering these customers high-end products and services, and we should make sure that our customer service is top-notch.

**Cluster 5: The Casual Shoppers**

These customers are not looking for anything in particular. They are just browsing and seeing what's available. We should focus on making our products and services easy to find and shop for, and we should make sure that our prices are competitive.

Once we have identified the different customer personas, we can develop marketing campaigns that are specifically tailored to each persona. This will help us to reach our target customers and increase our sales.

Voila! We've now used k-means clustering to create groups of spenders and explain their profiles.

Sometimes, though, you want a little bit [extra](https://cloud.google.com/blog/transform/prompt-debunking-five-generative-ai-misconceptions).

In [72]:
preamble = "Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite movie, a summary of how this relates to their purchasing behavior, and a witty e-mail headline for marketing campaign targeted to their group:"

display(Markdown('## Prompt:'))
print(preamble)
print('\n')
print(str.join("\n", cluster_info))
display(Markdown('## Response:'))
display(Markdown(str(textgen_model.predict(
   preamble + "\n" + str.join("\n", cluster_info),
    max_output_tokens=1024,
    temperature=0.35,
    top_p=0.8,
    top_k=40,
))))

## Prompt:

Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite movie, a summary of how this relates to their purchasing behavior, and a witty e-mail headline for marketing campaign targeted to their group:


cluster 1, average spend $146.47, count of orders per person 1.15, days since last order 153.47
cluster 2, average spend $46.37, count of orders per person 1.24, days since last order 385.58
cluster 3, average spend $52.43, count of orders per person 2.67, days since last order 139.33
cluster 4, average spend $388.32, count of orders per person 1.15, days since last order 202.86
cluster 5, average spend $37.73, count of orders per person 1.0, days since last order 99.92


## Response:

**Cluster 1: The Movie Buff**

* Favorite movie: The Shawshank Redemption
* Purchasing behavior: This group spends a lot of money on movies, and they tend to buy them often. They are likely to be big fans of classic movies, and they may be interested in collecting DVDs or Blu-rays.
* E-mail headline: "The Shawshank Redemption is Now on Blu-ray!"

**Cluster 2: The Casual Moviegoer**

* Favorite movie: The Avengers
* Purchasing behavior: This group spends less money on movies than the Movie Buffs, but they still buy them fairly often. They are likely to be fans of popular movies, and they may be interested in seeing new releases in theaters.
* E-mail headline: "The Avengers: Endgame is Now in Theaters!"

**Cluster 3: The Family Movie Night Group**

* Favorite movie: Frozen
* Purchasing behavior: This group spends a moderate amount of money on movies, and they tend to buy them in bulk. They are likely to be families with young children, and they may be interested in movies that are appropriate for all ages.
* E-mail headline: "Frozen 2 is Now Available on DVD!"

**Cluster 4: The Big Spender**

* Favorite movie: The Godfather
* Purchasing behavior: This group spends a lot of money on movies, and they tend to buy them infrequently. They are likely to be very passionate about movies, and they may be interested in collecting rare or hard-to-find titles.
* E-mail headline: "The Godfather: The Coppola Restoration is Now Available on Blu-ray!"

**Cluster 5: The Bargain Hunter**

* Favorite movie: The Princess Bride
* Purchasing behavior: This group spends the least amount of money on movies, and they tend to buy them only when they are on sale. They are likely to be very price-conscious, and they may be interested in finding movies that are available for free or at a discount.
* E-mail headline: "The Princess Bride is Now on Sale!"

In [83]:
def persona_builder(preamble):
    prompt = preamble + "\n" + str.join("\n", cluster_info)
    result = textgen_model.predict(
        prompt,
        max_output_tokens=1024,
        temperature=0.35,
        top_p=0.8,
        top_k=40,
    )
    
    return prompt, result

In [84]:
#persona_builder("Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite movie, a summary of how this relates to their purchasing behavior, and a witty e-mail headline for marketing campaign targeted to their group:")

## UI for Demo

In [89]:
import gradio as gr

with gr.Blocks() as demo:
    gr.Markdown(
    """
    ## Persona Builder and Marketing Bot

    ### Enter a task...

    """)
    with gr.Row():
      with gr.Column():
        pre_text = "Pretend you're a creative strategist, given the following clusters "
        input_text = gr.Textbox(label="Task", placeholder="Pretend you're a creative strategist, given the following clusters come up with creative brand persona and title labels for each of these clusters, and explain step by step; what would be the next marketing step for these clusters:")
        
        
    with gr.Row():
      generate = gr.Button("Generate Response")

    with gr.Row():
      label2 = gr.Textbox(label="Prompt")
    with gr.Row():
      label3 = gr.Textbox(label="Response generated by LLM")

    generate.click(persona_builder,input_text, [label2, label3])
demo.launch(share=False, debug=False)

Running on local URL:  http://127.0.0.1:7867

To create a public link, set `share=True` in `launch()`.


