# Using the Vertex AI PaLM API to explain BQML Clustering - Our Very Own Persona Builder

## Here's what we're going to build:

<p align="center">
  <img alt="Conceptual Flow" src="slides/process0.png" width="100%">
</p>

# Core code
Let's define some variables that will be used throughout this notebook.

These are the GCP Project ID `project_id`, the Model name `model_name` which is any name you prefer, and finally the Dataset name `dataset_name`.
The dataset needs to exist in the same Project as `project_id` and you'll need appropriate access to create and delete.

## Setup

In [1]:
from typing import Union
import sys

import os
import io
import json
import base64
import requests
import concurrent.futures
import time

import numpy as np
import pandas as pd

import vertexai
from vertexai.language_models import TextGenerationModel, TextEmbeddingModel
from vertexai.preview.language_models import TextGenerationModel as TextGenerationModel_preview
from google.cloud import aiplatform
from google.cloud import documentai
from google.cloud.documentai_v1 import Document
from google.cloud import storage
from google.cloud import bigquery

from IPython.display import display, Markdown, Latex
import markdown

print("Vertex AI version: " + str(aiplatform.__version__))

2023-11-30 14:12:46.174679: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Vertex AI version: 1.36.0


In [2]:
PROJECT_ID = 'mg-ce-demos'
REGION = 'us-central1'
DATASET = "bqml_demos" 
BQML_MODEL = "ecommerce_customer_segment_cluster5" 
EVAL = BQML_MODEL + "_eval"


In [3]:
# vertex ai clients
vertexai.init(project = PROJECT_ID, location = REGION)
aiplatform.init(project = PROJECT_ID, location = REGION)

# bigquery client
bq = bigquery.Client(project = PROJECT_ID)

## Create a K-means model to cluster ecommerce data

### Let's look at the data first

<p align="center">
  <img alt="Conceptual Flow" src="slides/process1.png" width="100%">
</p>

In [4]:
query = """
SELECT
  user_id,
  order_id,
  sale_price,
  created_at as order_created_date
FROM `mg-ce-demos.thelook_ecommerce.order_items`
WHERE created_at BETWEEN CAST('2022-01-01 00:00:00' AS TIMESTAMP)
AND CAST('2024-01-01 00:00:00' AS TIMESTAMP)
"""
df = bq.query(query).to_dataframe()
df.head()


Unnamed: 0,user_id,order_id,sale_price,order_created_date
0,40085,50021,2.5,2023-04-07 08:09:34+00:00
1,90363,112961,2.5,2022-12-25 23:23:29+00:00
2,3922,4887,2.5,2022-09-29 13:47:16+00:00
3,56076,70152,2.5,2022-03-27 03:01:23+00:00
4,90828,113540,2.5,2023-03-18 08:30:11+00:00


### `CREATE MODEL` using `KMEANS`

Create a query then start the model creation job, using a python loop to wait for the job to complete

<p align="center">
  <img alt="Conceptual Flow" src="slides/process2.png" width="100%">
</p>

In [5]:
query = """
CREATE MODEL IF NOT EXISTS `{0}.{1}`
OPTIONS (
  MODEL_TYPE = "KMEANS",
  NUM_CLUSTERS = 5,
  KMEANS_INIT_METHOD = "KMEANS++",
  STANDARDIZE_FEATURES = TRUE )
AS (
SELECT * EXCEPT (user_id)
FROM (
  SELECT user_id,
    DATE_DIFF(CURRENT_DATE(), CAST(MAX(order_created_date) as DATE), day) AS days_since_order, -- RECENCY
    COUNT(order_id) AS count_orders, -- FREQUENCY
    AVG(sale_price) AS avg_spend -- MONETARY
  FROM (
    SELECT user_id,
      order_id,
      sale_price,
      created_at as order_created_date
    FROM `mg-ce-demos.thelook_ecommerce.order_items`
    WHERE created_at BETWEEN CAST('2022-01-01 00:00:00' AS TIMESTAMP)
    AND CAST('2024-01-01 00:00:00' AS TIMESTAMP)
  )
  GROUP BY user_id, order_id
 )
)
""".format(DATASET, BQML_MODEL)


In [6]:
# Wrapper to use BigQuery client to run query/job, return job ID or result as DF
def run_bq_query(sql: str) -> Union[str, pd.DataFrame]:
    
    # Try dry run before executing query to catch any errors
    #job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
    #bq.query(sql, job_config=job_config)

    # If dry run succeeds without errors, proceed to run query
    job_config = bigquery.QueryJobConfig()
    client_result = bq.query(sql, job_config=job_config)

    job_id = client_result.job_id

    # Wait for query/job to finish running. then get & return data frame
    df = client_result.result().to_arrow().to_pandas()
    print(f"Finished job_id: {job_id}")
    return df

In [7]:
%%timeit

run_bq_query(query)

Finished job_id: 446d8320-2ec2-4864-aed9-dba251360381
Finished job_id: ecd27b31-1006-4af1-9aa7-c689c0d44979
Finished job_id: c545314d-bebe-435d-9dc4-5f13804d6194
Finished job_id: b8abfdc8-b53f-4329-b153-0cb98d71ad72
Finished job_id: 72e1ada0-7878-4fd3-ac4d-42dbb1a4bf03
Finished job_id: ccd74a71-4637-42e1-8144-89e1e0d365b7
Finished job_id: 5b740483-e8fc-4eca-90d4-a864abefe6ab
Finished job_id: bf28de2f-081c-475a-9b34-465a1d64c907
The slowest run took 15.03 times longer than the fastest. This could mean that an intermediate result is being cached.
2.72 s ± 4.24 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


Let's take a look at the model's clustering performance, using these metrics - Davies Bouldin Index and Mean Squared Distance

In [8]:
query = """
SELECT *
FROM ML.EVALUATE(MODEL `{0}.{1}`)
""".format(DATASET, BQML_MODEL)

run_bq_query(query)

Finished job_id: 027490ae-ab50-42d9-ad49-d9fd31aeaff8


Unnamed: 0,davies_bouldin_index,mean_squared_distance
0,1.023378,1.051785


### Now let's get the cluster (centroid) information

<p align="center">
  <img alt="Conceptual Flow" src="slides/process3.png" width="100%">
</p>

In [9]:
query = """
SELECT
  CONCAT('cluster ', CAST(centroid_id as STRING)) as centroid,
  avg_spend as average_spend,
  count_orders as count_of_orders,
  days_since_order
FROM (
  SELECT centroid_id, feature, ROUND(numerical_value, 2) as value
  FROM ML.CENTROIDS(MODEL `{0}.{1}`)
)
PIVOT (
  SUM(value)
  FOR feature IN ('avg_spend',  'count_orders', 'days_since_order')
)
ORDER BY centroid_id
""".format(DATASET, BQML_MODEL)

run_bq_query(query)

Finished job_id: 9ac002a3-6912-4a9b-8cb0-e9ad6eef4f6e


Unnamed: 0,centroid,average_spend,count_of_orders,days_since_order
0,cluster 1,49.44,1.23,102.87
1,cluster 2,59.56,3.51,87.9
2,cluster 3,251.34,1.14,205.85
3,cluster 4,57.47,3.49,354.32
4,cluster 5,48.87,1.22,376.5


Whew! That's a lot of metrics and cluster info. How about we explain this to our colleagues using the magic of LLMs.

In [10]:
df = bq.query(query).to_dataframe()
df.to_string(header=False, index=False)

cluster_info = []
for i, row in df.iterrows():
  cluster_info.append("{0}, average spend ${2}, count of orders per person {1}, days since last order {3}"
    .format(row["centroid"], row["count_of_orders"], row["average_spend"], row["days_since_order"]) )

print(str.join("\n", cluster_info))

cluster 1, average spend $49.44, count of orders per person 1.23, days since last order 102.87
cluster 2, average spend $59.56, count of orders per person 3.51, days since last order 87.9
cluster 3, average spend $251.34, count of orders per person 1.14, days since last order 205.85
cluster 4, average spend $57.47, count of orders per person 3.49, days since last order 354.32
cluster 5, average spend $48.87, count of orders per person 1.22, days since last order 376.5


## Explain with Vertex AI PaLM API

### First, we want to instantiate the large language model and create the prompt

<p align="center">
  <img alt="Conceptual Flow" src="slides/process4.png" width="100%">
</p>

In [11]:
textgen_model = TextGenerationModel.from_pretrained('text-bison')

In [12]:
preamble = """Pretend you're a creative strategist, given the following clusters come up with creative brand persona and title labels for each of these clusters, 
and explain step by step; what would be the next marketing step for these clusters:

"""

display(Markdown('## Prompt:'))
print(preamble)
print(str.join("\n", cluster_info))

## Prompt:

Pretend you're a creative strategist, given the following clusters come up with creative brand persona and title labels for each of these clusters, 
and explain step by step; what would be the next marketing step for these clusters:


cluster 1, average spend $49.44, count of orders per person 1.23, days since last order 102.87
cluster 2, average spend $59.56, count of orders per person 3.51, days since last order 87.9
cluster 3, average spend $251.34, count of orders per person 1.14, days since last order 205.85
cluster 4, average spend $57.47, count of orders per person 3.49, days since last order 354.32
cluster 5, average spend $48.87, count of orders per person 1.22, days since last order 376.5


### Now, we send our prompt to Google GenAI API for some LLM magic

<p align="center">
  <img alt="Conceptual Flow" src="slides/process5.png" width="100%">
</p>

In [13]:
response = textgen_model.predict(
   preamble + "\n" + str.join("\n", cluster_info),
    max_output_tokens=1024,
    temperature=0.4,
    top_p=0.8,
    top_k=40,
)

display(Markdown('## Response:'))
# Send prompt to LLM
display(Markdown(response.text))

## Response:

 **Cluster 1**: 
* Brand persona: Thrifty shoppers who are looking for a good deal. 
* Title labels: 
 -  Savvy Spender
 -  Budget-Minded Buyer
 -  Value-Conscious Consumer 
* Marketing step: Offer discounts and promotions to entice these customers to make more purchases.


**Cluster 2**: 
* Brand persona:  Loyal customers who are happy with your products and services. 
* Title labels: 
 -  Loyal Customer
 -  Repeat Purchaser
 -  Brand Advocate 
* Marketing step: Reward these customers for their loyalty with a loyalty program or exclusive discounts.


**Cluster 3**: 
* Brand persona:  High-end shoppers who are looking for the best of the best. 
* Title labels: 
 -  Luxury Seeker
 -  Affluent Consumer
 -  Discerning Shopper 
* Marketing step:  Position your brand as a high-end luxury brand and target your marketing efforts to affluent consumers.


**Cluster 4**: 
* Brand persona:  Customers who are looking for a specific product or service and are willing to wait for it. 
* Title labels: 
 -  Patient Purchaser
 -  Long-Term Customer
 -  Deliberate Buyer 
* Marketing step:  Create content that helps these customers find the products or services they are looking for and provide them with information about the benefits of waiting for your product or service.


**Cluster 5**: 
* Brand persona:  Customers who are not very engaged with your brand and are likely to churn. 
* Title labels: 
 -  Disengaged Customer
 -  At-Risk Customer
 -  潜在流失客户
* Marketing step:  Reach out to these customers with personalized messages and offers to try to win them back.

Voila! We've now used k-means clustering to create groups of spenders and explain their profiles.

Sometimes, though, you want a little bit [extra](https://cloud.google.com/blog/transform/prompt-debunking-five-generative-ai-misconceptions).

In [14]:
preamble = """Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite movie, 
a summary of how this relates to their purchasing behavior, and a witty e-mail headline for marketing campaign targeted to their group:

"""

display(Markdown('## Prompt:'))
print(preamble)
print(str.join("\n", cluster_info))

response = textgen_model.predict(
   preamble + "\n" + str.join("\n", cluster_info),
    max_output_tokens=1024,
    temperature=0.4,
    top_p=0.8,
    top_k=40,
)

display(Markdown('## Response:'))
display(Markdown(response.text))

## Prompt:

Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite movie, 
a summary of how this relates to their purchasing behavior, and a witty e-mail headline for marketing campaign targeted to their group:


cluster 1, average spend $49.44, count of orders per person 1.23, days since last order 102.87
cluster 2, average spend $59.56, count of orders per person 3.51, days since last order 87.9
cluster 3, average spend $251.34, count of orders per person 1.14, days since last order 205.85
cluster 4, average spend $57.47, count of orders per person 3.49, days since last order 354.32
cluster 5, average spend $48.87, count of orders per person 1.22, days since last order 376.5


## Response:

 **Cluster 1**: 
* Creative brand persona*: Thrifty shoppers who are looking for a good deal. Their favorite movie is "The Wolf of Wall Street" because it shows how money can be made and lost quickly. 
* Summary of how this relates to their purchasing behavior*: They are likely to compare prices before making a purchase and are more likely to buy items that are on sale. 
* Witty email headline for marketing campaign*: "Save money like a wolf on Wall Street with our amazing deals!"

**Cluster 2**: 
* Creative brand persona*: Busy professionals who don't have a lot of time to shop. Their favorite movie is "The Devil Wears Prada" because it shows how the fashion industry can be cutthroat and demanding. 
* Summary of how this relates to their purchasing behavior*: They are likely to value convenience and are more likely to buy items that can be purchased online or with a quick trip to the store. 
* Witty email headline for marketing campaign*: "Look like a boss and shop like a pro with our convenient shopping options!"

**Cluster 3**: 
* Creative brand persona*: Luxury seekers who are looking for the best of the best. Their favorite movie is "The Great Gatsby" because it shows how the wealthy live in a world of excess and opulence. 
* Summary of how this relates to their purchasing behavior*: They are likely to be willing to pay more for high-quality items and are more likely to buy items from luxury brands. 
* Witty email headline for marketing campaign*: "Indulge in luxury like Gatsby with our exquisite collection!"

**Cluster 4**: 
* Creative brand persona*: Frugal shoppers who are looking for a bargain. Their favorite movie is "The Big Short" because it shows how the financial crisis of 2008 was caused by greed and corruption. 
* Summary of how this relates to their purchasing behavior*: They are likely to be very price-conscious and are more likely to buy items that are on sale or discounted. 
* Witty email headline for marketing campaign*: "Get more bang for your buck with our amazing deals!"

**Cluster 5**: 
* Creative brand persona*: Spontaneous shoppers who are looking for something new and exciting. Their favorite movie is "The Hangover" because it shows how a night of partying can lead to unexpected and hilarious consequences. 
* Summary of how this relates to their purchasing behavior*: They are likely to be impulsive and are more likely to buy items that are new or trendy. 
* Witty email headline for marketing campaign*: "Shop like there's no tomorrow with our exciting new products!"

In [15]:
preamble = """Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite food 
and a detailed description of an image containing the favorite food that does not include people:

"""

display(Markdown('## Prompt:'))
print(preamble)
print(str.join("\n", cluster_info))

response = textgen_model.predict(
   preamble + "\n" + str.join("\n", cluster_info),
    max_output_tokens=1024,
    temperature=0.4,
    top_p=0.8,
    top_k=40,
)

display(Markdown('## Response:'))
display(Markdown(response.text))

## Prompt:

Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite food 
and a detailed description of an image containing the favorite food that does not include people:


cluster 1, average spend $49.44, count of orders per person 1.23, days since last order 102.87
cluster 2, average spend $59.56, count of orders per person 3.51, days since last order 87.9
cluster 3, average spend $251.34, count of orders per person 1.14, days since last order 205.85
cluster 4, average spend $57.47, count of orders per person 3.49, days since last order 354.32
cluster 5, average spend $48.87, count of orders per person 1.22, days since last order 376.5


## Response:

 **Cluster 1**: 

* Creative brand persona: Thrifty Foodie
* Favorite food: A simple but delicious meal of roasted chicken, vegetables, and potatoes.
* Image: A close-up of a plate of roasted chicken, vegetables, and potatoes. The chicken is golden brown and juicy, the vegetables are colorful and crisp, and the potatoes are fluffy and soft. The background is a simple white, allowing the food to be the star of the show.

**Cluster 2**: 

* Creative brand persona: Busy Professional
* Favorite food: A quick and easy meal of a salad with grilled salmon.
* Image: A bowl of mixed greens topped with grilled salmon, avocado, tomatoes, and cucumbers. The salmon is perfectly cooked, the avocado is creamy and ripe, and the tomatoes and cucumbers are fresh and crisp. The background is a busy office setting, with papers and files scattered about.

**Cluster 3**: 

* Creative brand persona: Affluent Entertainer
* Favorite food: An elegant meal of filet mignon with asparagus and mashed potatoes.
* Image: A plate of filet mignon, asparagus, and mashed potatoes. The filet mignon is cooked to perfection, the asparagus is tender and green, and the mashed potatoes are creamy and smooth. The background is a beautifully set table, with fine china and crystal.

**Cluster 4**: 

* Creative brand persona: Health-Conscious Athlete
* Favorite food: A nutritious meal of quinoa with grilled chicken and vegetables.
* Image: A bowl of quinoa topped with grilled chicken, vegetables, and a drizzle of dressing. The quinoa is fluffy and nutty, the chicken is tender and juicy, and the vegetables are colorful and crisp. The background is a gym, with weights and exercise equipment in the background.

**Cluster 5**: 

* Creative brand persona: Laid-Back Adventurer
* Favorite food: A hearty meal of chili with cornbread.
* Image: A bowl of chili with cornbread on the side. The chili is thick and flavorful, with a variety of beans, meat, and vegetables. The cornbread is golden brown and fluffy. The background is a campfire, with trees and mountains in the distance.

In [16]:
def persona_builder(preamble):
    pre_text = """Pretend you're a creative strategist, given the following clusters """
    prompt = pre_text + preamble + "\n\n" + str.join("\n", cluster_info)
    result = textgen_model.predict(
        prompt,
        max_output_tokens=1024,
        temperature=0.4,
        top_p=0.8,
        top_k=40,
    )
    
    return prompt, result.text

#### Notes

A beautiful mountain range with snow-capped peaks and a river running through it. The sun is shining and the sky is blue.  Photorealistic

come up with creative brand persona for each that includes the detail of their favorite movie, a summary of how this relates to their purchasing behavior, and a witty e-mail headline for marketing campaign targeted to their group

## Now we can stick it behind a UI

<p align="center">
  <img alt="Conceptual Flow" src="slides/process6.png" width="100%">
</p>

In [17]:
import gradio as gr

with gr.Blocks() as demo:
    gr.Markdown(
    """
    ## Persona Builder and Marketing Bot
    """)
    with gr.Row():
        input_text = gr.Textbox(label="Pretend you're a creative strategist, given the following clusters: ", value=" analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite movie and a detailed description of an image representing the movie that does not include people:") 
    with gr.Row():
        generate = gr.Button("Generate Response")
    with gr.Row():
        label2 = gr.Textbox(label="Prompt")
    with gr.Row():
        label3 = gr.Markdown(label="Response generated by LLM")

    generate.click(persona_builder,input_text, [label2, label3])
    
demo.launch(share=False, debug=False)

Running on local URL:  http://127.0.0.1:7875

To create a public link, set `share=True` in `launch()`.


