# Using the Vertex AI PaLM API to explain BQML Clustering - Our Very Own Persona Builder

## Here's what we're going to build:

<p align="center">
  <img alt="Conceptual Flow" src="slides/process0.png" width="100%">
</p>

# Core code
Let's define some variables that will be used throughout this notebook.

These are the GCP Project ID `project_id`, the Model name `model_name` which is any name you prefer, and finally the Dataset name `dataset_name`.
The dataset needs to exist in the same Project as `project_id` and you'll need appropriate access to create and delete.

## Setup

In [21]:
from typing import Union
import sys

import os
import io
import json
import base64
import requests
import concurrent.futures
import time

import numpy as np
import pandas as pd

import vertexai
from vertexai.language_models import TextGenerationModel, TextEmbeddingModel
from vertexai.preview.language_models import TextGenerationModel as TextGenerationModel_preview
from google.cloud import aiplatform
from google.cloud import documentai
from google.cloud.documentai_v1 import Document
from google.cloud import storage
from google.cloud import bigquery

from IPython.display import display, Markdown, Latex
import markdown

print("Vertex AI version: " + str(aiplatform.__version__))

Vertex AI version: 1.36.0


In [2]:
PROJECT_ID = 'mg-ce-demos'
REGION = 'us-central1'
DATASET = "bqml_demos" 
BQML_MODEL = "ecommerce_customer_segment_cluster5" 
EVAL = BQML_MODEL + "_eval"


In [3]:
# vertex ai clients
vertexai.init(project = PROJECT_ID, location = REGION)
aiplatform.init(project = PROJECT_ID, location = REGION)

# bigquery client
bq = bigquery.Client(project = PROJECT_ID)

## Create a K-means model to cluster ecommerce data

### Let's look at the data first

<p align="center">
  <img alt="Conceptual Flow" src="slides/process1.png" width="100%">
</p>

In [4]:
query = """
SELECT
  user_id,
  order_id,
  sale_price,
  created_at as order_created_date
FROM `mg-ce-demos.thelook_ecommerce.order_items`
WHERE created_at BETWEEN CAST('2022-01-01 00:00:00' AS TIMESTAMP)
AND CAST('2024-01-01 00:00:00' AS TIMESTAMP)
"""
df = bq.query(query).to_dataframe()
df.head()


Unnamed: 0,user_id,order_id,sale_price,order_created_date
0,40085,50021,2.5,2023-04-07 08:09:34+00:00
1,90363,112961,2.5,2022-12-25 23:23:29+00:00
2,3922,4887,2.5,2022-09-29 13:47:16+00:00
3,56076,70152,2.5,2022-03-27 03:01:23+00:00
4,90828,113540,2.5,2023-03-18 08:30:11+00:00


### `CREATE MODEL` using `KMEANS`

Create a query then start the model creation job, using a python loop to wait for the job to complete

<p align="center">
  <img alt="Conceptual Flow" src="slides/process2.png" width="100%">
</p>

In [5]:
query = """
CREATE MODEL IF NOT EXISTS `{0}.{1}`
OPTIONS (
  MODEL_TYPE = "KMEANS",
  NUM_CLUSTERS = 5,
  KMEANS_INIT_METHOD = "KMEANS++",
  STANDARDIZE_FEATURES = TRUE )
AS (
SELECT * EXCEPT (user_id)
FROM (
  SELECT user_id,
    DATE_DIFF(CURRENT_DATE(), CAST(MAX(order_created_date) as DATE), day) AS days_since_order, -- RECENCY
    COUNT(order_id) AS count_orders, -- FREQUENCY
    AVG(sale_price) AS avg_spend -- MONETARY
  FROM (
    SELECT user_id,
      order_id,
      sale_price,
      created_at as order_created_date
    FROM `mg-ce-demos.thelook_ecommerce.order_items`
    WHERE created_at BETWEEN CAST('2022-01-01 00:00:00' AS TIMESTAMP)
    AND CAST('2024-01-01 00:00:00' AS TIMESTAMP)
  )
  GROUP BY user_id, order_id
 )
)
""".format(DATASET, BQML_MODEL)


In [6]:
# Wrapper to use BigQuery client to run query/job, return job ID or result as DF
def run_bq_query(sql: str) -> Union[str, pd.DataFrame]:
    
    # Try dry run before executing query to catch any errors
    #job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
    #bq.query(sql, job_config=job_config)

    # If dry run succeeds without errors, proceed to run query
    job_config = bigquery.QueryJobConfig()
    client_result = bq.query(sql, job_config=job_config)

    job_id = client_result.job_id

    # Wait for query/job to finish running. then get & return data frame
    df = client_result.result().to_arrow().to_pandas()
    print(f"Finished job_id: {job_id}")
    return df

In [7]:
%%timeit

run_bq_query(query)

Finished job_id: 87613ceb-17b1-434d-97a8-f39eeeffac72
Finished job_id: d794d2d7-5781-4125-b891-09398e5559ed
Finished job_id: c01973c1-a10b-4913-8bfc-d499959570ea
Finished job_id: cda32035-8831-49c4-a9db-f651e8bfd9e6
Finished job_id: 4e859834-ac49-4ad3-9403-bd3926cea41d
Finished job_id: 93f1f915-84ee-49bc-a256-23e7d513f385
Finished job_id: 513a0fe5-1fdf-4ba8-8292-0e7d44f54f1b
Finished job_id: 42a41e60-9d55-4ddf-acd7-eb14dd19235c
The slowest run took 12.51 times longer than the fastest. This could mean that an intermediate result is being cached.
2.59 s ± 3.65 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


Let's take a look at the model's clustering performance, using these metrics - Davies Bouldin Index and Mean Squared Distance

In [8]:
query = """
SELECT *
FROM ML.EVALUATE(MODEL `{0}.{1}`)
""".format(DATASET, BQML_MODEL)

run_bq_query(query)

Finished job_id: 777b06c0-445c-46ba-8010-25c279e08c06


Unnamed: 0,davies_bouldin_index,mean_squared_distance
0,1.023378,1.051785


### Now let's get the cluster (centroid) information

<p align="center">
  <img alt="Conceptual Flow" src="slides/process3.png" width="100%">
</p>

In [10]:
query = """
SELECT
  CONCAT('cluster ', CAST(centroid_id as STRING)) as centroid,
  avg_spend as average_spend,
  count_orders as count_of_orders,
  days_since_order
FROM (
  SELECT centroid_id, feature, ROUND(numerical_value, 2) as value
  FROM ML.CENTROIDS(MODEL `{0}.{1}`)
)
PIVOT (
  SUM(value)
  FOR feature IN ('avg_spend',  'count_orders', 'days_since_order')
)
ORDER BY centroid_id
""".format(DATASET, BQML_MODEL)

run_bq_query(query)

Finished job_id: a1e05a1f-5e1b-4c23-b2a6-a1360d71aa9f


Unnamed: 0,centroid,average_spend,count_of_orders,days_since_order
0,cluster 1,49.44,1.23,102.87
1,cluster 2,59.56,3.51,87.9
2,cluster 3,251.34,1.14,205.85
3,cluster 4,57.47,3.49,354.32
4,cluster 5,48.87,1.22,376.5


Whew! That's a lot of metrics and cluster info. How about we explain this to our colleagues using the magic of LLMs.

In [11]:
df = bq.query(query).to_dataframe()
df.to_string(header=False, index=False)

cluster_info = []
for i, row in df.iterrows():
  cluster_info.append("{0}, average spend ${2}, count of orders per person {1}, days since last order {3}"
    .format(row["centroid"], row["count_of_orders"], row["average_spend"], row["days_since_order"]) )

print(str.join("\n", cluster_info))

cluster 1, average spend $49.44, count of orders per person 1.23, days since last order 102.87
cluster 2, average spend $59.56, count of orders per person 3.51, days since last order 87.9
cluster 3, average spend $251.34, count of orders per person 1.14, days since last order 205.85
cluster 4, average spend $57.47, count of orders per person 3.49, days since last order 354.32
cluster 5, average spend $48.87, count of orders per person 1.22, days since last order 376.5


## Explain with Vertex AI PaLM API

### First, we want to instantiate the large language model and create the prompt

<p align="center">
  <img alt="Conceptual Flow" src="slides/process4.png" width="100%">
</p>

In [12]:
textgen_model = TextGenerationModel.from_pretrained('text-bison')

In [25]:
preamble = """Pretend you're a creative strategist, given the following clusters come up with creative brand persona and title labels for each of these clusters, 
and explain step by step; what would be the next marketing step for these clusters:

"""

display(Markdown('## Prompt:'))
print(preamble)
print(str.join("\n", cluster_info))

## Prompt:

Pretend you're a creative strategist, given the following clusters come up with creative brand persona and title labels for each of these clusters, 
and explain step by step; what would be the next marketing step for these clusters:


cluster 1, average spend $49.44, count of orders per person 1.23, days since last order 102.87
cluster 2, average spend $59.56, count of orders per person 3.51, days since last order 87.9
cluster 3, average spend $251.34, count of orders per person 1.14, days since last order 205.85
cluster 4, average spend $57.47, count of orders per person 3.49, days since last order 354.32
cluster 5, average spend $48.87, count of orders per person 1.22, days since last order 376.5


### Now, we send our prompt to Google GenAI API for some LLM magic

<p align="center">
  <img alt="Conceptual Flow" src="slides/process5.png" width="100%">
</p>

In [27]:
response = textgen_model.predict(
   preamble + "\n" + str.join("\n", cluster_info),
    max_output_tokens=1024,
    temperature=0.4,
    top_p=0.8,
    top_k=40,
)

display(Markdown('## Response:'))
# Send prompt to LLM
display(Markdown(response.text))

## Response:

 **Cluster 1**: 
* Brand persona: Thrifty shoppers who are looking for a good deal.
* Title labels: 
 - "The Penny Pincher"
 - "The Bargain Hunter"
 - "The Frugal Fashionista"
* Next marketing step: Offer discounts and promotions to entice these customers to make another purchase.


**Cluster 2**: 
* Brand persona: Loyal customers who are happy with your products and services.
* Title labels: 
 - "The VIP"
 - "The Brand Ambassador"
 - "The Repeat Customer"
* Next marketing step: Reward these customers for their loyalty with a loyalty program or exclusive discounts.


**Cluster 3**: 
* Brand persona: High-end shoppers who are looking for the best of the best.
* Title labels: 
 - "The Luxury Seeker"
 - "The Connoisseur"
 - "The Affluent Aficionado"
* Next marketing step: Offer premium products and services to appeal to these customers' tastes.


**Cluster 4**: 
* Brand persona: Occasional shoppers who are not very engaged with your brand.
* Title labels: 
 - "The Window Shopper"
 - "The Casual Browser"
 - "The One-Time Customer"
* Next marketing step: Retarget these customers with personalized ads and emails to remind them of your brand and encourage them to make another purchase.


**Cluster 5**: 
* Brand persona: Lapsed customers who have not made a purchase in a long time.
* Title labels: 
 - "The Lost Customer"
 - "The Inactive Account"
 - "The Forgotten Fan"
* Next marketing step: Reach out to these customers with a special offer or promotion to win them back.

Voila! We've now used k-means clustering to create groups of spenders and explain their profiles.

Sometimes, though, you want a little bit [extra](https://cloud.google.com/blog/transform/prompt-debunking-five-generative-ai-misconceptions).

In [28]:
preamble = """Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite movie, 
a summary of how this relates to their purchasing behavior, and a witty e-mail headline for marketing campaign targeted to their group:

"""

display(Markdown('## Prompt:'))
print(preamble)
print(str.join("\n", cluster_info))

response = textgen_model.predict(
   preamble + "\n" + str.join("\n", cluster_info),
    max_output_tokens=1024,
    temperature=0.4,
    top_p=0.8,
    top_k=40,
)

display(Markdown('## Response:'))
display(Markdown(response.text))

## Prompt:

Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite movie, 
a summary of how this relates to their purchasing behavior, and a witty e-mail headline for marketing campaign targeted to their group:


cluster 1, average spend $49.44, count of orders per person 1.23, days since last order 102.87
cluster 2, average spend $59.56, count of orders per person 3.51, days since last order 87.9
cluster 3, average spend $251.34, count of orders per person 1.14, days since last order 205.85
cluster 4, average spend $57.47, count of orders per person 3.49, days since last order 354.32
cluster 5, average spend $48.87, count of orders per person 1.22, days since last order 376.5


## Response:

 **Cluster 1**: 
Creative brand persona: Thrifty shoppers who enjoy a good deal. Their favorite movie is "The Princess Bride" because it's a classic that the whole family can enjoy. They relate to the character of Westley, who is always trying to save the day. 
E-mail headline: "Save money like a pro with our latest deals!"

**Cluster 2**: 
Creative brand persona: Busy professionals who value convenience. Their favorite movie is "The Devil Wears Prada" because it's a relatable story about a young woman who is trying to make it in the fashion industry. They relate to the character of Andy, who is always trying to balance her work and personal life. 
E-mail headline: "Get the latest styles without leaving your desk!"

**Cluster 3**: 
Creative brand persona: Luxury seekers who appreciate the finer things in life. Their favorite movie is "The Great Gatsby" because it's a story about a man who is trying to recapture his lost love. They relate to the character of Gatsby, who is always throwing lavish parties in an attempt to win back the woman he loves. 
E-mail headline: "Indulge in the lap of luxury with our latest collection!"

**Cluster 4**: 
Creative brand persona: Spontaneous shoppers who are always looking for the next big thing. Their favorite movie is "The Hangover" because it's a story about a group of friends who go on a wild adventure. They relate to the characters of the movie, who are always getting into crazy situations. 
E-mail headline: "Live life on the edge with our latest products!"

**Cluster 5**: 
Creative brand persona: Traditionalists who value quality and craftsmanship. Their favorite movie is "The Shawshank Redemption" because it's a story about a man who is wrongly convicted of a crime and spends years in prison. They relate to the character of Andy, who never gives up hope, even when things are tough. 
E-mail headline: " Timeless classics that will never go out of style!"

In [30]:
preamble = """Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite food 
and a detailed description of an image containing the favorite food that does not include people:

"""

display(Markdown('## Prompt:'))
print(preamble)
print(str.join("\n", cluster_info))

response = textgen_model.predict(
   preamble + "\n" + str.join("\n", cluster_info),
    max_output_tokens=1024,
    temperature=0.4,
    top_p=0.8,
    top_k=40,
)

display(Markdown('## Response:'))
display(Markdown(response.text))

## Prompt:

Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite food 
and a detailed description of an image containing the favorite food that does not include people:


cluster 1, average spend $49.44, count of orders per person 1.23, days since last order 102.87
cluster 2, average spend $59.56, count of orders per person 3.51, days since last order 87.9
cluster 3, average spend $251.34, count of orders per person 1.14, days since last order 205.85
cluster 4, average spend $57.47, count of orders per person 3.49, days since last order 354.32
cluster 5, average spend $48.87, count of orders per person 1.22, days since last order 376.5


## Response:

 **Cluster 1**: 

* Creative brand persona: The Savvy Spender 
* Favorite food: Avocado toast 
* Image: A close-up of an avocado toast on a plate. The toast is topped with sliced avocado, a drizzle of olive oil, and a sprinkle of salt and pepper. The background is a light green color.

**Cluster 2**: 

* Creative brand persona: The Busy Professional 
* Favorite food: Salad 
* Image: A large bowl of salad with a variety of vegetables, including lettuce, tomatoes, cucumbers, and carrots. The salad is topped with a grilled chicken breast and a drizzle of dressing. The background is a white tablecloth.

**Cluster 3**: 

* Creative brand persona: The Affluent Epicurean 
* Favorite food: Steak 
* Image: A juicy steak on a plate with a side of asparagus and mashed potatoes. The steak is cooked to perfection and the asparagus is grilled to a tender crisp. The background is a dark red color.

**Cluster 4**: 

* Creative brand persona: The Occasional Indulger 
* Favorite food: Pizza 
* Image: A large pizza with a variety of toppings, including pepperoni, sausage, mushrooms, and onions. The pizza is hot and cheesy and the crust is perfectly golden brown. The background is a blue color.

**Cluster 5**: 

* Creative brand persona: The Frugal Foodie 
* Favorite food: Ramen 
* Image: A bowl of ramen with noodles, broth, and a variety of toppings, including egg, pork, and vegetables. The ramen is steaming hot and the broth is flavorful. The background is a light brown color.

In [39]:
def persona_builder(preamble):
    pre_text = """Pretend you're a creative strategist, given the following clusters """
    prompt = pre_text + preamble + "\n\n" + str.join("\n", cluster_info)
    result = textgen_model.predict(
        prompt,
        max_output_tokens=1024,
        temperature=0.4,
        top_p=0.8,
        top_k=40,
    )
    
    return prompt, result.text

#### Notes

A beautiful mountain range with snow-capped peaks and a river running through it. The sun is shining and the sky is blue.  Photorealistic

come up with creative brand persona for each that includes the detail of their favorite movie, a summary of how this relates to their purchasing behavior, and a witty e-mail headline for marketing campaign targeted to their group

## Now we can stick it behind a UI

<p align="center">
  <img alt="Conceptual Flow" src="slides/process6.png" width="100%">
</p>

In [44]:
import gradio as gr

with gr.Blocks() as demo:
    gr.Markdown(
    """
    ## Persona Builder and Marketing Bot
    """)
    with gr.Row():
        input_text = gr.Textbox(label="Pretend you're a creative strategist, given the following clusters: ", value="come up with a brand persona for each and generate an email headline and outreach strategy for a new marketing campaign:") 
    with gr.Row():
        generate = gr.Button("Generate Response")
    with gr.Row():
        label2 = gr.Textbox(label="Prompt")
    with gr.Row():
        label3 = gr.Markdown(label="Response generated by LLM")

    generate.click(persona_builder,input_text, [label2, label3])
    
demo.launch(share=False, debug=False)

Running on local URL:  http://127.0.0.1:7870

To create a public link, set `share=True` in `launch()`.


