# Using the Vertex AI PaLM API to explain BQML Clustering - Our Very Own Persona Builder

## Here's what we're going to build:

<p align="center">
  <img alt="Conceptual Flow" src="slides/process0.png" width="100%">
</p>

# Core code
Let's define some variables that will be used throughout this notebook.

These are the GCP Project ID `project_id`, the Model name `model_name` which is any name you prefer, and finally the Dataset name `dataset_name`.
The dataset needs to exist in the same Project as `project_id` and you'll need appropriate access to create and delete.

## Setup

In [1]:
from typing import Union
import sys

import os
import io
import json
import base64
import requests
import concurrent.futures
import time

import numpy as np
import pandas as pd

import vertexai.preview.language_models
from google.cloud import aiplatform
from google.cloud import documentai
from google.cloud.documentai_v1 import Document
from google.cloud import storage
from google.cloud import bigquery

from IPython.display import display, Markdown, Latex


In [2]:
PROJECT_ID = 'mg-ce-demos'
REGION = 'us-central1'
DATASET = "bqml_demos" 
BQML_MODEL = "ecommerce_customer_segment_cluster5" 
EVAL = BQML_MODEL + "_eval"


In [3]:
# vertex ai clients
vertexai.init(project = PROJECT_ID, location = REGION)
aiplatform.init(project = PROJECT_ID, location = REGION)

# bigquery client
bq = bigquery.Client(project = PROJECT_ID)

## Create a K-means model to cluster ecommerce data

### Let's look at the data first

<p align="center">
  <img alt="Conceptual Flow" src="slides/process1.png" width="100%">
</p>

In [4]:
query = """
SELECT
  user_id,
  order_id,
  sale_price,
  created_at as order_created_date
FROM `mg-ce-demos.thelook_ecommerce.order_items`
WHERE created_at BETWEEN CAST('2022-01-01 00:00:00' AS TIMESTAMP)
AND CAST('2024-01-01 00:00:00' AS TIMESTAMP)
"""
df = bq.query(query).to_dataframe()
df.head()


Unnamed: 0,user_id,order_id,sale_price,order_created_date
0,40085,50021,2.5,2023-04-07 08:09:34+00:00
1,90363,112961,2.5,2022-12-25 23:23:29+00:00
2,3922,4887,2.5,2022-09-29 13:47:16+00:00
3,56076,70152,2.5,2022-03-27 03:01:23+00:00
4,90828,113540,2.5,2023-03-18 08:30:11+00:00


### `CREATE MODEL` using `KMEANS`

Create a query then start the model creation job, using a python loop to wait for the job to complete

<p align="center">
  <img alt="Conceptual Flow" src="slides/process2.png" width="100%">
</p>

In [13]:
query = """
CREATE MODEL IF NOT EXISTS `{0}.{1}`
OPTIONS (
  MODEL_TYPE = "KMEANS",
  NUM_CLUSTERS = 5,
  KMEANS_INIT_METHOD = "KMEANS++",
  STANDARDIZE_FEATURES = TRUE )
AS (
SELECT * EXCEPT (user_id)
FROM (
  SELECT user_id,
    DATE_DIFF(CURRENT_DATE(), CAST(MAX(order_created_date) as DATE), day) AS days_since_order, -- RECENCY
    COUNT(order_id) AS count_orders, -- FREQUENCY
    AVG(sale_price) AS avg_spend -- MONETARY
  FROM (
    SELECT user_id,
      order_id,
      sale_price,
      created_at as order_created_date
    FROM `mg-ce-demos.thelook_ecommerce.order_items`
    WHERE created_at BETWEEN CAST('2022-01-01 00:00:00' AS TIMESTAMP)
    AND CAST('2024-01-01 00:00:00' AS TIMESTAMP)
  )
  GROUP BY user_id, order_id
 )
)
""".format(DATASET, BQML_MODEL)


In [14]:
# Wrapper to use BigQuery client to run query/job, return job ID or result as DF
def run_bq_query(sql: str) -> Union[str, pd.DataFrame]:
    
    # Try dry run before executing query to catch any errors
    #job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
    #bq.query(sql, job_config=job_config)

    # If dry run succeeds without errors, proceed to run query
    job_config = bigquery.QueryJobConfig()
    client_result = bq.query(sql, job_config=job_config)

    job_id = client_result.job_id

    # Wait for query/job to finish running. then get & return data frame
    df = client_result.result().to_arrow().to_pandas()
    print(f"Finished job_id: {job_id}")
    return df

In [15]:
%%timeit

run_bq_query(query)

Finished job_id: ab3960f3-9ca9-4097-be99-03cebce34fdf
Finished job_id: 6efd2b08-2bc1-4afa-9c5e-da8da9d73dad
Finished job_id: 06ee8bff-2815-4b87-8f5e-7a7db47ee2c1
Finished job_id: 7ee8cada-0181-4f44-8cb1-19c369b8b7be
Finished job_id: 98fec4cc-ac7b-4225-aeab-94ee03f9de6a
Finished job_id: fff17cbb-c186-4360-b879-fa38ca8f7242
Finished job_id: f1b72b83-6b12-40cd-b9a4-592209503c00
Finished job_id: abc6f9ce-b265-4299-8044-3541444e53ec
815 ms ± 48.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Let's take a look at the model's clustering performance, using these metrics - Davies Bouldin Index and Mean Squared Distance

In [16]:
query = """
SELECT *
FROM ML.EVALUATE(MODEL `{0}.{1}`)
""".format(DATASET, BQML_MODEL)

run_bq_query(query)

Finished job_id: 0193d04f-0b17-4016-9aea-ef4944e32c05


Unnamed: 0,davies_bouldin_index,mean_squared_distance
0,1.023378,1.051785


### Now let's get the cluster (centroid) information

<p align="center">
  <img alt="Conceptual Flow" src="slides/process3.png" width="100%">
</p>

In [17]:
query = """
SELECT
  CONCAT('cluster ', CAST(centroid_id as STRING)) as centroid,
  avg_spend as average_spend,
  count_orders as count_of_orders,
  days_since_order
FROM (
  SELECT centroid_id, feature, ROUND(numerical_value, 2) as value
  FROM ML.CENTROIDS(MODEL `{0}.{1}`)
)
PIVOT (
  SUM(value)
  FOR feature IN ('avg_spend',  'count_orders', 'days_since_order')
)
ORDER BY centroid_id
""".format(DATASET, BQML_MODEL)

run_bq_query(query)

Finished job_id: 77fab11a-df6e-4295-b595-0f25ae18a80e


Unnamed: 0,centroid,average_spend,count_of_orders,days_since_order
0,cluster 1,49.44,1.23,102.87
1,cluster 2,59.56,3.51,87.9
2,cluster 3,251.34,1.14,205.85
3,cluster 4,57.47,3.49,354.32
4,cluster 5,48.87,1.22,376.5


Whew! That's a lot of metrics and cluster info. How about we explain this to our colleagues using the magic of LLMs.

In [18]:
df = bq.query(query).to_dataframe()
df.to_string(header=False, index=False)

cluster_info = []
for i, row in df.iterrows():
  cluster_info.append("{0}, average spend ${2}, count of orders per person {1}, days since last order {3}"
    .format(row["centroid"], row["count_of_orders"], row["average_spend"], row["days_since_order"]) )

print(str.join("\n", cluster_info))

cluster 1, average spend $49.44, count of orders per person 1.23, days since last order 102.87
cluster 2, average spend $59.56, count of orders per person 3.51, days since last order 87.9
cluster 3, average spend $251.34, count of orders per person 1.14, days since last order 205.85
cluster 4, average spend $57.47, count of orders per person 3.49, days since last order 354.32
cluster 5, average spend $48.87, count of orders per person 1.22, days since last order 376.5


## Explain with Vertex AI PaLM API

### First, we want to instantiate the large language model and create the prompt

<p align="center">
  <img alt="Conceptual Flow" src="slides/process4.png" width="100%">
</p>

In [19]:
textgen_model = vertexai.preview.language_models.TextGenerationModel.from_pretrained('text-bison@001')

In [20]:
preamble = "Pretend you're a creative strategist, given the following clusters come up with creative brand persona and title labels for each of these clusters, and explain step by step; what would be the next marketing step for these clusters:"
display(Markdown('## Prompt:'))
print(preamble)
print('\n')
print(str.join("\n", cluster_info))

## Prompt:

Pretend you're a creative strategist, given the following clusters come up with creative brand persona and title labels for each of these clusters, and explain step by step; what would be the next marketing step for these clusters:


cluster 1, average spend $49.44, count of orders per person 1.23, days since last order 102.87
cluster 2, average spend $59.56, count of orders per person 3.51, days since last order 87.9
cluster 3, average spend $251.34, count of orders per person 1.14, days since last order 205.85
cluster 4, average spend $57.47, count of orders per person 3.49, days since last order 354.32
cluster 5, average spend $48.87, count of orders per person 1.22, days since last order 376.5


### Now, we send our prompt to Google GenAI API for some LLM magic

<p align="center">
  <img alt="Conceptual Flow" src="slides/process5.png" width="100%">
</p>

In [21]:
display(Markdown('## Response:'))
# Send prompt to LLM
display(Markdown(str(textgen_model.predict(
   preamble + "\n" + str.join("\n", cluster_info),
    max_output_tokens=1024,
    temperature=0.35,
    top_p=0.8,
    top_k=40,
))))

## Response:

6 **Cluster 1: The Frequent Shopper**

This cluster is made up of customers who spend an average of $49.44 per order and place an order every 102.87 days. They are likely to be loyal customers who are looking for a convenient and affordable way to shop.

**Brand persona:** The Frequent Shopper is a busy person who is always on the go. They are looking for a shopping experience that is easy and efficient, and they are willing to pay a little more for convenience.

**Title label:** The Frequent Shopper

**Next marketing step:** The next marketing step for this cluster would be to focus on convenience and value. This could be done by offering free shipping on orders over a certain amount, or by providing a loyalty program that rewards customers for their purchases.

**Cluster 2: The Avid Shopper**

This cluster is made up of customers who spend an average of $59.56 per order and place an order every 87.9 days. They are likely to be passionate about shopping and are always looking for new products to try.

**Brand persona:** The Avid Shopper is a fashion-forward person who is always on the lookout for the latest trends. They are willing to spend a little more money to get the best quality products, and they are not afraid to experiment with new styles.

**Title label:** The Avid Shopper

**Next marketing step:** The next marketing step for this cluster would be to focus on newness and excitement. This could be done by featuring new products on the homepage, or by sending out email newsletters with the latest trends.

**Cluster 3: The Luxury Shopper**

This cluster is made up of customers who spend an average of $251.34 per order and place an order every 205.85 days. They are likely to be affluent and have a taste for luxury.

**Brand persona:** The Luxury Shopper is a successful person who appreciates the finer things in life. They are willing to pay a premium for quality products and services, and they are not afraid to splurge on themselves.

**Title label:** The Luxury Shopper

**Next marketing step:** The next marketing step for this cluster would be to focus on exclusivity and personalization. This could be done by offering limited-edition products, or by providing a concierge service that caters to the individual needs of each customer.

**Cluster 4: The Value Shopper**

This cluster is made up of customers who spend an average of $57.47 per order and place an order every 354.32 days. They are likely to be budget-conscious and are looking for the best possible value for their money.

**Brand persona:** The Value Shopper is a savvy shopper who is always looking for a good deal. They are not afraid to comparison shop, and they are willing to sacrifice quality for a lower price.

**Title label:** The Value Shopper

**Next marketing step:** The next marketing step for this cluster would be to focus on affordability and savings. This could be done by offering discounts and coupons, or by highlighting the value of the products being sold.

**Cluster 5: The Occasional Shopper**

This cluster is made up of customers who spend an average of $48.87 per order and place an order every 376.56 days. They are likely to be infrequent shoppers who only buy when they need something specific.

**Brand persona:** The Occasional Shopper is a busy person who does not have a lot of time to shop. They are looking for a convenient and easy way to buy the things they need, and they are not afraid to pay a little more for convenience.

**Title label:** The Occasional Shopper

**Next marketing step:** The next marketing step for this cluster would be to focus on convenience and ease of use. This could be done by offering a wide variety of products, or by providing a quick and easy checkout process.

Voila! We've now used k-means clustering to create groups of spenders and explain their profiles.

Sometimes, though, you want a little bit [extra](https://cloud.google.com/blog/transform/prompt-debunking-five-generative-ai-misconceptions).

In [22]:
preamble = "Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite movie, a summary of how this relates to their purchasing behavior, and a witty e-mail headline for marketing campaign targeted to their group:"

display(Markdown('## Prompt:'))
print(preamble)
print('\n')
print(str.join("\n", cluster_info))
display(Markdown('## Response:'))
display(Markdown(str(textgen_model.predict(
   preamble + "\n" + str.join("\n", cluster_info),
    max_output_tokens=1024,
    temperature=0.35,
    top_p=0.8,
    top_k=40,
))))

## Prompt:

Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite movie, a summary of how this relates to their purchasing behavior, and a witty e-mail headline for marketing campaign targeted to their group:


cluster 1, average spend $49.44, count of orders per person 1.23, days since last order 102.87
cluster 2, average spend $59.56, count of orders per person 3.51, days since last order 87.9
cluster 3, average spend $251.34, count of orders per person 1.14, days since last order 205.85
cluster 4, average spend $57.47, count of orders per person 3.49, days since last order 354.32
cluster 5, average spend $48.87, count of orders per person 1.22, days since last order 376.5


## Response:

3 **Cluster 1: The Casual Shopper**

* Favorite movie: The Princess Bride
* Summary: This cluster is made up of casual shoppers who enjoy a variety of movies. They typically make one or two small purchases per month and have a long average purchase cycle.
* Witty e-mail headline: "The Princess Bride fans, we have something special for you!"

**Cluster 2: The Frequent Shopper**

* Favorite movie: The Shawshank Redemption
* Summary: This cluster is made up of frequent shoppers who love classic movies. They typically make multiple purchases per month and have a shorter average purchase cycle.
* Witty e-mail headline: "The Shawshank Redemption fans, we know you love a good deal! Check out our latest sale."

**Cluster 3: The Luxury Shopper**

* Favorite movie: The Godfather
* Summary: This cluster is made up of luxury shoppers who enjoy high-quality movies. They typically make one or two large purchases per year and have a long average purchase cycle.
* Witty e-mail headline: "The Godfather fans, treat yourself to the ultimate movie experience with our selection of luxury DVDs."

**Cluster 4: The Family Shopper**

* Favorite movie: Toy Story
* Summary: This cluster is made up of family shoppers who enjoy movies for all ages. They typically make multiple purchases per month and have a short average purchase cycle.
* Witty e-mail headline: "Toy Story fans, get your kids excited for the new movie with our selection of toys and games."

**Cluster 5: The Patient Shopper**

* Favorite movie: The Lord of the Rings
* Summary: This cluster is made up of patient shoppers who enjoy epic movies. They typically make one or two large purchases per year and have a very long average purchase cycle.
* Witty e-mail headline: "The Lord of the Rings fans, we know you're in it for the long haul. Check out our special edition box set."

In [24]:
preamble = "Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite movie, a summary of how this relates to their purchasing behavior, and a description of an image that describes scenery related to their favorite movie:"

display(Markdown('## Prompt:'))
print(preamble)
print('\n')
print(str.join("\n", cluster_info))
display(Markdown('## Response:'))
display(Markdown(str(textgen_model.predict(
   preamble + "\n" + str.join("\n", cluster_info),
    max_output_tokens=1024,
    temperature=0.35,
    top_p=0.8,
    top_k=40,
))))

## Prompt:

Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite movie, a summary of how this relates to their purchasing behavior, and a description of an image that describes scenery related to their favorite movie:


cluster 1, average spend $49.44, count of orders per person 1.23, days since last order 102.87
cluster 2, average spend $59.56, count of orders per person 3.51, days since last order 87.9
cluster 3, average spend $251.34, count of orders per person 1.14, days since last order 205.85
cluster 4, average spend $57.47, count of orders per person 3.49, days since last order 354.32
cluster 5, average spend $48.87, count of orders per person 1.22, days since last order 376.5


## Response:

8 **Cluster 1**

* Favorite movie: The Shawshank Redemption
* Summary: This cluster is made up of people who are loyal customers who spend an average of $49.44 per order. They typically place an order every 102 days. The Shawshank Redemption is a popular choice for this group because it is a feel-good movie that is perfect for a rainy day. The image of a field of green grass with a white picket fence and a red barn in the background would be a good representation of the scenery in this movie.

**Cluster 2**

* Favorite movie: The Godfather
* Summary: This cluster is made up of people who are high-value customers who spend an average of $59.56 per order. They typically place an order every 87 days. The Godfather is a popular choice for this group because it is a classic movie that is known for its violence and drama. The image of a dark and stormy night with a full moon would be a good representation of the scenery in this movie.

**Cluster 3**

* Favorite movie: The Dark Knight
* Summary: This cluster is made up of people who are impulsive shoppers who spend an average of $251.34 per order. They typically place an order every 205 days. The Dark Knight is a popular choice for this group because it is a thrilling movie that is full of action and suspense. The image of a city skyline at night with a full moon would be a good representation of the scenery in this movie.

**Cluster 4**

* Favorite movie: The Lord of the Rings: The Return of the King
* Summary: This cluster is made up of people who are loyal customers who spend an average of $57.47 per order. They typically place an order every 354 days. The Lord of the Rings: The Return of the King is a popular choice for this group because it is a long and epic movie that is perfect for a rainy day. The image of a mountain range with a waterfall in the foreground would be a good representation of the scenery in this movie.

**Cluster 5**

* Favorite movie: The Princess Bride
* Summary: This cluster is made up of people who are nostalgic shoppers who spend an average of $48.87 per order. They typically place an order every 376 days. The Princess Bride is a popular choice for this group because it is a feel-good movie that is perfect for a rainy day. The image of a field of flowers with a castle in the background would be a good representation of the scenery in this movie.

In [25]:
def persona_builder(preamble):
    pre_text = "Pretend you're a creative strategist, given the following clusters "
    prompt = pre_text + preamble + "\n" + str.join("\n", cluster_info)
    result = textgen_model.predict(
        prompt,
        max_output_tokens=1024,
        temperature=0.35,
        top_p=0.8,
        top_k=40,
    )
    
    return prompt, result

In [26]:
#persona_builder("Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite movie, a summary of how this relates to their purchasing behavior, and a witty e-mail headline for marketing campaign targeted to their group:")

## Now we can stick it behind a UI

<p align="center">
  <img alt="Conceptual Flow" src="slides/process6.png" width="100%">
</p>

In [27]:
import gradio as gr

with gr.Blocks() as demo:
    gr.Markdown(
    """
    ## Persona Builder and Marketing Bot

    ### Enter a task...

    """)
    with gr.Row():
      with gr.Column():
        input_text = gr.Textbox(label="Task", placeholder="Pretend you're a creative strategist, given the following clusters: ")
        
        
    with gr.Row():
      generate = gr.Button("Generate Response")

    with gr.Row():
      label2 = gr.Textbox(label="Prompt")
    with gr.Row():
      label3 = gr.Textbox(label="Response generated by LLM")

    generate.click(persona_builder,input_text, [label2, label3])
demo.launch(share=False, debug=False)

Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.


