# Using the Vertex AI PaLM API to explain BQML Clustering - Our Very Own Persona Builder

## Here's what we're going to build:

<p align="center">
  <img alt="Conceptual Flow" src="slides/process0.png" width="100%">
</p>

# Core code
Let's define some variables that will be used throughout this notebook.

These are the GCP Project ID `project_id`, the Model name `model_name` which is any name you prefer, and finally the Dataset name `dataset_name`.
The dataset needs to exist in the same Project as `project_id` and you'll need appropriate access to create and delete.

## Setup

In [20]:
from typing import Union
import sys

import os
import io
import json
import base64
import requests
import concurrent.futures
import time

import numpy as np
import pandas as pd

import vertexai.preview.language_models
from google.cloud import aiplatform
from google.cloud import documentai
from google.cloud.documentai_v1 import Document
from google.cloud import storage
from google.cloud import bigquery

from IPython.display import display, Markdown, Latex


In [2]:
PROJECT_ID = 'mg-ce-demos'
REGION = 'us-central1'
DATASET = "bqml_demos" 
BQML_MODEL = "ecommerce_customer_segment_cluster5" 
EVAL = BQML_MODEL + "_eval"


In [3]:
# vertex ai clients
vertexai.init(project = PROJECT_ID, location = REGION)
aiplatform.init(project = PROJECT_ID, location = REGION)

# bigquery client
bq = bigquery.Client(project = PROJECT_ID)

## Create a K-means model to cluster ecommerce data

### Let's look at the data first

<p align="center">
  <img alt="Conceptual Flow" src="slides/process1.png" width="100%">
</p>

In [4]:
query = """
SELECT
  user_id,
  order_id,
  sale_price,
  created_at as order_created_date
FROM `mg-ce-demos.thelook_ecommerce.order_items`
WHERE created_at BETWEEN CAST('2022-01-01 00:00:00' AS TIMESTAMP)
AND CAST('2024-01-01 00:00:00' AS TIMESTAMP)
"""
df = bq.query(query).to_dataframe()
df.head()


Unnamed: 0,user_id,order_id,sale_price,order_created_date
0,40085,50021,2.5,2023-04-07 08:09:34+00:00
1,90363,112961,2.5,2022-12-25 23:23:29+00:00
2,3922,4887,2.5,2022-09-29 13:47:16+00:00
3,56076,70152,2.5,2022-03-27 03:01:23+00:00
4,90828,113540,2.5,2023-03-18 08:30:11+00:00


### `CREATE MODEL` using `KMEANS`

Create a query then start the model creation job, using a python loop to wait for the job to complete

<p align="center">
  <img alt="Conceptual Flow" src="slides/process2.png" width="100%">
</p>

In [5]:
query = """
CREATE MODEL IF NOT EXISTS `{0}.{1}`
OPTIONS (
  MODEL_TYPE = "KMEANS",
  NUM_CLUSTERS = 5,
  KMEANS_INIT_METHOD = "KMEANS++",
  STANDARDIZE_FEATURES = TRUE )
AS (
SELECT * EXCEPT (user_id)
FROM (
  SELECT user_id,
    DATE_DIFF(CURRENT_DATE(), CAST(MAX(order_created_date) as DATE), day) AS days_since_order, -- RECENCY
    COUNT(order_id) AS count_orders, -- FREQUENCY
    AVG(sale_price) AS avg_spend -- MONETARY
  FROM (
    SELECT user_id,
      order_id,
      sale_price,
      created_at as order_created_date
    FROM `mg-ce-demos.thelook_ecommerce.order_items`
    WHERE created_at BETWEEN CAST('2022-01-01 00:00:00' AS TIMESTAMP)
    AND CAST('2024-01-01 00:00:00' AS TIMESTAMP)
  )
  GROUP BY user_id, order_id
 )
)
""".format(DATASET, BQML_MODEL)


In [6]:
# Wrapper to use BigQuery client to run query/job, return job ID or result as DF
def run_bq_query(sql: str) -> Union[str, pd.DataFrame]:
    
    # Try dry run before executing query to catch any errors
    #job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
    #bq.query(sql, job_config=job_config)

    # If dry run succeeds without errors, proceed to run query
    job_config = bigquery.QueryJobConfig()
    client_result = bq.query(sql, job_config=job_config)

    job_id = client_result.job_id

    # Wait for query/job to finish running. then get & return data frame
    df = client_result.result().to_arrow().to_pandas()
    print(f"Finished job_id: {job_id}")
    return df

In [7]:
%%timeit

run_bq_query(query)

Finished job_id: ae17b3bd-313c-4263-abb7-caa10e0dad68
Finished job_id: eb51865d-3960-4109-89b1-512f056917ee
Finished job_id: 7dd2735f-952d-44c7-81ef-e5d51a2ea21d
Finished job_id: cf4e3aec-91c4-4af9-a7e3-fe7d57a89b17
Finished job_id: 16629d23-892c-4193-be66-28e7ea9ec8f7
Finished job_id: 8431ff9f-0eb9-4284-9464-9f97aa89bf49
Finished job_id: 10f6b640-0dbf-4541-8021-b67a82318901
Finished job_id: 2520b917-b6f7-4b88-850b-298a6d116f52
The slowest run took 9.47 times longer than the fastest. This could mean that an intermediate result is being cached.
1.69 s ± 2.03 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


Let's take a look at the model's clustering performance, using these metrics - Davies Bouldin Index and Mean Squared Distance

In [8]:
query = """
SELECT *
FROM ML.EVALUATE(MODEL `{0}.{1}`)
""".format(DATASET, BQML_MODEL)

run_bq_query(query)

Finished job_id: 4caa110f-7915-4686-a349-66772d3f74de


Unnamed: 0,davies_bouldin_index,mean_squared_distance
0,1.023378,1.051785


### Now let's get the cluster (centroid) information

<p align="center">
  <img alt="Conceptual Flow" src="slides/process3.png" width="100%">
</p>

In [9]:
query = """
SELECT
  CONCAT('cluster ', CAST(centroid_id as STRING)) as centroid,
  avg_spend as average_spend,
  count_orders as count_of_orders,
  days_since_order
FROM (
  SELECT centroid_id, feature, ROUND(numerical_value, 2) as value
  FROM ML.CENTROIDS(MODEL `{0}.{1}`)
)
PIVOT (
  SUM(value)
  FOR feature IN ('avg_spend',  'count_orders', 'days_since_order')
)
ORDER BY centroid_id
""".format(DATASET, BQML_MODEL)

run_bq_query(query)

Finished job_id: f6e92809-3b1e-4ccf-8566-0c5e66af58d5


Unnamed: 0,centroid,average_spend,count_of_orders,days_since_order
0,cluster 1,49.44,1.23,102.87
1,cluster 2,59.56,3.51,87.9
2,cluster 3,251.34,1.14,205.85
3,cluster 4,57.47,3.49,354.32
4,cluster 5,48.87,1.22,376.5


Whew! That's a lot of metrics and cluster info. How about we explain this to our colleagues using the magic of LLMs.

In [10]:
df = bq.query(query).to_dataframe()
df.to_string(header=False, index=False)

cluster_info = []
for i, row in df.iterrows():
  cluster_info.append("{0}, average spend ${2}, count of orders per person {1}, days since last order {3}"
    .format(row["centroid"], row["count_of_orders"], row["average_spend"], row["days_since_order"]) )

print(str.join("\n", cluster_info))

cluster 1, average spend $49.44, count of orders per person 1.23, days since last order 102.87
cluster 2, average spend $59.56, count of orders per person 3.51, days since last order 87.9
cluster 3, average spend $251.34, count of orders per person 1.14, days since last order 205.85
cluster 4, average spend $57.47, count of orders per person 3.49, days since last order 354.32
cluster 5, average spend $48.87, count of orders per person 1.22, days since last order 376.5


## Explain with Vertex AI PaLM API

### First, we want to instantiate the large language model and create the prompt

<p align="center">
  <img alt="Conceptual Flow" src="slides/process4.png" width="100%">
</p>

In [11]:
textgen_model = vertexai.preview.language_models.TextGenerationModel.from_pretrained('text-bison@001')

In [12]:
preamble = "Pretend you're a creative strategist, given the following clusters come up with creative brand persona and title labels for each of these clusters, and explain step by step; what would be the next marketing step for these clusters:"
display(Markdown('## Prompt:'))
print(preamble)
print('\n')
print(str.join("\n", cluster_info))

## Prompt:

Pretend you're a creative strategist, given the following clusters come up with creative brand persona and title labels for each of these clusters, and explain step by step; what would be the next marketing step for these clusters:


cluster 1, average spend $49.44, count of orders per person 1.23, days since last order 102.87
cluster 2, average spend $59.56, count of orders per person 3.51, days since last order 87.9
cluster 3, average spend $251.34, count of orders per person 1.14, days since last order 205.85
cluster 4, average spend $57.47, count of orders per person 3.49, days since last order 354.32
cluster 5, average spend $48.87, count of orders per person 1.22, days since last order 376.5


### Now, we send our prompt to Google GenAI API for some LLM magic

<p align="center">
  <img alt="Conceptual Flow" src="slides/process5.png" width="100%">
</p>

In [13]:
display(Markdown('## Response:'))
# Send prompt to LLM
display(Markdown(str(textgen_model.predict(
   preamble + "\n" + str.join("\n", cluster_info),
    max_output_tokens=1024,
    temperature=0.35,
    top_p=0.8,
    top_k=40,
))))

## Response:

* **Cluster 1: The Occasional Purchaser**

This cluster is made up of customers who make small, infrequent purchases. They are typically looking for something specific and are not interested in browsing or exploring. The average spend in this cluster is $49.44, the count of orders per person is 1.23, and the days since last order is 102.87.

The next marketing step for this cluster would be to focus on creating awareness of the brand and driving traffic to the website. This could be done through paid advertising, social media marketing, or email marketing. Once customers are aware of the brand, they can be targeted with offers and promotions that encourage them to make a purchase.

* **Cluster 2: The Frequent Shopper**

This cluster is made up of customers who make frequent, small purchases. They are typically looking for everyday essentials and are not interested in spending a lot of money. The average spend in this cluster is $59.56, the count of orders per person is 3.51, and the days since last order is 87.9.

The next marketing step for this cluster would be to focus on providing convenience and value. This could be done through offering free shipping, fast delivery, and a wide variety of products. The brand could also consider developing a loyalty program to reward customers for their repeat business.

* **Cluster 3: The Luxury Buyer**

This cluster is made up of customers who make large, infrequent purchases. They are typically looking for high-quality products and are willing to pay a premium for them. The average spend in this cluster is $251.34, the count of orders per person is 1.14, and the days since last order is 205.85.

The next marketing step for this cluster would be to focus on creating a sense of exclusivity and luxury. This could be done through using high-quality imagery and copy, and by offering exclusive products and services. The brand could also consider partnering with influencers or celebrities to help generate awareness and excitement.

* **Cluster 4: The Loyal Customer**

This cluster is made up of customers who make regular, medium-sized purchases. They are typically loyal to the brand and are not interested in shopping around. The average spend in this cluster is $57.47, the count of orders per person is 3.49, and the days since last order is 354.32.

The next marketing step for this cluster would be to focus on maintaining and growing their loyalty. This could be done through offering personalized service, rewards programs, and exclusive promotions. The brand could also consider developing new products and services that appeal to this group of customers.

* **Cluster 5: The Forgotten Customer**

This cluster is made up of customers who have not made a purchase in a long time. They may have forgotten about the brand or they may have found a new one. The average spend in this cluster is $48.87, the count of orders per person is 1.22, and the days since last order is 376.5.

The next marketing step for this cluster would be to re-engage them with the brand. This could be done through email marketing, social media marketing, or direct mail. The brand could also consider offering a special promotion or discount to encourage them to make a purchase.

Voila! We've now used k-means clustering to create groups of spenders and explain their profiles.

Sometimes, though, you want a little bit [extra](https://cloud.google.com/blog/transform/prompt-debunking-five-generative-ai-misconceptions).

In [14]:
preamble = "Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite movie, a summary of how this relates to their purchasing behavior, and a witty e-mail headline for marketing campaign targeted to their group:"

display(Markdown('## Prompt:'))
print(preamble)
print('\n')
print(str.join("\n", cluster_info))
display(Markdown('## Response:'))
display(Markdown(str(textgen_model.predict(
   preamble + "\n" + str.join("\n", cluster_info),
    max_output_tokens=1024,
    temperature=0.35,
    top_p=0.8,
    top_k=40,
))))

## Prompt:

Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite movie, a summary of how this relates to their purchasing behavior, and a witty e-mail headline for marketing campaign targeted to their group:


cluster 1, average spend $49.44, count of orders per person 1.23, days since last order 102.87
cluster 2, average spend $59.56, count of orders per person 3.51, days since last order 87.9
cluster 3, average spend $251.34, count of orders per person 1.14, days since last order 205.85
cluster 4, average spend $57.47, count of orders per person 3.49, days since last order 354.32
cluster 5, average spend $48.87, count of orders per person 1.22, days since last order 376.5


## Response:

8 **Cluster 1: The Bargain Hunter**

This customer is always looking for a good deal. They love the movie "The Princess Bride" because it's a classic that never gets old. They are drawn to the movie's humor, adventure, and romance. This customer is likely to be interested in our products because they are affordable and offer good value.

**E-mail Headline:** Save big on your next purchase!

**Cluster 2: The Frequent Shopper**

This customer loves to shop and they are always looking for new products to try. They love the movie "Mean Girls" because it's a funny and relatable story about high school. This customer is likely to be interested in our products because they are new and exciting.

**E-mail Headline:** New arrivals you won't want to miss!

**Cluster 3: The Luxury Seeker**

This customer is looking for the best of the best. They love the movie "The Great Gatsby" because it's a visually stunning and luxurious film. This customer is likely to be interested in our products because they are high-quality and stylish.

**E-mail Headline:** Treat yourself to something special!

**Cluster 4: The Family Oriented**

This customer loves to spend time with their family. They love the movie "Home Alone" because it's a heartwarming and funny story about family. This customer is likely to be interested in our products because they are fun and family-friendly.

**E-mail Headline:** The perfect gift for the whole family!

**Cluster 5: The Patient Saver**

This customer is willing to wait for a good deal. They love the movie "The Shawshank Redemption" because it's a story about hope and perseverance. This customer is likely to be interested in our products because they are well-made and durable.

**E-mail Headline:** The best things come to those who wait!

In [19]:
preamble = "Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite food and a detailed description of an image containing their favorite food:"

display(Markdown('## Prompt:'))
print(preamble)
print('\n')
print(str.join("\n", cluster_info))
display(Markdown('## Response:'))
display(Markdown(str(textgen_model.predict(
   preamble + "\n" + str.join("\n", cluster_info),
    max_output_tokens=1024,
    temperature=0.35,
    top_p=0.8,
    top_k=40,
))))

## Prompt:

Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite food and a detailed description of an image containing their favorite food:


cluster 1, average spend $49.44, count of orders per person 1.23, days since last order 102.87
cluster 2, average spend $59.56, count of orders per person 3.51, days since last order 87.9
cluster 3, average spend $251.34, count of orders per person 1.14, days since last order 205.85
cluster 4, average spend $57.47, count of orders per person 3.49, days since last order 354.32
cluster 5, average spend $48.87, count of orders per person 1.22, days since last order 376.5


## Response:

3 **Cluster 1: The Casual Diner**

This cluster is made up of people who enjoy casual dining experiences. They typically order one or two dishes per person and spend around $50 per order. They dine out about once every two weeks.

The Casual Diner's favorite food is pizza. They love the combination of crispy crust, gooey cheese, and flavorful toppings. They would love an image of a pizza with a perfectly golden crust, melted cheese bubbling around the edges, and fresh toppings like pepperoni, mushrooms, and green peppers.

**Cluster 2: The Family Meal**

This cluster is made up of families with young children. They typically order multiple dishes per person and spend around $60 per order. They dine out about once a week.

The Family Meal's favorite food is chicken nuggets. They love the way the nuggets are crispy on the outside and tender on the inside. They would love an image of a plate of chicken nuggets with a side of fries and ketchup.

**Cluster 3: The Luxury Seeker**

This cluster is made up of people who enjoy fine dining experiences. They typically order multiple dishes per person and spend over $200 per order. They dine out about once a month.

The Luxury Seeker's favorite food is filet mignon. They love the rich, buttery flavor of the steak and the way it melts in their mouth. They would love an image of a perfectly cooked filet mignon, seared on the outside and pink on the inside, with a side of mashed potatoes and asparagus.

**Cluster 4: The Social Butterfly**

This cluster is made up of people who enjoy going out to eat with friends and family. They typically order multiple dishes per person and spend around $50 per order. They dine out about once a week.

The Social Butterfly's favorite food is pasta. They love the way the pasta can be customized to everyone's liking. They would love an image of a plate of pasta with a variety of sauces and toppings, such as marinara, pesto, and alfredo.

**Cluster 5: The Homebody**

This cluster is made up of people who enjoy cooking and eating at home. They typically order one or two dishes per person and spend around $40 per order. They dine out about once a month.

The Homebody's favorite food is tacos. They love the way they can be customized to everyone's liking. They would love an image of a plate of tacos with a variety of fillings, such as chicken, beef, and pork.

In [17]:
def persona_builder(preamble):
    pre_text = "Pretend you're a creative strategist, given the following clusters "
    prompt = pre_text + preamble + "\n" + str.join("\n", cluster_info)
    result = textgen_model.predict(
        prompt,
        max_output_tokens=1024,
        temperature=0.35,
        top_p=0.8,
        top_k=40,
    )
    
    return prompt, result

#### Notes

A beautiful mountain range with snow-capped peaks and a river running through it. The sun is shining and the sky is blue.  Photorealistic

come up with creative brand persona for each that includes the detail of their favorite movie, a summary of how this relates to their purchasing behavior, and a witty e-mail headline for marketing campaign targeted to their group

## Now we can stick it behind a UI

<p align="center">
  <img alt="Conceptual Flow" src="slides/process6.png" width="100%">
</p>

In [18]:
import gradio as gr

with gr.Blocks() as demo:
    gr.Markdown(
    """
    ## Persona Builder and Marketing Bot

    ### Enter a task...

    """)
    with gr.Row():
      with gr.Column():
        input_text = gr.Textbox(label="Task", placeholder="Pretend you're a creative strategist, given the following clusters: ")
        
        
    with gr.Row():
      generate = gr.Button("Generate Response")

    with gr.Row():
      label2 = gr.Textbox(label="Prompt")
    with gr.Row():
      label3 = gr.Textbox(label="Response generated by LLM")

    generate.click(persona_builder,input_text, [label2, label3])
demo.launch(share=False, debug=False)

Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.


