# Using the Vertex AI PaLM API to explain BQML Clustering - Our Very Own Persona Builder

## Here's what we're going to build:

<p align="center">
  <img alt="Conceptual Flow" src="slides/process0.png" width="100%">
</p>

# Core code
Let's define some variables that will be used throughout this notebook.

These are the GCP Project ID `project_id`, the Model name `model_name` which is any name you prefer, and finally the Dataset name `dataset_name`.
The dataset needs to exist in the same Project as `project_id` and you'll need appropriate access to create and delete.

## Setup

In [1]:
from typing import Union
import sys

import os
import io
import json
import base64
import requests
import concurrent.futures
import time

import numpy as np
import pandas as pd

import vertexai.preview.language_models
from google.cloud import aiplatform
from google.cloud import documentai
from google.cloud.documentai_v1 import Document
from google.cloud import storage
from google.cloud import bigquery

from IPython.display import display, Markdown, Latex


In [2]:
PROJECT_ID = 'mg-ce-demos'
REGION = 'us-central1'
DATASET = "bqml_demos" 
BQML_MODEL = "ecommerce_customer_segment_cluster5" 
EVAL = BQML_MODEL + "_eval"


In [3]:
# vertex ai clients
vertexai.init(project = PROJECT_ID, location = REGION)
aiplatform.init(project = PROJECT_ID, location = REGION)

# bigquery client
bq = bigquery.Client(project = PROJECT_ID)

## Create a K-means model to cluster ecommerce data

### Let's look at the data first

<p align="center">
  <img alt="Conceptual Flow" src="slides/process1.png" width="100%">
</p>

In [4]:
query = """
SELECT
  user_id,
  order_id,
  sale_price,
  created_at as order_created_date
FROM `mg-ce-demos.thelook_ecommerce.order_items`
WHERE created_at BETWEEN CAST('2022-01-01 00:00:00' AS TIMESTAMP)
AND CAST('2024-01-01 00:00:00' AS TIMESTAMP)
"""
df = bq.query(query).to_dataframe()
df.head()


Unnamed: 0,user_id,order_id,sale_price,order_created_date
0,40085,50021,2.5,2023-04-07 08:09:34+00:00
1,90363,112961,2.5,2022-12-25 23:23:29+00:00
2,3922,4887,2.5,2022-09-29 13:47:16+00:00
3,56076,70152,2.5,2022-03-27 03:01:23+00:00
4,90828,113540,2.5,2023-03-18 08:30:11+00:00


### `CREATE MODEL` using `KMEANS`

Create a query then start the model creation job, using a python loop to wait for the job to complete

<p align="center">
  <img alt="Conceptual Flow" src="slides/process2.png" width="100%">
</p>

In [5]:
query = """
CREATE MODEL IF NOT EXISTS `{0}.{1}`
OPTIONS (
  MODEL_TYPE = "KMEANS",
  NUM_CLUSTERS = 5,
  KMEANS_INIT_METHOD = "KMEANS++",
  STANDARDIZE_FEATURES = TRUE )
AS (
SELECT * EXCEPT (user_id)
FROM (
  SELECT user_id,
    DATE_DIFF(CURRENT_DATE(), CAST(MAX(order_created_date) as DATE), day) AS days_since_order, -- RECENCY
    COUNT(order_id) AS count_orders, -- FREQUENCY
    AVG(sale_price) AS avg_spend -- MONETARY
  FROM (
    SELECT user_id,
      order_id,
      sale_price,
      created_at as order_created_date
    FROM `mg-ce-demos.thelook_ecommerce.order_items`
    WHERE created_at BETWEEN CAST('2022-01-01 00:00:00' AS TIMESTAMP)
    AND CAST('2024-01-01 00:00:00' AS TIMESTAMP)
  )
  GROUP BY user_id, order_id
 )
)
""".format(DATASET, BQML_MODEL)


In [6]:
# Wrapper to use BigQuery client to run query/job, return job ID or result as DF
def run_bq_query(sql: str) -> Union[str, pd.DataFrame]:
    
    # Try dry run before executing query to catch any errors
    #job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
    #bq.query(sql, job_config=job_config)

    # If dry run succeeds without errors, proceed to run query
    job_config = bigquery.QueryJobConfig()
    client_result = bq.query(sql, job_config=job_config)

    job_id = client_result.job_id

    # Wait for query/job to finish running. then get & return data frame
    df = client_result.result().to_arrow().to_pandas()
    print(f"Finished job_id: {job_id}")
    return df

In [7]:
%%timeit

run_bq_query(query)

Finished job_id: 535ae74c-5e66-41e5-9d06-09034e8228e5
Finished job_id: b4707452-384c-4917-84f5-9ce7e96713ee
Finished job_id: f546e9be-9ad1-46fa-a071-d47bb19fd8f3
Finished job_id: 1b8292e9-0a2a-41d0-8e0b-882b97f5ab76
Finished job_id: 095e0f8d-bef8-4157-bf84-73a47ba84632
Finished job_id: 328d9c7d-bbb8-4688-9788-771943ddc342
Finished job_id: 0962d153-86a3-40be-9c50-8ee7c2f5f370
Finished job_id: e305ea4b-1620-461c-9b74-f0f437964089
The slowest run took 10.44 times longer than the fastest. This could mean that an intermediate result is being cached.
1.87 s ± 2.55 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


Let's take a look at the model's clustering performance, using these metrics - Davies Bouldin Index and Mean Squared Distance

In [8]:
query = """
SELECT *
FROM ML.EVALUATE(MODEL `{0}.{1}`)
""".format(DATASET, BQML_MODEL)

run_bq_query(query)

Finished job_id: b42c75b0-51b7-424c-a5eb-db0287bc0dc3


Unnamed: 0,davies_bouldin_index,mean_squared_distance
0,1.023378,1.051785


### Now let's get the cluster (centroid) information

<p align="center">
  <img alt="Conceptual Flow" src="slides/process3.png" width="100%">
</p>

In [9]:
query = """
SELECT
  CONCAT('cluster ', CAST(centroid_id as STRING)) as centroid,
  avg_spend as average_spend,
  count_orders as count_of_orders,
  days_since_order
FROM (
  SELECT centroid_id, feature, ROUND(numerical_value, 2) as value
  FROM ML.CENTROIDS(MODEL `{0}.{1}`)
)
PIVOT (
  SUM(value)
  FOR feature IN ('avg_spend',  'count_orders', 'days_since_order')
)
ORDER BY centroid_id
""".format(DATASET, BQML_MODEL)

run_bq_query(query)

Finished job_id: 76b1f771-f211-4e54-9d15-dc195af5b5d2


Unnamed: 0,centroid,average_spend,count_of_orders,days_since_order
0,cluster 1,49.44,1.23,102.87
1,cluster 2,59.56,3.51,87.9
2,cluster 3,251.34,1.14,205.85
3,cluster 4,57.47,3.49,354.32
4,cluster 5,48.87,1.22,376.5


Whew! That's a lot of metrics and cluster info. How about we explain this to our colleagues using the magic of LLMs.

In [10]:
df = bq.query(query).to_dataframe()
df.to_string(header=False, index=False)

cluster_info = []
for i, row in df.iterrows():
  cluster_info.append("{0}, average spend ${2}, count of orders per person {1}, days since last order {3}"
    .format(row["centroid"], row["count_of_orders"], row["average_spend"], row["days_since_order"]) )

print(str.join("\n", cluster_info))

cluster 1, average spend $49.44, count of orders per person 1.23, days since last order 102.87
cluster 2, average spend $59.56, count of orders per person 3.51, days since last order 87.9
cluster 3, average spend $251.34, count of orders per person 1.14, days since last order 205.85
cluster 4, average spend $57.47, count of orders per person 3.49, days since last order 354.32
cluster 5, average spend $48.87, count of orders per person 1.22, days since last order 376.5


## Explain with Vertex AI PaLM API

### First, we want to instantiate the large language model and create the prompt

<p align="center">
  <img alt="Conceptual Flow" src="slides/process4.png" width="100%">
</p>

In [11]:
textgen_model = vertexai.preview.language_models.TextGenerationModel.from_pretrained('text-bison@001')

In [12]:
preamble = "Pretend you're a creative strategist, given the following clusters come up with creative brand persona and title labels for each of these clusters, and explain step by step; what would be the next marketing step for these clusters:"
display(Markdown('## Prompt:'))
print(preamble)
print('\n')
print(str.join("\n", cluster_info))

## Prompt:

Pretend you're a creative strategist, given the following clusters come up with creative brand persona and title labels for each of these clusters, and explain step by step; what would be the next marketing step for these clusters:


cluster 1, average spend $49.44, count of orders per person 1.23, days since last order 102.87
cluster 2, average spend $59.56, count of orders per person 3.51, days since last order 87.9
cluster 3, average spend $251.34, count of orders per person 1.14, days since last order 205.85
cluster 4, average spend $57.47, count of orders per person 3.49, days since last order 354.32
cluster 5, average spend $48.87, count of orders per person 1.22, days since last order 376.5


### Now, we send our prompt to Google GenAI API for some LLM magic

<p align="center">
  <img alt="Conceptual Flow" src="slides/process5.png" width="100%">
</p>

In [13]:
display(Markdown('## Response:'))
# Send prompt to LLM
display(Markdown(str(textgen_model.predict(
   preamble + "\n" + str.join("\n", cluster_info),
    max_output_tokens=1024,
    temperature=0.35,
    top_p=0.8,
    top_k=40,
))))

## Response:

6 **Cluster 1: The Occasional Shopper**

This cluster is made up of customers who make small, infrequent purchases. They are typically looking for something specific and are not interested in browsing or exploring the site. The average spend per order is $49.44, and they place an order every 102.87 days.

**Brand persona:** The Occasional Shopper is a busy person who is always on the go. They are looking for a quick and easy way to find the things they need, and they are not interested in spending a lot of time shopping.

**Title label:** The Occasional Shopper

**Next marketing step:** The next marketing step for this cluster is to focus on making it easy for customers to find what they need quickly and easily. This could be done by improving the site's search function, or by providing more information about products on the product pages.

**Cluster 2: The Frequent Shopper**

This cluster is made up of customers who make frequent, small purchases. They are typically looking for everyday items, and they are not interested in spending a lot of money on each purchase. The average spend per order is $59.56, and they place an order every 87.9 days.

**Brand persona:** The Frequent Shopper is a practical person who is looking for good value for their money. They are not interested in spending a lot of time shopping, and they are looking for a store that they can trust.

**Title label:** The Frequent Shopper

**Next marketing step:** The next marketing step for this cluster is to focus on providing good value for money. This could be done by offering discounts, or by highlighting the quality of the products.

**Cluster 3: The Big Spender**

This cluster is made up of customers who make large, infrequent purchases. They are typically looking for something special, and they are not afraid to spend a lot of money. The average spend per order is $251.34, and they place an order every 205.85 days.

**Brand persona:** The Big Spender is a luxury shopper who is looking for the best that money can buy. They are not interested in bargains, and they are willing to pay a premium for quality.

**Title label:** The Big Spender

**Next marketing step:** The next marketing step for this cluster is to focus on highlighting the luxury and exclusivity of the products. This could be done by using high-quality images, or by featuring testimonials from satisfied customers.

**Cluster 4: The Loyal Customer**

This cluster is made up of customers who make regular, small purchases. They are typically looking for everyday items, and they are not interested in spending a lot of money on each purchase. The average spend per order is $57.47, and they place an order every 354.32 days.

**Brand persona:** The Loyal Customer is a customer who is loyal to the brand. They are not interested in shopping around, and they are happy to pay a premium for the convenience of shopping with a brand they trust.

**Title label:** The Loyal Customer

**Next marketing step:** The next marketing step for this cluster is to focus on building loyalty. This could be done by offering loyalty programs, or by providing excellent customer service.

**Cluster 5: The Forgotten Customer**

This cluster is made up of customers who have not made a purchase in a long time. They are typically looking for something specific, and they are not interested in browsing or exploring the site. The average spend per order is $48.87, and they place an order every 376.56 days.

**Brand persona:** The Forgotten Customer is a customer who has been forgotten by the brand. They are not interested in shopping around, but they are also not happy with the current state of their relationship with the brand.

**Title label:** The Forgotten Customer

**Next marketing step:** The next marketing step for this cluster is to focus on re-engaging with customers. This could be done by sending out email newsletters, or by offering special discounts.

Voila! We've now used k-means clustering to create groups of spenders and explain their profiles.

Sometimes, though, you want a little bit [extra](https://cloud.google.com/blog/transform/prompt-debunking-five-generative-ai-misconceptions).

In [14]:
preamble = "Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite movie, a summary of how this relates to their purchasing behavior, and a witty e-mail headline for marketing campaign targeted to their group:"

display(Markdown('## Prompt:'))
print(preamble)
print('\n')
print(str.join("\n", cluster_info))
display(Markdown('## Response:'))
display(Markdown(str(textgen_model.predict(
   preamble + "\n" + str.join("\n", cluster_info),
    max_output_tokens=1024,
    temperature=0.35,
    top_p=0.8,
    top_k=40,
))))

## Prompt:

Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite movie, a summary of how this relates to their purchasing behavior, and a witty e-mail headline for marketing campaign targeted to their group:


cluster 1, average spend $49.44, count of orders per person 1.23, days since last order 102.87
cluster 2, average spend $59.56, count of orders per person 3.51, days since last order 87.9
cluster 3, average spend $251.34, count of orders per person 1.14, days since last order 205.85
cluster 4, average spend $57.47, count of orders per person 3.49, days since last order 354.32
cluster 5, average spend $48.87, count of orders per person 1.22, days since last order 376.5


## Response:

4 **Cluster 1: The Bargain Hunter**

This customer is always looking for a good deal. They love movies like "The Princess Bride" and "The Shawshank Redemption" because they are both classic films that can be enjoyed over and over again. They are typically not big spenders, but they do make frequent purchases. You could target this customer with a campaign that offers discounts on popular movies or a loyalty program that rewards them for their purchases.

**E-mail Headline:** Save Big on Your Favorite Movies!

**Cluster 2: The Movie Buff**

This customer is passionate about movies. They love to watch new releases and they are always up-to-date on the latest trends. They love movies like "The Dark Knight" and "The Avengers" because they are action-packed and full of suspense. They are typically high spenders and they make frequent purchases. You could target this customer with a campaign that offers exclusive access to new releases or a subscription service that gives them access to a library of movies.

**E-mail Headline:** Get the Inside Scoop on the Latest Movies!

**Cluster 3: The Social Butterfly**

This customer loves to be around people and they enjoy sharing their experiences with others. They love movies like "The Hangover" and "Bridesmaids" because they are funny and make them laugh. They are typically high spenders and they make frequent purchases. You could target this customer with a campaign that offers social media giveaways or a rewards program that encourages them to share their purchases with their friends.

**E-mail Headline:** Share Your Favorite Movies with Your Friends and Get Rewarded!

**Cluster 4: The Family Man**

This customer loves to spend time with their family. They love movies like "The Incredibles" and "Toy Story" because they are fun and entertaining for the whole family. They are typically moderate spenders and they make frequent purchases. You could target this customer with a campaign that offers family-friendly movies or a subscription service that gives them access to a library of movies that the whole family can enjoy.

**E-mail Headline:** Family Movie Night Made Easy!

**Cluster 5: The Introvert**

This customer is more introverted and they prefer to spend time alone. They love movies like "The Notebook" and "The Shawshank Redemption" because they are emotional and thought-provoking. They are typically low spenders and they make infrequent purchases. You could target this customer with a campaign that offers personalized recommendations or a subscription service that gives them access to a library of movies that they are likely to enjoy.

**E-mail Headline:** Find Your Perfect Movie with Our Personalized Recommendations!

In [15]:
preamble = "Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite food and a detailed description of an image containing their favorite food:"

display(Markdown('## Prompt:'))
print(preamble)
print('\n')
print(str.join("\n", cluster_info))
display(Markdown('## Response:'))
display(Markdown(str(textgen_model.predict(
   preamble + "\n" + str.join("\n", cluster_info),
    max_output_tokens=1024,
    temperature=0.35,
    top_p=0.8,
    top_k=40,
))))

## Prompt:

Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite food and a detailed description of an image containing their favorite food:


cluster 1, average spend $49.44, count of orders per person 1.23, days since last order 102.87
cluster 2, average spend $59.56, count of orders per person 3.51, days since last order 87.9
cluster 3, average spend $251.34, count of orders per person 1.14, days since last order 205.85
cluster 4, average spend $57.47, count of orders per person 3.49, days since last order 354.32
cluster 5, average spend $48.87, count of orders per person 1.22, days since last order 376.5


## Response:

8 **Cluster 1: The Casual Diner**

This cluster is made up of people who enjoy casual dining experiences. They typically order one or two dishes per person, and they spend around $50 per order. They dine out about once every two weeks.

The Casual Diner's favorite food is pizza. They love the combination of crispy crust, gooey cheese, and flavorful toppings. They imagine an image of a pizza with a perfectly golden crust, topped with fresh mozzarella cheese, pepperoni, and basil. The pizza is steaming hot and smells delicious. The Casual Diner can't wait to dig in!

**Cluster 2: The Family Meal**

This cluster is made up of families with young children. They typically order a variety of dishes to share, and they spend around $60 per order. They dine out about once a week.

The Family Meal's favorite food is chicken nuggets. The kids love the crispy nuggets and the dipping sauce. The parents appreciate that the nuggets are a quick and easy meal that the whole family can enjoy. They imagine an image of a plate of chicken nuggets, french fries, and apple slices. The kids are digging in with their hands, and the parents are smiling and enjoying the moment.

**Cluster 3: The Luxury Seeker**

This cluster is made up of people who enjoy fine dining experiences. They typically order multiple courses per person, and they spend over $200 per order. They dine out about once a month.

The Luxury Seeker's favorite food is filet mignon. They love the rich, buttery flavor of the steak. They imagine an image of a perfectly cooked filet mignon, seared on the outside and pink on the inside. The steak is served with a side of asparagus and mashed potatoes. The Luxury Seeker can't wait to savor every bite!

**Cluster 4: The Social Butterfly**

This cluster is made up of people who enjoy going out to eat with friends. They typically order a variety of dishes to share, and they spend around $60 per order. They dine out about once a week.

The Social Butterfly's favorite food is sushi. They love the variety of flavors and textures that sushi offers. They imagine an image of a platter of sushi, with sashimi, nigiri, and rolls. The Social Butterfly is surrounded by friends, and they are all laughing and enjoying the meal.

**Cluster 5: The Budget-Conscious Diner**

This cluster is made up of people who are looking for a good value when they dine out. They typically order one or two dishes per person, and they spend around $50 per order. They dine out about once every two weeks.

The Budget-Conscious Diner's favorite food is pasta. They love the hearty, filling feeling that pasta gives them. They imagine an image of a plate of spaghetti with meatballs and marinara sauce. The pasta is topped with Parmesan cheese and fresh basil. The Budget-Conscious Diner is smiling and satisfied, knowing that they got a delicious meal for a great price.

In [16]:
def persona_builder(preamble):
    pre_text = "Pretend you're a creative strategist, given the following clusters "
    prompt = pre_text + preamble + "\n" + str.join("\n", cluster_info)
    result = textgen_model.predict(
        prompt,
        max_output_tokens=1024,
        temperature=0.35,
        top_p=0.8,
        top_k=40,
    )
    
    return prompt, result

#### Notes

A beautiful mountain range with snow-capped peaks and a river running through it. The sun is shining and the sky is blue.  Photorealistic

come up with creative brand persona for each that includes the detail of their favorite movie, a summary of how this relates to their purchasing behavior, and a witty e-mail headline for marketing campaign targeted to their group

## Now we can stick it behind a UI

<p align="center">
  <img alt="Conceptual Flow" src="slides/process6.png" width="100%">
</p>

In [17]:
import gradio as gr

with gr.Blocks() as demo:
    gr.Markdown(
    """
    ## Persona Builder and Marketing Bot

    ### Enter a task...

    """)
    with gr.Row():
      with gr.Column():
        input_text = gr.Textbox(label="Task", placeholder="Pretend you're a creative strategist, given the following clusters: ")
        
        
    with gr.Row():
      generate = gr.Button("Generate Response")

    with gr.Row():
      label2 = gr.Textbox(label="Prompt")
    with gr.Row():
      label3 = gr.Textbox(label="Response generated by LLM")

    generate.click(persona_builder,input_text, [label2, label3])
demo.launch(share=False, debug=False)

Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.


