# Using the Vertex AI PaLM API to explain BQML Clustering - Our Very Own Persona Builder

## Here's what we're going to build:

<p align="center">
  <img alt="Conceptual Flow" src="slides/process0.png" width="100%">
</p>

# Core code
Let's define some variables that will be used throughout this notebook.

These are the GCP Project ID `project_id`, the Model name `model_name` which is any name you prefer, and finally the Dataset name `dataset_name`.
The dataset needs to exist in the same Project as `project_id` and you'll need appropriate access to create and delete.

## Setup

In [1]:
from typing import Union
import sys

import os
import io
import json
import base64
import requests
import concurrent.futures
import time

import numpy as np
import pandas as pd

import vertexai.preview.language_models
from google.cloud import aiplatform
from google.cloud import documentai
from google.cloud.documentai_v1 import Document
from google.cloud import storage
from google.cloud import bigquery

from IPython.display import display, Markdown, Latex


In [2]:
PROJECT_ID = 'mg-ce-demos'
REGION = 'us-central1'
DATASET = "bqml_demos" 
BQML_MODEL = "ecommerce_customer_segment_cluster5" 
EVAL = BQML_MODEL + "_eval"


In [3]:
# vertex ai clients
vertexai.init(project = PROJECT_ID, location = REGION)
aiplatform.init(project = PROJECT_ID, location = REGION)

# bigquery client
bq = bigquery.Client(project = PROJECT_ID)

## Create a K-means model to cluster ecommerce data

### Let's look at the data first

<p align="center">
  <img alt="Conceptual Flow" src="slides/process1.png" width="100%">
</p>

In [4]:
query = """
SELECT
  user_id,
  order_id,
  sale_price,
  created_at as order_created_date
FROM `mg-ce-demos.thelook_ecommerce.order_items`
WHERE created_at BETWEEN CAST('2022-01-01 00:00:00' AS TIMESTAMP)
AND CAST('2024-01-01 00:00:00' AS TIMESTAMP)
"""
df = bq.query(query).to_dataframe()
df.head()


Unnamed: 0,user_id,order_id,sale_price,order_created_date
0,40085,50021,2.5,2023-04-07 08:09:34+00:00
1,90363,112961,2.5,2022-12-25 23:23:29+00:00
2,3922,4887,2.5,2022-09-29 13:47:16+00:00
3,56076,70152,2.5,2022-03-27 03:01:23+00:00
4,90828,113540,2.5,2023-03-18 08:30:11+00:00


### `CREATE MODEL` using `KMEANS`

Create a query then start the model creation job, using a python loop to wait for the job to complete

<p align="center">
  <img alt="Conceptual Flow" src="slides/process2.png" width="100%">
</p>

In [5]:
query = """
CREATE MODEL IF NOT EXISTS `{0}.{1}`
OPTIONS (
  MODEL_TYPE = "KMEANS",
  NUM_CLUSTERS = 5,
  KMEANS_INIT_METHOD = "KMEANS++",
  STANDARDIZE_FEATURES = TRUE )
AS (
SELECT * EXCEPT (user_id)
FROM (
  SELECT user_id,
    DATE_DIFF(CURRENT_DATE(), CAST(MAX(order_created_date) as DATE), day) AS days_since_order, -- RECENCY
    COUNT(order_id) AS count_orders, -- FREQUENCY
    AVG(sale_price) AS avg_spend -- MONETARY
  FROM (
    SELECT user_id,
      order_id,
      sale_price,
      created_at as order_created_date
    FROM `mg-ce-demos.thelook_ecommerce.order_items`
    WHERE created_at BETWEEN CAST('2022-01-01 00:00:00' AS TIMESTAMP)
    AND CAST('2024-01-01 00:00:00' AS TIMESTAMP)
  )
  GROUP BY user_id, order_id
 )
)
""".format(DATASET, BQML_MODEL)


In [6]:
# Wrapper to use BigQuery client to run query/job, return job ID or result as DF
def run_bq_query(sql: str) -> Union[str, pd.DataFrame]:
    
    # Try dry run before executing query to catch any errors
    #job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
    #bq.query(sql, job_config=job_config)

    # If dry run succeeds without errors, proceed to run query
    job_config = bigquery.QueryJobConfig()
    client_result = bq.query(sql, job_config=job_config)

    job_id = client_result.job_id

    # Wait for query/job to finish running. then get & return data frame
    df = client_result.result().to_arrow().to_pandas()
    print(f"Finished job_id: {job_id}")
    return df

In [7]:
%%timeit

run_bq_query(query)

Finished job_id: 24aadbaa-47ac-4a58-be27-62d27256c701
Finished job_id: 698df591-34c5-4031-bad8-3b1d5ea31d76
Finished job_id: 1aad5cc6-86b8-4c61-bbde-cf3652928786
Finished job_id: 7c0e1432-fd44-4f0e-bfd8-f6c326d3c15b
Finished job_id: dcb3d109-44e7-483a-bc49-db4c4c8df89a
Finished job_id: 180d643b-c862-44bf-8880-e8214cb7bf4d
Finished job_id: 037716dc-fc10-4ae8-84aa-9019b156dae4
Finished job_id: 39777414-36dc-455b-9d4f-03a353ff1a00
The slowest run took 12.55 times longer than the fastest. This could mean that an intermediate result is being cached.
2.09 s ± 3.13 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


Let's take a look at the model's clustering performance, using these metrics - Davies Bouldin Index and Mean Squared Distance

In [8]:
query = """
SELECT *
FROM ML.EVALUATE(MODEL `{0}.{1}`)
""".format(DATASET, BQML_MODEL)

run_bq_query(query)

Finished job_id: 39d7239c-4600-4650-b6d4-9b552c4df244


Unnamed: 0,davies_bouldin_index,mean_squared_distance
0,1.023378,1.051785


### Now let's get the cluster (centroid) information

<p align="center">
  <img alt="Conceptual Flow" src="slides/process3.png" width="100%">
</p>

In [9]:
query = """
SELECT
  CONCAT('cluster ', CAST(centroid_id as STRING)) as centroid,
  avg_spend as average_spend,
  count_orders as count_of_orders,
  days_since_order
FROM (
  SELECT centroid_id, feature, ROUND(numerical_value, 2) as value
  FROM ML.CENTROIDS(MODEL `{0}.{1}`)
)
PIVOT (
  SUM(value)
  FOR feature IN ('avg_spend',  'count_orders', 'days_since_order')
)
ORDER BY centroid_id
""".format(DATASET, BQML_MODEL)

run_bq_query(query)

Finished job_id: cba29c7e-0f19-43e1-b78f-8ff3db88e41e


Unnamed: 0,centroid,average_spend,count_of_orders,days_since_order
0,cluster 1,49.44,1.23,102.87
1,cluster 2,59.56,3.51,87.9
2,cluster 3,251.34,1.14,205.85
3,cluster 4,57.47,3.49,354.32
4,cluster 5,48.87,1.22,376.5


Whew! That's a lot of metrics and cluster info. How about we explain this to our colleagues using the magic of LLMs.

In [10]:
df = bq.query(query).to_dataframe()
df.to_string(header=False, index=False)

cluster_info = []
for i, row in df.iterrows():
  cluster_info.append("{0}, average spend ${2}, count of orders per person {1}, days since last order {3}"
    .format(row["centroid"], row["count_of_orders"], row["average_spend"], row["days_since_order"]) )

print(str.join("\n", cluster_info))

cluster 1, average spend $49.44, count of orders per person 1.23, days since last order 102.87
cluster 2, average spend $59.56, count of orders per person 3.51, days since last order 87.9
cluster 3, average spend $251.34, count of orders per person 1.14, days since last order 205.85
cluster 4, average spend $57.47, count of orders per person 3.49, days since last order 354.32
cluster 5, average spend $48.87, count of orders per person 1.22, days since last order 376.5


## Explain with Vertex AI PaLM API

### First, we want to instantiate the large language model and create the prompt

<p align="center">
  <img alt="Conceptual Flow" src="slides/process4.png" width="100%">
</p>

In [11]:
textgen_model = vertexai.preview.language_models.TextGenerationModel.from_pretrained('text-bison@001')

In [12]:
preamble = "Pretend you're a creative strategist, given the following clusters come up with creative brand persona and title labels for each of these clusters, and explain step by step; what would be the next marketing step for these clusters:"
display(Markdown('## Prompt:'))
print(preamble)
print('\n')
print(str.join("\n", cluster_info))

## Prompt:

Pretend you're a creative strategist, given the following clusters come up with creative brand persona and title labels for each of these clusters, and explain step by step; what would be the next marketing step for these clusters:


cluster 1, average spend $49.44, count of orders per person 1.23, days since last order 102.87
cluster 2, average spend $59.56, count of orders per person 3.51, days since last order 87.9
cluster 3, average spend $251.34, count of orders per person 1.14, days since last order 205.85
cluster 4, average spend $57.47, count of orders per person 3.49, days since last order 354.32
cluster 5, average spend $48.87, count of orders per person 1.22, days since last order 376.5


### Now, we send our prompt to Google GenAI API for some LLM magic

<p align="center">
  <img alt="Conceptual Flow" src="slides/process5.png" width="100%">
</p>

In [13]:
display(Markdown('## Response:'))
# Send prompt to LLM
display(Markdown(str(textgen_model.predict(
   preamble + "\n" + str.join("\n", cluster_info),
    max_output_tokens=1024,
    temperature=0.35,
    top_p=0.8,
    top_k=40,
))))

## Response:

1 **Cluster 1: The Casual Shopper**

This cluster is made up of customers who spend an average of $49.44 per order and place an order every 102.87 days. They are typically young adults who are looking for affordable, everyday items. The brand persona for this cluster is "The Casual Shopper," and the title label is "Affordable Everyday Essentials."

The next marketing step for this cluster is to focus on promoting the brand's affordable prices and wide selection of everyday items. This can be done through a variety of channels, such as social media, email marketing, and paid advertising.

**Cluster 2: The Frequent Shopper**

This cluster is made up of customers who spend an average of $59.56 per order and place an order every 87.9 days. They are typically middle-aged adults who are looking for quality, name-brand items. The brand persona for this cluster is "The Frequent Shopper," and the title label is "Quality Name-Brand Items."

The next marketing step for this cluster is to focus on promoting the brand's quality, name-brand items and its convenient shopping experience. This can be done through a variety of channels, such as social media, email marketing, and in-store promotions.

**Cluster 3: The Luxury Shopper**

This cluster is made up of customers who spend an average of $251.34 per order and place an order every 205.85 days. They are typically affluent adults who are looking for high-end, luxury items. The brand persona for this cluster is "The Luxury Shopper," and the title label is "Luxury and Exclusivity."

The next marketing step for this cluster is to focus on promoting the brand's luxury and exclusivity. This can be done through a variety of channels, such as social media, email marketing, and in-store events.

**Cluster 4: The Loyal Shopper**

This cluster is made up of customers who spend an average of $57.47 per order and place an order every 354.32 days. They are typically long-time customers who are looking for quality, reliable products. The brand persona for this cluster is "The Loyal Shopper," and the title label is "Quality and Reliability."

The next marketing step for this cluster is to focus on promoting the brand's quality and reliability. This can be done through a variety of channels, such as social media, email marketing, and in-store promotions.

**Cluster 5: The Occasional Shopper**

This cluster is made up of customers who spend an average of $48.87 per order and place an order every 376.51 days. They are typically busy adults who are looking for convenience and value. The brand persona for this cluster is "The Occasional Shopper," and the title label is "Convenience and Value."

The next marketing step for this cluster is to focus on promoting the brand's convenience and value. This can be done through a variety of channels, such as social media, email marketing, and in-store promotions.

Voila! We've now used k-means clustering to create groups of spenders and explain their profiles.

Sometimes, though, you want a little bit [extra](https://cloud.google.com/blog/transform/prompt-debunking-five-generative-ai-misconceptions).

In [14]:
preamble = "Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite movie, a summary of how this relates to their purchasing behavior, and a witty e-mail headline for marketing campaign targeted to their group:"

display(Markdown('## Prompt:'))
print(preamble)
print('\n')
print(str.join("\n", cluster_info))
display(Markdown('## Response:'))
display(Markdown(str(textgen_model.predict(
   preamble + "\n" + str.join("\n", cluster_info),
    max_output_tokens=1024,
    temperature=0.35,
    top_p=0.8,
    top_k=40,
))))

## Prompt:

Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite movie, a summary of how this relates to their purchasing behavior, and a witty e-mail headline for marketing campaign targeted to their group:


cluster 1, average spend $49.44, count of orders per person 1.23, days since last order 102.87
cluster 2, average spend $59.56, count of orders per person 3.51, days since last order 87.9
cluster 3, average spend $251.34, count of orders per person 1.14, days since last order 205.85
cluster 4, average spend $57.47, count of orders per person 3.49, days since last order 354.32
cluster 5, average spend $48.87, count of orders per person 1.22, days since last order 376.5


## Response:

3 **Cluster 1: The Bargain Hunter**

This customer is always looking for a good deal. They love the movie "The Princess Bride" because it's a classic that never gets old. They are drawn to the movie's humor, adventure, and romance. The customer's purchasing behavior reflects their love of a good deal. They typically make small purchases, but they do so frequently. This allows them to stock up on their favorite items when they are on sale.

**E-mail Headline:** Save big on your next purchase!

**Cluster 2: The Frequent Shopper**

This customer loves to shop. They love the movie "Mean Girls" because it's a hilarious and relatable comedy. They are drawn to the movie's characters, who are all so different but ultimately find common ground. The customer's purchasing behavior reflects their love of shopping. They make frequent, large purchases. They are always looking for new products to try and love to share their finds with their friends.

**E-mail Headline:** Shop till you drop!

**Cluster 3: The Luxury Seeker**

This customer loves to indulge. They love the movie "The Great Gatsby" because it's a lavish and glamorous tale. They are drawn to the movie's beautiful costumes, sets, and music. The customer's purchasing behavior reflects their love of luxury. They make large, infrequent purchases. They are willing to pay a premium for the best quality products.

**E-mail Headline:** Treat yourself to something special!

**Cluster 4: The Family Oriented**

This customer loves to spend time with their family. They love the movie "Home Alone" because it's a heartwarming and funny story about family. They are drawn to the movie's lovable characters and its message of togetherness. The customer's purchasing behavior reflects their love of family. They make purchases that benefit the entire family, such as food, clothing, and toys.

**E-mail Headline:** Make your family's day!

**Cluster 5: The Cautious Shopper**

This customer is careful with their money. They love the movie "The Shawshank Redemption" because it's a story about hope and redemption. They are drawn to the movie's inspiring characters and its message of perseverance. The customer's purchasing behavior reflects their cautious nature. They make small, thoughtful purchases. They are always looking for the best deals.

**E-mail Headline:** Save money without sacrificing quality!

In [15]:
preamble = "Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite food and a detailed description of an image containing their favorite food:"

display(Markdown('## Prompt:'))
print(preamble)
print('\n')
print(str.join("\n", cluster_info))
display(Markdown('## Response:'))
display(Markdown(str(textgen_model.predict(
   preamble + "\n" + str.join("\n", cluster_info),
    max_output_tokens=1024,
    temperature=0.35,
    top_p=0.8,
    top_k=40,
))))

## Prompt:

Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite food and a detailed description of an image containing their favorite food:


cluster 1, average spend $49.44, count of orders per person 1.23, days since last order 102.87
cluster 2, average spend $59.56, count of orders per person 3.51, days since last order 87.9
cluster 3, average spend $251.34, count of orders per person 1.14, days since last order 205.85
cluster 4, average spend $57.47, count of orders per person 3.49, days since last order 354.32
cluster 5, average spend $48.87, count of orders per person 1.22, days since last order 376.5


## Response:

6 **Cluster 1: The Casual Diner**

This cluster is made up of people who enjoy casual dining experiences. They typically order one or two entrees per person, and they spend around $50 per order. They dine out about once every two weeks.

The Casual Diner's favorite food is pizza. They love the combination of crispy crust, gooey cheese, and flavorful toppings. They would love an image of a pizza with a perfectly golden crust, melted cheese oozing over the edges, and fresh toppings like pepperoni, mushrooms, and olives.

**Cluster 2: The Family Meal**

This cluster is made up of families with young children. They typically order a variety of dishes to share, and they spend around $60 per order. They dine out about once a week.

The Family Meal's favorite food is chicken nuggets. They love the way that chicken nuggets are a kid-friendly finger food that everyone can enjoy. They would love an image of a plate of chicken nuggets with a side of fries and ketchup.

**Cluster 3: The Luxury Seeker**

This cluster is made up of people who enjoy fine dining experiences. They typically order expensive entrees and cocktails, and they spend around $250 per order. They dine out about once a month.

The Luxury Seeker's favorite food is filet mignon. They love the rich flavor of the beef and the way it melts in their mouth. They would love an image of a perfectly cooked filet mignon, seared on the outside and pink on the inside, with a side of asparagus and mashed potatoes.

**Cluster 4: The Social Butterfly**

This cluster is made up of people who enjoy going out with friends. They typically order appetizers and shareable plates, and they spend around $60 per order. They dine out about once a week.

The Social Butterfly's favorite food is nachos. They love the way that nachos are a fun and easy way to share a meal with friends. They would love an image of a plate of nachos piled high with chips, cheese, guacamole, salsa, and sour cream.

**Cluster 5: The Homebody**

This cluster is made up of people who prefer to cook at home. They typically order simple dishes that are easy to make, and they spend around $50 per order. They dine out about once a month.

The Homebody's favorite food is spaghetti and meatballs. They love the comfort food feeling of a home-cooked meal. They would love an image of a bowl of spaghetti and meatballs with a side of garlic bread.

In [16]:
def persona_builder(preamble):
    pre_text = "Pretend you're a creative strategist, given the following clusters "
    prompt = pre_text + preamble + "\n" + str.join("\n", cluster_info)
    result = textgen_model.predict(
        prompt,
        max_output_tokens=1024,
        temperature=0.35,
        top_p=0.8,
        top_k=40,
    )
    
    return prompt, result

#### Notes

A beautiful mountain range with snow-capped peaks and a river running through it. The sun is shining and the sky is blue.  Photorealistic

come up with creative brand persona for each that includes the detail of their favorite movie, a summary of how this relates to their purchasing behavior, and a witty e-mail headline for marketing campaign targeted to their group

## Now we can stick it behind a UI

<p align="center">
  <img alt="Conceptual Flow" src="slides/process6.png" width="100%">
</p>

In [17]:
import gradio as gr

with gr.Blocks() as demo:
    gr.Markdown(
    """
    ## Persona Builder and Marketing Bot

    ### Enter a task...

    """)
    with gr.Row():
      with gr.Column():
        input_text = gr.Textbox(label="Task", placeholder="Pretend you're a creative strategist, given the following clusters: ")
        
        
    with gr.Row():
      generate = gr.Button("Generate Response")

    with gr.Row():
      label2 = gr.Textbox(label="Prompt")
    with gr.Row():
      label3 = gr.Textbox(label="Response generated by LLM")

    generate.click(persona_builder,input_text, [label2, label3])
demo.launch(share=False, debug=False)

Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.


