# Using the Vertex AI PaLM API to explain BQML Clustering - Our Very Own Persona Builder

## Here's what we're going to build:

<p align="center">
  <img alt="Conceptual Flow" src="slides/process0.png" width="100%">
</p>

# Core code
Let's define some variables that will be used throughout this notebook.

These are the GCP Project ID `project_id`, the Model name `model_name` which is any name you prefer, and finally the Dataset name `dataset_name`.
The dataset needs to exist in the same Project as `project_id` and you'll need appropriate access to create and delete.

## Setup

In [18]:
from typing import Union
import sys

import os
import io
import json
import base64
import requests
import concurrent.futures
import time

import numpy as np
import pandas as pd

import vertexai.preview.language_models
from google.cloud import aiplatform
from google.cloud import documentai
from google.cloud.documentai_v1 import Document
from google.cloud import storage
from google.cloud import bigquery

from IPython.display import display, Markdown, Latex


In [2]:
PROJECT_ID = 'mg-ce-demos'
REGION = 'us-central1'
DATASET = "bqml_demos" 
BQML_MODEL = "ecommerce_customer_segment_cluster5" 
EVAL = BQML_MODEL + "_eval"


In [3]:
# vertex ai clients
vertexai.init(project = PROJECT_ID, location = REGION)
aiplatform.init(project = PROJECT_ID, location = REGION)

# bigquery client
bq = bigquery.Client(project = PROJECT_ID)

## Create a K-means model to cluster ecommerce data

### Let's look at the data first

<p align="center">
  <img alt="Conceptual Flow" src="slides/process1.png" width="100%">
</p>

In [4]:
query = """
SELECT
  user_id,
  order_id,
  sale_price,
  created_at as order_created_date
FROM `mg-ce-demos.thelook_ecommerce.order_items`
WHERE created_at BETWEEN CAST('2022-01-01 00:00:00' AS TIMESTAMP)
AND CAST('2024-01-01 00:00:00' AS TIMESTAMP)
"""
df = bq.query(query).to_dataframe()
df.head()


Unnamed: 0,user_id,order_id,sale_price,order_created_date
0,40085,50021,2.5,2023-04-07 08:09:34+00:00
1,90363,112961,2.5,2022-12-25 23:23:29+00:00
2,3922,4887,2.5,2022-09-29 13:47:16+00:00
3,56076,70152,2.5,2022-03-27 03:01:23+00:00
4,90828,113540,2.5,2023-03-18 08:30:11+00:00


### `CREATE MODEL` using `KMEANS`

Create a query then start the model creation job, using a python loop to wait for the job to complete

<p align="center">
  <img alt="Conceptual Flow" src="slides/process2.png" width="100%">
</p>

In [5]:
query = """
CREATE MODEL IF NOT EXISTS `{0}.{1}`
OPTIONS (
  MODEL_TYPE = "KMEANS",
  NUM_CLUSTERS = 5,
  KMEANS_INIT_METHOD = "KMEANS++",
  STANDARDIZE_FEATURES = TRUE )
AS (
SELECT * EXCEPT (user_id)
FROM (
  SELECT user_id,
    DATE_DIFF(CURRENT_DATE(), CAST(MAX(order_created_date) as DATE), day) AS days_since_order, -- RECENCY
    COUNT(order_id) AS count_orders, -- FREQUENCY
    AVG(sale_price) AS avg_spend -- MONETARY
  FROM (
    SELECT user_id,
      order_id,
      sale_price,
      created_at as order_created_date
    FROM `mg-ce-demos.thelook_ecommerce.order_items`
    WHERE created_at BETWEEN CAST('2022-01-01 00:00:00' AS TIMESTAMP)
    AND CAST('2024-01-01 00:00:00' AS TIMESTAMP)
  )
  GROUP BY user_id, order_id
 )
)
""".format(DATASET, BQML_MODEL)


In [6]:
# Wrapper to use BigQuery client to run query/job, return job ID or result as DF
def run_bq_query(sql: str) -> Union[str, pd.DataFrame]:
    
    # Try dry run before executing query to catch any errors
    #job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
    #bq.query(sql, job_config=job_config)

    # If dry run succeeds without errors, proceed to run query
    job_config = bigquery.QueryJobConfig()
    client_result = bq.query(sql, job_config=job_config)

    job_id = client_result.job_id

    # Wait for query/job to finish running. then get & return data frame
    df = client_result.result().to_arrow().to_pandas()
    print(f"Finished job_id: {job_id}")
    return df

In [7]:
%%timeit

run_bq_query(query)

Finished job_id: 116cf4b7-ce29-46e5-88e9-c7598e132f20
Finished job_id: 1af8e3b3-d370-4624-a677-93d545cccf5f
Finished job_id: 8a6b5f83-ba94-483a-9f9a-228f25063084
Finished job_id: 4007d91d-128a-4666-8a3f-d56dbe8ca4d5
Finished job_id: 792d3018-a50d-4432-8dd5-44c3c0dd2a84
Finished job_id: 1729fd34-9804-4e2a-81d9-f044e900a1fb
Finished job_id: ff462401-ebdb-403d-baf3-ee459931ec95
Finished job_id: d752dcd5-8384-410b-8f52-19f5ebc080c3
The slowest run took 9.24 times longer than the fastest. This could mean that an intermediate result is being cached.
1.85 s ± 2.23 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


Let's take a look at the model's clustering performance, using these metrics - Davies Bouldin Index and Mean Squared Distance

In [8]:
query = """
SELECT *
FROM ML.EVALUATE(MODEL `{0}.{1}`)
""".format(DATASET, BQML_MODEL)

run_bq_query(query)

Finished job_id: d4307833-5fdc-4401-bf47-ab2f592531d2


Unnamed: 0,davies_bouldin_index,mean_squared_distance
0,1.023378,1.051785


### Now let's get the cluster (centroid) information

<p align="center">
  <img alt="Conceptual Flow" src="slides/process3.png" width="100%">
</p>

In [9]:
query = """
SELECT
  CONCAT('cluster ', CAST(centroid_id as STRING)) as centroid,
  avg_spend as average_spend,
  count_orders as count_of_orders,
  days_since_order
FROM (
  SELECT centroid_id, feature, ROUND(numerical_value, 2) as value
  FROM ML.CENTROIDS(MODEL `{0}.{1}`)
)
PIVOT (
  SUM(value)
  FOR feature IN ('avg_spend',  'count_orders', 'days_since_order')
)
ORDER BY centroid_id
""".format(DATASET, BQML_MODEL)

run_bq_query(query)

Finished job_id: 6f19ce0e-0df1-4571-882b-800cee8db7f7


Unnamed: 0,centroid,average_spend,count_of_orders,days_since_order
0,cluster 1,49.44,1.23,102.87
1,cluster 2,59.56,3.51,87.9
2,cluster 3,251.34,1.14,205.85
3,cluster 4,57.47,3.49,354.32
4,cluster 5,48.87,1.22,376.5


Whew! That's a lot of metrics and cluster info. How about we explain this to our colleagues using the magic of LLMs.

In [10]:
df = bq.query(query).to_dataframe()
df.to_string(header=False, index=False)

cluster_info = []
for i, row in df.iterrows():
  cluster_info.append("{0}, average spend ${2}, count of orders per person {1}, days since last order {3}"
    .format(row["centroid"], row["count_of_orders"], row["average_spend"], row["days_since_order"]) )

print(str.join("\n", cluster_info))

cluster 1, average spend $49.44, count of orders per person 1.23, days since last order 102.87
cluster 2, average spend $59.56, count of orders per person 3.51, days since last order 87.9
cluster 3, average spend $251.34, count of orders per person 1.14, days since last order 205.85
cluster 4, average spend $57.47, count of orders per person 3.49, days since last order 354.32
cluster 5, average spend $48.87, count of orders per person 1.22, days since last order 376.5


## Explain with Vertex AI PaLM API

### First, we want to instantiate the large language model and create the prompt

<p align="center">
  <img alt="Conceptual Flow" src="slides/process4.png" width="100%">
</p>

In [11]:
textgen_model = vertexai.preview.language_models.TextGenerationModel.from_pretrained('text-bison@001')

In [12]:
preamble = "Pretend you're a creative strategist, given the following clusters come up with creative brand persona and title labels for each of these clusters, and explain step by step; what would be the next marketing step for these clusters:"
display(Markdown('## Prompt:'))
print(preamble)
print('\n')
print(str.join("\n", cluster_info))

## Prompt:

Pretend you're a creative strategist, given the following clusters come up with creative brand persona and title labels for each of these clusters, and explain step by step; what would be the next marketing step for these clusters:


cluster 1, average spend $49.44, count of orders per person 1.23, days since last order 102.87
cluster 2, average spend $59.56, count of orders per person 3.51, days since last order 87.9
cluster 3, average spend $251.34, count of orders per person 1.14, days since last order 205.85
cluster 4, average spend $57.47, count of orders per person 3.49, days since last order 354.32
cluster 5, average spend $48.87, count of orders per person 1.22, days since last order 376.5


### Now, we send our prompt to Google GenAI API for some LLM magic

<p align="center">
  <img alt="Conceptual Flow" src="slides/process5.png" width="100%">
</p>

In [13]:
display(Markdown('## Response:'))
# Send prompt to LLM
display(Markdown(str(textgen_model.predict(
   preamble + "\n" + str.join("\n", cluster_info),
    max_output_tokens=1024,
    temperature=0.35,
    top_p=0.8,
    top_k=40,
))))

## Response:

The data is about customer clusters. We can create brand personas for each cluster and give them creative title labels.

**Cluster 1: The Occasional Shopper**
This cluster is made up of customers who make small, infrequent purchases. They are typically looking for deals and are not loyal to any particular brand. The average spend per customer in this cluster is $49.44, and they make an average of 1.23 orders per person. The days since their last order is 102.87.

**Title label:** The Bargain Hunter

**Marketing step:** This cluster is a good target for promotional offers and coupons. You can also try to upsell them by offering them related products or services.

**Cluster 2: The Frequent Shopper**
This cluster is made up of customers who make regular, medium-sized purchases. They are typically loyal to a particular brand and are looking for quality products. The average spend per customer in this cluster is $59.56, and they make an average of 3.51 orders per person. The days since their last order is 87.9.

**Title label:** The Loyal Customer

**Marketing step:** This cluster is a good target for loyalty programs and special offers. You can also try to cross-sell them by offering them products or services that they might be interested in.

**Cluster 3: The Big Spender**
This cluster is made up of customers who make large, infrequent purchases. They are typically looking for high-end products and are not concerned about the price. The average spend per customer in this cluster is $251.34, and they make an average of 1.14 orders per person. The days since their last order is 205.85.

**Title label:** The Luxury Shopper

**Marketing step:** This cluster is a good target for high-end marketing campaigns and promotions. You can also try to upsell them by offering them related products or services.

**Cluster 4: The Loyal Spender**
This cluster is made up of customers who make regular, large purchases. They are typically loyal to a particular brand and are looking for quality products. The average spend per customer in this cluster is $57.47, and they make an average of 3.49 orders per person. The days since their last order is 354.32.

**Title label:** The Valued Customer

**Marketing step:** This cluster is a good target for loyalty programs and special offers. You can also try to cross-sell them by offering them products or services that they might be interested in.

**Cluster 5: The Inactive Shopper**
This cluster is made up of customers who have not made a purchase in a long time. They are typically not loyal to any particular brand and are looking for deals. The average spend per customer in this cluster is $48.87, and they make an average of 1.22 orders per person. The days since their last order is 376.5.

**Title label:** The Forgotten Customer

**Marketing step:** This cluster is a good target for reactivation campaigns. You can try to reach out to them with special offers or coupons, or you can try to upsell them by offering them related products or services.

Voila! We've now used k-means clustering to create groups of spenders and explain their profiles.

Sometimes, though, you want a little bit [extra](https://cloud.google.com/blog/transform/prompt-debunking-five-generative-ai-misconceptions).

In [19]:
preamble = "Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite movie, a summary of how this relates to their purchasing behavior, and a witty e-mail headline for marketing campaign targeted to their group:"

display(Markdown('## Prompt:'))
print(preamble)
print('\n')
print(str.join("\n", cluster_info))
display(Markdown('## Response:'))
display(Markdown(str(textgen_model.predict(
   preamble + "\n" + str.join("\n", cluster_info),
    max_output_tokens=1024,
    temperature=0.35,
    top_p=0.8,
    top_k=40,
))))

## Prompt:

Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite movie, a summary of how this relates to their purchasing behavior, and a witty e-mail headline for marketing campaign targeted to their group:


cluster 1, average spend $49.44, count of orders per person 1.23, days since last order 102.87
cluster 2, average spend $59.56, count of orders per person 3.51, days since last order 87.9
cluster 3, average spend $251.34, count of orders per person 1.14, days since last order 205.85
cluster 4, average spend $57.47, count of orders per person 3.49, days since last order 354.32
cluster 5, average spend $48.87, count of orders per person 1.22, days since last order 376.5


## Response:

3 **Cluster 1: The Bargain Hunter**

This customer is always looking for a good deal. They love the movie "The Princess Bride" because it's a classic that never gets old. They are drawn to the movie's humor, romance, and adventure. The customer's purchasing behavior reflects their love of a good deal. They typically only make one purchase every few months, but they spend a lot of money when they do.

**E-mail Headline:** The Princess Bride: A Classic for Bargain Hunters

**Body:**

Do you love a good deal? Then you'll love our new collection of The Princess Bride merchandise! We have everything you need to celebrate your favorite movie, from t-shirts and mugs to DVDs and Blu-rays. And best of all, our prices are so low, you'll feel like you're getting away with something.

So what are you waiting for? Shop our collection of The Princess Bride merchandise today!

**Cluster 2: The Frequent Shopper**

This customer is a loyal shopper who loves to buy new things. They love the movie "Mean Girls" because it's a relatable comedy about high school life. They are drawn to the movie's humor, fashion, and drama. The customer's purchasing behavior reflects their love of shopping. They make multiple purchases every month, and they spend a lot of money on each purchase.

**E-mail Headline:** Mean Girls: The Movie for Fashionistas

**Body:**

Do you love fashion? Then you'll love the movie "Mean Girls"! This iconic comedy follows a group of high school girls who are all about the latest trends. From their clothes to their hairstyles, these girls know how to make a statement.

If you're looking for some new fashion inspiration, check out our collection of Mean Girls merchandise. We have everything you need to look like a total queen bee, from t-shirts and dresses to jewelry and handbags.

So what are you waiting for? Shop our collection of Mean Girls merchandise today!

**Cluster 3: The Big Spender**

This customer is a high-end shopper who loves to splurge. They love the movie "The Wolf of Wall Street" because it's a rags-to-riches story about a successful businessman. They are drawn to the movie's fast-paced action, lavish lifestyle, and big dreams. The customer's purchasing behavior reflects their love of luxury. They make large purchases every few months, and they don't mind spending a lot of money.

**E-mail Headline:** The Wolf of Wall Street: The Movie for Big Spenders

**Body:**

Do you love luxury? Then you'll love the movie "The Wolf of Wall Street"! This biographical drama follows the life of Jordan Belfort, a successful stockbroker who lived a life of excess. From his lavish parties to his private jets, Jordan Belfort had it all.

If you're looking for some inspiration to live your best life, check out our collection of The Wolf of Wall Street merchandise. We have everything you need to look and feel like a millionaire, from t-shirts and hats to watches and cufflinks.

So what are you waiting for? Shop our collection of The Wolf of Wall Street merchandise today!

**Cluster 4: The Casual Shopper**

This customer is a casual shopper who likes to buy things on a whim. They love the movie "The Big Lebowski" because it's a quirky comedy about a slacker who gets caught up in a series of misadventures. They are drawn to the movie's humor, laid-back attitude, and offbeat characters. The customer's purchasing behavior reflects their casual approach to shopping. They make small purchases every few weeks, and they don't mind spending a lot of money.

**E-mail Headline:** The Big Lebowski: The Movie for Casual Shoppers

**Body:**

Do you like to shop on a whim? Then you'll love the movie "The Big Lebowski"! This cult classic follows the life of The Dude, a lazy slacker who gets caught up in a series of hilarious misadventures. From his bowling buddies to his rug that really tied the room together, The Dude is a lovable character who will have you laughing out loud.

If you're looking for a movie that will make you forget about your troubles, check out "The Big Lebowski" today!

**Cluster 5: The Loyal Customer**

This customer is a loyal shopper who loves to buy from the same brands. They love the movie "Casablanca" because it's a timeless classic about love and sacrifice. They are drawn to the movie's beautiful cinematography, romantic story, and unforgettable characters. The customer's purchasing behavior reflects their loyalty

In [15]:
preamble = "Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite food and a detailed description of an image containing their favorite food:"

display(Markdown('## Prompt:'))
print(preamble)
print('\n')
print(str.join("\n", cluster_info))
display(Markdown('## Response:'))
display(Markdown(str(textgen_model.predict(
   preamble + "\n" + str.join("\n", cluster_info),
    max_output_tokens=1024,
    temperature=0.35,
    top_p=0.8,
    top_k=40,
))))

## Prompt:

Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite food and a detailed description of an image containing their favorite food:


cluster 1, average spend $49.44, count of orders per person 1.23, days since last order 102.87
cluster 2, average spend $59.56, count of orders per person 3.51, days since last order 87.9
cluster 3, average spend $251.34, count of orders per person 1.14, days since last order 205.85
cluster 4, average spend $57.47, count of orders per person 3.49, days since last order 354.32
cluster 5, average spend $48.87, count of orders per person 1.22, days since last order 376.5


## Response:

2 **Cluster 1: The Casual Diner**

This cluster is made up of people who enjoy casual dining experiences. They typically order one or two dishes per person and spend around $50 per order. They dine out about once every two weeks.

The Casual Diner's favorite food is pizza. They love the combination of crispy crust, gooey cheese, and flavorful toppings. They would love an image of a pizza with a perfectly charred crust, melted cheese, and fresh vegetables.

**Cluster 2: The Family Meal**

This cluster is made up of families with young children. They typically order multiple dishes per person and spend around $60 per order. They dine out about once a week.

The Family Meal's favorite food is chicken nuggets. They love the way the nuggets are crispy on the outside and tender on the inside. They would love an image of a plate of chicken nuggets with a side of fries and ketchup.

**Cluster 3: The Luxury Seeker**

This cluster is made up of people who enjoy fine dining experiences. They typically order multiple courses per person and spend around $250 per order. They dine out about once a month.

The Luxury Seeker's favorite food is filet mignon. They love the rich flavor of the beef and the way it melts in their mouth. They would love an image of a perfectly cooked filet mignon with a side of asparagus and mashed potatoes.

**Cluster 4: The Social Butterfly**

This cluster is made up of people who enjoy going out with friends. They typically order multiple dishes per person and spend around $60 per order. They dine out about once a week.

The Social Butterfly's favorite food is sushi. They love the variety of flavors and textures that sushi offers. They would love an image of a platter of sushi with sashimi, nigiri, and rolls.

**Cluster 5: The Homebody**

This cluster is made up of people who enjoy cooking at home. They typically order one or two dishes per person and spend around $50 per order. They dine out about once a month.

The Homebody's favorite food is pasta. They love the way pasta can be customized to their liking. They would love an image of a bowl of pasta with a flavorful sauce and fresh vegetables.

In [16]:
def persona_builder(preamble):
    pre_text = "Pretend you're a creative strategist, given the following clusters "
    prompt = pre_text + preamble + "\n" + str.join("\n", cluster_info)
    result = textgen_model.predict(
        prompt,
        max_output_tokens=1024,
        temperature=0.35,
        top_p=0.8,
        top_k=40,
    )
    
    return prompt, result

#### Notes

A beautiful mountain range with snow-capped peaks and a river running through it. The sun is shining and the sky is blue.  Photorealistic

come up with creative brand persona for each that includes the detail of their favorite movie, a summary of how this relates to their purchasing behavior, and a witty e-mail headline for marketing campaign targeted to their group

## Now we can stick it behind a UI

<p align="center">
  <img alt="Conceptual Flow" src="slides/process6.png" width="100%">
</p>

In [17]:
import gradio as gr

with gr.Blocks() as demo:
    gr.Markdown(
    """
    ## Persona Builder and Marketing Bot

    ### Enter a task...

    """)
    with gr.Row():
      with gr.Column():
        input_text = gr.Textbox(label="Task", placeholder="Pretend you're a creative strategist, given the following clusters: ")
        
        
    with gr.Row():
      generate = gr.Button("Generate Response")

    with gr.Row():
      label2 = gr.Textbox(label="Prompt")
    with gr.Row():
      label3 = gr.Textbox(label="Response generated by LLM")

    generate.click(persona_builder,input_text, [label2, label3])
demo.launch(share=False, debug=False)


Thanks for being a Gradio user! If you have questions or feedback, please join our Discord server and chat with us: https://discord.gg/feTf9x3ZSB
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.


