# Using the Vertex AI PaLM API to explain BQML Clustering - Our Very Own Persona Builder

## Here's what we're going to build:

<p align="center">
  <img alt="Conceptual Flow" src="slides/process0.png" width="100%">
</p>

# Core code
Let's define some variables that will be used throughout this notebook.

These are the GCP Project ID `project_id`, the Model name `model_name` which is any name you prefer, and finally the Dataset name `dataset_name`.
The dataset needs to exist in the same Project as `project_id` and you'll need appropriate access to create and delete.

## Setup

In [23]:
from typing import Union
import sys

import os
import io
import json
import base64
import requests
import concurrent.futures
import time

import numpy as np
import pandas as pd

import vertexai
from vertexai.language_models import TextGenerationModel, TextEmbeddingModel
from vertexai.preview.language_models import TextGenerationModel as TextGenerationModel_preview
from vertexai.preview.generative_models import GenerativeModel, Part, Image

from google.cloud import aiplatform
from google.cloud import documentai
from google.cloud.documentai_v1 import Document
from google.cloud import storage
from google.cloud import bigquery

from IPython.display import display, Markdown, Latex
import markdown

print("Vertex AI version: " + str(aiplatform.__version__))

Vertex AI version: 1.41.0


In [2]:
PROJECT_ID = 'mg-ce-demos'
REGION = 'us-central1'
DATASET = "bqml_demos" 
BQML_MODEL = "ecommerce_customer_segment_cluster5" 
EVAL = BQML_MODEL + "_eval"


In [3]:
# vertex ai clients
vertexai.init(project = PROJECT_ID, location = REGION)
aiplatform.init(project = PROJECT_ID, location = REGION)

# bigquery client
bq = bigquery.Client(project = PROJECT_ID)

## Create a K-means model to cluster ecommerce data

### Let's look at the data first

<p align="center">
  <img alt="Conceptual Flow" src="slides/process1.png" width="100%">
</p>

In [4]:
query = """
SELECT
  user_id,
  order_id,
  sale_price,
  created_at as order_created_date
FROM `mg-ce-demos.thelook_ecommerce.order_items`
WHERE created_at BETWEEN CAST('2022-01-01 00:00:00' AS TIMESTAMP)
AND CAST('2024-01-01 00:00:00' AS TIMESTAMP)
"""
df = bq.query(query).to_dataframe()
df.head()


Unnamed: 0,user_id,order_id,sale_price,order_created_date
0,40085,50021,2.5,2023-04-07 08:09:34+00:00
1,90363,112961,2.5,2022-12-25 23:23:29+00:00
2,3922,4887,2.5,2022-09-29 13:47:16+00:00
3,56076,70152,2.5,2022-03-27 03:01:23+00:00
4,90828,113540,2.5,2023-03-18 08:30:11+00:00


### `CREATE MODEL` using `KMEANS`

Create a query then start the model creation job, using a python loop to wait for the job to complete

<p align="center">
  <img alt="Conceptual Flow" src="slides/process2.png" width="100%">
</p>

#### model code

In [5]:
query = """
CREATE MODEL IF NOT EXISTS `{0}.{1}`
OPTIONS (
  MODEL_TYPE = "KMEANS",
  NUM_CLUSTERS = 5,
  KMEANS_INIT_METHOD = "KMEANS++",
  STANDARDIZE_FEATURES = TRUE )
AS (
SELECT * EXCEPT (user_id)
FROM (
  SELECT user_id,
    DATE_DIFF(CURRENT_DATE(), CAST(MAX(order_created_date) as DATE), day) AS days_since_order, -- RECENCY
    COUNT(order_id) AS count_orders, -- FREQUENCY
    AVG(sale_price) AS avg_spend -- MONETARY
  FROM (
    SELECT user_id,
      order_id,
      sale_price,
      created_at as order_created_date
    FROM `mg-ce-demos.thelook_ecommerce.order_items`
    WHERE created_at BETWEEN CAST('2022-01-01 00:00:00' AS TIMESTAMP)
    AND CAST('2024-01-01 00:00:00' AS TIMESTAMP)
  )
  GROUP BY user_id, order_id
 )
)
""".format(DATASET, BQML_MODEL)


In [6]:
# Wrapper to use BigQuery client to run query/job, return job ID or result as DF
def run_bq_query(sql: str) -> Union[str, pd.DataFrame]:
    
    # Try dry run before executing query to catch any errors
    #job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
    #bq.query(sql, job_config=job_config)

    # If dry run succeeds without errors, proceed to run query
    job_config = bigquery.QueryJobConfig()
    client_result = bq.query(sql, job_config=job_config)

    job_id = client_result.job_id

    # Wait for query/job to finish running. then get & return data frame
    df = client_result.result().to_arrow().to_pandas()
    print(f"Finished job_id: {job_id}")
    return df

In [7]:
%%timeit

run_bq_query(query)

Finished job_id: d8286e3b-9c0f-4333-b62d-3f3c1a255be9
Finished job_id: 11bc6c95-c031-4418-916d-8ddae9727c88
Finished job_id: c96b1d75-14a0-42d6-ba23-2bcfe683396d
Finished job_id: 3ff7479b-3c91-4fa8-a758-44c435de1523
Finished job_id: 6b70b8d1-36fa-4f0e-b2a5-b592c73509cd
Finished job_id: d534964e-e2f7-4df5-9f30-6393862d1e4d
Finished job_id: c559e7d4-7953-4f61-92ca-71971960f7c6
Finished job_id: 543ee217-fd26-480c-891c-db1ee27de466
The slowest run took 4.65 times longer than the fastest. This could mean that an intermediate result is being cached.
1.9 s ± 1.53 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


Let's take a look at the model's clustering performance, using these metrics - Davies Bouldin Index and Mean Squared Distance

In [8]:
query = """
SELECT *
FROM ML.EVALUATE(MODEL `{0}.{1}`)
""".format(DATASET, BQML_MODEL)

run_bq_query(query)

Finished job_id: ca69d7e2-a7c1-49b7-8b3d-b5a0ccc9ee5b


Unnamed: 0,davies_bouldin_index,mean_squared_distance
0,1.023378,1.051785


### Now let's get the cluster (centroid) information

<p align="center">
  <img alt="Conceptual Flow" src="slides/process3.png" width="100%">
</p>

In [9]:
query = """
SELECT
  CONCAT('cluster ', CAST(centroid_id as STRING)) as centroid,
  avg_spend as average_spend,
  count_orders as count_of_orders,
  days_since_order
FROM (
  SELECT centroid_id, feature, ROUND(numerical_value, 2) as value
  FROM ML.CENTROIDS(MODEL `{0}.{1}`)
)
PIVOT (
  SUM(value)
  FOR feature IN ('avg_spend',  'count_orders', 'days_since_order')
)
ORDER BY centroid_id
""".format(DATASET, BQML_MODEL)

run_bq_query(query)

Finished job_id: 01b0ad9b-fad2-4ef4-a056-ca941a93751f


Unnamed: 0,centroid,average_spend,count_of_orders,days_since_order
0,cluster 1,49.44,1.23,102.87
1,cluster 2,59.56,3.51,87.9
2,cluster 3,251.34,1.14,205.85
3,cluster 4,57.47,3.49,354.32
4,cluster 5,48.87,1.22,376.5


Whew! That's a lot of metrics and cluster info. How about we explain this to our colleagues using the magic of LLMs.

In [10]:
df = bq.query(query).to_dataframe()
df.to_string(header=False, index=False)

cluster_info = []
for i, row in df.iterrows():
  cluster_info.append("{0}, average spend ${2}, count of orders per person {1}, days since last order {3}"
    .format(row["centroid"], row["count_of_orders"], row["average_spend"], row["days_since_order"]) )

print(str.join("\n", cluster_info))

cluster 1, average spend $49.44, count of orders per person 1.23, days since last order 102.87
cluster 2, average spend $59.56, count of orders per person 3.51, days since last order 87.9
cluster 3, average spend $251.34, count of orders per person 1.14, days since last order 205.85
cluster 4, average spend $57.47, count of orders per person 3.49, days since last order 354.32
cluster 5, average spend $48.87, count of orders per person 1.22, days since last order 376.5


## Explain with Vertex AI PaLM API

### First, we want to instantiate the large language model and create the prompt

<p align="center">
  <img alt="Conceptual Flow" src="slides/process4.png" width="100%">
</p>

In [11]:
textgen_model = TextGenerationModel.from_pretrained('text-bison')

In [12]:
def gemini_generate(prompt):
    result = []
    model = GenerativeModel("gemini-pro")
    responses = model.generate_content(
        [prompt],
        generation_config={
            "max_output_tokens": 2048,
            "temperature": 0.5,
            "top_p": 1
        },
        stream=True,
    )
  
    for response in responses:
        result.append(response.candidates[0].content.parts[0].text)

    result = ''.join(result)
    return result

In [30]:
preamble = """Pretend you're a creative strategist, given the following clusters come up with creative brand persona and title labels for each of these clusters, 
and explain step by step; what would be the next marketing step for these clusters:

"""
prompt = preamble + "\n" + str.join("\n", cluster_info)
display(Markdown('## Prompt:'))
print(prompt)
#print(preamble)
#print(str.join("\n", cluster_info))

## Prompt:

Pretend you're a creative strategist, given the following clusters come up with creative brand persona and title labels for each of these clusters, 
and explain step by step; what would be the next marketing step for these clusters:


cluster 1, average spend $49.44, count of orders per person 1.23, days since last order 102.87
cluster 2, average spend $59.56, count of orders per person 3.51, days since last order 87.9
cluster 3, average spend $251.34, count of orders per person 1.14, days since last order 205.85
cluster 4, average spend $57.47, count of orders per person 3.49, days since last order 354.32
cluster 5, average spend $48.87, count of orders per person 1.22, days since last order 376.5


### Now, we send our prompt to Google GenAI API for some LLM magic

<p align="center">
  <img alt="Conceptual Flow" src="slides/process5.png" width="100%">
</p>

In [31]:
response = textgen_model.predict(
   prompt,
    max_output_tokens=1024,
    temperature=0.4,
    top_p=0.8,
    top_k=40,
)

display(Markdown('## Response:'))
# Send prompt to LLM
display(Markdown(response.text))

## Response:

 **Cluster 1: "The Occasional Shoppers"**

* Brand persona: These customers are typically busy individuals who don't have a lot of time to shop. They're looking for convenience and value, and they're not particularly brand-loyal.
* Title label: "The Occasional Shoppers"
* Next marketing step: Offer these customers discounts and promotions to encourage them to shop more frequently. You could also try sending them personalized emails with product recommendations based on their past purchases.

**Cluster 2: "The Regulars"**

* Brand persona: These customers are loyal to your brand and they shop with you on a regular basis. They're typically familiar with your products and they're happy with the quality and value that you offer.
* Title label: "The Regulars"
* Next marketing step: Reward these customers for their loyalty with exclusive discounts and offers. You could also try sending them personalized emails with thank-you notes and special promotions.

**Cluster 3: "The Big Spenders"**

* Brand persona: These customers are high-value customers who spend a lot of money with your brand. They're typically affluent and they're looking for luxury and exclusivity.
* Title label: "The Big Spenders"
* Next marketing step: Offer these customers VIP treatment, such as early access to new products, exclusive discounts, and personalized customer service. You could also try sending them handwritten thank-you notes and gifts.

**Cluster 4: "The Lapsed Customers"**

* Brand persona: These customers used to shop with you regularly, but they haven't made a purchase in a while. They may have been lost to a competitor or they may have simply lost interest in your brand.
* Title label: "The Lapsed Customers"
* Next marketing step: Try to win back these customers by sending them personalized emails with special offers and discounts. You could also try running a social media campaign to remind them of your brand.

**Cluster 5: "The One-Time Buyers"**

* Brand persona: These customers have only made one purchase from your brand. They may have been satisfied with their purchase, but they haven't been motivated to shop with you again.
* Title label: "The One-Time Buyers"
* Next marketing step: Try to convert these customers into repeat customers by sending them personalized emails with product recommendations and special offers. You could also try running a social media campaign to remind them of your brand.

In [32]:
respose_gem = gemini_generate(prompt)
display(Markdown('## Response:'))
# Send prompt to LLM
display(Markdown(respose_gem))

## Response:

**Cluster 1: "Occasional Shoppers"**

* Title Label: "Budget-Conscious Explorers"

**Cluster 2: "Regular Shoppers"**

* Title Label: "Value-Seeking Repeaters"

**Cluster 3: "Loyal Customers"**

* Title Label: "Exclusive Connoisseurs"

**Cluster 4: "Infrequent Buyers"**

* Title Label: "Selective Spenders"

**Cluster 5: "Dormant Customers"**

* Title Label: "Forgotten Loyalists"

**Next Marketing Step for Each Cluster:**

**Cluster 1: Occasional Shoppers**

* Focus on personalized promotions and discounts to encourage repeat purchases.
* Offer incentives for referrals to increase customer base.

**Cluster 2: Regular Shoppers**

* Reward loyalty with personalized offers and exclusive perks.
* Upsell complementary products and services to increase average spend.

**Cluster 3: Loyal Customers**

* Maintain exceptional customer service to foster brand advocacy.
* Offer exclusive access to new products and experiences to enhance loyalty.

**Cluster 4: Infrequent Buyers**

* Send targeted campaigns to re-engage customers and remind them of the brand.
* Offer limited-time promotions or incentives to encourage purchases.

**Cluster 5: Dormant Customers**

* Conduct market research to understand reasons for inactivity.
* Send personalized re-engagement emails or offer incentives to reactivate customers.

Voila! We've now used k-means clustering to create groups of spenders and explain their profiles.

Sometimes, though, you want a little bit [extra](https://cloud.google.com/blog/transform/prompt-debunking-five-generative-ai-misconceptions).

In [33]:
preamble = """Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite movie, 
a summary of how this relates to their purchasing behavior, and a witty e-mail headline for marketing campaign targeted to their group:

"""
prompt = preamble + "\n" + str.join("\n", cluster_info)

display(Markdown('## Prompt:'))
print(prompt)

response = textgen_model.predict(
   prompt,
    max_output_tokens=1024,
    temperature=0.4,
    top_p=0.8,
    top_k=40,
)

display(Markdown('## Response:'))
display(Markdown(response.text))

## Prompt:

Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite movie, 
a summary of how this relates to their purchasing behavior, and a witty e-mail headline for marketing campaign targeted to their group:


cluster 1, average spend $49.44, count of orders per person 1.23, days since last order 102.87
cluster 2, average spend $59.56, count of orders per person 3.51, days since last order 87.9
cluster 3, average spend $251.34, count of orders per person 1.14, days since last order 205.85
cluster 4, average spend $57.47, count of orders per person 3.49, days since last order 354.32
cluster 5, average spend $48.87, count of orders per person 1.22, days since last order 376.5


## Response:

 **Cluster 1: The Occasional Treaters**

* **Favorite movie:** "The Princess Bride"
* **Purchasing behavior:** They make infrequent purchases, but when they do, they tend to splurge on a few high-quality items. They're looking for something special that will make them feel good.
* **E-mail headline:** "Treat Yourself! Indulge in Your Favorite Delights."

**Cluster 2: The Regular Indulgers**

* **Favorite movie:** "When Harry Met Sally"
* **Purchasing behavior:** They make regular purchases, and they're always looking for new and exciting products to try. They're not afraid to experiment, and they're always up for a good deal.
* **E-mail headline:** "New Arrivals! Discover the Latest and Greatest."

**Cluster 3: The Connoisseurs**

* **Favorite movie:** "The Godfather"
* **Purchasing behavior:** They make infrequent purchases, but they're willing to pay top dollar for the best of the best. They're looking for something truly exceptional that will impress their friends and family.
* **E-mail headline:** "Exclusive Offer! Indulge in the Finest of the Finest."

**Cluster 4: The Seasonal Shoppers**

* **Favorite movie:** "Home Alone"
* **Purchasing behavior:** They make purchases in bursts, usually around the holidays or other special occasions. They're looking for something that will make their celebrations extra special.
* **E-mail headline:** "Get Ready for the Holidays! Stock Up on Your Favorites."

**Cluster 5: The Minimalists**

* **Favorite movie:** "The Little Prince"
* **Purchasing behavior:** They make infrequent purchases, and they're very selective about what they buy. They're looking for something that's simple, functional, and affordable.
* **E-mail headline:** "Less is More! Discover Our Minimalist Collection."

In [34]:
respose_gem = gemini_generate(prompt)

display(Markdown('## Prompt:'))
print(prompt)

display(Markdown('## Response:'))
display(Markdown(respose_gem))

## Prompt:

Pretend you're a creative strategist, analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite movie, 
a summary of how this relates to their purchasing behavior, and a witty e-mail headline for marketing campaign targeted to their group:


cluster 1, average spend $49.44, count of orders per person 1.23, days since last order 102.87
cluster 2, average spend $59.56, count of orders per person 3.51, days since last order 87.9
cluster 3, average spend $251.34, count of orders per person 1.14, days since last order 205.85
cluster 4, average spend $57.47, count of orders per person 3.49, days since last order 354.32
cluster 5, average spend $48.87, count of orders per person 1.22, days since last order 376.5


## Response:

**Cluster 1: The Occasional Shoppers**

* **Brand Persona:** Emily, a busy professional who values convenience and efficiency. She's not a frequent shopper but when she does, she prefers to make a single purchase of essential items.
* **Favorite Movie:** "The Matrix" (1999) - Represents Emily's desire for a streamlined and efficient shopping experience, free from distractions.
* **Purchasing Behavior:** Purchases small quantities of necessary items, prioritizes time-saving options like online ordering or quick in-store pickups.
* **Witty Email Headline:** "Simplify Your Shopping: The Matrix of Convenience"

**Cluster 2: The Regulars**

* **Brand Persona:** John, a family man who enjoys browsing and comparing products before making a purchase. He's a loyal customer who appreciates personalized recommendations and a wide selection.
* **Favorite Movie:** "The Godfather" (1972) - Conveys John's sense of tradition, loyalty, and appreciation for quality products.
* **Purchasing Behavior:** Makes multiple purchases of various items, researches products thoroughly, and values customer service and loyalty programs.
* **Witty Email Headline:** "Join Our Family of Shoppers: The Godfather of Retail"

**Cluster 3: The Luxe Shoppers**

* **Brand Persona:** Sophia, a sophisticated and discerning consumer who seeks out exclusive and high-end products. She's willing to pay a premium for quality and exclusivity.
* **Favorite Movie:** "The Devil Wears Prada" (2006) - Reflects Sophia's aspiration for style, luxury, and a touch of glamour in her purchases.
* **Purchasing Behavior:** Makes infrequent but significant purchases, focusing on designer brands, limited editions, and personalized experiences.
* **Witty Email Headline:** "Indulge in the Devilish Delights of Luxury: The Prada of Retail"

**Cluster 4: The Occasional Treaters**

* **Brand Persona:** David, a sporadic shopper who enjoys indulging in occasional treats or gifts for himself or others. He's not a regular customer but appreciates unique and memorable experiences.
* **Favorite Movie:** "The Shawshank Redemption" (1994) - Evokes David's desire to escape the mundane and create special moments through his purchases.
* **Purchasing Behavior:** Makes infrequent purchases of non-essential items, often as gifts or for special occasions.
* **Witty Email Headline:** "Escape the Ordinary: The Shawshank Redemption of Retail"

**Cluster 5: The Lapsed Shoppers**

* **Brand Persona:** Sarah, a former customer who has stopped shopping due to negative experiences or a lack of engagement. She's open to reconnecting but needs to be convinced of the value of the brand.
* **Favorite Movie:** "The Sixth Sense" (1999) - Represents Sarah's need to see the potential and value in the brand again.
* **Purchasing Behavior:** Has made few or no recent purchases, may have had negative experiences in the past.
* **Witty Email Headline:** "Unlock the Sixth Sense: Rediscover the Hidden Value of Our Brand"

In [36]:
def persona_builder(preamble):
    pre_text = """Pretend you're a creative strategist, given the following clusters """
    prompt = pre_text + preamble + "\n\n" + str.join("\n", cluster_info)
    #result = textgen_model.predict(
    #    prompt,
    #    max_output_tokens=1024,
    #    temperature=0.4,
    #    top_p=0.8,
    #    top_k=40,
    #)

    respose_gem = gemini_generate(prompt)
    
    return prompt, respose_gem

## Now we can stick it behind a UI

<p align="center">
  <img alt="Conceptual Flow" src="slides/process6.png" width="100%">
</p>

In [37]:
import gradio as gr

with gr.Blocks() as demo:
    gr.Markdown(
    """
    ## Persona Builder and Marketing Bot
    """)
    with gr.Row():
        input_text = gr.Textbox(label="Pretend you're a creative strategist, given the following clusters: ", 
                                value=" analyse the following clusters and come up with creative brand persona for each that includes the detail of their favorite kind of sunglasses and a detailed description of an image including an animal wearing their favorite type of sunglasses:") 
    with gr.Row():
        generate = gr.Button("Generate Response")
    with gr.Row():
        label2 = gr.Textbox(label="Prompt")
    with gr.Row():
        label3 = gr.Markdown(label="Response generated by LLM")

    generate.click(persona_builder,input_text, [label2, label3])
    
demo.launch(share=False, debug=False)

Running on local URL:  http://127.0.0.1:7865

To create a public link, set `share=True` in `launch()`.


