# Getting Started

## Setup Database Environment - Retail CLV Regression Demo

This Notebook is used to setup the required database objects including mock retail data for a **Customer Lifetime Value (CLV) Regression** demo.

### Tables Created:
1. **CUSTOMER_DEMOGRAPHICS**: Customer profile features (4 features)
2. **PURCHASE_BEHAVIOR**: Transaction-based features (3 features + target)

The tables join on `CUSTOMER_ID` to demonstrate feature engineering workflows.

In [20]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


**Install SQLGlot** <br>
Install SQLGlot with pip install in the conda environment **py-snowpark_df_ml_fs** by running the following command in the same terminal window.  We will use this package to format the SQL produced from Snowpark so that it is human-readable in the Dynamic Tables that Feature Store creates.  Installing within the Notebook, as other users have reported issues trying to install directly within the OS.

In [21]:
!python3 -m pip install "sqlglot[rs]" --no-deps



#### Notebook Packages

In [None]:
# Python packages
import os
from os import listdir
from os.path import isfile, join
import time
import json
import datetime


# SNOWFLAKE
# Snowpark
from snowflake.snowpark import Session, DataFrame, Window, WindowSpec
import snowflake.snowpark.functions as F
import snowflake.snowpark.types as T
from snowflake.snowpark.version import VERSION
from snowflake.ml.utils import connection_params

from helper.useful_fns import run_sql

### Setup Snowflake connection and database parameters

Change the settings below if you want to if need to apply to your Snowflake Account.

E.g. if you need to use a different role with ACCOUNTADMIN privileges to setup the environment

In [23]:
# ===========================================
# CONFIGURATION - Modify these as needed
# ===========================================

# Roles
admin_role = 'ACCOUNTADMIN'              # Role with ACCOUNTADMIN privileges for setup
demo_role = 'RETAIL_REGRESSION_DEMO_ROLE'           # The data scientist role for this demo

# Database
database_name = 'RETAIL_REGRESSION_DEMO'

# Schema
schema_name = 'DS'

# Warehouse
warehouse_name = 'RETAIL_REGRESSION_DEMO_WH'
warehouse_size = 'SMALL'

# Data Generation
num_customers = 10000                     # Number of customers to generate 

In [24]:
# Create Snowflake Session object
with open('connection.json', 'r') as f:
    connection_parameters = json.load(f)
# connection_parameters = connection_params.SnowflakeLoginOptions("ak32940")
session = Session.builder.configs(connection_parameters).create()
session.sql_simplifier_enabled = True
snowflake_environment = session.sql('SELECT current_user(), current_version()').collect()
snowpark_version = VERSION

You might have more than one threads sharing the Session object trying to update sql_simplifier_enabled. Updating this while other tasks are running can potentially cause unexpected behavior. Please update the session configuration before starting the threads.


In [25]:
# Current Environment Details
print('\nConnection Established with the following parameters:')
print(f'User                        : {snowflake_environment[0][0]}')
print(f'Role                        : {session.get_current_role()}')
print(f'Database                    : {session.get_current_database()}')
print(f'Schema                      : {session.get_current_schema()}')
print(f'Warehouse                   : {session.get_current_warehouse()}')
print(f'Snowflake version           : {snowflake_environment[0][1]}')
print(f'Snowpark for Python version : {snowpark_version[0]}.{snowpark_version[1]}.{snowpark_version[2]} \n')


Connection Established with the following parameters:
User                        : JARCHEN
Role                        : "ACCOUNTADMIN"
Database                    : "SAMPLES_DB"
Schema                      : "NOTEBOOKS"
Warehouse                   : "ENRICH_WH"
Snowflake version           : 9.37.1
Snowpark for Python version : 1.38.0 



In [26]:
run_sql(f'''use role {admin_role}''', session)

use role ACCOUNTADMIN 
 [Row(status='Statement executed successfully.')] 



{'use role ACCOUNTADMIN': [Row(status='Statement executed successfully.')]}

In [27]:
# Setup master role and permissions
run_sql(f'''use role {admin_role}''', session)

# Create the demo role
run_sql(f'''create role if not exists {demo_role}''', session)

# Grant role to SYSADMIN (best practice for role hierarchy)
run_sql(f'''grant role {demo_role} to role SYSADMIN''', session)

# Create warehouse
run_sql(f'''create warehouse if not exists {warehouse_name} 
            warehouse_size = {warehouse_size}
            auto_suspend = 60
            auto_resume = true
            initially_suspended = true''', session)

# Grant warehouse permissions to demo role
run_sql(f'''grant all on warehouse {warehouse_name} to role {demo_role}''', session)

# Use the warehouse
run_sql(f'''use warehouse {warehouse_name}''', session)

# Grant task execution permissions to demo role
run_sql(f'''grant execute managed task on account to role {demo_role}''', session)
run_sql(f'''grant execute task on account to role {demo_role}''', session)


use role ACCOUNTADMIN 
 [Row(status='Statement executed successfully.')] 

create role if not exists RETAIL_REGRESSION_DEMO_ROLE 
 [Row(status='RETAIL_REGRESSION_DEMO_ROLE already exists, statement succeeded.')] 

grant role RETAIL_REGRESSION_DEMO_ROLE to role SYSADMIN 
 [Row(status='Statement executed successfully.')] 

create warehouse if not exists RETAIL_REGRESSION_DEMO_WH 
            warehouse_size = SMALL
            auto_suspend = 60
            auto_resume = true
            initially_suspended = true 
 [Row(status='RETAIL_REGRESSION_DEMO_WH already exists, statement succeeded.')] 

grant all on warehouse RETAIL_REGRESSION_DEMO_WH to role RETAIL_REGRESSION_DEMO_ROLE 
 [Row(status='Statement executed successfully.')] 

use warehouse RETAIL_REGRESSION_DEMO_WH 
 [Row(status='Statement executed successfully.')] 

grant execute managed task on account to role RETAIL_REGRESSION_DEMO_ROLE 
 [Row(status='Statement executed successfully.')] 

grant execute task on account to role RETAI

{'grant execute task on account to role RETAIL_REGRESSION_DEMO_ROLE': [Row(status='Statement executed successfully.')]}

In [None]:
# Database setup
run_sql(f'''use role {admin_role}''', session)

# Create database
run_sql(f'''create database if not exists {database_name}''', session)
# run_sql(f'''create or replace database {database_name}''', session)

# Grant database permissions to demo role
run_sql(f'''grant all on database {database_name} to role {demo_role}''', session)
run_sql(f'''grant all on all schemas in database {database_name} to role {demo_role}''', session)
run_sql(f'''grant all on future schemas in database {database_name} to role {demo_role}''', session)


use role ACCOUNTADMIN 
 [Row(status='Statement executed successfully.')] 

create or replace database RETAIL_REGRESSION_DEMO 
 [Row(status='Database RETAIL_REGRESSION_DEMO successfully created.')] 

grant all on database RETAIL_REGRESSION_DEMO to role RETAIL_REGRESSION_DEMO_ROLE 
 [Row(status='Statement executed successfully.')] 

grant all on all schemas in database RETAIL_REGRESSION_DEMO to role RETAIL_REGRESSION_DEMO_ROLE 
 [Row(status='Statement executed successfully. 1 objects affected.')] 

grant all on future schemas in database RETAIL_REGRESSION_DEMO to role RETAIL_REGRESSION_DEMO_ROLE 
 [Row(status='Statement executed successfully.')] 



{'grant all on future schemas in database RETAIL_REGRESSION_DEMO to role RETAIL_REGRESSION_DEMO_ROLE': [Row(status='Statement executed successfully.')]}

In [29]:
# Schema setup with permissions
# Switch to demo role for schema creation (to ensure ownership)
run_sql(f'''use role {demo_role}''', session)
run_sql(f'''use warehouse {warehouse_name}''', session)
run_sql(f'''use database {database_name}''', session)

# Create schema
run_sql(f'''create schema if not exists {schema_name}''', session)

# Switch back to admin to grant permissions
run_sql(f'''use role {admin_role}''', session)

# Grant schema permissions to demo role
run_sql(f'''grant usage on schema {database_name}.{schema_name} to role {demo_role}''', session)
run_sql(f'''grant create table on schema {database_name}.{schema_name} to role {demo_role}''', session)
run_sql(f'''grant create view on schema {database_name}.{schema_name} to role {demo_role}''', session)
run_sql(f'''grant create tag on schema {database_name}.{schema_name} to role {demo_role}''', session)
run_sql(f'''grant create dynamic table on schema {database_name}.{schema_name} to role {demo_role}''', session)

# Grant permissions on existing and future objects
run_sql(f'''grant select, insert, update, delete on all tables in schema {database_name}.{schema_name} to role {demo_role}''', session)
run_sql(f'''grant select, insert, update, delete on future tables in schema {database_name}.{schema_name} to role {demo_role}''', session)
run_sql(f'''grant select, references on all views in schema {database_name}.{schema_name} to role {demo_role}''', session)
run_sql(f'''grant select, references on future views in schema {database_name}.{schema_name} to role {demo_role}''', session)
run_sql(f'''grant select, monitor on all dynamic tables in schema {database_name}.{schema_name} to role {demo_role}''', session)
run_sql(f'''grant select, monitor on future dynamic tables in schema {database_name}.{schema_name} to role {demo_role}''', session)


use role RETAIL_REGRESSION_DEMO_ROLE 
 [Row(status='Statement executed successfully.')] 

use warehouse RETAIL_REGRESSION_DEMO_WH 
 [Row(status='Statement executed successfully.')] 

use database RETAIL_REGRESSION_DEMO 
 [Row(status='Statement executed successfully.')] 

create schema if not exists DS 
 [Row(status='Schema DS successfully created.')] 

use role ACCOUNTADMIN 
 [Row(status='Statement executed successfully.')] 

grant usage on schema RETAIL_REGRESSION_DEMO.DS to role RETAIL_REGRESSION_DEMO_ROLE 
 [Row(status='Statement executed successfully.')] 

grant create table on schema RETAIL_REGRESSION_DEMO.DS to role RETAIL_REGRESSION_DEMO_ROLE 
 [Row(status='Statement executed successfully.')] 

grant create view on schema RETAIL_REGRESSION_DEMO.DS to role RETAIL_REGRESSION_DEMO_ROLE 
 [Row(status='Statement executed successfully.')] 

grant create tag on schema RETAIL_REGRESSION_DEMO.DS to role RETAIL_REGRESSION_DEMO_ROLE 
 [Row(status='Statement executed successfully.')] 

gran

{'grant select, monitor on future dynamic tables in schema RETAIL_REGRESSION_DEMO.DS to role RETAIL_REGRESSION_DEMO_ROLE': [Row(status='Statement executed successfully.')]}

In [30]:
# =====================================================
# TABLE 1: CUSTOMER_DEMOGRAPHICS
# Features:
#   1. AGE: Customer age (18-75)
#   2. ANNUAL_INCOME: Estimated annual income ($20k-$200k)
#   3. LOYALTY_TIER: Categorical tier (1=Bronze, 2=Silver, 3=Gold, 4=Platinum)
#   4. TENURE_MONTHS: How long they've been a customer (1-120 months)
# =====================================================

# Switch to demo role for table creation
run_sql(f'''use role {demo_role}''', session)
run_sql(f'''use warehouse {warehouse_name}''', session)
run_sql(f'''use database {database_name}''', session)
run_sql(f'''use schema {schema_name}''', session)

# Create CUSTOMER_DEMOGRAPHICS table
run_sql(f'''
CREATE OR REPLACE TABLE CUSTOMER_DEMOGRAPHICS (
    CUSTOMER_ID         INTEGER         NOT NULL PRIMARY KEY,
    AGE                 INTEGER         NOT NULL,
    ANNUAL_INCOME       DECIMAL(10,2)   NOT NULL,
    LOYALTY_TIER        INTEGER         NOT NULL,
    TENURE_MONTHS       INTEGER         NOT NULL,
    SIGNUP_DATE         DATE            NOT NULL,
    CREATED_AT          TIMESTAMP_NTZ   DEFAULT CURRENT_TIMESTAMP()
) 
CLUSTER BY (CUSTOMER_ID)
COMMENT = 'Customer demographic features for CLV regression model'
''', session)

# Insert mock customer demographics data
run_sql(f'''
INSERT INTO CUSTOMER_DEMOGRAPHICS (
    CUSTOMER_ID,
    AGE,
    ANNUAL_INCOME,
    LOYALTY_TIER,
    TENURE_MONTHS,
    SIGNUP_DATE
)
SELECT 
    SEQ4() + 1 AS CUSTOMER_ID,
    GREATEST(18, LEAST(75, ROUND(40 + (RANDOM() % 20) - 10 + (RANDOM() % 10)))) AS AGE,
    ROUND(CASE 
        WHEN UNIFORM(0, 100, RANDOM()) < 60 THEN UNIFORM(40000, 100000, RANDOM())
        WHEN UNIFORM(0, 100, RANDOM()) < 85 THEN UNIFORM(20000, 40000, RANDOM())
        ELSE UNIFORM(100000, 200000, RANDOM())
    END, 2) AS ANNUAL_INCOME,
    CASE 
        WHEN UNIFORM(0, 100, RANDOM()) < 40 THEN 1
        WHEN UNIFORM(0, 100, RANDOM()) < 70 THEN 2
        WHEN UNIFORM(0, 100, RANDOM()) < 90 THEN 3
        ELSE 4
    END AS LOYALTY_TIER,
    GREATEST(1, LEAST(120, CASE 
        WHEN UNIFORM(0, 100, RANDOM()) < 50 THEN UNIFORM(1, 24, RANDOM())
        WHEN UNIFORM(0, 100, RANDOM()) < 80 THEN UNIFORM(24, 60, RANDOM())
        ELSE UNIFORM(60, 120, RANDOM())
    END)) AS TENURE_MONTHS,
    DATEADD('month', -TENURE_MONTHS, CURRENT_DATE()) AS SIGNUP_DATE
FROM TABLE(GENERATOR(ROWCOUNT => {num_customers}))
''', session)

print(f'Created CUSTOMER_DEMOGRAPHICS table with {num_customers} rows')

use role RETAIL_REGRESSION_DEMO_ROLE 
 [Row(status='Statement executed successfully.')] 

use warehouse RETAIL_REGRESSION_DEMO_WH 
 [Row(status='Statement executed successfully.')] 

use database RETAIL_REGRESSION_DEMO 
 [Row(status='Statement executed successfully.')] 

use schema DS 
 [Row(status='Statement executed successfully.')] 


CREATE OR REPLACE TABLE CUSTOMER_DEMOGRAPHICS (
    CUSTOMER_ID         INTEGER         NOT NULL PRIMARY KEY,
    AGE                 INTEGER         NOT NULL,
    ANNUAL_INCOME       DECIMAL(10,2)   NOT NULL,
    LOYALTY_TIER        INTEGER         NOT NULL,
    TENURE_MONTHS       INTEGER         NOT NULL,
    SIGNUP_DATE         DATE            NOT NULL,
    CREATED_AT          TIMESTAMP_NTZ   DEFAULT CURRENT_TIMESTAMP()
) 
CLUSTER BY (CUSTOMER_ID)
COMMENT = 'Customer demographic features for CLV regression model'
 
 [Row(status='Table CUSTOMER_DEMOGRAPHICS successfully created.')] 


INSERT INTO CUSTOMER_DEMOGRAPHICS (
    CUSTOMER_ID,
    AGE,
   

In [31]:
# =====================================================
# TABLE 2: PURCHASE_BEHAVIOR
# Features:
#   1. AVG_ORDER_VALUE: Average transaction amount ($15-$500)
#   2. PURCHASE_FREQUENCY: Orders per month (0.1-8)
#   3. RETURN_RATE: Percentage of items returned (0-30%)
#   
# Target Variable:
#   4. LIFETIME_VALUE: Total customer value to predict (regression target)
# =====================================================

# Create PURCHASE_BEHAVIOR table
run_sql(f'''
CREATE OR REPLACE TABLE PURCHASE_BEHAVIOR (
    CUSTOMER_ID             INTEGER         NOT NULL PRIMARY KEY,
    AVG_ORDER_VALUE         DECIMAL(10,2)   NOT NULL,
    PURCHASE_FREQUENCY      DECIMAL(5,2)    NOT NULL,
    RETURN_RATE             DECIMAL(5,2)    NOT NULL,
    LIFETIME_VALUE          DECIMAL(12,2)   NOT NULL,
    LAST_PURCHASE_DATE      DATE            NOT NULL,
    TOTAL_ORDERS            INTEGER         NOT NULL,
    CREATED_AT              TIMESTAMP_NTZ   DEFAULT CURRENT_TIMESTAMP(),
    FOREIGN KEY (CUSTOMER_ID) REFERENCES CUSTOMER_DEMOGRAPHICS(CUSTOMER_ID)
)
CLUSTER BY (CUSTOMER_ID)
COMMENT = 'Purchase behavior features and CLV target for regression model'
''', session)

# Insert mock purchase behavior data with realistic correlations
run_sql(f'''
INSERT INTO PURCHASE_BEHAVIOR (
    CUSTOMER_ID, AVG_ORDER_VALUE, PURCHASE_FREQUENCY, RETURN_RATE,
    LIFETIME_VALUE, LAST_PURCHASE_DATE, TOTAL_ORDERS
)
SELECT 
    c.CUSTOMER_ID,
    -- AVG_ORDER_VALUE: Correlates with income and loyalty tier
    ROUND(GREATEST(15, LEAST(500,
        UNIFORM(50, 150, RANDOM()) + (c.ANNUAL_INCOME / 5000) + (c.LOYALTY_TIER * 20) + UNIFORM(-30, 30, RANDOM())
    )), 2) AS AVG_ORDER_VALUE,
    -- PURCHASE_FREQUENCY: Orders per month, correlates with loyalty
    ROUND(GREATEST(0.1, LEAST(8,
        0.5 + (c.LOYALTY_TIER * 0.8) + (UNIFORM(0, 200, RANDOM()) / 100.0) 
        - (CASE WHEN c.TENURE_MONTHS < 6 THEN 0.3 ELSE 0 END)
    )), 2) AS PURCHASE_FREQUENCY,
    -- RETURN_RATE: 0-30%
    ROUND(GREATEST(0, LEAST(30,
        UNIFORM(2, 15, RANDOM()) + (CASE WHEN PURCHASE_FREQUENCY > 4 THEN 5 ELSE 0 END) + UNIFORM(-5, 5, RANDOM())
    )), 2) AS RETURN_RATE,
    -- LIFETIME_VALUE (TARGET)
    ROUND(GREATEST(50,
        (AVG_ORDER_VALUE * PURCHASE_FREQUENCY * c.TENURE_MONTHS)
        * (1 - RETURN_RATE / 100) * (1 + c.LOYALTY_TIER * 0.1) * (1 + c.ANNUAL_INCOME / 500000)
        + UNIFORM(-500, 500, RANDOM())
    ), 2) AS LIFETIME_VALUE,
    -- LAST_PURCHASE_DATE
    DATEADD('day', -GREATEST(1, ROUND(30 / GREATEST(PURCHASE_FREQUENCY, 0.5) + UNIFORM(0, 14, RANDOM()))), CURRENT_DATE()) AS LAST_PURCHASE_DATE,
    -- TOTAL_ORDERS
    GREATEST(1, ROUND(PURCHASE_FREQUENCY * c.TENURE_MONTHS)) AS TOTAL_ORDERS
FROM CUSTOMER_DEMOGRAPHICS c
''', session)

print(f'Created PURCHASE_BEHAVIOR table with {num_customers} rows')


CREATE OR REPLACE TABLE PURCHASE_BEHAVIOR (
    CUSTOMER_ID             INTEGER         NOT NULL PRIMARY KEY,
    AVG_ORDER_VALUE         DECIMAL(10,2)   NOT NULL,
    PURCHASE_FREQUENCY      DECIMAL(5,2)    NOT NULL,
    RETURN_RATE             DECIMAL(5,2)    NOT NULL,
    LIFETIME_VALUE          DECIMAL(12,2)   NOT NULL,
    LAST_PURCHASE_DATE      DATE            NOT NULL,
    TOTAL_ORDERS            INTEGER         NOT NULL,
    CREATED_AT              TIMESTAMP_NTZ   DEFAULT CURRENT_TIMESTAMP(),
    FOREIGN KEY (CUSTOMER_ID) REFERENCES CUSTOMER_DEMOGRAPHICS(CUSTOMER_ID)
)
CLUSTER BY (CUSTOMER_ID)
COMMENT = 'Purchase behavior features and CLV target for regression model'
 
 [Row(status='Table PURCHASE_BEHAVIOR successfully created.')] 


INSERT INTO PURCHASE_BEHAVIOR (
    CUSTOMER_ID, AVG_ORDER_VALUE, PURCHASE_FREQUENCY, RETURN_RATE,
    LIFETIME_VALUE, LAST_PURCHASE_DATE, TOTAL_ORDERS
)
SELECT 
    c.CUSTOMER_ID,
    -- AVG_ORDER_VALUE: Correlates with income and loyalty tier
 

In [32]:
# CLV by Loyalty Tier (Correlation Check)
print("\n=== CLV by Loyalty Tier (Correlation Check) ===")
session.sql('''
SELECT 
    d.LOYALTY_TIER,
    COUNT(*) AS CUSTOMER_COUNT,
    ROUND(AVG(p.LIFETIME_VALUE), 2) AS AVG_CLV,
    ROUND(MIN(p.LIFETIME_VALUE), 2) AS MIN_CLV,
    ROUND(MAX(p.LIFETIME_VALUE), 2) AS MAX_CLV
FROM CUSTOMER_DEMOGRAPHICS d
JOIN PURCHASE_BEHAVIOR p ON d.CUSTOMER_ID = p.CUSTOMER_ID
GROUP BY d.LOYALTY_TIER ORDER BY d.LOYALTY_TIER
''').show()


=== CLV by Loyalty Tier (Correlation Check) ===
-------------------------------------------------------------------------
|"LOYALTY_TIER"  |"CUSTOMER_COUNT"  |"AVG_CLV"  |"MIN_CLV"  |"MAX_CLV"  |
-------------------------------------------------------------------------
|1               |3901              |11151.99   |50.00      |81969.45   |
|2               |4216              |18921.08   |50.00      |129523.43  |
|3               |1690              |28499.07   |197.58     |156880.06  |
|4               |193               |37337.17   |1116.69    |193320.94  |
-------------------------------------------------------------------------



## -------------------------------------------------------------------------------------

## CLEAN UP


In [33]:
# session.close()

In [34]:
from datetime import datetime
from zoneinfo import ZoneInfo
formatted_time = datetime.now(ZoneInfo("Australia/Melbourne")).strftime("%A, %B %d, %Y %I:%M:%S %p %Z")

print(f"The last run time in Melbourne is: {formatted_time}")

The last run time in Melbourne is: Wednesday, December 03, 2025 10:33:07 PM AEDT
