# End to End CLV Model Notebook

# Getting Started with Snowflake Feature Store
We will use the Use-Case to show how Snowflake Feature Store (and Model Registry) can be used to maintain & store features, retrieve them for training and perform micro-batch inference.

In the development (TRAINING) enviroment we will 
- create FeatureViews in the Feature Store that maintain the required customer-behaviour features.
- use these Features to train a model, and save the model in the Snowflake model-registry.
- plot the clusters for the trained model to visually verify. 

In the production (SERVING) environment we will
- re-create the FeatureViews on production data
- generate an Inference FeatureView that uses the saved model to perform incremental inference

# Feature Engineering & Model Training

In [21]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


#### Notebook Packages

In [22]:
# Python packages
import os
import json
import timeit

# SNOWFLAKE
# Snowpark
from snowflake.snowpark import Session, DataFrame, Window, WindowSpec

import snowflake.snowpark.functions as F
import snowflake.snowpark.types as T
from snowflake.snowpark import Session, Row

# Snowflake Feature Store
from snowflake.ml.feature_store import (
    FeatureView,
    Entity
)

# COMMON FUNCTIONS
from helper.useful_fns import dataset_check_and_update, formatSQL, create_ModelRegistry, create_FeatureStore, create_SF_Session, get_spine_df

### Setup Snowflake connection and database parameters

In [23]:
role_name, database_name, schema_name, session, warehouse_name = create_SF_Session(
    connection_file = '../../connection.json',
    role="ACCOUNTADMIN"
)

You might have more than one threads sharing the Session object trying to update sql_simplifier_enabled. Updating this while other tasks are running can potentially cause unexpected behavior. Please update the session configuration before starting the threads.



Connection Established with the following parameters:
User                        : JARCHEN
Role                        : "ACCOUNTADMIN"
Database                    : "RETAIL_REGRESSION_DEMO"
Schema                      : "DS"
Warehouse                   : "RETAIL_REGRESSION_DEMO_WH"
Snowflake version           : 9.39.2
Snowpark for Python version : 1.38.0 



In [24]:
# Create compute pool
def create_compute_pool(name: str, instance_family: str, min_nodes: int = 1, max_nodes: int = 10) -> list[Row]:
    query = f"""
        CREATE COMPUTE POOL IF NOT EXISTS {name}
            MIN_NODES = {min_nodes}
            MAX_NODES = {max_nodes}
            INSTANCE_FAMILY = {instance_family}
    """
    return session.sql(query).collect()

compute_pool = "CLV_MODEL_POOL_CPU"
create_compute_pool(compute_pool, "CPU_X64_L")

[Row(status='CLV_MODEL_POOL_CPU already exists, statement succeeded.')]

## MODEL DEVELOPMENT
* Create Snowflake Model-Registry
* Create Snowflake Feature-Store
* Establish and Create CUSTOMER Entity in the development Snowflake FeatureStore
* Create Source Data references and perform basic data-cleansing
* Create & Run Preprocessing Function to create features
* Create FeatureView_Preprocess from Preprocess Dataframe SQL
* Create training data from FeatureView_Preprocess (asof join)
* Create & Fit Snowpark-ml pipeline 
* Save model in Model Registry
* 'Verify' and approve model
* Create new FeatureView_Model_Inference with Transforms UDF + KMeans model

In [25]:
# Create/Reference Snowflake Model Registry - Common across Environments
mr = create_ModelRegistry(session, database_name, '_MODELLING')

# Create/Reference Snowflake Feature Store - Common across Environments
fs = create_FeatureStore(session, database_name, '_FEATURE_STORE', warehouse_name)


Model Registry (_MODELLING) already exists
Feature Store (_FEATURE_STORE) already exists


In [26]:
cust_tbl = '.'.join([database_name, schema_name,'CUSTOMERS'])
cust_sdf = session.table(cust_tbl)
print(cust_tbl, cust_sdf.count())
cust_sdf.limit(10).show()

RETAIL_REGRESSION_DEMO.DS.CUSTOMERS 10000
------------------------------------------------------------------------------------------------------------------------------------------------
|"CUSTOMER_ID"  |"AGE"  |"ANNUAL_INCOME"  |"LOYALTY_TIER"  |"GENDER"  |"STATE"  |"TENURE_MONTHS"  |"SIGNUP_DATE"  |"CREATED_AT"                |
------------------------------------------------------------------------------------------------------------------------------------------------
|1              |36     |66091.00         |medium          |female    |WA       |6                |2025-06-19     |2025-12-19 03:18:43.893000  |
|2              |32     |47309.00         |low             |female    |NSW      |5                |2025-07-19     |2025-12-19 03:18:43.893000  |
|3              |18     |54797.00         |low             |male      |NSW      |29               |2023-07-19     |2025-12-19 03:18:43.893000  |
|4              |35     |27852.00         |low             |female    |NSW      |22     

### CUSTOMER Entity
Establish and Create CUSTOMER Entity in Snowflake FeatureStore for this Use-Case

In [27]:
if "CUSTOMER" not in json.loads(fs.list_entities().select(F.to_json(F.array_agg("NAME", True))).collect()[0][0]):
    customer_entity = Entity(name="CUSTOMER", join_keys=["CUSTOMER_ID"],desc="Primary Key for CUSTOMER ORDER")
    fs.register_entity(customer_entity)
else:
    customer_entity = fs.get_entity("CUSTOMER")

fs.list_entities().show()

------------------------------------------------------------------------------
|"NAME"    |"JOIN_KEYS"      |"DESC"                          |"OWNER"       |
------------------------------------------------------------------------------
|CUSTOMER  |["CUSTOMER_ID"]  |Primary Key for CUSTOMER ORDER  |ACCOUNTADMIN  |
------------------------------------------------------------------------------



 ### Create & Load Source Data

Our Feature engineering pipelines are defined using Snowpark dataframes (or SQL expressions).  In the `QS_feature_engineering_fns.py` file we have created two feature engineering functions to create our pipeline :
* __uc01_load_data__(order_data: DataFrame, lineitem_data: DataFrame, order_returns_data: DataFrame) -> DataFrame   
* __uc01_pre_process__(data: DataFrame) -> DataFrame

`uc01_load_data`, takes the source tables, as dataframe objects, and joins them together, performing some data-cleansing by replacing NA's with default values. It returns a dataframe as it's output.

`uc01_pre_process`, takes the dataframe output from `uc01_load_data`  and performs aggregation on it to derive some features that will be used in our segmentation model.  It returns a dataframe as output, which we will use to provide the feature-pipeline definition within our FeatureView.

In this way we can build up a complex pipeline step-by-step and use it to derive a FeatureView, that will be maintained as a pipeline in Snowflake.

We will import the functions, and create dataframes from them using the dataframes we created earlier pointing to the tables in our TRAINING (Development) schema.  We will use the last dataframe we create at the end of the pipeline as our input to the FeatureView.


In [28]:
# Feature Engineering Functions
from feature_engineering_fns import uc01_load_data, uc01_pre_process

In [29]:
# Tables
cust_tbl                    = '.'.join([database_name, schema_name,'CUSTOMERS'])
behavior_tbl                = '.'.join([database_name, schema_name,'PURCHASE_BEHAVIOR'])

# Snowpark Dataframe
cust_sdf              = session.table(cust_tbl)
behavior_tbl          = session.table(behavior_tbl)

# Row Counts
print(f'''\nTABLE ROW_COUNTS IN {schema_name}''')
print(cust_tbl, cust_sdf.count())
print(behavior_tbl, behavior_tbl.count())


TABLE ROW_COUNTS IN DS
RETAIL_REGRESSION_DEMO.DS.CUSTOMERS 10000
<snowflake.snowpark.table.Table object at 0x11860b130> 10000


In [30]:
raw_data = uc01_load_data(cust_sdf, behavior_tbl)

In [31]:
# Format and print the SQL for the Snowpark Dataframe
rd_sql = formatSQL(raw_data.queries['queries'][0], True)
print(os.linesep.join(rd_sql.split(os.linesep)[:1000]))

WITH SNOWPARK_LEFT AS (
  SELECT
    "CUSTOMER_ID" AS "l_0000_CUSTOMER_ID",
    "AGE" AS "AGE",
    "ANNUAL_INCOME" AS "ANNUAL_INCOME",
    "LOYALTY_TIER" AS "LOYALTY_TIER",
    "GENDER" AS "GENDER",
    "STATE" AS "STATE",
    "TENURE_MONTHS" AS "TENURE_MONTHS",
    "SIGNUP_DATE" AS "SIGNUP_DATE",
    "CREATED_AT" AS "CREATED_AT"
  FROM RETAIL_REGRESSION_DEMO.DS.CUSTOMERS
), SNOWPARK_RIGHT AS (
  SELECT
    "CUSTOMER_ID" AS "r_0001_CUSTOMER_ID",
    "AVG_ORDER_VALUE" AS "AVG_ORDER_VALUE",
    "PURCHASE_FREQUENCY" AS "PURCHASE_FREQUENCY",
    "RETURN_RATE" AS "RETURN_RATE",
    "LIFETIME_VALUE" AS "LIFETIME_VALUE",
    "LAST_PURCHASE_DATE" AS "LAST_PURCHASE_DATE",
    "TOTAL_ORDERS" AS "TOTAL_ORDERS",
    "UPDATED_AT" AS "UPDATED_AT"
  FROM RETAIL_REGRESSION_DEMO.DS.PURCHASE_BEHAVIOR
), cte AS (
  SELECT
    *
  FROM (
    SNOWPARK_LEFT AS SNOWPARK_LEFT
      LEFT OUTER JOIN SNOWPARK_RIGHT AS SNOWPARK_RIGHT
        ON (
          "l_0000_CUSTOMER_ID" = "r_0001_CUSTOMER_ID"
        )
  

In [32]:
raw_data.show()

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"CUSTOMER_ID"  |"AGE"  |"ANNUAL_INCOME"  |"LOYALTY_TIER"  |"GENDER"  |"STATE"  |"TENURE_MONTHS"  |"SIGNUP_DATE"  |"CUSTOMER_CREATED_AT"       |"AVG_ORDER_VALUE"  |"PURCHASE_FREQUENCY"  |"RETURN_RATE"  |"LIFETIME_VALUE"  |"LAST_PURCHASE_DATE"  |"TOTAL_ORDERS"  |"BEHAVIOR_UPDATED_AT"       |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|1              |36     |66091.00         |medium          |female    |WA       |6                |2025-06-19     |2025-12-1

### Create & Run Preprocessing Function 

In [33]:
preprocessed_data = uc01_pre_process(raw_data)

In [34]:
preprocessed_data.show()

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"CUSTOMER_ID"  |"AGE"  |"LOYALTY_TIER"  |"GENDER"  |"STATE"  |"TENURE_MONTHS"  |"SIGNUP_DATE"  |"CUSTOMER_CREATED_AT"       |"AVG_ORDER_VALUE"  |"PURCHASE_FREQUENCY"  |"RETURN_RATE"  |"LIFETIME_VALUE"  |"LAST_PURCHASE_DATE"  |"TOTAL_ORDERS"  |"BEHAVIOR_UPDATED_AT"       |"ANNUAL_INCOME"  |"AVERAGE_ORDER_PER_MONTH"  |"DAYS_SINCE_LAST_PURCHASE"  |"DAYS_SINCE_SIGNUP"  |"EXPECTED_DAYS_BETWEEN_PURCHASES"  |"DAYS_SINCE_EXPECTED_LAST_PURCHASE_DATE"  |
----------------------------------------------------------------------------------------------------

In [35]:
# Format and print the SQL for the Snowpark Dataframe
ppd_sql = formatSQL(preprocessed_data.queries['queries'][0], True)
print(os.linesep.join(ppd_sql.split(os.linesep)[:1000]))

WITH SNOWPARK_LEFT AS (
  SELECT
    "CUSTOMER_ID" AS "l_0000_CUSTOMER_ID",
    "AGE" AS "AGE",
    "ANNUAL_INCOME" AS "ANNUAL_INCOME",
    "LOYALTY_TIER" AS "LOYALTY_TIER",
    "GENDER" AS "GENDER",
    "STATE" AS "STATE",
    "TENURE_MONTHS" AS "TENURE_MONTHS",
    "SIGNUP_DATE" AS "SIGNUP_DATE",
    "CREATED_AT" AS "CREATED_AT"
  FROM RETAIL_REGRESSION_DEMO.DS.CUSTOMERS
), SNOWPARK_RIGHT AS (
  SELECT
    "CUSTOMER_ID" AS "r_0001_CUSTOMER_ID",
    "AVG_ORDER_VALUE" AS "AVG_ORDER_VALUE",
    "PURCHASE_FREQUENCY" AS "PURCHASE_FREQUENCY",
    "RETURN_RATE" AS "RETURN_RATE",
    "LIFETIME_VALUE" AS "LIFETIME_VALUE",
    "LAST_PURCHASE_DATE" AS "LAST_PURCHASE_DATE",
    "TOTAL_ORDERS" AS "TOTAL_ORDERS",
    "UPDATED_AT" AS "UPDATED_AT"
  FROM RETAIL_REGRESSION_DEMO.DS.PURCHASE_BEHAVIOR
), cte AS (
  SELECT
    *
  FROM (
    SNOWPARK_LEFT AS SNOWPARK_LEFT
      LEFT OUTER JOIN SNOWPARK_RIGHT AS SNOWPARK_RIGHT
        ON (
          "l_0000_CUSTOMER_ID" = "r_0001_CUSTOMER_ID"
        )
  

### Create Preprocessing FeatureView from Preprocess Dataframe (SQL)

In [36]:
# Define descriptions for the FeatureView's Features.  These will be added as comments to the database object
preprocess_features_desc = {  
   "AVERAGE_ORDER_PER_MONTH":"Average number of orders per month",
   "DAYS_SINCE_LAST_PURCHASE":"Days since last purchase",
   "DAYS_SINCE_SIGNUP":"Days since signup",
   "EXPECTED_DAYS_BETWEEN_PURCHASES":"Expected days between purchases",
   "DAYS_SINCE_EXPECTED_LAST_PURCHASE_DATE":"Days since expected last purchase date from LAST_PURCHASE_DATE"
}

ppd_fv_name    = "FV_PREPROCESS"
ppd_fv_version = "V_1"

try:
   # If FeatureView already exists just return the reference to it
   fv_uc01_preprocess = fs.get_feature_view(name=ppd_fv_name,version=ppd_fv_version)
except:
   # Create the FeatureView instance
   fv_uc01_preprocess_instance = FeatureView(
      name=ppd_fv_name, 
      entities=[customer_entity], 
      feature_df=preprocessed_data,      # <- We can use the snowpark dataframe as-is from our Python
      # feature_df=preprocessed_data.queries['queries'][0],    # <- Or we can use SQL, in this case linted from the dataframe generated SQL to make more human readable
      timestamp_col="BEHAVIOR_UPDATED_AT",
      refresh_freq="60 minute",            # <- specifying optional refresh_freq creates FeatureView as Dynamic Table, else created as View.
      desc="Customer Modelling Features").attach_feature_desc(preprocess_features_desc)

   # Register the FeatureView instance.  Creates  object in Snowflake
   fv_uc01_preprocess = fs.register_feature_view(
      feature_view=fv_uc01_preprocess_instance, 
      version=ppd_fv_version, 
      block=True,     # whether function call blocks until initial data is available
      overwrite=False # whether to replace existing feature view with same name/version
   )
   print(f"Feature View : {ppd_fv_name}_{ppd_fv_version} created")   
else:
   print(f"Feature View : {ppd_fv_name}_{ppd_fv_version} already created")
finally:
   fs.list_feature_views().show(20)
spine = fv_uc01_preprocess

Feature View : FV_PREPROCESS_V_1 already created
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"NAME"         |"VERSION"  |"DATABASE_NAME"         |"SCHEMA_NAME"   |"CREATED_ON"                |"OWNER"       |"DESC"                       |"ENTITIES"    |"REFRESH_FREQ"  |"REFRESH_MODE"  |"SCHEDULING_STATE"  |"WAREHOUSE"  |"CLUSTER_BY"                            |"ONLINE_CONFIG"                                |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [37]:
# You can also use the following to retrieve a Feature View instance for use within Python
FV_UC01_PREPROCESS_V_1 = fs.get_feature_view(ppd_fv_name, 'V_1')

In [38]:
# We can look at the FeatureView's contents with
FV_UC01_PREPROCESS_V_1.feature_df.sort(F.col("CUSTOMER_ID"), ascending=False).show(10)

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"CUSTOMER_ID"  |"AGE"  |"GENDER"  |"STATE"  |"LOYALTY_TIER"  |"TENURE_MONTHS"  |"SIGNUP_DATE"  |"CUSTOMER_CREATED_AT"       |"AVG_ORDER_VALUE"  |"PURCHASE_FREQUENCY"  |"RETURN_RATE"  |"LIFETIME_VALUE"  |"LAST_PURCHASE_DATE"  |"TOTAL_ORDERS"  |"BEHAVIOR_UPDATED_AT"       |"ANNUAL_INCOME"  |"AVERAGE_ORDER_PER_MONTH"  |"DAYS_SINCE_LAST_PURCHASE"  |"DAYS_SINCE_SIGNUP"  |"EXPECTED_DAYS_BETWEEN_PURCHASES"  |"DAYS_SINCE_EXPECTED_LAST_PURCHASE_DATE"  |
----------------------------------------------------------------------------------------------------

### Create training data Dataset from FeatureView_Preprocess

In [39]:
# Create Spine
spine_sdf = get_spine_df(spine)
spine_sdf.sort('CUSTOMER_ID').show(5)

-----------------------------------------
|"CUSTOMER_ID"  |"ASOF_DATE"  |"COL_1"  |
-----------------------------------------
|1              |2025-12-30   |values1  |
|2              |2025-12-30   |values1  |
|3              |2025-12-30   |values1  |
|4              |2025-12-30   |values1  |
|5              |2025-12-30   |values1  |
-----------------------------------------



In [40]:
from helper.useful_fns import dataset_check_and_update
def generate_training_df(spine_sdf, feature_view, feature_store):
    dataset_name = 'TRAINING_DATASET'
    schema_name = feature_store.list_feature_views().to_pandas()['SCHEMA_NAME'][0]

    dataset_version = dataset_check_and_update(session, dataset_name, schema_name= schema_name)
    # Generate_Dataset
    training_dataset = feature_store.generate_dataset( 
        name = dataset_name,
        version = dataset_version,
        spine_df = spine_sdf, 
        features = [feature_view], 
        spine_timestamp_col = 'ASOF_DATE'
        )                                     
    # Create a snowpark dataframe reference from the Dataset
    training_dataset_sdf = training_dataset.read.to_snowpark_dataframe()
    
    return training_dataset_sdf

In [41]:
# Generate_Dataset
training_dataset_sdf_v1 = generate_training_df(spine_sdf, fv_uc01_preprocess, feature_store=fs)
# Display some sample data
training_dataset_sdf_v1.sort('CUSTOMER_ID').show(5)

['V_1', 'V_2', 'V_3'] V_3




---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"CUSTOMER_ID"  |"ASOF_DATE"  |"COL_1"  |"AGE"  |"GENDER"  |"STATE"  |"LOYALTY_TIER"  |"TENURE_MONTHS"  |"SIGNUP_DATE"  |"CUSTOMER_CREATED_AT"       |"AVG_ORDER_VALUE"   |"PURCHASE_FREQUENCY"  |"RETURN_RATE"  |"LIFETIME_VALUE"    |"LAST_PURCHASE_DATE"  |"TOTAL_ORDERS"  |"ANNUAL_INCOME"  |"AVERAGE_ORDER_PER_MONTH"  |"DAYS_SINCE_LAST_PURCHASE"  |"DAYS_SINCE_SIGNUP"  |"EXPECTED_DAYS_BETWEEN_PURCHASES"  |"DAYS_SINCE_EXPECTED_LAST_PURCHASE_DATE"  |
--------------------------------------------------------------------------------------------------------

In [42]:
training_dataset_sdf_v1.to_pandas().head()

Unnamed: 0,CUSTOMER_ID,ASOF_DATE,COL_1,AGE,GENDER,STATE,LOYALTY_TIER,TENURE_MONTHS,SIGNUP_DATE,CUSTOMER_CREATED_AT,...,RETURN_RATE,LIFETIME_VALUE,LAST_PURCHASE_DATE,TOTAL_ORDERS,ANNUAL_INCOME,AVERAGE_ORDER_PER_MONTH,DAYS_SINCE_LAST_PURCHASE,DAYS_SINCE_SIGNUP,EXPECTED_DAYS_BETWEEN_PURCHASES,DAYS_SINCE_EXPECTED_LAST_PURCHASE_DATE
0,3441,2025-12-30,values1,31,female,SA,medium,21,2024-03-19,2025-12-19 03:18:43.893,...,11.0,14172.320312,2025-12-03,64,88795,3.047619,16,640,9.90099,6
1,3648,2025-12-30,values1,18,male,WA,low,24,2023-12-19,2025-12-19 03:18:43.893,...,18.0,7436.75,2025-12-08,67,87415,2.791667,11,731,10.791367,0
2,6015,2025-12-30,values1,18,male,NSW,medium,8,2025-04-19,2025-12-19 03:18:43.893,...,11.0,2771.850098,2025-12-04,27,22167,3.375,15,244,8.928571,6
3,6035,2025-12-30,values1,53,male,TAS,medium,17,2024-07-19,2025-12-19 03:18:43.893,...,12.0,7231.109863,2025-11-27,65,24284,3.823529,22,518,7.8125,14
4,9284,2025-12-30,values1,34,female,SA,low,11,2025-01-19,2025-12-19 03:18:43.893,...,13.0,4291.930176,2025-12-01,21,60054,1.909091,18,334,15.384615,3


In [43]:
# Python
from time import perf_counter

# ML
import pandas as pd
import xgboost as xgb
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler

# Snowpark
from snowflake.ml.data.data_connector import DataConnector
from snowflake.ml.registry import Registry as ModelRegistry
from snowflake.snowpark import Session, Row
from snowflake.ml.dataset import Dataset
from snowflake.ml.dataset import load_dataset
from snowflake.ml.experiment import ExperimentTracking
from snowflake.ml.experiment.callback.xgboost import SnowflakeXgboostCallback
from snowflake.ml.model.model_signature import infer_signature
from snowflake.snowpark.context import get_active_session

In [44]:
import math
def create_data_connector(session, dataset_name) -> DataConnector:
    """Load data from Snowflake DataSet"""
    ds = Dataset.load(
        session=session, 
        name=dataset_name
    )
    ds_latest_version = str(ds.list_versions()[-1])
    ds_df = load_dataset(
        session, 
        dataset_name, 
        ds_latest_version
    )
    return DataConnector.from_dataset(ds_df)


def compare_params(input_d, extracted_d):
    ignore_keys = ['callbacks'] # Ignore complex objects
    mismatches = []
    
    for key, val in input_d.items():
        if key in ignore_keys: continue
            
        # Check if key exists in extraction
        if key not in extracted_d:
            mismatches.append(f"Missing key: {key}")
            continue
            
        ex_val = extracted_d[key]
        
        # Handle Float vs Int (63 vs 63.0) and NaNs
        if isinstance(val, (int, float)) and isinstance(ex_val, (int, float)):
            # Check for NaN in both (NaN != NaN in Python, so we must handle explicitly)
            if pd.isna(val) and pd.isna(ex_val):
                continue
            if not math.isclose(val, ex_val):
                mismatches.append(f"{key}: {val} (Input) != {ex_val} (Row)")
        
        # Standard comparison for strings/others
        elif val != ex_val:
            mismatches.append(f"{key}: {val} != {ex_val}")
            
    return mismatches

def generate_train_val_set(dataframe: pd.DataFrame) -> tuple[pd.DataFrame, pd.DataFrame]:
    """Generate train and validation dataset"""
    # Split data
    X = dataframe[['RETURN_RATIO', 'FREQUENCY']]
    y = dataframe["RETURN_ROW_PRICE"]
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    print(f"Splitted data")

    # Combine features and target for each split
    train_df = pd.concat([X_train, y_train], axis=1)
    val_df = pd.concat([X_test, y_test], axis=1)
    return train_df, val_df

def build_pipeline(**model_params) -> Pipeline:
    """Create pipeline with preprocessors and model"""
    # Define column types
    feature_cols = ['RETURN_RATIO', 'FREQUENCY'] 

    # Create preprocessing steps
    preprocessor = ColumnTransformer(
        transformers=[
            ('NUM', MinMaxScaler(), feature_cols)
        ],
        remainder='passthrough',
    )

    model = xgb.XGBRegressor(**(model_params))

    return Pipeline([("preprocessor", preprocessor), ("regressor", model)])


def evaluate_model(model: Pipeline, X_test: pd.DataFrame, y_test: pd.DataFrame):
    """Evaluate model performance"""
    # Make predictions
    y_pred = model.predict(X_test)
    # Calculate metrics
    metrics = {
        "mean_absolute_error": mean_absolute_error(y_test, y_pred),
        "mean_absolute_percentage_error": mean_absolute_percentage_error(y_test, y_pred),
        "r2_score": r2_score(y_test, y_pred),
    }

    return metrics


def train():
    from snowflake.ml.modeling import tune
    from snowflake.ml.modeling.tune.search import RandomSearch, BayesOpt
    session = get_active_session()
    # Get tuner context
    tuner_context = tune.get_tuner_context()
    params = tuner_context.get_hyper_params()
    dm = tuner_context.get_dataset_map()
    model_name = params.pop("model_name")
    mr_schema_name = params.pop("mr_schema_name")
    experiment_name = params.pop("experiment_name")
    
    # Initialize experiment tracking for this trial
    exp = ExperimentTracking(session=session, schema_name=mr_schema_name)
    exp.set_experiment(experiment_name)

    run = exp.start_run()
    print("OG NAME!!!!!")
    print(run.name)
    print("++++++++++++++")

    # with exp.start_run():
    # Load data
    train_data = dm["train"].to_pandas()
    val_data = dm["val"].to_pandas()

    # Separate features and target
    X_train = train_data.drop('RETURN_ROW_PRICE', axis=1)
    y_train = train_data['RETURN_ROW_PRICE']
    X_val = val_data.drop('RETURN_ROW_PRICE', axis=1)
    y_val = val_data['RETURN_ROW_PRICE']

    # Train model
    sig = infer_signature(X_train, y_train)
    callback = SnowflakeXgboostCallback(
        exp, model_name="name", model_signature=sig
    )
    params['callbacks'] = [callback]

    model = build_pipeline(
        model_params=params
    )
    # Log model parameters with the log_param(...) or log_params(...) methods
    exp.log_params(params)

    print("Training model...", end="")
    start = perf_counter()
    model.fit(X_train, y_train)
    elapsed = perf_counter() - start
    print(f" done! Elapsed={elapsed:.3f}s")

    # Evaluate model
    print("Evaluating model...", end="")
    start = perf_counter()
    metrics = evaluate_model(
        model,
        X_val,
        y_val,
    )
    elapsed = perf_counter() - start
    print(f" done! Elapsed={elapsed:.3f}s")

    # Log model metrics with the log_metric(...) or log_metrics(...) methods
    exp.log_metrics(metrics)

    # Report to HPO framework (optimize on validation F1)
    tuner_context.report(
        metrics=metrics, 
        model=model
    )
    return {
        "run_name": run.name, 
        "params": params,
        "mean_absolute_error": metrics['mean_absolute_error'],
        "mean_absolute_percentage_error": metrics['mean_absolute_percentage_error'],
        "r2_score": metrics['r2_score'],
        "model": model,
        "X_train": X_train,
        "metrics": metrics
    }


In [45]:
def train_remote(
        source_dataset: str, 
        model_name: str, 
        mr_schema_name: str,
        experiment_name: str
    ):
    from snowflake.ml.modeling import tune
    from snowflake.ml.modeling.tune.search import RandomSearch, BayesOpt

    # Retrieve session from SPCS service context
    session = Session.builder.getOrCreate()

    # Load data
    print("Loading data...", end="", flush=True)
    start = perf_counter()
    dc = create_data_connector(session, dataset_name=source_dataset)
    df = dc.to_pandas()
    elapsed = perf_counter() - start
    print(f" done! Loaded {len(df)} rows, elapsed={elapsed:.3f}s")

    print(f"Building train/val data")
    train_df, val_df = generate_train_val_set(df)

    X = train_df[['RETURN_RATIO', 'FREQUENCY']]
    y = train_df["RETURN_ROW_PRICE"]
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    # Create DataConnectors
    dataset_map = {
        "train": DataConnector.from_dataframe(session.create_dataframe(train_df)),
        "val": DataConnector.from_dataframe(session.create_dataframe(val_df)),
    }

    # Define search space for XGBoost
    search_space = {
        'mr_schema_name': mr_schema_name,
        'model_name': model_name,
        'experiment_name': experiment_name,
        'n_estimators': tune.randint(50, 200),
        'random_state': 42,
    }

    # Configure tuner
    tuner_config = tune.TunerConfig(
        metric='mean_absolute_percentage_error',
        mode='min',
        search_alg=RandomSearch(),
        num_trials=2,
    )

    # Create tuner
    tuner = tune.Tuner(
        train_func=train,
        search_space=search_space, 
        tuner_config=tuner_config
    )

    print(f"HPO starting")
    results = tuner.run(dataset_map=dataset_map)

    best_config = results.best_result[0] if isinstance(results.best_result, list) else results.best_result
    best_model = results.best_model[0] if isinstance(results.best_model, list) else results.best_model
    best_config_record = best_config.to_dict(orient='records')[0]
    best_config_dict = {
        str(k).removeprefix('config/'): v 
        for k, v in best_config_record.items() 
        if k.startswith('config/')
    }
    results_df: pd.DataFrame = results.results
    exp = ExperimentTracking(session=session, schema_name=mr_schema_name)
    exp.set_experiment(experiment_name)
    param_cols = [c for c in results_df.columns if str(c).startswith('params/')]

    for index, row in results_df.iterrows():
        run_name = row['run_name']
        exp.start_run(run_name)

        # run = runs.run_name
        metrics = {
            "mean_absolute_error": row['metrics/mean_absolute_error'],
            "mean_absolute_percentage_error": row['metrics/mean_absolute_percentage_error'],
            "r2_score": row['metrics/r2_score'],
        }
        params_series = row[param_cols]
        params_dict = {
            str(k).removeprefix('params/'): v 
            for k, v in params_series.items()
        }
        diffs = compare_params(best_config_dict, params_dict)

        if not diffs:
            # Save model to registry
            print("Logging model to Model Registry...", end="")
            exp.log_model(
                model=best_model, 
                model_name=model_name, 
                metrics=metrics,
                sample_input_data=X_train,
                conda_dependencies=["xgboost"],
            ) # type: ignore
            
        exp.end_run(row['run_name'])
    return {
        "results": results.results,
        "best_config": best_config,
        "best_model": best_model
    }

In [46]:
train_job = train_remote(
    source_dataset="TPCXAI_SF0001_QUICKSTART_INC._TRAINING_FEATURE_STORE.UC01_TRAINING",
    model_name = "MODEL_1.UC01_SNOWFLAKEML_RF_REGRESSOR_MODEL",
    mr_schema_name = "MODEL_1",
    experiment_name="MY_EXPERIMENT"
)

ImportError: cannot import name 'tune' from 'snowflake.ml.modeling' (unknown location)

## CLEAN UP

In [None]:
# session.close()

In [None]:
from datetime import datetime
from zoneinfo import ZoneInfo
formatted_time = datetime.now(ZoneInfo("Australia/Melbourne")).strftime("%A, %B %d, %Y %I:%M:%S %p %Z")

print(f"The last run time in Melbourne is: {formatted_time}")