Version: 0.0.2  Updated date: 07/05/2024
Conda Environment : py-snowpark_df_ml_fs-1.15.0_v1

# Getting Started with Snowflake Feature Store
We will use the Use-Case to show how Snowflake Feature Store (and Model Registry) can be used to maintain & store features, retrieve them for training and perform micro-batch inference.

In the development (TRAINING) enviroment we will 
- create FeatureViews in the Feature Store that maintain the required customer-behaviour features.
- use these Features to train a model, and save the model in the Snowflake model-registry.
- plot the clusters for the trained model to visually verify. 

In the production (SERVING) environment we will
- re-create the FeatureViews on production data
- generate an Inference FeatureView that uses the saved model to perform incremental inference

# Feature Engineering & Model Training

In [None]:
%load_ext autoreload
%autoreload 2

#### Notebook Packages

In [None]:
# Python packages
import os
import json
import timeit

# SNOWFLAKE
# Snowpark
from snowflake.snowpark import Session, DataFrame, Window, WindowSpec, Row
from feature_engineering_fns import uc01_load_data, uc01_pre_process_v2
import snowflake.snowpark.functions as F
import snowflake.snowpark.types as T

# Snowflake Feature Store
from snowflake.ml.feature_store import (
    FeatureView,
    Entity)

# COMMON FUNCTIONS
from useful_fns import check_and_update, dataset_check_and_update, create_ModelRegistry, create_FeatureStore, create_SF_Session, get_spine_df

### Setup Snowflake connection and database parameters

In [None]:
# Schemas
tpcxai_training_schema     = 'TRAINING'

In [None]:
fs_qs_role, tpcxai_database, tpcxai_training_schema, session, warehouse_env = create_SF_Session(tpcxai_training_schema, role="ACCOUNTADMIN")

## MODEL DEVELOPMENT
* Create Snowflake Model-Registry
* Create Snowflake Feature-Store
* Establish and Create CUSTOMER Entity in the development Snowflake FeatureStore
* Create Source Data references and perform basic data-cleansing
* Create & Run Preprocessing Function to create features
* Create FeatureView_Preprocess from Preprocess Dataframe SQL
* Create training data from FeatureView_Preprocess (asof join)
* Create & Fit Snowpark-ml pipeline 
* Save model in Model Registry
* 'Verify' and approve model
* Create new FeatureView_Model_Inference with Transforms UDF + KMeans model

In [None]:
# Set the Schema
tpcxai_schema = tpcxai_training_schema

# Create/Reference Snowflake Model Registry - Common across Environments
mr = create_ModelRegistry(session, tpcxai_database, 'MODEL_1')

# Create/Reference Snowflake Feature Store for Training (Development) Environment
fs = create_FeatureStore(session, tpcxai_database, f'''_{tpcxai_schema}_FEATURE_STORE''', warehouse_env)


In [None]:
from snowflake.ml.dataset import Dataset
ds = Dataset.load(session=session, name='TPCXAI_SF0001_QUICKSTART_INC._TRAINING_FEATURE_STORE.UC01_TRAINING')
ds.fully_qualified_name, ds.list_versions()

In [None]:
from snowflake.ml.dataset import load_dataset
training_dataset_sdf_v1 = load_dataset(session, 'TPCXAI_SF0001_QUICKSTART_INC._TRAINING_FEATURE_STORE.UC01_TRAINING', 'V_1')
training_dataset_sdf_v1 = training_dataset_sdf_v1.read.to_snowpark_dataframe()

In [None]:
# Display some sample data
training_dataset_sdf_v1.sort('O_CUSTOMER_SK').show(5)

In [None]:
training_dataset_sdf_v1.to_pandas().head()

### Fit Snowpark-ML Transforms & Model using Fileset training data

We need to fit the transformer over the training Fileset to ensure we are using the same input global values for transforming and training, and later inference with the model.

The transforms here are model-specific and persisted within the model-pipeline, and not stored in the Feature Store.

In [None]:
def train_test_split(training_dataset_sdf):
    weights = [0.6, 0.4]
    training_dataset_sdf = training_dataset_sdf.with_column("FREQUENCY", F.round(F.col("FREQUENCY"), 3))
    training_dataset_sdf = training_dataset_sdf.with_column("RETURN_RATIO", F.round(F.col("RETURN_RATIO"), 3))
    training_dataset_sdf = training_dataset_sdf.with_column("RETURN_ROW_PRICE", F.round(F.col("RETURN_ROW_PRICE"), 3))
    training_dataset_sdf = training_dataset_sdf.select(['RETURN_RATIO', 'FREQUENCY', 'RETURN_ROW_PRICE'])

    train_df, test_df = training_dataset_sdf.random_split(weights, seed=42) # Using a seed for reproducibility
    return train_df, test_df

In [None]:
#### Use-Case 01 - Specific Packages
# from sklearn.pipeline import Pipeline as sml_Pipeline
from snowflake.ml.modeling.pipeline.pipeline import Pipeline as sml_Pipeline
# from sklearn.preprocessing import MinMaxScaler as sml_MinMaxScaler
from snowflake.ml.modeling.preprocessing.min_max_scaler import MinMaxScaler as sml_MinMaxScaler
# from sklearn.compose import ColumnTransformer as sml_ColumnTransformer
from snowflake.ml.modeling.compose.column_transformer import ColumnTransformer as sml_ColumnTransformer

# from xgboost import XGBRegressor
from snowflake.ml.modeling.xgboost.xgb_regressor import XGBRegressor as sml_XGBRegressor
import pandas as pd
from snowflake.snowpark.dataframe import DataFrame
def uc01_train(featurevector: DataFrame, n_estimators=100):
    feature_cols = ['RETURN_RATIO', 'FREQUENCY'] 
    mms_output_cols = ['RETURN_RATIO', 'FREQUENCY', 'RETURN_ROW_PRICE']
       
    target_col = "RETURN_ROW_PRICE"

    # Create preprocessing steps
    preprocessor = sml_ColumnTransformer(
        transformers=[
            ('NUM', sml_MinMaxScaler(), feature_cols)
        ],
        remainder='passthrough',
        output_cols=mms_output_cols
    )

    # Create pipeline
    pipeline = sml_Pipeline([
        ('preprocessor', preprocessor),
        ('regressor', sml_XGBRegressor(
                random_state=42,
                n_estimators=n_estimators,
                input_cols=feature_cols,
                label_cols=[target_col],
                output_cols=['PREDICTION']
            )
        )
    ])

    model = pipeline.fit(featurevector)
    return model

In [None]:
#### Use-Case 01 - Specific Packages
from sklearn.pipeline import Pipeline as sk_Pipeline
from sklearn.preprocessing import MinMaxScaler as sk_MinMaxScaler
from sklearn.compose import ColumnTransformer as sk_ColumnTransformer
from xgboost import XGBRegressor
import pandas as pd

def uc01_train_sklearn(featurevector: pd.DataFrame, n_estimators=100):
    feature_cols = ['RETURN_RATIO', 'FREQUENCY'] 
    mms_output_cols = ['RETURN_RATIO', 'FREQUENCY', 'RETURN_ROW_PRICE']
       
    target_col = "RETURN_ROW_PRICE"

    # Create preprocessing steps
    preprocessor = sk_ColumnTransformer(
        transformers=[
            ('NUM', sk_MinMaxScaler(), feature_cols)
        ],
        remainder='passthrough'
    )

    # Create pipeline
    pipeline = sk_Pipeline([
        ('preprocessor', preprocessor),
        ('regressor', XGBRegressor(
                random_state=42,
                n_estimators=n_estimators
            )
        )
    ])

    X = featurevector[feature_cols]
    y = featurevector[[target_col]]
    model = pipeline.fit(X, y)
    return model

In [None]:
## Fit the Model
model_name = "MODEL_1.UC01_SNOWFLAKEML_RF_REGRESSOR_MODEL"
train_df, test_df = train_test_split(training_dataset_sdf_v1)
train_pd_df = train_df.to_pandas()

In [None]:
train_result = uc01_train(train_df)
train_sklern_result = uc01_train_sklearn(train_pd_df)

# Check for the latest version of this model in registry, and increment version
mr_df = mr.show_models()
model_version = check_and_update(mr_df, model_name)
print('model version:\t',model_version)

In [None]:
sample = train_df.limit(10)
sample_pd = train_pd_df.head(10)

In [None]:
# Save the Model to the Model Registry
model_base = mr.log_model(
    model= train_result,
    model_name= model_name,
    comment="TPCXAI USE CASE 01 - XGB Regressor",
    sample_input_data=sample,
    # options= {
    #     "enable_explainability": True,
    #     "method_options": {"predict": {"case_sensitive": True}, "explain": {"case_sensitive": True}}
    # }
)

In [None]:
# Save the Model to the Model Registry
model_sklearn = mr.log_model(
    model= train_sklern_result,
    model_name= model_name + "sklearn",
    comment="TPCXAI USE CASE 01 - XGB Regressor",
    sample_input_data=sample_pd[['RETURN_RATIO', 'FREQUENCY']],
    options= {
        "enable_explainability": True,
    }
)

In [None]:
# Get and set default for latest version of the model
m = mr.get_model(model_name)
latest_version = m.show_versions().iloc[-1]['name']
mv = m.version(latest_version)
m.default = latest_version

## We noticed the feature wasn't correct

In [None]:
order_tbl = '.'.join([tpcxai_database, tpcxai_schema,'ORDERS'])
order_sdf = session.table(order_tbl)
customer_entity = fs.get_entity("ORDER")
# Tables
line_item_tbl            = '.'.join([tpcxai_database, tpcxai_schema,'LINEITEM'])
order_returns_tbl        = '.'.join([tpcxai_database, tpcxai_schema,'ORDER_RETURNS'])
# Snowpark Dataframe
line_item_sdf            = session.table(line_item_tbl)
order_returns_sdf        = session.table(order_returns_tbl)
raw_data = uc01_load_data(order_sdf, line_item_sdf, order_returns_sdf)

In [None]:
def generate_training_df(spine_sdf, feature_view):
    dataset_name = 'UC01_TRAINING_V2'
    dataset_version = dataset_check_and_update(session, dataset_name)
    # Generate_Dataset
    training_dataset = fs.generate_dataset( 
        name = dataset_name,
        version = dataset_version,
        spine_df = spine_sdf, 
        features = [feature_view], 
        spine_timestamp_col = 'ASOF_DATE'
        )                                     
    # Create a snowpark dataframe reference from the Dataset
    training_dataset_sdf = training_dataset.read.to_snowpark_dataframe()
    
    return training_dataset_sdf

In [None]:
preprocessed_data_v2 = uc01_pre_process_v2(raw_data)
ppd_fv_name    = "FV_UC01_PREPROCESS"
ppd_fv_version = "V_2"
# Define descriptions for the FeatureView's Features.  These will be added as comments to the database object
preprocess_features_desc = {  "FREQUENCY":"Average yearly order frequency",
                              "RETURN_RATIO":"Average of, Per Order Returns Ratio.  Per order returns ratio : total returns value / total order value" }

# Create the FeatureView instance
fv_uc01_preprocess_instance = FeatureView(
    name=ppd_fv_name, 
    entities=[customer_entity], 
    feature_df=preprocessed_data_v2,      # <- We can use the snowpark dataframe as-is from our Python
    # feature_df=preprocessed_data_v2.queries['queries'][0],    # <- Or we can use SQL, in this case linted from the dataframe generated SQL to make more human readable
    timestamp_col="LATEST_ORDER_DATE",
    refresh_freq="60 minute",            # <- specifying optional refresh_freq creates FeatureView as Dynamic Table, else created as View.
    desc="Features to support Use Case 01").attach_feature_desc(preprocess_features_desc)

# Register the FeatureView instance.  Creates  object in Snowflake
fv_uc01_preprocess_v2 = fs.register_feature_view(
    feature_view=fv_uc01_preprocess_instance, 
    version=ppd_fv_version, 
    block=True,     # whether function call blocks until initial data is available
    overwrite=False # whether to replace existing feature view with same name/version
)
spine_sdf = get_spine_df(fv_uc01_preprocess_v2)
training_dataset_sdf_v2 = generate_training_df(spine_sdf, fv_uc01_preprocess_v2)

In [None]:
## Fit the Model
model_name = "MODEL_1.UC01_SNOWFLAKEML_XGB_REGRESSOR_MODEL_V2"
train_df_v2, test_df_v2 = train_test_split(training_dataset_sdf_v2)
train_result = uc01_train(train_df_v2)

# Check for the latest version of this model in registry, and increment version
mr_df = mr.show_models()
model_version = check_and_update(mr_df, model_name)
print('model version:\t',model_version)
# Save the Model to the Model Registry
sample = train_df.to_pandas().head(10)
# Save the Model to the Model Registry
mv_kmeans = mr.log_model(
    model= train_result,
    model_name= model_name,
    version_name= model_version,
    comment="TPCXAI USE CASE 01 - XGB Regressor",
    sample_input_data=sample,
    options= {
        # "enable_explainability": True
    }
)
# Get and set default for latest version of the model
m = mr.get_model(model_name)
latest_version = m.show_versions().iloc[-1]['name']
mv = m.version(latest_version)
m.default = latest_version

In [None]:
model_name = "MODEL_1.UC01_SNOWFLAKEML_XGB_REGRESSOR_MODEL_V2"
train_result = uc01_train(train_df_v2, n_estimators = 110)
# Check for the latest version of this model in registry, and increment version
mr_df = mr.show_models()
model_version = check_and_update(mr_df, model_name)
print('model version:\t',model_version)
# Save the Model to the Model Registry
sample = train_df.to_pandas().head(10)
# Save the Model to the Model Registry
mv_kmeans = mr.log_model(
    model= train_result,
    model_name= model_name,
    version_name= model_version,
    comment="TPCXAI USE CASE 01 - XGB Regressor",
    sample_input_data=sample,
    options= {
        # "enable_explainability": True
    }
)
# Get and set default for latest version of the model
m = mr.get_model(model_name)
latest_version = m.show_versions().iloc[-1]['name']
mv = m.version(latest_version)
m.default = latest_version

In [None]:
# Generate_Dataset
training_set_sdf = fs.generate_training_set(
                                        spine_df = spine_sdf, 
                                        features = [fv_uc01_preprocess_v2], 
                                        spine_timestamp_col = 'ASOF_DATE'
                                        )                                     
# Display some sample data
training_set_sdf.sort('O_CUSTOMER_SK').show(5)

In [None]:
mr.show_models()


In [None]:
model_name = "MODEL_1.UC01_SNOWFLAKEML_XGB_REGRESSOR_MODEL_V2"
train_result = uc01_train(train_df_v2, n_estimators = 110)
# Check for the latest version of this model in registry, and increment version
mr_df = mr.show_models()
model_version = check_and_update(mr_df, model_name)
print('model version:\t',model_version)
# Save the Model to the Model Registry
mv_kmeans = mr.log_model(
    model= train_result,
    model_name= model_name,
    version_name= model_version,
    comment="TPCXAI USE CASE 01 - XGB Regressor",
    sample_input_data=sample,
    options= {
        # "enable_explainability": True
    }
)
# Get and set default for latest version of the model
m = mr.get_model(model_name)
latest_version = m.show_versions().iloc[-1]['name']
mv = m.version(latest_version)
m.default = latest_version

#### Check the model clusters

We will check the model clusters derived from the model.  We create an inference Function using the Snowflake Model Registry.  This packages our model as a Python function which enables access from [Python](https://docs.snowflake.com/developer-guide/snowpark-ml/model-registry/overview#calling-model-methods) or directly from [SQL](https://docs.snowflake.com/sql-reference/commands-model#label-snowpark-model-registry-model-methods).  This allows the model to be used directly for prediction within our Feature Engineering pipeline, by creating an inference Feature View.

In [None]:
def uc01_serve(featurevector, model) -> DataFrame :
    clusters = model.run(featurevector, function_name="predict")    
    return clusters

In [None]:
# Create Spine
inference_spine_sdf =  fv_uc01_preprocess_v2.feature_df.group_by('O_CUSTOMER_SK').agg(F.max('LATEST_ORDER_DATE').as_('ASOF_DATE'))

# Generate_Dataset
inference_dataset_sdf = fs.retrieve_feature_values(spine_df = inference_spine_sdf, features = [fv_uc01_preprocess_v2],  spine_timestamp_col = 'ASOF_DATE' )
#inference_dataset_sdf = fs.read_feature_view(fv_uc01_preprocess_v2)

start = timeit.default_timer()#
# serve_result = uc01_serve(training_dataset_sdf, train_result['MODEL'])#
inference_result_sdf = uc01_serve(inference_dataset_sdf, mv)#
end = timeit.default_timer()#
serve_time = end - start#
print('serve time:\t', serve_time)#
inference_result_sdf.show()
# inference_sample_sdf = inference_result_sdf.sample(n = 10000)
# inference_sample_sdf.show()

In [None]:
inference_result_sdf.to_pandas()

In [None]:
#Getting unique labels
import matplotlib.pyplot as plt

plt_df = inference_result_sdf.select(F.col("PREDICTION"), F.col("RETURN_ROW_PRICE")).to_pandas()
y_predicted = plt_df['PREDICTION']
y_actual = plt_df['RETURN_ROW_PRICE']
# --- 1. Calculate Residuals ---
# You need y_actual and y_predicted from the previous example
residuals = y_actual - y_predicted

# --- 2. Create the Residual Plot ---
fig, ax = plt.subplots(figsize=(10, 6))

ax.scatter(y_predicted, residuals, edgecolors=(0, 0, 0), alpha=0.8)

# --- 3. Add a Horizontal Line at Zero ---
# This line represents zero error
ax.axhline(y=0, color='r', linestyle='--', lw=2)

# --- 4. Add Labels and Title ---
ax.set_xlabel('Predicted Values', fontsize=12)
ax.set_ylabel('Residuals (Actual - Predicted)', fontsize=12)
ax.set_title('Residual Plot', fontsize=16)

plt.show()

We can look at the query that contains our inference function.  It makes use of the SQL API for Model registry to call the inference function `MODEL_VERSION_ALIAS!PREDICT(RETURN_RATIO, FREQUENCY) AS TMP_RESULT`

In [None]:

ind_sql = inference_result_sdf.queries['queries'][0]
ind_fmtd_sql = os.linesep.join(ind_sql.split(os.linesep)[:1000])
print(ind_fmtd_sql)

## CLEAN UP

In [None]:
# session.close()

In [None]:
from datetime import datetime
from zoneinfo import ZoneInfo
formatted_time = datetime.now(ZoneInfo("Australia/Melbourne")).strftime("%A, %B %d, %Y %I:%M:%S %p %Z")

print(f"The last run time in Melbourne is: {formatted_time}")