# Demo of Snowpark end-to-end Machine Learning

Purpose of this demo is to showcase how the Snowpark ML library can be used for end to end Machine Learning

The dataset is from https://archive-beta.ics.uci.edu/dataset/222/bank+marketing  
**Run 00_Load_demo_data.ipynb to upload the Parquet files used for this Notebook**

It has the following columns:  
**bank client data**:  
1 - age (numeric)  
2 - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student",  
                                   "blue-collar","self-employed","retired","technician","services")   
3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)  
4 - education (categorical: "unknown","secondary","primary","tertiary")  
5 - default: has credit in default? (binary: "yes","no")  
6 - balance: average yearly balance, in euros (numeric)   
7 - housing: has housing loan? (binary: "yes","no")  
8 - loan: has personal loan? (binary: "yes","no")  
**related with the last contact of the current campaign**:  
9 - contact: contact communication type (categorical: "unknown","telephone","cellular")   
10 - day: last contact day of the month (numeric)  
11 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")  
12 - duration: last contact duration, in seconds (numeric)  
**other attributes**:  
13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)  
14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)  
15 - previous: number of contacts performed before this campaign and for this client (numeric)  
16 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")  
**Output variable (desired target)**:  
17 - y - has the client subscribed a term deposit? (binary: "yes","no")  

Start by importing needed libraries

In [None]:
import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

In [None]:
# Imports 
import snowflake.snowpark as S
from snowflake.snowpark import Session
from snowflake.snowpark import functions as F
from snowflake.snowpark import types as T
from snowflake.snowpark import Window

from snowflake.ml.version  import VERSION as snowml_version

import snowflake.ml.modeling.preprocessing as pp
from snowflake.ml.modeling.pipeline import Pipeline
from snowflake.ml.modeling.ensemble import RandomForestClassifier
from snowflake.ml.modeling.metrics import correlation, precision_recall_fscore_support, accuracy_score, confusion_matrix
from snowflake.ml.registry import model_registry

import json

# Make sure we do not get line breaks when doing show on wide dataframes
from IPython.core.display import HTML
display(HTML("<style>pre { white-space: pre !important; }</style>"))

import pandas as pd
import sqlparse

from matplotlib import pyplot as plt
from matplotlib.pyplot import figure
import seaborn as sn
%matplotlib inline

# Print the version of Snowpark we are using
print(f"Using Snowpark: {S.__version__}")
print(f"Using Snowflake ML: {snowml_version}")

Helper functions for nicer printing of Snowpark dataframe schema and SQL.

In [None]:
# Helper functions for nicer printing
def print_sql(df):
    for query in df.queries['queries']:
        print(sqlparse.format(query, reindent=True))

def print_schema(df):
    print("schema:")
    for col in df.schema.fields:
        print(f" |-- {col.name}: {col.datatype} (Nullable: {col.nullable})")

def shape(df):
    return (df.count(), len(df.columns))

Connect to Snowflake

This example is using a JSON file with the following structure
```
{
    "account":"MY SNOWFLAKE ACCOUNT",
    "user": "MY USER",
    "password":"MY PASSWORD",
    "role":"MY ROLE",
    "warehouse":"MY WH",
    "database":"MY DB",
    "schema":"MY SCHEMA"
}

```

In [None]:
with open('../creds.json') as f:
    connection_parameters = json.load(f)

session = Session.builder.configs(connection_parameters).create()
print("Current role: " + session.get_current_role() + ", Current schema: " + session.get_fully_qualified_current_schema() + ", Current WH: " + session.get_current_warehouse())

In [None]:
# Parameters
source_path = "@SOURCE_FILES/BANK_MARKETING" # Where the source parquet files are stored
sp_udf_stage = "BANK_STAGE" # Name of the stage to used for storing the code for the SP and UDF , as well the trained model files

## Loading of source data
### Loading Parquet files with inferring the schema.

In [None]:
session.sql(f"ls {source_path}").select('"name"').show(30, max_width=150)

Take a peak in the files, in order to read the file with a select as a Parquet file we need to create a temporary file format object

In [None]:
session.sql("create or replace temp file format parq1 type='PARQUET'").collect()
session.sql(f"select $1 from {source_path} (file_format=>parq1 )").show(2)

Loading Parquet files with inferring the schema. When using **read.parquet** the file format object is created autamtically for us

In [None]:
df_reader = session.read.parquet(source_path)
df_reader.show()

Looking at the SQL for the df_reader, it shows that we now will read directly from the stage every time we access the data

In [None]:
print_sql(df_reader)

Saving the dta into a physical table in Snowflake

In [None]:
session.sql("DROP TABLE IF EXISTS bank_marketing_snowml").collect()
df_reader.copy_into_table("bank_marketing_snowml")

## Create a Snowpark Dataframe

In [None]:
df_bank_marketing = session.table("bank_marketing_snowml")
display(f"Dataframe shape: {shape(df_bank_marketing)}")
df_bank_marketing.show()

In [None]:
print_sql(df_bank_marketing)

## Data understanding

Start with verifying datatypes, simple put we will treat charcter columns as categorical

In [None]:
print_schema(df_bank_marketing)

DAY is stored as a number but can be threaded as categorical, fixed number of days in months, and by changing the data type to character we will do that.

In [None]:
df_bank_marketing_prep = df_bank_marketing.with_column("DAY", F.to_varchar(F.col("DAY"))).with_column_renamed("DEFAULT", "CREDIT_DEFAULT")
print_schema(df_bank_marketing_prep)

Get basic statistics about the categorical and numeric columns

In [None]:
df_bank_marketing_prep.describe().show()

Create variables with our categorical, numeric and target columns names so we can use them with encoders and scalers

In [None]:
cat_cols = [c.name for c in df_bank_marketing_prep.schema.fields if (type(c.datatype) == T.StringType) & (c.name != 'Y')]
numeric_types = [T.DecimalType, T.LongType, T.DoubleType, T.FloatType, T.IntegerType]
num_cols = [c.name for c in df_bank_marketing_prep.schema.fields if type(c.datatype) in numeric_types]
target_col = "Y"

Distribution of target values

In [None]:
df_bank_marketing_prep.group_by(target_col).count().show()

Frequency tables for each categorical feature

In [None]:
for col in cat_cols:
    display(df_bank_marketing_prep.select(F.count_distinct(col).as_(f"{col} distinct values")).show())
    display(df_bank_marketing_prep.group_by(col).count()\
                                .select(col, (F.call_function("RATIO_TO_REPORT", F.col("COUNT")).over() * 100).as_("% observations") )\
                                .sort(F.col("% observations").desc()).show(31))

Relationship between each of the categorical features and the target column

In [None]:
for col in cat_cols:
    window = Window.partition_by(col)
    display(df_bank_marketing_prep.group_by(col, F.col(target_col))\
                                .count()\
                                .select(col, F.col(target_col), (F.call_function("RATIO_TO_REPORT", F.col("COUNT")).over(window) * 100).as_("percentage"))\
                                .pivot(target_col, ['no', 'yes']).agg(F.sum("percentage")).show(50))


Check the correlation between all numeric variables using the correlation function

In [None]:
corr_matrix = correlation(df=df_bank_marketing_prep)
corr_matrix

In [None]:
sn.heatmap(corr_matrix, annot=True)
plt.show()

Have a look at the PDAYS columns, that is the number of days since last contact and if the customer has never been contacted it has -1

In [None]:
df_bank_marketing_prep.group_by("PDAYS").count().sort(F.col("COUNT").desc()).show()

Majority of the customers have never been contacted (have -1), what is the max and min values of it?

In [None]:
min_max = df_bank_marketing_prep.select(F.min("PDAYS").as_("MIN_VAL"), F.max("PDAYS").as_("MAX_VAL")).collect()[0]
min_max

Since the value range is rather wide we can bin it so we get less vales. There is multiple ways to this, but to make things simple we will create 20 equal width bins. For this we could use the WIDTH_BUCKET function in Snowflake, but we want to also give the bins names based on their range so we can create a function for dynamically generate the bins and the lables for them.  

In [None]:
def manual_bucketize(df, column, max_values=[],labels=[]):
    condition = None
    for idx, bucket in enumerate(labels):
        if idx <= len(max_values) - 1:
            if type(condition) == F.CaseExpr:
                condition = condition.when(F.col(column) < F.lit(max_values[idx]), F.lit(bucket))
            else:
                condition = F.when(F.col(column) < F.lit(max_values[idx]), F.lit(bucket))
        else:
            condition = condition.otherwise(F.lit(bucket))
    df = df.with_column(column + '_BUCKET', condition)
    df = df.drop(column)
    return df


We need to generate the max values for each bin and the lables

In [None]:
increment = int(round((min_max['MAX_VAL'] - min_max['MIN_VAL']) / 20, ndigits=0))
bin_vals = [0]
bin_lables = ['NEVER']
for idx, val in enumerate(range(1, min_max['MAX_VAL'], increment)):
    bin_vals.append(val)
    idx += 1
    bin_lable = f'{bin_vals[idx-1]}-{bin_vals[idx]-1}'
    bin_lables.append(bin_lable)

Call the function ti generate the bins

In [None]:
df_bank_marketing_binned_prep = manual_bucketize(df_bank_marketing_prep, 'PDAYS', bin_vals, 
                                          bin_lables)
df_bank_marketing_binned_prep.show(20)

In [None]:
print_sql(df_bank_marketing_binned_prep)

We will use PDAYS_BIN instead of PDAYS so we will remove PDAYS form the numeric columns list and add PDAYS_BIN to teh categorical columns list

In [None]:
cat_cols.append("PDAYS_BUCKET")
num_cols.remove("PDAYS")

In [None]:
df_bank_marketing_binned_prep.write.save_as_table()

## Using snowml for preprocessing and training

Snowml includes the possibility to create piplines and to have training of scikit-learn, XGBoost and Lightgbm models automatically pushed down to Snwoflake, including inference.

We can also create piplines for doing all steps.

Generate output column names for the columns we use the transformers on.

In [None]:
cat_cols_ohe = [col + '_OHE' for col in  cat_cols]
num_cols_out = [col + '_SCALED' for col in num_cols]

We want to use 1 and 0 for the target column (Y)

In [None]:
df_bank_marketing_binned_prep = df_bank_marketing_binned_prep.with_column("Y", F.iff(F.col("Y") == F.lit("yes"), F.lit(1), F.lit(0)))

In [None]:
print_schema(df_bank_marketing_binned_prep)

In [None]:
df_bank_marketing_binned_prep.show()

In [None]:
                                # Standard scaler for numerical columns
preprocessor = Pipeline(steps=[ ('scaler', pp.StandardScaler(input_cols=num_cols, output_cols=num_cols_out, drop_input_cols=True))
                               # One Hot Encoder transformer for categorical columns
                               , ('onehot', pp.OneHotEncoder(input_cols=cat_cols, output_cols=cat_cols_ohe, drop_input_cols=True, sparse=False, handle_unknown='ignore'))])

# Combine into one pipline with a RandomForestClassifier
model_pipe = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', RandomForestClassifier(label_cols=target_col, output_cols=['PREDICTED_RESPONSE'], n_jobs=-1))])

model_pipe.fit(df_bank_marketing_binned_prep)

Get a Snowpark DataFrame with the predictions using the training data

In [None]:
df_predictions = model_pipe.predict(df_bank_marketing_binned_prep).cache_result()
df_predictions.show()

skl = model_pipe.to_sklearn()

Calculate metrics based on the training data, we will use those later when stroing the pipeline into the model registry

In [None]:
precision_recall_fscore_metrics = precision_recall_fscore_support(df=df_predictions, y_true_col_names='Y', y_pred_col_names='PREDICTED_RESPONSE', average='binary')
accuracy_metric =  accuracy_score(df=df_predictions, y_true_col_names='Y', y_pred_col_names='PREDICTED_RESPONSE')
cm = confusion_matrix(df=df_predictions, y_true_col_name='Y', y_pred_col_name='PREDICTED_RESPONSE')
print(f"Precision: {precision_recall_fscore_metrics[0]}")
print(f"Recall: {precision_recall_fscore_metrics[1]}")
print(f"fbeta: {precision_recall_fscore_metrics[2]}")
print(f"Accuracy: {accuracy_metric}")
print(f"Confusion matrix: {cm}")

## Deploy Model/Pipeline using Snowflake Model Registry (Private Preview)
The Snowflake Model Registry allows us to store models/piplines in Snowflake with additional metadata, it allows us also to deploy those models to Snowflake and to retrive them. The API also can be used to apply a model on data.

### Open/Create Model Registry
A model registry needs to be created before it can be used. The creation will create a new database in the current account so the active role needs to have permissions to create a database. After the first creation, the model registry can be opened without the need to create it again.

In [None]:
# Create a new model registry. This will be a no-op if the registry already exists.
model_registry.create_model_registry(session=session, database_name='MODEL_REGISTRY', schema_name='MODEL_REGISTRY_SCHEMA')

Connect to the model registry

In [None]:
snowml_registry = model_registry.ModelRegistry(session=session, database_name='MODEL_REGISTRY', schema_name='MODEL_REGISTRY_SCHEMA')

### Register a new Model
Registering a new model is always performed through the relational API.

The call to log_model executes a few steps:

1. The given model object is serialized and uploaded to a stage.
2. An entry in the Model Registry is created for the model, referencing the model stage location.
3. Additional metadata is updated for the model as provided in the call.

For the serialization to work, the model object needs to be serializable in python.

In [None]:
model_nm = 'pp_predict_response'

Check if we already have stored a model with the same name

In [None]:
model_list = snowml_registry.list_models()
model_list.filter(F.col("NAME") == model_nm).select("NAME", "VERSION", "TYPE", "TAGS", "METRICS").show(max_width=150)

We can store multiple models with the same name as long as they have different versions

In [None]:
model_v = '5'

In [None]:
model_ref = snowml_registry.log_model(model=model_pipe, model_name=model_nm, model_version=model_v,
                                     description='4th Version of a Pipline with OneHoteEncoder, StandardScaler and RandomForestClassifier to predict response',
                                     tags={
                                        "stage": "testing", "classifier_type": "pipeline"},
                                    options={"embed_local_ml_library": True},)


The log_model method will return a reference to the model, we can use it for getting information about the store model

In [None]:
model_ref.get_name() , model_ref.get_version()

Check that the model is in the model registry

In [None]:
model_list.filter(F.col("NAME") == model_nm).select("NAME", "VERSION", "TYPE", "TAGS", "METRICS").show(max_width=150)

### Add Metrics
Metrics are a type of metadata annotation that can be associated with models stored in the Model Registry. Metrics often take the form of scalars but we also support more complex objects such as arrays or dictionaries to represent metrics. In the exmamples below, we add scalars, dictionaries, and a 2-dimensional numpy array as metrics.

In [None]:
# Add metrics
model_ref.set_metric(metric_name="train_accuracy", metric_value=accuracy_metric)
model_ref.set_metric(metric_name="train_precision", metric_value=precision_recall_fscore_metrics[0])
model_ref.set_metric(metric_name="train_recall", metric_value=precision_recall_fscore_metrics[1])
model_ref.set_metric(metric_name="train_f1", metric_value=precision_recall_fscore_metrics[2])
model_ref.set_metric(metric_name="train_confusion_matrix", metric_value=cm)

Get all metrics for a model

In [None]:
model_ref.get_metrics()

Get value for one metric

In [None]:
model_ref.get_metric_value('train_precision')

### List Model in Registry
Listing models in the registry returns a SnowPark DataFrame. That allows the caller to select and filter the models as needed. In the example below, we list the name, version, tags, and metrics for the model we just added.

In [None]:
model_list = snowml_registry.list_models()
model_list.filter(F.col("NAME") == 'pp_predict_response').select("NAME", "VERSION", "TYPE", "TAGS", "METRICS").show(max_width=150)

### Model Deployment
Registry can be used to create deployment, which can be used for prediction. Deployment exists in the form of UDF. It could be either permanent or temporary.

#### Permanent deployment
Start by checking if we already have deployments for the model

In [None]:
model_ref.list_deployments().select("MODEL_NAME", "MODEL_VERSION", "DEPLOYMENT_NAME", "TARGET_PLATFORM").show()

In [None]:
deploy_name = f"pp_predict_response_{model_v}_udf"
model_ref.deploy(deployment_name=deploy_name, target_method='predict', permanent=True, options={"relax_version": True})

In [None]:
model_ref.list_deployments().select("MODEL_NAME", "MODEL_VERSION", "DEPLOYMENT_NAME", "TARGET_PLATFORM").show()

Use the deployed model

In [None]:
model_ref.predict(deployment_name=deploy_name, data=df_bank_marketing_binned_prep).select("PREDICTED_RESPONSE").show()

Since the model are deployed as a UDF we can call it directly, using SQL or Snowpark. The UDF will expect a ovject (dict) with all columns in it and for generating that we can use the object_construct function. The UDF will also return a object/dict so we need to extract the PREDICTED_RESPONSE value form it.

In [None]:
object_list = [] 
for col in df_bank_marketing_binned_prep.columns:
    object_list.extend([F.lit(col), F.col(col)])

df_bank_marketing_binned_prep.select(F.call_function("model_registry.MODEL_REGISTRY_SCHEMA.pp_predict_response_1_udf", F.object_construct(*object_list)).as_('response')).select(F.col("response")['PREDICTED_RESPONSE'].as_('PREDICTED_RESPONSE')).show()

### Load a model from the registry
It is also possible to load a store model back into memory

In [None]:
restored_model = model_ref.load_model()
restored_model.predict(df_bank_marketing_binned_prep).show()

### Model Registry with Scikit-Learn

Snowflake Model Registry also supports popoular open source Python ML frameworks such as scikit-learn, XGBoost, HuggingFace, PyTorch etc

In [None]:
from sklearn.ensemble import RandomForestClassifier as skl_RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder as skl_OneHotEncoder
from sklearn.preprocessing import StandardScaler as skl_StandardScaler
from sklearn.compose import ColumnTransformer as skl_ColumnTransformer
from sklearn.pipeline import Pipeline as skl_Pipeline

pd_train = df_bank_marketing_binned_prep.to_pandas()
X = pd_train[[*cat_cols, *num_cols]]
y = pd_train["Y"]

# One Hot Encoder transformer for categorical columns
cat_transformer = skl_Pipeline(steps=[
    ('onehot', skl_OneHotEncoder(handle_unknown='ignore'))
])
# Standard scaler for numerical columns
num_transformer = skl_Pipeline(steps=[
    ('scaler', skl_StandardScaler())
])

# Combine into a column transformer
preprocessor = skl_ColumnTransformer(
  [
        ('num', num_transformer, num_cols),
        ('cat', cat_transformer, cat_cols),
    ],  verbose_feature_names_out=False,
)

# Create a pipeline with the column transformer and training of a Random Forrest Classifier
pipe = skl_Pipeline(steps=[('preprocessor', preprocessor),
                       ('classifier', skl_RandomForestClassifier(n_jobs=-1))])
rfc_model = pipe.fit(X, y)

In [None]:
skl_model_name = "skl_predict_response"
skl_model_ref = snowml_registry.log_model(model=rfc_model, model_name=skl_model_name, model_version='1',
                                     description='Scikit-Learn Pipline with OneHoteEncoder, StandardScaler and RandomForestClassifier to predict response',
                                     tags={
                                        "stage": "testing", "classifier_type": "pipeline"},sample_input_data=X.head(), options={"embed_local_ml_library": True})


In [None]:
model_list.select("NAME", "VERSION", "TYPE", "TAGS", "METRICS").show(max_width=150)

In [None]:
skl_deploy_name = "skl_model_response_udf"
skl_model_ref.deploy(deployment_name=skl_deploy_name, target_method='predict', permanent=True,options={"relax_version": True})

In [None]:
skl_model_ref.predict(deployment_name=skl_deploy_name, data=X.head())

In [None]:
session.close()