# Citibike ML
In this example we use the [Citibike dataset](https://ride.citibikenyc.com/system-data). Citibike is a bicycle sharing system in New York City. Everyday users choose from 20,000 bicycles at 1300 stations around New York City.

To ensure customer satisfaction Citibike needs to predict how many bicycles will be needed at each station. Maintenance teams from Citibike will check each station and repair or replace bicycles. Additionally, the team will relocate bicycles between stations based on predicted demand. The business needs to be able to run reports of how many bicycles will be needed at a given station on a given day.

## ML Ops
In this section of the demo, we will utilize Snowpark's Python client-side Dataframe API as well as the Snowpark server-side runtime to create an **ML ops pipeline**.  We will take the functions created by the ML Engineer and create a set of functions that can be easily automated with the company's orchestration tools. 

The ML Engineer must create a pipeline to **automate deployment** of models and batch predictions where the business users can consume them easily from dashboards and analytics tools like Tableau or Power BI.  Predictions will be made for the top 10 busiest stations.  The predictions must be accompanied by an explanation of which features were most impactful for the prediction.  

For this demo flow we will assume that the organization has the following **policies and processes** :   
-**Dev Tools**: The ML engineer can develop in their tool of choice (ie. VS Code, IntelliJ, Pycharm, Eclipse, etc.).  Snowpark Python makes it possible to use any environment where they have a python kernel.  For the sake of a demo we will use Jupyter.  
-**Data Governance**: To preserve customer privacy no data can be stored locally.  The ingest system may store data temporarily but it must be assumed that, in production, the ingest system will not preserve intermediate data products between runs. Snowpark Python allows the user to push-down all operations to Snowflake and bring the code to the data.   
-**Automation**: Although the ML engineer can use any IDE or notebooks for development purposes the final product must be python code at the end of the work stream.  Well-documented, modularized code is necessary for good ML operations and to interface with the company's CI/CD and orchestration tools.  
-**Compliance**: Any ML models must be traceable back to the original data set used for training.  The business needs to be able to easily remove specific user data from training datasets and retrain models. 

Input: Data in `trips` table.  Feature engineering, train, predict functions from data scientist.  
Output: Automatable pipeline of feature engineering, train, predict.

### 1. Load  credentials and connect to Snowflake

In [None]:
from dags.snowpark_connection import snowpark_connect
session, state_dict = snowpark_connect('./include/state.json')

### 1. Setup Training and Inference Pipeline

We will generate a unique identifier which we will use to provide lineage across all components of the pipeline.

In [None]:
from snowflake.snowpark import functions as F
import uuid
model_id = str(uuid.uuid1()).replace('-', '_')

state_dict.update({'model_id': model_id})
state_dict.update({'weather_table_name': 'WEATHER',
                   'holiday_table_name': 'HOLIDAYS',
                   'feature_table_name' : 'FEATURES_'+model_id,
                   'pred_table_name': 'PREDS_'+model_id,
                   'eval_table_name': 'EVALS_'+model_id,
                   'forecast_table_name': 'FORECAST_'+model_id,
                   'forecast_steps': 30,
                   'train_udf_name': 'station_train_predict_udf',
                   'train_func_name': 'station_train_predict_func',
                   'eval_udf_name': 'eval_model_output_udf',
                   'eval_func_name': 'eval_model_func'
                  })

import json
with open('./include/state.json', 'w') as sdf:
    json.dump(state_dict, sdf)

We will deploy the model training and inference as a permanent [Python Snowpark User-Defined Function (UDF)](https://docs.snowflake.com/en/LIMITEDACCESS/snowpark-python.html#creating-user-defined-functions-udfs-for-dataframes). This will make the function available to not only our automated training/inference pipeline but also to any users needing the function for manually generated predictions.  
  
As a permanent function we will need a staging area.

In [None]:
session.sql('CREATE STAGE IF NOT EXISTS ' + state_dict['model_stage_name']).collect()

For production we need to be able to reproduce results.  The `trips` table will change as new data is loaded each month so we need a point-in-time snapshot.  Snowflake [Zero-Copy Cloning](https://docs.snowflake.com/en/sql-reference/sql/create-clone.html) allows us to do this with copy-on-write features so we don't have multiple copies of the same data.  We will create a unique ID to identify each training/inference run as well as the features and predictions generated.  We can use [object tagging](https://docs.snowflake.com/en/user-guide/object-tagging.html) to tag each object with the `model_id` as well.

In [None]:
clone_table_name = 'TRIPS_CLONE_'+state_dict["model_id"]
state_dict.update({"clone_table_name":clone_table_name})

_ = session.sql('CREATE OR REPLACE TABLE '+clone_table_name+" CLONE "+state_dict["trips_table_name"]).collect()
_ = session.sql('CREATE TAG IF NOT EXISTS model_id_tag').collect()
_ = session.sql("ALTER TABLE "+clone_table_name+" SET TAG model_id_tag = '"+state_dict["model_id"]+"'").collect()

We will start by importing the functions created by the ML Engineer.

In [None]:
from dags.mlops_pipeline import materialize_holiday_table
from dags.mlops_pipeline import materialize_weather_table
from dags.mlops_pipeline import deploy_pred_train_udf
from dags.mlops_pipeline import deploy_eval_udf
from dags.mlops_pipeline import create_forecast_table
from dags.mlops_pipeline import create_feature_table
from dags.mlops_pipeline import train_predict
from dags.mlops_pipeline import evaluate_station_model

The pipeline will be orchestrated by our companies orchestration framework but we will test the steps here.

In [None]:
holiday_table_name = materialize_holiday_table(session=session, 
                                               holiday_table_name=state_dict['holiday_table_name'])

In [None]:
weather_table_name = materialize_weather_table(session=session,
                                               weather_table_name=state_dict['weather_table_name'])

In [None]:
model_udf_name = deploy_pred_train_udf(session=session, 
                                       udf_name=state_dict['train_udf_name'],
                                       function_name=state_dict['train_func_name'],
                                       model_stage_name=state_dict['model_stage_name'])

In [None]:
eval_udf_name = deploy_eval_udf(session=session, 
                                udf_name=state_dict['eval_udf_name'],
                                function_name=state_dict['eval_func_name'],
                                model_stage_name=state_dict['model_stage_name'])

In [None]:
feature_table_name = create_feature_table(session, 
                                          trips_table_name=state_dict['clone_table_name'], 
                                          holiday_table_name=state_dict['holiday_table_name'], 
                                          weather_table_name=state_dict['weather_table_name'],
                                          feature_table_name=state_dict['feature_table_name'])

_ = session.sql("ALTER TABLE "+feature_table_name+" SET TAG model_id_tag = '"+state_dict["model_id"]+"'").collect()


In [None]:
forecast_table_name = create_forecast_table(session, 
                                            holiday_table_name=state_dict['holiday_table_name'], 
                                            weather_table_name=state_dict['weather_table_name'], 
                                            forecast_table_name=state_dict['forecast_table_name'],
                                            start_date='2020-03-01', 
                                            steps=state_dict['forecast_steps'])

_ = session.sql("ALTER TABLE "+forecast_table_name+" SET TAG model_id_tag = '"+state_dict["model_id"]+"'").collect()


In [None]:
session.use_warehouse(state_dict['compute_parameters']['train_warehouse'])

In [None]:
pred_table_name = train_predict(session, 
                                station_train_pred_udf_name=state_dict['train_udf_name'], 
                                feature_table_name=state_dict['feature_table_name'], 
                                forecast_table_name=state_dict['forecast_table_name'],
                                pred_table_name=state_dict['pred_table_name'])

_ = session.sql("ALTER TABLE "+pred_table_name+" SET TAG model_id_tag = '"+state_dict["model_id"]+"'").collect()


In [None]:
session.use_warehouse(state_dict['compute_parameters']['default_warehouse'])

In [None]:
eval_table_name = evaluate_station_model(session, 
                                         eval_model_udf_name=state_dict['eval_udf_name'], 
                                         pred_table_name=state_dict['pred_table_name'], 
                                         eval_table_name=state_dict['eval_table_name'])

_ = session.sql("ALTER TABLE "+eval_table_name+" SET TAG model_id_tag = '"+state_dict["model_id"]+"'").collect()


In [None]:
#session.sql('ALTER WAREHOUSE IF EXISTS '+state_dict['compute_parameters']['train_warehouse']+' SUSPEND').collect()

## 2. Consolidate Ingest, Training, Inference and Evaluation

In [None]:
files_to_download = ['202003-citibike-tripdata.csv.zip']

In [None]:
session.close()