# ML Engineering Exercise 3

## Parallel Feature Engineering, Training and Inference

Exercise: The original code was built from feature engineering, training, inference and evaluation on a single station.  The ML Engineer initially took this code and wrapped it in various loops to operate over all stations.  This is REALLY SLOW.  Instead we would like to vectorize as much of the code as possible, push-down all operations into Snowflake and leverage the parallel nature of Snowpark User-defined Functions.

Snowpark user-defined functions (UDFs) are generally a good candidte for the so-called "Embarrassingly parallel" workloads.   The use of Snowpark Python UDFs for training (in this code) is a stretch of the original intention of UDFs.  However, this code is meant to give an idea of the art of the possible.

**Note**: At the current time in Snowpark Python Private Preview the row-based parallelization is an additional feature flag which is not enabled on all accounts with Snowpark python.  Much of the work can be parallelized without this feature but the final training is currently limited in parallelization.

**Note**: At the current time in Snowpark Python Private Preview the Snowpark Python UDFs are limited to scalar functions (one row in, one value out).  Much of the complexity around parallelizing UDFs comes from the current lack of user-defined table functions and this code will get MUCH easier in the near future.

Input: Data in `trips` table.  Feature engineering, train, predict functions from data scientist.  
Output: Prediction models available to business users in SQL. Evaluation reports for monitoring.

### 1. Load  credentials and connect to Snowflake

In [None]:
from dags.snowpark_connection import snowpark_connect
session, state_dict = snowpark_connect()

### 2.  Materialize the holidays and weather features


In [None]:
from snowflake.snowpark import functions as F
from citibike_ml.mlops_pipeline import materialize_holiday_table, materialize_weather_table

trips_table_name = state_dict['trips_table_name']
#holiday_table_name = state_dict['holiday_table_name']
#weather_table_name = state_dict['weather_table_name']
#model_stage_name = state_dict['model_stage_name']

holiday_table_name = materialize_holiday_table(session=session,
                                               holiday_table_name=state_dict['holiday_table_name'])
precip_table_name = materialize_precip_table(session=session, 
                                             weather_table_name=state_dict['weather_table_name'])

### 3.  Create a vectorized feature generation
Previously the data scientist picked one station for training and predictions.  We want to generate features for all stations in parallel.  We can leverage the power of the Snowflake SQL execution engine for this but Snowpark allows us to write it in python.  

Snowflake [window functions](https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/_autosummary/snowflake.snowpark.html#snowflake.snowpark.Window) are a powerful tool for vectorizing work.  Our initial feature engineering code from the data scientist used window functions to calculate the lag features.

In [None]:
import snowflake.snowpark as snp 
trips_df = session.table(trips_table_name)
holiday_df = session.table(holiday_table_name)
precip_df = session.table(precip_table_name)
station_id = '519'

date_window = snp.Window.orderBy('DATE')

#Previously start with a filter on station_id

feature_df = trips_df.filter(F.col('START_STATION_ID') == station_id)\
                     .select(F.to_date(F.col('STARTTIME')).alias('DATE'),
                             F.col('START_STATION_ID').alias('STATION_ID'))\
                     .groupBy(F.col('STATION_ID'), F.col('DATE'))\
                        .count()\
                     .withColumn('LAG_1', F.lag(F.col('COUNT'), offset=1, default_value=None).over(date_window))\
                     .withColumn('LAG_7', F.lag(F.col('COUNT'), offset=7, default_value=None).over(date_window))\
                        .na.drop()\
                     .join(holiday_df, 'DATE', join_type='left').na.fill({'HOLIDAY':0})\
                     .join(precip_df, 'DATE', 'inner')

In [None]:
feature_df.show(5), feature_df.select('STATION_ID').distinct().count()

We can create a multi-level window function to allow us to partition by station_id and group by the date within that window.  
  
Notice there is no `filter()` initially.

In [None]:
sid_date_window = snp.Window.partitionBy(F.col('STATION_ID')).orderBy(F.col('DATE').asc())

feature_df = trips_df.select(F.to_date(F.col('STARTTIME')).alias('DATE'),
                             F.col('START_STATION_ID').alias('STATION_ID'))\
                     .groupBy(F.col('STATION_ID'), F.col('DATE'))\
                        .count()\
                     .withColumn('LAG_1', F.lag(F.col('COUNT'), offset=1, default_value=None).over(sid_date_window))\
                     .withColumn('LAG_7', F.lag(F.col('COUNT'), offset=7, default_value=None).over(sid_date_window))\
                        .na.drop()\
                     .join(holiday_df, 'DATE', join_type='left').na.fill({'HOLIDAY':0})\
                     .join(precip_df, 'DATE', 'inner')

In [None]:
feature_df.show(5), feature_df.select('STATION_ID').distinct().count()

Our feature dataframe now has feature sets for 1061 of the original 1081 stations.  Twenty stations have less than 7 total trips so these end up being dropped because our `lag()` functions are not imputing missing values but rather dropping them.  
  
  
There is one more step.  Our upstream feature training will do a 365 day split using the first year for training and the second year for validation.  This is important because of the annual seasonality that our model needs to capture.  
So we need to generate features only for stations that have at least 2 years worth of data.  Again, we can do this with a second window function.

In [None]:
sid_date_window = snp.Window.partitionBy(F.col('STATION_ID')).orderBy(F.col('DATE').asc())
sid_window = snp.Window.partitionBy(F.col('STATION_ID'))


feature_df = trips_df.select(F.to_date(F.col('STARTTIME')).alias('DATE'),
                             F.col('START_STATION_ID').alias('STATION_ID'))\
                     .groupBy(F.col('STATION_ID'), F.col('DATE'))\
                        .count()\
                     .withColumn('LAG_1', F.lag(F.col('COUNT'), offset=1, default_value=None).over(sid_date_window))\
                     .withColumn('LAG_7', F.lag(F.col('COUNT'), offset=7, default_value=None).over(sid_date_window))\
                        .na.drop()\
                     .join(holiday_df, 'DATE', join_type='left').na.fill({'HOLIDAY':0})\
                     .join(precip_df, 'DATE', 'inner')\
                     .withColumn('DAY_COUNT', F.count(F.col('DATE')).over(sid_window))\
                        .filter(F.col('DAY_COUNT') >= 365*2)

Our feature set should not include any stations with less than 737 days (365*2+7) of data.

In [None]:
feature_df.select(F.min('DAY_COUNT')).collect()[0][0]

Now how many stations have at least two years of data?

In [None]:
feature_df.select('STATION_ID').distinct().count()

### 4.  Vectorize the training and inference

Because we currently only have scalar functions for Snowpark Python UDFs we must aggregate the features for each station to a single cell.   This will be much easier in the future with vectorized input and user-defined table functions.

In [None]:
feature_df = feature_df.drop('DAY_COUNT')

In [None]:
feature_column_list = feature_df.columns
feature_column_list.remove('\"STATION_ID\"')
feature_column_list = [f.replace('\"', "") for f in feature_column_list]
feature_column_array = F.array_construct(*[F.lit(x) for x in feature_column_list])

feature_df_stuffed = feature_df.groupBy(F.col('STATION_ID'))\
                               .agg(F.array_agg(F.array_construct(*feature_column_list)).alias('INPUT_DATA'))\
                               .withColumn('INPUT_COLUMN_LIST', feature_column_array)\
                               .withColumn('TARGET_COLUMN', F.lit('COUNT'))

In [None]:
feature_df_stuffed.show(1)

Lets check to make sure the aggregate happened at the right level.

### 5. Update the Training/Prediction Code to use UDF parallelization
Now that we can generate the features in parallel we can also use the Snowflake UDF structure to train all of our stations in parallel.  The handler will run 8x per node of the warehouse so to train on 1061 stations we will need a larger warehouse.  
  
First we need to update our UDF handler with the 2-year logic as well in case someone accidentally calls it with data that wasn't filter for 2 year minimum.

In [None]:
%%writefile /station_train_predict.py

def station_train_predict_func(input_data: list, 
                               input_columns: list, 
                               target_column: str,
                               max_epochs: int) -> str:

    import pandas as pd
    df = pd.DataFrame(input_data, columns = input_columns)
    
    #Due to annual seasonality we need at least one year of data for training 
    #and a second year of data for validation
    if len(df) < 365*2:
        df['PRED'] = 'NULL'
    else:
        feature_columns = input_columns.copy()
        feature_columns.remove('DATE')
        feature_columns.remove(target_column)

        from torch import tensor
        from pytorch_tabnet.tab_model import TabNetRegressor

        model = TabNetRegressor()

        #cutpoint = round(len(df)*(train_valid_split/100))
        cutpoint = 365

        ##NOTE: in order to do train/valid split on time-based portion the input data must be sorted by date    
        df['DATE'] = pd.to_datetime(df['DATE'])
        df = df.sort_values(by='DATE', ascending=True)

        y_valid = df[target_column][-cutpoint:].values.reshape(-1, 1)
        X_valid = df[feature_columns][-cutpoint:].values
        y_train = df[target_column][:-cutpoint].values.reshape(-1, 1)
        X_train = df[feature_columns][:-cutpoint].values

        model.fit(
            X_train, y_train,
            eval_set=[(X_valid, y_valid)],
            max_epochs=max_epochs,
            patience=100,
            batch_size=1024, 
            virtual_batch_size=128,
            num_workers=0,
            drop_last=False)

        df['PRED'] = model.predict(tensor(df[feature_columns].round(2).values))
        df['DATE'] = df['DATE'].dt.strftime('%Y-%m-%d')
        df = pd.concat([df, pd.DataFrame(model.explain(df[feature_columns].values)[0], 
                               columns = feature_columns).add_prefix('EXPL_').round(2)], axis=1)
    
    return [df.values.tolist(), df.columns.tolist()]


In [None]:
from citibike_ml.mlops_pipeline import deploy_pred_train_udf

_ = session.sql('CREATE STAGE IF NOT EXISTS ' + model_stage_name).collect()

model_udf_name = deploy_pred_train_udf(session=session, model_stage_name=model_stage_name)

**NOTE**: The following code will not currently work due to a bug in the Snowpark backend.  This will be fixed in the 6.3.0 code push.  For now we use a limit function which essentially bypasses the row-wise parallelization.

In [None]:
_ = session.sql('USE WAREHOUSE LG_WH']).collect()

max_epochs=10

output_list = feature_df_stuffed.limit(1)\
                                .select('STATION_ID', F.call_udf(model_udf_name, 
                                                                 'INPUT_DATA', 
                                                                 'INPUT_COLUMN_LIST', 
                                                                 'TARGET_COLUMN', 
                                                                 F.lit(max_epochs)).alias('OUTPUT_DATA')).collect()

In [None]:
import ast
import pandas as pd
df = pd.DataFrame(ast.literal_eval(output_list[0]['OUTPUT_DATA'])[0], 
                  columns = ast.literal_eval(output_list[0]['OUTPUT_DATA'])[1])

df.head()

There is essentially no changes to the actual training an prediction code except that we need to un-stuff the 

In [None]:
%%writefile citibike_ml/parallel_udf.py

def generate_feature_table(session, 
                           clone_table_name, 
                           feature_table_name, 
                           holiday_table_name, 
                           precip_table_name) -> list:
    
    from snowflake.snowpark import functions as F
    import snowflake.snowpark as snp
    
    clone_df = session.table(clone_table_name)
    holiday_df = session.table(holiday_table_name)
    precip_df = session.table(precip_table_name)

    window = snp.Window.partitionBy(F.col('STATION_ID')).orderBy(F.col('DATE').asc())
    sid_window = snp.Window.partitionBy(F.col('STATION_ID'))


    feature_df = clone_df.select(F.to_date(F.col('STARTTIME')).alias('DATE'),
                                 F.col('START_STATION_ID').alias('STATION_ID'))\
                         .groupBy(F.col('STATION_ID'), F.col('DATE'))\
                            .count()\
                         .withColumn('DAY_COUNT', F.count(F.col('DATE')).over(sid_window))\
                            .filter(F.col('DAY_COUNT') >= 365*2)\
                         .withColumn('LAG_1', F.lag(F.col('COUNT'), offset=1, default_value=None).over(window))\
                         .withColumn('LAG_7', F.lag(F.col('COUNT'), offset=7, default_value=None).over(window))\
                            .na.drop()\
                         .join(holiday_df, 'DATE', join_type='left').na.fill({'HOLIDAY':0})\
                         .join(precip_df, 'DATE', 'inner')\
                         .withColumn('DAY_COUNT', F.count(F.col('DATE')).over(sid_window))\
                            .filter(F.col('DAY_COUNT') >= 365*2)\
                         .drop('DAY_COUNT')
    
    feature_column_list = feature_df.columns
    feature_column_list.remove('\"STATION_ID\"')
    feature_column_list = [f.replace('\"', "") for f in feature_column_list]
    feature_column_array = F.array_construct(*[F.lit(x) for x in feature_column_list])

    feature_df_stuffed = feature_df.groupBy(F.col('STATION_ID'))\
                                   .agg(F.array_agg(F.array_construct(*feature_column_list)).alias('INPUT_DATA'))\
                                   .withColumn('INPUT_COLUMN_LIST', feature_column_array)\
                                   .withColumn('TARGET_COLUMN', F.lit('COUNT'))
    
    feature_df_stuffed.limit(50).write.mode('overwrite').saveAsTable(feature_table_name)        

    return feature_table_name

def train_predict_feature_table(session, station_train_pred_udf_name, feature_table_name, pred_table_name) -> str:
    from snowflake.snowpark import functions as F
    import pandas as pd
    import ast
    
    max_epochs=10

    output_list = session.table(feature_table_name)\
                         .select('STATION_ID', F.call_udf(station_train_pred_udf_name, 
                                                          'INPUT_DATA', 
                                                          'INPUT_COLUMN_LIST', 
                                                          'TARGET_COLUMN', 
                                                          F.lit(max_epochs)).alias('OUTPUT_DATA')).collect()
    df = pd.DataFrame()
    for row in range(len(output_list)):
        tempdf = pd.DataFrame(data = ast.literal_eval(output_list[row]['OUTPUT_DATA'])[0], 
                                    columns=ast.literal_eval(output_list[row]['OUTPUT_DATA'])[1]
                                    )
        tempdf['STATION_ID'] = str(output_list[row]['STATION_ID'])
        df = pd.concat([df, tempdf], axis=0)
        
    session.createDataFrame(df).write.mode('overwrite').saveAsTable(pred_table_name)
    
    return pred_table_name

### 5. Test

In [None]:
%%time
from citibike_ml.parallel_udf import generate_feature_table, train_predict_feature_table

feature_table_name = generate_feature_table(session=session, 
                                            clone_table_name=trips_table_name, 
                                            feature_table_name='TRIPS_FEATURES_TEST', 
                                            holiday_table_name=holiday_table_name,
                                            precip_table_name=precip_table_name
                                           )

In [None]:
session.table(feature_table_name).show(1)

In [None]:
_ = session.sql('USE WAREHOUSE X2L_WH']).collect()

pred_table_name = train_predict_feature_table(session=session, 
                                              station_train_pred_udf_name=model_udf_name, 
                                              feature_table_name=feature_table_name, 
                                              pred_table_name='PRED_TEST'
                                             )

In [None]:
session.table(pred_table_name).show(1)