#Technique in Spark to train multiple models and perform scalable inferencing

## 1.Preperation before class

### Environment preperation
1. Prepare a Databricks instance with Ls8s_v2
2. Prepare a Azure ML workspace 
3. Prepare a service principal with secret key registered in keyvault. The service principal should have contributor access to your Azure ML workspace

### Download data from Microsoft Open Dataset

https://azure.microsoft.com/en-us/services/open-datasets/catalog/sample-oj-sales-simulated

In [6]:
data =spark.read.format("csv").option("header", True).load("wasbs://ojsales-simulatedcontainer@azureopendatastorage.blob.core.windows.net/oj_sales_data/Store10*.csv")

In [7]:
#Write to local delta for fast reading
data.write.format("delta").saveAsTable("OJ_Sales_Data")

In [8]:
%sql optimize OJ_Sales_Data zorder by store, brand

path,metrics
,"List(1, 16, List(456483, 456483, 456483.0, 1, 456483), List(29570, 37082, 36254.5625, 16, 580073), 0, List(minCubeSize(107374182400), List(0, 0), List(16, 580073), 0, List(16, 580073), 1, null), 1, 16, 0, false)"


In [9]:
%sql select * from OJ_Sales_Data limit 10

WeekStarting,Store,Brand,Quantity,Advert,Price,Revenue
1990-06-14,1094,minute.maid,17892,1,2.09,37394.28
1990-06-21,1094,minute.maid,14053,1,2.45,34429.850000000006
1990-06-28,1094,minute.maid,17341,1,2.47,42832.27
1990-07-05,1094,minute.maid,17194,1,2.42,41609.48
1990-07-12,1094,minute.maid,17945,1,2.39,42888.55
1990-07-19,1094,minute.maid,17371,1,2.3,39953.3
1990-07-26,1094,minute.maid,9825,1,2.36,23187.0
1990-08-02,1094,minute.maid,10849,1,2.58,27990.42
1990-08-09,1094,minute.maid,12084,1,2.0,24168.0
1990-08-16,1094,minute.maid,10484,1,2.32,24322.88


In [10]:
%sql select count (distinct store, brand) from OJ_Sales_Data 

"count(DISTINCT store, brand)"
300


In [11]:
%sql select distinct brand from OJ_Sales_Data 

brand
dominicks
tropicana
minute.maid


## Pre-training exersize

1. Read about Pandas Function APIs: https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/pandas-function-apis
2. Answer following questions:
- What is the advantage of this technology vs. regular Python UDF?
- What is the role of Apache Arrow in this?
- What is the use of iterator and yield vs. regular list and return?

Using the OJ sales dataset above, use Pandas Function APIs, pick out for each store and brand the best selling week in the form of week_number-yyyy.
The result set look like this:

In [15]:
import pandas as pd
result_sample= pd.DataFrame({"store": [1066, 1067, 1068],'Brand':['dominicks', 'tropicana','tropicana'],"Best_Selling_Week": ['23-1992', '24-1991','24-1991']})
display(result_sample)

store,Brand,Best_Selling_Week
1066,dominicks,23-1992
1067,tropicana,24-1991
1068,tropicana,24-1991


###Optional reading: we'll forecast models and utilities from the Many Models repo (AML PRS method) to compare. To prepare yourself on the training day, it's useful to get familiar the class and libraries there.

In [17]:
https://github.com/microsoft/solution-accelerator-many-models/blob/master/Custom_Script/scripts/timeseries_utilities.py
https://github.com/microsoft/solution-accelerator-many-models/blob/master/Custom_Script/scripts/train.py
https://github.com/microsoft/solution-accelerator-many-models/blob/master/Custom_Script/scripts/forecast.py

##2. Training content

###Many Model Training

In [20]:
#prepare values to broadcast
tenant_id ='72f988bf-86f1-41af-91ab-2d7cd011db47' 
service_principal_id='af883abf-89dd-4889-bdb3-1ee84f68465e'
service_principal_password=dbutils.secrets.get('scope1','app01-pass')
subscription_id = '0e9bace8-7a81-4922-83b5-d995ff706507'
# Azure Machine Learning resource group NOT the managed resource group
resource_group = 'azureml' 

#Azure Machine Learning workspace name, NOT Azure Databricks workspace
workspace_name = 'ws01ent'  

### Test with a single store & brand combination (single time series)

In [22]:
%run ./timeseries_utilities


In [23]:
#Getting data
import pandas as pd
train_data_df = spark.sql("select to_timestamp(WeekStarting) WeekStarting, float(Quantity), Brand,Revenue, Store from OJ_Sales_Data where Store = '1066' and Brand ='tropicana'").toPandas()


In [24]:
display(train_data_df)

WeekStarting,Quantity,Brand,Revenue,Store
1990-06-14T00:00:00.000+0000,13198.0,tropicana,29695.5,1066
1990-06-21T00:00:00.000+0000,12188.0,tropicana,27179.24,1066
1990-06-28T00:00:00.000+0000,10453.0,tropicana,25505.32,1066
1990-07-05T00:00:00.000+0000,13390.0,tropicana,35349.6,1066
1990-07-12T00:00:00.000+0000,12798.0,tropicana,29691.36,1066
1990-07-19T00:00:00.000+0000,18476.0,tropicana,49146.16,1066
1990-07-26T00:00:00.000+0000,16244.0,tropicana,35087.04,1066
1990-08-02T00:00:00.000+0000,16057.0,tropicana,35807.11,1066
1990-08-09T00:00:00.000+0000,16888.0,tropicana,35127.04,1066
1990-08-16T00:00:00.000+0000,14045.0,tropicana,30056.300000000003,1066


In [25]:

#Getting data for one table to test the utility function
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
import joblib
import os
target_column= 'Quantity'
timestamp_column= 'WeekStarting'
timeseries_id_columns= [ 'Store', 'Brand']
drop_columns=['Revenue', 'Store', 'Brand']
model_type= 'lr'
model_name=train_data_df['Store'][0]+"_"+train_data_df['Brand'][0]
test_size=20
# 1.0 Read the data from CSV - parse timestamps as datetime type and put the time in the index
data = train_data_df \
        .set_index('WeekStarting') \
        .sort_index(ascending=True)

# 2.0 Split the data into train and test sets
train = data[:-test_size]
test = data[-test_size:]

# 3.0 Create and fit the forecasting pipeline
# The pipeline will drop unhelpful features, make a calendar feature, and make lag features
lagger = SimpleLagger(target_column, lag_orders=[1, 2, 3, 4])
transform_steps = [('column_dropper', ColumnDropper(drop_columns)),
                   ('calendar_featurizer', SimpleCalendarFeaturizer()), ('lagger', lagger)]
forecaster = SimpleForecaster(transform_steps, LinearRegression(), target_column, timestamp_column)
forecaster.fit(train)
print('Featurized data example:')
print(forecaster.transform(train).head())


In [26]:
from azureml.core.authentication import ServicePrincipalAuthentication
from azureml.core import Workspace
from azureml.core import Model

import cloudpickle 

sp_auth = ServicePrincipalAuthentication(tenant_id =tenant_id,
                                         service_principal_id=service_principal_id,
                                         service_principal_password=service_principal_password)
# Instantiate Azure Machine Learning workspace
ws = Workspace.get(name=workspace_name,
                   subscription_id=subscription_id,
                   resource_group=resource_group,auth= sp_auth)


# 4.0 Get predictions on test set
forecasts = forecaster.forecast(test)
compare_data = test.assign(forecasts=forecasts).dropna()

# 5.0 Calculate accuracy metrics for the fit
mse = mean_squared_error(compare_data[target_column], compare_data['forecasts'])
rmse = np.sqrt(mse)
mae = mean_absolute_error(compare_data[target_column], compare_data['forecasts'])
actuals = compare_data[target_column].values
preds = compare_data['forecasts'].values
mape = np.mean(np.abs((actuals - preds) / actuals) * 100)

# 7.0 Train model with full dataset
forecaster.fit(data)

# 8.0 Save the forecasting pipeline
with open(model_name, mode='wb') as file:
   cloudpickle.dump(forecaster, file)

model = Model.register(workspace=ws, model_name=model_name, model_path=filename, tags={'mse':str(mse), 'mape': str(mape), 'rmse': str(rmse)})



####Scale it up with many model training with function Pandas API

In [28]:
#Prepare the core training function

from azureml.core.authentication import ServicePrincipalAuthentication
from azureml.core import Workspace
from azureml.core import Model
import cloudpickle
#do not use joblib to dump because it will have issue with multi-level object
def many_model_train(train_data_df):
  sp_auth = ServicePrincipalAuthentication(tenant_id =tenant_id,
                                         service_principal_id=service_principal_id,
                                         service_principal_password=service_principal_password)
  # Instantiate Azure Machine Learning workspace
  ws = Workspace.get(name=workspace_name,
                     subscription_id=subscription_id,
                     resource_group=resource_group,auth= sp_auth)


  target_column= 'Quantity'
  timestamp_column= 'WeekStarting'
  timeseries_id_columns= [ 'Store', 'Brand']
  drop_columns=['Revenue', 'Store', 'Brand']
  model_type= 'lr'
  model_name=train_data_df['Store'][0]+"_"+train_data_df['Brand'][0]
  test_size=20
  # 1.0 Read the data from CSV - parse timestamps as datetime type and put the time in the index
  data = train_data_df \
          .set_index('WeekStarting') \
          .sort_index(ascending=True)

  # 2.0 Split the data into train and test sets
  train = data[:-test_size]
  test = data[-test_size:]

  # 3.0 Create and fit the forecasting pipeline
  # The pipeline will drop unhelpful features, make a calendar feature, and make lag features
  lagger = SimpleLagger(target_column, lag_orders=[1, 2, 3, 4])
  transform_steps = [('column_dropper', ColumnDropper(drop_columns)),
                     ('calendar_featurizer', SimpleCalendarFeaturizer()), ('lagger', lagger)]
  forecaster = SimpleForecaster(transform_steps, LinearRegression(), target_column, timestamp_column)
  forecaster.fit(train)

  # 4.0 Get predictions on test set
  forecasts = forecaster.forecast(test)
  compare_data = test.assign(forecasts=forecasts).dropna()

  # 5.0 Calculate accuracy metrics for the fit
  mse = mean_squared_error(compare_data[target_column], compare_data['forecasts'])
  rmse = np.sqrt(mse)
  mae = mean_absolute_error(compare_data[target_column], compare_data['forecasts'])
  actuals = compare_data[target_column].values
  preds = compare_data['forecasts'].values
  mape = np.mean(np.abs((actuals - preds) / actuals) * 100)

  # 7.0 Train model with full dataset
  forecaster.fit(data)

  # 8.0 Save the forecasting pipeline
  with open(model_name, mode='wb') as file:
     cloudpickle.dump(forecaster, file)#   
  model = Model.register(workspace=ws, model_name=model_name, model_path=model_name, tags={'mse':str(mse), 'mape': str(mape), 'rmse': str(rmse)})
  
  return pd.DataFrame({'mse':[mse], 'mape': [mape], 'rmse': [rmse], 'filename':[model_name]})


In [29]:
df = spark.sql("select to_timestamp(WeekStarting) WeekStarting, float(Quantity), Brand,Revenue, Store from OJ_Sales_Data")
df = df.repartition(200) #to increase parallelism
result = df.groupby(["Brand","Store"]).applyInPandas(many_model_train, schema="mse float, mape float, rmse float, filename string ")


In [30]:
display(result)

mse,mape,rmse,filename
10501595.0,20.475067,3240.6165,1031_tropicana
8323296.5,17.982645,2885.0125,1021_minute.maid
8422692.0,18.61009,2902.1875,1074_tropicana
12016312.0,22.212152,3466.4553,1077_minute.maid
6714000.0,13.475155,2591.1387,1078_minute.maid
10599569.0,23.259544,3255.6978,1019_minute.maid
5647451.5,15.766435,2376.4368,1090_tropicana
8836949.0,19.134098,2972.7007,1099_tropicana
7331310.0,14.901359,2707.6392,1014_minute.maid
10048782.0,20.76326,3169.9814,1020_minute.maid


###Many Model Inferencing: Can you score using multiple model in parallel?

In [32]:
#Home work: please prepare a function pandas UDF to produce forecast for mutliple store and brand given the test data