# Many Model Training Using Synpase Spark 2.4 with AutoML

### Environment preperation
1. Prepare a Synaspe Spark Pool 2.4
As of this writing, Automl is not supported on Python 3.8 or later so we have to use Spark 2.4 for Automl
2. To fix problem with Spark Pandas UDF's incompatbility with pyarrow >=0.15, we need to downgrade the  pyarrow environment to pyarrow 0.14.1. use the requirement file to downgrade pyarrow.
3. Prepare a Azure ML workspace 
4. Prepare a service principal with secret key registered in keyvault. The service principal should have contributor access to your Azure ML workspace

### Download data from Microsoft Open Dataset

https://azure.microsoft.com/en-us/services/open-datasets/catalog/sample-oj-sales-simulated

In [1]:
data =spark.read.format("csv").option("header", True).load("wasbs://ojsales-simulatedcontainer@azureopendatastorage.blob.core.windows.net/oj_sales_data/Store10*.csv")

StatementMeta(spark001, 4, 1, Finished, Available)

In [71]:
# #Write to local delta for fast reading
data.write.format("delta").mode("overwrite").saveAsTable("OJ_Sales_Data")

StatementMeta(spark001, 6, 20, Finished, Available)

In [2]:
%%sql 
select * from OJ_Sales_Data limit 10

StatementMeta(spark001, 4, 2, Finished, Available)

<Spark SQL result set with 10 rows and 7 fields>

In [3]:
%%sql 
select count (distinct store, brand) from OJ_Sales_Data 

StatementMeta(spark001, 4, 3, Finished, Available)

<Spark SQL result set with 1 rows and 1 fields>

In [76]:
spark.conf.set(' spark.sql.execution.arrow.maxRecordsPerBatch', 100)
#Default is 10000 which in some cases may defeat the purpose of parallelism

StatementMeta(spark001, 8, 2, Finished, Available)

###Many Model Training

In [83]:
#prepare values to broadcast
tenant_id ='' 
service_principal_id=''
service_principal_password=''
subscription_id = ''
# Azure Machine Learning resource group NOT the managed resource group
resource_group = '' 

#Azure Machine Learning workspace name, NOT Azure Databricks workspace
workspace_name = ''  

StatementMeta(spark001, 8, 9, Finished, Available)

### Test with a single store & brand combination (single time series)

In [55]:
#Getting data
import pandas as pd
train_data_df = spark.sql("select to_timestamp(WeekStarting) WeekStarting, float(Quantity), Brand,Revenue, Store from OJ_Sales_Data where Store = '1066' and Brand ='tropicana'").toPandas()


StatementMeta(spark001, 6, 4, Finished, Available)

In [89]:
from azureml.core.experiment import Experiment

from azureml.core.authentication import ServicePrincipalAuthentication
from azureml.core import Workspace
from azureml.core import Model

from azureml.train.automl import AutoMLConfig

from azureml.automl.core.forecasting_parameters import ForecastingParameters
import cloudpickle 
sp_auth = ServicePrincipalAuthentication(tenant_id =tenant_id,
                                         service_principal_id=service_principal_id,
                                         service_principal_password=service_principal_password)
# Instantiate Azure Machine Learning workspace
ws = Workspace.get(name=workspace_name,
                   subscription_id=subscription_id,
                   resource_group=resource_group,auth= sp_auth)


experiment_name = 'automl-ml-forecast-local'

experiment=Experiment(ws, experiment_name)



#Getting data for one table to test the utility function
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
import joblib
import os
target_column= 'Quantity'
timestamp_column= 'WeekStarting'
timeseries_id_columns= [ 'Store', 'Brand']
drop_columns=['Revenue', 'Store', 'Brand']
model_type= 'lr'
model_name=train_data_df['Store'][0]+"_"+train_data_df['Brand'][0]
test_size=20
# 1.0 Read the data from CSV - parse timestamps as datetime type and put the time in the index
# data = train_data_df \
#         .set_index('WeekStarting') \
#         .sort_index(ascending=True)

# 2.0 Split the data into train and test sets
train = train_data_df[:-test_size]
test = train_data_df[-test_size:]




time_column_name='WeekStarting'
time_series_id_column_names=['Store', 'Brand']
time_series_settings = {
                        'time_column_name': time_column_name,
                        'time_series_id_column_names': time_series_id_column_names,
                        'forecast_horizon': 2
                        }

automl_config = AutoMLConfig(
                                task = 'forecasting',
                                debug_log='automl_oj_sales_errors.log',
                                primary_metric='normalized_root_mean_squared_error',
                                experiment_timeout_minutes=20,
                                training_data=train,
                                label_column_name="Quantity",
                                n_cross_validations=5,
                                **time_series_settings
                        )

local_run = experiment.submit(automl_config, show_output = False)
best_run, fitted_model = local_run.get_output()

with open(model_name, mode='wb') as file:
   joblib.dump(fitted_model, file)

model = Model.register(workspace=ws, model_name=model_name, model_path=model_name)


StatementMeta(spark001, 8, 15, Finished, Available)

NameError: name 'train_data_df' is not defined

In [13]:
best_run, fitted_model = local_run.get_output()

with open(model_name, mode='wb') as file:
   joblib.dump(fitted_model, file)

model = Model.register(workspace=ws, model_name=model_name, model_path=model_name)


StatementMeta(spark001, 5, 13, Finished, Available)

Registering model 1066_tropicana

### Scale it up with many model training with function Pandas API

In [95]:
#Prepare the core training function

from azureml.core.authentication import ServicePrincipalAuthentication
from azureml.core import Workspace
from azureml.core import Model
from azureml.core.experiment import Experiment
from azureml.train.automl import AutoMLConfig
import pandas as pd
from azureml.automl.core.forecasting_parameters import ForecastingParameters
import cloudpickle
#do not use joblib to dump because it will have issue with multi-level object
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
from pyspark.sql.functions import pandas_udf, PandasUDFType

schema = StructType([ \
    StructField("Store",StringType(),True), \
    StructField("Brand",StringType(),True), \

  ])

@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def many_model_train(train_data_df):
        sp_auth = ServicePrincipalAuthentication(tenant_id =tenant_id,
                                                service_principal_id=service_principal_id,
                                                service_principal_password=service_principal_password)
        # Instantiate Azure Machine Learning workspace
        ws = Workspace.get(name=workspace_name,
                        subscription_id=subscription_id,
                        resource_group=resource_group,auth= sp_auth)
        experiment_name = 'automl-ml-forecast-local'

        experiment=Experiment(ws, experiment_name)

        target_column= 'Quantity'
        timestamp_column= 'WeekStarting'
        timeseries_id_columns= [ 'Store', 'Brand']
        drop_columns=['Revenue', 'Store', 'Brand']
        model_type= 'lr'
        #Get the store and brand. They are unique from the group so just the first value is sufficient
        store = str(train_data_df['Store'][0])
        brand = str(train_data_df['Brand'][0])
        test_size=20

        train = train_data_df[:-test_size]
        test = train_data_df[-test_size:]

        model_name=store+"_"+brand
        test_size=20
        # 1.0 Format the input data from group by, put the time in the index
        time_column_name='WeekStarting'
        time_series_id_column_names=['Store', 'Brand']
        time_series_settings = {
                                'time_column_name': time_column_name,
                                'time_series_id_column_names': time_series_id_column_names,
                                'forecast_horizon': 2
                                }

        automl_config = AutoMLConfig(
                                        task = 'forecasting',
                                        debug_log='automl_oj_sales_errors.log',
                                        primary_metric='normalized_root_mean_squared_error',
                                        experiment_timeout_minutes=20,
                                        iterations=2,
                                        training_data=train,
                                        label_column_name="Quantity",
                                        n_cross_validations=2,
                                        **time_series_settings
                                )

        local_run = experiment.submit(automl_config, show_output = False)
        best_run, fitted_model = local_run.get_output()

        with open(model_name, mode='wb') as file:
                cloudpickle.dump(fitted_model, file)
  
        model = Model.register(workspace=ws, model_name=model_name, model_path=model_name)

        return pd.DataFrame({'Store':[store],'Brand':[brand]})


StatementMeta(spark001, 8, 21, Finished, Available)

In [96]:
df = spark.sql("select to_timestamp(WeekStarting) WeekStarting, float(Quantity), Brand,Revenue, Store from OJ_Sales_Data")
result = df.groupby(["Brand","Store"]).apply(many_model_train)


StatementMeta(spark001, 8, 22, Finished, Available)

In [97]:
display(result.head(10))

StatementMeta(spark001, 8, 23, Submitted, Running)

###Many Model Inferencing: Can you score using multiple models in parallel?

#### Home work: please prepare a function pandas UDF to produce forecast for mutliple store and brand given the test data