<i>Copyright (c) Microsoft Corporation.</i>

<i>Licensed under the MIT License.</i> 

# Automated Machine Learning (AutoML) on Azure for Retail Sales Forecasting

This notebook demonstrates how to apply [AutoML in Azure Machine Learning services](https://docs.microsoft.com/en-us/azure/machine-learning/concept-automated-ml) to train and tune machine learning models for forecasting product sales in retail. We will use the Orange Juice dataset to illustrate the steps of utilizing AutoML as well as how to combine an AutoML model with a custom model for better performance.

AutoML is a process of automating the tasks of machine learning model development. It helps data scientists and other practitioners build machine learning models with high scalability and quality in less amount of time. AutoML in Azure Machine Learning allows you to train and tune a model using a target metric that you specify. This service iterates through machine learning algorithms and feature selection approaches, producing a score that measures the quality of each machine learning pipeline. The best model will then be selected based on the scores. For more technical details about Azure AutoML, please check [this paper](https://papers.nips.cc/paper/7595-probabilistic-matrix-factorization-for-automated-machine-learning.pdf).

## Global Settings and Imports

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
import os
import sys
import math
import warnings
import datetime
import logging
import azureml.core
import azureml.automl
import pandas as pd

from matplotlib import pyplot as plt
from fclib.azureml.azureml_utils import (
    get_or_create_workspace,
    get_or_create_amlcompute,
)
from fclib.dataset.ojdata import download_ojdata, FIRST_WEEK_START
from fclib.common.utils import align_outputs
from fclib.evaluation.evaluation_utils import MAPE
from fclib.models.multiple_linear_regression import fit, predict

from azureml.core import Workspace
from azureml.core.dataset import Dataset
from azureml.core.experiment import Experiment
from azureml.core.model import Model
from automl.client.core.common import constants
from azureml.train.automl import AutoMLConfig
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.automl.core._vendor.automl.client.core.common import metrics

warnings.filterwarnings("ignore")

print("System version: {}".format(sys.version))
print("This notebook was created using version 1.0.85 of the Azure ML SDK")
print("You are currently using version", azureml.core.VERSION, "of the Azure ML SDK")

System version: 3.6.9 (default, Nov  7 2019, 10:44:02) 
[GCC 8.3.0]
This notebook was created using version 1.0.85 of the Azure ML SDK
You are currently using version 1.0.85 of the Azure ML SDK


In [3]:
# Data directory
DATA_DIR = os.path.join("ojdata")

# Forecasting settings
GAP = 2
LAST_WEEK = 138

# Number of test periods
NUM_TEST_PERIODS = 3

# Column names
time_column_name = "week_start"
target_column_name = "move"
grain_column_names = ["store", "brand"]
index_column_names = [time_column_name] + grain_column_names

# Subset of stores used in the notebook
USE_STORES = [2, 5, 8]

## Set up Azure Machine Learning Workspace

An Azure ML workspace is an Azure resource that organizes and coordinates the actions of many other Azure resources to assist in executing and sharing machine learning workflows. In particular, an Azure ML workspace coordinates storage, databases, and compute resources providing added functionality for machine learning experimentation, deployment, inference, and the monitoring of deployed models. To create an Azure ML workspace, first you need access to an Azure subscription. An Azure subscription allows you to manage storage, compute, and other assets in the Azure cloud. You can [create a new subscription](https://azure.microsoft.com/en-us/free/) or access existing subscription information from the [Azure portal](https://portal.azure.com/). Given that you have access to your Azure subscription, you can further create an Azure ML workspace by following the instructions [here](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-manage-workspace). You can also do so [using Azure CLI](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-manage-workspace-cli) or the `Workspace.create()` method in Azure SDK.

Once you have created an Azure ML workspace, you can download its configuration file (`config.json`) from Azure Portal as follows

<img src="https://user-images.githubusercontent.com/20047467/76651752-8827b180-653b-11ea-942d-99cf0bdc4f96.png" width="900" height="320">

### Prepare Azure ML Workspace

In the following cell, `get_or_create_workspace()` creates a workspace object from the details stored in `config.json` that you have downloaded. We assume that you store this config file to a directory `./.azureml`. In case the existing workspace cannot be loaded, the following cell will try to create a new workspace with the subscription ID, resource group, and workspace name as specified in the beginning of the cell.

The cell can fail if you don't have permission to access the workspace. You may need to log into your Azure account and change the default subscription to the one which the workspace belongs to using Azure CLI `az account set --subscription <name or id>`.

In [None]:
# Please specify the AzureML workspace attributes below if you want to create a new one.
subscription_id = "<subscription-id>"
resource_group = "<resource-group>"
workspace_name = "<workspace-name>"
workspace_region = "<workspace-region>"

# Connect to a workspace
ws = get_or_create_workspace(
    config_path="./.azureml",
    subscription_id=subscription_id,
    resource_group=resource_group,
    workspace_name=workspace_name,
    workspace_region=workspace_region,
)
print(
    "Workspace name: " + ws.name,
    "Azure region: " + ws.location,
    "Resource group: " + ws.resource_group,
    sep="\n",
)

### Create compute resources for your experiments

We run AutoML on a dynamically scalable compute cluster. In the next cell, we create an AmlCompute target with a specific cluster name, VM size, and maximum number of nodes if the cluster does not exist. Otherwise, we will reuse an existing one. For more options of VM sizes, please check the information in this [link](https://docs.microsoft.com/en-us/azure/virtual-machines/sizes-general).

In [5]:
# Choose a name for your cluster
cluster_name = "cpu-cluster"
# VM Size
vm_size = "STANDARD_D2_V2"
# Maximum number of nodes of the cluster
max_nodes = 4

# Create a new AmlCompute if it does not exist or reuse an existing one
cpu_cluster = get_or_create_amlcompute(
    workspace=ws,
    compute_name=cluster_name,
    vm_size=vm_size,
    min_nodes=0,
    max_nodes=max_nodes,
    verbose=True,
)

Found compute target: cpu-cluster
Rescaling to 4 nodes
Updating
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


### Define Experiment

To run AutoML, you need to create an Experiment. An Experiment corresponds to a prediction problem you are trying to solve, while a Run corresponds to a specific approach to the problem.

In [None]:
# choose a name for the run history container in the workspace
experiment_name = "automl-ojforecasting"

experiment = Experiment(ws, experiment_name)

output = {}
output["SDK version"] = azureml.core.VERSION
output["Workspace"] = ws.name
output["SKU"] = ws.sku
output["Resource Group"] = ws.resource_group
output["Location"] = ws.location
output["Run History Name"] = experiment_name
pd.set_option("display.max_colwidth", -1)
outputDf = pd.DataFrame(data=output, index=[""])
outputDf.T

## Data Preparation

We need to split the Orange Juice data into training and test sets. By default, the following cell will download and spit the data. If you've already done so, you may skip this part by switching `DOWNLOAD_SPLIT_DATA` to `False`.

We store the training data and test data using dataframes. The training data includes `train_df` and `aux_df` with `train_df` containing the historical sales up to week 135 (the time we make forecasts) and `aux_df` containing price/promotion information up until week 138. We assume that future price and promotion information up to a certain number of weeks ahead is predetermined and known. The test data is stored in `test_df` which contains the sales of each product in week 137 and 138. Assuming the current week is week 135, our goal is to forecast the sales in week 137 and 138 using the training data. There is a one-week gap between the current week and the first target week of forecasting as we want to leave time for planning inventory in practice.

### Data split

In [7]:
df = pd.read_csv(os.path.join(DATA_DIR, "yx.csv"))
df = df.loc[df.week <= LAST_WEEK]

In [8]:
# Convert logarithm of the unit sales to unit sales
df["move"] = df["logmove"].apply(lambda x: round(math.exp(x)))
# Add timestamp column
df["week_start"] = df["week"].apply(lambda x: FIRST_WEEK_START + datetime.timedelta(days=(x - 1) * 7))
# Select a subset of stores for demo purpose
df_sub = df[df.store.isin(USE_STORES)]

In [9]:
# Split data into training and test sets
def split_last_n_by_grain(df, n):
    """Group df by grain and split on last n rows for each group."""
    df_grouped = df.sort_values(time_column_name).groupby(  # Sort by ascending time
        grain_column_names, group_keys=False
    )
    df_head = df_grouped.apply(lambda dfg: dfg.iloc[:-n])
    df_tail = df_grouped.apply(lambda dfg: dfg.iloc[-n:])
    return df_head, df_tail


train_df, test_df = split_last_n_by_grain(df_sub, NUM_TEST_PERIODS)
train_df.reset_index(drop=True)
test_df.reset_index(drop=True)

# Save data locally
local_data_pathes = [
    os.path.join(DATA_DIR, "train.csv"),
    os.path.join(DATA_DIR, "test.csv"),
]

train_df.to_csv(local_data_pathes[0], index=None, header=True)
test_df.to_csv(local_data_pathes[1], index=None, header=True)

### Upload data to datastore

The [Machine Learning service workspace](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-workspace), is paired with the storage account, which contains the default data store. We will use it to upload the train and test data and create [tabular datasets](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py) for training and testing. A tabular dataset defines a series of lazily-evaluated, immutable operations to load data from the data source into tabular representation.


In [10]:
datastore = ws.get_default_datastore()
datastore.upload_files(files=local_data_pathes, target_path="dataset/", overwrite=True, show_progress=True)

Uploading an estimated of 2 files
Uploading ojdata/test.csv
Uploading ojdata/train.csv
Uploaded ojdata/test.csv, 1 files out of an estimated total of 2
Uploaded ojdata/train.csv, 2 files out of an estimated total of 2
Uploaded 2 files


$AZUREML_DATAREFERENCE_d4b75808692b4fe6aea7563ae3929c60

### Create dataset for training

In [11]:
train_dataset = Dataset.Tabular.from_delimited_files(path=datastore.path("dataset/train.csv"))

In [12]:
train_dataset.to_pandas_dataframe().tail()

Unnamed: 0,store,brand,week,logmove,constant,price1,price2,price3,price4,price5,...,price7,price8,price9,price10,price11,deal,feat,profit,move,week_start
2976,8,11,131,10.40499,1,0.027969,0.043646,0.043594,0.032344,0.031094,...,0.039844,0.031094,0.024844,0.024688,0.023359,0,0.0,5.52,33024,1992-03-12
2977,8,11,132,10.38542,1,0.027969,0.043646,0.043594,0.042031,0.031094,...,0.031094,0.031094,0.024844,0.024688,0.023359,1,1.0,5.48,32384,1992-03-19
2978,8,11,133,9.373819,1,0.045156,0.043646,0.043594,0.031094,0.037344,...,0.031094,0.031094,0.020156,0.024688,0.023359,0,0.0,5.38,11776,1992-03-26
2979,8,11,134,9.340667,1,0.039062,0.043646,0.043594,0.031094,0.031094,...,0.039844,0.031094,0.020156,0.024688,0.023359,0,0.0,7.16,11392,1992-04-02
2980,8,11,135,10.514991,1,0.039062,0.043646,0.043594,0.042031,0.031094,...,0.039844,0.029531,0.026406,0.024688,0.023359,1,1.0,8.29,36864,1992-04-09


## Modeling

For forecasting tasks, AutoML uses pre-processing and estimation steps that are specific to time-series. AutoML will undertake the following pre-processing steps:
* Detect time-series sample frequency (e.g. hourly, daily, weekly) and create new records for absent time points to make the series regular. A regular time series has a well-defined frequency and has a value at every sample point in a contiguous time span
* Impute missing values in the target (via forward-fill) and feature columns (using median column values)
* Create grain-based features to enable fixed effects across different series
* Create time-based features to assist in learning seasonal patterns
* Encode categorical variables to numeric quantities

In this notebook, AutoML will train a single, regression-type model across all time-series in a given training set. This allows the model to generalize across related series. To create a training job, we use AutoML Config object to define the settings and data. Here is a summary of the meanings of the AutoMLConfig parameters:

|Property|Description|
|-|-|
|**task**|forecasting|
|**primary_metric**|This is the metric that you want to optimize.<br> Forecasting supports the following primary metrics <br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i>
|**experiment_timeout_hours**|Experimentation timeout in hours.|
|**enable_early_stopping**|If early stopping is on, training will stop when the primary metric is no longer improving.|
|**training_data**|Input dataset, containing both features and label column.|
|**label_column_name**|The name of the label column.|
|**compute_target**|The remote compute for training.|
|**n_cross_validations**|Number of cross-validation folds to use for model/pipeline selection|
|**enable_voting_ensemble**|Allow AutoML to create a Voting ensemble of the best performing models|
|**enable_stack_ensemble**|Allow AutoML to create a Stack ensemble of the best performing models|
|**debug_log**|Log file path for writing debugging information|
|**time_column_name**|Name of the datetime column in the input data|
|**grain_column_names**|Name(s) of the columns defining individual series in the input data|
|**drop_column_names**|Name(s) of columns to drop prior to modeling|
|**max_horizon**|Maximum desired forecast horizon in units of time-series frequency|

### Model training

In [13]:
time_series_settings = {
    "time_column_name": time_column_name,
    "grain_column_names": grain_column_names,
    "drop_column_names": ["logmove"],  # 'logmove' is a leaky feature, so we remove it.
    "max_horizon": NUM_TEST_PERIODS,
}

automl_config = AutoMLConfig(
    task="forecasting",
    debug_log="automl_oj_sales_errors.log",
    primary_metric="normalized_mean_absolute_error",
    experiment_timeout_hours=1.0,  # You may increase this number to improve model accuracy
    training_data=train_dataset,
    label_column_name=target_column_name,
    compute_target=cpu_cluster,
    enable_early_stopping=True,
    n_cross_validations=3,
    verbosity=logging.INFO,
    **time_series_settings
)

In [14]:
remote_run = experiment.submit(automl_config, show_output=False)
remote_run

Experiment,Id,Type,Status,Details Page,Docs Page
automl-ojforecasting-test,AutoML_12ee1cb2-3442-4092-a09e-b5a06647b845,automl,Starting,Link to Azure Machine Learning studio,Link to Documentation


In [15]:
remote_run.wait_for_completion()

{'runId': 'AutoML_12ee1cb2-3442-4092-a09e-b5a06647b845',
 'target': 'cpu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2020-05-13T17:42:54.119105Z',
 'endTimeUtc': '2020-05-13T18:46:01.363453Z',
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'normalized_mean_absolute_error',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': '3',
  'target': 'cpu-cluster',
  'DataPrepJsonString': '{\\"training_data\\": \\"{\\\\\\"blocks\\\\\\": [{\\\\\\"id\\\\\\": \\\\\\"8227060c-e76b-4722-a91a-b95c571881ec\\\\\\", \\\\\\"type\\\\\\": \\\\\\"Microsoft.DPrep.GetDatastoreFilesBlock\\\\\\", \\\\\\"arguments\\\\\\": {\\\\\\"datastores\\\\\\": [{\\\\\\"datastoreName\\\\\\": \\\\\\"workspaceblobstore\\\\\\", \\\\\\"path\\\\\\": \\\\\\"dataset/train.csv\\\\\\", \\\\\\"resourceGroup\\\\\\": \\\\\\"novanta-mdw-rg\\\\\\", \\\\\\"subscription\\\\\\": \\\\\\"7c9d382c-5964-48db-9cf6-c595c7ba4339\\\\\\",

### Retrieve the best model

Each run within an Experiment stores serialized (i.e. pickled) pipelines from the AutoML iterations. After the training job is done, we can retrieve the pipeline with the best performance on the validation dataset.

In [None]:
best_run, fitted_model = remote_run.get_output()
print(fitted_model.steps)

## Additional Reading

\[1\] Nicolo Fusi, Rishit Sheth, and Melih Elibol. 2018. Probabilistic Matrix Factorization for Automated Machine Learning. In Advances in Neural Information Processing Systems. 3348-3357.<br>
\[2\] Azure AutoML Package Docs: https://docs.microsoft.com/en-us/python/api/azureml-train-automl/azureml.train.automl?view=azure-ml-py <br>
\[3\] Azure Automated Machine Learning Examples: https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/automated-machine-learning <br>


