Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Demand forecasting and the [many models solution accelerator](https://github.com/Azure/azureml-examples/tree/main/python-sdk/tutorials/automl-with-azureml/forecasting-many-models)
---
This noebook takes the timeseries data sent from the coolers to the PickList and PickList items tables and prepares it for use in Azure Machine Learning. Once finished with 
the tasks in this notebook, you will pick up after the data preparation phase in the many models solution accelerator, to train forecasting models which will predict the 
number and type of items taken out of inventory at each individual cooler, allowing intelegent restocking, better decisions around the ballance of inventory to stock, etc.

### Prerequisites 
At this point, you should have already:

1. Deployed the solution
1. Run the load_data notebook](./load_data.ipynb)

## 1.0 Initalization

Here we initalize libraries and variables, as well as configuration of the storage we will be using (the default synapse workspace Data Lake) 

### Library Initialization:

In [20]:
from pyspark.sql.functions import *
from pyspark.sql.window import Window
from pyspark.sql.types import *
from notebookutils import mssparkutils
import json
sc = spark.sparkContext

StatementMeta(conncoolerpool, 35, 10, Finished, Available)

### Initialize variables and configure workspace storage

In [51]:
synapse_account_name = 'connected-cooler-sa-synapsews'
data_lake_account_name = 'connectedcoolersasynadls' # Synapse Workspace ADLS
file_system_name = 'connectedcoolersasynfs'
ml_data_container = 'mldata'

ml_directory = 'cooler-data'
ds_train_path = 'ds-train'
ds_inference_path = 'ds-inference'
timestamp_split = '2021-09-15 00:00:00'

t_file_path = f'abfss://{ml_data_container}@{data_lake_account_name}.dfs.core.windows.net/{ds_train_path}'
I_file_path = f'abfss://{ml_data_container}@{data_lake_account_name}.dfs.core.windows.net/{ds_inference_path}'

spark.conf.set("spark.storage.synapse.linkedServiceName", f"{synapse_account_name}-WorkspaceDefaultStorage")
spark.conf.set("fs.azure.account.oauth.provider.type", "com.microsoft.azure.synapse.tokenlibrary.LinkedServiceBasedTokenProvider")

StatementMeta(conncoolerpool, 35, 41, Finished, Available)

## 2.0 Data Preparation
In the Data Preparation notebook, we registered the orange juice inference data to the Workspace. You can choose to run the pipeline on the subet of 10 series or the full dataset of 11,973 series. We recommend starting with 10 series then expanding.

### Get data sets
In this example, the datasets are stored in the blob storage setup with Synapse.  This allows us to use the existing linked service (setup above) for access.

In [None]:
picklist_df = spark.read.parquet(f'abfss://{file_system_name}@{data_lake_account_name}.dfs.core.windows.net/{database_name}/picklist')
picklistitem_df = spark.read.parquet(f'abfss://{file_system_name}@{data_lake_account_name}.dfs.core.windows.net/{database_name}/picklistitem')

### Join data and partition by cooler and sku, then split into training and inferencing data
Join the PickList and PickListItem tables together.  This is the time series data showing number of items removed from the cooler per hour.  
This is the data that will be useed to train our demand forecasting model.  The following code will organize the data into csv files by cooler and item. 
The power of the many models solution is that it will allow us to easily develop a model to predict sales for a particular item within a particualar cooler.
In a production implementation of the solution, the cooler data could be augmented with other data such as local holidays, or weather data to help prediction accuracy.

Next we split the training data from the inferencing test data.  You can control the split data by setting the timestamp_split variable.

In [52]:
# partition the data, then split it into training and inference sets
for cooler in joined_df.select('CoolerId').distinct().collect():
    for sku in joined_df.select('ItemSku').distinct().collect():
        tmp_df = joined_df.where((joined_df.CoolerId == cooler['CoolerId']) & (joined_df.ItemSku == f"{sku['ItemSku']}"))
        train_tmp_df = tmp_df.where(tmp_df.PickListFulfilledTimestamp <= timestamp_split) \
            .repartition(1) \
            .write \
            .option("header",True) \
            .mode('append') \
            .csv(f"{t_file_path}")
        infr_tmp_df = tmp_df.where(tmp_df.PickListFulfilledTimestamp > timestamp_split) \
            .repartition(1) \
            .write \
            .option("header",True) \
            .mode('append') \
            .csv(f"{I_file_path}")        


StatementMeta(conncoolerpool, 35, 42, Finished, Available)

#### At this point, the data is configured and ready for training with the [Many Models Solution Accelerator](https://github.com/microsoft/solution-accelerator-many-models).