Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Data Preparation
---

This repository uses simulated orange juice sales data from [Azure Open Datasets](https://azure.microsoft.com/en-us/services/open-datasets/) to walk you through the process of training many models and forecasting on Azure Machine Learning. 

This notebook walks you through all the necessary steps to configure the data for this solution accelerator, including:

1. Download the sample data
2. Split in train/test sets
3. Connect to your workspace and upload the data to its Datastore

### Prerequisites
If you have already run the [00_Setup_AML_Workspace](00_Setup_AML_Workspace.ipynb) notebook you are all set.


## 1.0 Download sample data

The time series data used in this example was simulated based on the University of Chicago's Dominick's Finer Foods dataset, which featured two years of sales of 3 different orange juice brands for individual stores. You can learn more about the dataset [here](https://azure.microsoft.com/en-us/services/open-datasets/catalog/sample-oj-sales-simulated/). 

The full dataset includes simulated sales for 3,991 stores with 3 orange juice brands each, thus allowing 11,973 models to be trained to showcase the power of the many models pattern.

You'll need the `azureml-opendatasets` package to download the data. You can install it with the following:

%pip install --upgrade azureml-opendatasets

We'll start by downloading the first 10 files but you can easily edit the code below to train all 11,973 models.

In [None]:
dataset_maxfiles = 10 # Set to 11973 or 0 to get all the files

In [None]:
import os
from azureml.opendatasets import OjSalesSimulated

# Pull all of the data
oj_sales_files = OjSalesSimulated.get_file_dataset()

# Pull only the first `dataset_maxfiles` files
if dataset_maxfiles:
    oj_sales_files = oj_sales_files.take(dataset_maxfiles)

# Create a folder to download
target_path = 'oj_sales_data' 
os.makedirs(target_path, exist_ok=True)

# Download the data
oj_sales_files.download(target_path, overwrite=True)

## 2.0 Split in train/test sets

We will now create a train/test split for each dataset. The test files will contain the last 20 weeks of observations from each original data file. In order to ensure the splitting respects temporal ordering, we need to provide the name of the timestamp column.

In [None]:
from scripts.notebooks.helper import split_data_train_test

# Provide name of timestamp column in the data and number of periods to reserve for the test set
timestamp_column = 'WeekStarting'
n_test_periods = 20

# Split each file and store in corresponding directory
train_path, test_path = split_data_train_test(target_path, timestamp_column, n_test_periods)

## 3.0 Upload data to Datastore in AML Workspace

In the [setup notebook](00_Setup_AML_Workspace.ipynb) you created a [Workspace](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace.workspace?view=azure-ml-py). We are going to register the data in that enviroment.

In [None]:
from azureml.core.workspace import Workspace

ws = Workspace.from_config()

# Take a look at Workspace
ws.get_details()

We will upload both sets of data files to your Workspace's default [Datastore](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-access-data). 
A Datastore is a place where data can be stored that is then made accessible for training or forecasting. Please refer to [Datastore documentation](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.datastore(class)?view=azure-ml-py) on how to access data from Datastore.

In [None]:
# Connect to default datastore
datastore = ws.get_default_datastore()

# Upload train data
ds_train_path = target_path + '_train'
datastore.upload(src_dir=train_path, target_path=ds_train_path, overwrite=True)

# Upload test data
ds_test_path = target_path + '_test'
datastore.upload(src_dir=test_path, target_path=ds_test_path, overwrite=True)

## 4.0 Register dataset in AML Workspace

The last step is creating and registering [datasets](https://docs.microsoft.com/en-us/azure/machine-learning/concept-data#datasets) in Azure Machine Learning for the train and test sets.

Using a [FileDataset](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.file_dataset.filedataset?view=azure-ml-py) is currently the best way to take advantage of the many models pattern, so we create FileDatasets in the next cell. We then [register](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-register-datasets#register-datasets) the FileDatasets in your Workspace; this associates the train/test sets with simple names that can be easily referred to later on when we train models and produce forecasts.

In [None]:
from azureml.core.dataset import Dataset

# Create file datasets
ds_train = Dataset.File.from_files(path=datastore.path(ds_train_path), validate=False)
ds_test = Dataset.File.from_files(path=datastore.path(ds_test_path), validate=False)

# Register the file datasets
dataset_name = 'oj_data_small'
train_dataset_name = dataset_name + '_train'
test_dataset_name = dataset_name + '_test'
ds_train.register(ws, train_dataset_name, create_new_version=True)
ds_test.register(ws, test_dataset_name, create_new_version=True)

## 5.0 *[Optional]* Interact with the registered dataset

After registering the data, it can be easily called using the command below. This is how the datasets will be accessed in future notebooks.

In [None]:
oj_ds = Dataset.get_by_name(ws, name=train_dataset_name)
oj_ds

It is also possible to download the data from the registered dataset:

In [None]:
oj_ds.download()

## Next Steps

Now that you have created your datasets, you are ready to move to the training notebook to train the models.

There are two options:
- Train using custom script: open notebook [02_Training_Pipeline.ipynb](02_Training_Pipeline.ipynb)
- Train using Automated ML: open notebook [02b_Train_AutoML.ipynb](02b_Train_AutoML.ipynb)