Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/manymodels/01_Data_Preparation/01_Data_Preparation.png)

# 01b Data Preparation 
This notebook uses simulated orange juice sales data to walk you through the process of training many models on Azure Machine Learning using Automated ML. 

The time series data used in this example was simulated based on the University of Chicago's Dominick's Finer Foods dataset which featured two years of sales of 3 different orange juice brands for individual stores. The full simulated dataset includes 3,991 stores with 3 orange juice brands each thus allowing 11,973 models to be trained to showcase the power of the many models pattern.

  
In this notebook, two datasets will be created: one with all 11,973 files and one with only 10 files that can be used to quickly test and debug. For each dataset, you'll be walked through the process of:

1. Registering the blob container as a Datastore to the Workspace
2. Registering a File Dataset to the Workspace

### Prerequisites
If you have already run the [00_Setup_AML_Workspace](../../00_Setup_AML_Workspace.ipynb) notebooks you are all set.


### 1.0 Connect to your Workspace and Datastore
In the configuration notebook you created a [Workspace](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace.workspace?view=azure-ml-py). We are going to use that enviroment to register the data. You also set up the Datastore which in this example is a container in Blob storage where we will store the data. 

In [None]:
from azureml.core.workspace import Workspace
ws = Workspace.from_config() 
datastore = ws.get_default_datastore()

# Take a look at Workspace
ws.get_details()

### 2.0 Data Preparation
The OJ data is available in the public blob container. The data is split to be used for training and for inferencing. For the current dataset, the data was split on time column ('WeekStarting') before and after '1992-5-28' .

The container has 'oj_data' and 'oj_inference' folders that contains training and inference data respectively for the 11,973 models. It also has 'oj_data_small' and 'oj_inference_small' folders that has training and inference data for 10 models.

To create the [FileDataset](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.file_dataset.filedataset?view=azure-ml-py) needed for the ParallelRunStep, you first need to register the blob container to the workspace.

<b> To use your own data, create a local folder with each group as a separate file. The data has to be presplit into different files by group and for OJ data, it was already split by Store and Brand. For timeseries, the groups must not split up individual timeseries. That is, each group must contain one or more whole time-series</b>
Then use that folder and directory in section 3.0 to upload your data to the Datastore. 

### 3.0 Register the blob container as DataStore

A Datastore is a place where data can be stored that is then made accessible to a compute either by means of mounting or copying the data to the compute target.

Please refer to [Datastore](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.datastore(class)?view=azure-ml-py) documentation on how to access data from Datastore.

In this next step, we will be registering blob storage as datastore to the Workspace.

In [None]:
from azureml.core import Datastore

# Please change the following to point to your own blob container and pass in account_key
blob_datastore_name = "automl_many_models"
container_name = "automl-sample-notebook-data"
account_name = "automlsamplenotebookdata"

oj_datastore = Datastore.register_azure_blob_container(workspace=ws, 
                                                       datastore_name=blob_datastore_name, 
                                                       container_name=container_name,
                                                       account_name=account_name,
                                                       create_if_not_exists=True)    

### 4.0 Create the FileDatasets 

Now that the datastore is available from the Workspace, [FileDatasets](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.file_dataset.filedataset?view=azure-ml-py) can be created. Datasets in Azure Machine Learning are references to specific data in a Datastore.  We are using FileDatasets since we are storing each group as a separate file. This will help to separate the data needed for model training. 

In [None]:
from azureml.core.dataset import Dataset

ds_name = 'oj_data'
input_ds = Dataset.File.from_files(path=oj_datastore.path(ds_name + '/'), validate=False)

inference_name = 'oj_inference'
inference_ds = Dataset.File.from_files(path=oj_datastore.path(inference_name + '/'), validate=False)

ds_name_small = 'oj_data_small'
input_ds_small = Dataset.File.from_files(path=oj_datastore.path(ds_name_small + '/'), validate=False)

inference_name_small = 'oj_inference_small'
inference_ds_small = Dataset.File.from_files(path=oj_datastore.path(inference_name_small + '/'), validate=False)

### 5.0 Register the FileDataSets to the Workspace 
Finally, register the dataset to your Workspace so it can be called as an input into the training pipeline in the next notebook. We will use the inference dataset as part of the forecasting pipeline.

In [None]:
input_ds.register(ws, ds_name, create_new_version=True)
inference_ds.register(ws, inference_name, create_new_version=True)

input_ds_small.register(ws, ds_name_small, create_new_version=True)
inference_ds_small.register(ws, inference_name_small, create_new_version=True)

### 6.0 Call the Registered dataset *(Optional)*
After registering the data, it can be easily called using the command below. This is how the datasets will be accessed in future notebooks.

In [None]:
oj_ds = Dataset.get_by_name(ws, name = ds_name_small)
oj_ds

You can also download the data from the dataset in the future. 

In [None]:
#oj_ds.download()

## Next Steps
Now that you have created your dataset, you are ready to move to the Training Notebook to create models. 