Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/manymodels/01_Data_Preparation/01_Data_Preparation.png)

# 01b Environment Setup
This notebook uses simulated orange juice sales data to walk you through the process of training many models on Azure Machine Learning using Automated ML. 

The time series data used in this example was simulated based on the University of Chicago's Dominick's Finer Foods dataset which featured two years of sales of 3 different orange juice brands for individual stores. The full simulated dataset includes 3,991 stores with 3 orange juice brands each thus allowing 11,973 models to be trained to showcase the power of the many models pattern.

  
In this notebook, two datasets will be created: one with all 11,973 files and one with only 10 files that can be used to quickly test and debug. For each dataset, you'll be walked through the process of:

1. Downloading the data from Azure Open Datasets
2. Uploading the data to Azure Blob Storage
3. Registering a File Dataset to the Workspace

## Prerequisites
You will need to setup your workspace using the configuration notebook.


### 1.0 Connect to your Workspace and Datastore
In the configuration notebook you created a [Workspace](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace.workspace?view=azure-ml-py). We are going to use that enviroment to register the data. You also set up the Datastore which in this example is a container in Blob storage where we will store the data. 

In [None]:
from azureml.core.workspace import Workspace
ws = Workspace.from_config() 
datastore = ws.get_default_datastore()

# Take a look at Workspace
ws.get_details()

### 2.0 Download the data from Azure Open Datasets
To download the data, import OjSalesSimulated from Azure Open Datasets. Two datasets are being created: one with all 11,973 files and one with 10 files but this can be customized based on your preferences.



If you have to install OjSalesSimulated, run the following command and restart the kernal. 

In [None]:
#!pip install azureml-opendatasets

<b> To use your own data, create a local folder with each group as a separate file. The data has to be presplit into different files by group and for OJ data, it was already split by Store and Brand. For timeseries, the groups must not split up individual timeseries. That is, each group must contain one or more whole time-series</b>
Then use that folder and directory in section 3.0 to upload your data to the Datastore. 

In [None]:
from azureml.opendatasets import OjSalesSimulated

# Pull all of the data
oj_sales_files = OjSalesSimulated.get_file_dataset()

# Pull the first 10 files
oj_sales_files_small = OjSalesSimulated.get_file_dataset().take(10)

Next, create the folders that the data will be downloaded to. 

In [None]:
import os

oj_sales_path = "oj_sales_data"
if not os.path.exists(oj_sales_path):
    os.mkdir(oj_sales_path)
    
oj_sales_path_small = "oj_sales_data_small"
if not os.path.exists(oj_sales_path_small):
    os.mkdir(oj_sales_path_small)

Finally, download the files to the folder you created. 

In [None]:
oj_sales_files.download(oj_sales_path, overwrite=True)
oj_sales_files_small.download(oj_sales_path_small, overwrite=True)

### 3.0 Split the data and upload the files to your Datastore
We will split the data so that we can use part of the data for inferencing. For the current dataset, we will be splitting on time column ('WeekStarting') before and after '1992-5-28' .

You are now ready to load the historical orange juice sales data. To create the [FileDataset](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.file_dataset.filedataset?view=azure-ml-py) needed for the ParallelRunStep, you first need to upload the csv files to your blob datastore.

Please *note* that this will take time as we process 11k files to get the train and test data.  The following scripts will create 'upload_train_data' and 'upload_test_data' folders under the 'oj_sales_data' and 'oj_sales_data_small' folders. The data will then be uploaded to the datastore.

In [None]:
from scripts.helper import  split_data_upload_to_datastore
time_column_name = 'WeekStarting'
target_date = '1992-5-28'


target_path = 'oj_sales_data' 
target_inference_path = 'oj_sales_inference'
split_data_upload_to_datastore(data_path = oj_sales_path,
                               column_name = time_column_name,
                               date = target_date,
                               datastore = datastore,
                               train_ds_target_path = target_path,
                               test_ds_target_path = target_inference_path)


target_path_small = 'oj_sales_data_small'
target_inference_path_small = 'oj_sales_inference_small'
split_data_upload_to_datastore(data_path = oj_sales_path_small,
                               column_name = time_column_name,
                               date = target_date,
                               datastore = datastore,
                               train_ds_target_path = target_path_small,
                               test_ds_target_path = target_inference_path_small)

### 4.0 Create the FileDatasets 

Now that the files exist in the datastore, [FileDatasets](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.file_dataset.filedataset?view=azure-ml-py) can be created. Datasets in Azure Machine Learning are references to specific data in a Datastore.  We are using FileDatasets since we are storing each group as a separate file. This will help to separate the data needed for model training. 

In [None]:
from azureml.core.dataset import Dataset

ds_name = 'oj_data'
input_ds = Dataset.File.from_files(path=datastore.path(target_path + '/'), validate=False)

inference_name = 'oj_inference'
inference_ds = Dataset.File.from_files(path=datastore.path(target_inference_path + '/'), validate=False)

ds_name_small = 'oj_data_small'
input_ds_small = Dataset.File.from_files(path=datastore.path(target_path_small + '/'), validate=False)

inference_name_small = 'oj_inference_small'
inference_ds_small = Dataset.File.from_files(path=datastore.path(target_inference_path_small + '/'), validate=False)

### 5.0 Register the FileDataSets to the Workspace 
Finally, register the dataset to your Workspace so it can be called as an input into the training pipeline in the next notebook. We will use the inference dataset as part of the forecasting pipeline.

In [None]:
input_ds.register(ws, ds_name, create_new_version=True)
inference_ds.register(ws, inference_name, create_new_version=True)

input_ds_small.register(ws, ds_name_small, create_new_version=True)
inference_ds_small.register(ws, inference_name_small, create_new_version=True)

### 6.0 Call the Registered dataset *(Optional)*
After registering the data, it can be easily called using the command below. This is how the datasets will be accessed in future notebooks.

In [None]:
oj_ds = Dataset.get_by_name(ws, name = ds_name_small)
oj_ds

You can also download the data from the dataset in the future. 

In [None]:
#oj_ds.download()

### 7.0 Delete the local files *(Optional)*

In [None]:
import shutil

shutil.rmtree(oj_sales_path)
shutil.rmtree(oj_sales_path_small)

## Next Steps
Now that you have created your dataset, you are ready to move to the Training Notebook to create models. 