Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# 01 Data Preparation
---

This solution accelerator uses simulated orange juice sales data to walk you through the process of training many models on Azure Machine Learning. 

The time series data used in this example was simulated based on the University of Chicago's Dominick's Finer Foods dataset which featured two years of sales of 3 different orange juice brands for individual stores. The full simulated dataset includes 3,991 stores with 3 orange juice brands each thus allowing 11,973 models to be trained to showcase the power of the many models pattern.

  
In this notebook, two datasets will be created: one with all 11,973 files and one with only 10 files that can be used to quickly test and debug. For each dataset, you'll walk you through the process of:

1. Downloading the data from Azure Open Datasets
2. Uploading the data to Datastore
3. Registering a File Dataset to the Workspace


### Prerequisites 
At this point, you should have already: 
1. Created your AML Workspace
2. Run 00_Environment_Setup.ipynb to configure the enviroment

## 1.0 Connect to your workspace and datastore
In the Environemnet Setup notebook you created a [Workspace](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace.workspace?view=azure-ml-py). We are going to use that enviroment to register the data. You also set up the Datastore which in this example is a container in Blob storage where we will store the data. The Datastore object contains the connection to the storage location. 

In [1]:
from azureml.core.workspace import Workspace
ws = Workspace.from_config()
datastore = ws.get_default_datastore()

print('Workspace Name: ' + ws.name, 
      'Azure Region: ' + ws.location, 
      'Subscription Id: ' + ws.subscription_id, 
      'Resource Group: ' + ws.resource_group, sep = '\n')

Workspace Name: ManyModelsAccelerator
Azure Region: westus2
Subscription Id: f97fb87f-32d7-4d7c-9bc5-ea43b4fea7ac
Resource Group: ManyModelsRG


## 2.0 Download the data from Azure Open Datasets
To download the data, import OjSalesSimulated from Azure Open Datasets. Two datasets are being created: one with all 11,973 files and one with 10 files. The smaller dataset is designed to enable you to quickly test and debug the notebooks. This can be customized based on your preferences.

To use your own data, create a local folder with each time series as a seperate file. Then use that folder and directory in section 3.0 to upload your data to the Datastore. 

In [None]:
#!pip install azureml-opendatasets

In [2]:
from azureml.opendatasets import OjSalesSimulated

# Pull all of the data
oj_sales_files = OjSalesSimulated.get_file_dataset()

# Pull the first 10 files
oj_sales_files_small = OjSalesSimulated.get_file_dataset().take(10)

If you are recieving an error importing OjSalesSimulated, run the following command and then *restart the kernal*:
```
!pip install azureml-opendatasets==0.1.0.8487238 --extra-index-url https://azuremlsdktestpypi.azureedge.net/CLI-SDK-Runners-Validation/8487238/
```

Next, create the folders that the data will be downloaded to:

In [3]:
import os

#oj_sales_path = "oj_sales_data"
#if not os.path.exists(oj_sales_path):
#    os.mkdir(oj_sales_path)
    
oj_sales_path_small = "oj_sales_data_small"
if not os.path.exists(oj_sales_path_small):
    os.mkdir(oj_sales_path_small)

Finally, download the files to the folder you created. This may take a few minutes:

In [4]:
#oj_sales_files.download(oj_sales_path, overwrite=True)
oj_sales_files_small.download(oj_sales_path_small, overwrite=True)

['E:\\solution-accelerator-many-models\\01_Data_Preparation\\oj_sales_data_small\\Store1000_dominicks.csv',
 'E:\\solution-accelerator-many-models\\01_Data_Preparation\\oj_sales_data_small\\Store1000_minute.maid.csv',
 'E:\\solution-accelerator-many-models\\01_Data_Preparation\\oj_sales_data_small\\Store1000_tropicana.csv',
 'E:\\solution-accelerator-many-models\\01_Data_Preparation\\oj_sales_data_small\\Store1001_dominicks.csv',
 'E:\\solution-accelerator-many-models\\01_Data_Preparation\\oj_sales_data_small\\Store1001_minute.maid.csv',
 'E:\\solution-accelerator-many-models\\01_Data_Preparation\\oj_sales_data_small\\Store1001_tropicana.csv',
 'E:\\solution-accelerator-many-models\\01_Data_Preparation\\oj_sales_data_small\\Store1002_dominicks.csv',
 'E:\\solution-accelerator-many-models\\01_Data_Preparation\\oj_sales_data_small\\Store1002_minute.maid.csv',
 'E:\\solution-accelerator-many-models\\01_Data_Preparation\\oj_sales_data_small\\Store1002_tropicana.csv',
 'E:\\solution-acceler

## 3.0 Upload the files to your datastore
To create the [FileDataset](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.file_dataset.filedataset?view=azure-ml-py) needed for the ParallelRunStep, you first need to upload the csv files to your blob datastore.

In [5]:
'''
target_path = 'oj_sales_data' 
datastore.upload(src_dir = oj_sales_path,
                target_path = target_path,
                overwrite = True, 
                show_progress = False)
'''
target_path_small = 'oj_sales_data_small'
datastore.upload(src_dir = oj_sales_path_small,
                target_path = target_path_small,
                overwrite = True, 
                show_progress = False)

$AZUREML_DATAREFERENCE_0bddada51b6d48d9b96ede3ab942dbca

## 4.0 Create the FileDatasets 

Now that the files exist in the datastore, [FileDatasets](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.file_dataset.filedataset?view=azure-ml-py) can be created. Datasets in Azure Machine Learning are references to specific data in a datastore.  We are using FileDatasets since we are storing each series as a seperate file. This will help to seperate the data needed for model training. 

In [6]:
from azureml.core.dataset import Dataset

'''
ds_name = 'oj_data'
path_on_datastore = datastore.path(target_path + '/')
input_ds = Dataset.File.from_files(path=path_on_datastore, validate=False)
'''

ds_name_small = 'oj_data_small'
path_on_datastore_small = datastore.path(target_path_small + '/')
input_ds_small = Dataset.File.from_files(path=path_on_datastore_small, validate=False)

## 5.0 Register the FileDataSets to the workspace 
Finally, register the dataset to your Workspace so it can be called as an input into the training pipeline in the next notebook. This same dataset will also be used as part of the scoring and forecasting pipelines.

In [7]:
#registered_ds = input_ds.register(ws, ds_name, create_new_version=True)

registered_ds_small = input_ds_small.register(ws, ds_name_small, create_new_version=True)

---
## Next Steps
Now that you have created your dataset, you are ready to move to the [02_Training_Pipeline.ipynb](https://github.com/microsoft/solution-accelerator-many-models/blob/master/02_Training/02_Training_Pipeline.ipynb) notebook to train the models. 