Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Data Preparation and Setup
---

This notebook walks you through all the necessary steps to configure your environment and data for this solution accelerator including:

1. Connect to your workspace
2. Deploying a compute cluster for training and forecasting
3. Creating and registering the dataset used in this accelerator

### Prerequisites
If you have already run the [00_Setup_AML_Workspace](../00_Setup_AML_Workspace.ipynb) notebooks you are all set.

## 1.0 Connect to your Workspace
In the [00_Setup_AML_Workspace](../00_Setup_AML_Workspace.ipynb) notebook you created a [Workspace](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace.workspace?view=azure-ml-py). 

In [1]:
from azureml.core.workspace import Workspace
ws = Workspace.from_config() 


# Take a look at Workspace
ws.get_details()

{'id': '/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourceGroups/erwright-automl/providers/Microsoft.MachineLearningServices/workspaces/erwrightDevTest',
 'name': 'erwrightDevTest',
 'location': 'westus2',
 'type': 'Microsoft.MachineLearningServices/workspaces',
 'sku': 'Enterprise',
 'workspaceid': 'cb7c6316-1e4f-4176-88d3-c9d532bc6250',
 'description': '',
 'friendlyName': 'erwrightDevTest',
 'creationTime': '2019-12-20T17:22:41.1330323+00:00',
 'containerRegistry': '/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourceGroups/erwright-automl/providers/Microsoft.ContainerRegistry/registries/erwrightdevt4b19fb88',
 'keyVault': '/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourcegroups/erwright-automl/providers/microsoft.keyvault/vaults/erwrightkeyvault85243c18',
 'applicationInsights': '/subscriptions/381b38e9-9840-4719-a5a0-61d9585e1e91/resourcegroups/erwright-automl/providers/microsoft.insights/components/erwrightinsights82d97763',
 'identityPrincipalId': '7

## 2.0 Create compute

In this step we create an compute cluster that will be used for the training and forecasting pipelines. This is a one-time set up so you won't need to re-run this in future notebooks.

We create a STANDARD_D13_V2 compute cluster. D-series VMs are used for tasks that require higher compute power and temporary disk performance. This [page](https://docs.microsoft.com/en-us/azure/cloud-services/cloud-services-sizes-specs) will gives you more information on VM sizes to help you decide which will best fit your use case.

In [7]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
cpu_cluster_name = "cpucluster"

# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found an existing cluster, using it instead.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D13_V2',
                                                           min_nodes=0,
                                                           max_nodes=5)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True)

Creating
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## 3.0 Create dataset

This solution accelerator uses simulated orange juice sales data from [Azure Open Datasets](https://azure.microsoft.com/en-us/services/open-datasets/) to walk you through the process of training many models on Azure Machine Learning. You can learn more about the dataset [here](https://azure.microsoft.com/en-us/services/open-datasets/catalog/sample-oj-sales-simulated/). The full dataset includes simulared sales for 3,991 stores with 3 orange juice brands each thus allowing 11,973 models to be trained to showcase the power of the many models pattern.

We'll start by downloading the first 10 files but you can easily edit the code below to train all 11,973 models.

In [None]:
#%pip install --upgrade azureml-opendatasets

In [2]:
import os
from azureml.core.dataset import Dataset
from azureml.opendatasets import OjSalesSimulated

# Pull all of the data
oj_sales_files = OjSalesSimulated.get_file_dataset()

# Pull only the first 10 files
oj_sales_files_small = OjSalesSimulated.get_file_dataset().take(10)

# Create a folder to download
target_path = 'oj_sales_data' 
if not os.path.exists(target_path):
    os.mkdir(target_path)

# Download the data
oj_sales_files_small.download(target_path, overwrite=True)

['C:\\Users\\erwright\\Source\\Repos\\solution-accelerator-many-models\\Custom_Script\\oj_sales_data\\https\\azureopendatastorage.azurefd.net\\ojsales-simulatedcontainer\\oj_sales_data\\Store1000_dominicks.csv',
 'C:\\Users\\erwright\\Source\\Repos\\solution-accelerator-many-models\\Custom_Script\\oj_sales_data\\https\\azureopendatastorage.azurefd.net\\ojsales-simulatedcontainer\\oj_sales_data\\Store1000_minute.maid.csv',
 'C:\\Users\\erwright\\Source\\Repos\\solution-accelerator-many-models\\Custom_Script\\oj_sales_data\\https\\azureopendatastorage.azurefd.net\\ojsales-simulatedcontainer\\oj_sales_data\\Store1000_tropicana.csv',
 'C:\\Users\\erwright\\Source\\Repos\\solution-accelerator-many-models\\Custom_Script\\oj_sales_data\\https\\azureopendatastorage.azurefd.net\\ojsales-simulatedcontainer\\oj_sales_data\\Store1001_dominicks.csv',
 'C:\\Users\\erwright\\Source\\Repos\\solution-accelerator-many-models\\Custom_Script\\oj_sales_data\\https\\azureopendatastorage.azurefd.net\\ojsales

In [3]:
from scripts.helper import split_data_upload_to_datastore

datastore = ws.get_default_datastore()
target_path = 'oj_sales_data'
ds_train_path = target_path + '_train'
ds_test_path = target_path + '_test'
split_data_upload_to_datastore(target_path, 'WeekStarting', 20, datastore, ds_train_path, ds_test_path)



oj_sales_data\https\azureopendatastorage.azurefd.net\ojsales-simulatedcontainer\oj_sales_data\Store1001_dominicks.csv
  WeekStarting  Store      Brand  Quantity  Advert  Price   Revenue
0   1990-06-14   1001  dominicks     12383       1   2.30  28480.90
1   1990-06-21   1001  dominicks     19572       1   2.25  44037.00
2   1990-06-28   1001  dominicks     13737       1   2.18  29946.66
3   1990-07-05   1001  dominicks     11989       1   2.08  24937.12
4   1990-07-12   1001  dominicks     17410       1   2.59  45091.90
oj_sales_data\https\azureopendatastorage.azurefd.net\ojsales-simulatedcontainer\oj_sales_data\Store1000_dominicks.csv
  WeekStarting  Store      Brand  Quantity  Advert  Price   Revenue
0   1990-06-14   1000  dominicks     12003       1   2.59  31087.77
1   1990-06-21   1000  dominicks     10239       1   2.39  24471.21
2   1990-06-28   1000  dominicks     17917       1   2.48  44434.16
3   1990-07-05   1000  dominicks     14218       1   2.33  33127.94
4   1990-07-12  

In [4]:
ds_train_path = target_path + '_train'
ds_test_path = target_path + '_test'
from azureml.core.dataset import Dataset
# Create file datasets
ds_train = Dataset.File.from_files(path=datastore.path(ds_train_path), validate=False)
ds_test = Dataset.File.from_files(path=datastore.path(ds_test_path), validate=False)

# Register the file datasets
dataset_name = 'oj_data_small'
train_dataset_name = dataset_name + '_train'
test_dataset_name = dataset_name + '_test'
ds_train.register(ws, train_dataset_name, create_new_version=True)
ds_test.register(ws, test_dataset_name, create_new_version=True)

{
  "source": [
    "('workspaceblobstore', 'oj_sales_data_test')"
  ],
  "definition": [
    "GetDatastoreFiles"
  ],
  "registration": {
    "id": "d5ad6720-8274-42ed-a430-c48b13ff9065",
    "name": "oj_data_small_test",
    "version": 1,
    "workspace": "Workspace.create(name='erwrightDevTest', subscription_id='381b38e9-9840-4719-a5a0-61d9585e1e91', resource_group='erwright-automl')"
  }
}

Next, we create and register a [dataset](https://docs.microsoft.com/en-us/azure/machine-learning/concept-data#datasets) in Azure Machine Learning. 

Using a [FileDataset](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.file_dataset.filedataset?view=azure-ml-py) is currently the best way to take advantage of the many models pattern so we create a FileDataset below:

In [4]:
# Connect to default datastore
datastore = ws.get_default_datastore()

# Upload the data
datastore.upload(src_dir = target_path,
                target_path = target_path,
                overwrite = True)

# Create a file dataset
path_on_datastore = datastore.path(target_path)
ds = Dataset.File.from_files(path=path_on_datastore, validate=False)

# Register the file dataset
dataset_name = 'oj_data_small'
ds.register(ws, dataset_name, create_new_version=True)

Uploading an estimated of 10 files
Uploading oj_sales_data\https\azureopendatastorage.azurefd.net\ojsales-simulatedcontainer\oj_sales_data\Store1000_dominicks.csv
Uploading oj_sales_data\https\azureopendatastorage.azurefd.net\ojsales-simulatedcontainer\oj_sales_data\Store1000_minute.maid.csv
Uploading oj_sales_data\https\azureopendatastorage.azurefd.net\ojsales-simulatedcontainer\oj_sales_data\Store1000_tropicana.csv
Uploading oj_sales_data\https\azureopendatastorage.azurefd.net\ojsales-simulatedcontainer\oj_sales_data\Store1001_dominicks.csv
Uploading oj_sales_data\https\azureopendatastorage.azurefd.net\ojsales-simulatedcontainer\oj_sales_data\Store1001_minute.maid.csv
Uploading oj_sales_data\https\azureopendatastorage.azurefd.net\ojsales-simulatedcontainer\oj_sales_data\Store1001_tropicana.csv
Uploading oj_sales_data\https\azureopendatastorage.azurefd.net\ojsales-simulatedcontainer\oj_sales_data\Store1002_dominicks.csv
Uploading oj_sales_data\https\azureopendatastorage.azurefd.net\oj

{
  "source": [
    "('workspaceblobstore', 'oj_sales_data')"
  ],
  "definition": [
    "GetDatastoreFiles"
  ],
  "registration": {
    "id": "14c5dd78-f894-45d7-94cd-5a20d34ca922",
    "name": "oj_data_small",
    "version": 2,
    "workspace": "Workspace.create(name='erwrightDevTest', subscription_id='381b38e9-9840-4719-a5a0-61d9585e1e91', resource_group='erwright-automl')"
  }
}

Now that you've set up your workspace and created a dataset, move on to 02_Training_Pipeline.ipynb to train and score the models.