Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Environment Setup
---

This notebook walks you through all the necessary steps to configure your environment for this solution accelerator including:

1. Connecting to your workspace and create a config.json (this can be skipped if running on a Notebook VM)
2. Deploying a compute cluster for training and forecasting
3. Creating and registering the dataset used in this accelerator

### Prerequisites
At this point, you should have created your AML workspace. If you haven't created one already, you can create one in step 1.1 below.

## 1.0 Connect to workspace

Connect this solution accelerator to your AML workspace. This step isn't necessary if you're using a Notebook VM.

The following cell allows you to specify your workspace parameters. This cell uses the python method os.getenv to read values from environment variables which is useful for automation. If no environment variable exists, the parameters will be set to the specified default values.

In [None]:
import os

subscription_id = os.getenv("SUBSCRIPTION_ID", default="<my-subscription-id>")
resource_group = os.getenv("RESOURCE_GROUP", default="<my-resource-group>")
workspace_name = os.getenv("WORKSPACE_NAME", default="<my-workspace-name>")
workspace_region = os.getenv("WORKSPACE_REGION", default="westus2")

In [None]:
from azureml.core import Workspace

try:
    ws = Workspace(subscription_id=subscription_id, 
                   resource_group=resource_group, 
                   workspace_name=workspace_name)
    print("Workspace configuration succeeded. Skip the workspace creation steps below")
except:
    print("Workspace not accessible. Change your parameters or create a new workspace below")

In [None]:
ws.get_details()

### 1.1 Create workspace if needed
If you don't have a workspace already, uncomment the lines below to create one:

In [None]:
# Create the workspace using the specified parameters
# ws = Workspace.create(
#    name=workspace_name,
#    subscription_id=subscription_id,
#    resource_group=resource_group, 
#    location=workspace_region,
#    create_resource_group=True,
#    sku='basic',
#    exist_ok=True
#)
#ws.get_details()

### 1.2 Write config file
Write the details of the workspace to a config.json file:

In [None]:
ws.write_config()

## 2.0 Create compute

In this step we create an compute cluster that will be used for the training and forecasting pipelines. This is a one-time set up so you won't need to re-run this in future notebooks.

We create a STANDARD_D13_V2 compute cluster. D-series VMs are used for tasks that require higher compute power and temporary disk performance. This [page](https://docs.microsoft.com/en-us/azure/cloud-services/cloud-services-sizes-specs) will gives you more information on VM sizes to help you decide which will best fit your use case.

In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
cpu_cluster_name = "cpucluster"

# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found an existing cluster, using it instead.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D12_V2',
                                                           min_nodes=0,
                                                           max_nodes=5)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True)

## 3.0 Create dataset

This solution accelerator uses simulated orange juice sales data from [Azure Open Datasets](https://azure.microsoft.com/en-us/services/open-datasets/) to walk you through the process of training many models on Azure Machine Learning. You can learn more about the dataset [here](https://azure.microsoft.com/en-us/services/open-datasets/catalog/sample-oj-sales-simulated/). The full dataset includes simulared sales for 3,991 stores with 3 orange juice brands each thus allowing 11,973 models to be trained to showcase the power of the many models pattern.

We'll start by downloading the first 10 files but you can easily edit the code below to train all 11,973 models.

In [None]:
#%pip install --upgrade azureml-opendatasets

In [None]:
from azureml.core.dataset import Dataset
from azureml.opendatasets import OjSalesSimulated

# Pull all of the data
oj_sales_files = OjSalesSimulated.get_file_dataset()

# Pull only the first 10 files
oj_sales_files_small = OjSalesSimulated.get_file_dataset().take(10)

# Create a folder to download
target_path = 'oj_sales_data' 
if not os.path.exists(target_path):
    os.mkdir(target_path)

# Download the data
oj_sales_files_small.download(target_path, overwrite=True)

Next, we create and register a [dataset](https://docs.microsoft.com/en-us/azure/machine-learning/concept-data#datasets) in Azure Machine Learning. 

Using a [FileDataset](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.file_dataset.filedataset?view=azure-ml-py) is currently the best way to take advantage of the many models pattern so we create a FileDataset below:

In [None]:
# Connect to default datastore
datastore = ws.get_default_datastore()

# Upload the data
datastore.upload(src_dir = target_path,
                target_path = target_path,
                overwrite = True)

# Create a file dataset
path_on_datastore = datastore.path(target_path)
ds = Dataset.File.from_files(path=path_on_datastore, validate=False)

# Register the file dataset
dataset_name = 'oj_data_small'
ds.register(ws, dataset_name, create_new_version=True)

Now that you've set up your workspace and created a dataset, move on to 01_Training_Pipeline.ipynb to train and score the models.