# 01 Data Preperation
This solution accelerator uses simulated orange juice sales data to walk you through the process of training many models on Azure Machine Learning. The data used in this example was simulated based on the University of Chicago's Dominick's Finer Foods dataset which featured sales of 3 different orange juice brands for individual stores. The full simulated dataset include 3,991 stores with 3 orange juice brands each thus allowing 11,973 models to be trained to showcase the power of the many models pattern.

In this notebook, two datasets will be created: one with all 11,973 files and one with only 10 files that can be used to quickly test and debug. For each dataset, you'll walk you through the process of:

1. Downloading the data from Azure Open Datasets
2. Uploading the data to Azure Blob Storage
3. Registering a File Dataset to the Workspace


### Prerequisites 
At this point, you should have already: 
1. Created your AML Workspace
2. Run the Environment Setup Notebook

### 1.0 Connect to your workspace and datastore

In [None]:
from azureml.core.workspace import Workspace
ws = Workspace.from_config()
datastore = ws.get_default_config()

### 2.0 Download the data from Azure Open Datasets 
To download the data, import OjSalesSimulated from Azure Open Datasets. Two datasets are being created: one with all 11,973 files and one with 10 files but this can be customized based on your preferences.

In [None]:
from azureml.opendatasets import OjSalesSimulated

# Pull all of the data
oj_sales_files = OjSalesSimulated.get_file_dataset()

# Pull the first 10 files
oj_sales_files_small = OjSalesSimulated.get_file_dataset().take(10)

Next, create the folders that the data will be downloaded to. 

In [None]:
import os 

oj_sales_path = "oj_sales_data"
if not os.path.exists(oj_sales_path):
    os.mkdir(oj_sales_path)
    
oj_sales_path_small = "oj_sales_data_small"
if not os.path.exists(oj_sales_path_small):
    os.mkdir(oj_sales_path_small)

Finally, download the files to the folder you created. 

In [None]:
oj_sales_files.download(oj_sales_path, overwrite=True)
oj_sales_files_small.download(oj_sales_path_small, overwrite=True)

### 3.0 Upload the files to your data store
To create the [FileDataset](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.file_dataset.filedataset?view=azure-ml-py), you first need to upload the csv files to your blob datastore.

In [None]:
target_path = 'oj_sales_data' 
datastore.upload(src_dir = oj_sales_path,
                target_path = target_path,
                overwrite = True, 
                show_progress = False)

target_path_small = 'oj_sales_data_small'
datastore.upload(src_dir = oj_sales_path_small,
                target_path = target_path_small,
                overwrite = True, 
                show_progress = False)

### 4.0 Create the file datasets 
Now that the files exist in the datastore, [FileDatasets](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.file_dataset.filedataset?view=azure-ml-py) can be created. 

In [None]:
from azureml.core.dataset import Dataset

ds_name = 'oj_data'
path_on_datastore = datastore.path(target_path + '/')
input_ds = Dataset.File.from_files(path=path_on_datastore, validate=False)

ds_name_small = 'oj_data_small'
path_on_datastore_small = datastore.path(target_path_small + '/')
input_ds_small = Dataset.File.from_files(path=path_on_datastore_small, validate=False)

### 5.0 Register the file dataset to the workspace 
Finally, register the dataset to your workspace so it can be called as an input into the training pipeline in the next notebook. This same dataset will also be used as part of the scoring and forecasting pipelines.

In [None]:
registered_ds = input_ds.register(ws, ds_name, create_new_version=True)

registered_ds_small = input_ds_small.register(ws, ds_name_small, create_new_version=True)

### 6.0 Call the Resigstered dataset *(Optional)*
After reigstering the data, it can be easily called using the command below. This is how the datasets will be accessed in future notebooks. 

In [None]:
oj_ds = Dataset.get_by_name(ws, name = ds_name)
oj_ds

You can also download the data from the dataset in the future. 

In [None]:
oj_ds.download()

### 7.0 Delete the local files *(Optional)*

In [None]:
import shutil

shutil.rmtree(oj_sales_path)
shutil.rmtree(oj_sales_path_small)

## Next Steps
Now that you have created your dataset, you are ready to move to the Training Notebook to create models. 