# Data Preperation

We will be leveraging Azure Open Datasets to pull in the Orange Juice Sales Data. We are going to forecast weekly quantity of orange juice sold for each Brand at each Store. 

The data used in this example is from the University of Chicago's Dominick's Finer Foods dataset to forecast orange juice sales. Dominick's was a grocery chain in the Chicago metropolitan area. 

#### This Notebook will walk through the following steps: 
- Pulling the data locally 
- Grouping the data by Store and Brand 
- Uploading the data to Azure Blob Storage
- Registering a File Dataset to the Workspace 

# Prerequisites 
This example runs on an Azure Machine Learning Notebook VM. If you have already run the Environment Setup notebook or you have an AML Workspace and Datastore set up you are all set. 

### Call the workspace and datastore
From the Environment Setup notebook, we need to call the Workspace and Datastore we set up. 

In [2]:
from azureml.core.workspace import Workspace

ws = Workspace(subscription_id="bbd86e7d-3602-4e6d-baa4-40ae2ad9303c", resource_group="ManyModelsSA", workspace_name="ManyModelsSAv1")
# ws = Workspace.from_config()
datastore = ws.get_default_datastore()

### (in progress) Leverage Open Datasets to load the Orange Juice Sales data
We will use Open Datasets to pull in the subset of data we want to work with. You can set the sample size to determine the number of series you would like to use to build models. (add in what the max is) 

In [None]:
# hold for pulling data from open datasets 

#from azureml.opendatasets import OjSales

### Temporarily reading data from local directory
Currently, we are reading the data from our local directory and selecting the subset that we would like to build models off of. 

In [None]:
time_column_name = 'WeekStarting'
data = pd.read_csv("generated_oj_sales.csv", parse_dates=[time_column_name])
data.head()

### Split the data into groups 
We want to group our data by Store and Brand. This is the level we would like to forecast quantity. 

In [None]:
# group data by store and brand 
store_brand_groups = [x for _, x in oj_sales_raw.groupby(['Store', 'Brand'])]

### Download the data locally 
In order to upload the data to Blob, we need to save it locally. We will create a directory to write the datasets. 

In [None]:
# Create a Data Directory in local path
data_dir = "data"

if not os.path.exists(data_dir):
    os.mkdir(data_dir)

# Create a folder for the oj_sales data 
oj_sales_path = data_dir + "/oj_sales_data"

if not os.path.exists(oj_dir):
    os.mkdir(oj_dir)


# save each store/brand to csv to upload 
for grp in store_brand_groups: 
    file_name = '/store' + str(grp['Store'].unique()).lstrip("['").rstrip("']") + '_' + 
        str(grp['Brand'].unique()).lstrip("['").rstrip("']") + '.csv'
        
    grp.to_csv(path_or_buf = oj_sales_path + file_name, index = False)
    

### Upload the individual datasets to Blob Storage
We upload the data to Blob and will create the FileDataset from this folder of csv files.  

In [None]:
target_path = 'oj_sales_data' + str(sample_size*3)

datastore.upload(src_dir = oj_sales_path,
                target_path = target_path,
                overwrite = True, 
                show_progress = True)

### Create the file dataset 
We need to define the path of the data to create the [FileDataset](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.file_dataset.filedataset?view=azure-ml-py). 

In [None]:
from azureml.core.dataset import Dataset

ds_name = 'oj_data'
path_on_datastore = datastore.path(target_path + '/')

input_ds = Dataset.File.from_files(path=path_on_datastore, validate=False)

### Register the file dataset to the workspace 
We want to register the dataset to our workspace so we can call it as an input into our Pipeline for forecasting. 

In [None]:
registered_ds = input_ds.register(ws, ds_name, create_new_version=True)
named_ds = registered_ds.as_named_input(ds_name)

### Call the Resigstered dataset
After reigstering the data, we can call it into our Notebook. We will use this in the Training and Scoring Notebooks. You can also download the data from the FileDataset back to a local directory. 

In [None]:
oj_ds = Dataset.get_by_name(ws, name = ds_name)
oj_ds

In [None]:
# download data locally 
oj_ds.download()

### Subset a small number of datasets 

We demonstrate how to subset 9 datasets - 3 stores for 3 different brands in the following blocks. You can change it to any number you'd like through changing the number in the take() method.

In [3]:
from azureml.core.dataset import Dataset

FileDstAllModels = Dataset.get_by_name(ws, name='AllDataProd')
FileDstAllModels

{
  "source": [
    "('workspaceblobstore', 'AllDataProd/')"
  ],
  "definition": [
    "GetDatastoreFiles"
  ],
  "registration": {
    "id": "13d13057-951b-45e4-8c6f-ced91644781c",
    "name": "AllDataProd",
    "version": 2,
    "workspace": "Workspace.create(name='ManyModelsSAv1', subscription_id='bbd86e7d-3602-4e6d-baa4-40ae2ad9303c', resource_group='ManyModelsSA')"
  }
}

In [7]:
FileDst9Models = FileDstAllModels.take(9)

to_path() method will print out the subsetted file names.

In [8]:
FileDst9Models.to_path()

array(['/Store1000_dominicks.csv', '/Store1000_minute.maid.csv',
       '/Store1000_tropicana.csv', '/Store1001_dominicks.csv',
       '/Store1001_minute.maid.csv', '/Store1001_tropicana.csv',
       '/Store1002_dominicks.csv', '/Store1002_minute.maid.csv',
       '/Store1002_tropicana.csv'], dtype=object)

Then we create a temporary directory to download the 9 files.

In [9]:
import os
import tempfile

data_folder = tempfile.mkdtemp()
FileDst9Models.download(data_folder, overwrite=True)

array(['/var/folders/ds/k3qt7v792nz1vf7vxm0vwbz5q68fz8/T/tmp9wywgj79/Store1000_dominicks.csv',
       '/var/folders/ds/k3qt7v792nz1vf7vxm0vwbz5q68fz8/T/tmp9wywgj79/Store1000_minute.maid.csv',
       '/var/folders/ds/k3qt7v792nz1vf7vxm0vwbz5q68fz8/T/tmp9wywgj79/Store1000_tropicana.csv',
       '/var/folders/ds/k3qt7v792nz1vf7vxm0vwbz5q68fz8/T/tmp9wywgj79/Store1001_dominicks.csv',
       '/var/folders/ds/k3qt7v792nz1vf7vxm0vwbz5q68fz8/T/tmp9wywgj79/Store1001_minute.maid.csv',
       '/var/folders/ds/k3qt7v792nz1vf7vxm0vwbz5q68fz8/T/tmp9wywgj79/Store1001_tropicana.csv',
       '/var/folders/ds/k3qt7v792nz1vf7vxm0vwbz5q68fz8/T/tmp9wywgj79/Store1002_dominicks.csv',
       '/var/folders/ds/k3qt7v792nz1vf7vxm0vwbz5q68fz8/T/tmp9wywgj79/Store1002_minute.maid.csv',
       '/var/folders/ds/k3qt7v792nz1vf7vxm0vwbz5q68fz8/T/tmp9wywgj79/Store1002_tropicana.csv'],
      dtype=object)

Check out the dowonlaoded files.

In [16]:
os.listdir(data_folder)

['Store1000_dominicks.csv',
 'Store1002_minute.maid.csv',
 'Store1000_tropicana.csv',
 'Store1001_minute.maid.csv',
 'Store1001_tropicana.csv',
 'Store1002_dominicks.csv',
 'Store1000_minute.maid.csv',
 'Store1001_dominicks.csv',
 'Store1002_tropicana.csv']

Upload the downloaded files to the blob container and named the folder 'oj_sales_9_datasets'.

In [12]:
target_path = 'oj_sales_9_datasets'

datastore.upload(src_dir = data_folder,
                target_path = target_path,
                overwrite = True, 
                show_progress = True)

Uploading an estimated of 9 files
Uploading /var/folders/ds/k3qt7v792nz1vf7vxm0vwbz5q68fz8/T/tmp9wywgj79/Store1000_dominicks.csv
Uploading /var/folders/ds/k3qt7v792nz1vf7vxm0vwbz5q68fz8/T/tmp9wywgj79/Store1000_minute.maid.csv
Uploading /var/folders/ds/k3qt7v792nz1vf7vxm0vwbz5q68fz8/T/tmp9wywgj79/Store1000_tropicana.csv
Uploading /var/folders/ds/k3qt7v792nz1vf7vxm0vwbz5q68fz8/T/tmp9wywgj79/Store1001_dominicks.csv
Uploading /var/folders/ds/k3qt7v792nz1vf7vxm0vwbz5q68fz8/T/tmp9wywgj79/Store1001_minute.maid.csv
Uploading /var/folders/ds/k3qt7v792nz1vf7vxm0vwbz5q68fz8/T/tmp9wywgj79/Store1001_tropicana.csv
Uploading /var/folders/ds/k3qt7v792nz1vf7vxm0vwbz5q68fz8/T/tmp9wywgj79/Store1002_dominicks.csv
Uploading /var/folders/ds/k3qt7v792nz1vf7vxm0vwbz5q68fz8/T/tmp9wywgj79/Store1002_minute.maid.csv
Uploading /var/folders/ds/k3qt7v792nz1vf7vxm0vwbz5q68fz8/T/tmp9wywgj79/Store1002_tropicana.csv
Uploaded /var/folders/ds/k3qt7v792nz1vf7vxm0vwbz5q68fz8/T/tmp9wywgj79/Store1001_tropicana.csv, 1 files ou

$AZUREML_DATAREFERENCE_cbaac2c516464edf9511efe7e0dd41d6

We then register the file datasets to the Workspace and name it as 'FileDst9Models'.  We can call this registered dataset later in our notebook. This is one-time set up that you don't need to re-run. 

In [13]:
datastore_paths = [(datastore, target_path)]
ojsales9_ds = Dataset.File.from_files(path=datastore_paths)
ojsales9_ds = ojsales9_ds.register(workspace=ws, 
                                       name='FileDst9Models', 
                                       description='9 files')

In [14]:
ojsales9_ds

{
  "source": [
    "('workspaceblobstore', 'oj_sales_9_datasets')"
  ],
  "definition": [
    "GetDatastoreFiles"
  ],
  "registration": {
    "id": "bd170b79-d43f-4f76-8ad2-0831f5e7c28e",
    "name": "FileDst9Models",
    "version": 1,
    "description": "9 files",
    "workspace": "Workspace.create(name='ManyModelsSAv1', subscription_id='bbd86e7d-3602-4e6d-baa4-40ae2ad9303c', resource_group='ManyModelsSA')"
  }
}

Lastly, we delete the temporary directory.

In [19]:
import shutil
shutil.rmtree(data_folder)

### Subset datasets without explicitly cleaning up

In [None]:
from tempfile import TemporaryDirectory

target_path = 'oj_sales_9_datasets'

with TemporaryDirectory() as temp_dir:
    FileDst9Models.download(temp_dir, overwrite=True)
    datastore.upload(src_dir = temp_dir,
                target_path = target_path,
                overwrite = True, 
                show_progress = True)

In [None]:
datastore_paths = [(datastore, target_path)]
ojsales9_ds = Dataset.File.from_files(path=datastore_paths)
ojsales9_ds = ojsales9_ds.register(workspace=ws, 
                                       name='FileDst9Models', 
                                       description='9 files')