# Data Preperation

We will be leveraging Azure Open Datasets to pull in the Orange Juice Sales Data. We are going to forecast weekly quantity of orange juice sold for each Brand at each Store. 

The data used in this example is from the University of Chicago's Dominick's Finer Foods dataset to forecast orange juice sales. Dominick's was a grocery chain in the Chicago metropolitan area. 

This Notebook will walk through the following steps: 
- Pulling down the data locally. 
- Grouping the data by Store and Brand. 
- Uploading the data to Azure Blob Storage.
- Registering a File Dataset to the Workspace. 

# Prerequisites 
This example runs on an Azure Machine Learning Notebook VM. If you have already run the Environment Setup notebook or you have an AML Workspace and Datastore set up you are all set. 

### Call the workspace and datastore
From the Environment Setup notebook, we need to call the Workspace and Datastore we set up. 

In [None]:
from azureml.core.workspace import Workspace
ws = Workspace.from_config()
datastore = ws.get_default_config()

### (in progress) Leverage Open Datasets to load the Orange Juice Sales data
We will use Open Datasets to pull in the subset of data we want to work with. You can set the sample size to determine the number of series you would like to use to build models. (add in what the max is) 

In [None]:
# hold for pulling data from open datasets 

#from azureml.opendatasets import OjSales

### Temporarily reading data from local directory
Currently, we are reading the data from our local directory and selecting the subset that we would like to build models off of. 

In [None]:
time_column_name = 'WeekStarting'
data = pd.read_csv("generated_oj_sales.csv", parse_dates=[time_column_name])
data.head()

The sample size can be adjusted to run different numbers of models. We are subsetting our data by the store numbers. The maximum number of stores is 3,991. If you would like to run models for all 11,973 series you may skip the subsetting step. 

In [None]:
import random

sample_size = 10
store_list = data['Store'].unique().tolist()

# Pull a random subset of stores
store_sample = random.sample(store_list, sample_size)
print(store_sample)

In [None]:
# Filter the table to the subset 
oj_sales_raw = data[data['Store'].isin(store_sample)]
print(len(oj_sales_raw['Store'].unique()))

### Split the data into groups 
We want to group our data by Store and Brand. This is the level we would like to forecast quantity. 

In [None]:
# group data by store and brand 
store_brand_groups = [x for _, x in oj_sales_raw.groupby(['Store', 'Brand'])]

### Download the data locally 
In order to upload the data to Blob, we need to save it locally. We will create a directory to write the datasets. 

In [None]:
# Create a Data Directory in local path
data_dir = "data"

if not os.path.exists(data_dir):
    os.mkdir(data_dir)

# Create a folder for the oj_sales data 
oj_sales_path = data_dir + "/oj_sales_data"

if not os.path.exists(oj_dir):
    os.mkdir(oj_dir)


# save each store/brand to csv to upload 
for grp in store_brand_groups: 
    file_name = '/store' + str(grp['Store'].unique()).lstrip("['").rstrip("']") + '_' + 
        str(grp['Brand'].unique()).lstrip("['").rstrip("']") + '.csv'
        
    grp.to_csv(path_or_buf = oj_sales_path + file_name, index = False)
    

### Upload the individual datasets to Blob Storage
We upload the data to Blob and will create the FileDataset from this folder of csv files.  

In [None]:
target_path = 'oj_sales_data' + str(sample_size*3)

datastore.upload(src_dir = oj_sales_path,
                target_path = target_path,
                overwrite = True, 
                show_progress = True)

### Create the file dataset 
We need to define the path of the data to create the [FileDataset](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.file_dataset.filedataset?view=azure-ml-py). 

In [None]:
from azureml.core.dataset import Dataset

ds_name = 'oj_data_'+ str(sample_size*3)
path_on_datastore = datastore.path(target_path + '/')

input_ds = Dataset.File.from_files(path=path_on_datastore, validate=False)

### Register the file dataset to the workspace 
We want to register the dataset to our workspace so we can call it as an input into our Pipeline for forecasting. 

In [None]:
registered_ds = input_ds.register(ws, ds_name, create_new_version=True)
named_ds = registered_ds.as_named_input(ds_name)

### Call the Resigstered dataset
After reigstering the data, we can call it into our Notebook. We will use this in the Training and Scoring Notebooks. You can also download the data from the FileDataset back to a local directory. 

In [None]:
oj_ds = Dataset.get_by_name(ws, name = ds_name)
oj_ds

In [None]:
# download data locally 
oj_ds.download()