# Data Preperation

We will be leveraging Azure Open Datasets to pull in the Orange Juice Sales Data. 

This Notebook will walk through the following steps: 
- Pulling down the data locally. 
- Grouping the data by Store and Brand. 
- Uploading the data to Azure Blob Storage.
- Registering a File Dataset to the Workspace. 

### Call the workspace and datastore

In [None]:
from azureml.core.workspace import Workspace
ws = Workspace.from_config()
datastore = ws.get_default_config()

### Leverage Open Datasets to load the Orange Juice Sales data
We will use Open Datasets to pull in the subset of data we want to work with. You can set the sample size to determine the number of series you would like to use to build models. (add in what the max is) 

In [1]:
# hold for pulling data from open datasets 

#from azureml.opendatasets import OjSales

In [None]:
# add filtering to what part of the oj data you want to bring in 
sample_size = 10

# loop to pull in correct # series 

oj_sales_raw = # add in code here from subset pull 

## NY Taxi Open Source Data for reference 
#for sample_month in range(number_of_months):
#    temp_df_green = NycTlcGreen(start + relativedelta(months=sample_month), end + relativedelta(months=sample_month)) \
#        .to_pandas_dataframe()
#    green_df_raw = green_df_raw.append(temp_df_green.sample(sample_size))

### Split the data into groups 

In [None]:
# group data by store and brand 
store_brand_groups = [x for _, x in oj_sales_raw.groupby(['Store', 'Brand'])]

### Download the data locally 

In [None]:
# Create a Data Directory in local path
data_dir = "data"

if not os.path.exists(data_dir):
    os.mkdir(data_dir)

# Create a folder for the oj_sales data 
oj_sales_path = data_dir + "/oj_sales_data"

if not os.path.exists(oj_dir):
    os.mkdir(oj_dir)


# save each store/brand to csv to upload 
for grp in store_brand_groups: 
    file_name = '/store' + str(grp['Store'].unique()).lstrip("['").rstrip("']") + '_' + 
        str(grp['Brand'].unique()).lstrip("['").rstrip("']") + '.csv'
        
    grp.to_csv(path_or_buf = oj_sales_path + file_name, index = False)
    

### Upload the individual datasets to Blob Storage
We will create the FileDataset from this folder of csv files on Blob. 

In [None]:
# upload to blob 
target_path = 'oj_sales_data'

datastore.upload(src_dir = oj_sales_path,
                target_path = target_path,
                overwrite = True, 
                show_progress = True)

### Create the file dataset 

In [None]:
from azureml.core.dataset import Dataset

ds_name = 'oj_data'
path_on_datastore = datastore.path(target_path + '/')

input_ds = Dataset.File.from_files(path=path_on_datastore, validate=False)

### Register the file dataset to the workspace 

In [None]:
registered_ds = input_ds.register(ws, ds_name, create_new_version=True)
named_ds = registered_ds.as_named_input(ds_name)

### Call the Resigstered dataset

In [None]:
oj_ds = Dataset.get_by_name(ws, name = ds_name)

df = oj_ds.to_pandas_dataframe()

In [None]:
df.head()