This notebook is a quick utility for splitting the example Test Data CSV into a folder structure that better represents the expected folder structure of incoming data.
It was created to facilitate demonstration of the Data Drift tools of Azure ML

In [None]:
# Load in local test data CSV into Pandas
import pandas as pd
import datetime as dt
df = pd.read_csv('location-of-test-data-in-repository')

In [None]:
# Remove special characters from column/feature names
df.columns = df.columns.str.replace('[%,&,(,),*]', '')
df.columns = df.columns.str.replace('[ ,  ]', '_')

In [None]:
df.columns

In [None]:
# Add a column for a date which strips off the Hours/Minutes of timestamp_x
# This is used for splitting the data into each subfolder
df['date'] = pd.to_datetime(df['timestamp_x'], format='%m/%d/%Y %H:%M').apply(lambda x: x.strftime('%Y/%m/%d') if x is not pd.NaT else None)


In [None]:
import os

# Create a new file for each day's worth of data
for day in dates:
    if day is not None:
        splitdate = day.split('/')
        year = splitdate[0]
        month = splitdate[1]
        day = splitdate[2]
        minidf.to_csv(f'./{year}/{month}/{day}/data.csv')

## Upload to Azure

Once the local copy of the folder structure is created, you can quickly push it to blob storage using the Azure CLI:

```azurecli
az storage azcopy blob upload -c <NAME OF CONTAINER> --account-name <NAME OF STORAGE ACCOUNT> -s <wherever you have saved the split csv files> --recursive
```


## Register datastore

Now that the data is uploaded into Azure, we can register the datastore (or, if you uploaded to a datastore that was already registered - as in this example, simply retrieve that datastore)

In [None]:
from azureml.core import Datastore, Dataset, Workspace
ws = Workspace('SUBSCRIPTION ID', 'RESOURCE GROUP', 'WORKSPACE NAME')

In [None]:
ds = Datastore.get(ws, 'DATASTORE NAME')

## Create Dataset

Using wildcards to handle all of the date-based subfolders, we can register our Dataset as a tabular dataset.
The `partition_format` option allows us to add columns to the Dataset based on the folder path. In this example, we add a "line" and "upload_date" column based on the folder structure.

In [None]:
partitioned_dataset = Dataset.Tabular.from_delimited_files(path=[(ds, '*/*/*/data.csv')], partition_format='{line}/{upload_date:yyyy/MM/dd}/data.csv')

In [None]:
rows = partitioned_dataset.take(5)

In [None]:
rows.to_pandas_dataframe()

## Register the Dataset

The local version of the Tabular Dataset has been created, now we need to mark which columns (is/are) our timestamp(s), before registering them in Azure ML.

The `timestamp` column is a "fine grain" timestamp which we can filter on, while the `partition_timestamp` is a "course grain" timestamp used for partitioning the data into groups. Since `timestamp_x` has hours and minutes included, here we will use that as our `timestamp` parameter, while the `partition_timestamp` will be the `date` column. The `partition_timestamp` is optional.

In [None]:
partitioned_dataset = partitioned_dataset.with_timestamp_columns(timestamp='timestamp_x', partition_timestamp='date')

In [None]:
partitioned_dataset.register(ws, 'PartitionedData', create_new_version=True)