# Data Preparation

This notebook should create a dataset that will be consumed by a training job. In this notebook we conform to 'best practices' such that this notebook can be integrated into an [Azure Machine Learning Pipeline](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-ml-pipelines).

## Import Libraries 

Any libraries that are required for the data prep step should be imported in the next cell. This ensures that the created py script for our pipeline conforms to [PEP8 styling guide best practice](https://www.python.org/dev/peps/pep-0008/#imports).

In [None]:
import argparse
import tempfile
import os
import logging
import pandas as pd
from azureml.core import Dataset, Workspace, Datastore
from azureml.data import TabularDataset, FileDataset
from azureml.core.run import _OfflineRun, Run

## Setting script parameters with argparse

Any parameters such as dataset name or output folder name should be defined using argparse. This is in general good practice and will also make the transition to pipelines more seamless. The default value used should be what you want to use in this notebook within the compute instance, for example if we are wanting to prepare a registered dataset called 'my_raw_data' then you would use:

```
parser.add_argument("--input_dataset", default="my_raw_data")
```

If you have more than one dataset than the one provided below then you can add additional arguments. Moreover, if you want to have a variable with better semantic meaning (e.g. sales_dataset) then feel free to change it.

In [None]:
parser = argparse.ArgumentParser()
parser.add_argument("--input_dataset",
                    default="",
                    help="the input dataset name")
parser.add_argument("--output_folder",
                    default="outputs",
                    help="the folder name where you want to place outputs")
args, _ = parser.parse_known_args()

dataset_name1 = args.input_dataset
output_folder = args.output_folder

## Load Data (must be a registered dataset in AzureML workspace)

Below we have written a function for you that will fetch a _registered_ dataset. If you have not registered a dataset for this project then please go to the [Azure Machine Studio](https://ml.azure.com/), locate __Datasets__ (under __Assets__ in the left hand menu) and click on __Create Dataset__. Follow the instructions to register your dataset.

The function below will return:

* Dataset object
* Mount Folder. If the dataset is of type File then this will be the mounted folder path to where the dataset is mounted. Otherwise, this will be returned as a None object.

This function is designed to work whether you are running in a Compute Instance or whether this is used in another compute target (e.g. training cluster) as part of a AzureML pipeline. Therefore, you can have an interactive session in compute instance with this notebook to do EDA and prepare data for training but will not have to make changes to the code to incorporate into an AzureML pipeline.

In [None]:
def get_registered_dataset(dataset_name):
    run = Run.get_context()
    dset = None
    mount_folder = ''

    if isinstance(run, _OfflineRun):
        ws = Workspace.from_config()
        dset = Dataset.get_by_name(ws, dataset_name)

        if isinstance(dset, FileDataset):
            mount_folder = tempfile.mkdtemp()
            ws = Workspace.from_config()
            dset = Dataset.get_by_name(ws, dataset_name)
            print('This is a file dataset and therefore mounting to ' + mount_folder)
            mount_context = dset.mount(mount_folder)
            mount_context.start()
    else:
        ws = run.experiment.workspace
        print("dataset name " + dataset_name)
        dset = run.input_datasets[dataset_name]
        print(dset)
        if isinstance(dset, str):
            mount_folder = dset
            print('This is a file dataset and therefore it has already been mounted to ' + mount_folder)
            print('contents of folder')
            print(os.listdir(mount_folder))
        
    
    return dset, mount_folder

Below we run the function by providing the dataset name provided by the argument defined by the script parameters above.

In [None]:
dataset, mount_folder = get_registered_dataset(dataset_name1)

# if the dataset is tabular type and you want to render it into pandas dataframe use:
# dataframe = dataset.to_pandas_dataframe()

# if the dataset is tabular and you want to render it into a spark dataframe use:
# dataframe = dataset.to_spark_dataframe()

# if you want to take (say) a 10% sample of a tabular dataset use:
# dataframe = dataset.take_sample(probability=0.10).to_pandas_dataframe()

# if you want to take the top n number of data points from a tabular dataset use:
# dataframe = dataset.take(100).to_pandas_dataframe()

# if using a file dataset then you can view the mounted files using:
# os.listdir(mount_folder)

## Write data prep code below

## Write out training dataset(s) for next step in pipeline 

Below you should write out your training datasets to the `output_folder` parameter defined in the argparse section above. Below we create the directory for you. 

In [None]:
os.makedirs(output_folder, exist_ok=True)

# below you should write your training sets (or any other data required for training) into the folder created above e.g.
# file_path = os.path.join(output_folder, 'training_data.csv')
# pd.to_csv(file_path)