# Azure Machine Learning - AutoML Pipeline Sample
## 01 - Environment Creation

The notebook below is part of a set of notebooks designed to train, evaluate, and register a custom regression/classification model using Azure ML's AutoML capabilities. Then as part of a batch process, that model will be used to score a new dataset, compute SHAP values for all scored data, and write those results as a CSV to an AML-linked blob store. These notebooks are designed to be run in sequenced based on the number/letter designation in their title.

This notebook contains logic for saving, uploading, and registering CSV datasets to an Azure ML-linked blob storage account. 

Using built-in regression/classification datasets from Scikit-Learn, we first load and save the California Housing and Iris Setosa datasets as CSV files locally (Note: For demonstration purposes, we will save two copies of the same data for both training and scoring. In most production scenarios these would be distinct datasets). Then, using the Azure Machine Learning File System utility, we upload these files to blob storage and register these datasets (`Regression_HousingData` and `Classification_IrisData`) to be consumed in future pipelines.

### Import required packages

In [None]:
from azure.ai.ml import MLClient  
from azure.ai.ml.entities import Data  
from azure.identity import DefaultAzureCredential  
from azure.ai.ml.constants import AssetTypes  

from sklearn.datasets import fetch_california_housing, load_iris
import pandas as pd
import os

### Load sample datasets from Scikit-Learn and create unified dataframes with both input and target features

In [None]:
# Fetch the California housing dataset  
california_housing = fetch_california_housing()  
X, y = pd.DataFrame(california_housing.data, columns=california_housing.feature_names), pd.DataFrame(california_housing.target, columns=california_housing.target_names)  
housing_data = pd.concat([X, y], axis=1)

# Load the iris dataset  
iris = load_iris()  
X, y = pd.DataFrame(iris.data, columns=iris.feature_names), pd.DataFrame(iris.target, columns=['Plant'])
iris_data = pd.concat([X, y], axis=1)

### Save datasets locally

In [None]:
os.makedirs('./housing_data', exist_ok=True)
os.makedirs('./iris_data', exist_ok=True)

housing_data.to_csv('./housing_data/regression_training_dataset.csv', index=False)
iris_data.to_csv('./iris_data/classification_training_dataset.csv', index=False)

housing_data.to_csv('./housing_data/regression_scoring_dataset.csv', index=False)
iris_data.to_csv('./iris_data/classification_scoring_dataset.csv', index=False)

### Get a connection to your Azure ML workspace

Update the values for `subscription_id`, `resource_group`, and `workspace_name` to reflect the attributes associated with your resource

In [None]:
# Define workspace details   
subscription_id = ''
resource_group = ''
workspace_name = ''  
  
# Authenticate to Azure  
credential = DefaultAzureCredential()  
  
# Connect to your workspace  
ml_client = MLClient.from_config(credential=credential, workspace=workspace_name)

datastore = ml_client.datastores.get_default() 

### Upload regression & classification data to the default AML blobstore using the `AzureMachineLearningFileSystem` utility

The code below will upload the contents of our newly created `housing_data` and `iris_data` directories to the relative path specified in the `upload(...)` command.

In [None]:
from azureml.fsspec import AzureMachineLearningFileSystem
# instantiate file system using following URI
fs_uri = f'azureml://subscriptions/{subscription_id}/resourcegroups/{resource_group}/workspaces/{workspace_name}/datastores/{datastore.name}/paths/'
print(fs_uri)
fs = AzureMachineLearningFileSystem(fs_uri)

# you can specify recursive as False to upload a file
fs.upload(lpath='housing_data', rpath='data/housing_data', recursive=True, **{'overwrite': 'MERGE_WITH_OVERWRITE'})

# you need to specify recursive as True to upload a folder
fs.upload(lpath='iris_data', rpath='data/iris_data', recursive=True, **{'overwrite': 'MERGE_WITH_OVERWRITE'})

### Confirm data upload

You should now be able to see your uploaded CSVs in the AML-linked blob store as is shown below:

![Azure ML Data](img/aml_data.png "Uploaded Data")

### Register classification/regression training datasets

Here, we will created registered (read: saved & versioned) copies of our uploaded data to simplify usage in subsequent steps

In [None]:
# Register the dataset  
housing_data_uri = fs_uri + 'data/housing_data/regression_training_dataset.csv'
data_asset = Data(  
    path=housing_data_uri,  
    type=AssetTypes.URI_FILE,  
    description='California housing dataset from Scikit-learn to be used in building a model for predicting median home price',  
    name='Regression_HousingData',  
)  

# Create or update the dataset  
registered_data_asset = ml_client.data.create_or_update(data_asset)  

print(f"Dataset {registered_data_asset.name} is registered.") 

In [None]:
# Register the dataset  
iris_data_uri = fs_uri + 'data/iris_data/classification_training_dataset.csv'
data_asset = Data(  
    path=iris_data_uri,  
    type=AssetTypes.URI_FILE,  
    description='Iris Setosa dataset from Scikit-learn to be used in building a model for classifying plant type based on attributes',  
    name='Classification_IrisData',  
)  

# Create or update the dataset  
registered_data_asset = ml_client.data.create_or_update(data_asset)  

print(f"Dataset {registered_data_asset.name} is registered.") 

### Confirm Dataset creation

After registering your datasets, they should appear within your AML workspace as registered assets as shown below:

![Azure ML Datasets](img/aml_datasets.png "Registered Datasets")