# Kaggle Red Wine Quality 

This project is using both automated ML and bespoke ML to create a model that predicts the quality of wine based on a subset of features. This data was taken from a Kaggle project (https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009). 

To perform the Auto ML, this notebook interacts with the Azure ML SDK to set up an Auto ML experiemtn. This then progresses to creating an automated pipeline to run training experiments and registering the machine learning models.

## Steps taken in this notebook
 -  Connecting to an Azure workspace
 -  Preparing the data
 -  Creating a Compute Target
 -  Running an Auto ML experiment on Azure
 -  Analysing the Auto ML output 



### Step 1 - Connecting to an Azure workspace

An Azure ML Workspace was set up within portal.azure.com and a compute instance has been set up. The following code will connect to the Azure ML Workspace created in Azure.

In [8]:
import azureml.core
print("Ready to use Azure ML", azureml.core.VERSION)

Ready to use Azure ML 1.7.0


In [9]:
from azureml.core import Workspace
ws = Workspace.from_config()
print('Ready to use the {} workspace on Azure ML {}'.format(ws.name, azureml.core.VERSION))

Ready to use the redwine workspace on Azure ML 1.7.0


### Step 2 - Preparing the data

The data used in this experiment was downloaded from https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009. 
The steps below, import the data if it is not currently present in the workspace data store creating a tabular dataset and registering this within Azure.   

In [10]:
from azureml.core import Dataset

default_ds = ws.get_default_datastore()

if 'red wine dataset' not in ws.datasets:
    default_ds.upload_files(files=['./RedWine/Red Wine Quality.csv'], # Upload the red wine csv file in /RedWine
                        target_path='red-wine-data/', # Put it in a folder path in the datastore
                        overwrite=True, # Replace existing files of the same name
                        show_progress=True)

    #Create a tabular dataset from the path on the datastore.
    full_rw_data_set = Dataset.Tabular.from_delimited_files(path=(default_ds, 'red-wine-data/*.csv'))

    # Register the tabular dataset
    try:
        full_rw_data_set = full_rw_data_set.register(workspace=ws, 
                                name='red wine dataset',
                                description='red wine quality data',
                                tags = {'format':'CSV'},
                                create_new_version=True)
        print('Dataset registered.')
    except Exception as ex:
        print(ex)
else:
    print('Dataset already registered.')

Dataset already registered.


### Step 3 - Create a Compute Target
Creating a Compute Cluster in Azure to be used for subsequent ML tasks. 

__*Note*__ - If local compute is going to be used, this step is not required. 

In [11]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

cluster_name = "red-wine-1" 

# Verify that cluster exists
try:
    training_cluster = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    # If not, create it
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS2_V2', max_nodes=2)
    training_cluster = ComputeTarget.create(ws, cluster_name, compute_config)

training_cluster.wait_for_completion(show_output=True)

Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


### Step 4 - Using Auto ML to guide the ML algorithm to use


Split the dataset into training and validation subsets

In [12]:
redwine_ds = ws.datasets.get("red wine dataset")
train_ds, test_ds = redwine_ds.random_split(percentage=0.7, seed=123)
print("Data ready!")

Data ready!


Configuring an AutoML experiment to run on a compute cluster.  
- Run configuration that includes the required packages for the experiment environment
- Set the configuration settings to specify how many combinations to try (if the number of iterations are not specified, the default is 1000).
- Set the metric to use when evaluating models. The task here is 'regression' so the options to chose from are:
 - Spearman Correlation
 - Normalized Root Mean Squared Error
 - R-squared
 - Normalised Mean Absolute Error


In [9]:
from azureml.train.automl import AutoMLConfig

automl_config = AutoMLConfig(name='Automated ML Experiment',
                             task='regression',
                             compute_target=training_cluster,
                             enable_local_managed=True,
                             training_data = train_ds,
                             validation_data = test_ds,
                             label_column_name='quality',
                             iterations=500,
                             primary_metric = 'r2_score',
                             max_concurrent_iterations=10,
                             featurization='auto',
                             model_explainability=True
                             )

print("Ready for Auto ML run.")

Ready for Auto ML run.


Running the Auto ML experiment using the config file created above.

In [10]:
from azureml.core.experiment import Experiment
from azureml.widgets import RunDetails

print('Submitting Auto ML experiment...')
automl_experiment = Experiment(ws, 'redwine_automl')
automl_run = automl_experiment.submit(automl_config)
RunDetails(automl_run).show()
automl_run.wait_for_completion(show_output=True)

Submitting Auto ML experiment...


Determining the best model from the AutoML experiment.

In [None]:
from azureml.core.experiment import Experiment
from azureml.widgets import RunDetails

best_run, fitted_model = automl_run.get_output()
print(best_run)
print(fitted_model)
best_run_metrics = best_run.get_metrics()
for metric_name in best_run_metrics:
    metric = best_run_metrics[metric_name]
    print(metric_name, metric)

In [None]:
import pandas as pd

children = list(automl_run.get_children())
metricslist = {}
for run in children:
    properties = run.get_properties()
    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}
    metricslist[int(properties['iteration'])] = metrics
rundata = pd.DataFrame(metricslist).sort_index(1)
rundata.to_csv('RedWine/rundata.csv')
rundata


Transposing the results of the AutoML and picking out those with the best metrics to allow comparison of the best models more easily.

In [2]:
import pandas as pd
import numpy as np
rundata = pd.read_csv("rundata.csv")
toprundata = rundata.transpose()


new_headers = toprundata.iloc[0]
toprundata = toprundata[1:] 
toprundata.columns = new_headers


df_toprundata =  pd.DataFrame(columns=new_headers)
for x in new_headers:
    if (x == 'explained_variance') | (x == 'r2_score') | (x == 'spearman_correlation'):
        df_toprundata = df_toprundata.append(toprundata[toprundata[x] == toprundata[x].max()])
    else:
        df_toprundata = df_toprundata.append(toprundata[toprundata[x] == toprundata[x].min()])
df_toprundata = df_toprundata.drop_duplicates()
df_toprundata



Unnamed: 0,explained_variance,mean_absolute_error,mean_absolute_percentage_error,median_absolute_error,normalized_mean_absolute_error,normalized_median_absolute_error,normalized_root_mean_squared_error,normalized_root_mean_squared_log_error,r2_score,root_mean_squared_error,root_mean_squared_log_error,spearman_correlation
498,0.49232,0.40188,7.42868,0.301803,0.0803761,0.0603607,0.111159,0.106438,0.490989,0.555793,0.0863138,0.732499
356,0.441908,0.396111,7.40795,0.284356,0.0792223,0.0568713,0.116871,0.112212,0.437325,0.584357,0.090996,0.694177
499,0.490253,0.398293,7.35103,0.292824,0.0796586,0.0585648,0.111459,0.106531,0.48823,0.557297,0.0863895,0.728791
312,0.366982,0.40974,7.61672,0.248313,0.0819479,0.0496626,0.124368,0.118526,0.362821,0.621842,0.0961164,0.639898


In [3]:
from azureml.core.run import Run
from azureml.core import Workspace
ws = Workspace.from_config()

from azureml.core.experiment import Experiment
experiment = Experiment(workspace=ws, name='redwine_automl')

list_runs = experiment.get_runs()
for run in list_runs:
    print(run.id)
# try to get latest run ID dynamically

AutoML_5dc0c0a3-d23d-4fb4-b1e6-fe01685fb180
AutoML_57cc9c2c-4533-4174-83d0-69b7e8704656


In [134]:
from azureml.core.experiment import Experiment
from azureml.widgets import RunDetails
from ast import literal_eval
import json

run_id = 'AutoML_5dc0c0a3-d23d-4fb4-b1e6-fe01685fb180'
child_run_id = pd.DataFrame(df_toprundata.index.tolist())
child_run_id['id'] =  run_id + '_' + child_run_id.astype(str)
list_child_run_id = list(child_run_id['id'])
list_child_run_id

headers = ['child_run_id', 'ensembled_algorithms', 'pipeline_script']
child_run_information =  pd.DataFrame(columns=headers)

for x in list_child_run_id:
    run = Run(experiment=experiment, run_id=x)
    try:
        tags_ea = run.tags['ensembled_algorithms']
        tags_ps = run.tags['pipeline_script']
        child_run_information = child_run_information.append([{'child_run_id': x, 'ensembled_algorithms': tags_ea, 'pipeline_script': tags_ps}])
    except:
        tags_ps = run.tags['pipeline_script']
        child_run_information = child_run_information.append([{'child_run_id': x, 'ensembled_algorithms': 'Na', 'pipeline_script': tags_ps}])
child_run_information

###### need to figure out how to get 'pipeline_name' or 'pipeline_id' dependent on whether ensemble or not. 
# child_run_information_1 = pd.io.json.json_normalize(child_run_information.pipeline_script.apply(json.loads))
# #child_run_information = pd.io.json.json_normalize(child_run_information.objects.apply(json.loads))
# child_run_information_1

#need pipeline_name for non ensemble & pipeline_id for non-ensembled (within pipeline script)

# test = Run(experiment=experiment, run_id='AutoML_5dc0c0a3-d23d-4fb4-b1e6-fe01685fb180_356')
# tags_df = pd.read_json(test.tags['pipeline_script'])
# tags_df = (tags_df["objects"].apply(pd.Series).merge(tags_df, left_index = True, right_index = True))
# tags_df = (tags_df["param_kwargs"].apply(pd.Series).merge(tags_df, left_index = True, right_index = True))
# child_run_id  = ['AutoML_5dc0c0a3-d23d-4fb4-b1e6-fe01685fb180_356', 'AutoML_5dc0c0a3-d23d-4fb4-b1e6-fe01685fb180_356']
# tags_df['child_run_id'] = child_run_id
# # tags_df
    

Unnamed: 0,child_run_id,ensembled_algorithms,pipeline_script
0,AutoML_5dc0c0a3-d23d-4fb4-b1e6-fe01685fb180_498,"['ExtremeRandomTrees', 'RandomForest', 'Random...","{""pipeline_id"":""__AutoML_Ensemble__"",""objects""..."
0,AutoML_5dc0c0a3-d23d-4fb4-b1e6-fe01685fb180_356,Na,"{""objects"": [{""class_name"": ""RobustScaler"", ""m..."
0,AutoML_5dc0c0a3-d23d-4fb4-b1e6-fe01685fb180_499,"['ExtremeRandomTrees', 'RandomForest', 'Random...","{""pipeline_id"":""__AutoML_Stack_Ensemble__"",""ob..."
0,AutoML_5dc0c0a3-d23d-4fb4-b1e6-fe01685fb180_312,Na,"{""objects"": [{""class_name"": ""RobustScaler"", ""m..."


### Step 5 - Creating a script for the pipeline steps

This pipeline consists of the following steps:

 - Training the model using an estimator
 - Registering the trained model

Pipelines consist of one or more *steps*, which can be Python scripts, or specialized steps like an Auto ML training estimator or a data transfer step that copies data from one location to another. Each step can run in its own compute context.


Creating a folder for the pipeline step files

In [13]:
import os
experiment_folder = 'RedWine/Redwine_Pipeline'
os.makedirs(experiment_folder, exist_ok=True)

print(experiment_folder)


RedWine/Redwine_Pipeline


Training the model using an estimator

In [17]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

cluster_name = "redwine1"

# Verify that cluster exists
try:
    pipeline_cluster = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    # If not, create it
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS2_V2', 
                                                           max_nodes=2)
    pipeline_cluster = ComputeTarget.create(ws, cluster_name, compute_config)

pipeline_cluster.wait_for_completion(show_output=True)

Found existing cluster, use it.

Running


In [22]:
from azureml.core.experiment import Experiment
from azureml.widgets import RunDetails

knn_experiment = Experiment(ws, 'redwine_knn')
#kautoml_run = automl_experiment.submit(automl_config)
#RunDetails(automl_run).show()
#automl_run.wait_for_completion(show_output=True)

In [24]:
##### work on this more #####

##%%writefile $experiment_folder/train_redwine.py
# Import libraries
from azureml.core import Run
import argparse
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing
#from sklearn.metrics import roc_curve
#import matplotlib.pyplot as plt

## Get parameters
#parser = argparse.ArgumentParser()
#parser.add_argument('--output_folder', type=str, dest='output_folder', default="redwine_model", help='output folder')
#args = parser.parse_args()
#output_folder = args.output_folder

# Get the experiment run context
run = Run.get_context()
print(run)
experimentcontext = knn_experiment.get_runs()
print(experimentcontext)
# load the dataset (passed as an input dataset)
print("Loading Data...")
dataset = run.input_datasets['redwine_train'].to_pandas_dataframe()
le = preprocessing.LabelEncoder()
dataset_labelencoded = le.fit_transform(dataset)
print(dataset)
# Separate features and labels
#X, y = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, diabetes['Diabetic'].values

# Split data into training set and test set
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Train adecision tree model
#print('Training a decision tree model')
#model = DecisionTreeClassifier().fit(X_train, y_train)

# calculate accuracy
#y_hat = model.predict(X_test)
#acc = np.average(y_hat == y_test)
#print('Accuracy:', acc)
#run.log('Accuracy', np.float(acc))

# calculate AUC
#y_scores = model.predict_proba(X_test)
#auc = roc_auc_score(y_test,y_scores[:,1])
#print('AUC: ' + str(auc))
#run.log('AUC', np.float(auc))

# plot ROC curve
#fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])
#fig = plt.figure(figsize=(6, 4))
# Plot the diagonal 50% line
#plt.plot([0, 1], [0, 1], 'k--')
# Plot the FPR and TPR achieved by our model
#plt.plot(fpr, tpr)
#plt.xlabel('False Positive Rate')
#plt.ylabel('True Positive Rate')
#plt.title('ROC Curve')
#run.log_image(name = "ROC", plot = fig)
#plt.show()

# Save the trained model
#os.makedirs(output_folder, exist_ok=True)
#output_path = output_folder + "/model.pkl"
#joblib.dump(value=model, filename=output_path)

#run.complete()

<azureml.core.run._OfflineRun object at 0x7fa70c102550>
<generator object Run._rehydrate_runs at 0x7fa70c11b780>
Loading Data...


AttributeError: '_OfflineRun' object has no attribute 'input_datasets'

The script for the second step of the pipeline will load the model from where it was saved, and then register it in the workspace. It includes a single **model_folder** parameter that contains the path where the model was saved.

In [None]:
%%writefile $experiment_folder/register_diabetes.py
# Import libraries
import argparse
import joblib
from azureml.core import Workspace, Model, Run

# Get parameters
parser = argparse.ArgumentParser()
parser.add_argument('--model_folder', type=str, dest='model_folder', default="diabetes_model", help='model location')
args = parser.parse_args()
model_folder = args.model_folder

# Get the experiment run context
run = Run.get_context()

# load the model
print("Loading model from " + model_folder)
model_file = model_folder + "/model.pkl"
model = joblib.load(model_file)

Model.register(workspace=run.experiment.workspace,
               model_path = model_file,
               model_name = 'diabetes_model',
               tags={'Training context':'Pipeline'})

run.complete()

## Prepare a Compute Environment for the Pipeline Steps

In this exercise, you'll use the same compute for both steps, but it's important to realize that each step is run independently; so you could specify different compute contexts for each step if appropriate.

First, get the compute target you created in a previous lab (if it doesn't exist, it will be created).

> **Important**: Change *your-compute-cluster* to the name of your compute cluster in the code below before running it!

In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

cluster_name = "your-compute-cluster"

# Verify that cluster exists
try:
    pipeline_cluster = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    # If not, create it
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS2_V2', 
                                                           max_nodes=2)
    pipeline_cluster = ComputeTarget.create(ws, cluster_name, compute_config)

pipeline_cluster.wait_for_completion(show_output=True)

The compute will require a Python environment with the necessary package dependencies installed, so you'll need to create a run configuration.

In [None]:
from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.runconfig import RunConfiguration

# Create a Python environment for the experiment
diabetes_env = Environment("diabetes-experiment-env")
diabetes_env.python.user_managed_dependencies = False # Let Azure ML manage dependencies
diabetes_env.docker.enabled = True # Use a docker container

# Create a set of package dependencies
diabetes_packages = CondaDependencies.create(conda_packages=['scikit-learn','ipykernel','matplotlib', 'pandas'],
                                             pip_packages=['azureml-sdk','pyarrow'])

# Add the dependencies to the environment
diabetes_env.python.conda_dependencies = diabetes_packages

# Register the environment (just in case previous lab wasn't completed)
diabetes_env.register(workspace=ws)
registered_env = Environment.get(ws, 'diabetes-experiment-env')

# Create a new runconfig object for the pipeline
pipeline_run_config = RunConfiguration()

# Use the compute you created above. 
pipeline_run_config.target = pipeline_cluster

# Assign the environment to the run configuration
pipeline_run_config.environment = registered_env

print ("Run configuration created.")

## Create and Run a Pipeline

Now you're ready to create and run a pipeline.

First you need to define the steps for the pipeline, and any data references that need to passed between them. In this case, the first step must write the model to a folder that can be read from by the second step. Since the steps will be run on remote compute (and in fact, could each be run on different compute), the folder path must be passed as a data reference to a location in a datastore within the workspace. The **PipelineData** object is a special kind of data reference that is used for interim storage locations that can be passed between pipeline steps, so you'll create one and use at as the output for the first step and the input for the second step. Note that you also need to pass it as a script argument so our code can access the datastore location referenced by the data reference.

In [None]:
from azureml.pipeline.core import PipelineData
from azureml.pipeline.steps import PythonScriptStep, EstimatorStep
from azureml.train.estimator import Estimator

# Get the training dataset
diabetes_ds = ws.datasets.get("diabetes dataset")

# Create a PipelineData (temporary Data Reference) for the model folder
model_folder = PipelineData("model_folder", datastore=ws.get_default_datastore())

estimator = Estimator(source_directory=experiment_folder,
                        compute_target = pipeline_cluster,
                        environment_definition=pipeline_run_config.environment,
                        entry_script='train_diabetes.py')

train_step = EstimatorStep(name = "Train Model",
                           estimator=estimator, 
                           estimator_entry_script_arguments=['--output_folder', model_folder],
                           inputs=[diabetes_ds.as_named_input('diabetes_train')],
                           outputs=[model_folder],
                           compute_target = pipeline_cluster,
                           allow_reuse = True)

# Step 2, run the model registration script
register_step = PythonScriptStep(name = "Register Model",
                                source_directory = experiment_folder,
                                script_name = "register_diabetes.py",
                                arguments = ['--model_folder', model_folder],
                                inputs=[model_folder],
                                compute_target = pipeline_cluster,
                                runconfig = pipeline_run_config,
                                allow_reuse = True)

print("Pipeline steps defined")

OK, you're ready build the pipeline from the steps you've defined and run it as an experiment.

In [None]:
from azureml.core import Experiment
from azureml.pipeline.core import Pipeline
from azureml.widgets import RunDetails

# Construct the pipeline
pipeline_steps = [train_step, register_step]
pipeline = Pipeline(workspace = ws, steps=pipeline_steps)
print("Pipeline is built.")

# Create an experiment and run the pipeline
experiment = Experiment(workspace = ws, name = 'diabetes-training-pipeline')
pipeline_run = experiment.submit(pipeline, regenerate_outputs=True)
print("Pipeline submitted for execution.")

RunDetails(pipeline_run).show()
pipeline_run.wait_for_completion()

The widget above shows details of the pipeline as it runs. You can also monitor pipeline runs in the **Experiments** page in [Azure Machine Learning studio](https://ml.azure.com).

When the pipeline has finished, a new model should be registered with a *Training context* tag indicating it was trained in a pipeline. Run the following code to verify this.

In [None]:
from azureml.core import Model

for model in Model.list(ws):
    print(model.name, 'version:', model.version)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print ('\t',tag_name, ':', tag)
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print ('\t',prop_name, ':', prop)
    print('\n')

This is a simple example, designed to demonstrate the principle. In reality, you could build more sophisticated logic into the pipeline steps - for example, evaluating the model against some test data to calculate a performance metric like AUC or accuracy, comparing the metric to that of any previously registered versions of the model, and only registering the new model if it performs better.

You can use the [Azure Machine Learning extension for Azure DevOps](https://marketplace.visualstudio.com/items?itemName=ms-air-aiagility.vss-services-azureml) to combine Azure ML pipelines with Azure DevOps pipelines (yes, it *is* confusing that they have the same name!) and integrate model retraining into a *continuous integration/continuous deployment (CI/CD)* process. For example you could use an Azure DevOps *build* pipeline to trigger an Azure ML pipeline that trains and registers a model, and when the model is registered it could trigger an Azure Devops *release* pipeline that deploys the model as a web service, along with the application or service that consumes the model.