# Lab 1 - Training a Machine Learning Model

In this lab you will setup the Azure Machine Learning service and use it for tracking training of a `scikit-learn` model.

## Get the lab datasets
The following cell will download the dataset used by this lab. Click into the following cell and use `Shift + Enter` to execute it

In [None]:
import os

# Create a temporary folder to store locally relevant content for this notebook
datasetsFolderName = '../datasets'
os.makedirs(datasetsFolderName, exist_ok=True)
print('Content files will be saved to {0}'.format(datasetsFolderName))

filesToDownload = ['UsedCars_Clean.csv', 'UsedCars_Affordability.csv']

for fileToDownload in filesToDownload:
  downloadCommand = 'wget -O ''{0}/{1}'' ''https://databricksdemostore.blob.core.windows.net/data/aml-labs/{1}'''.format(datasetsFolderName, fileToDownload)
  print(downloadCommand)
  os.system(downloadCommand)
  
#List all downloaded files
os.listdir(datasetsFolderName)

## Train a simple model

The following cell loads the sampled dataset. Use `Shift + Enter` to execute the cell. Take a moment to look at the data loaded into the Pandas Dataframe - it contains data about used cars such as the price (in dollars), age (in years), KM (kilometers driven) and other attributes like weather it is automatic transimission, the number of doors, and the weight.

In [None]:
# Step 1 - load the data
########################
import numpy as np
import pandas as pd
import os

pathToCsvFile = os.path.join(datasetsFolderName, 'UsedCars_Clean.csv')
df = pd.read_csv(pathToCsvFile, delimiter=',')
print(df)


We are going to try and build a model that can answer the question "Can I afford a car that is X months old and has Y kilometers on it, given I have $12,000 to spend?". We will engineer the label for affordable. Execute the following cell.

In [None]:
# Step 2 - add the affordable feature
######################################
df['Affordable'] = np.where(df['Price']<12000, 1, 0)
df_affordability = df[["Age","KM", "Affordable"]]
print(df_affordability)

We are going to train a Logistic Regression model in Azure Databricks. This type of model requires us to standardize the scale of our training features Age and KM, so we use the `StandardScaler` from Scikit-Learn to transform these features so that they have values centered with a mean around 0 (mostly between -2.96 and 1.29). Select Step 3 and execute the code. Observe the difference in min and max values between the un-scaled and scaled Dataframes. When we use Sci-Kit Learn, these models are trained on the driver node. Execute the following cell.

In [None]:
# Step 3 - Scale the numeric feature values
###########################################
X = df_affordability[["Age", "KM"]].values
y = df_affordability[["Affordable"]].values

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X.astype(float))

print(pd.DataFrame(X).describe().round(2))
print(pd.DataFrame(X_scaled).describe().round(2))

Train the model by fitting a LogisticRegression against the scaled input features (X_scaled) and the labels (y). Execute the following cell.

In [None]:
# Step 4 - Fit a Logistic Regression
####################################
from sklearn import linear_model
# Create a linear model for Logistic Regression
clf = linear_model.LogisticRegression(C=1)

# Flatten labels
y = np.ravel(y)

# we create an instance of Classifier and fit the data.
clf.fit(X_scaled, y)

Try prediction - if you set the age to 60 months and km to 40,000, does the model predict you can afford the car? Execute the cell and find out.

In [None]:
# Step 5 - Test the trained model's prediction
##############################################
age = 60
km = 40000

scaled_input = scaler.transform([[age, km]])
prediction = clf.predict(scaled_input)

print("Can I afford a car that is {} month(s) old with {} KM's on it?".format(age,km))
print("Yes (1)" if prediction[0] == 1 else "No (0)")

Now, let's get a sense for how accurate the model is. Execute the following cell. What was your model's accuracy?

In [None]:
# Step 6 - Measure the model's performance
##########################################
scaled_inputs = scaler.transform(X.astype(float))
predictions = clf.predict(scaled_inputs)
print(predictions)

from sklearn.metrics import accuracy_score
score = accuracy_score(y, predictions)
print("Model Accuracy: {}".format(score.round(3)))

One thing that can affect the model's performance is how much data of all the labeled training data available is used to train the model. In the next cell, you define a method that uses train_test_split from Scikit-Learn that will enable you to split the data using different percentages. Execute the cell to register this function.

In [None]:
# Step 7 - Define a method to experiment with different training subset sizes
#############################################################################
from sklearn.model_selection import train_test_split
full_X = df_affordability[["Age", "KM"]]
full_Y = df_affordability[["Affordable"]]

def train_eval_model(full_X, full_Y,training_set_percentage):
    train_X, test_X, train_Y, test_Y = train_test_split(full_X, full_Y, train_size=training_set_percentage, random_state=42)
    
    # Flatten labels
    train_Y = np.ravel(train_Y)
    test_Y = np.ravel(test_Y)
    
    # Convert to float
    train_X = train_X.astype(float)
    test_X = test_X.astype(float)

    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(train_X)
    clf = linear_model.LogisticRegression(C=1)
    clf.fit(X_scaled, train_Y)

    scaled_inputs = scaler.transform(test_X)
    predictions = clf.predict(scaled_inputs)
    score = accuracy_score(test_Y, predictions)

    return (clf, score)

## Use Azure Machine Learning to log performance metrics
In the steps that follow, you will train multiple models using different sizes of training data and observe the impact on performance (accuracy). Each time you create a new model, you are executing a Run in the terminology of Azure Machine Learning service. In this case, you will create one Experiment and execute multiple Runs within it, each with different training percentages (and resultant varying accuracies).

Execute the following cell to quickly verify you have the Azure Machine Learning SDK installed on your cluster. If you get a version number back without error, you are ready to proceed.

In [None]:
# Step 8 - Verify AML SDK Installed
#####################################################################
import azureml.core
print("SDK Version:", azureml.core.VERSION)

# import the Workspace class 
from azureml.core import Workspace

All Azure Machine Learning entities are organized within a Workspace. You can create an AML Workspace in the Azure Portal, but as the code in the following cell shows, you can also create a Workspace directly from code. Set the values for `subscription_id`, `resource_group`, `workspace_name` and `workspace_region` as directed by the comments. Execute Step 9. You will be prompted to log in to your Azure Subscription by the command output.

In [None]:
# Step 9 - Create a workspace
#####################################################################

#Provide the Subscription ID of your existing Azure subscription
subscription_id = "<your subscription>"

#Provide values for the Resource Group and Workspace that will be created
resource_group = "<your resource group>"
workspace_name = "<your workspace name>"
workspace_region = "<your region>"

ws = Workspace.create(
    name = workspace_name,
    subscription_id = subscription_id,
    resource_group = resource_group, 
    location = workspace_region,
    exist_ok = True)

print("Workspace Provisioning complete.")
ws.get_details()
ws.write_config()

To begin capturing metrics, you must first create an Experiment and then call `start_logging()` on that Experiment. The return value of this call is a Run. This root run can have other child runs. When you are finished with an experiment run, use `complete()` to close out the root run. Execute the following cell to train four different models using differing amounts of training data and log the results to Azure Machine Learning.

In [None]:
# Step 10 - Create an experiment and log metrics for multiple training runs
###########################################################################
from azureml.core.run import Run
from azureml.core.experiment import Experiment

# start a training run by defining an experiment
myexperiment = Experiment(ws, "usedcars_training_local")
root_run = myexperiment.start_logging()

training_set_percentage = 0.25
run = root_run.child_run("Training_Set_Percentage-%0.5F" % training_set_percentage)
model, score = train_eval_model(full_X, full_Y, training_set_percentage)
print("With %0.2f percent of data, model accuracy reached %0.4f." % (training_set_percentage, score))
run.log("Training_Set_Percentage", training_set_percentage)
run.log("Accuracy", score)
run.complete()

training_set_percentage = 0.5
run = root_run.child_run("Training_Set_Percentage-%0.5F" % training_set_percentage)
model, score = train_eval_model(full_X, full_Y, training_set_percentage)
print("With %0.2f percent of data, model accuracy reached %0.4f." % (training_set_percentage, score))
run.log("Training_Set_Percentage", training_set_percentage)
run.log("Accuracy", score)
run.complete()

training_set_percentage = 0.75
run = root_run.child_run("Training_Set_Percentage-%0.5F" % training_set_percentage)
model, score = train_eval_model(full_X, full_Y, training_set_percentage)
print("With %0.2f percent of data, model accuracy reached %0.4f." % (training_set_percentage, score))
run.log("Training_Set_Percentage", training_set_percentage)
run.log("Accuracy", score)
run.complete()

training_set_percentage = 0.9
run = root_run.child_run("Training_Set_Percentage-%0.5F" % training_set_percentage)
model, score = train_eval_model(full_X, full_Y, training_set_percentage)
print("With %0.2f percent of data, model accuracy reached %0.4f." % (training_set_percentage, score))
run.log("Training_Set_Percentage", training_set_percentage)
run.log("Accuracy", score)
run.complete()

# Close out the experiment
root_run.complete()

Now that you have captured history for various runs, you can review the runs. You could use the Azure Portal for this - go to the Azure Portal, find your Azure Machine Learning Workspace, select Experiments and select the UsedCars_Experiment. However, in this case we will use the AML SDK to query for the runs. Execute the following cell to view the runs and their status.

In [None]:
# Step 11 - Review captured runs
################################
# Go to the Azure Portal, find your Azure Machine Learning Workspace, select Experiments and select the UsedCars_Experiment

# You can also query the run history using the SDK.
# The following command lists all of the runs for the experiment
runs = [r for r in root_run.get_children()]
print(runs)

## Train remotely using Azure ML Compute

Up until now, all of your training was executed locally on the same machine running Jupyter. Now you will execute the same logic targeting a remote Azure ML Compute, which you will provision from code.

In [None]:
# Step 12 - Create an Azure ML Compute cluster
#############################################################################################

# Read the workspace config from file
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\n')

# Create Azure ML cluster
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
import os

# choose a name for your cluster
cluster_name = "cpu-bai-cluster"
cluster_min_nodes = 1
cluster_max_nodes = 3
vm_size = "STANDARD_DS11_V2"

if cluster_name in ws.compute_targets:
    compute_target = ws.compute_targets[cluster_name]
    if compute_target and type(compute_target) is AmlCompute:
        print('Found existing compute target, using this compute target instead of creating:  ' + cluster_name)
    else:
        print("Error: A compute target with name ",cluster_name," was found, but it is not of type AmlCompute.")
else:
    print('Creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = vm_size, 
                                                                min_nodes = cluster_min_nodes, 
                                                                max_nodes = cluster_max_nodes)

    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, provisioning_config)
    
    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it will use the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
     # For a more detailed view of current BatchAI cluster status, use the 'status' property    
    print(compute_target.status.serialize())

With your cluster ready, you need to upload the training data to the default DataStore for your AML Workspace (which uses Azure Storage). 

In [None]:
# Step 13 - Upload the dataset to the DataStore
###############################################
ds = ws.get_default_datastore()
print(ds.datastore_type, ds.account_name, ds.container_name)
ds.upload(src_dir='../datasets', target_path='used_cars', overwrite=True, show_progress=True)

Next, you will need to create a training script that is similar to the code you have executed locally to train the model. 

In [None]:
import os
script_folder = './script'
os.makedirs(script_folder, exist_ok=True)

In [None]:
%%writefile $script_folder/train.py

import argparse
import os
import numpy as np
import pandas as pd

from sklearn import linear_model 
from sklearn.externals import joblib
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

from azureml.core import Run

# let user feed in 2 parameters, the location of the data files (from datastore), and the training set percentage to use
parser = argparse.ArgumentParser()
parser.add_argument('--data-folder', type=str, dest='data_folder', help='data folder mounting point')
parser.add_argument('--training-set-percentage', type=float, dest='training_set_percentage', default=0.25, help='percentage of dataset to use for training')
args = parser.parse_args()

data_folder = os.path.join(args.data_folder, 'used_cars')
print('Data folder:', data_folder)
data_csv_path = os.path.join(data_folder, 'UsedCars_Clean.csv')
print('Path to CSV file dataset:' + data_csv_path)

# Load the data
#df = pd.read_csv('UsedCars_Clean.csv', delimiter=',')
df = pd.read_csv(data_csv_path)
df['Affordable'] = np.where(df['Price']<12000, 1, 0)
df_affordability = df[["Age","KM", "Affordable"]]


# Now experiment with different training subsets
from sklearn.model_selection import train_test_split
full_X = df_affordability[["Age", "KM"]]
full_Y = df_affordability[["Affordable"]]

def train_eval_model(full_X, full_Y,training_set_percentage):
    train_X, test_X, train_Y, test_Y = train_test_split(full_X, full_Y, train_size=training_set_percentage, random_state=42)
    
    # Flatten labels
    train_Y = np.ravel(train_Y)
    test_Y = np.ravel(test_Y)
    
    # Convert to float
    train_X = train_X.astype(float)
    test_X = test_X.astype(float)

    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(train_X)
    clf = linear_model.LogisticRegression(C=1)
    clf.fit(X_scaled, train_Y)

    scaled_inputs = scaler.transform(test_X)
    predictions = clf.predict(scaled_inputs)
    score = accuracy_score(test_Y, predictions)

    return (clf, score)

# Acquire the current run
run = Run.get_context()


training_set_percentage = args.training_set_percentage
model, score = train_eval_model(full_X, full_Y, training_set_percentage)
print("With %0.2f percent of data, model accuracy reached %0.4f." % (training_set_percentage, score))
run.log("Training_Set_Percentage", training_set_percentage)
run.log("Accuracy", score)


# note file saved in the outputs folder is automatically uploaded into experiment record
joblib.dump(value=model, filename='outputs/model.pkl')


Create an estimator that descrives the configuration of the job that will execute your model training script.

In [None]:
# Step 14 - Create estimator
#############################
from azureml.train.estimator import Estimator

script_params = {
    '--data-folder': ds.as_mount(),
    '--training-set-percentage': 0.3
}

est_config = Estimator(source_directory=script_folder,
                       script_params=script_params,
                       compute_target=compute_target,
                       entry_script='train.py',
                       conda_packages=['scikit-learn', 'pandas'])

Submit the job using the submit() method of the Experiment object. 

In [None]:
# Step 15 - Execute the estimator job
#####################################

# Create new experiment
from azureml.core import Experiment
experiment_name = "usedcars_training_amlcompute"
exp = Experiment(workspace=ws, name=experiment_name)

run = exp.submit(config=est_config)
run


You can monitor the job through Azure Portal or using AML Jupyter Widget.

In [None]:
from azureml.widgets import RunDetails
RunDetails(run).show()


In [None]:
# Poll for job status
run.wait_for_completion(show_output=True)  # value of True will display a verbose, streaming log

# Examine the recorded metrics from the run
print(run.get_metrics())