# AutoML 101: auto-ml-classification

In this example we showcase how you can use the AutoML Classifier for a simple classification problem.

In this notebook you would see
1. Creating or reusing an existing Project and Workspace
2. Instantiating AutoML Classifier
3. Training the Model using local compute
4. Exploring the results
5. Testing the fitted model

In [None]:
import azureml.core
import pandas as pd
from azureml.core.authentication import ServicePrincipalAuthentication
from azureml.core.workspace import Workspace
from azureml.core.experiment import Experiment
from azureml.train.automl import AutoMLConfig
from azureml.train.automl.run import AutoMLRun
import logging

In [None]:
subscription_id = "<Azure Subscription ID>"
resource_group = "<Azure Resource Group>"
workspace_name = "AMLSample"
workspace_region = "eastus2"

tenant_id = "<Azure Tenant ID>"
app_id = "<Azure AD Application ID>"
app_key = "<Azure AD Application Key>"

auth_sp = ServicePrincipalAuthentication(tenant_id = tenant_id,
                                         username = app_id,
                                         password = app_key)

In [None]:
# import the Workspace class
from azureml.core import Workspace

ws = Workspace.create(name = workspace_name,
                      auth = auth_sp,
                      subscription_id = subscription_id,
                      resource_group = resource_group, 
                      create_resource_group = True,
                      location = workspace_region,
                      exist_ok = True)

ws.get_details()

In [None]:
from azureml.core.experiment import Experiment

# choose a name for experiment
experiment_name = 'automl-classification'
# project folder
project_folder = '/home/.azureml'

experiment=Experiment(ws, experiment_name)

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace Name'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Project Directory'] = project_folder
output['Experiment Name'] = experiment.name
pd.set_option('display.max_colwidth', -1)
pd.DataFrame(data = output, index = ['']).T

## Diagnostics

Opt-in diagnostics collection for better experience, quality, and security of future releases

In [None]:
from azureml.telemetry import set_diagnostics_collection
set_diagnostics_collection(send_diagnostics=True)

## Load BlobStore Dataset

In [None]:
import azureml.dataprep as dprep

dataflow = dprep.read_csv(path='https://commonartifacts.blob.core.windows.net/automl/UCI_Adult_train.csv')
X_train = dataflow.drop_columns('label(IsOver50K)').skip(100)
y_train = dataflow.keep_columns('label(IsOver50K)').skip(100)

## Instantiate Auto ML Classifier

Instantiate a AutoML Object This creates an Experiment in Azure ML. You can reuse this objects to trigger multiple runs. Each run will be part of the same experiment.

|Property|Description|
|-|-|
|**primary_metric**|This is the metric that you want to optimize.<br> Auto ML Classifier supports the following primary metrics <br><i>AUC_macro</i><br><i>AUC_weighted</i><br><i>accuracy</i><br><i>weighted_accuracy</i><br><i>norm_macro_recall</i><br><i>balanced_accuracy</i><br><i>average_precision_score_weighted</i>|
|**max_time_sec**|Time limit in seconds for each iterations|
|**iterations**|Number of iterations. In each iteration Auto ML Classifier trains the data with a specific pipeline|
|**n_cross_validations**|Number of cross validation splits|
|**verbosity**|Verbosity level for AutoML log file|
|**X**|The training features to use when fitting pipelines during AutoML experiment|
|**y**|Training labels to use when fitting pipelines during AutoML experiment|
|**preprocess**|Flag whether AutoML should preprocess your data for you such as handling missing data, text data and other common feature extraction. Note: If input data is Sparse you cannot use preprocess as True|
|**concurrent_iterations**|Maximum number of iterations that would be executed in parallel. This should be less than the number of cores on the AzureML compute|
|**spark_context**|Spark context|
|**path**|Path to the AzureML project folder|

In [None]:
from azureml.train.automl import AutoMLConfig

automl_config = AutoMLConfig(task = 'classification',
                             primary_metric = 'accuracy',
                             max_time_sec = 3600,
                             iterations = 5,
                             n_cross_validations = 2,
                             verbosity = logging.INFO,
                             X = X_train, 
                             y = y_train,
                             preprocess = True,
                             concurrent_iterations=5,
                             spark_context = sc,
                             path=project_folder,
                             enable_cache=False 
                             )

## Training the Model

You can call the fit method on the AutoML instance and pass the run configuration. Depending on the data and number of iterations this can run for while. Once the run is complete, iteration results will be printed to console.

*fit* method on Auto ML Classifier triggers the training of the model. It can be called with the following parameters

|**Parameter**|**Description**|
|-|-|
|**automl_config**|AutoML config instantiated in the previous step|
|**show_output**| True/False to turn on/off console output|

In [None]:
from azureml.train.automl.run import AutoMLRun

expt_run = experiment.submit(automl_config, show_output=True)

## Exploring the results

#### Widget for monitoring runs

The widget will sit on "loading" until the first iteration completed, then you will see an auto-updating graph and table show up. It refreshed once per minute, so you should see the graph update as child runs complete.

NOTE: The widget displays a link at the bottom. This links to a web-ui to explore the individual run details.

In [None]:
print(expt_run.get_portal_url())

#### Retrieve All Child Runs
You can also use sdk methods to fetch all the child runs and see individual metrics that we log.

In [None]:
children = list(expt_run.get_children())
metricslist = {}
for run in children:
    properties = run.get_properties()
    metrics = {k: v for k, v in run.get_metrics().items() if isinstance(v, float)}    
    metricslist[int(properties['iteration'])] = metrics
    
import pandas as pd

rundata = pd.DataFrame(metricslist).sort_index(1)
rundata

### Retrieve the Best Model

Below we select the best pipeline from our iterations. The *get_output* method on automl_classifier returns the best run and the fitted model for the last *fit* invocation. There are overloads on *get_output* that allow you to retrieve the best run and fitted model for *any* logged metric or a particular *iteration*.

In [None]:
best_run, fitted_model = expt_run.get_output()
print(best_run)
print(fitted_model)

#### Best Model based on any other metric
Give me the run and the model that has the smallest `log_loss`:

In [None]:
lookup_metric = "log_loss"
best_run, fitted_model = expt_run.get_output(metric = lookup_metric)
print(best_run)
print(fitted_model)

In [None]:
iteration = 3
best_run, fitted_model = expt_run.get_output(iteration = iteration)
print(best_run)
print(fitted_model)

### Test the Best Fitted Model

#### Load Test Data

In [None]:
from sklearn import datasets
digits = datasets.load_digits()
X_test = digits.data[:10, :]
y_test = digits.target[:10]
images = digits.images[:10]

#### Testing our best pipeline
We will try to predict 2 digits and see how our model works.

In [None]:
#Randomly select digits and test
import random
import numpy as np

for index in np.random.choice(len(y_test), 2):
    predicted = fitted_model.predict(X_test[index:index + 1])[0]
    label = y_test[index]
    compare = "Index:%d Label value = %s  Predicted value = %s " % (index,label,predicted)
    print(compare)