# AutoML example

## Use case
Predict the price of the apartment.

Example features of the dataset are shown below.

![](./data_table.png)

## Create Azure ML

1. Create Azure Machine Learning service using the Portal, ARM template, Azure CLI or other tool.

2. Go to https://ml.azure.com/ and log in.


## Load data to Azure ML

Go to Datasets tab, and create new dataset. Import it. Set the correct dataset properties and column names.
![](./load_data.png)
![](./data_schema.png)

## Create Notebook in Azure ML

Crate new Notebook in the Azure ML. 

Then create new compute target, be aware of different pricing tiers. I created mine and named it _default4_.


## Authenticate to Azure

Below code is used to authenticate to Azure. Run below cell and follow the instructions.

In [1]:
from azureml.core import Workspace, Dataset

# Get Workspace defined in by default config.json file
ws = Workspace.from_config()

Performing interactive authentication. Please follow the instructions on the terminal.
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code CESZALQCT to authenticate.
You have logged in. Now let us find all the subscriptions to which you have access...
Interactive authentication successfully completed.


## Load data
Load data to Pandas data frame.

In [2]:
# Load Data
dataset_name = 'apartments'
aml_dataset = ws.datasets[dataset_name]

# Use Pandas DataFrame just to sneak peak some data and schema
full_df = aml_dataset.to_pandas_dataframe()
full_df.head()

Unnamed: 0,SalePrice,YearBuilt,YrSold,MonthSold,Size(sqf),Floor,HallwayType,HeatingType,AptManageType,N_Parkinglot(Ground),...,N_FacilitiesNearBy(Mall),N_FacilitiesNearBy(ETC),N_FacilitiesNearBy(Park),N_SchoolNearBy(Elementary),N_SchoolNearBy(Middle),N_SchoolNearBy(High),N_SchoolNearBy(University),N_FacilitiesInApt,N_FacilitiesNearBy(Total),N_SchoolNearBy(Total)
0,141592,2006,2007,8,814,3,terraced,individual_heating,management_in_trust,111.0,...,1.0,1.0,0.0,3.0,2.0,2.0,2.0,5,6.0,9.0
1,51327,1985,2007,8,587,8,corridor,individual_heating,self_management,80.0,...,1.0,2.0,1.0,2.0,1.0,1.0,0.0,3,12.0,4.0
2,48672,1985,2007,8,587,6,corridor,individual_heating,self_management,80.0,...,1.0,2.0,1.0,2.0,1.0,1.0,0.0,3,12.0,4.0
3,380530,2006,2007,8,2056,8,terraced,individual_heating,management_in_trust,249.0,...,1.0,0.0,0.0,2.0,2.0,1.0,2.0,5,3.0,7.0
4,221238,1993,2007,8,1761,3,mixed,individual_heating,management_in_trust,523.0,...,1.0,5.0,0.0,4.0,3.0,5.0,5.0,4,14.0,17.0


## Split data

Split data to training set and test set.

In [3]:
train_dataset, test_dataset = aml_dataset.random_split(0.8, seed=23423)

# Use Pandas DF only to check the data
train_dataset_df = train_dataset.to_pandas_dataframe()
test_dataset_df = test_dataset.to_pandas_dataframe()

print(train_dataset_df.describe())

           SalePrice    YearBuilt       YrSold    MonthSold    Size(sqf)  \
count    4712.000000  4712.000000  4712.000000  4712.000000  4712.000000   
mean   221190.035017  2002.953523  2012.690365     6.182725   954.845289   
std    105798.535899     8.860901     2.913136     3.399517   381.186609   
min     32743.000000  1978.000000  2007.000000     1.000000   135.000000   
25%    144752.000000  1993.000000  2010.000000     3.000000   644.000000   
50%    207964.000000  2006.000000  2013.000000     6.000000   910.000000   
75%    291150.000000  2008.000000  2015.000000     9.000000  1149.000000   
max    585840.000000  2015.000000  2017.000000    12.000000  2337.000000   

             Floor  N_Parkinglot(Ground)  N_Parkinglot(Basement)        N_APT  \
count  4712.000000           4712.000000             4712.000000  4712.000000   
mean     11.929542            196.078311              572.262097     5.646647   
std       7.519036            218.253233              408.010536     2.8

## Get the compute node

In [4]:
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget

amlcompute_cluster_name = "default4"

found = False
cts = ws.compute_targets

if amlcompute_cluster_name in cts and cts[amlcompute_cluster_name].type == 'ComputeInstance':
     found = True
     print('Found existing training cluster.')
     # Get existing cluster
     # Method 1:
     aml_remote_compute = cts[amlcompute_cluster_name]
     # Method 2:
     # aml_remote_compute = ComputeTarget(ws, amlcompute_cluster_name)
    
if not found:
     print('Creating a new training cluster...')
     provisioning_config = AmlCompute.provisioning_configuration(vm_size = "STANDARD_D3_V2", # for GPU, use "STANDARD_NC12"
                                                                 #vm_priority = 'lowpriority', # optional
                                                                 max_nodes = 20)
     # Create the cluster.
     aml_remote_compute = ComputeTarget.create(ws, amlcompute_cluster_name, provisioning_config)
    
print('Checking cluster status...')
# Can poll for a minimum number of nodes and for a specific timeout.
# If no min_node_count is provided, it will use the scale settings for the cluster.
aml_remote_compute.wait_for_completion(show_output = True)

Found existing training cluster.
Checking cluster status...

Running


## Metrics to evaluate models
Here I get the metrics related to regression task, to know what metric to set in the next step.

In [6]:
from azureml.train import automl

automl.utilities.get_primary_metrics('regression')

['r2_score',
 'normalized_root_mean_squared_error',
 'spearman_correlation',
 'normalized_mean_absolute_error']

## Define experiment

In the below cell I define the experiment that will be run.
There are three main settings:
 - task - 'regression'
 - primary_metric - 'r2_score'
 - label_column_name - 'SalePrice'

In [7]:
import logging
import os

from azureml.train.automl import AutoMLConfig

project_folder = './'
os.makedirs(project_folder, exist_ok=True)

automl_config = AutoMLConfig(compute_target=aml_remote_compute,
                             task='regression',
                             primary_metric='r2_score',
                             experiment_timeout_minutes=15,                            
                             training_data=train_dataset,
                             label_column_name="SalePrice",
                             n_cross_validations=5,
                             # blacklist_models='XGBoostClassifier', 
                             # iteration_timeout_minutes=5,                                                    
                             enable_early_stopping=True,
                             featurization='auto',
                             debug_log='automated_ml_errors.log',
                             verbosity=logging.INFO,
                             path=project_folder
                             )

## Run the experiment


In [8]:
from azureml.core import Experiment
from datetime import datetime

now = datetime.now()
time_string = now.strftime("%m-%d-%Y-%H")
experiment_name = "regress-automl-remote-{0}".format(time_string)
print(experiment_name)

experiment = Experiment(workspace=ws, name=experiment_name)

import time
start_time = time.time()
            
run = experiment.submit(automl_config, show_output=True)

print('Manual run timing: --- %s seconds needed for running the whole Remote AutoML Experiment ---' % (time.time() - start_time))

regress-automl-remote-12-30-2020-17
Running on remote.
No run_configuration provided, running on default4 with default configuration
Running on remote compute: default4
Parent Run ID: AutoML_b3d0f831-2811-4050-a9ba-b7fa01774933

Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturization. Beginning to fit featurizers and featurize the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Missing feature values imputation
STATUS:       PASSED
DESCRIPTION:  No feature missing values were detected in the training data.
              Learn more about missing value imputation: https://aka.ms/AutomatedMLFeaturization

******************************************************************************************

## Show the best result

In [10]:
best_run, fitted_model = run.get_output()
print(best_run)
print(fitted_model)

Run(Experiment: regress-automl-remote-12-30-2020-17,
Id: AutoML_b3d0f831-2811-4050-a9ba-b7fa01774933_12,
Type: azureml.scriptrun,
Status: Completed)
RegressionPipeline(pipeline=Pipeline(memory=None,
                                     steps=[('datatransformer',
                                             DataTransformer(enable_dnn=None,
                                                             enable_feature_sweeping=None,
                                                             feature_sweeping_config=None,
                                                             feature_sweeping_timeout=None,
                                                             featurization_config=None,
                                                             force_text_dnn=None,
                                                             is_cross_validation=None,
                                                             is_onnx_compatible=None,
                                          

## Predict values

Remove the predicted data column.

In [14]:
import pandas as pd

#Remove Label/y column
if 'SalePrice' in test_dataset_df.columns:
    y_test_df = test_dataset_df.pop('SalePrice')

x_test_df = test_dataset_df

Predict values and get 10 predictions.

In [15]:
y_predictions = fitted_model.predict(x_test_df)

print('10 predictions: ')
print(y_predictions[:10])

10 predictions: 
[ 53260.29297958 328536.45112727 189534.82674841  84655.59642399
  53845.40404312 177425.06995282 182509.36666898 126561.71172731
 196948.99789632  91413.50337866]


In [16]:
y_predictions.shape

(1179,)

Calculate and show r2 score:

In [17]:
from sklearn.metrics import r2_score

print('R2 Score:')
r2_score(y_test_df, y_predictions)

R2 Score:


0.9763867186258007

The accuracy of the best model is at almost 98%. So it's trained really good.