Copyright (c) Microsoft Corporation. All rights reserved.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/NotebookVM/tutorials/regression-part2-automated-ml.png)

# Tutorial: Use automated machine learning to predict taxi fares

In this tutorial, you use automated machine learning in Azure Machine Learning service to create a regression model to predict NYC taxi fare prices. This process accepts training data and configuration settings, and automatically iterates through combinations of different feature normalization/standardization methods, models, and hyperparameter settings to arrive at the best model.

In this tutorial you learn the following tasks:

* Download, transform, and clean data using Azure Open Datasets
* Train an automated machine learning regression model
* Calculate model accuracy

If you donâ€™t have an Azure subscription, create a free account before you begin. Try the [free or paid version](https://aka.ms/AMLFree) of Azure Machine Learning service today.

## Prerequisites

* Complete the [setup tutorial](https://docs.microsoft.com/azure/machine-learning/service/tutorial-1st-experiment-sdk-setup) if you don't already have an Azure Machine Learning service workspace or notebook virtual machine.
* After you complete the setup tutorial, open the **tutorials/regression-automated-ml.ipynb** notebook using the same notebook server.

This tutorial is also available on [GitHub](https://github.com/Azure/MachineLearningNotebooks/tree/master/tutorials) if you wish to run it in your own [local environment](https://docs.microsoft.com/azure/machine-learning/service/how-to-configure-environment#local). Run `pip install azureml-sdk[automl] azureml-opendatasets azureml-widgets` to get the required packages.

## Download and prepare data

Import the necessary packages. The Open Datasets package contains a class representing each data source (`NycTlcGreen` for example) to easily filter date parameters before downloading.

In [4]:
from azureml.opendatasets import NycTlcGreen
import pandas as pd
from datetime import datetime
from dateutil.relativedelta import relativedelta

Begin by creating a dataframe to hold the taxi data. When working in a non-Spark environment, Open Datasets only allows downloading one month of data at a time with certain classes to avoid `MemoryError` with large datasets. To download taxi data, iteratively fetch one month at a time, and before appending it to `green_taxi_df` randomly sample 2,000 records from each month to avoid bloating the dataframe. Then preview the data.

In [5]:
#load dataset 
kdd_df = pd.read_csv("https://library.startlearninglabs.uw.edu/DATASCI420/2019/Datasets/Intrusion%20Detection.csv")
kdd_df.tail(10)

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,Class
97298,0,tcp,ftp_data,SF,0,2072,0,0,0,1,...,84,1.0,0.0,1.0,0.02,0.0,0.0,0.0,0.0,1
97299,31,tcp,telnet,SF,2402,3814,0,0,0,3,...,2,1.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,1
97300,33,tcp,telnet,SF,2402,3815,0,0,0,3,...,3,1.0,0.0,0.33,0.0,0.0,0.0,0.0,0.0,1
97301,162,tcp,telnet,SF,1567,2738,0,0,0,3,...,4,1.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,1
97302,127,tcp,telnet,SF,1567,2736,0,0,0,1,...,5,1.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,1
97303,321,tcp,telnet,RSTO,1506,1887,0,0,0,0,...,6,1.0,0.0,0.17,0.0,0.0,0.0,0.17,0.17,1
97304,45,tcp,telnet,SF,2336,4201,0,0,0,3,...,7,1.0,0.0,0.14,0.0,0.0,0.0,0.14,0.14,1
97305,176,tcp,telnet,SF,1559,2732,0,0,0,3,...,8,1.0,0.0,0.12,0.0,0.0,0.0,0.12,0.12,1
97306,61,tcp,telnet,SF,2336,4194,0,0,0,3,...,9,1.0,0.0,0.11,0.0,0.0,0.0,0.11,0.11,1
97307,47,tcp,telnet,SF,2402,3816,0,0,0,3,...,10,1.0,0.0,0.1,0.0,0.0,0.0,0.1,0.1,1


In [6]:
#using label encoder, convert 3 features to numberic
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
kdd_df['protocol_type'] = le.fit_transform(kdd_df['protocol_type'])
kdd_df['service'] = le.fit_transform(kdd_df['service'])
kdd_df['flag'] = le.fit_transform(kdd_df['flag'])
kdd_df.describe()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,Class
count,97308.0,97308.0,97308.0,97308.0,97308.0,97308.0,97308.0,97308.0,97308.0,97308.0,...,97308.0,97308.0,97308.0,97308.0,97308.0,97308.0,97308.0,97308.0,97308.0,97308.0
mean,216.62,1.18,10.74,7.61,1157.12,3385.56,0.0,0.0,0.0,0.05,...,202.01,0.85,0.06,0.13,0.02,0.0,0.0,0.06,0.06,0.0
std,1359.01,0.42,3.12,1.61,34220.86,37573.05,0.0,0.0,0.01,0.86,...,86.97,0.31,0.18,0.28,0.05,0.03,0.02,0.22,0.22,0.02
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,1.0,10.0,8.0,147.0,136.0,0.0,0.0,0.0,0.0,...,170.0,0.91,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,1.0,10.0,8.0,231.0,421.0,0.0,0.0,0.0,0.0,...,255.0,1.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0
75%,0.0,1.0,10.0,8.0,313.0,2124.0,0.0,0.0,0.0,0.0,...,255.0,1.0,0.01,0.07,0.03,0.0,0.0,0.0,0.0,0.0
max,58329.0,2.0,24.0,8.0,2194619.0,5134218.0,1.0,0.0,3.0,30.0,...,255.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [7]:
#remove zero value columns and lets plot box plots for rest of the columns
kdd_df_new = kdd_df.drop(['is_host_login','wrong_fragment','num_outbound_cmds'],axis=1)

In [8]:
#Convert new dataframe into output label "Y" and input features X
X = kdd_df_new.iloc[:,:38]
Y = kdd_df_new.iloc[:,-1]

In [9]:
#split the data into train and test.The same splits i will use across various classifiers below
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,Y,random_state=34,test_size=0.3)

In [10]:
#scale the train/test data using MinMaxScaler. Same scaler is used across all classifiers for consistency 

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0,1))
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

  return self.partial_fit(X, y)


## Configure workspace


Create a workspace object from the existing workspace. A [Workspace](https://docs.microsoft.com/python/api/azureml-core/azureml.core.workspace.workspace?view=azure-ml-py) is a class that accepts your Azure subscription and resource information. It also creates a cloud resource to monitor and track your model runs. `Workspace.from_config()` reads the file **config.json** and loads the authentication details into an object named `ws`. `ws` is used throughout the rest of the code in this tutorial.

In [2]:
from azureml.core.workspace import Workspace
ws = Workspace.from_config()

### Define training settings

Define the experiment parameter and model settings for training. View the full list of [settings](https://docs.microsoft.com/azure/machine-learning/service/how-to-configure-auto-train). Submitting the experiment with these default settings will take approximately 5-10 min, but if you want a shorter run time, reduce the `iterations` parameter.


|Property| Value in this tutorial |Description|
|----|----|---|
|**iteration_timeout_minutes**|2|Time limit in minutes for each iteration. Reduce this value to decrease total runtime.|
|**iterations**|20|Number of iterations. In each iteration, a new machine learning model is trained with your data. This is the primary value that affects total run time.|
|**primary_metric**| spearman_correlation | Metric that you want to optimize. The best-fit model will be chosen based on this metric.|
|**preprocess**| True | By using **True**, the experiment can preprocess the input data (handling missing data, converting text to numeric, etc.)|
|**verbosity**| logging.INFO | Controls the level of logging.|
|**n_cross_validations**|5|Number of cross-validation splits to perform when validation data is not specified.|

In [46]:
import logging

automl_settings = {
    "iteration_timeout_minutes": 2,
    "iterations": 20,
    "primary_metric": 'accuracy',
    "preprocess": False,
    "verbosity": logging.INFO,
    "n_cross_validations": 2
}

Use your defined training settings as a `**kwargs` parameter to an `AutoMLConfig` object. Additionally, specify your training data and the type of model, which is `regression` in this case.

In [47]:
from azureml.train.automl import AutoMLConfig

automl_config = AutoMLConfig(task='classification',
                             debug_log='automated_ml_errors.log',
                             X=X_train_scaled,
                             y=y_train.values.flatten(),
                             **automl_settings)

Automated machine learning pre-processing steps (feature normalization, handling missing data, converting text to numeric, etc.) become part of the underlying model. When using the model for predictions, the same pre-processing steps applied during training are applied to your input data automatically.

### Train the automatic regression model

Create an experiment object in your workspace. An experiment acts as a container for your individual runs. Pass the defined `automl_config` object to the experiment, and set the output to `True` to view progress during the run. 

After starting the experiment, the output shown updates live as the experiment runs. For each iteration, you see the model type, the run duration, and the training accuracy. The field `BEST` tracks the best running training score based on your metric type.

In [48]:
from azureml.core.experiment import Experiment
experiment = Experiment(ws, "KDD-experiment")
local_run = experiment.submit(automl_config, show_output=True)

Running on local machine
Parent Run ID: AutoML_bb55fe6a-96f8-4e1a-b554-209ee61b6c9d
Current status: DatasetCrossValidationSplit. Generating CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
ITERATION: The iteration being evaluated.
PIPELINE: A summary description of the pipeline being evaluated.
DURATION: Time taken for the current iteration.
METRIC: The result of computing score on the fitted pipeline.
BEST: The best observed score thus far.
****************************************************************************************************

 ITERATION   PIPELINE                                       DURATION      METRIC      BEST
         0   MinMaxScaler SGD                               0:00:12       0.9998    0.9998
         1   MinMaxScaler SGD                               0:00:13       0.9936    0.9998
         2   StandardScalerWrapper SGD                  

## Explore the results

Explore the results of automatic training with a [Jupyter widget](https://docs.microsoft.com/python/api/azureml-widgets/azureml.widgets?view=azure-ml-py). The widget allows you to see a graph and table of all individual run iterations, along with training accuracy metrics and metadata. Additionally, you can filter on different accuracy metrics than your primary metric with the dropdown selector.

In [49]:
from azureml.widgets import RunDetails
RunDetails(local_run).show()

A Jupyter Widget

### Retrieve the best model

Select the best model from your iterations. The `get_output` function returns the best run and the fitted model for the last fit invocation. By using the overloads on `get_output`, you can retrieve the best run and fitted model for any logged metric or a particular iteration.

In [50]:
best_run, fitted_model = local_run.get_output()
print(best_run)
print(fitted_model)

Run(Experiment: KDD-experiment,
Id: AutoML_bb55fe6a-96f8-4e1a-b554-209ee61b6c9d_18,
Type: None,
Status: Completed)
Pipeline(memory=None,
     steps=[('prefittedsoftvotingclassifier', PreFittedSoftVotingClassifier(classification_labels=None,
               estimators=[('0', Pipeline(memory=None,
     steps=[('MinMaxScaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('SGDClassifierWrapper', SGDClassifierWrapper(alpha=0.816418367346938...666666666667, 0.13333333333333333, 0.06666666666666667, 0.26666666666666666, 0.06666666666666667]))])


### Test the best model accuracy

In [51]:
#predict values using test data

predict = fitted_model.predict(X_test_scaled)
testscore = fitted_model.score(X_test_scaled,y_test)
#score test values 
print("Test Score: ",testscore)
#score train values
print("Train Score: ", fitted_model.score(X_train_scaled,y_train))
from sklearn.metrics import confusion_matrix,recall_score,precision_score
print("Confusion Matrix :")
print(confusion_matrix(y_test,predict))
from sklearn.metrics import classification_report
print("Classification Report: ")
print(classification_report(y_test, predict))

Test Score:  0.9998287260644675
Train Score:  0.9999119136754019
Confusion Matrix :
[[29184     0]
 [    5     4]]
Classification Report: 
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     29184
           1       1.00      0.44      0.62         9

   micro avg       1.00      1.00      1.00     29193
   macro avg       1.00      0.72      0.81     29193
weighted avg       1.00      1.00      1.00     29193



In [55]:
#!pip install --upgrade pip
#!pip install --upgrade imblearn

Collecting imblearn
  Downloading https://files.pythonhosted.org/packages/81/a7/4179e6ebfd654bd0eac0b9c06125b8b4c96a9d0a8ff9e9507eb2a26d2d7e/imblearn-0.0-py2.py3-none-any.whl
Collecting imbalanced-learn (from imblearn)
[?25l  Downloading https://files.pythonhosted.org/packages/e6/62/08c14224a7e242df2cef7b312d2ef821c3931ec9b015ff93bb52ec8a10a3/imbalanced_learn-0.5.0-py3-none-any.whl (173kB)
[K     |████████████████████████████████| 174kB 6.8MB/s eta 0:00:01
Collecting scikit-learn>=0.21 (from imbalanced-learn->imblearn)
[?25l  Downloading https://files.pythonhosted.org/packages/a0/c5/d2238762d780dde84a20b8c761f563fe882b88c5a5fb03c056547c442a19/scikit_learn-0.21.3-cp36-cp36m-manylinux1_x86_64.whl (6.7MB)
[K     |████████████████████████████████| 6.7MB 38.9MB/s eta 0:00:01
[31mERROR: azureml-train-automl 1.0.62 has requirement scikit-learn<=0.20.3,>=0.19.0, but you'll have scikit-learn 0.21.3 which is incompatible.[0m
[31mERROR: azureml-train-automl 1.0.62 has requirement wheel==0.

In [56]:
# import SMOTE
# I am using not Majority to allow expand minority class (bad ones) to expand
from imblearn.over_sampling import SMOTE
# create a SMOTE object
sm = SMOTE(sampling_strategy='not majority',random_state=1)
# use SMOTE to fit the data in X and y
X_res, y_res = sm.fit_sample(X, Y)

Using TensorFlow backend.


In [57]:
#split the data into train_res and test_res.
from sklearn.model_selection import train_test_split
X_train_res,X_test_res,y_train_res,y_test_res = train_test_split(X_res,y_res,random_state=34,test_size=0.3)

In [58]:
#scale the train/test data using RobustScaler

from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_train_scaled_res = scaler.fit_transform(X_train_res)
X_test_scaled_res = scaler.transform(X_test_res)

In [59]:
import logging

automl_settings = {
    "iteration_timeout_minutes": 2,
    "iterations": 20,
    "primary_metric": 'accuracy',
    "preprocess": False,
    "verbosity": logging.INFO,
    "n_cross_validations": 2
}

In [61]:
from azureml.train.automl import AutoMLConfig

automl_config = AutoMLConfig(task='classification',
                             debug_log='automated_ml_errors.log',
                             X=X_train_scaled_res,
                             y=y_train_res.flatten(),
                             **automl_settings)

In [62]:
from azureml.core.experiment import Experiment
experiment = Experiment(ws, "KDD-experiment_SMOTE")
local_run = experiment.submit(automl_config, show_output=True)

Running on local machine
Parent Run ID: AutoML_f1eed1b5-52bc-4ce9-8e9f-5acd76aad5a1
Current status: DatasetCrossValidationSplit. Generating CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
ITERATION: The iteration being evaluated.
PIPELINE: A summary description of the pipeline being evaluated.
DURATION: Time taken for the current iteration.
METRIC: The result of computing score on the fitted pipeline.
BEST: The best observed score thus far.
****************************************************************************************************

 ITERATION   PIPELINE                                       DURATION      METRIC      BEST
         0   MinMaxScaler SGD                               0:00:13       0.9993    0.9993
         1   MinMaxScaler SGD                               0:00:13       0.9983    0.9993
         2   StandardScalerWrapper SGD                  

--- Logging error ---
Traceback (most recent call last):
  File "/anaconda/envs/azureml_py36/lib/python3.6/logging/handlers.py", line 72, in emit
    self.doRollover()
  File "/anaconda/envs/azureml_py36/lib/python3.6/logging/handlers.py", line 169, in doRollover
    os.rename(sfn, dfn)
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/azmnt/code/Users/manpo/samples-1.0.65/tutorials/automated_ml_errors.log.6' -> '/mnt/azmnt/code/Users/manpo/samples-1.0.65/tutorials/automated_ml_errors.log.7'
Call stack:
  File "/anaconda/envs/azureml_py36/lib/python3.6/threading.py", line 884, in _bootstrap
    self._bootstrap_inner()
  File "/anaconda/envs/azureml_py36/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/anaconda/envs/azureml_py36/lib/python3.6/threading.py", line 1182, in run
    self.function(*self.args, **self.kwargs)
  File "/anaconda/envs/azureml_py36/lib/python3.6/site-packages/azureml/automl/core/timer_utilities.py", line 40, in _run
  

StandardScalerWrapper SGD                      0:00:14       0.9989    0.9996
        12   StandardScalerWrapper ExtremeRandomTrees       0:00:20       0.9968    0.9996
        13   MinMaxScaler RandomForest                      0:00:36       0.9973    0.9996
        14   MinMaxScaler ExtremeRandomTrees                0:00:15       0.9897    0.9996
        15   MinMaxScaler BernoulliNaiveBayes               0:00:14       0.9959    0.9996
        16   StandardScalerWrapper BernoulliNaiveBayes      0:00:13       0.9962    0.9996
        17   MinMaxScaler RandomForest                      0:00:20       0.9956    0.9996
        18   VotingEnsemble                                 0:00:28       0.9997    0.9997
        19   StackEnsemble                                  0:00:47       0.9994    0.9997


In [63]:
from azureml.widgets import RunDetails
RunDetails(local_run).show()

A Jupyter Widget

In [64]:
best_run, fitted_model = local_run.get_output()
print(best_run)
print(fitted_model)

Run(Experiment: KDD-experiment_SMOTE,
Id: AutoML_f1eed1b5-52bc-4ce9-8e9f-5acd76aad5a1_18,
Type: None,
Status: Completed)
Pipeline(memory=None,
     steps=[('prefittedsoftvotingclassifier', PreFittedSoftVotingClassifier(classification_labels=None,
               estimators=[('5', Pipeline(memory=None,
     steps=[('MinMaxScaler', MinMaxScaler(copy=True, feature_range=(0, 1))), ('RandomForestClassifier', RandomForestClassifier(bootstrap=True, cla...333333333333, 0.08333333333333333, 0.08333333333333333, 0.08333333333333333, 0.08333333333333333]))])


In [65]:
#predict values using test data

predict = fitted_model.predict(X_test_scaled_res)
testscore = fitted_model.score(X_test_scaled_res,y_test_res)
#score test values 
print("Test Score: ",testscore)
#score train values
print("Train Score: ", fitted_model.score(X_train_scaled_res,y_train_res))
from sklearn.metrics import confusion_matrix,recall_score,precision_score
print("Confusion Matrix :")
print(confusion_matrix(y_test_res,predict))
from sklearn.metrics import classification_report
print("Classification Report: ")
print(classification_report(y_test_res, predict))


Test Score:  0.9995374098377507
Train Score:  0.9997503469443201
Confusion Matrix :
[[29451    27]
 [    0 28889]]
Classification Report: 
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     29478
           1       1.00      1.00      1.00     28889

   micro avg       1.00      1.00      1.00     58367
   macro avg       1.00      1.00      1.00     58367
weighted avg       1.00      1.00      1.00     58367

