
# Automated ML

### Import Dependencies
In the cell below, importing all the dependencies that will be needed to complete the project.

In [1]:
import numpy as np
import pandas as pd
from sklearn import datasets
import pkg_resources
import azureml.core
from azureml.core import Workspace, Experiment
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.core.dataset import Dataset
from azureml.core.environment import Environment
from azureml.core.model import InferenceConfig
from azureml.core.webservice import AciWebservice, Webservice
from azureml.core.model import InferenceConfig, Model
from azureml.train.automl import AutoMLConfig
from azureml.pipeline.steps import AutoMLStep
from azureml.widgets import RunDetails
from pprint import pprint
import json
import requests
import logging
import os
import csv

### Overview

Attrition has always been a major concern in any organization. The IBM HR Attrition Case Study is a fictional dataset which aims to identify important factors that might be influential in determining which employee might leave the firm and who may not.

Dataset link: https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset

The Dataset consists of 35 columns, through which we aim to predict weather an employee will leave the job or not. This is a binary classification problem, where the outcome 'Attrition' will either be 'true' or 'false'. 
In this experiment we will be using AutoML to find the best prediction for the given Dataset. We will then deploy the model with the best prediction and interact with the deployment.

### Import Workspace

In [2]:
ws = Workspace.from_config()

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id,
      'Resource group: ' + ws.resource_group, sep = '\n')

Workspace name: quick-starts-ws-138608
Azure region: southcentralus
Subscription id: 9a7511b8-150f-4a58-8528-3e7d50216c31
Resource group: aml-quickstarts-138608


### Create Experiment

In [3]:
# choosing a name for experiment
experiment_name = 'capstone-automl'
experiment=Experiment(ws, experiment_name)

run = experiment.start_logging()

### Create Compute Cluster

In [4]:
cluster_name = "notebook138608"

try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target, using it!')
except ComputeTargetException:
    print('Creating a new compute target!')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2', max_nodes=4)
    
    # create the cluster
    cpu_cluster = ComputeTarget.create(ws, cluster_name, compute_config)
    
cpu_cluster.wait_for_completion(show_output=True)
 
# Using get_status() to get a detailed status for the current cluster.
print(cpu_cluster.get_status().serialize())

Found existing compute target, using it!

Running
{'errors': [], 'creationTime': '2021-02-11T17:50:41.294287+00:00', 'createdBy': {'userObjectId': '8528587c-835e-4446-87fb-ee3ba202662a', 'userTenantId': '660b3398-b80e-49d2-bc5b-ac1dc93b5254', 'userName': None}, 'modifiedTime': '2021-02-11T17:53:14.698308+00:00', 'state': 'Running', 'vmSize': 'STANDARD_DS3_V2'}


## Dataset

In the cell below, writing code to access the external data that will be used in this project. We are using the IBM HR Analytics Employee Attrition & Performance dataset from Kaggle.

In [5]:
# Try to load the dataset from the Workspace. Otherwise, create it from the file
# NOTE: update the key to match the dataset name
found = False
key = "Employee Attrition"
description_text = "IBM HR Analytics Employee Attrition & Performance"

if key in ws.datasets.keys(): 
        found = True
        dataset = ws.datasets[key] 

if not found:
        # Create AML Dataset and register it into Workspace
        data = 'https://raw.githubusercontent.com/manas-v/Capstone-Project-Azure-Machine-Learning-Engineer/main/WA_Fn-UseC_-HR-Employee-Attrition.csv'
        dataset = Dataset.Tabular.from_delimited_files(data)        
        #Register Dataset in Workspace
        dataset = dataset.register(workspace=ws,
                                   name=key,
                                   description=description_text)


df = dataset.to_pandas_dataframe()
df.describe()

Unnamed: 0,Age,DailyRate,DistanceFromHome,Education,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
count,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,...,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0
mean,36.92381,802.485714,9.192517,2.912925,1.0,1024.865306,2.721769,65.891156,2.729932,2.063946,...,2.712245,80.0,0.793878,11.279592,2.79932,2.761224,7.008163,4.229252,2.187755,4.123129
std,9.135373,403.5091,8.106864,1.024165,0.0,602.024335,1.093082,20.329428,0.711561,1.10694,...,1.081209,0.0,0.852077,7.780782,1.289271,0.706476,6.126525,3.623137,3.22243,3.568136
min,18.0,102.0,1.0,1.0,1.0,1.0,1.0,30.0,1.0,1.0,...,1.0,80.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,30.0,465.0,2.0,2.0,1.0,491.25,2.0,48.0,2.0,1.0,...,2.0,80.0,0.0,6.0,2.0,2.0,3.0,2.0,0.0,2.0
50%,36.0,802.0,7.0,3.0,1.0,1020.5,3.0,66.0,3.0,2.0,...,3.0,80.0,1.0,10.0,3.0,3.0,5.0,3.0,1.0,3.0
75%,43.0,1157.0,14.0,4.0,1.0,1555.75,4.0,83.75,3.0,3.0,...,4.0,80.0,1.0,15.0,3.0,3.0,9.0,7.0,3.0,7.0
max,60.0,1499.0,29.0,5.0,1.0,2068.0,4.0,100.0,4.0,5.0,...,4.0,80.0,3.0,40.0,6.0,4.0,40.0,18.0,15.0,17.0


## AutoML Configuration

The AutoML settings and AutoMl configurations are given below.
This is a binary classification problem with label column 'Attrition' having output as 'true' or 'false'. The experiment timeout is 20 mins, a maximum of 5 concurrent iterations take place together, the primary metric for the run is AUC_weighted. 

In [6]:
# Automl setting
automl_settings = automl_settings = {
    "experiment_timeout_minutes": 20,
    "max_concurrent_iterations": 5,
    "primary_metric" : 'AUC_weighted'
}

# Automl config
automl_config = AutoMLConfig(compute_target=cpu_cluster,
                             task = "classification",
                             training_data=dataset,
                             label_column_name="Attrition",   
                             path = './capstone-project',
                             enable_early_stopping= True,
                             featurization= 'auto',
                             debug_log = "automl_errors.log",
                             **automl_settings
                             )

In [7]:
# Submitting the experiment
remote_run = experiment.submit(automl_config, show_output=True)

Running on remote.
No run_configuration provided, running on notebook138608 with default configuration
Running on remote compute: notebook138608
Parent Run ID: AutoML_b1870f09-4190-4612-b6ef-ed0770d273ed

Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturization. Beginning to fit featurizers and featurize the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Cross validation
STATUS:       DONE
DESCRIPTION:  Each iteration of the trained model was validated through cross-validation.
              
DETAILS:      
+---------------------------------+
|Number of folds                  |
|3                                |
+---------------------------------+

******************************************

## Run Details

In the cell below, using the `RunDetails` widget to show the different experiments.

In [8]:
RunDetails(remote_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

## Best Model

The best run from the AutoMl Run was VotingEnsemble with an AUC_weighted of 0.83328615.
We are retrieveing the best model from the automl experiments and display the properties of the model.

In [9]:
best_run, fitted_model = remote_run.get_output()
print(best_run)

best_run_metrics = best_run.get_metrics()
print('Best Run Id: ', best_run.id)

Package:azureml-automl-runtime, training version:1.21.0, current version:1.20.0
Package:azureml-core, training version:1.21.0.post1, current version:1.20.0
Package:azureml-dataprep, training version:2.8.2, current version:2.7.3
Package:azureml-dataprep-native, training version:28.0.0, current version:27.0.0
Package:azureml-dataprep-rslex, training version:1.6.0, current version:1.5.0
Package:azureml-dataset-runtime, training version:1.21.0, current version:1.20.0
Package:azureml-defaults, training version:1.21.0, current version:1.20.0
Package:azureml-interpret, training version:1.21.0, current version:1.20.0
Package:azureml-pipeline-core, training version:1.21.0, current version:1.20.0
Package:azureml-telemetry, training version:1.21.0, current version:1.20.0
Package:azureml-train-automl-client, training version:1.21.0, current version:1.20.0
Package:azureml-train-automl-runtime, training version:1.21.0, current version:1.20.0


Run(Experiment: capstone-automl,
Id: AutoML_b1870f09-4190-4612-b6ef-ed0770d273ed_43,
Type: azureml.scriptrun,
Status: Completed)
Best Run Id:  AutoML_b1870f09-4190-4612-b6ef-ed0770d273ed_43


In [10]:
print(fitted_model)

Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                                 feature_sweeping_config=None,
                                 feature_sweeping_timeout=None,
                                 featurization_config=None, force_text_dnn=None,
                                 is_cross_validation=None,
                                 is_onnx_compatible=None, logger=None,
                                 observer=None, task=None, working_dir=None)),
                ('prefittedsoftvotingclassifier',...
                                                                                                random_state=None,
                                                                                                solver='saga',
                                                                                                tol=0.0001,
                                                          

In [11]:
def print_model(fitted_model, prefix=""):
    for step in fitted_model.steps:
        print(prefix + step[0])
        if hasattr(step[1], 'estimators') and hasattr(step[1], 'weights'):
            pprint({'estimators': list(
                e[0] for e in step[1].estimators), 'weights': step[1].weights})
            print()
            for estimator in step[1].estimators:
                print_model(estimator[1], estimator[0] + ' - ')
        else:
            pprint(step[1].get_params())
            print()

print_model(fitted_model)

datatransformer
{'enable_dnn': None,
 'enable_feature_sweeping': None,
 'feature_sweeping_config': None,
 'feature_sweeping_timeout': None,
 'featurization_config': None,
 'force_text_dnn': None,
 'is_cross_validation': None,
 'is_onnx_compatible': None,
 'logger': None,
 'observer': None,
 'task': None,
 'working_dir': None}

prefittedsoftvotingclassifier
{'estimators': ['23',
                '34',
                '20',
                '35',
                '33',
                '39',
                '14',
                '31',
                '36',
                '26',
                '29'],
 'weights': [0.07692307692307693,
             0.07692307692307693,
             0.07692307692307693,
             0.07692307692307693,
             0.07692307692307693,
             0.07692307692307693,
             0.23076923076923078,
             0.07692307692307693,
             0.07692307692307693,
             0.07692307692307693,
             0.07692307692307693]}

23 - maxabsscaler
{'co

In [12]:
for metric_name in best_run_metrics:
    metric = best_run_metrics[metric_name]
    print(metric_name,"-" , metric)

weighted_accuracy - 0.9604529149222975
matthews_correlation - 0.3888480908516447
recall_score_weighted - 0.8659863945578231
AUC_weighted - 0.8332861544179403
precision_score_macro - 0.8422248463868334
AUC_micro - 0.9306191864500902
f1_score_micro - 0.8659863945578231
log_loss - 0.3453232756270071
f1_score_weighted - 0.832072588447846
balanced_accuracy - 0.6117077860299189
precision_score_weighted - 0.8581631391955735
f1_score_macro - 0.64264656690687
accuracy - 0.8659863945578231
average_precision_score_micro - 0.9193624589788132
recall_score_macro - 0.6117077860299189
AUC_macro - 0.8332861544179403
average_precision_score_weighted - 0.889197221377286
average_precision_score_macro - 0.7745118342826475
norm_macro_recall - 0.22341557205983772
recall_score_micro - 0.8659863945578231
precision_score_micro - 0.8659863945578231
confusion_matrix - aml://artifactId/ExperimentRun/dcid.AutoML_b1870f09-4190-4612-b6ef-ed0770d273ed_43/confusion_matrix
accuracy_table - aml://artifactId/ExperimentRun

In [13]:
automodel = best_run.register_model(model_name='automl_model', 
                                    model_path='outputs/model.pkl',
                                    tags={'Method':'AutoML'},
                                    properties={'AUC_weighted': best_run_metrics['AUC_weighted']})

print(automodel)

Model(workspace=Workspace.create(name='quick-starts-ws-138608', subscription_id='9a7511b8-150f-4a58-8528-3e7d50216c31', resource_group='aml-quickstarts-138608'), name=automl_model, id=automl_model:1, version=1, tags={'Method': 'AutoML'}, properties={'AUC_weighted': '0.8332861544179403'})


## Model Deployment

Remember you have to deploy only one of the two models you trained. Perform the steps in the rest of this notebook only if you wish to deploy this model.

In the cell below, registering the model, creating an inference config and deploying the model as a web service.

In [14]:
# Download scoring file 
best_run.download_file('outputs/scoring_file_v_1_0_0.py', 'score.py')

# Download environment file
best_run.download_file('outputs/conda_env_v_1_0_0.yml', 'env.yml')

In [15]:
aciconfig = AciWebservice.deploy_configuration(cpu_cores=1, 
                                               memory_gb=1, 
                                               description='Predict Employee Attrition with AutoML')

In [16]:
inference_config = InferenceConfig(entry_script="score.py", environment=best_run.get_environment())

service = Model.deploy(workspace=ws, 
                       name='automl-webservice', 
                       models=[automodel], 
                       inference_config=inference_config, 
                       deployment_config=aciconfig)

In [17]:
service.wait_for_deployment(show_output=True)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running............
Succeeded
ACI service creation operation finished, operation "Succeeded"


In [18]:
print("Service State: ",service.state)
print("Scoring URI: ",service.scoring_uri)
print("Swagger URI: ",service.swagger_uri)

Service State:  Healthy
Scoring URI:  http://f3ddd345-4ab2-4ab6-820d-ea12ce8b328c.southcentralus.azurecontainer.io/score
Swagger URI:  http://f3ddd345-4ab2-4ab6-820d-ea12ce8b328c.southcentralus.azurecontainer.io/swagger.json


In [19]:
!python logs.py

2021-02-11T19:29:53,340712900+00:00 - gunicorn/run 
2021-02-11T19:29:53,369703800+00:00 - rsyslog/run 
2021-02-11T19:29:53,382010500+00:00 - iot-server/run 
rsyslogd: /azureml-envs/azureml_20a8278aa8b20dd48cc50f56a6d2586c/lib/libuuid.so.1: no version information available (required by rsyslogd)
2021-02-11T19:29:53,424950600+00:00 - nginx/run 
/usr/sbin/nginx: /azureml-envs/azureml_20a8278aa8b20dd48cc50f56a6d2586c/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_20a8278aa8b20dd48cc50f56a6d2586c/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_20a8278aa8b20dd48cc50f56a6d2586c/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_20a8278aa8b20dd48cc50f56a6d2586c/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx:

In the cell below, sending a request to the web service deployed to test it.

In [20]:
#Import test data
test_df = df.sample(4) # sample data from original dataset
label_df = test_df.pop('Attrition')

test_sample = json.dumps({'data': test_df.to_dict(orient='records')})

print(test_sample)

{"data": [{"Age": 43, "BusinessTravel": "Travel_Rarely", "DailyRate": 782, "Department": "Research & Development", "DistanceFromHome": 6, "Education": 4, "EducationField": "Other", "EmployeeCount": 1, "EmployeeNumber": 661, "EnvironmentSatisfaction": 2, "Gender": "Male", "HourlyRate": 50, "JobInvolvement": 2, "JobLevel": 4, "JobRole": "Research Director", "JobSatisfaction": 4, "MaritalStatus": "Divorced", "MonthlyIncome": 16627, "MonthlyRate": 2671, "NumCompaniesWorked": 4, "Over18": true, "OverTime": true, "PercentSalaryHike": 14, "PerformanceRating": 3, "RelationshipSatisfaction": 3, "StandardHours": 80, "StockOptionLevel": 1, "TotalWorkingYears": 21, "TrainingTimesLastYear": 3, "WorkLifeBalance": 2, "YearsAtCompany": 1, "YearsInCurrentRole": 0, "YearsSinceLastPromotion": 0, "YearsWithCurrManager": 0}, {"Age": 56, "BusinessTravel": "Travel_Rarely", "DailyRate": 310, "Department": "Research & Development", "DistanceFromHome": 7, "Education": 2, "EducationField": "Technical Degree", "E

In [21]:
scoring_uri = service.scoring_uri
input_data = test_sample

# Set the content type
headers = {'Content-Type': 'application/json'}

# Make the request and display the response
resp = requests.post(scoring_uri, input_data, headers=headers)
print(resp.text)

"{\"result\": [false, false, true, false]}"


In the cell below, printing the logs of the web service and deleting the service

In [22]:
print(service.get_logs())

2021-02-11T19:29:53,340712900+00:00 - gunicorn/run 
2021-02-11T19:29:53,369703800+00:00 - rsyslog/run 
2021-02-11T19:29:53,382010500+00:00 - iot-server/run 
rsyslogd: /azureml-envs/azureml_20a8278aa8b20dd48cc50f56a6d2586c/lib/libuuid.so.1: no version information available (required by rsyslogd)
2021-02-11T19:29:53,424950600+00:00 - nginx/run 
/usr/sbin/nginx: /azureml-envs/azureml_20a8278aa8b20dd48cc50f56a6d2586c/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_20a8278aa8b20dd48cc50f56a6d2586c/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_20a8278aa8b20dd48cc50f56a6d2586c/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_20a8278aa8b20dd48cc50f56a6d2586c/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml

In [None]:
service.detete()