# Automated ML

*TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.*

In [1]:
import train_xgb # The module for loading external data
import os
import pandas as pd
import json
import ast
import pickle
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.core import Workspace, Dataset, Experiment, Model
from azureml.data.dataset_factory import TabularDatasetFactory
from azureml.train.automl import AutoMLConfig
from azureml.widgets import RunDetails

In [2]:
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\n')

quick-starts-ws-154647
aml-quickstarts-154647
southcentralus
61c5c3f0-6dc7-4ed9-a7f3-c704b20e3b30


In [3]:
# Create compute cluster
# Choose a name for your CPU cluster
cpu_cluster_name = "cpu-cluster"

# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           max_nodes=4)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True)

InProgress......
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## Dataset

### Overview
*TODO: In this markdown cell, give an overview of the dataset you are using. Also mention the task you will be performing.*

This project processes a data set describing the sale of individual residential property in Ames, Iowa from 2006 to 2010. The data set contains 2930 observations and a large number of explanatory variables (23 nominal, 23 ordinal, 14 discrete, and 20 continuous) involved in assessing home values. The **Ames Housing dataset** was compiled by Dean De Cock for use in data science education.

The project's goal is to train and deploy a machine learning model for **prediction of the sales price based on parameters of a house**.

*TODO: Get data. In the cell below, write code to access the data you will be using in this project. Remember that the dataset needs to be external.*

The Ames Housing dataset is downloaded from Kaggle.com and registered to the workspace.

In [4]:
# Try to load the dataset from the workspace. Otherwise, load it from Kaggle
found = False
ds_key = 'Ames-housing-dataset'
ds_desc = 'Ames Housing training data.'

if ds_key in ws.datasets.keys():
    found = True
    dataset = ws.datasets[ds_key]
    print(f'Found registered {ds_key}, use it.')
    
if not found:
    train, test = train_xgb.load_data_clean(source='kaggle')
    print(f"train.shape = {train.shape}, test.shape = {test.shape}")
    # Register the train dataset
    blob = ws.get_default_datastore()
    dataset = TabularDatasetFactory.register_pandas_dataframe(train, blob, name=ds_key, description=ds_desc)

train.shape = (1460, 80), test.shape = (1459, 79)


Method register_pandas_dataframe: This is an experimental method, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


Validating arguments.
Arguments validated.
Successfully obtained datastore reference and path.
Uploading file to managed-dataset/0a966fcc-ee6e-4d27-be6f-6e1df4496ac6/
Successfully uploaded file to datastore.
Creating and registering a new dataset.
Successfully created and registered a new dataset.


In [5]:
ds_key in ws.datasets.keys()

True

## AutoML Configuration

*TODO: Explain why you chose the automl settings and cofiguration you used below.*

As there are time constraints for the experiment, the automl setting:
+ sets experiment timeout to 2 hours,
+ enables early stopping,
+ disables the neural networks (this is a default setting). 

Number of cross validation is set to 3 as the datset has more then 1.000 data points. For the sake of cross platform migration the ONNX compatible models are enabled.

In [6]:
# Choose a name for experiment
experiment_name = 'Ames-housing-AutoML'

experiment=Experiment(ws, experiment_name)

In [7]:
# Set parameters for AutoMLConfig
automl_settings = {
    'experiment_timeout_minutes' : 120,
    'n_cross_validations' : 3,
    'enable_early_stopping' : True,
    'iteration_timeout_minutes' : 5,
    'max_concurrent_iterations' : 4,
    'max_cores_per_iteration' : -1,
    'enable_onnx_compatible_models' : True
}

automl_config = AutoMLConfig(
    task='regression',
    primary_metric='normalized_root_mean_squared_error',
    compute_target=cpu_cluster,
    training_data=dataset,
    label_column_name='SalePrice',
    **automl_settings)

## Run Details

*OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?*

Prediction of a sale price based on parameters of a house is an example of a **regression** which is a supervised machine learning task. Regression algorithms used in automated ML defines [Regression Class](https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.constants.supportedmodels.regression?view=azure-ml-py).

+ An example of linear models is [**ElasticNet**](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html#sklearn.linear_model.ElasticNet). It is a linear regression model trained with both l1 and l2-norm regularization of the coefficients. This combination allows for learning a sparse model where few of the weights are non-zero like Lasso, while still maintaining the regularization properties of Ridge.
+ An example of tree-based algorithms is [**XGBoostRegressor**](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn). It is an accurate and effective off-the-shelf procedure that supports a number of different loss functions and is developed with both deep consideration in terms of systems optimization and principles in machine learning.
+ An example of a neural network is [**TensorFlowDNNRegressor**](https://www.tensorflow.org/api_docs/python/tf/estimator/DNNRegressor). DNN models are not enabled by default due to their performance requirements.

The models using gradient tree boosting have usually better performance because they combine predictions of several base estimators. For the same reason the result of the automated ML is usually a voting ensemble. The neural networks can do even better but at costs of compute performance requirements.

*TODO: In the cell below, use the `RunDetails` widget to show the different experiments.*

In [8]:
# TODO: Submit your experiment
aml_run = experiment.submit(automl_config)
RunDetails(aml_run).show()
aml_run.wait_for_completion(show_output=True)

Submitting remote run.


Experiment,Id,Type,Status,Details Page,Docs Page
Ames-housing-AutoML,AutoML_8a1f49c7-7001-4525-9922-362961aa3ee3,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation


_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

Experiment,Id,Type,Status,Details Page,Docs Page
Ames-housing-AutoML,AutoML_8a1f49c7-7001-4525-9922-362961aa3ee3,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation



Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturization. Beginning to fit featurizers and featurize the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Missing feature values imputation
STATUS:       PASSED
DESCRIPTION:  No feature missing values were detected in the training data.
              Learn more about missing value imputation: https://aka.ms/AutomatedMLFeaturization

****************************************************************************************************

TYPE:         High cardinality feature detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and no high cardinality features were detected.
              Learn more about high cardinality feature h

{'runId': 'AutoML_8a1f49c7-7001-4525-9922-362961aa3ee3',
 'target': 'cpu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2021-08-16T08:25:25.312432Z',
 'endTimeUtc': '2021-08-16T08:48:35.329871Z',
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'normalized_root_mean_squared_error',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': '3',
  'target': 'cpu-cluster',
  'DataPrepJsonString': '{\\"training_data\\": {\\"datasetId\\": \\"174428c7-1944-4ccc-b64f-f9b7e7dfedfb\\"}, \\"datasets\\": 0}',
  'EnableSubsampling': None,
  'runTemplate': 'AutoML',
  'azureml.runsource': 'automl',
  'display_task_type': 'regression',
  'dependencies_versions': '{"azureml-widgets": "1.32.0", "azureml-train": "1.32.0", "azureml-train-restclients-hyperdrive": "1.32.0", "azureml-train-core": "1.32.0", "azureml-train-automl": "1.32.0", "azureml-train-automl-runtime": "1.32.0", "azureml-train-automl-c

## Best Model

*TODO: In the cell below, get the best model from the automl experiments and display all the properties of the model.*

In [9]:
# Get your best run and save the model from that run.
best_run, fitted_model = aml_run.get_output()
print(best_run)
print(fitted_model)

Package:azureml-automl-runtime, training version:1.33.0, current version:1.32.0
Package:azureml-core, training version:1.33.0, current version:1.32.0
Package:azureml-dataprep, training version:2.20.1, current version:2.18.0
Package:azureml-dataprep-native, training version:38.0.0, current version:36.0.0
Package:azureml-dataprep-rslex, training version:1.18.0, current version:1.16.1
Package:azureml-dataset-runtime, training version:1.33.0, current version:1.32.0
Package:azureml-defaults, training version:1.33.0, current version:1.32.0
Package:azureml-interpret, training version:1.33.0, current version:1.32.0
Package:azureml-mlflow, training version:1.33.0, current version:1.32.0
Package:azureml-pipeline-core, training version:1.33.0, current version:1.32.0
Package:azureml-responsibleai, training version:1.33.0, current version:1.32.0
Package:azureml-telemetry, training version:1.33.0, current version:1.32.0
Package:azureml-train-automl-client, training version:1.33.0, current version:1.

Run(Experiment: Ames-housing-AutoML,
Id: AutoML_8a1f49c7-7001-4525-9922-362961aa3ee3_36,
Type: azureml.scriptrun,
Status: Completed)
RegressionPipeline(pipeline=Pipeline(memory=None,
                                     steps=[('datatransformer',
                                             DataTransformer(enable_dnn=False, enable_feature_sweeping=False, feature_sweeping_config={}, feature_sweeping_timeout=86400, featurization_config=None, force_text_dnn=False, is_cross_validation=True, is_onnx_compatible=True, observer=None, task='regression', working_dir='/mnt/batch/ta...
), random_state=0, reg_alpha=1.4583333333333335, reg_lambda=2.3958333333333335, subsample=0.6, tree_method='hist'))], verbose=False)), ('7', Pipeline(memory=None, steps=[('maxabsscaler', MaxAbsScaler(copy=True)), ('elasticnet', ElasticNet(alpha=0.001, copy_X=True, fit_intercept=True, l1_ratio=1, max_iter=1000, normalize=False, positive=False, precompute=False, random_state=None, selection='cyclic', tol=0.0001, warm_

In [10]:
best_run_metrics = best_run.get_metrics()
best_run_metrics

{'r2_score': 0.8942710428327086,
 'root_mean_squared_error': 25771.906465315406,
 'spearman_correlation': 0.9562738525955498,
 'mean_absolute_percentage_error': 9.197768325624113,
 'root_mean_squared_log_error': 0.13022613109168246,
 'normalized_root_mean_squared_error': 0.03578934379296681,
 'normalized_median_absolute_error': 0.014509132245016371,
 'normalized_mean_absolute_error': 0.021677741931194747,
 'explained_variance': 0.8944568099173953,
 'median_absolute_error': 10448.026129636288,
 'normalized_root_mean_squared_log_error': 0.04236093258268204,
 'mean_absolute_error': 15610.141964653338,
 'predicted_true': 'aml://artifactId/ExperimentRun/dcid.AutoML_8a1f49c7-7001-4525-9922-362961aa3ee3_36/predicted_true',
 'residuals': 'aml://artifactId/ExperimentRun/dcid.AutoML_8a1f49c7-7001-4525-9922-362961aa3ee3_36/residuals'}

In [11]:
details = best_run.get_details()
# A pattern for extracting structure of the VotingEnsemble model
voting_ensemble = {k:ast.literal_eval(details['properties'][k]) for k in ['ensembled_algorithms', 'ensemble_weights']}
pd.DataFrame(voting_ensemble).sort_values(by='ensemble_weights', ascending=False)

Unnamed: 0,ensembled_algorithms,ensemble_weights
1,XGBoostRegressor,0.4
2,XGBoostRegressor,0.4
3,ElasticNet,0.13
0,XGBoostRegressor,0.07


In [13]:
# Save metrics and details for ex-post examination
os.makedirs('./aml-outputs', exist_ok=True)
with open('aml-outputs/best_aml_metrics.json', 'w') as file:
    json.dump(best_run_metrics, file)
with open('aml-outputs/best_aml_details.txt', 'w') as file:
    file.write(str(details))

In [14]:
# Explore the main model in the ensemble
main_dict = {}
main_dict['preprocessor'] = str(type(fitted_model.steps[1][1].get_params()['estimators'][0][1].get_params()['steps'][0][1]))

In [15]:
main_dict['estimator_type'] = str(type(fitted_model.steps[1][1].get_params()['estimators'][0][1].get_params()['steps'][1][1]))

In [16]:
main_dict['estimator_param'] = fitted_model.steps[1][1].get_params()['estimators'][0][1].get_params()['steps'][1][1].get_params()

In [17]:
# Save details of the main model for ex-post examination
with open('aml-outputs/best_aml_main_model.json', 'w') as file:
    json.dump(main_dict, file)
main_dict

{'preprocessor': "<class 'azureml.automl.runtime.shared.model_wrappers.StandardScalerWrapper'>",
 'estimator_type': "<class 'azureml.automl.runtime.shared.model_wrappers.XGBoostRegressor'>",
 'estimator_param': {'base_score': 0.5,
  'booster': 'gbtree',
  'colsample_bylevel': 1,
  'colsample_bynode': 1,
  'colsample_bytree': 1,
  'gamma': 0,
  'importance_type': 'gain',
  'learning_rate': 0.1,
  'max_delta_step': 0,
  'max_depth': 3,
  'min_child_weight': 1,
  'missing': nan,
  'n_estimators': 100,
  'n_jobs': -1,
  'nthread': None,
  'objective': 'reg:squarederror',
  'random_state': 0,
  'reg_alpha': 0,
  'reg_lambda': 1,
  'scale_pos_weight': 1,
  'seed': None,
  'silent': None,
  'subsample': 1,
  'verbosity': 0,
  'tree_method': 'auto',
  'verbose': -10}}

In [18]:
# Check the path to the model
for i,n in enumerate(best_run.get_file_names()):
    print(i,n)

0 automl_driver.py
1 azureml-logs/55_azureml-execution-tvmps_18b8019307bf121532254d3f74299ad2cbf5f8eb2347d513a73fe17a9410aff2_d.txt
2 azureml-logs/65_job_prep-tvmps_18b8019307bf121532254d3f74299ad2cbf5f8eb2347d513a73fe17a9410aff2_d.txt
3 azureml-logs/70_driver_log.txt
4 azureml-logs/75_job_post-tvmps_18b8019307bf121532254d3f74299ad2cbf5f8eb2347d513a73fe17a9410aff2_d.txt
5 azureml-logs/process_info.json
6 azureml-logs/process_status.json
7 explanation/386a65d0/expected_values.interpret.json
8 explanation/386a65d0/features.interpret.json
9 explanation/386a65d0/global_names/0.interpret.json
10 explanation/386a65d0/global_rank/0.interpret.json
11 explanation/386a65d0/global_values/0.interpret.json
12 explanation/386a65d0/local_importance_values.interpret.json
13 explanation/386a65d0/rich_metadata.interpret.json
14 explanation/386a65d0/true_ys_viz.interpret.json
15 explanation/386a65d0/visualization_dict.interpret.json
16 explanation/386a65d0/ys_pred_viz.interpret.json
17 explanation/b21056

In [19]:
# Save the best model
os.makedirs('./aml-outputs/', exist_ok=True)
for i in range(33,42):
    print(best_run.get_file_names()[i])
    best_run.download_file(best_run.get_file_names()[i], output_file_path='./aml-outputs/')

outputs/conda_env_v_1_0_0.yml
outputs/env_dependencies.json
outputs/internal_cross_validated_models.pkl
outputs/model.onnx
outputs/model.pkl
outputs/model_onnx.json
outputs/pipeline_graph.json
outputs/scoring_file_v_1_0_0.py
outputs/scoring_file_v_2_0_0.py


Remember you have to deploy only one of the two models you trained. Perform the steps in the rest of this notebook only if you wish to deploy this model.

*TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.*

In [20]:
# Register the best model
model = Model.register(ws, model_path='aml-outputs/model.pkl', model_name='Ames-Housing-AutoML-Model', tags=best_run_metrics)
print(model.name, model.id, model.version, sep='\t')

Registering model Ames-Housing-AutoML-Model
Ames-Housing-AutoML-Model	Ames-Housing-AutoML-Model:1	1


In [21]:
from azureml.core.webservice import AciWebservice
aciconfig = AciWebservice.deploy_configuration(cpu_cores=1,
                                              memory_gb=1,
                                              tags={"data" : "Kaggle", "method" : "AutoML"},
                                              description="Predict Ames Housing Prices",
                                              auth_enabled=True,
                                              enable_app_insights=True)

In [22]:
from azureml.core.environment import Environment
from azureml.core.conda_dependencies import CondaDependencies

env = Environment("project-env")
cd = CondaDependencies('aml-outputs/conda_env_v_1_0_0.yml')
env.python.conda_dependencies = cd
# Register environment to re-use later
env.register(workspace=ws)

{
    "databricks": {
        "eggLibraries": [],
        "jarLibraries": [],
        "mavenLibraries": [],
        "pypiLibraries": [],
        "rcranLibraries": []
    },
    "docker": {
        "arguments": [],
        "baseDockerfile": null,
        "baseImage": "mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:20210615.v1",
        "baseImageRegistry": {
            "address": null,
            "password": null,
            "registryIdentity": null,
            "username": null
        },
        "enabled": false,
        "platform": {
            "architecture": "amd64",
            "os": "Linux"
        },
        "sharedVolumes": true,
        "shmSize": null
    },
    "environmentVariables": {
        "EXAMPLE_ENV_VAR": "EXAMPLE_VALUE"
    },
    "inferencingStackVersion": null,
    "name": "project-env",
    "python": {
        "baseCondaEnvironment": null,
        "condaDependencies": {
            "channels": [
                "anaconda",
                "conda-forge"
  

In [23]:
%%time
import uuid
from azureml.core.webservice import Webservice
from azureml.core.model import InferenceConfig
from azureml.core.environment import Environment
from azureml.core import Workspace
from azureml.core.model import Model

ws = Workspace.from_config()
model = Model(ws, 'Ames-Housing-AutoML-Model')

myenv = Environment.get(workspace=ws, name="project-env")
inference_config = InferenceConfig(entry_script="aml-outputs/scoring_file_v_1_0_0.py", environment=myenv)

service_name = 'ames-housing-aml-' + str(uuid.uuid4())[:4]
service = Model.deploy(workspace=ws,
                      name=service_name,
                      models=[model],
                      inference_config=inference_config,
                      deployment_config=aciconfig)

service.wait_for_deployment(show_output=True)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running
2021-08-16 08:55:21+00:00 Creating Container Registry if not exists..
2021-08-16 09:05:21+00:00 Registering the environment.
2021-08-16 09:05:23+00:00 Building image..
2021-08-16 09:22:29+00:00 Generating deployment configuration.
2021-08-16 09:22:31+00:00 Submitting deployment to compute..
2021-08-16 09:22:39+00:00 Checking the status of deployment ames-housing-aml-a698..
2021-08-16 09:26:35+00:00 Checking the status of inference endpoint ames-housing-aml-a698.
Failed


ERROR:azureml.core.webservice.webservice:Service deployment polling reached non-successful terminal state, current service state: Unhealthy
Operation ID: 238ed8fc-ce62-4929-a516-3ac8ad5de684
More information can be found using '.get_logs()'
Error:
{
  "code": "AciDeploymentFailed",
  "statusCode": 400,
  "message": "Aci Deployment failed with exception: Error in entry script, ModuleNotFoundError: No module named 'azureml.api', please run print(service.get_logs()) to get details.",
  "details": [
    {
      "code": "CrashLoopBackOff",
      "message": "Error in entry script, ModuleNotFoundError: No module named 'azureml.api', please run print(service.get_logs()) to get details."
    }
  ]
}



WebserviceException: WebserviceException:
	Message: Service deployment polling reached non-successful terminal state, current service state: Unhealthy
Operation ID: 238ed8fc-ce62-4929-a516-3ac8ad5de684
More information can be found using '.get_logs()'
Error:
{
  "code": "AciDeploymentFailed",
  "statusCode": 400,
  "message": "Aci Deployment failed with exception: Error in entry script, ModuleNotFoundError: No module named 'azureml.api', please run print(service.get_logs()) to get details.",
  "details": [
    {
      "code": "CrashLoopBackOff",
      "message": "Error in entry script, ModuleNotFoundError: No module named 'azureml.api', please run print(service.get_logs()) to get details."
    }
  ]
}
	InnerException None
	ErrorResponse 
{
    "error": {
        "message": "Service deployment polling reached non-successful terminal state, current service state: Unhealthy\nOperation ID: 238ed8fc-ce62-4929-a516-3ac8ad5de684\nMore information can be found using '.get_logs()'\nError:\n{\n  \"code\": \"AciDeploymentFailed\",\n  \"statusCode\": 400,\n  \"message\": \"Aci Deployment failed with exception: Error in entry script, ModuleNotFoundError: No module named 'azureml.api', please run print(service.get_logs()) to get details.\",\n  \"details\": [\n    {\n      \"code\": \"CrashLoopBackOff\",\n      \"message\": \"Error in entry script, ModuleNotFoundError: No module named 'azureml.api', please run print(service.get_logs()) to get details.\"\n    }\n  ]\n}"
    }
}

In [24]:
print(service.get_logs())

None


In [None]:
from azureml.core.authentication import InteractiveLoginAuthentication

interactive_auth = InteractiveLoginAuthentication()
auth_header = interactive_auth.get_authentication_header()

In [None]:
# Prepare data for request
_ , test = train_xgb.load_data_clean()
test = train_xgb.label_encode(test)
data = {'data': test.head().to_dict(orient='list')}

# Replace the next cell with the code from 'Consume' tab of the endpoint
# and delete 'data = {}' assignment as data is defined in this cell!  

TODO: In the cell below, send a request to the web service you deployed to test it.

In [None]:
import urllib.request
import json
import os
import ssl

def allowSelfSignedHttps(allowed):
    # bypass the server certificate verification on client side
    if allowed and not os.environ.get('PYTHONHTTPSVERIFY', '') and getattr(ssl, '_create_unverified_context', None):
        ssl._create_default_https_context = ssl._create_unverified_context

allowSelfSignedHttps(True) # this line is needed if you use self-signed certificate in your scoring service.

# Request data goes here
#

body = str.encode(json.dumps(data))

url = 'http://56120259-5690-491e-a8d6-bc13c75b82de.southcentralus.azurecontainer.io/score'
api_key = '' # Replace this with the API key for the web service
headers = {'Content-Type':'application/json', 'Authorization':('Bearer '+ api_key)}

req = urllib.request.Request(url, body, headers)

try:
    response = urllib.request.urlopen(req)

    result = response.read()
    print(result)
except urllib.error.HTTPError as error:
    print("The request failed with status code: " + str(error.code))

    # Print the headers - they include the requert ID and the timestamp, which are useful for debugging the failure
    print(error.info())
    print(json.loads(error.read().decode("utf8", 'ignore')))

TODO: In the cell below, print the logs of the web service and delete the service

In [None]:
service.delete()
# Delete() is used to deprovision and delete the AmlCompute target. 
cpu_cluster.delete()