# Automated ML

In [1]:
from azureml.core import Workspace, Experiment
from azureml.core.dataset import Dataset
from azureml.train.automl import AutoMLConfig
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.exceptions import ComputeTargetException
from azureml.widgets import RunDetails
import os
import shutil
import logging


## Dataset

### Overview

"Airbnb for Boston with fraud detcetion" data was downloaded from Kaggle with the following link:

https://www.kaggle.com/datasets/hawkingcr/airbnb-for-boston-with-fraud-detection/download?datasetVersionNumber=1

The downloaded file is saved as "output.csv" in the "data" directory. The dataset aims to classify whether an Airbnb listing is a fraud or not.

A notebook file named "data_process.ipyng" was created to perform some pre-processing on the data. Firstly, a correlation analysis was conducted with the target column "fraud" to identify and remove some non-significant features. Next, the data was split into "train.csv" and "test.csv" sets, and the balance of the training data was examined. Due to the class imbalance in the training target, an upsampling technique was applied to address this imbalance

In [2]:
ws = Workspace.from_config()

experiment_name = 'udacity-aml-capstone'
experiment=Experiment(ws, experiment_name)

datastore = ws.get_default_datastore()
train_data_file = "train.csv"
src_dir = "./data"
target_path = "airbnb_boston"

train_data_dir = "./tmp_dir"
if os.path.exists(train_data_dir) == False:
    os.mkdir(train_data_dir)

src_file_path = os.path.join(src_dir,train_data_file)
#print(src_file_path)
dest = shutil.copy(src_file_path,train_data_dir)
#print("After copying:")
#print(os.listdir(train_data_dir))
print("train data path:",dest)

#datastore.upload(
    #src_dir=train_data_dir, target_path=target_path, overwrite=True, show_progress=True
#)

#datastore.upload_files(
    #["./data/train.csv"],target_path="airbnb_boston", overwrite=True, show_progress=True
#)

#Dataset.File.upload_directory(src_dir,(datastore,target_path),pattern=train_data_file,
                              #overwrite=True, show_progress=True)
Dataset.File.upload_directory(train_data_dir,(datastore,target_path),
                              overwrite=True, show_progress=True)

# Upload the training data as a tabular dataset for access during training on remote compute
datastore_path = os.path.join(target_path,train_data_file)
print("datastore train data path: ",datastore_path)
train_ds = Dataset.Tabular.from_delimited_files(
    path=datastore.path(datastore_path)
)

train_ds.to_pandas_dataframe().head()

train data path: ./tmp_dir/train.csv
Validating arguments.
Arguments validated.
Uploading file to airbnb_boston
Uploading an estimated of 1 files
Uploading ./tmp_dir/train.csv
Uploaded ./tmp_dir/train.csv, 1 files out of an estimated total of 1
Uploaded 1 files
Creating new dataset
datastore train data path:  airbnb_boston/train.csv


Unnamed: 0,host_response_rate,host_identity_verified,host_total_listings_count,is_location_exact,property_type,accommodates,price,minimum_nights,number_of_reviews,review_scores_rating,instant_bookable,cancellation_policy,reviews_per_month,fraud
0,95,1,3,1,8,2,6500,2,8,93.0,0,1,0.63,1
1,100,1,1,1,0,8,50000,1,88,98.0,0,1,4.2,1
2,100,1,1,1,8,2,9000,1,192,95.0,0,1,5.58,1
3,90,1,1,1,0,2,11500,1,54,88.0,1,2,3.58,1
4,92,0,8,1,2,6,27500,2,29,91.0,1,2,0.72,1


### Create or Attach an AmlCompute cluster

In [3]:

cluster_name = "my-cluster"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print("Found existing cluster, use it.")
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(
        vm_size="STANDARD_D2_V2", max_nodes=4
    )
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)
compute_target.wait_for_completion(show_output=True)


Found existing cluster, use it.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


## AutoML Configuration

TODO: Explain why you chose the automl settings and cofiguration you used below.

iteration_timeout_minutes: Time limit in minutes for each iteration. Increase this value for larger datasets that need more time for each iteration.

experiment_timeout_hours: Maximum amount of time in hours that all iterations combined can take before the experiment terminates.

enable_early_stopping: Flag to enable early termination if the score is not improving in the short term.

primary_metric: Metric that you want to optimize. The best-fit model will be chosen based on this metric.

featurization: By using auto, the experiment can preprocess the input data (handling missing data, converting text to numeric, etc.)

verbosity: Controls the level of logging.

n_cross_validation: Number of cross validation to perform when validation data is
                    not specified.

In [4]:

automl_settings = {
    "iteration_timeout_minutes": 10,
    "experiment_timeout_minutes": 30,
    "max_concurrent_iterations": 4,
    "enable_early_stopping": True,
    "primary_metric": 'AUC_weighted',
    "featurization": 'auto',
    "verbosity": logging.INFO,
    "n_cross_validations": 5
}

automl_config = AutoMLConfig(
    task="classification",
    compute_target=compute_target,
    training_data=train_ds,
    label_column_name="fraud",
    blocked_models=["KNN", "LinearSVM"],
    enable_onnx_compatible_models=True,
    **automl_settings)

In [5]:
# Submit experiment
auto_run = experiment.submit(automl_config)

Submitting remote run.


Experiment,Id,Type,Status,Details Page,Docs Page
udacity-aml-capstone,AutoML_96a49aca-3d02-478d-8078-fcd7d9e78594,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation


## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

TODO: In the cell below, use the `RunDetails` widget to show the different experiments.

In [6]:
RunDetails(auto_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

## Best Model

TODO: In the cell below, get the best model from the automl experiments and display all the properties of the model.



In [7]:
# Retrieve and save best automl model.

auto_run.wait_for_completion(show_output=True)
assert(auto_run.get_status() == "Completed")

# Note two ways to get best_run, compare them (?)
best_auto_run, best_model = auto_run.get_output()
# best_auto_child = auto_run.get_best_child()

print(best_auto_run.get_details())
#print(best_auto_child.get_details())


Experiment,Id,Type,Status,Details Page,Docs Page
udacity-aml-capstone,AutoML_96a49aca-3d02-478d-8078-fcd7d9e78594,automl,Completed,Link to Azure Machine Learning studio,Link to Documentation




********************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and all classes are balanced in your training data.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData

********************************************************************************************

TYPE:         Missing feature values imputation
STATUS:       PASSED
DESCRIPTION:  No feature missing values were detected in the training data.
              Learn more about missing value imputation: https://aka.ms/AutomatedMLFeaturization

********************************************************************************************

TYPE:         High cardinality feature detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and no high cardinality features were detected.
              Learn more about high cardinality feat

Package:azureml-automl-runtime, training version:1.52.0.post1, current version:1.51.0.post1
Package:azureml-core, training version:1.52.0, current version:1.51.0
Package:azureml-dataprep, training version:4.11.4, current version:4.10.8
Package:azureml-dataprep-rslex, training version:2.18.4, current version:2.17.12
Package:azureml-dataset-runtime, training version:1.52.0, current version:1.51.0
Package:azureml-defaults, training version:1.52.0, current version:1.51.0
Package:azureml-interpret, training version:1.52.0, current version:1.51.0
Package:azureml-mlflow, training version:1.52.0, current version:1.51.0
Package:azureml-pipeline-core, training version:1.52.0, current version:1.51.0
Package:azureml-responsibleai, training version:1.52.0, current version:1.51.0
Package:azureml-telemetry, training version:1.52.0, current version:1.51.0
Package:azureml-train-automl-client, training version:1.52.0, current version:1.51.0.post1
Package:azureml-train-automl-runtime, training version:1.

In [8]:
# another way to get best run

best_auto_child = auto_run.get_best_child()
print(best_auto_child.get_details())

{'runId': 'AutoML_96a49aca-3d02-478d-8078-fcd7d9e78594_14', 'target': 'my-cluster', 'status': 'Completed', 'startTimeUtc': '2023-07-30T00:45:43.289818Z', 'endTimeUtc': '2023-07-30T00:46:22.254027Z', 'services': {}, 'properties': {'runTemplate': 'automl_child', 'pipeline_id': 'a72eb56f3d4aadb7b7f0149ba6e5f05657a95ca1', 'pipeline_spec': '{"objects": [{"class_name": "StandardScaler", "module": "sklearn.preprocessing", "param_args": [], "param_kwargs": {"with_mean": false, "with_std": false}, "prepared_kwargs": {}, "spec_class": "preproc"}, {"class_name": "RandomForestClassifier", "module": "sklearn.ensemble", "param_args": [], "param_kwargs": {}, "prepared_kwargs": {}, "spec_class": "sklearn"}], "pipeline_id": "a72eb56f3d4aadb7b7f0149ba6e5f05657a95ca1", "module": "sklearn.pipeline", "class_name": "Pipeline", "pipeline_name": "{ StandardScaler, RandomForestClassifier }"}', 'training_percent': '100', 'predicted_cost': '0.5', 'iteration': '14', '_aml_system_scenario_identification': 'Remote.

In [11]:
# download some outputs

import json
import pandas as pd

output_dir = "./outputs"
if os.path.exists(output_dir) == False:
    os.mkdir(output_dir)
    
script_file_name = output_dir + "/score.py"
best_auto_run.download_file("outputs/scoring_file_v_1_0_0.py", script_file_name)

# Download the featurization summary JSON file locally
featurization_file_name = output_dir + "/featurization_summary.json"
best_auto_run.download_file(
    "outputs/featurization_summary.json", featurization_file_name
)

# Render the JSON as a pandas DataFrame
with open(featurization_file_name, "r") as f:
    records = json.load(f)

print(records)
records_pd = pd.DataFrame.from_records(records)
records_pd.head()


[{'RawFeatureName': 'host_response_rate', 'TypeDetected': 'Categorical', 'Dropped': 'No', 'EngineeredFeatureCount': 46, 'Transformations': ['StringCast-CharGramCountVectorizer'], 'TransformationParams': {'Transformer1': {'Input': ['host_response_rate'], 'TransformationFunction': 'StringCast', 'Operator': None, 'FeatureType': 'Categorical', 'ShouldOutput': False, 'TransformationParams': {}}, 'Transformer2': {'Input': ['Transformer1'], 'TransformationFunction': 'CountVectorizer', 'Operator': 'CharGram', 'FeatureType': None, 'ShouldOutput': True, 'TransformationParams': {'analyzer': 'word', 'binary': True, 'decode_error': 'strict', 'encoding': 'utf-8', 'input': 'content', 'lowercase': False, 'max_df': 1.0, 'max_features': None, 'min_df': 1, 'ngram_range': [1, 1], 'stop_words': None, 'strip_accents': None, 'token_pattern': '(?u)\\b\\w\\w+\\b', 'vocabulary': None}}}}, {'RawFeatureName': 'host_identity_verified', 'TypeDetected': 'Categorical', 'Dropped': 'No', 'EngineeredFeatureCount': 1, 'T

Unnamed: 0,RawFeatureName,TypeDetected,Dropped,EngineeredFeatureCount,Transformations,TransformationParams
0,host_response_rate,Categorical,No,46,[StringCast-CharGramCountVectorizer],{'Transformer1': {'Input': ['host_response_rat...
1,host_identity_verified,Categorical,No,1,[ModeCatImputer-StringCast-LabelEncoder],{'Transformer1': {'Input': ['host_identity_ver...
2,host_total_listings_count,Categorical,No,34,[StringCast-CharGramCountVectorizer],{'Transformer1': {'Input': ['host_total_listin...
3,is_location_exact,Categorical,No,1,[ModeCatImputer-StringCast-LabelEncoder],{'Transformer1': {'Input': ['is_location_exact...
4,property_type,Categorical,No,13,[StringCast-CharGramCountVectorizer],"{'Transformer1': {'Input': ['property_type'], ..."


In [12]:
# Save the best model
import joblib

best_model_file = output_dir + "/best_model.pkl"
#print(best_model)
joblib.dump(best_model,best_model_file)


['./outputs/best_model.pkl']

## Model Deployment

Remember you have to deploy only one of the two models you trained but you still need to register both the models. Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

In [13]:
from azureml.core.model import Model

autorun_model = Model.register(model_path=best_model_file,
                            model_name="autorun_model",
                            workspace=ws)

autorun_model_path = Model.get_model_path(model_name='autorun_model',_workspace=ws)
print("registered best model path: ",autorun_model_path)

# read back model to test
saved_model = joblib.load(autorun_model_path)
# registered_model = joblib.load(best_model_file)

print("saved model:",saved_model)


Registering model autorun_model
registered best model path:  azureml-models/autorun_model/1/best_model.pkl
saved model: Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=False, enable_feature_sweeping=False, feature_sweeping_config={}, feature_sweeping_timeout=86400, featurization_config=None, force_text_dnn=False, is_cross_validation=True, is_onnx_compatible=True, observer=None, task='classification', working_dir='/mnt/batch/tasks/shared/LS_root/mount...
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight=None, criterion='gini',
                                        max_depth=None, max_features='auto',
                                        max_leaf_nodes=None, max_samples=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_

In [14]:
# another way of register model

model_name = best_auto_run.properties["model_name"]
print(model_name)
description = "AutoML Model trained on Airbnb boston to predict fraud listing"
tags = None
registered_model = auto_run.register_model(
    model_name=model_name, description=description, tags=tags
)

print(
    auto_run.model_id
)  # This will be written to the script file later in the notebook.
print("registered model: ",registered_model)

AutoML96a49aca314
AutoML96a49aca314
registered model:  Model(workspace=Workspace.create(name='quick-starts-ws-239435', subscription_id='f9d5a085-54dc-4215-9ba6-dad5d86e60a0', resource_group='aml-quickstarts-239435'), name=AutoML96a49aca314, id=AutoML96a49aca314:1, version=1, tags={}, properties={})


## Test model

In [16]:
import pandas as pd
from sklearn.metrics import confusion_matrix

test_file = src_dir + "/test.csv";
df_test = pd.read_csv(test_file)
df_test = df_test[pd.notnull(df_test['fraud'])]

y_test = df_test['fraud']
X_test = df_test.drop(['fraud'], axis=1)

ypred = best_model.predict(X_test)
cm = confusion_matrix(y_test, ypred)

pd.DataFrame(cm).style.background_gradient(cmap='Blues', low=0, high=0.9)

Unnamed: 0,0,1
0,647,51
1,77,122


In [17]:
ypred = saved_model.predict(X_test)
cm = confusion_matrix(y_test, ypred)

pd.DataFrame(cm).style.background_gradient(cmap='Blues', low=0, high=0.9)

Unnamed: 0,0,1
0,647,51
1,77,122


Deploy Webservice

In [18]:
ypred = registered_model.predict(X_test)
cm = confusion_matrix(y_test, ypred)

pd.DataFrame(cm).style.background_gradient(cmap='Blues', low=0, high=0.9)

AttributeError: 'Model' object has no attribute 'predict'

In [21]:
from azureml.core.model import InferenceConfig
from azureml.core.webservice import AciWebservice
from azureml.core.webservice import Webservice
from azureml.core.model import Model
from azureml.core.environment import Environment

inference_config = InferenceConfig(
    environment=best_auto_run.get_environment(), entry_script=script_file_name
)

aciconfig = AciWebservice.deploy_configuration(
    cpu_cores=2,
    memory_gb=2,
    tags={"area": "bmData", "type": "automl_classification"},
    description="sample service for Automl Classification",
)

aci_service_name = model_name.lower()
print(aci_service_name)
aci_service = Model.deploy(ws, aci_service_name, [registered_model], inference_config, aciconfig)
aci_service.wait_for_deployment(True)
print(aci_service.state)

automl96a49aca314
Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running
2023-07-30 01:28:58+00:00 Creating Container Registry if not exists..
2023-07-30 01:38:59+00:00 Registering the environment.
2023-07-30 01:38:59+00:00 Use the existing image.
2023-07-30 01:39:00+00:00 Submitting deployment to compute..
2023-07-30 01:39:04+00:00 Checking the status of deployment automl96a49aca314..
2023-07-30 01:40:57+00:00 Checking the status of inference endpoint automl96a49aca314.
Succeeded
ACI service creation operation finished, operation "Succeeded"
Healthy


TODO: In the cell below, print the logs of the web service and delete the service

In [22]:
aci_service.get_logs()



TODO: In the cell below, send a request to the web service you deployed to test it.

In [28]:
from numpy import array
import requests
import json

X_test_json = X_test.to_json(orient="records")
#data = '{"data": ' + X_test_json + "}"
data = '{"data": ' + X_test_json + ', "method": "predict"}'
#print("test data:", data)
headers = {"Content-Type": "application/json"}

resp = requests.post(aci_service.scoring_uri, data, headers=headers)

y_pred = json.loads(json.loads(resp.text))["result"]
#print(y_pred)

#print(y_test)
actual = array(y_test)
#actual = actual[:, 0]
print(len(y_pred), " ", len(actual))
#print(actual)

cm = confusion_matrix(actual, ypred)

pd.DataFrame(cm).style.background_gradient(cmap='Blues', low=0, high=0.9)

897   897


Unnamed: 0,0,1
0,647,51
1,77,122


**Submission Checklist**
- I have registered the model.
- I have deployed the model with the best accuracy as a webservice.
- I have tested the webservice by sending a request to the model endpoint.
- I have deleted the webservice and shutdown all the computes that I have used.
- I have taken a screenshot showing the model endpoint as active.
- The project includes a file containing the environment details.
