# Amazon SageMaker Autopilot Candidate Definition Notebook

This notebook was automatically generated by the AutoML job **YTrain**.
This notebook allows you to customize the candidate definitions and execute the SageMaker Autopilot workflow.

The dataset has **25** columns and the column named **Your es­tim­ated rev­en­ue (USD)** is used as
the target column. This is being treated as a **Regression** problem. 
This notebook will build a **[Regression](https://en.wikipedia.org/wiki/Regression_analysis)** model that
**minimizes** the "**MSE**" quality metric of the trained models.
The "**MSE**" metric stands for mean square error. It minimizes the square distance between the model's prediction and the true answer.

As part of the AutoML job, the input dataset has been randomly split into two pieces, one for **training** and one for
**validation**. This notebook helps you inspect and modify the data transformation approaches proposed by Amazon SageMaker Autopilot. You can interactively
train the data transformation models and use them to transform the data. Finally, you can execute a multiple algorithm hyperparameter optimization (multi-algo HPO)
job that helps you find the best model for your dataset by jointly optimizing the data transformations and machine learning algorithms.

<div class="alert alert-info"> 💡 <strong> Available Knobs</strong>
Look for sections like this for recommended settings that you can change.
</div>


---

## Contents

1. [Sagemaker Setup](#Sagemaker-Setup)
    1. [Downloading Generated Candidates](#Downloading-Generated-Modules)
    1. [SageMaker Autopilot Job and Amazon Simple Storage Service (Amazon S3) Configuration](#SageMaker-Autopilot-Job-and-Amazon-Simple-Storage-Service-(Amazon-S3)-Configuration)
1. [Candidate Pipelines](#Candidate-Pipelines)
    1. [Generated Candidates](#Generated-Candidates)
    1. [Selected Candidates](#Selected-Candidates)
1. [Executing the Candidate Pipelines](#Executing-the-Candidate-Pipelines)
    1. [Run Data Transformation Steps](#Run-Data-Transformation-Steps)
    1. [Multi Algorithm Hyperparameter Tuning](#Multi-Algorithm-Hyperparameter-Tuning)
1. [Model Selection and Deployment](#Model-Selection-and-Deployment)
    1. [Tuning Job Result Overview](#Tuning-Job-Result-Overview)
    1. [Model Deployment](#Model-Deployment)

---

## Sagemaker Setup

Before you launch the SageMaker Autopilot jobs, we'll setup the environment for Amazon SageMaker
- Check environment & dependencies.
- Create a few helper objects/function to organize input/output data and SageMaker sessions.

**Minimal Environment Requirements**

- Jupyter: Tested on `JupyterLab 1.0.6`, `jupyter_core 4.5.0` and `IPython 6.4.0`
- Kernel: `conda_python3`
- Dependencies required
  - `sagemaker-python-sdk>=2.40.0`
    - Use `!pip install sagemaker==2.40.0` to download this dependency.
    - Kernel may need to be restarted after download.
- Expected Execution Role/permission
  - S3 access to the bucket that stores the notebook.

### Downloading Generated Modules
Download the generated data transformation modules and an SageMaker Autopilot helper module used by this notebook.
Those artifacts will be downloaded to **YTrain-artifacts** folder.

In [2]:
!mkdir -p YTrain-artifacts
!aws s3 sync s3://ads508projectbucket-jy/YTrain/sagemaker-automl-candidates/YTrain-pr-1-a9b349827acf4d0991229e52cb4d5e25914b670ed1624729a43/generated_module YTrain-artifacts/generated_module --only-show-errors
!aws s3 sync s3://ads508projectbucket-jy/YTrain/sagemaker-automl-candidates/YTrain-pr-1-a9b349827acf4d0991229e52cb4d5e25914b670ed1624729a43/notebooks/sagemaker_automl YTrain-artifacts/sagemaker_automl --only-show-errors

import sys
sys.path.append("YTrain-artifacts")

### SageMaker Autopilot Job and Amazon Simple Storage Service (Amazon S3) Configuration

The following configuration has been derived from the SageMaker Autopilot job. These items configure where this notebook will
look for generated candidates, and where input and output data is stored on Amazon S3.

In [3]:
from sagemaker_automl import uid, AutoMLLocalRunConfig

# Where the preprocessed data from the existing AutoML job is stored
BASE_AUTOML_JOB_NAME = 'YTrain'
BASE_AUTOML_JOB_CONFIG = {
    'automl_job_name': BASE_AUTOML_JOB_NAME,
    'automl_output_s3_base_path': 's3://ads508projectbucket-jy/YTrain',
    'data_transformer_image_repo_version': '2.5-1-cpu-py3',
    'algo_image_repo_versions': {'xgboost': '1.3-1-cpu-py3', 'linear-learner': 'training-cpu', 'mlp': 'training-cpu'},
    'algo_inference_image_repo_versions': {'xgboost': '1.3-1-cpu-py3', 'linear-learner': 'inference-cpu', 'mlp': 'inference-cpu'}
}

# Path conventions of the output data storage path from the local AutoML job run of this notebook
LOCAL_AUTOML_JOB_NAME = 'YTrain-notebook-run-{}'.format(uid())
LOCAL_AUTOML_JOB_CONFIG = {
    'local_automl_job_name': LOCAL_AUTOML_JOB_NAME,
    'local_automl_job_output_s3_base_path': 's3://ads508projectbucket-jy/YTrain/{}'.format(LOCAL_AUTOML_JOB_NAME),
    'data_processing_model_dir': 'data-processor-models',
    'data_processing_transformed_output_dir': 'transformed-data',
    'multi_algo_tuning_output_dir': 'multi-algo-tuning'
}

AUTOML_LOCAL_RUN_CONFIG = AutoMLLocalRunConfig(
    role='arn:aws:iam::012360440082:role/LabRole',
    base_automl_job_config=BASE_AUTOML_JOB_CONFIG,
    local_automl_job_config=LOCAL_AUTOML_JOB_CONFIG,
    security_config={'EnableInterContainerTrafficEncryption': False, 'VpcConfig': {}})

AUTOML_LOCAL_RUN_CONFIG.display()

This notebook is initialized to use the following configuration: 
        <table>
        <tr><th colspan=2>Name</th><th>Value</th></tr>
        <tr><th>General</th><th>Role</th><td>arn:aws:iam::012360440082:role/LabRole</td></tr>
        <tr><th rowspan=2>Base AutoML Job</th><th>Job Name</th><td>YTrain</td></tr>
        <tr><th>Base Output S3 Path</th><td>s3://ads508projectbucket-jy/YTrain</td></tr>
        <tr><th rowspan=5>Interactive Job</th><th>Job Name</th><td>YTrain-notebook-run-03-01-11-18</td></tr>
        <tr><th>Base Output S3 Path</th><td>s3://ads508projectbucket-jy/YTrain/YTrain-notebook-run-03-01-11-18</td></tr>
        <tr><th>Data Processing Trained Model Directory</th><td>s3://ads508projectbucket-jy/YTrain/YTrain-notebook-run-03-01-11-18/data-processor-models</td></tr>
        <tr><th>Data Processing Transformed Output</th><td>s3://ads508projectbucket-jy/YTrain/YTrain-notebook-run-03-01-11-18/transformed-data</td></tr>
        <tr><th>Algo Tuning Model Output Directory</th><td>s3://ads508projectbucket-jy/YTrain/YTrain-notebook-run-03-01-11-18/multi-algo-tuning</td></tr>
        </table>
        

## Candidate Pipelines

The `AutoMLLocalRunner` keeps track of selected candidates and automates many of the steps needed to execute feature engineering and tuning steps.

In [4]:
from sagemaker_automl import AutoMLInteractiveRunner, AutoMLLocalCandidate

automl_interactive_runner = AutoMLInteractiveRunner(AUTOML_LOCAL_RUN_CONFIG)

### Generated Candidates

The SageMaker Autopilot Job has analyzed the dataset and has generated **8** machine learning
pipeline(s) that use **3** algorithm(s). Each pipeline contains a set of feature transformers and an
algorithm.

<div class="alert alert-info"> 💡 <strong> Available Knobs</strong>

1. The resource configuration: instance type & count
1. Select candidate pipeline definitions by cells
1. The linked data transformation script can be reviewed and updated. Please refer to the [README.md](./YTrain-artifacts/generated_module/README.md) for detailed customization instructions.
</div>

**[dpp0-xgboost](YTrain-artifacts/generated_module/candidate_data_processors/dpp0.py)**: This data transformation strategy first transforms 'numeric' features using [RobustImputer (converts missing values to nan)](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/impute/base.py). It merges all the generated features and applies [RobustStandardScaler](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/preprocessing/data.py). The
transformed data will be used to tune a *xgboost* model. Here is the definition:

In [5]:
automl_interactive_runner.select_candidate({
    "data_transformer": {
        "name": "dpp0",
        "training_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
            "volume_size_in_gb":  50
        },
        "transform_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
        },
        "transforms_label": False,
        "transformed_data_format": "text/csv",
        "sparse_encoding": False
    },
    "algorithm": {
        "name": "xgboost",
        "training_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
        },
    }
})

2022-04-03 01:11:19,013 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2022-04-03 01:11:19,083 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.
2022-04-03 01:11:19,089 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2022-04-03 01:11:19,104 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.
2022-04-03 01:11:19,105 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2022-04-03 01:11:19,123 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.


**[dpp1-xgboost](YTrain-artifacts/generated_module/candidate_data_processors/dpp1.py)**: This data transformation strategy first transforms 'numeric' features using [RobustImputer (converts missing values to nan)](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/impute/base.py), 'categorical' features using [ThresholdOneHotEncoder](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/preprocessing/encoders.py). It merges all the generated features and applies [RobustStandardScaler](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/preprocessing/data.py). The
transformed data will be used to tune a *xgboost* model. Here is the definition:

In [6]:
automl_interactive_runner.select_candidate({
    "data_transformer": {
        "name": "dpp1",
        "training_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
            "volume_size_in_gb":  50
        },
        "transform_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
        },
        "transforms_label": False,
        "transformed_data_format": "text/csv",
        "sparse_encoding": False
    },
    "algorithm": {
        "name": "xgboost",
        "training_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
        },
    }
})

2022-04-03 01:11:19,140 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2022-04-03 01:11:19,156 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.
2022-04-03 01:11:19,157 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2022-04-03 01:11:19,173 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.
2022-04-03 01:11:19,175 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2022-04-03 01:11:19,190 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.


**[dpp2-xgboost](YTrain-artifacts/generated_module/candidate_data_processors/dpp2.py)**: This data transformation strategy first transforms 'numeric' features using [RobustImputer (converts missing values to nan)](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/impute/base.py), 'categorical' features using [ThresholdOneHotEncoder](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/preprocessing/encoders.py). It merges all the generated features and applies [RobustStandardScaler](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/preprocessing/data.py). The
transformed data will be used to tune a *xgboost* model. Here is the definition:

In [7]:
automl_interactive_runner.select_candidate({
    "data_transformer": {
        "name": "dpp2",
        "training_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
            "volume_size_in_gb":  50
        },
        "transform_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
        },
        "transforms_label": False,
        "transformed_data_format": "text/csv",
        "sparse_encoding": False
    },
    "algorithm": {
        "name": "xgboost",
        "training_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
        },
    }
})

2022-04-03 01:11:19,241 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2022-04-03 01:11:19,255 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.
2022-04-03 01:11:19,257 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2022-04-03 01:11:19,272 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.
2022-04-03 01:11:19,274 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2022-04-03 01:11:19,289 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.


**[dpp3-linear-learner](YTrain-artifacts/generated_module/candidate_data_processors/dpp3.py)**: This data transformation strategy first transforms 'numeric' features using [combined RobustImputer and RobustMissingIndicator](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/impute/base.py) followed by [QuantileExtremeValuesTransformer](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/preprocessing/base.py), 'categorical' features using [ThresholdOneHotEncoder](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/preprocessing/encoders.py). It merges all the generated features and applies [RobustPCA](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/decomposition/robust_pca.py) followed by [RobustStandardScaler](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/preprocessing/data.py). The
transformed data will be used to tune a *linear-learner* model. Here is the definition:

In [8]:
automl_interactive_runner.select_candidate({
    "data_transformer": {
        "name": "dpp3",
        "training_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
            "volume_size_in_gb":  50
        },
        "transform_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
        },
        "transforms_label": False,
        "transformed_data_format": "application/x-recordio-protobuf",
        "sparse_encoding": False
    },
    "algorithm": {
        "name": "linear-learner",
        "training_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
        },
    }
})

2022-04-03 01:11:19,342 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2022-04-03 01:11:19,354 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.
2022-04-03 01:11:19,356 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2022-04-03 01:11:19,425 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.
2022-04-03 01:11:19,426 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2022-04-03 01:11:19,445 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.


**[dpp4-linear-learner](YTrain-artifacts/generated_module/candidate_data_processors/dpp4.py)**: This data transformation strategy first transforms 'numeric' features using [RobustImputer](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/impute/base.py). It merges all the generated features and applies [RobustStandardScaler](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/preprocessing/data.py). The
transformed data will be used to tune a *linear-learner* model. Here is the definition:

In [9]:
automl_interactive_runner.select_candidate({
    "data_transformer": {
        "name": "dpp4",
        "training_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
            "volume_size_in_gb":  50
        },
        "transform_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
        },
        "transforms_label": False,
        "transformed_data_format": "application/x-recordio-protobuf",
        "sparse_encoding": False
    },
    "algorithm": {
        "name": "linear-learner",
        "training_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
        },
    }
})

2022-04-03 01:11:19,479 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2022-04-03 01:11:19,492 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.
2022-04-03 01:11:19,494 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2022-04-03 01:11:19,514 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.
2022-04-03 01:11:19,516 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2022-04-03 01:11:19,535 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.


**[dpp5-xgboost](YTrain-artifacts/generated_module/candidate_data_processors/dpp5.py)**: This data transformation strategy first transforms 'numeric' features using [RobustImputer (converts missing values to nan)](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/impute/base.py), 'categorical' features using [ThresholdOneHotEncoder](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/preprocessing/encoders.py). It merges all the generated features and applies [RobustStandardScaler](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/preprocessing/data.py). The
transformed data will be used to tune a *xgboost* model. Here is the definition:

In [10]:
automl_interactive_runner.select_candidate({
    "data_transformer": {
        "name": "dpp5",
        "training_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
            "volume_size_in_gb":  50
        },
        "transform_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
        },
        "transforms_label": False,
        "transformed_data_format": "text/csv",
        "sparse_encoding": False
    },
    "algorithm": {
        "name": "xgboost",
        "training_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
        },
    }
})

2022-04-03 01:11:19,580 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2022-04-03 01:11:19,593 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.
2022-04-03 01:11:19,595 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2022-04-03 01:11:19,611 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.
2022-04-03 01:11:19,612 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2022-04-03 01:11:19,627 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.


**[dpp6-mlp](YTrain-artifacts/generated_module/candidate_data_processors/dpp6.py)**: This data transformation strategy transforms 'numeric' features using [RobustImputer](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/impute/base.py) followed by [RobustStandardScaler](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/preprocessing/data.py). The
transformed data will be used to tune a *mlp* model. Here is the definition:

In [11]:
automl_interactive_runner.select_candidate({
    "data_transformer": {
        "name": "dpp6",
        "training_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
            "volume_size_in_gb":  50
        },
        "transform_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
        },
        "transforms_label": False,
        "transformed_data_format": "text/csv",
        "sparse_encoding": False
    },
    "algorithm": {
        "name": "mlp",
        "training_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
        },
        "candidate_specific_static_hyperparameters": {
            "num_categorical_features": '0',
        }
    }
})

2022-04-03 01:11:19,680 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2022-04-03 01:11:19,694 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.
2022-04-03 01:11:19,695 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2022-04-03 01:11:19,714 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.
2022-04-03 01:11:19,719 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2022-04-03 01:11:19,786 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.


**[dpp7-xgboost](YTrain-artifacts/generated_module/candidate_data_processors/dpp7.py)**: This data transformation strategy first transforms 'numeric' features using [RobustImputer](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/impute/base.py), 'categorical' features using [ThresholdOneHotEncoder](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/preprocessing/encoders.py). It merges all the generated features and applies [RobustPCA](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/decomposition/robust_pca.py) followed by [RobustStandardScaler](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/preprocessing/data.py). The
transformed data will be used to tune a *xgboost* model. Here is the definition:

In [12]:
automl_interactive_runner.select_candidate({
    "data_transformer": {
        "name": "dpp7",
        "training_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
            "volume_size_in_gb":  50
        },
        "transform_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
        },
        "transforms_label": False,
        "transformed_data_format": "text/csv",
        "sparse_encoding": False
    },
    "algorithm": {
        "name": "xgboost",
        "training_resource_config": {
            "instance_type": "ml.m5.4xlarge",
            "instance_count": 1,
        },
    }
})

2022-04-03 01:11:19,797 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2022-04-03 01:11:19,815 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.
2022-04-03 01:11:19,817 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2022-04-03 01:11:19,832 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.
2022-04-03 01:11:19,833 INFO sagemaker.image_uris: Same images used for training and inference. Defaulting to image scope: inference.
2022-04-03 01:11:19,848 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.


### Selected Candidates

You have selected the following candidates (please run the cell below and click on the feature transformer links for details):

In [13]:
automl_interactive_runner.display_candidates()

Candidate Name,Algorithm,Feature Transformer
dpp0-xgboost,xgboost,dpp0.py
dpp1-xgboost,xgboost,dpp1.py
dpp2-xgboost,xgboost,dpp2.py
dpp3-linear-learner,linear-learner,dpp3.py
dpp4-linear-learner,linear-learner,dpp4.py
dpp5-xgboost,xgboost,dpp5.py
dpp6-mlp,mlp,dpp6.py
dpp7-xgboost,xgboost,dpp7.py


The feature engineering pipeline consists of two SageMaker jobs:

1. Generated trainable data transformer Python modules like [dpp0.py](YTrain-artifacts/generated_module/candidate_data_processors/dpp0.py), which has been downloaded to the local file system
2. A **training** job to train the data transformers
3. A **batch transform** job to apply the trained transformation to the dataset to generate the algorithm compatible data

The transformers and its training pipeline are built using open sourced **[sagemaker-scikit-learn-container][]** and **[sagemaker-scikit-learn-extension][]**.

[sagemaker-scikit-learn-container]: https://github.com/aws/sagemaker-scikit-learn-container
[sagemaker-scikit-learn-extension]: https://github.com/aws/sagemaker-scikit-learn-extension

## Executing the Candidate Pipelines

Each candidate pipeline consists of two steps, feature transformation and algorithm training.
For efficiency first execute the feature transformation step which will generate a featurized dataset on S3
for each pipeline.

After each featurized dataset is prepared, execute a multi-algorithm tuning job that will run tuning jobs
in parallel for each pipeline. This tuning job will execute training jobs to find the best set of
hyper-parameters for each pipeline, as well as finding the overall best performing pipeline.

### Run Data Transformation Steps

Now you are ready to start execution all data transformation steps.  The cell below may take some time to finish,
feel free to go grab a cup of coffee. To expedite the process you can set the number of `parallel_jobs` to be up to 10.
Please check the account limits to increase the limits before increasing the number of jobs to run in parallel.

In [14]:
automl_interactive_runner.fit_data_transformers(parallel_jobs=7)

2022-04-03 01:11:19,979 INFO root: [Worker_3:dpp3-linear-learner]Executing step: train_data_transformer
2022-04-03 01:11:19,981 INFO sagemaker.image_uris: Defaulting to the only supported framework/algorithm version: latest.
2022-04-03 01:11:19,999 INFO sagemaker.image_uris: Ignoring unnecessary instance type: None.
2022-04-03 01:11:20,221 INFO sagemaker: Creating training-job with name: YTrain-notebook-run-03-01-11-18-dpp3-train-03-01-11-19
2022-04-03 01:11:20,267 ERROR root: Failed to fit data transformer for dpp3-linear-learner
Traceback (most recent call last):
  File "YTrain-artifacts/sagemaker_automl/interactive_runner.py", line 143, in _process_data_transformer_future
    future.result()
  File "/opt/conda/lib/python3.7/concurrent/futures/_base.py", line 428, in result
    return self.__get_result()
  File "/opt/conda/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/opt/conda/lib/python3.7/concurrent/futures/thread.py", lin

### Multi Algorithm Hyperparameter Tuning

Now that the algorithm compatible transformed datasets are ready, you can start the multi-algorithm model tuning job
to find the best predictive model. The following algorithm training job configuration for each
algorithm is auto-generated by the AutoML Job as part of the recommendation.

<div class="alert alert-info"> 💡 <strong> Available Knobs</strong>

1. Hyperparameter ranges
2. Objective metrics
3. Recommended static algorithm hyperparameters.

Please refers to [Xgboost tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost-tuning.html) and [Linear learner tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner-tuning.html) for detailed explanations of the parameters.
</div>

The AutoML recommendation job has recommended the following hyperparameters, objectives and accuracy metrics for
the algorithm and problem type:

In [15]:
ALGORITHM_OBJECTIVE_METRICS = {
    'xgboost': 'validation:mse',
    'linear-learner': 'validation:mse',
    'mlp': 'validation:mse',
}

STATIC_HYPERPARAMETERS = {
    'xgboost': {
        'objective': 'reg:squarederror',
        'eval_metric': 'mse',
        '_kfold': 5,
        '_num_cv_round': 3,
    },
    'linear-learner': {
        'predictor_type': 'regressor',
        'ml_application': 'linear_learner',
        'num_epochs': 50,
        'reporting_metrics': 'mse',
        'eval_metric': 'mse',
        'kfold': 5,
        'num_cv_rounds': 3,
        'prediction_storage_mode': 'store_cv_avg_predictions',
    },
    'mlp': {
        'problem_type': 'regression',
        'ml_application': 'mlp',
        'use_batchnorm': 'true',
        'activation': 'relu',
        'warmup_epochs': 10,
        'reporting_metrics': 'mse',
        'eval_metric': 'mse',
        'kfold': 5,
        'num_cv_rounds': 3,
        'prediction_storage_mode': 'store_cv_avg_predictions',
    },
}

The following tunable hyperparameters search ranges are recommended for the Multi-Algo tuning job:

In [16]:
from sagemaker.parameter import CategoricalParameter, ContinuousParameter, IntegerParameter

ALGORITHM_TUNABLE_HYPERPARAMETER_RANGES = {
    'xgboost': {
        'num_round': IntegerParameter(64, 1024, scaling_type='Logarithmic'),
        'max_depth': IntegerParameter(2, 8, scaling_type='Logarithmic'),
        'eta': ContinuousParameter(1e-3, 1.0, scaling_type='Logarithmic'),
        'gamma': ContinuousParameter(1e-6, 64.0, scaling_type='Logarithmic'),
        'min_child_weight': ContinuousParameter(1e-6, 32.0, scaling_type='Logarithmic'),
        'subsample': ContinuousParameter(0.5, 1.0, scaling_type='Linear'),
        'colsample_bytree': ContinuousParameter(0.3, 1.0, scaling_type='Linear'),
        'lambda': ContinuousParameter(1e-6, 2.0, scaling_type='Logarithmic'),
        'alpha': ContinuousParameter(1e-6, 2.0, scaling_type='Logarithmic'),
    },
    'linear-learner': {
        'mini_batch_size': IntegerParameter(128, 512, scaling_type='Linear'),
        'wd': ContinuousParameter(1e-12, 1e-2, scaling_type='Logarithmic'),
        'learning_rate': ContinuousParameter(1e-6, 1e-2, scaling_type='Logarithmic'),
    },
    'mlp': {
        'mini_batch_size': IntegerParameter(128, 512, scaling_type='Linear'),
        'learning_rate': ContinuousParameter(1e-6, 1e-2, scaling_type='Logarithmic'),
        'weight_decay': ContinuousParameter(1e-12, 1e-2, scaling_type='Logarithmic'),
        'dropout_prob': ContinuousParameter(0.25, 0.5, scaling_type='Linear'),
        'embedding_size_factor': ContinuousParameter(0.65, 0.95, scaling_type='Linear'),
        'network_type': CategoricalParameter(['feedforward', 'widedeep']),
        'layers': CategoricalParameter(['256', '50, 25', '100, 50', '200, 100', '256, 128', '300, 150', '200, 100, 50']),
    },
}

#### Prepare Multi-Algorithm Tuner Input

To use the multi-algorithm HPO tuner, prepare some inputs and parameters. Prepare a dictionary whose key is the name of the trained pipeline candidates and the values are respectively:

1. Estimators for the recommended algorithm
2. Hyperparameters search ranges
3. Objective metrics

In [17]:
multi_algo_tuning_parameters = automl_interactive_runner.prepare_multi_algo_parameters(
    objective_metrics=ALGORITHM_OBJECTIVE_METRICS,
    static_hyperparameters=STATIC_HYPERPARAMETERS,
    hyperparameters_search_ranges=ALGORITHM_TUNABLE_HYPERPARAMETER_RANGES)

Below you prepare the inputs data to the multi-algo tuner:

In [18]:
multi_algo_tuning_inputs = automl_interactive_runner.prepare_multi_algo_inputs()