# AWS Autopilot

[Resource limits/quotas](https://docs.aws.amazon.com/sagemaker/latest/dg/autopilot-quotas.html)

- Autopilot supports tabular data formatted as CSV or Parquet.
- Columns should be a feature with a specific data type and each row should be an observation.
- Accepted column data types include numerical, categorical, text, and time series that consists of strings of comma-separate numbers.
- Supports building models on datasets up to hundreds of GBs.
- Autopilot uses cross-validation to build models in hyperparameter optimization (HPO) and ensemble training mode for small datasets with 50,000 or fewer training instances. In ensemble mode, cross-validation is performed regardless of dataset size.
- ![Autopilot cross validation](https://docs.aws.amazon.com/images/sagemaker/latest/dg/images/autopilot/autopilot-metrics-kfold.PNG)
- In HPO mode, you can see the training and validation metrics from each fold in your /aws/sagemaker/TrainingJobs CloudWatch Logs.
- Cross-validation can increase training times by an average of 20%. Training times may also increase significantly for complex datasets.
    - Hold out cross validation is the [best option](https://towardsdatascience.com/5-minute-guide-to-cross-validation-be3c5b0ae693) for our data size
    - [Guideline](https://stats.stackexchange.com/a/307849): Let 𝑚 be the number of samples in your dataset:
        - If 𝑚 ≤ 20: use Leave-one-out cross validation.
        - If 20 < 𝑚 ≤ 100: use k-fold cross validation with a relatively large 𝑘≤𝑚 keeping in mind computational cost.
        - If 100 < 𝑚 ≤ 1,000,000: use regular k-fold cross validation (𝑘=5).
        - If there is not enough computational power and 𝑚 > 10,000:  use hold-out cross validation.
        - If 𝑚 ≥ 1,000,000: use hold-out cross validation, but if computational power is available you can use k-fold cross validation (𝑘=5) if you want to squeeze that extra performance out of your model.

### Notebooks / reports generated by Autopilot

- Notebooks re: full visibility into how the data was wrangled and how the models were selected, trained, and tuned for each of the candidates tested.
- The notebooks also provide educational tools to help you learn about and conduct your own ML experiments.
- You can learn about the impact of various inputs and trade-offs made in experiments by examining the various data exploration and candidate definition notebooks exposed by Autopilot. 
- You can also conduct further experiments on the higher performing candidates by making your own modifications to the notebooks and rerunning them.

## Resources

- [Developer Guide: Autopilot](https://docs.aws.amazon.com/sagemaker/latest/dg/autopilot-automate-model-development.html)
    - [Training modes](https://docs.aws.amazon.com/sagemaker/latest/dg/autopilot-model-support-validation.html) i.e. hyperparameter tuning vs ensembling.
    - [Metrics and validation](https://docs.aws.amazon.com/sagemaker/latest/dg/autopilot-metrics-validation.html)
- [sagemaker module docs](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html)
    - AutoML functions
        - [create_auto_ml_job](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_auto_ml_job)
        - [describe_auto_ml_job](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.describe_auto_ml_job)
        - [list_candidates_for_auto_ml_job](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.list_candidates_for_auto_ml_job)
        - [list_auto_ml_jobs](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.list_auto_ml_jobs)
        - [stop_auto_ml_job](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.stop_auto_ml_job)
- [Sagemaker API Reference](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Operations.html)
- Fairness papers (both very similar)
    - [Fairness Measures for Machine Learning in Finance](https://pages.awscloud.com/rs/112-TZM-766/images/Fairness.Measures.for.Machine.Learning.in.Finance.pdf)
    - [Amazon AI Fairness and Explainability Whitepaper](https://pages.awscloud.com/rs/112-TZM-766/images/Amazon.AI.Fairness.and.Explainability.Whitepaper.pdf)
- [Source code](https://github.com/aws/sagemaker-python-sdk/tree/master/src/sagemaker/automl)

## Examples

- [AWS's Sagemaker Examples on GitHub](https://github.com/aws/amazon-sagemaker-examples)
    - [Direct Marketing with Amazon SageMaker Autopilot](https://github.com/aws/amazon-sagemaker-examples/blob/main/autopilot/sagemaker_autopilot_direct_marketing.ipynb)
    - [Customer Churn Prediction with Amazon SageMaker Autopilot](https://github.com/aws/amazon-sagemaker-examples/blob/main/autopilot/autopilot_customer_churn.ipynb)
    - [Top Candidates Customer Churn Prediction with Amazon SageMaker Autopilot and Batch Transform (Python SDK)](https://github.com/aws/amazon-sagemaker-examples/blob/main/autopilot/autopilot_customer_churn_high_level_with_evaluation.ipynb)
    - [Bringing your own data processing code to SageMaker Autopilot](https://github.com/aws/amazon-sagemaker-examples/blob/main/autopilot/custom-feature-selection/Feature_selection_autopilot.ipynb)
    - [Explaining Autopilot Models](https://github.com/aws/amazon-sagemaker-examples/blob/main/autopilot/model-explainability/explaining_customer_churn_model.ipynb)
    - [Move Amazon SageMaker Autopilot ML models from experimentation to production using Amazon SageMaker Pipelines](https://github.com/aws/amazon-sagemaker-examples/blob/main/autopilot/sagemaker-autopilot-pipelines/autopilot_pipelines_demo_notebook.ipynb)
    - [Housing Price Prediction with Amazon SageMaker Autopilot](https://github.com/aws/amazon-sagemaker-examples/blob/main/autopilot/autopilot_california_housing.ipynb)
    - [Regression with Amazon SageMaker Autopilot (Parquet input)](https://github.com/aws/amazon-sagemaker-examples/blob/main/autopilot/sagemaker_autopilot_abalone_parquet_input.ipynb)
    - [Deploy Autopilot models to serverless inference endpoints](https://github.com/aws/amazon-sagemaker-examples/blob/main/autopilot/autopilot-serverless-inference/autopilot-models-serverless-inference.ipynb)
- [Data Science on AWS (book repo)](https://github.com/data-science-on-aws/data-science-on-aws)
    - [Data Science on AWS (book repo) - Ch 3](https://github.com/data-science-on-aws/data-science-on-aws/tree/main/03_automl)

SageMaker Autopilot job consists of the following high-level steps : 
- **Analyzing Data**, where the dataset is analyzed and Autopilot comes up with a list of ML pipelines that should be tried out on the dataset. The dataset is also split into train and validation sets.
- **Feature Engineering**, where Autopilot performs feature transformation on individual features of the dataset as well as at an aggregate level.
- **Model Tuning**, where the top performing pipeline is selected along with the optimal hyperparameters for the training algorithm (the last stage of the pipeline). 

### Amazon SageMaker Autopilot automated machine learning (AutoML) processes:

- Explores your data
- Selects the algorithms relevant to your problem type
- Prepares the data to facilitate model training and tuning
- Applies cross-validation resampling to all candidate algorithms when appropriate
- Produces metrics to assess the predictive quality of its machine learning model candidates
- Ranks all of the optimized models tested by their performance
- Finds the best performing model to deploy
- Generates a report that indicates the importance of each feature for the predictions made by the best candidate

### Custom Solution State

- [x] Explore the data
- [x] Select relevant algorithms
- [x] Feature engineering
- Apply cross-validation resampling to candidate algorithms when appropriate
- [x] Produce metrics to assess the predictive quality
- Rank models by their performance
- Choose best performing model
- [x] Register with Model Registry
- [x] Deploy to endpoint
- Monitor models

Interpretability / Clarify?