# Introduction to AutoML

AutoML, or Automated Machine Learning, is the process of automating the end-to-end process of applying machine learning to real-world problems. AutoML covers the complete pipeline from the raw dataset to the deployable machine learning model, including:

1. **Data Preprocessing**: Handling missing values, encoding categorical variables, feature scaling, etc.
2. **Feature Engineering**: Creating new features from existing data to improve model performance.
3. **Model Selection**: Choosing the best model from a wide range of algorithms.
4. **Hyperparameter Tuning**: Optimizing the parameters of the chosen model to improve performance.
5. **Model Evaluation**: Assessing the performance of the model using various metrics.
6. **Model Deployment**: Making the model available for use in production.

AutoML tools aim to make machine learning accessible to non-experts and improve the efficiency of experts by automating repetitive tasks and providing state-of-the-art models with minimal effort.

Popular AutoML frameworks include:
- **Google Cloud AutoML**
- **H2O.ai**
- **Auto-sklearn**
- **TPOT**
- **MLBox**

In this notebook, we will explore how to use one of these AutoML frameworks to build and evaluate a machine learning model.

# Auto-Sklearn

Auto-Sklearn is an open-source AutoML tool built on top of the popular scikit-learn library. It automates the process of model selection, hyperparameter tuning, and ensemble construction. 

Auto-Sklearn leverages **Bayesian optimization** to find the best model and hyperparameters for a given dataset.

How Auto-Sklearn Works

Auto-sklearn automates the following:
1. Model Selection – Tries multiple ML models (e.g., Decision Trees, Random Forests, SVMs).
2. Hyperparameter Optimization – Uses Bayesian Optimization to tune hyperparameters.
3. Feature Engineering & Preprocessing – Automatically applies transformations like normalization, one-hot encoding, and missing value imputation.
4. Meta-Learning – Uses knowledge from past datasets to speed up training.
5. Ensembling – Combines multiple models to improve performance.

It builds on top of Scikit-Learn and is particularly useful for tabular data (classification and regression tasks).

In the following sections, we will demonstrate how to use Auto-Sklearn to build and evaluate a machine learning model.

**How Auto-Sklearn Differs from Grid Search or Random Search**

<table border="1">
    <tr>
        <th>Feature</th>
        <th>Grid Search</th>
        <th>Random Search</th>
        <th>Auto-Sklearn</th>
    </tr>
    <tr>
        <td>Model Selection</td>
        <td>Manual</td>
        <td>Manual</td>
        <td>Automatic</td>
    </tr>
    <tr>
        <td>Hyperparameter Tuning</td>
        <td>Exhaustive</td>
        <td>Random Sampling</td>
        <td>Bayesian Optimization</td>
    </tr>
    <tr>
        <td>Preprocessing</td>
        <td>Manual</td>
        <td>Manual</td>
        <td>Automatic</td>
    </tr>
    <tr>
        <td>Uses Prior Knowledge</td>
        <td>❌ No</td>
        <td>❌ No</td>
        <td>✅ Yes (Meta-Learning)</td>
    </tr>
</table>


In [2]:
!pip install auto-sklearn joblib

Collecting auto-sklearn
  Downloading auto-sklearn-0.15.0.tar.gz (6.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.5/6.5 MB[0m [31m351.2 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting distro
  Downloading distro-1.9.0-py3-none-any.whl (20 kB)
Collecting liac-arff
  Downloading liac-arff-2.5.0.tar.gz (13 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting smac<1.3,>=1.2
  Downloading smac-1.2.tar.gz (260 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m260.9/260.9 kB[0m [31m278.6 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting pyrfr<0.9,>=0.8.1
  Downloading pyrfr-0.8.3.tar.gz (293 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m293.4/293.4 kB[0m [31m515.5 kB/s[0m eta [

## Auto-Sklearn for Classification

end-to-end process of using Auto-Sklearn for different tasks such as classification.

**Classification Task: Data Preparation**

We load the Iris dataset, which is a popular dataset for classification tasks. The dataset is then split into training and testing sets using an 80-20 split.

In [3]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from autosklearn.classification import AutoSklearnClassifier

ModuleNotFoundError: No module named 'autosklearn'

In [None]:
# Load dataset
data = load_iris()
X = data.data
y = data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Classification Task: Model Training**

**What Happens Behind the Scenes?**
- Auto-Sklearn tries multiple models (e.g., Random Forest, SVM, Gradient Boosting).
- It optimizes hyperparameters automatically.
- The best models are ensembled to improve performance.

In [None]:
# Initialize AutoSklearnClassifier
automl = AutoSklearnClassifier(
    time_left_for_this_task=300,  # Run AutoML for 5 minutes
    per_run_time_limit=30,  # Max time per model
    ensemble_size=10  # Number of models in the final ensemble
)
# Train the model
automl.fit(X_train, y_train)

**Classification Task: Model Evaluation**

The trained model is used to make predictions on the test data. We then calculate the accuracy of the model by comparing the predicted labels with the true labels.

In [None]:
# Evaluate the model
y_pred = automl.predict(X_test)
accuracy = np.mean(y_pred == y_test)
print(f'Classification Accuracy: {accuracy:.2f}')

## Auto-Sklearn for Regression

load the Boston housing dataset and AutoSklearnRegressor from the autosklearn library.

In [None]:
from sklearn.datasets import load_boston
from autosklearn.regression import AutoSklearnRegressor

**Regression Task: Data Preparation**
We load the Boston housing dataset, which is commonly used for regression tasks. The dataset is then split into training and testing sets using an 80-20 split.

In [None]:
# Load dataset
data = load_boston()
X = data.data
y = data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Regression Task: Model Training**

In [None]:
# Initialize AutoSklearnRegressor
automl = AutoSklearnRegressor(time_left_for_this_task=120, per_run_time_limit=30)

# Train the model
automl.fit(X_train, y_train)

**Regression Task: Model Evaluation**

The trained model is used to make predictions on the test data. We then calculate the mean squared error (MSE) of the model by comparing the predicted values with the true values.

In [None]:
# Evaluate the model
y_pred = automl.predict(X_test)
mse = np.mean((y_pred - y_test) ** 2)
print(f'Regression Mean Squared Error: {mse:.2f}')

## Auto-Sklearn's Features in autosklearn

## Understanding Auto-Sklearn’s Output

In [None]:
# Get the Best Models Used -  Details about the best models found.
print(automl.show_models())

In [None]:
# Get the Leaderboard - Ranked list of best models.
print(automl.leaderboard())

In [None]:
# Get the pipeline - Best preprocessing + model pipeline
print(automl.get_pipeline())

In [None]:
# Get the Best Pipeline
print(automl.show_best_model())

In [None]:
# Get the final ensemble
models = automl.get_models_with_weights()
for weight, model in models:
    print(f"Weight: {weight}, Model: {model}")

In [None]:
# Get the results - Performance metrics for all tried models.
print(automl.cv_results_)

In [None]:
# Get the statistics -  General summary of the AutoML process.
#  Shows details like the number of models tried, total time taken, and the best score.
print(automl.sprint_statistics())

In [None]:
# Saving and Loading Auto-Sklearn Models
# Save the model
import joblib
joblib.dump(automl, "automl_model.pkl")

# Load the model
automl = joblib.load("automl_model.pkl")