Title: Auto-Sklearn Tutorial for Tabular Data

Introduction: In this tutorial, we will explore auto-sklearn, an automated machine learning (AutoML) tool designed to simplify and automate the process of selecting and optimizing machine learning models. Auto-sklearn is a powerful extension of the popular scikit-learn library, leveraging Bayesian optimization, meta-learning, and ensemble methods to automatically find and fine-tune the best machine learning models for your specific data.

# Step 1: Install Necessary Libraries

In [None]:
%pip install matplotlib
%pip install pandas
%pip install auto-sklearn

# Step 2: Import Libraries and Load Dataset

In [None]:
from pprint import pprint
import autosklearn.regression
import matplotlib.pyplot as plt
import pandas as pd

# Load the dataset
df = pd.read_csv("/path/to/your/dataset")

# Step 3: Split the Dataset into Training and Testing Sets


In [None]:
# Define the ratio of the dataset to be used for training
# You can change the value of train_ratio to adjust the train-test split ratio
train_ratio = 0.8

# Calculate the split index
split_index = int(len(df) * train_ratio)

# Split the dataset
train_df = df[:split_index]
test_df = df[split_index:]

## Step 4: Prepare the Data and fit a regressor



In [None]:
# Define the target column
target_column = "your_target_column"

# Separate features and target variable for training and testing sets
X_train = train_df.drop(columns=[target_column]).to_numpy()
y_train = train_df[target_column].to_numpy()

X_test = test_df.drop(columns=[target_column]).to_numpy()
y_test = test_df[target_column].to_numpy()

automl = autosklearn.regression.AutoSklearnRegressor(
    time_left_for_this_task=120,
    per_run_time_limit=30,
    tmp_folder="/tmp/autosklearn_regression_example_tmp",
)
automl.fit(X_train, y_train, dataset_name="diabetes")

## Step 5: View the models found by auto-sklearn



In [None]:
print(automl.leaderboard())

## Step 6: Print the final ensemble constructed by auto-sklearn



In [None]:
pprint(automl.show_models(), indent=4)

## Step 7: Get the Score of the final ensemble
After training the estimator, we can now quantify the goodness of fit. One possibility for
is the [R2 score](https://scikit-learn.org/stable/modules/model_evaluation.html#r2-score).
The values range between -inf and 1 with 1 being the best possible value. A dummy estimator
predicting the data mean has an R2 score of 0.



In [None]:
train_predictions = automl.predict(X_train)
print("Train R2 score:", sklearn.metrics.r2_score(y_train, train_predictions))
test_predictions = automl.predict(X_test)
print("Test R2 score:", sklearn.metrics.r2_score(y_test, test_predictions))

## Step 8: Plot the predictions
Furthermore, we can now visually inspect the predictions. We plot the true value against the
predictions and show results on train and test data. Points on the diagonal depict perfect
predictions. Points below the diagonal were overestimated by the model (predicted value is higher
than the true value), points above the diagonal were underestimated (predicted value is lower than
the true value).



In [None]:
plt.scatter(train_predictions, y_train, label="Train samples", c="#d95f02")
plt.scatter(test_predictions, y_test, label="Test samples", c="#7570b3")
plt.xlabel("Predicted value")
plt.ylabel("True value")
plt.legend()
plt.plot([30, 400], [30, 400], c="k", zorder=0)
plt.xlim([30, 400])
plt.ylim([30, 400])
plt.tight_layout()
plt.show()