# AutoML with H2O

This notebook will guide you through the steps to use H2O AutoML for model training. Before starting, ensure that you have either gone through the Data_Preparation_for_ML notebook or have your datasets named `train_dataset.csv` and `test_dataset.csv` in the `/data/` folder.

## Steps Included:
1. Selecting and loading datasets
2. Selecting the target variable
3. Performance exploration (calculating GINI or IV)
4. Feature importance analysis
5. Correlation treatment
6. H2O AutoML


---
## Importing necessary libraries

In [1]:
# Import necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import h2o
from h2o.automl import H2OAutoML
import xgboost as xgb
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, roc_curve
import ipywidgets as widgets
from IPython.display import display

# Initialize H2O
h2o.init()

print("Packages imported and H2O initialized.")


Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "13.0.2" 2020-01-14; OpenJDK Runtime Environment (build 13.0.2+8); OpenJDK 64-Bit Server VM (build 13.0.2+8, mixed mode, sharing)
  Starting server from /home/default/lib/python3.10/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmplvktwbdb
  JVM stdout: /tmp/tmplvktwbdb/h2o_python_75198895_started_from_python.out
  JVM stderr: /tmp/tmplvktwbdb/h2o_python_75198895_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.
Please download and install the latest version from: https://h2o-release.s3.amazonaws.com/h2o/latest_stable.html


0,1
H2O_cluster_uptime:,02 secs
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.42.0.2
H2O_cluster_version_age:,10 months and 10 days
H2O_cluster_name:,H2O_from_python_python_75198895_jsar0h
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,29.97 Gb
H2O_cluster_total_cores:,20
H2O_cluster_allowed_cores:,20


Packages imported and H2O initialized.


---
## Selecting and Loading Datasets

Please select your train and test datasets.


In [4]:
# Widget to select train and test datasets
train_file_widget = widgets.Text(
    value='/data/train_dataset.csv',
    description='Train Dataset:',
    disabled=False
)
display(train_file_widget)

test_file_widget = widgets.Text(
    value='/data/test_dataset.csv',
    description='Test Dataset:',
    disabled=False
)
display(test_file_widget)


Text(value='/data/train_dataset.csv', description='Train Dataset:')

Text(value='/data/test_dataset.csv', description='Test Dataset:')

In [5]:
# Load datasets
train_file = train_file_widget.value
test_file = test_file_widget.value

train_data = pd.read_csv(train_file)
test_data = pd.read_csv(test_file)

print("Train and test datasets loaded.")
display(train_data.head())
display(test_data.head())


Train and test datasets loaded.


Unnamed: 0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,Survived
0,2.310969,-1.664466,-0.597,1.558939,-0.904751,-0.07609,-0.615833,0.082718,0.104453,-0.856158,0
1,-0.324932,-0.750348,-1.847883,-0.872708,0.358119,-0.345926,-0.24634,-0.352741,-0.014974,0.29197,0
2,-1.609571,-0.803272,1.228848,1.114457,0.816196,-0.009331,0.241954,-0.612879,0.24948,-0.290291,0
3,-1.275964,-0.229808,-0.943715,-0.141607,0.918611,-0.119819,-0.241904,-0.087031,-0.516926,-0.062049,0
4,-0.896649,4.165155,-2.127456,-0.790586,0.634079,0.302518,0.021939,0.340533,-1.047267,-0.367064,0


Unnamed: 0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,Survived
0,-0.39629,0.207673,0.132268,-0.721755,0.675525,1.258544,-1.284308,1.709685,0.380903,0.126862,1
1,-0.702403,-0.826772,0.125794,0.35561,0.591617,-0.641283,0.27446,-0.519303,0.101023,-0.104779,0
2,-1.555742,-0.358339,-1.126078,-1.370477,1.813102,-0.811142,0.448751,-0.818909,0.031103,-0.311042,0
3,-0.034517,1.16251,-1.506031,-0.980642,-0.648214,1.301233,0.238183,-1.207968,0.522121,0.320525,1
4,-0.565579,0.859583,1.709824,-1.246519,-2.059776,0.762434,-0.815891,0.291687,-0.6183,0.053933,1


---
## Selecting the Target Variable

Please select the target variable from the dataset.


In [6]:
# Widget to select target variable
target_column_widget = widgets.Dropdown(
    options=train_data.columns.tolist(),
    description='Target Column:',
    disabled=False
)
display(target_column_widget)


Dropdown(description='Target Column:', options=('PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6', 'PC7', 'PC8', 'PC9',…

In [7]:
# Select target column
target_column = target_column_widget.value
print(f"Target column selected: {target_column}")


Target column selected: Survived


---
## Performance Exploration

Calculating GINI on train and test datasets.


In [8]:
# Function to calculate GINI coefficient
def gini_coefficient(y_true, y_prob):
    fpr, tpr, _ = roc_curve(y_true, y_prob)
    gini = 2 * roc_auc_score(y_true, y_prob) - 1
    return gini

# Calculate GINI on train and test datasets
model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss')
model.fit(train_data.drop(columns=[target_column]), train_data[target_column])

train_pred = model.predict_proba(train_data.drop(columns=[target_column]))[:, 1]
test_pred = model.predict_proba(test_data.drop(columns=[target_column]))[:, 1]

train_gini = gini_coefficient(train_data[target_column], train_pred)
test_gini = gini_coefficient(test_data[target_column], test_pred)

print(f"Train GINI: {train_gini}")
print(f"Test GINI: {test_gini}")




Train GINI: 1.0
Test GINI: 0.7359073359073358


---
## Feature Importance Analysis

Calculating feature importance using XGBoost and L1 regularized logistic regression.


In [9]:
# Feature importance using XGBoost
xgb_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss')
xgb_model.fit(train_data.drop(columns=[target_column]), train_data[target_column])
xgb_feature_importance = pd.Series(xgb_model.feature_importances_, index=train_data.drop(columns=[target_column]).columns)

# Feature importance using L1 regularized logistic regression
scaler = StandardScaler()
X_scaled = scaler.fit_transform(train_data.drop(columns=[target_column]))

l1_model = LogisticRegression(penalty='l1', solver='liblinear')
l1_model.fit(X_scaled, train_data[target_column])
l1_feature_importance = pd.Series(np.abs(l1_model.coef_[0]), index=train_data.drop(columns=[target_column]).columns)

# Display feature importance
print("XGBoost Feature Importance:")
display(xgb_feature_importance.sort_values(ascending=False).head(10))

print("L1 Regularized Logistic Regression Feature Importance:")
display(l1_feature_importance.sort_values(ascending=False).head(10))




XGBoost Feature Importance:


PC2     0.178369
PC1     0.153187
PC4     0.094098
PC8     0.093857
PC7     0.093557
PC5     0.088732
PC6     0.079109
PC9     0.076733
PC10    0.072368
PC3     0.069990
dtype: float32

L1 Regularized Logistic Regression Feature Importance:


PC1     1.041824
PC4     0.627449
PC7     0.546009
PC5     0.516006
PC6     0.505490
PC8     0.493831
PC2     0.418723
PC9     0.131494
PC3     0.093078
PC10    0.050084
dtype: float64

---
## Correlation Treatment

Identifying and handling highly correlated features. Please select the correlation threshold.


In [10]:
# Widget to select correlation threshold
correlation_threshold_widget = widgets.FloatSlider(
    value=0.9,
    min=0.5,
    max=1.0,
    step=0.05,
    description='Corr Threshold:',
    continuous_update=False,
    orientation='horizontal'
)
display(correlation_threshold_widget)


FloatSlider(value=0.9, continuous_update=False, description='Corr Threshold:', max=1.0, min=0.5, step=0.05)

In [12]:
# Identify highly correlated features
correlation_threshold = correlation_threshold_widget.value
corr_matrix = train_data.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

# Find features with correlation greater than the threshold
to_drop = [column for column in upper.columns if any(upper[column] > correlation_threshold)]
print(f"Highly correlated features (threshold = {correlation_threshold}): {to_drop}")

# Widget to confirm dropping highly correlated features
drop_corr_features_widget = widgets.ToggleButtons(
    options=['Yes', 'No'],
    description='Drop Features?',
    disabled=False,
    button_style=''
)
display(drop_corr_features_widget)


Highly correlated features (threshold = 0.7): []


ToggleButtons(description='Drop Features?', options=('Yes', 'No'), value='Yes')

In [13]:
# Drop highly correlated features if confirmed
if drop_corr_features_widget.value == 'Yes':
    train_data.drop(columns=to_drop, inplace=True)
    test_data.drop(columns=to_drop, inplace=True)
    print(f"Dropped features: {to_drop}")
else:
    print("No features were dropped.")


Dropped features: []


---
## Feature Selection

Based on the feature importance analysis, you may choose to use all features or select a subset of important features. Please make your selection below.


In [14]:
# Suggest features to use based on feature importance analysis
important_features_xgb = xgb_feature_importance.sort_values(ascending=False).head(10).index.tolist()
important_features_l1 = l1_feature_importance.sort_values(ascending=False).head(10).index.tolist()

suggested_features = list(set(important_features_xgb + important_features_l1))

print(f"Suggested important features based on analysis: {suggested_features}")

# Widget to select whether to use all features or only important features
use_important_features_widget = widgets.ToggleButtons(
    options=['All Features', 'Important Features'],
    description='Feature Selection:',
    disabled=False,
    button_style=''
)
display(use_important_features_widget)


Suggested important features based on analysis: ['PC8', 'PC6', 'PC9', 'PC7', 'PC3', 'PC10', 'PC1', 'PC2', 'PC4', 'PC5']


ToggleButtons(description='Feature Selection:', options=('All Features', 'Important Features'), value='All Fea…

In [15]:
# Adjust dataset based on user selection
if use_important_features_widget.value == 'Important Features':
    train_data = train_data[suggested_features + [target_column]]
    test_data = test_data[suggested_features + [target_column]]
    print("Using only important features for training.")
else:
    print("Using all features for training.")


Using only important features for training.


---
## H2O AutoML

Now we will proceed with H2O AutoML to train models on the prepared dataset.


In [17]:
# Convert Pandas dataframes to H2O frames
h2o_train = h2o.H2OFrame(train_data)
h2o_test = h2o.H2OFrame(test_data)

# Convert target to categorical if it is binary
if train_data[target_column].nunique() == 2:
    h2o_train[target_column] = h2o_train[target_column].asfactor()
    h2o_test[target_column] = h2o_test[target_column].asfactor()
    print(f"Target variable '{target_column}' converted to categorical for binary classification.")

# Define the features and target
x = h2o_train.columns
y = target_column
x.remove(y)

# Initialize and train H2O AutoML
aml = H2OAutoML(max_models=20, seed=1)
aml.train(x=x, y=y, training_frame=h2o_train)

# View the AutoML Leaderboard
lb = aml.leaderboard
print("H2O AutoML Leaderboard:")
lb.head(rows=lb.nrows)


Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Target variable 'Survived' converted to categorical for binary classification.
AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%
H2O AutoML Leaderboard:


model_id,auc,logloss,aucpr,mean_per_class_error,rmse,mse
DeepLearning_grid_1_AutoML_2_20240605_92810_model_1,0.845763,0.476767,0.788465,0.176869,0.368668,0.135916
StackedEnsemble_BestOfFamily_1_AutoML_2_20240605_92810,0.842943,0.451653,0.785457,0.183239,0.372913,0.139064
StackedEnsemble_AllModels_1_AutoML_2_20240605_92810,0.84117,0.448532,0.793814,0.186618,0.370431,0.137219
DeepLearning_grid_2_AutoML_2_20240605_92810_model_1,0.839725,0.525613,0.78391,0.209359,0.388044,0.150578
GLM_1_AutoML_2_20240605_92810,0.836792,0.466872,0.785564,0.22378,0.38575,0.148803
DeepLearning_grid_3_AutoML_2_20240605_92810_model_1,0.835476,0.563024,0.802568,0.204568,0.386621,0.149476
DRF_1_AutoML_2_20240605_92810,0.827707,0.581636,0.761793,0.232822,0.395776,0.156639
XGBoost_grid_1_AutoML_2_20240605_92810_model_2,0.827535,0.489446,0.783441,0.228788,0.394078,0.155297
GBM_grid_1_AutoML_2_20240605_92810_model_2,0.827472,0.479921,0.785956,0.229814,0.393576,0.154902
XRT_1_AutoML_2_20240605_92810,0.827215,0.578214,0.778665,0.219914,0.39351,0.15485


---
# Next Steps

You can now proceed to evaluate the models and select the best one for your use case. The H2O AutoML Leaderboard provides a comprehensive comparison of the models trained.

See our **Getting Started** - **Working_With_MLFlow_in_Keboola** notebook to see how you could register and deploy your model in Keboola to later use it for Batch or real-time scoring.


In [None]:
h2o.shutdown(prompt=False)