## **Пример использования H2O**
Выполнен на основе [гайда](https://towardsdatascience.com/artificial-intelligence-made-easy-187ecb90c299).

In [0]:
!pip install h2o

In [16]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [8]:
import sklearn
import pandas as pd
from matplotlib import pyplot
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_breast_cancer

# Возьмем какой-нибудь датасет из sklearn для примера
cancer = load_breast_cancer()
df = pd.DataFrame(np.c_[cancer['data'], cancer['target']],
                  columns= np.append(cancer['feature_names'], ['target']))
print(df.head())
print("~" * 15)
print("Колонки датасета:")
print(df.columns)

   mean radius  mean texture  ...  worst fractal dimension  target
0        17.99         10.38  ...                  0.11890     0.0
1        20.57         17.77  ...                  0.08902     0.0
2        19.69         21.25  ...                  0.08758     0.0
3        11.42         20.38  ...                  0.17300     0.0
4        20.29         14.34  ...                  0.07678     0.0

[5 rows x 31 columns]
~~~~~~~~~~~~~~~
Колонки датасета:
Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness'

In [11]:
# Для простоты выделим случайные 5 колонок
random_cols = np.random.randint(low=0, high=df.columns.shape[0], size=5)
print(df.columns[random_cols])
chosen_cols = df.columns[random_cols]
data = df.loc[:, chosen_cols]

Index(['worst fractal dimension', 'concavity error', 'mean fractal dimension',
       'worst smoothness', 'perimeter error'],
      dtype='object')


In [19]:
import h2o

# Сохраним эти данные и подгрузим в формате H2O
data.to_csv('drive/My Drive/automl/data-example.csv', index=False)

df = h2o.import_file("drive/My Drive/automl/data-example.csv")

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [20]:
# Инициализация сервера
h2o.init(nthreads = -1, max_mem_size = 8)
h2o.connect()

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O_cluster_uptime:,24 mins 22 secs
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.30.0.3
H2O_cluster_version_age:,16 days
H2O_cluster_name:,H2O_from_python_unknownUser_vi8njh
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,8.000 Gb
H2O_cluster_total_cores:,2
H2O_cluster_allowed_cores:,2


Connecting to H2O server at http://localhost:54321 ... successful.


0,1
H2O_cluster_uptime:,24 mins 22 secs
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.30.0.3
H2O_cluster_version_age:,16 days
H2O_cluster_name:,H2O_from_python_unknownUser_vi8njh
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,8.000 Gb
H2O_cluster_total_cores:,2
H2O_cluster_allowed_cores:,2


<H2OConnection to http://localhost:54321, no session>

In [0]:
y = 'perimeter error'
x = list(df.columns).remove(y)

In [24]:
splits = df.split_frame(ratios=[0.7, 0.15], seed=1)
train = splits[0]
valid = splits[1]
test = splits[2]

print("Размер обучающей выборки:", train.nrow)
print("Размер валидационной выборки:", valid.nrow)
print("Размер тестовой выборки:", test.nrow)

Размер обучающей выборки: 393
Размер валидационной выборки: 89
Размер тестовой выборки: 87


In [0]:
# Простая модель (random forest)
from h2o.estimators.random_forest import H2ORandomForestEstimator
rf = H2ORandomForestEstimator(seed=1)

In [26]:
import time
start = time.time()
rf.train(x=x, y=y, training_frame=train)
end = time.time()
print("Обучение выполнилось за %.2f секунд." % (end - start))

drf Model Build progress: |███████████████████████████████████████████████| 100%
Обучение выполнилось за 1.78 секунд.


In [27]:
y_hat = rf.predict(test_data=test)

rf_performance = rf.model_performance(test)
print(rf_performance)

drf prediction progress: |████████████████████████████████████████████████| 100%

ModelMetricsRegression: drf
** Reported on test data. **

MSE: 1.8549567888337524
RMSE: 1.3619679837770609
MAE: 0.987156408545615
RMSLE: 0.3076747905658349
Mean Residual Deviance: 1.8549567888337524



In [31]:
from h2o.automl import H2OAutoML

aml = H2OAutoML(max_models=5, max_runtime_secs=300, seed=1)
aml.train(x=x, y=y, training_frame=train)

AutoML progress: |████████████████████████████████████████████████████████| 100%


In [32]:
lb = aml.leaderboard
lb.head(rows=lb.nrows)

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
XGBoost_2_AutoML_20200530_091146,3.03653,1.74256,3.03653,1.1191,0.350986
StackedEnsemble_BestOfFamily_AutoML_20200530_091146,3.12438,1.76759,3.12438,1.09152,0.330789
StackedEnsemble_AllModels_AutoML_20200530_091146,3.15944,1.77748,3.15944,1.09337,0.33025
DRF_1_AutoML_20200530_091146,3.24955,1.80265,3.24955,1.09593,0.332442
XGBoost_3_AutoML_20200530_091146,3.39513,1.84259,3.39513,1.16198,0.360832
XGBoost_1_AutoML_20200530_091146,3.76225,1.93965,3.76225,1.23811,0.386427
GLM_1_AutoML_20200530_091146,3.8513,1.96247,3.8513,1.20467,0.362027


