In this notebook, few regression types are tried using sklearn. They are trained on dataset X_train_observed and then crossvalidated on X_train_estimated (there is no particular reason behind dividing the dataset to training and testing part like this, I just decided it would be easiest way). 

This crossvalidation is tried just on dataset A. Then the solution with the lowest mean absolute error (mae) is chosen and used for the other datasets.


In [2]:
%load_ext autoreload

In [3]:
%autoreload

# load libraries
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import ElasticNet
from sklearn.svm import SVR
import h2o
from h2o.automl import H2OAutoML
import sklearn

# load my custom functions
from solutions.few_regression_types import data_preprocess


In [68]:
# read dataset A
# for simplicity, I use X_train_estimated as test data for cross validation
y = pd.read_parquet("../../dataset/A/train_targets.parquet")
X_train = pd.read_parquet("../../dataset/A/X_train_observed.parquet")
X_test = pd.read_parquet("../../dataset/A/X_train_estimated.parquet")

In [4]:
# edit data
X_train, y_train = data_preprocess.preprocess_train_data(X_train, y, "everything")
X_test, y_test = data_preprocess.preprocess_train_data(X_test, y, "everything")

In [5]:
print(f"X_train.shape = {X_train.shape}")
print(f"X_test.shape = {X_test.shape}")
print(f"y_train.shape = {y_train.shape}")
print(f"y_test.shape = {y_test.shape}")

X_train.shape = (29667, 47)
X_test.shape = (4394, 47)
y_train.shape = (29667, 1)
y_test.shape = (4394, 1)


## Machine learning methods

In [6]:
# decision tree
decision_tree = DecisionTreeRegressor()
decision_tree.fit(X_train, y_train)
y_pred_tree = decision_tree.predict(X_test)

mae_tree = np.mean(np.abs(np.array(y_test) - y_pred_tree))
mae_tree    # mae means mean absolute error, mae_tree = 616.575890061115

624.4372789807927

In [7]:
# random forest
random_forest = RandomForestRegressor(n_estimators=100)
random_forest.fit(X_train, y_train.values.ravel()) # ravel part is because of scikit's data conversion warning, it does not have to be there
y_pred_forest = random_forest.predict(X_test)

mae_forest = np.mean(np.abs(np.array(y_test) - y_pred_forest))
mae_forest  # 599.9553060312836

600.5727608942257

In [8]:
# gradient boosting
gradient_boosting = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1) 
gradient_boosting.fit(X_train, y_train.values.ravel())
y_pred_grad = gradient_boosting.predict(X_test)

mae_grad = np.mean(np.abs(np.array(y_test) - y_pred_grad))
mae_grad    # 592.2928322998536

592.2928322998534

In [9]:
# elastic net
elastic_net = ElasticNet(alpha=1.0, l1_ratio=0.5, random_state=42)
elastic_net.fit(X_train, y_train.values.ravel())
y_pred_elast_net = elastic_net.predict(X_test)

mae_elast_net = np.mean(np.abs(np.array(y_test) - y_pred_elast_net))
mae_elast_net   # 599.946498024572

  model = cd_fast.enet_coordinate_descent(


599.9464980245721

In [10]:
# support vector regression
svr_model = SVR(kernel='rbf', C=1.)
svr_model.fit(X_train, y_train.values.ravel())
y_pred_svr = svr_model.predict(X_test)

mae_svr = np.mean(np.abs(np.array(y_test) - y_pred_svr))
mae_svr

473.76274242555763

In [11]:
# hyperparameter tuning of SVR
# no need to run this section, it takes too long; the results are approx.: 
# [345, 362, 397, 437, 473, 497, 513] 
# lower C gives us better results
for C in [0.001, 0.03, 0.1, 0.3, 1, 3, 10]:
    svr_model = SVR(kernel='rbf', C=C)
    svr_model.fit(X_train, y_train.values.ravel())
    y_pred_svr = svr_model.predict(X_test)

    print(np.mean(np.abs(np.array(y_test) - y_pred_svr)), end=", ")

345.1870758178765, 362.3174570223702, 397.974746466719, 437.0422504854231, 473.76274242555763, 497.00397971674676, 513.0037055004162, 

In [90]:
#H2O AutoML
y = pd.read_parquet("../../dataset/A/train_targets.parquet")
X_train = pd.concat([
    pd.read_parquet("../../dataset/A/X_train_observed.parquet"),
    pd.read_parquet("../../dataset/A/X_train_estimated.parquet")
], ignore_index=True)
X_test = pd.read_parquet("../../dataset/A/X_train_estimated.parquet")

# Preprocess the data
X_y_bacon = X.merge(y, left_on = "date_forecast", right_on = "time")
X_y_bacon = X_y_bacon.dropna(subset = "pv_measurement")

columns_to_drop = ["snow_drift:idx", "elevation:m", 'snow_melt_10min:mm', 'fresh_snow_12h:cm',
                   'fresh_snow_3h:cm',  'fresh_snow_6h:cm', 'wind_speed_w_1000hPa:ms', 
                   'snow_water:kgm2', 'snow_density:kgm3', 'fresh_snow_1h:cm', 'fresh_snow_24h:cm', 
                   'wind_speed_v_10m:ms', 'wind_speed_u_10m:ms', 'wind_speed_10m:ms',
                   'msl_pressure:hPa', 'pressure_100m:hPa', 'pressure_50m:hPa', 'dew_or_rime:idx', 
                   'date_forecast', 'clear_sky_energy_1h:J', 'diffuse_rad_1h:J', 'direct_rad_1h:J', 
                   'dew_point_2m:K','super_cooled_liquid_water:kgm2', 'ceiling_height_agl:m', 'cloud_base_agl:m', 'prob_rime:p',
                   'time','date_forecast']
X_y_bacon.drop(columns=columns_to_drop, inplace=True)

# Split data 
train,test = sklearn.model_selection.train_test_split(X_y_bacon, test_size = 0.25)

# Add little bit of "water"
h2o.init(max_mem_size = "16G") # If u have more RAM change the parameter

train_frame = h2o.H2OFrame(train)
test_frame = h2o.H2OFrame(test)

print("X_train_frame:",train_frame.shape)
print("X_test_frame:",test_frame.shape)

x = train_frame.columns[:-1] 
y = train_frame.columns[-1] 


aml = H2OAutoML(max_runtime_secs=60)
aml.train(x=x, y=y, training_frame=train_frame, validation_frame=test_frame)#,seed = 123456) # idk why, but the seed doesnt work for me


print(aml.leaderboard)

h2o.shutdown()


Checking whether there is an H2O instance running at http://localhost:54321. connected.


0,1
H2O_cluster_uptime:,2 hours 27 mins
H2O_cluster_timezone:,Europe/Prague
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.44.0.1
H2O_cluster_version_age:,"21 days, 8 hours and 54 minutes"
H2O_cluster_name:,H2O_from_python_simon_oebwgv
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,14.00 Gb
H2O_cluster_total_cores:,8
H2O_cluster_allowed_cores:,8


Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
X_train_frame: (25545, 21)
X_test_frame: (8516, 21)
AutoML progress: |█
19:21:29.704: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
19:21:29.704: AutoML: XGBoost is not available; skipping it.

██████████████████████████████████████████████████████████████| (done) 100%
model_id                                                    rmse     mse      mae    rmsle    mean_residual_deviance
StackedEnsemble_AllModels_2_AutoML_7_20231106_192129     434.657  188926  196.063      nan                    188926
StackedEnsemble_AllModels_1_AutoML_7_20231106_192129     435.113  189324  194.854      nan    

  h2o.shutdown()


H2O session _sid_ba80 closed.


# Prediction on real test data


SVR model came out with the lowest mean absolute error. So far we did only cross validation on the training data. SVR will be used on the real test data, on the datasets B and C and to generate the output csv file.

In [None]:
prediction = []

for letter in ['A', 'B', 'C']:
    # read the data
    print(f"dataset {letter}")
    X_train = pd.concat([
        pd.read_parquet(f"../../dataset/{letter}/X_train_observed.parquet"),
        pd.read_parquet(f"../../dataset/{letter}/X_train_estimated.parquet")
    ], ignore_index=True)
    y_train = pd.read_parquet(f"../../dataset/{letter}/train_targets.parquet")
    X_test = pd.read_parquet(f"../../dataset/{letter}/X_test_estimated.parquet")
    # preprocess the data
    X_train, y_train = data_preprocess.preprocess_train_data(X_train, y_train, "everything")
    X_test = data_preprocess.preprocess_test_data(X_test, "everything")
    # learn 
    model = SVR(kernel='rbf', C=.001)
    model.fit(X_train, y_train.values.ravel())
    prediction = np.concatenate((prediction, model.predict(X_test)))
prediction[prediction < 0.] = 0. # energy production can't be negative
df = pd.DataFrame({'prediction': prediction})
df.to_csv('svr.csv', index_label='id')
print("done")

Usage of AutoML on real data sets A,B,C.

In [9]:
for letter in ['A', 'B', 'C']:
    # Read the data
    print(f"dataset {letter}")
    X = pd.concat([
        pd.read_parquet(f"../../dataset/{letter}/X_train_observed.parquet"),
        pd.read_parquet(f"../../dataset/{letter}/X_train_estimated.parquet")
    ], ignore_index=True)
    y = pd.read_parquet(f"../../dataset/{letter}/train_targets.parquet")
    X_test = pd.read_parquet(f"../../dataset/{letter}/X_test_estimated.parquet")
    
    # Preprocess the data
    X_y_bacon = X.merge(y, left_on = "date_forecast", right_on = "time")
    X_y_bacon = X_y_bacon.dropna(subset = "pv_measurement")

    columns_to_drop = ["snow_drift:idx", "elevation:m", 'snow_melt_10min:mm', 'fresh_snow_12h:cm',
                       'fresh_snow_3h:cm',  'fresh_snow_6h:cm', 'wind_speed_w_1000hPa:ms', 
                       'snow_water:kgm2', 'snow_density:kgm3', 'fresh_snow_1h:cm', 'fresh_snow_24h:cm', 
                       'wind_speed_v_10m:ms', 'wind_speed_u_10m:ms', 'wind_speed_10m:ms',
                       'msl_pressure:hPa', 'pressure_100m:hPa', 'pressure_50m:hPa', 'dew_or_rime:idx', 
                       'date_forecast', 'clear_sky_energy_1h:J', 'diffuse_rad_1h:J', 'direct_rad_1h:J', 
                       'dew_point_2m:K','super_cooled_liquid_water:kgm2', 'ceiling_height_agl:m', 'cloud_base_agl:m', 'prob_rime:p',
                       'time','date_forecast']
    X_y_bacon.drop(columns=columns_to_drop, inplace=True)

    # Split data 
    train,test = sklearn.model_selection.train_test_split(X_y_bacon, test_size = 0.25)

    # Add little bit of "water"
    h2o.init(max_mem_size = "16G") # If u have more RAM change the parameter

    train_frame = h2o.H2OFrame(train)
    test_frame = h2o.H2OFrame(test)

    print("X_train_frame:",train_frame.shape)
    print("X_test_frame:",test_frame.shape)
    
    x = train_frame.columns[:-1] 
    y = train_frame.columns[-1] 
    x_test = h2o.H2OFrame(X_test)

    aml = H2OAutoML(max_runtime_secs = 60,
                    sort_metric = "MAE",
                    stopping_metric = "MAE",
                    seed = 313373)
    aml.train(x = x, 
              y = y,
              training_frame = train_frame,
              validation_frame = test_frame)
    
    print(aml.leaderboard)
    best_model = aml.get_best_model(criterion='MAE')
    prediction = best_model.predict(x_test)
    predictions_df = h2o.as_list(prediction)
    predictions_df[predictions_df < 0.] = 0.
    predictions_df.to_csv(f'automl.csv', index_label='id')
    print("done")
h2o.shutdown()

dataset A
Checking whether there is an H2O instance running at http://localhost:54321. connected.


0,1
H2O_cluster_uptime:,26 mins 30 secs
H2O_cluster_timezone:,Europe/Prague
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.44.0.1
H2O_cluster_version_age:,"21 days, 10 hours and 42 minutes"
H2O_cluster_name:,H2O_from_python_simon_aj0j1i
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,13.87 Gb
H2O_cluster_total_cores:,8
H2O_cluster_allowed_cores:,8


Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
X_train_frame: (25545, 21)
X_test_frame: (8516, 21)
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
AutoML progress: |█
21:09:40.552: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
21:09:40.552: AutoML: XGBoost is not available; skipping it.

██████████████████████████████████████████████████████████████| (done) 100%
model_id                                                     mae     rmse     mse    rmsle    mean_residual_deviance
StackedEnsemble_AllModels_1_AutoML_5_20231106_210940     197.033  438.165  191989      nan                    191989

0,1
H2O_cluster_uptime:,30 mins 16 secs
H2O_cluster_timezone:,Europe/Prague
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.44.0.1
H2O_cluster_version_age:,"21 days, 10 hours and 46 minutes"
H2O_cluster_name:,H2O_from_python_simon_aj0j1i
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,13.83 Gb
H2O_cluster_total_cores:,8
H2O_cluster_allowed_cores:,8


Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
X_train_frame: (24614, 21)
X_test_frame: (8205, 21)
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
AutoML progress: |█
21:13:27.240: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
21:13:27.240: AutoML: XGBoost is not available; skipping it.

██████████████████████████████████████████████████████████████| (done) 100%
model_id                                                     mae     rmse      mse      rmsle    mean_residual_deviance
StackedEnsemble_AllModels_3_AutoML_6_20231106_211327     37.9752  79.7832  6365.36  nan                         6

0,1
H2O_cluster_uptime:,34 mins 03 secs
H2O_cluster_timezone:,Europe/Prague
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.44.0.1
H2O_cluster_version_age:,"21 days, 10 hours and 50 minutes"
H2O_cluster_name:,H2O_from_python_simon_aj0j1i
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,13.77 Gb
H2O_cluster_total_cores:,8
H2O_cluster_allowed_cores:,8


Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
X_train_frame: (19553, 21)
X_test_frame: (6518, 21)
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
AutoML progress: |█
21:17:13.750: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
21:17:13.750: AutoML: XGBoost is not available; skipping it.

██████████████████████████████████████████████████████████████| (done) 100%
model_id                                                     mae     rmse      mse    rmsle    mean_residual_deviance
StackedEnsemble_AllModels_1_AutoML_7_20231106_211713     21.8546  54.1475  2931.95      nan                   2931.

  h2o.shutdown()


H2O session _sid_abc6 closed.
