In this notebook, few regression types are tried using sklearn. They are trained on dataset X_train_observed and then crossvalidated on X_train_estimated (there is no particular reason behind dividing the dataset to training and testing part like this, I just decided it would be easiest way). 

This crossvalidation is tried just on dataset A. Then the solution with the lowest mean absolute error (mae) is chosen and used for the other datasets.


In [15]:
%load_ext autoreload

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [16]:
%autoreload

# load libraries
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import ElasticNet
from sklearn.svm import SVR

# load my custom functions
from solutions.few_regression_types import data_preprocess


In [17]:
# read dataset A
# for simplicity, I use X_train_estimated as test data for cross validation
y = pd.read_parquet("../../dataset/A/train_targets.parquet")
X_train = pd.read_parquet("../../dataset/A/X_train_observed.parquet")
X_test = pd.read_parquet("../../dataset/A/X_train_estimated.parquet")

In [4]:
# edit data
X_train, y_train = data_preprocess.preprocess_train_data(X_train, y, "everything")
X_test, y_test = data_preprocess.preprocess_train_data(X_test, y, "everything")

In [5]:
print(f"X_train.shape = {X_train.shape}")
print(f"X_test.shape = {X_test.shape}")
print(f"y_train.shape = {y_train.shape}")
print(f"y_test.shape = {y_test.shape}")

X_train.shape = (29667, 47)
X_test.shape = (4394, 47)
y_train.shape = (29667, 1)
y_test.shape = (4394, 1)


## Machine learning methods

In [6]:
# decision tree
decision_tree = DecisionTreeRegressor()
decision_tree.fit(X_train, y_train)
y_pred_tree = decision_tree.predict(X_test)

mae_tree = np.mean(np.abs(np.array(y_test) - y_pred_tree))
mae_tree    # mae means mean absolute error, mae_tree = 616.575890061115

616.575890061115

In [7]:
# random forest
random_forest = RandomForestRegressor(n_estimators=100)
random_forest.fit(X_train, y_train.values.ravel()) # ravel part is because of scikit's data conversion warning, it does not have to be there
y_pred_forest = random_forest.predict(X_test)

mae_forest = np.mean(np.abs(np.array(y_test) - y_pred_forest))
mae_forest  # 599.9553060312836

599.9553060312836

In [8]:
# gradient boosting
gradient_boosting = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1) 
gradient_boosting.fit(X_train, y_train.values.ravel())
y_pred_grad = gradient_boosting.predict(X_test)

mae_grad = np.mean(np.abs(np.array(y_test) - y_pred_grad))
mae_grad    # 592.2928322998536

592.2928322998536

In [9]:
# elastic net
elastic_net = ElasticNet(alpha=1.0, l1_ratio=0.5, random_state=42)
elastic_net.fit(X_train, y_train.values.ravel())
y_pred_elast_net = elastic_net.predict(X_test)

mae_elast_net = np.mean(np.abs(np.array(y_test) - y_pred_elast_net))
mae_elast_net   # 599.946498024572

  model = cd_fast.enet_coordinate_descent(


599.946498024572

In [10]:
# support vector regression
svr_model = SVR(kernel='rbf', C=1.)
svr_model.fit(X_train, y_train.values.ravel())
y_pred_svr = svr_model.predict(X_test)

mae_svr = np.mean(np.abs(np.array(y_test) - y_pred_svr))
mae_svr

[LibSVM].......................
*
optimization finished, #iter = 23992
obj = -19923.080123, rho = -23.964617
nSV = 29662, nBSV = 29662


345.1870758178765

In [11]:
# hyperparameter tuning of SVR
# no need to run this section, it takes too long; the results are approx.: 
# [345, 362, 397, 437, 473, 497, 513] 
# lower C gives us better results
for C in [0.001, 0.03, 0.1, 0.3, 1, 3, 10]:
    svr_model = SVR(kernel='rbf', C=C)
    svr_model.fit(X_train, y_train.values.ravel())
    y_pred_svr = svr_model.predict(X_test)

    print(np.mean(np.abs(np.array(y_test) - y_pred_svr)), end=", ")

345.1870758178765, 362.3174570223702, 397.974746466719, 437.0422504854231, 473.76274242555763, 497.00397971674676, 513.0037055004162, 

# Prediction on real test data


SVR model came out with the lowest mean absolute error. So far we did only cross validation on the training data. SVR will be used on the real test data, on the datasets B and C and to generate the output csv file.

In [21]:
prediction = []

for letter in ['A', 'B', 'C']:
    # read the data
    print(f"dataset {letter}")
    X_train = pd.concat([
        pd.read_parquet(f"../../dataset/{letter}/X_train_observed.parquet"),
        pd.read_parquet(f"../../dataset/{letter}/X_train_estimated.parquet")
    ], ignore_index=True)
    y_train = pd.read_parquet(f"../../dataset/{letter}/train_targets.parquet")
    X_test = pd.read_parquet(f"../../dataset/{letter}/X_test_estimated.parquet")
    # preprocess the data
    X_train, y_train = data_preprocess.preprocess_train_data(X_train, y_train, "everything")
    X_test = data_preprocess.preprocess_test_data(X_test, "everything")
    # learn 
    model = SVR(kernel='rbf', C=.001)
    model.fit(X_train, y_train.values.ravel())
    prediction = np.concatenate((prediction, model.predict(X_test)))
prediction[prediction < 0.] = 0. # energy production can't be negative
df = pd.DataFrame({'prediction': prediction})
df.to_csv('svr.csv', index_label='id')
print("done")

dataset A
dataset B
dataset C
