# AutoML Regressor

This is a component that trains an AutoML Regressor model using [auto-sklearn](https://github.com/automl/auto-sklearn). 
<br>
auto-sklearn is an automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator.

This notebook shows:
- how to use SDK to load the dataset and save a model.
- how to receive parameters from the platform.

In [None]:
dataset = "boston" #@param {type:"string"}
target = "medv" #@param {type:"string"}
experiment_id = "6db0fff7-ba9d-4f64-8cbe-9a31bd8b3644" #@param {type:"string"}
operator_id = "1fbe7220-0b4b-4eb6-aba2-b8afec49250d" #@param {type:"string"}
duration = 60 #@param {type:"integer"}

## Load dataset

Import and put the whole dataset in a pandas.DataFrame.

In [None]:
from platiagro import load_dataset

df = load_dataset(name=dataset)
X = df.drop(target, axis=1).to_numpy()
y = df[target].to_numpy()

## Load metadata about the dataset
For example, below we get the feature type for each column in the dataset. (eg. categorical, numerical, or datetime)

In [None]:
import numpy as np
from platiagro import stat_dataset
from platiagro.featuretypes import infer_featuretypes

try:
    metadata = stat_dataset(name=dataset)
    featuretypes = metadata["featuretypes"]
except KeyError:
    featuretypes = infer_featuretypes(df)

featuretypes = np.array(featuretypes)

## Replace NaN values
Remove features that all values are NA.<br>
If some values are missing, then use the mean for numerical features, and the mode for categorical features.

In [None]:
na_free = df.dropna(axis="columns", how="all")
only_na = df.loc[:, ~df.columns.isin(na_free.columns)]

featuretypes = featuretypes[df.columns.isin(na_free.columns)]
df = na_free

In [None]:
from platiagro.featuretypes import CATEGORICAL, NUMERICAL

numerical_indexes = (featuretypes == NUMERICAL)
numerical_nan_replacement = df.iloc[:, numerical_indexes].mean(axis=0)
df.fillna(numerical_nan_replacement, inplace=True)

categorical_indexes = (featuretypes == CATEGORICAL)
categorical_nan_replacement = df.iloc[:, categorical_indexes].mode(axis=0).iloc[0]
df.fillna(categorical_nan_replacement, inplace=True)

In [None]:
X = df.drop(target, axis=1).to_numpy()
columns = df.columns.to_numpy()
target_index = np.argwhere(columns == target)
columns = np.delete(columns, target_index)
featuretypes = np.delete(featuretypes, target_index)

## Remove datetime features
Datetime columns require separate preprocessing or feature extraction steps.

In [None]:
from platiagro.featuretypes import DATETIME

datetime_indexes = (featuretypes == DATETIME)
X = X[:, np.where(~datetime_indexes)[0]]
featuretypes = np.delete(featuretypes, np.where(datetime_indexes))

## Encode categorical features

Many machine learning algorithms cannot operate on categorical data directly. They require all input variables and output variables to be numeric.<br>
This means that categorical data must be converted to a numerical form.<br>
The features are converted to ordinal integers. This results in a single column of integers (0 to n_categories - 1) per feature.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

categorical_indexes = (featuretypes == CATEGORICAL)
feature_encoder = OrdinalEncoder()

if np.ma.any(categorical_indexes):
    X[:, categorical_indexes] = feature_encoder.fit_transform(X[:, categorical_indexes])

## Split dataset into train/test splits

Training Dataset: the sample of data used to fit the model.

Test Dataset: the sample of data used to provide an unbiased evaluation of a model fit on the training dataset.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,  train_size=0.7)

## Fit a model using autosklearn.regression.AutoSklearnRegressor

In [None]:
from autosklearn.regression import AutoSklearnRegressor

estimator = AutoSklearnRegressor(
    time_left_for_this_task=duration,
    per_run_time_limit=duration,
)
estimator.fit(X_train, y_train, feat_type=featuretypes)
estimator.refit(X_train, y_train)

## Measure the performance

R² corresponds to the squared correlation between the observed outcome values and the predicted values by the model.

In [None]:
from sklearn.metrics import r2_score

# uses the model to make predictions on the Test Dataset
y_pred = estimator.predict(X_test)

# computes R²
r2 = r2_score(y_test, y_pred)

## Save metrics

Record the metrics used to evaluate the model.<br>
It's a good way to document the experiments, and also help to avoid running the same experiment twice. 

In [None]:
from platiagro import save_metrics

save_metrics(experiment_id=experiment_id, operator_id=operator_id, r2_score=r2)

## Save figure

Record a matplotlib figure to document the experiment.

In [None]:
import numpy as np
import seaborn as sns
from platiagro import save_figure
from scipy.stats import gaussian_kde

abs_err = False
if any(y_test==0):
    err = y_pred - y_test
    abs_err = True
else:
    err = (y_pred - y_test)/y_test

kde = gaussian_kde(err)
x_err = np.linspace(err.min(), err.max(), 1000)
p_err = kde(x_err)
 
ax = sns.kdeplot(p_err)

save_figure(experiment_id=experiment_id, operator_id=operator_id, figure=ax.figure)

## Save model

Stores the model artifacts in a object storage.<br>
It will make the model available for future deployments.

In [None]:
from platiagro import save_model

save_model(experiment_id=experiment_id,
           model={"estimator": estimator,
                  "feature_encoder": feature_encoder,
                  "columns": columns,
                  "datetime_indexes": datetime_indexes,
                  "categorical_indexes": categorical_indexes,
                  "numerical_nan_replacement": numerical_nan_replacement,
                  "categorical_nan_replacement": categorical_nan_replacement})