# Support Vector Regression

This is a component that trains a Support Vector Regression model using [Scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html). 
<br>
Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.

This notebook shows:
- how to use the [SDK](https://platiagro.github.io/sdk/) to load datasets, save models and other artifacts.
- how to declare parameters and use them to build reusable components.

## Declare parameters
Components may declare (and use) these default parameters:
- dataset
- target
- experiment_id
- operator_id

Use these parameters to load/save datasets, models, metrics, and figures with the help of [PlatIAgro SDK](https://platiagro.github.io/sdk/).

You may also declare custom parameters to set when running an experiment.

In [None]:
dataset = "boston" #@param {type:"string"}
target = "medv" #@param {type:"string"}
experiment_id = "50ecf2eb-b805-4629-aea4-66053ed9ec8c" #@param {type:"string"}
operator_id = "104740b1-82c1-4321-9bd3-b2ef931f56f1" #@param {type:"string"}

## Load dataset

Import and put the whole dataset in a pandas.DataFrame.

In [None]:
from platiagro import load_dataset

df = load_dataset(name=dataset)
X = df.drop(target, axis=1).to_numpy()
y = df[target].to_numpy()

## Load metadata about the dataset
For example, below we get the feature type for each column in the dataset. (eg. categorical, numerical, or datetime)

In [None]:
import numpy as np
from platiagro import stat_dataset
from platiagro.featuretypes import infer_featuretypes

try:
    metadata = stat_dataset(name=dataset)
    featuretypes = metadata["featuretypes"]
except KeyError:
    featuretypes = infer_featuretypes(df)

featuretypes = np.array(featuretypes)

## Replace NaN values
Remove features that all values are NA.<br>
If some values are missing, then use the mean for numerical features, and the mode for categorical features.

In [None]:
na_free = df.dropna(axis="columns", how="all")
only_na = df.loc[:, ~df.columns.isin(na_free.columns)]

featuretypes = featuretypes[df.columns.isin(na_free.columns)]
df = na_free

In [None]:
from platiagro.featuretypes import CATEGORICAL, NUMERICAL

numerical_indexes = (featuretypes == NUMERICAL)
numerical_nan_replacement = df.iloc[:, numerical_indexes].mean(axis=0)
df.fillna(numerical_nan_replacement, inplace=True)

categorical_indexes = (featuretypes == CATEGORICAL)
categorical_nan_replacement = df.iloc[:, categorical_indexes].mode(axis=0).iloc[0]
df.fillna(categorical_nan_replacement, inplace=True)

In [None]:
X = df.drop(target, axis=1).to_numpy()
columns = df.columns.to_numpy()
target_index = np.argwhere(columns == target)
columns = np.delete(columns, target_index)
featuretypes = np.delete(featuretypes, target_index)

## Remove datetime features
Datetime columns require separate preprocessing or feature extraction steps.

In [None]:
from platiagro.featuretypes import DATETIME

datetime_indexes = (featuretypes == DATETIME)
X = X[:, np.where(~datetime_indexes)[0]]
featuretypes = np.delete(featuretypes, np.where(datetime_indexes))

## Encode categorical features

Many machine learning algorithms cannot operate on categorical data directly. They require all input variables and output variables to be numeric.<br>
This means that categorical data must be converted to a numerical form.<br>
The features are converted to ordinal integers. This results in a single column of integers (0 to n_categories - 1) per feature.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

categorical_indexes = (featuretypes == CATEGORICAL)
feature_encoder = OrdinalEncoder()

if np.ma.any(categorical_indexes):
    X[:, categorical_indexes] = feature_encoder.fit_transform(X[:, categorical_indexes])

## Split dataset into train/test splits

Training Dataset: the sample of data used to fit the model.

Test Dataset: the sample of data used to provide an unbiased evaluation of a model fit on the training dataset.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,  train_size=0.7)

## Fit a model using sklearn.svm.SVR

In [None]:
from sklearn.svm import SVR

estimator = SVR(gamma="auto")
estimator.fit(X_train, y_train)    

## Measure the performance

R² corresponds to the squared correlation between the observed outcome values and the predicted values by the model.

In [None]:
from sklearn.metrics import r2_score

# uses the model to make predictions on the Test Dataset
y_pred = estimator.predict(X_test)

# computes R²
r2 = r2_score(y_test, y_pred)

## Save metrics

Record the metrics used to evaluate the model.<br>
It's a good way to document the experiments, and also help to avoid running the same experiment twice. 

In [None]:
from platiagro import save_metrics

save_metrics(experiment_id=experiment_id, operator_id=operator_id, r2_score=r2)

## Save figure

Record a matplotlib figure to document the experiment.

In [None]:
import matplotlib.pyplot as plt
from platiagro import save_figure
from scipy.stats import gaussian_kde


def annotate_plot(e, s, plt, y_lim, h, abs_err):
    if h < 2:
        p = 0.05
    else:
        p = 0.1
    plt.annotate("", xy=(max(e), y_lim[1]/h),
                 xytext=(0, y_lim[1]/h),
                 arrowprops=dict(arrowstyle="->"))
    plt.annotate("", xy=(min(e), y_lim[1]/h),
                 xytext=(0, y_lim[1]/h),
                 arrowprops=dict(arrowstyle="->"))
    plt.annotate("{}%".format(s),
                 xy=(0, (1+p)*y_lim[1]/h),
                 ha="center")
    if abs_err:
        plt.annotate("{:.2f}".format(max(e)),
                     xy=((0+max(e))/2, (1-p)*y_lim[1]/h),
                     ha="center")
        plt.annotate("{:.2f}".format(min(e)),
                     xy=((0+min(e))/2, (1-p)*y_lim[1]/h),
                     ha="center")
    else:
        plt.annotate("{:.2f}%".format(100*max(e)),
                     xy=((0+max(e))/2, (1-p)*y_lim[1]/h),
                     ha="center")
        plt.annotate("{:.2f}%".format(100*min(e)),
                     xy=((0+min(e))/2, (1-p)*y_lim[1]/h),
                     ha="center")

In [None]:
abs_err = False
if any(y_test==0):
    err = y_pred - y_test
    abs_err = True
else:
    err = (y_pred - y_test)/y_test

sorted_idx = np.argsort(np.abs(err))
n = int(0.7*len(y_test))
idx = sorted_idx[:n]
e = err[idx]

n = int(0.95*len(y_test))
idx = sorted_idx[:n]
aux = err[idx]
x_lim = (aux.min(), aux.max())

plt.figure()

kde = gaussian_kde(err)
x_err = np.linspace(err.min(), err.max(), 1000)
p_err = kde(x_err)
plt.plot(x_err, p_err, 'b-')

y_lim = plt.ylim()
plt.ylim((0, y_lim[1]))
y_lim = plt.ylim()
plt.xlim(x_lim)
plt.plot([e.min(), e.min()], y_lim, "r--")
plt.plot([e.max(), e.max()], y_lim, "r--")

# Shade the area between e.min() and e.max()
plt.fill_betweenx(y_lim, e.min(), e.max(),
                  facecolor="red",  # The fill color
                  color="red",      # The outline color
                  alpha=0.2)        # Transparency of the fill

annotate_plot(e, 70, plt, y_lim, 2, abs_err)
annotate_plot(aux, 95, plt, y_lim, 1.2, abs_err)

plt.grid(True)
plt.title("Error Distribution")

save_figure(experiment_id=experiment_id, operator_id=operator_id, figure=plt.gcf())

## Save model

Stores the model artifacts in a object storage.<br>
It will make the model available for future deployments.

In [None]:
from platiagro import save_model

save_model(experiment_id=experiment_id,
           model={"estimator": estimator,
                  "feature_encoder": feature_encoder,
                  "columns": columns,
                  "datetime_indexes": datetime_indexes,
                  "categorical_indexes": categorical_indexes,
                  "numerical_nan_replacement": numerical_nan_replacement,
                  "categorical_nan_replacement": categorical_nan_replacement})