# Support Vector Classification

This is a component that trains a Support Vector Classification model using [Scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html). 
<br>
Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.

This notebook shows:
- how to use SDK to load the dataset and save a model.
- how to receive parameters from the platform.

In [None]:
dataset = "titanic" #@param {type:"string"}
target = "Survived" #@param {type:"string"}
experiment_id = "1550eac4-931b-41f1-bf58-a958b2249620" #@param {type:"string"}
operator_id = "c0deb81a-540e-4d51-bf8f-c332f9b9fd73" #@param {type:"string"}

## Load dataset

Import and put the whole dataset in a pandas.DataFrame.

In [None]:
from platiagro import load_dataset

df = load_dataset(name=dataset)
X = df.drop(target, axis=1).to_numpy()
y = df[target].to_numpy()

In [None]:
columns = df.columns.tolist()
target_idx = columns.index(target)

## Load metadata about the dataset
For example, below we get the feature type for each column in the dataset. (eg. categorical, numerical, or datetime)

In [None]:
from platiagro import stat_dataset
from platiagro.featuretypes import infer_featuretypes

try:
    metadata = stat_dataset(name=dataset)
    featuretypes = metadata["featuretypes"]
except KeyError:
    featuretypes = infer_featuretypes(df)

## Replace NaN values
Use the mean for numerical features.<br>
And the mode for categorical features.

In [None]:
from platiagro.featuretypes import CATEGORICAL, NUMERICAL

numerical = [columns[idx] for idx, ft in enumerate(featuretypes) if ft == NUMERICAL and idx != target_idx]
numerical_nan_replacement = df.loc[:, numerical].mean(axis=0)

categorical = [columns[idx] for idx, ft in enumerate(featuretypes) if ft == CATEGORICAL and idx != target_idx]
categorical_nan_replacement = df.loc[:, categorical].mode(axis=0).iloc[0]

df.fillna(numerical_nan_replacement, inplace=True)
df.fillna(categorical_nan_replacement, inplace=True)
X = df.drop(target, axis=1).to_numpy()

## Encode categorical features

Many machine learning algorithms cannot operate on categorical data directly. They require all input variables and output variables to be numeric.<br>
This means that categorical data must be converted to a numerical form.<br>
The features are converted to ordinal integers. This results in a single column of integers (0 to n_categories - 1) per feature.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

featuretypes.pop(target_idx)

# selects the categorical features
categorical_idxs = [idx for idx, ft in enumerate(featuretypes) if ft == CATEGORICAL]
feature_encoder = OrdinalEncoder()

if len(categorical_idxs) > 0:
    X[:, categorical_idxs] = feature_encoder.fit_transform(X[:, categorical_idxs])

## Encode target labels

The target labels are converted to ordinal integers with value between 0 and n_classes-1.

In [None]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

## Split dataset into train/test splits

Training Dataset: the sample of data used to fit the model.

Test Dataset: the sample of data used to provide an unbiased evaluation of a model fit on the training dataset.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,  train_size=0.7)

## Fit a model using sklearn.svm.SVC

In [None]:
from sklearn.svm import SVC

estimator = SVC(gamma="auto")
estimator.fit(X_train, y_train)    

## Measure the performance
The [**Confusion Matrix**](https://en.wikipedia.org/wiki/Confusion_matrix) is a performance measurement for machine learning classification.<br>
It is extremely useful for measuring [Accuracy](https://en.wikipedia.org/wiki/Accuracy_and_precision#In_binary_classification), [Recall, Precision, and F-measure](https://en.wikipedia.org/wiki/Precision_and_recall).

In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix

# uses the model to make predictions on the Test Dataset
y_pred = estimator.predict(X_test)

# computes confusion matrix
labels = np.unique(y)
data = confusion_matrix(y_test, y_pred, labels=labels)

# puts matrix in pandas.DataFrame for better format
labels = label_encoder.inverse_transform(labels)
confusion_matrix = pd.DataFrame(data, columns=labels, index=labels)

## Save metrics

Record the metrics used to evaluate the model.<br>
It's a good way to document the experiments, and also help to avoid running the same experiment twice. 

In [None]:
from platiagro import save_metrics

save_metrics(experiment_id=experiment_id, operator_id=operator_id, confusion_matrix=confusion_matrix)

## Save model

Stores the model artifacts in a object storage.<br>
It will make the model available for future deployments.

In [None]:
from platiagro import save_model

save_model(experiment_id=experiment_id,
           model={"estimator": estimator,
                  "feature_encoder": feature_encoder,
                  "label_encoder": label_encoder,
                  "columns": columns,
                  "numerical_nan_replacement": numerical_nan_replacement,
                  "categorical_nan_replacement": categorical_nan_replacement})