# AutoML Classifier

This is a component that trains an AutoML Classifier model using [auto-sklearn](https://github.com/automl/auto-sklearn). 
<br>
auto-sklearn is an automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator.

This notebook shows:
- how to use the [SDK](https://platiagro.github.io/sdk/) to load datasets, save models and other artifacts.
- how to declare parameters and use them to build reusable components.

## Declare parameters and model hyperparameters
Components may declare (and use) these default parameters:
- dataset
- target

Use these parameters to load/save datasets, models, metrics, and figures with the help of [PlatIAgro SDK](https://platiagro.github.io/sdk/). <br />
You may also declare custom parameters to set when running an experiment.

Select the hyperparameters and their respective values to be used when training the model:
- time_left_for_this_task
- per_run_time_limit
- ensemble_size

These parameters are just a few offered by the model class, you may also use another existing parameter. <br />
Check the [model parameters](https://automl.github.io/auto-sklearn/master/api.html#autosklearn.classification.AutoSklearnClassifier) for more information.

In [None]:
# parameters
dataset = "iris" #@param {type:"string"}
target = "Species" #@param {type:"string"}

# hyperparameters
time_left_for_this_task = 60 #@param {type:"integer", label:"Limite de tempo (em segundos)", description:"Limite de tempo para a procura de modelos apropriados"}
per_run_time_limit = 60 #@param {type:"integer", label:"Limite de tempo (em segundos)", description:"Prazo para uma única chamada para o modelo de Machine Learning"}
ensemble_size = 50 #@param {type:"integer", label:"Ensemble Learning", description:"Número de modelos adicionados ao conjunto criado pela seleção do Ensemble das bibliotecas de modelos"}

## Load dataset

Import and put the whole dataset in a pandas.DataFrame.

In [None]:
from platiagro import load_dataset

df = load_dataset(name=dataset)
X = df.drop(target, axis=1).to_numpy()
y = df[target].to_numpy()

## Load metadata about the dataset
For example, below we get the feature type for each column in the dataset. (eg. categorical, numerical, or datetime)

In [None]:
import numpy as np
from platiagro import stat_dataset

metadata = stat_dataset(name=dataset)
featuretypes = metadata["featuretypes"]

columns = df.columns.to_numpy()
featuretypes = np.array(featuretypes)
target_index = np.argwhere(columns == target)
columns = np.delete(columns, target_index)
featuretypes = np.delete(featuretypes, target_index)

## Encode target labels

The target labels are converted to ordinal integers with value between 0 and n_classes-1.

In [None]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

## Split dataset into train/test splits

Training Dataset: the sample of data used to fit the model.

Test Dataset: the sample of data used to provide an unbiased evaluation of a model fit on the training dataset.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,  train_size=0.7)

## Fit a model using autosklearn.classification.AutoSklearnClassifier

In [None]:
from sklearn.compose import ColumnTransformer
from autosklearn.classification import AutoSklearnClassifier
from platiagro.featuretypes import NUMERICAL
from sklearn.feature_selection import VarianceThreshold
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder

numerical_indexes = (featuretypes == NUMERICAL)

pipeline = Pipeline(steps=[
    ('handle missing values', ColumnTransformer(
    [('imputer_mean', SimpleImputer(strategy='mean'), numerical_indexes),
     ('imputer_mode', SimpleImputer(strategy='most_frequent'), ~numerical_indexes)],
    remainder='drop')),
    ('handle categorical features', ColumnTransformer(
    [('feature_encoder', OrdinalEncoder(), ~numerical_indexes)],
    remainder='passthrough')),
    ('variance threshold', VarianceThreshold(threshold=0)),
    ('estimator', AutoSklearnClassifier(time_left_for_this_task=time_left_for_this_task,
                                        per_run_time_limit=per_run_time_limit,
                                        ensemble_size=ensemble_size)),
])

pipeline.fit(X_train, y_train)
pipeline.named_steps.estimator.refit(X_train, y_train)

## Measure the performance
The [**Confusion Matrix**](https://en.wikipedia.org/wiki/Confusion_matrix) is a performance measurement for machine learning classification.<br>
It is extremely useful for measuring [Accuracy](https://en.wikipedia.org/wiki/Accuracy_and_precision#In_binary_classification), [Recall, Precision, and F-measure](https://en.wikipedia.org/wiki/Precision_and_recall).

In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix

# uses the model to make predictions on the Test Dataset
y_pred = pipeline.predict(X_test)

# computes confusion matrix
labels = np.unique(y)
data = confusion_matrix(y_test, y_pred, labels=labels)

# puts matrix in pandas.DataFrame for better format
labels = label_encoder.inverse_transform(labels)
confusion_matrix = pd.DataFrame(data, columns=labels, index=labels)

## Save metrics

Record the metrics used to evaluate the model.<br>
It's a good way to document the experiments, and also help to avoid running the same experiment twice. 

In [None]:
from platiagro import save_metrics

save_metrics(confusion_matrix=confusion_matrix)

## Save model

Stores the model artifacts in a object storage.<br>
It will make the model available for future deployments.

In [None]:
from platiagro import save_model

save_model(pipeline=pipeline,
           label_encoder=label_encoder,
           columns=columns)