# Isolation Forest Clustering - Experiment

This is a component that trains a Isolation Forest model using [Scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html). 
<br>
Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.

This notebook shows:
- how to use the [SDK](https://platiagro.github.io/sdk/) to load datasets, save models and other artifacts.
- how to declare parameters and use them to build reusable components.

## Declare parameters
Components may declare (and use) these default parameters:
- dataset

Use these parameters to load/save datasets, models, metrics, and figures with the help of [PlatIAgro SDK](https://platiagro.github.io/sdk/).

You may also declare custom parameters to set when running an experiment.

In [None]:
dataset = "iris" #@param {type:"string"}
max_samples = "auto" #@param {type:"float"}
contamination = 0.1 #@param {type:"float"}
max_features = 1.0 #@param {type:"float"}

## Load dataset

Import and put the whole dataset in a pandas.DataFrame.

In [None]:
from platiagro import load_dataset

df = load_dataset(name=dataset)

## Load metadata about the dataset
For example, below we get the feature type for each column in the dataset. (eg. categorical, numerical, or datetime)

In [None]:
import numpy as np
import pandas as pd
from platiagro import stat_dataset

metadata = stat_dataset(name=dataset)
featuretypes = metadata["featuretypes"]
featuretypes = np.array(featuretypes)
columns = df.columns.to_numpy()

## Features configuration

In [None]:
from platiagro.featuretypes import NUMERICAL

# Selects the indexes of numerical and non-numerical features
numerical_indexes = np.where(featuretypes == NUMERICAL)[0]
non_numerical_indexes = np.where(~(featuretypes == NUMERICAL))[0]

# After the step handle_missing_values, 
# numerical features are grouped in the beggining of the array
numerical_indexes_after_handle_missing_values = \
    np.arange(len(numerical_indexes))
non_numerical_indexes_after_handle_missing_values = \
    np.arange(len(numerical_indexes), len(featuretypes))

## Fit a model using sklearn.ensemble.IsolationForest

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.ensemble import IsolationForest
from sklearn.pipeline import Pipeline
from category_encoders.ordinal import OrdinalEncoder

pipeline = Pipeline(steps=[
            ("handle_missing_values",
             ColumnTransformer(
                [("imputer_mean", SimpleImputer(strategy="mean"), numerical_indexes),
                 ("imputer_mode", SimpleImputer(strategy="most_frequent"), non_numerical_indexes)],
                 remainder="drop")),
            ("handle_categorical_features",
             ColumnTransformer(
                 [("feature_encoder", OrdinalEncoder(), non_numerical_indexes_after_handle_missing_values)],
                 remainder="passthrough")),
            ("estimator", IsolationForest(max_samples=max_samples,
                            contamination=contamination,
                            max_features=max_features))
])

score = pipeline.fit_predict(df)

## Measure the performance

In the case of Isolation Forest we may measure performance by getting the average anomaly.

In [None]:
from sklearn.decomposition import PCA

# Run all except the last step
df_encoded = Pipeline(steps=pipeline.steps[:-1]).transform(df)

# Dimension reduction
pca = PCA(n_components=2)
reduced = pca.fit_transform(df_encoded)

X_pca = pd.DataFrame(reduced, columns=["X", "Y"])

X_pca["Anomaly"] = score

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

ax = sns.scatterplot(x="X", y="Y", hue="Anomaly", data=X_pca)

ax.set_title("PCA Graph", {"fontweight": 'bold'})

In [None]:
from platiagro import save_figure

save_figure(figure=plt.gcf())

## Save metrics

Record the metrics used to evaluate the model.<br>
It's a good way to document the experiments, and also help to avoid running the same experiment twice. 

In [None]:
from platiagro import save_metrics

save_metrics(anomaly_score=score)

## Save model

Stores the model artifacts in a object storage.<br>
It will make the model available for future deployments.

In [None]:
from platiagro import save_model

save_model(pipeline=pipeline,
           columns=columns)