# IsolationForestDetect Example

In this example, we will explore how to predict anomalies within a dataset using the IsolationForest 
algorithm which isolates observations by randomly selecting a feature and then randomly 
selecting a split value between the maximum and minimum values of the selected feature.
This subclass uses the `sklearn.ensemble.IsolationForest` class from `sklearn` in the
background.

`sklearn.ensemble.IsolationForest` can only handle numerical data, however, this subclass allows for categorical
input by applying ordinal encoding before calling the sklearn class. In order to use this function,
use enable_encoder=True. If you'd like to use a different type of encoding, 
consider using the Pipeline class and call your own encoder before calling this subclass.
For more details see [sklearn's documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html#).

In [3]:
import sys
sys.path.append('../../../notebooks')

import pandas as pd
import numpy as np
from raimitigations.datadiagnostics import IsolationForestDetect
from download import download_datasets
from itertools import compress

Load the data:

In [6]:
data_dir = '../../../datasets/'
download_datasets(data_dir)
dataset = pd.read_csv(data_dir + 'hr_promotion/train.csv')
dataset = dataset[:10000].dropna(axis=0)

dataset

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0
4,48945,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,14934,Procurement,region_13,Master's & above,f,other,1,37,4.0,7,1,0,71,0
9996,22040,Sales & Marketing,region_33,Master's & above,m,sourcing,1,39,3.0,7,0,0,48,0
9997,14188,Finance,region_13,Master's & above,f,sourcing,1,33,4.0,4,1,0,58,0
9998,73566,Operations,region_28,Master's & above,m,other,1,32,4.0,4,1,0,57,1


This class uses a `.fit()`, `.predict()` and `.transform()` interface. It uses a `col_predict` parameter to specify the columns to be included in error prediction, if `None`, all columns will be evaluated for errors. It also uses a `mode` string parameter that can take the values:
- `"column"`: fit and prediction will be applied to each column independently. An error matrix of the same shape as the data will be returned by `predict`.
- `"row"`: fit is applied over the whole data and prediction over each row. A list of erroneous row indices will be returned by `predict`.

Given the setting `mode`=`"column"`, `.predict()` returns a matrix of the same shape as the input data indicating if an element is erroneous: -1, non-erroneous: +1 or np.nan for columns not included in `col_predict`.

We can use the following function to print out erroneous values returned by the error matrix:

In [4]:
def print_erroneous_values_per_column(df, error_matrix):
    for i, col in enumerate(df):
        mask = np.where(error_matrix == -1, True, False)
        errors = set(compress(list(df[col]), mask[:, i]))
        if errors:
            print("Column: ", col)
            print(list(errors))

#### `mode`=`"row"`:

Using the default setting, `enable_encoder`=False, categorical columns will be excluded from anomaly detection:

In [10]:
numerical_columns = list(dataset.select_dtypes(include=['number']).columns)
numerical_columns

['employee_id',
 'no_of_trainings',
 'age',
 'previous_year_rating',
 'length_of_service',
 'KPIs_met >80%',
 'awards_won?',
 'avg_training_score',
 'is_promoted']

In [16]:
isf_detector_1 = IsolationForestDetect(
    df=dataset,
    col_predict=numerical_columns,
    mode="row",
    isf_params={
        "n_estimators": 100,
        "max_samples": "auto",
        "contamination": "auto",
        "max_features": 1.0,
        "bootstrap": False,
        "n_jobs": None,
        "random_state": 0,
        "warm_start": False,
    },
    enable_encoder = False,
    verbose=False,
)
isf_detector_1.fit()
indices = isf_detector_1.predict(dataset)
len(indices)

2023

Now, using `enable_encoder`=True, we can use `col_predict`=None and include all columns in the anomaly detection by default. 

We also have the option to pass an `sklearn.IsolationForest` object directly, note that in this case, `isf_params` will be ignored.

In [21]:
from sklearn.ensemble import IsolationForest
sklearn_obj = clf = IsolationForest(random_state=100)
isf_detector_2 = IsolationForestDetect(
    df=dataset,
    col_predict=None,
    sklearn_obj=sklearn_obj,
    enable_encoder = True,
    verbose=False,
)
isf_detector_2.fit()
indices = isf_detector_2.predict(dataset)
len(indices)

3383

We can call the `transform()` method to remove erroneous indices with `mode`=`"row"`:

In [22]:
isf_detector_2.transform(dataset)

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
1,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
2,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0
3,48945,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0
4,58896,Analytics,region_2,Bachelor's,m,sourcing,2,31,3.0,7,0,0,85,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5467,47263,Technology,region_13,Bachelor's,m,other,1,27,4.0,5,1,0,79,0
5468,59001,Operations,region_2,Bachelor's,m,other,1,41,2.0,15,0,0,61,0
5469,17717,Operations,region_2,Bachelor's,f,sourcing,1,39,2.0,11,0,0,61,0
5470,22040,Sales & Marketing,region_33,Master's & above,m,sourcing,1,39,3.0,7,0,0,48,0


#### `mode`=`"column"`:

Using the default setting, `enable_encoder`=False again, categorical columns will be excluded from anomaly detection:

In [19]:
isf_detector_3 = IsolationForestDetect(
    df=dataset,
    col_predict=numerical_columns,
    mode="column",
    isf_params={
        "n_estimators": 100,
        "max_samples": "auto",
        "contamination": "auto",
        "max_features": 1.0,
        "bootstrap": False,
        "n_jobs": None,
        "random_state": 0,
        "warm_start": False,
    },
    enable_encoder=False,
    verbose=False,
)
isf_detector_3.fit()
error_matrix = isf_detector_3.predict(dataset)
print_erroneous_values_per_column(dataset, error_matrix)

Column:  employee_id
[73729, 73730, 24580, 5, 73735, 8199, 8200, 73738, 12, 73741, 73742, 24590, 8211, 73748, 73749, 73753, 65561, 8219, 24602, 65567, 33, 34, 8229, 38, 37, 24616, 65579, 8236, 8238, 8239, 65584, 65583, 8245, 65591, 24633, 24636, 8253, 73788, 73791, 65600, 8257, 65603, 71, 65609, 24650, 73804, 76, 65616, 73809, 65618, 84, 65621, 85, 65624, 65625, 24665, 73819, 8284, 91, 73822, 8285, 98, 24675, 100, 73829, 24674, 8292, 24682, 107, 8300, 65643, 24689, 73846, 24696, 124, 73853, 65660, 24702, 65664, 129, 24706, 65667, 132, 128, 24710, 65671, 135, 137, 24714, 8330, 139, 73869, 32910, 24721, 8340, 24725, 150, 73880, 65689, 65690, 32921, 32922, 32925, 159, 65698, 65699, 24740, 32934, 167, 8360, 65705, 73899, 24748, 24750, 73906, 32947, 180, 8373, 8374, 73911, 24761, 8380, 65725, 8386, 24771, 197, 32966, 8392, 73929, 32970, 8395, 32971, 202, 8400, 209, 8402, 8403, 212, 32979, 24790, 73943, 211, 65747, 219, 32987, 65757, 222, 32990, 24799, 24801, 73952, 24803, 65763, 24798, 8422

And now with enabling encoding, we can include categorical columns:

In [23]:
isf_detector_4 = IsolationForestDetect(
    df=dataset,
    col_predict=None,
    mode="column",
    isf_params={
        "n_estimators": 100,
        "max_samples": "auto",
        "contamination": "auto",
        "max_features": 1.0,
        "bootstrap": False,
        "n_jobs": None,
        "random_state": 0,
        "warm_start": False,
    },
    enable_encoder=True,
    verbose=False,
)
isf_detector_4.fit()
error_matrix = isf_detector_4.predict(dataset)
print_erroneous_values_per_column(dataset, error_matrix)

Column:  employee_id
[73729, 73730, 24580, 5, 73735, 8199, 8200, 73738, 12, 73741, 73742, 24590, 8211, 73748, 73749, 73753, 65561, 8219, 24602, 65567, 33, 34, 8229, 38, 37, 24616, 65579, 8236, 8238, 8239, 65584, 65583, 8245, 65591, 24633, 24636, 8253, 73788, 73791, 65600, 8257, 65603, 71, 65609, 24650, 73804, 76, 65616, 73809, 65618, 84, 65621, 85, 65624, 65625, 24665, 73819, 8284, 91, 73822, 8285, 98, 24675, 100, 73829, 24674, 8292, 24682, 107, 8300, 65643, 24689, 73846, 24696, 124, 73853, 65660, 24702, 65664, 129, 24706, 65667, 132, 128, 24710, 65671, 135, 137, 24714, 8330, 139, 73869, 32910, 24721, 8340, 24725, 150, 73880, 65689, 65690, 32921, 32922, 32925, 159, 65698, 65699, 24740, 32934, 167, 8360, 65705, 73899, 24748, 24750, 73906, 32947, 180, 8373, 8374, 73911, 24761, 8380, 65725, 8386, 24771, 197, 32966, 8392, 73929, 32970, 8395, 32971, 202, 8400, 209, 8402, 8403, 212, 32979, 24790, 73943, 211, 65747, 219, 32987, 65757, 222, 32990, 24799, 24801, 73952, 24803, 65763, 24798, 8422

We can call the `.transform()` function to transform the data, using `mode`=`"column"`, we have the 2 following options:
1. Remove erroneous values from the data:

In [24]:
isf_detector_4.transform(dataset)


Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,,Sales & Marketing,region_7,,,,1.0,35.0,,8.0,,0.0,49.0,0.0
1,,Operations,region_22,Bachelor's,m,other,1.0,30.0,,4.0,0.0,0.0,60.0,0.0
2,,Sales & Marketing,,Bachelor's,m,,1.0,34.0,3.0,7.0,0.0,0.0,50.0,0.0
3,,Sales & Marketing,,Bachelor's,m,other,,39.0,,,0.0,0.0,50.0,0.0
4,48945.0,,region_26,Bachelor's,m,other,1.0,,3.0,2.0,0.0,0.0,,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,,,,,,,,,,,,,,
9996,,,,,,,,,,,,,,
9997,,,,,,,,,,,,,,
9998,,,,,,,,,,,,,,


2. Or, we can pass an imputer to impute these values post removal. Here we are using the `IterativeDataImputer` offered by this library:

In [25]:
from raimitigations.dataprocessing import IterativeDataImputer
from sklearn.ensemble import RandomForestRegressor
imputer = IterativeDataImputer(
    df=dataset,
    col_impute=None,
    enable_encoder=True,
    iterative_params={
        'estimator': RandomForestRegressor(),
        'missing_values': np.nan,
        'sample_posterior': False,
        'max_iter': 3,
        'tol': 1e-3,
        'n_nearest_features': None,
        'initial_strategy': 'mean',
        'imputation_order': 'ascending',
        'skip_complete': False,
        'min_value': -np.inf,
        'max_value': np.inf,
        'random_state': 100},
    verbose=False
)

isf_detector_4.transform(dataset, imputer=imputer)




Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,43375.50,Sales & Marketing,region_7,Bachelor's,f,other,1.00,35.00,1.72,8.00,0.0,0.0,49.00,0.0
1,45322.45,Operations,region_22,Bachelor's,m,other,1.00,30.00,2.24,4.00,0.0,0.0,60.00,0.0
2,23665.69,Sales & Marketing,region_28,Bachelor's,m,other,1.00,34.00,3.00,7.00,0.0,0.0,50.00,0.0
3,24128.57,Sales & Marketing,region_33,Bachelor's,m,other,1.02,39.00,3.60,8.62,0.0,0.0,50.00,0.0
4,48945.00,Legal,region_26,Bachelor's,m,other,1.00,29.95,3.00,2.00,0.0,0.0,58.25,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8850,32554.44,HR,region_23,Bachelor's,f,referred,1.25,30.03,3.69,4.00,0.0,0.0,63.43,0.0
8851,32554.44,HR,region_23,Bachelor's,f,referred,1.25,30.03,3.69,4.00,0.0,0.0,63.43,0.0
8852,32554.44,HR,region_23,Bachelor's,f,referred,1.25,30.03,3.69,4.00,0.0,0.0,63.43,0.0
8853,32554.44,HR,region_23,Bachelor's,f,referred,1.25,30.03,3.69,4.00,0.0,0.0,63.43,0.0
