# IsolationForestDetect Example

In this example, we will explore how to predict anomalies within a dataset using the IsolationForest 
algorithm which isolates observations by randomly selecting a feature and then randomly 
selecting a split value between the maximum and minimum values of the selected feature.
This subclass uses the `sklearn.ensemble.IsolationForest` class from `sklearn` in the
background.

`sklearn.ensemble.IsolationForest` can only handle numerical data, however, this subclass allows for categorical
input by applying ordinal encoding before calling the sklearn class. In order to use this function,
use enable_encoder=True. If you'd like to use a different type of encoding, 
consider using the Pipeline class and call your own encoder before calling this subclass.
For more details see [sklearn's documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html#).

In [1]:
import sys
sys.path.append('../../../notebooks')

import pandas as pd
import numpy as np
from raimitigations.datadiagnostics import IsolationForestDetect
from download import download_datasets
from itertools import compress

Load the data:

In [2]:
data_dir = '../../../datasets/'
download_datasets(data_dir)
dataset = pd.read_csv(data_dir + 'hr_promotion/train.csv')
dataset = dataset[:10000].dropna(axis=0).drop('employee_id', axis=1)

dataset

Unnamed: 0,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
1,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
2,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
3,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0
4,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,Procurement,region_13,Master's & above,f,other,1,37,4.0,7,1,0,71,0
9996,Sales & Marketing,region_33,Master's & above,m,sourcing,1,39,3.0,7,0,0,48,0
9997,Finance,region_13,Master's & above,f,sourcing,1,33,4.0,4,1,0,58,0
9998,Operations,region_28,Master's & above,m,other,1,32,4.0,4,1,0,57,1


This class uses a `.fit()`, `.predict()` and `.transform()` interface. It uses a `col_predict` parameter to specify the columns to be included in error prediction, if `None`, all columns will be evaluated for errors. It also uses a `mode` string parameter that can take the values:
- `"column"`: fit and prediction will be applied to each column independently. An error matrix of the same shape as the data will be returned by `predict`.
- `"row"`: fit is applied over the whole data and prediction over each row. A list of erroneous row indices will be returned by `predict`.

Given the setting `mode`=`"column"`, `.predict()` returns a matrix of the same shape as the input data indicating if an element is erroneous: -1, non-erroneous: +1 or np.nan for columns not included in `col_predict`.

We can use the following function to print out erroneous values returned by the error matrix:

In [3]:
def print_erroneous_values_per_column(df, error_matrix):
    for i, col in enumerate(df):
        mask = np.where(error_matrix == -1, True, False)
        errors = set(compress(list(df[col]), mask[:, i]))
        if errors:
            print("Column: ", col)
            print(list(errors))

#### `mode`=`"row"`:

Using the default setting, `enable_encoder`=False, categorical columns will be excluded from anomaly detection:

In [4]:
numerical_columns = list(dataset.select_dtypes(include=['number']).columns)
numerical_columns

['no_of_trainings',
 'age',
 'previous_year_rating',
 'length_of_service',
 'KPIs_met >80%',
 'awards_won?',
 'avg_training_score',
 'is_promoted']

In [5]:
isf_detector_1 = IsolationForestDetect(
    df=dataset,
    col_predict=numerical_columns,
    mode="row",
    isf_params={
        "n_estimators": 100,
        "max_samples": "auto",
        "contamination": "auto",
        "max_features": 1.0,
        "bootstrap": False,
        "n_jobs": None,
        "random_state": 0,
        "warm_start": False,
    },
    enable_encoder = False,
    verbose=False,
)
isf_detector_1.fit()
indices = isf_detector_1.predict(dataset)
len(indices)

2084

Now, using `enable_encoder`=True, we can use `col_predict`=None and include all columns in the anomaly detection by default. 

We also have the option for further cutomization:
- We can pass an `sklearn.IsolationForest` object directly, note that in this case, `isf_params` will be ignored.
- We can save our prediction results to a log file using the parameter `json_log_path`. This parameter defaults to None, where no log file is saved. Given `mode`="row", the log file will include:
    - "object_config": contains set attributes of the concrete class;
    - "erroneous_rows": containing a list of erroneous row indices.

In [6]:
from sklearn.ensemble import IsolationForest
sklearn_obj = clf = IsolationForest(random_state=100)
isf_detector_2 = IsolationForestDetect(
    df=dataset,
    col_predict=None,
    sklearn_obj=sklearn_obj,
    enable_encoder = True,
    json_log_path='../logs/log_2.json',
    verbose=False,
)
isf_detector_2.fit()
indices = isf_detector_2.predict(dataset)
len(indices)

3151

We can call the `transform()` method to remove erroneous indices with `mode`=`"row"`:

In [7]:
isf_detector_2.transform(dataset)

Unnamed: 0,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
1,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
2,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0
3,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0
4,Analytics,region_2,Bachelor's,m,sourcing,2,31,3.0,7,0,0,85,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5699,Technology,region_13,Bachelor's,m,other,1,27,4.0,5,1,0,79,0
5700,Operations,region_2,Bachelor's,m,other,1,41,2.0,15,0,0,61,0
5701,Operations,region_2,Bachelor's,f,sourcing,1,39,2.0,11,0,0,61,0
5702,Sales & Marketing,region_33,Master's & above,m,sourcing,1,39,3.0,7,0,0,48,0


#### `mode`=`"column"`:

Using the default setting, `enable_encoder`=False again, categorical columns will be excluded from anomaly detection:

In [19]:
isf_detector_3 = IsolationForestDetect(
    df=dataset,
    col_predict=['no_of_trainings','previous_year_rating','length_of_service',
                 'KPIs_met >80%','awards_won?','avg_training_score'],
    mode="column",
    isf_params={
        "n_estimators": 100,
        "max_samples": "auto",
        "contamination": "auto",
        "max_features": 1.0,
        "bootstrap": False,
        "n_jobs": None,
        "random_state": 0,
        "warm_start": False,
    },
    enable_encoder=False,
    verbose=False,
)
isf_detector_3.fit()
error_matrix = isf_detector_3.predict(dataset)
print_erroneous_values_per_column(dataset, error_matrix)

Column:  no_of_trainings
[2, 3, 4, 5, 6, 7, 8, 10]
Column:  previous_year_rating
[1.0, 2.0, 4.0, 5.0]
Column:  length_of_service
[1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34]
Column:  KPIs_met >80%
[1]
Column:  awards_won?
[1]
Column:  avg_training_score
[39, 40, 41, 42, 43, 44, 45, 46, 47, 54, 55, 56, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]


And now with enabling encoding, we can include categorical columns. Let's also add a log file path, given `mode`="column", the log file will include:
- "object_config": contains set attributes of the concrete class;
- It maps every column containing errors to a list of its erroneous values.

In [20]:
isf_detector_4 = IsolationForestDetect(
    df=dataset,
    col_predict=['department', 'region', 'education', 'recruitment_channel',
                 'no_of_trainings', 'previous_year_rating', 'length_of_service',
                 'KPIs_met >80%', 'awards_won?', 'avg_training_score'],
    mode="column",
    isf_params={
        "n_estimators": 100,
        "max_samples": "auto",
        "contamination": "auto",
        "max_features": 1.0,
        "bootstrap": False,
        "n_jobs": None,
        "random_state": 0,
        "warm_start": False,
    },
    enable_encoder=True,
    json_log_path='../logs/log_4.json',
    verbose=False,
)
isf_detector_4.fit()
error_matrix = isf_detector_4.predict(dataset)
print_erroneous_values_per_column(dataset, error_matrix)

Column:  department
['Analytics', 'HR', 'Finance', 'R&D', 'Technology', 'Legal']
Column:  region
['region_10', 'region_32', 'region_31', 'region_9', 'region_28', 'region_19', 'region_1', 'region_4', 'region_34', 'region_13', 'region_12', 'region_23', 'region_30', 'region_25', 'region_3', 'region_20', 'region_16', 'region_27', 'region_17', 'region_14', 'region_33', 'region_24', 'region_8', 'region_11', 'region_21', 'region_29', 'region_6', 'region_18', 'region_5']
Column:  education
['Below Secondary', "Master's & above"]
Column:  recruitment_channel
['referred', 'sourcing']
Column:  no_of_trainings
[2, 3, 4, 5, 6, 7, 8, 10]
Column:  previous_year_rating
[1.0, 2.0, 4.0, 5.0]
Column:  length_of_service
[1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34]
Column:  KPIs_met >80%
[1]
Column:  awards_won?
[1]
Column:  avg_training_score
[39, 40, 41, 42, 43, 44, 45, 46, 47, 54, 55, 56, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 7

In [21]:
error_matrix

array([[ 1.,  1., -1., ...,  1.,  1., nan],
       [ 1.,  1.,  1., ...,  1.,  1., nan],
       [ 1., -1.,  1., ...,  1.,  1., nan],
       ...,
       [-1., -1., -1., ...,  1.,  1., nan],
       [ 1., -1., -1., ...,  1.,  1., nan],
       [ 1., -1.,  1., ...,  1., -1., nan]])

We can call the `.transform()` function to transform the data, using `mode`=`"column"`, we have the 2 following options:
1. Remove erroneous values from the data:

In [22]:
isf_detector_4.transform(dataset)

Unnamed: 0,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,Sales & Marketing,region_7,,f,,1.0,35,,8.0,,0.0,49.0,0
1,Operations,region_22,Bachelor's,m,other,1.0,30,,4.0,0.0,0.0,60.0,0
2,Sales & Marketing,,Bachelor's,m,,1.0,34,3.0,7.0,0.0,0.0,50.0,0
3,Sales & Marketing,,Bachelor's,m,other,,39,,,0.0,0.0,50.0,0
4,,region_26,Bachelor's,m,other,1.0,45,3.0,2.0,0.0,0.0,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,Procurement,,,f,other,1.0,37,,7.0,,0.0,,0
9996,Sales & Marketing,,,m,,1.0,39,3.0,7.0,0.0,0.0,48.0,0
9997,,,,f,,1.0,33,,4.0,,0.0,58.0,0
9998,Operations,,,m,other,1.0,32,,4.0,,0.0,57.0,1


2. Or, we can pass an imputer to impute these values post removal. Here we are using the `IterativeDataImputer` offered by this library:

In [23]:
from raimitigations.dataprocessing import IterativeDataImputer
from sklearn.ensemble import RandomForestRegressor
imputer = IterativeDataImputer(
    df=dataset,
    col_impute=None,
    enable_encoder=True,
    iterative_params={
        'estimator': RandomForestRegressor(),
        'missing_values': np.nan,
        'sample_posterior': False,
        'max_iter': 3,
        'tol': 1e-3,
        'n_nearest_features': None,
        'initial_strategy': 'mean',
        'imputation_order': 'ascending',
        'skip_complete': False,
        'min_value': -np.inf,
        'max_value': np.inf,
        'random_state': 100},
    verbose=False
)

isf_detector_4.transform(dataset, imputer=imputer)



Unnamed: 0,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,Sales & Marketing,region_7,Bachelor's,f,other,1.0,35.0,3.0,8.00,0.0,0.0,49.00,0.0
1,Operations,region_22,Bachelor's,m,other,1.0,30.0,3.0,4.00,0.0,0.0,60.00,0.0
2,Sales & Marketing,region_15,Bachelor's,m,other,1.0,34.0,3.0,7.00,0.0,0.0,50.00,0.0
3,Sales & Marketing,region_22,Bachelor's,m,other,1.0,39.0,3.0,4.86,0.0,0.0,50.00,0.0
4,Procurement,region_26,Bachelor's,m,other,1.0,45.0,3.0,2.00,0.0,0.0,81.00,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8850,Procurement,region_2,Bachelor's,f,other,1.0,37.0,3.0,7.00,0.0,0.0,80.05,0.0
8851,Sales & Marketing,region_2,Bachelor's,m,other,1.0,39.0,3.0,7.00,0.0,0.0,48.00,0.0
8852,Operations,region_2,Bachelor's,f,other,1.0,33.0,3.0,4.00,0.0,0.0,58.00,0.0
8853,Operations,region_22,Bachelor's,m,other,1.0,32.0,3.0,4.00,0.0,0.0,57.00,1.0
