# ActiveDetect Example

In this example, we will explore how to predict errors within a dataset using multiple error modules available with `ActiveDetect` using a modified version of the HR promotion dataset.

This subclass uses error detection modules presented in the Sanjay Krishnan et al.'s activedetect repo and paper: [BoostClean: Automated Error Detection and Repair for Machine Learning](https://arxiv.org/abs/1711.01299). 

In [2]:
import sys
sys.path.append('../../../notebooks')

import pandas as pd
import numpy as np
from raimitigations.datadiagnostics import *
from download import download_datasets
import random
import string
from itertools import compress

Load the data:

In [3]:
data_dir = '../../../datasets/'
download_datasets(data_dir)
dataset = pd.read_csv(data_dir + 'hr_promotion/train.csv')
dataset = dataset[:10000].drop('employee_id', axis=1)

dataset

Unnamed: 0,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
1,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
2,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
3,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0
4,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,Procurement,region_13,Master's & above,f,other,1,37,4.0,7,1,0,71,0
9996,Sales & Marketing,region_33,Master's & above,m,sourcing,1,39,3.0,7,0,0,48,0
9997,Finance,region_13,Master's & above,f,sourcing,1,33,4.0,4,1,0,58,0
9998,Operations,region_28,Master's & above,m,other,1,32,4.0,4,1,0,57,1


#### Dataset Edit
Datasets tend to naturally have errors in the data and this dataset is no different. However, in addition to existing errors, for the sake of this tutorial, we will add some synthetic errors to help highlight functionalities offered by `ActiveDetect`.

In [4]:
def get_random_indices(df: pd.DataFrame, size: int) -> list:
    indices = df.index.to_list()
    return random.sample(indices, size)

Add quantitative errors to `"avg_training_score"` column:

In [5]:
rand_indices = get_random_indices(dataset, 6)
dataset.loc[rand_indices, "avg_training_score"] = dataset.loc[rand_indices, "avg_training_score"] * 1000

Add punctuation errors to `"gender"` column:

In [6]:
rand_indices = get_random_indices(dataset, 5)
dataset.loc[rand_indices, "gender"] = ['','.',',','  ','. ']

Add semantic errors to `"education"` column:

In [7]:
rand_indices = get_random_indices(dataset, 2)
dataset.loc[rand_indices, "education"] = ["not an education status 1", "not an education status 2"]

Add distribution errors to `"region"` column:

In [8]:
dataset["region"].value_counts()

region_2     2231
region_22    1183
region_7      886
region_15     509
region_13     488
region_26     427
region_31     342
region_4      320
region_27     311
region_16     286
region_28     236
region_11     216
region_23     214
region_29     189
region_32     184
region_19     158
region_14     155
region_17     151
region_20     144
region_5      139
region_25     134
region_6      134
region_1      124
region_30     117
region_8      115
region_10     113
region_24      84
region_12      82
region_9       78
region_21      65
region_3       64
region_34      58
region_33      56
region_18       7
Name: region, dtype: int64

In [9]:
rand_indices = get_random_indices(dataset, 5000)
dataset.loc[rand_indices, "region"] = "region_x"

Add a synthetic string column `"X"` (including synthetic errors) to evaluate string and character similarity errors:

In [10]:
random.seed(100)
dataset["X"] = ""
dataset["X"] = [''.join(random.choice(string.ascii_lowercase) for i in range(5)) for i in range(dataset.shape[0])]
rand_indices = get_random_indices(dataset, 2)
dataset.loc[rand_indices, "X"] = [str(random.randint(0, 1000)), "?????****?????"]

## Error Modules
`ActiveDetect` calls on different error modules to predict errors in the data. It automatically detects what type of data each column has (numerical, categorical, string) and calls applicable error modules.

It uses a `.fit()`, `.predict()` and `.transform()` interface. It uses a `col_predict` parameter to specify the columns to be included in error prediction, if `None`, all columns will be evaluated for errors. It also uses a `mode` string parameter that can take the values:
- `"column"`: prediction will be applied to each column. An error matrix of the same shape as the data will be returned by `predict`.
- `"row"`: prediction will be applied over each row as a whole. A list of erroneous row indices will be returned by `predict`.

Given the default setting, `mode`=`"column"`, `.predict()` returns a matrix of the same shape as the input data indicating if an element is erroneous: -1, non-erroneous: +1 or np.nan for columns not included in `col_predict`. The matrix returned by this function is the union of all error matrices returned by all modules passed to `error_modules`. If you'd like to get the error matrix of a single module post calling `.predict()`, you can do so by calling `.get_error_module_matrix(<error module name>)`.

We can use the following function to print out erroneous values returned by the error matrix:

In [11]:
def print_erroneous_values_per_column(df, error_matrix):
    for i,col in enumerate(df):
        mask = np.where(error_matrix == -1, True, False)
        errors = set(compress(list(df[col]),mask[:,i]))
        if errors:
            print("Column: ", col)
            print(list(errors))

In [12]:
def check_dataframe_matrix(dataframe, matrix, indices):
    for index, _ in dataframe.iterrows():
        if index in indices:
            if -1 in matrix[index, :]:
                continue
            else:
                print(f"Error: Row {index} in the matrix does not contain -1.")
                return
    print('all good.')

Now, although we can use `ActiveDetect` to call multiple modules at once, let's first explore each available module individually. We'll look at both `mode`s for each module:

### QuantitativeErrorModule
This module detects both quantitative parsing failures and abnormal values in a `numerical` column using standard deviation. It takes the following parameters.
- `thresh`: a standard deviation count threshold to determine how many stds can a non-erroneous value be beyond the values' mean. This parameter defaults at 3.5.

In [13]:
## mode = "column"
active_detector_1 = ActiveDetect(
    df=dataset,
    col_predict=None,
    mode="column",
    error_modules=[QuantitativeErrorModule()],
    verbose=False,
)
active_detector_1.fit()
error_matrix = active_detector_1.predict(dataset)
print_erroneous_values_per_column(dataset, error_matrix)

Column:  no_of_trainings
[4, 5, 6, 7, 8, 10]
Column:  length_of_service
[32, 33, 34, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
Column:  awards_won?
[1]
Column:  avg_training_score
[64000, 84000, 49000, 69000, 82000, 58000]


In [14]:
## mode = "row"
active_detector_1 = ActiveDetect(
    df=dataset,
    col_predict=None,
    mode="row",
    error_modules=[QuantitativeErrorModule()],
    verbose=False,
)
active_detector_1.fit()
indices = active_detector_1.predict(dataset)
len(indices)

471

In [15]:
check_dataframe_matrix(dataset, error_matrix, indices)

all good.


### PuncErrorModule
This module detects attributes that are only punctuation, whitespace, etc in `categorical` and `string` columns. It takes no parameters.

In [16]:
## mode = "column"
active_detector_2 = ActiveDetect(
    df=dataset,
    col_predict=None,
    error_modules=[PuncErrorModule()],
    verbose=False,
)
active_detector_2.fit()
error_matrix = active_detector_2.predict(dataset)
print_erroneous_values_per_column(dataset, error_matrix)

Column:  gender
['', '.', ',', '  ', '. ']
Column:  X
['?????****?????']


In [17]:
## mode = "row"
active_detector_2 = ActiveDetect(
    df=dataset,
    col_predict=None,
    mode="row",
    error_modules=[PuncErrorModule()],
    verbose=False,
)
active_detector_2.fit()
indices = active_detector_2.predict(dataset)
len(indices)

6

In [18]:
check_dataframe_matrix(dataset, error_matrix, indices)

all good.


### SemanticErrorModule
This module detects values that do not belong in a `categorical` column, it does so by using Word2Vec architecture. It takes the following parameters:

- `thresh`: a float specifying the similarity score threshold to determine when a value doesn't belong to the set. This parameter defaults at 3.5;
- `fail_thresh`: an int representing the fraction of tokens not found in the corpus before short-circuiting. This parameter defaults at 5;

In [19]:
## mode = "column"
active_detector_3 = ActiveDetect(
    df=dataset,
    col_predict=None,
    error_modules=[SemanticErrorModule(thresh=1.5, fail_thresh=2)],
    verbose=False,
)
active_detector_3.fit()
error_matrix = active_detector_3.predict(dataset)
print_erroneous_values_per_column(dataset, error_matrix)

Column:  department
['R&D', 'Procurement']
Column:  education
['not an education status 1', 'not an education status 2']
Column:  gender
['', '.', ',', '  ', '. ']


In [20]:
## mode = "row"
active_detector_3 = ActiveDetect(
    df=dataset,
    col_predict=None,
    mode="row",
    error_modules=[SemanticErrorModule(thresh=1.5, fail_thresh=2)],
    verbose=False,
)
active_detector_3.fit()
indices = active_detector_3.predict(dataset)
len(indices)

1502

In [21]:
check_dataframe_matrix(dataset, error_matrix, indices)

all good.


### DistributionErrorModule
This module detects values that appear more or less frequently than typical in the dataset using standard deviation. It applies to all data types and takes the following parameters:
- `thresh`: a standard deviation count threshold to determine how many stds can the distribution count of a non-erroneous value be beyond the distribution mean. This parameter defaults at 3.5;
- `fail_thresh`: minimum number of unique values for a successful run of the module. This parameter defaults at 2;

In [22]:
## mode = "column"
active_detector_4 = ActiveDetect(
    df=dataset,
    col_predict=None,
    error_modules=[DistributionErrorModule(thresh=2.5, fail_thresh=2)],
    verbose=False,
)
active_detector_4.fit()
error_matrix = active_detector_4.predict(dataset)
print_erroneous_values_per_column(dataset, error_matrix)

Column:  region
['region_x']
Column:  no_of_trainings
[1]
Column:  avg_training_score
[49, 50]


In [23]:
## mode = "row"
active_detector_4 = ActiveDetect(
    df=dataset,
    col_predict=None,
    mode="row",
    error_modules=[DistributionErrorModule(thresh=2.5, fail_thresh=2)],
    verbose=False,
)
active_detector_4.fit()
indices = active_detector_4.predict(dataset)
len(indices)

9166

In [24]:
check_dataframe_matrix(dataset, error_matrix, indices)

all good.


### StringSimilarityErrorModule
This module detects values that do not belong in a `string` column. It fine-tunes Word2Vec on the given set of data and compares the score of likelihood of input values within the set using standard deviation to predict possibly erroneous values. It takes the following parameters:
- `thresh`: a standard deviation count threshold to determine how many stds can a non-erroneous string's likelihood score be beyond the dataset mean. This parameter defaults at 3.5;

In [25]:
## mode = "column"
active_detector_5 = ActiveDetect(
    df=dataset,
    col_predict=['X'],
    error_modules=[StringSimilarityErrorModule(thresh=1.5)],
    verbose=False,
)
active_detector_5.fit()
error_matrix = active_detector_5.predict(dataset)
print_erroneous_values_per_column(dataset, error_matrix)

Column:  X
['gomjs', 'cmsig', 'ugqgq']


In [26]:
## mode = "row"
active_detector_5 = ActiveDetect(
    df=dataset,
    col_predict=['X'],
    mode="row",
    error_modules=[StringSimilarityErrorModule(thresh=1.5)],
    verbose=False,
)
active_detector_5.fit()
indices = active_detector_5.predict(dataset)
len(indices)

6

In [27]:
check_dataframe_matrix(dataset, error_matrix, indices)

all good.


### CharSimilarityErrorModule
This module detects values that do not belong in a `string` column. It fine-tunes Word2Vec on the given set of data on the character-level and compares the score of likelihood of input values within the set using standard deviation to predict possibly erroneous values. It takes the following parameters:
- `thresh`: a standard deviation count threshold to determine how many stds can a non-erroneous string's likelihood score be beyond the dataset's mean. This parameter defaults at 3.5;

In [28]:
## mode = "column"
active_detector_6 = ActiveDetect(
    df=dataset,
    col_predict=None,
    error_modules=[CharSimilarityErrorModule(thresh=5)],
    verbose=False,
)
active_detector_6.fit()
error_matrix = active_detector_6.predict(dataset)
print_erroneous_values_per_column(dataset, error_matrix)

Column:  X
['?????****?????', '900']


In [29]:
## mode = "row"
active_detector_6 = ActiveDetect(
    df=dataset,
    col_predict=None,
    mode="row",
    error_modules=[CharSimilarityErrorModule(thresh=5)],
    verbose=False,
)
active_detector_6.fit()
indices = active_detector_6.predict(dataset)
len(indices)

2

In [30]:
check_dataframe_matrix(dataset, error_matrix, indices)

all good.


## Calling multiple error modules
Next, let's use `ActiveDetect` to call multiple error modules at once. We can also use the `json_log_path` parameter to save our prediction results to a json log file. This parameter defaults to None, where no log file is saved. Given `mode`="column", the log file will include:
- "object_config": contains set attributes of the concrete class;
- It maps every column containing errors to a list of its erroneous values.

In [31]:
## mode = "column"
active_detector_7 = ActiveDetect(
    df=dataset,
    col_predict=None,
    error_modules=[QuantitativeErrorModule(), DistributionErrorModule(thresh=2.5, fail_thresh=2), SemanticErrorModule(thresh=1.5, fail_thresh=2)],
    json_log_path="../logs/log_7.json",
    verbose=False,
)
active_detector_7.fit()
error_matrix = active_detector_7.predict(dataset)
print_erroneous_values_per_column(dataset, error_matrix)

Column:  department
['R&D', 'Procurement']
Column:  region
['region_x']
Column:  education
['not an education status 1', 'not an education status 2']
Column:  gender
['', '.', ',', '  ', '. ']
Column:  no_of_trainings
[1, 4, 5, 6, 7, 8, 10]
Column:  length_of_service
[32, 33, 34, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
Column:  awards_won?
[1]
Column:  avg_training_score
[64000, 84000, 49000, 69000, 58000, 49, 50, 82000]


In [32]:
error_matrix

array([[ 1.,  1.,  1., ..., -1.,  1.,  1.],
       [ 1.,  1.,  1., ...,  1.,  1.,  1.],
       [ 1.,  1.,  1., ..., -1.,  1.,  1.],
       ...,
       [ 1., -1.,  1., ...,  1.,  1.,  1.],
       [ 1., -1.,  1., ...,  1.,  1.,  1.],
       [-1., -1.,  1., ...,  1.,  1.,  1.]])

If you'd like to isolate the error matrix of a single module:

In [33]:
active_detector_7.get_error_module_matrix('QuantitativeErrorModule')

array([[1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       ...,
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1],
       [1, 1, 1, ..., 1, 1, 1]])

Finally, we can call the `.transform()` function to transform the data, using `mode`=`"column"`, we have the 2 following options:
1. Remove erroneous values from the data:

In [34]:
active_detector_7.transform(dataset)

Unnamed: 0,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted,X
0,Sales & Marketing,region_7,Master's & above,f,sourcing,,35,5.0,8.0,1,0.0,,0,eooyf
1,Operations,region_22,Bachelor's,m,other,,30,5.0,4.0,0,0.0,60.0,0,wmxln
2,Sales & Marketing,region_19,Bachelor's,m,sourcing,,34,3.0,7.0,0,0.0,,0,qzdrd
3,Sales & Marketing,region_23,Bachelor's,m,other,2.0,39,1.0,10.0,0,0.0,,0,cxoib
4,Technology,,Bachelor's,m,other,,45,3.0,2.0,0,0.0,73.0,0,vugkh
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,,,Master's & above,f,other,,37,4.0,7.0,1,0.0,71.0,0,kjtsp
9996,Sales & Marketing,,Master's & above,m,sourcing,,39,3.0,7.0,0,0.0,48.0,0,xtgaz
9997,Finance,,Master's & above,f,sourcing,,33,4.0,4.0,1,0.0,58.0,0,uycfb
9998,Operations,,Master's & above,m,other,,32,4.0,4.0,1,0.0,57.0,1,vgdui


2. Or, we can pass an imputer to impute these values post removal. Here we are using the `IterativeDataImputer` offered by this library:

In [35]:
from raimitigations.dataprocessing import IterativeDataImputer
from sklearn.ensemble import RandomForestRegressor
imputer = IterativeDataImputer(
    df=dataset,
    col_impute=None,
    enable_encoder=True,
    iterative_params={
        'estimator': RandomForestRegressor(),
        'missing_values': np.nan,
        'sample_posterior': False,
        'max_iter': 3,
        'tol': 1e-3,
        'n_nearest_features': None,
        'initial_strategy': 'mean',
        'imputation_order': 'ascending',
        'skip_complete': False,
        'min_value': -np.inf,
        'max_value': np.inf,
        'random_state': 100},
    verbose=False
)

active_detector_7.transform(dataset, imputer = imputer)



Unnamed: 0,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted,X
0,Sales & Marketing,region_7,Master's & above,f,sourcing,2.27,35.0,5.0,8.0,1.0,0.0,48.11,0.0,eooyf
1,Operations,region_22,Bachelor's,m,other,2.17,30.0,5.0,4.0,0.0,0.0,60.00,0.0,wmxln
2,Sales & Marketing,region_19,Bachelor's,m,sourcing,2.06,34.0,3.0,7.0,0.0,0.0,49.64,0.0,qzdrd
3,Sales & Marketing,region_23,Bachelor's,m,other,2.00,39.0,1.0,10.0,0.0,0.0,49.04,0.0,cxoib
4,Technology,region_28,Bachelor's,m,other,2.21,45.0,3.0,2.0,0.0,0.0,73.00,0.0,vugkh
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,Sales & Marketing,region_26,Master's & above,f,other,2.36,37.0,4.0,7.0,1.0,0.0,71.00,0.0,kjtsp
9996,Sales & Marketing,region_21,Master's & above,m,sourcing,2.11,39.0,3.0,7.0,0.0,0.0,48.00,0.0,xtgaz
9997,Finance,region_21,Master's & above,f,sourcing,2.17,33.0,4.0,4.0,1.0,0.0,58.00,0.0,uycfb
9998,Operations,region_28,Master's & above,m,other,2.46,32.0,4.0,4.0,1.0,0.0,57.00,1.0,vgdui


Now using `mode`=`"row"`:

Let's save our results to a log file as well, given this `mode` choice, the log file will include:
- "object_config": contains set attributes of the concrete class;
- "erroneous_rows": containing a list of erroneous row indices.

In [38]:
## mode = "row"
active_detector_8 = ActiveDetect(
    df=dataset,
    col_predict=None,
    mode="row",
    error_modules=[QuantitativeErrorModule(), PuncErrorModule(), SemanticErrorModule(thresh=1.5, fail_thresh=2)],
    json_log_path="../logs/log_8.json",
    verbose=False,
)
active_detector_8.fit()
indices = active_detector_8.predict(dataset)
len(indices)

1889

Again, we can call the `.transform()` function to transform the data, using `mode`=`"row"`, the `transform` method removes erroneous rows from the dataset. 

In [39]:
active_detector_8.transform(dataset)

Unnamed: 0,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted,X
0,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0,eooyf
1,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0,wmxln
2,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0,qzdrd
3,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0,cxoib
4,Technology,region_x,Bachelor's,m,other,1,45,3.0,2,0,0,73,0,vugkh
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8106,Operations,region_8,Bachelor's,f,sourcing,1,26,5.0,3,1,0,60,0,nwszn
8107,Operations,region_x,Bachelor's,f,sourcing,1,39,2.0,11,0,0,61,0,gbwrw
8108,Sales & Marketing,region_x,Master's & above,m,sourcing,1,39,3.0,7,0,0,48,0,xtgaz
8109,Finance,region_x,Master's & above,f,sourcing,1,33,4.0,4,1,0,58,0,uycfb
