# IterativeDataImputer Example

In this example, we will explore how to perform imputation with the HR promotion dataset using iterative imputation. 

This method imputes missing data of a feature using the other features. It uses a round-robin method of modeling each feature with missing values to be imputed as a function of the other features. 

This subclass uses the class:`~sklearn.impute.IterativeImputer` class from mod:`sklearn` in the background (note that this sklearn class is still in an experimental stage).

In [1]:
import sys
sys.path.append('../../../notebooks')

import pandas as pd
import numpy as np
from raimitigations.dataprocessing import IterativeDataImputer

from download import download_datasets

from sklearn.linear_model import BayesianRidge, Ridge
from sklearn.ensemble import RandomForestRegressor

## Handling a DataFrame with column names

In [2]:
data_dir = '../../../datasets/'
download_datasets(data_dir)
dataset = pd.read_csv(data_dir + 'hr_promotion/train.csv')
dataset = dataset[:10000]

dataset

Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,65438,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
1,65141,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
2,7513,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
3,2542,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0
4,48945,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,14934,Procurement,region_13,Master's & above,f,other,1,37,4.0,7,1,0,71,0
9996,22040,Sales & Marketing,region_33,Master's & above,m,sourcing,1,39,3.0,7,0,0,48,0
9997,14188,Finance,region_13,Master's & above,f,sourcing,1,33,4.0,4,1,0,58,0
9998,73566,Operations,region_28,Master's & above,m,other,1,32,4.0,4,1,0,57,1


In [3]:
print(dataset.isna().any())
print(dataset['education'].unique())
print(dataset['previous_year_rating'].unique())

employee_id             False
department              False
region                  False
education                True
gender                  False
recruitment_channel     False
no_of_trainings         False
age                     False
previous_year_rating     True
length_of_service       False
KPIs_met >80%           False
awards_won?             False
avg_training_score      False
is_promoted             False
dtype: bool
["Master's & above" "Bachelor's" nan 'Below Secondary']
[ 5.  3.  1.  4. nan  2.]


### Without Enabling Encoding

As we can see above, both the **education** and **previous_year_rating** have missing values. 

However, note that the dataset includes categorical columns such as **education**, while sklearn's `sklearn.IterativeImputer` can only handle numerical data. The good thing is, `IteratveDataImputer` can in fact handle categorical data using the boolean parameter `enable_encoder`. 

First, let's try to use the default value `enable_encoder=False`, categorical data will be excluded from the imputation process, whether it has missing values or not and we can use `col_impute` to specify only `previous_year_rating` to be imputed.

Additionally, we can specify `sklearn.IterativeImputer`'s parameters using the `iterative_params` dictionary as long as it uses the following format (if you choose not to pass anything to this param, default values will be used).

In [11]:
imputer = IterativeDataImputer(
    df=dataset,
    col_impute=['previous_year_rating'],
    enable_encoder=False,
    iterative_params={
        'estimator': RandomForestRegressor(),
        'missing_values': np.nan,
        'sample_posterior': False,
        'max_iter': 10,
        'tol': 1e-3,
        'n_nearest_features': None,
        'initial_strategy': 'mean',
        'imputation_order': 'ascending',
        'skip_complete': False,
        'min_value': -np.inf,
        'max_value': np.inf,
        'random_state': None}
)
imputer.fit()
new_df = imputer.transform(dataset)
new_df


If you'd like to include these columns, you need to set 'enable_encoder'=True.
If you'd like to use a different type of encoding before imputation, consider using the Pipeline class and call your own encoder before calling this subclass for imputation.
  self._fit()


[IterativeImputer] Completing matrix with shape (10000, 9)
[IterativeImputer] Change: 1.7071906463137385, scaled tolerance: 78.297 
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (10000, 9)


Unnamed: 0,employee_id,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted,education,region,recruitment_channel,department,gender
0,65438.0,1.0,35.0,5.0,8.0,1.0,0.0,49.0,0.0,Master's & above,region_7,sourcing,Sales & Marketing,f
1,65141.0,1.0,30.0,5.0,4.0,0.0,0.0,60.0,0.0,Bachelor's,region_22,other,Operations,m
2,7513.0,1.0,34.0,3.0,7.0,0.0,0.0,50.0,0.0,Bachelor's,region_19,sourcing,Sales & Marketing,m
3,2542.0,2.0,39.0,1.0,10.0,0.0,0.0,50.0,0.0,Bachelor's,region_23,other,Sales & Marketing,m
4,48945.0,1.0,45.0,3.0,2.0,0.0,0.0,73.0,0.0,Bachelor's,region_26,other,Technology,m
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,14934.0,1.0,37.0,4.0,7.0,1.0,0.0,71.0,0.0,Master's & above,region_13,other,Procurement,f
9996,22040.0,1.0,39.0,3.0,7.0,0.0,0.0,48.0,0.0,Master's & above,region_33,sourcing,Sales & Marketing,m
9997,14188.0,1.0,33.0,4.0,4.0,1.0,0.0,58.0,0.0,Master's & above,region_13,sourcing,Finance,f
9998,73566.0,1.0,32.0,4.0,4.0,1.0,0.0,57.0,1.0,Master's & above,region_28,other,Operations,m


In [12]:
print(new_df.isna().any())

employee_id             False
no_of_trainings         False
age                     False
previous_year_rating    False
length_of_service       False
KPIs_met >80%           False
awards_won?             False
avg_training_score      False
is_promoted             False
education                True
region                  False
recruitment_channel     False
department              False
gender                  False
dtype: bool


### With Enabling Encoding

Now if we use `enable_encoder=True`, we can include categorical data both to be imputed and to be used in the imputation of other features.

Using the original dataset, we know that both **education** and **previous_year_rating** have missing values. Below, we will not specify the columns to use for imputation (`IterativeDataImputer` will determine this for us).

In [9]:
imputer = IterativeDataImputer(
    df=dataset,
    col_impute=None,
    enable_encoder=True,
    iterative_params={
        'estimator': RandomForestRegressor(),
        'missing_values': np.nan,
        'sample_posterior': False,
        'max_iter': 10,
        'tol': 1e-3,
        'n_nearest_features': None,
        'initial_strategy': 'mean',
        'imputation_order': 'ascending',
        'skip_complete': False,
        'min_value': -np.inf,
        'max_value': np.inf,
        'random_state': None}
)
imputer.fit()
new_df = imputer.transform(dataset)
new_df

No columns specified for imputation. These columns have been automatically identified:
['education', 'previous_year_rating']


If you'd like to use a different type of encoding before imputation, consider using the Pipeline class and call your own encoder before calling this subclass for imputation.
  self._fit()


[IterativeImputer] Completing matrix with shape (10000, 14)
[IterativeImputer] Change: 1.704667046565469, scaled tolerance: 78.297 
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (10000, 14)


Note that encoded columns are not guaranteed to reverse transform if they have imputed values.
If you'd like to use a different type of encoding before imputation, consider using the Pipeline class and call your own encoder before calling this subclass.
  transf_df = self._transform(transf_df)


Unnamed: 0,employee_id,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,65438.0,Sales & Marketing,region_7,2.0,f,sourcing,1.0,35.0,5.0,8.0,1.0,0.0,49.0,0.0
1,65141.0,Operations,region_22,0.0,m,other,1.0,30.0,5.0,4.0,0.0,0.0,60.0,0.0
2,7513.0,Sales & Marketing,region_19,0.0,m,sourcing,1.0,34.0,3.0,7.0,0.0,0.0,50.0,0.0
3,2542.0,Sales & Marketing,region_23,0.0,m,other,2.0,39.0,1.0,10.0,0.0,0.0,50.0,0.0
4,48945.0,Technology,region_26,0.0,m,other,1.0,45.0,3.0,2.0,0.0,0.0,73.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,14934.0,Procurement,region_13,2.0,f,other,1.0,37.0,4.0,7.0,1.0,0.0,71.0,0.0
9996,22040.0,Sales & Marketing,region_33,2.0,m,sourcing,1.0,39.0,3.0,7.0,0.0,0.0,48.0,0.0
9997,14188.0,Finance,region_13,2.0,f,sourcing,1.0,33.0,4.0,4.0,1.0,0.0,58.0,0.0
9998,73566.0,Operations,region_28,2.0,m,other,1.0,32.0,4.0,4.0,1.0,0.0,57.0,1.0


Note that using the encoder before the imputation of **education** column can't necessarily be reverse transformed as it now includes new imputed values that the encoder can't map back to categorical data.

In [8]:
print(new_df.isna().any())

employee_id             False
department              False
region                  False
education               False
gender                  False
recruitment_channel     False
no_of_trainings         False
age                     False
previous_year_rating    False
length_of_service       False
KPIs_met >80%           False
awards_won?             False
avg_training_score      False
is_promoted             False
dtype: bool


## Handling a DataFrame without column names

Even if the dataset contains no header columns, we can perform the same operations, instead with the column index. The next few cells will demonstrate how to do this.

In [13]:
dataset = pd.read_csv(data_dir + 'hr_promotion/train.csv', header=None, skiprows=1)
dataset.drop(columns=[0], inplace=True)
dataset = dataset[:10000]
dataset

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5.0,8,1,0,49,0
1,Operations,region_22,Bachelor's,m,other,1,30,5.0,4,0,0,60,0
2,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3.0,7,0,0,50,0
3,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1.0,10,0,0,50,0
4,Technology,region_26,Bachelor's,m,other,1,45,3.0,2,0,0,73,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,Procurement,region_13,Master's & above,f,other,1,37,4.0,7,1,0,71,0
9996,Sales & Marketing,region_33,Master's & above,m,sourcing,1,39,3.0,7,0,0,48,0
9997,Finance,region_13,Master's & above,f,sourcing,1,33,4.0,4,1,0,58,0
9998,Operations,region_28,Master's & above,m,other,1,32,4.0,4,1,0,57,1


In [14]:
print(dataset.isna().any())
print(dataset.iloc[:, 2].unique())
print(dataset.iloc[:, 7].unique())

1     False
2     False
3      True
4     False
5     False
6     False
7     False
8      True
9     False
10    False
11    False
12    False
13    False
dtype: bool
["Master's & above" "Bachelor's" nan 'Below Secondary']
[ 5.  3.  1.  4. nan  2.]


### Without Enabling Encoding

In [15]:
imputer = IterativeDataImputer(
    df=dataset,
    col_impute=[7],
    enable_encoder=False,
    iterative_params={
        'estimator': BayesianRidge(),
        'missing_values': np.nan,
        'sample_posterior': False,
        'max_iter': 10,
        'tol': 1e-3,
        'n_nearest_features': None,
        'initial_strategy': 'mean',
        'imputation_order': 'ascending',
        'skip_complete': False,
        'min_value': -np.inf,
        'max_value': np.inf,
        'random_state': None}
)
imputer.fit()
new_df = imputer.transform(dataset)
new_df

If you'd like to include these columns, you need to set 'enable_encoder'=True.
If you'd like to use a different type of encoding before imputation, consider using the Pipeline class and call your own encoder before calling this subclass for imputation.
  self._fit()


[IterativeImputer] Completing matrix with shape (10000, 8)
[IterativeImputer] Change: 1.095270803516288, scaled tolerance: 0.099 
[IterativeImputer] Change: 0.0, scaled tolerance: 0.099 
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (10000, 8)


Unnamed: 0,5,6,7,8,9,10,11,12,0,3,4,2,1
0,1.0,35.0,5.0,8.0,1.0,0.0,49.0,0.0,Sales & Marketing,f,sourcing,Master's & above,region_7
1,1.0,30.0,5.0,4.0,0.0,0.0,60.0,0.0,Operations,m,other,Bachelor's,region_22
2,1.0,34.0,3.0,7.0,0.0,0.0,50.0,0.0,Sales & Marketing,m,sourcing,Bachelor's,region_19
3,2.0,39.0,1.0,10.0,0.0,0.0,50.0,0.0,Sales & Marketing,m,other,Bachelor's,region_23
4,1.0,45.0,3.0,2.0,0.0,0.0,73.0,0.0,Technology,m,other,Bachelor's,region_26
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,1.0,37.0,4.0,7.0,1.0,0.0,71.0,0.0,Procurement,f,other,Master's & above,region_13
9996,1.0,39.0,3.0,7.0,0.0,0.0,48.0,0.0,Sales & Marketing,m,sourcing,Master's & above,region_33
9997,1.0,33.0,4.0,4.0,1.0,0.0,58.0,0.0,Finance,f,sourcing,Master's & above,region_13
9998,1.0,32.0,4.0,4.0,1.0,0.0,57.0,1.0,Operations,m,other,Master's & above,region_28


In [17]:
print(new_df.isna().any())

5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
0     False
3     False
4     False
2      True
1     False
dtype: bool


### With Enabling Encoding

In [18]:
imputer = IterativeDataImputer(
    df=dataset,
    col_impute=None,
    enable_encoder=True,
    iterative_params={
        'estimator': RandomForestRegressor(),
        'missing_values': np.nan,
        'sample_posterior': False,
        'max_iter': 10,
        'tol': 1e-3,
        'n_nearest_features': None,
        'initial_strategy': 'mean',
        'imputation_order': 'ascending',
        'skip_complete': False,
        'min_value': -np.inf,
        'max_value': np.inf,
        'random_state': None}
)
imputer.fit()
new_df = imputer.transform(dataset)
new_df


No columns specified for imputation. These columns have been automatically identified:
['2', '7']
[IterativeImputer] Completing matrix with shape (10000, 13)


If you'd like to use a different type of encoding before imputation, consider using the Pipeline class and call your own encoder before calling this subclass for imputation.
  self._fit()


[IterativeImputer] Change: 1.664667046565469, scaled tolerance: 0.099 
[IterativeImputer] Change: 0.7699999999999998, scaled tolerance: 0.099 
[IterativeImputer] Change: 0.6699999999999999, scaled tolerance: 0.099 
[IterativeImputer] Change: 0.5999999999999996, scaled tolerance: 0.099 
[IterativeImputer] Change: 0.6800000000000002, scaled tolerance: 0.099 
[IterativeImputer] Change: 0.5900000000000003, scaled tolerance: 0.099 
[IterativeImputer] Change: 0.6800000000000002, scaled tolerance: 0.099 
[IterativeImputer] Change: 0.6100000000000003, scaled tolerance: 0.099 
[IterativeImputer] Change: 0.5100000000000002, scaled tolerance: 0.099 
[IterativeImputer] Change: 0.5699999999999998, scaled tolerance: 0.099 


Note that encoded columns are not guaranteed to reverse transform if they have imputed values.
If you'd like to use a different type of encoding before imputation, consider using the Pipeline class and call your own encoder before calling this subclass.
  transf_df = self._transform(transf_df)


[IterativeImputer] Completing matrix with shape (10000, 13)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,Sales & Marketing,region_7,2.0,f,sourcing,1.0,35.0,5.0,8.0,1.0,0.0,49.0,0.0
1,Operations,region_22,0.0,m,other,1.0,30.0,5.0,4.0,0.0,0.0,60.0,0.0
2,Sales & Marketing,region_19,0.0,m,sourcing,1.0,34.0,3.0,7.0,0.0,0.0,50.0,0.0
3,Sales & Marketing,region_23,0.0,m,other,2.0,39.0,1.0,10.0,0.0,0.0,50.0,0.0
4,Technology,region_26,0.0,m,other,1.0,45.0,3.0,2.0,0.0,0.0,73.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,Procurement,region_13,2.0,f,other,1.0,37.0,4.0,7.0,1.0,0.0,71.0,0.0
9996,Sales & Marketing,region_33,2.0,m,sourcing,1.0,39.0,3.0,7.0,0.0,0.0,48.0,0.0
9997,Finance,region_13,2.0,f,sourcing,1.0,33.0,4.0,4.0,1.0,0.0,58.0,0.0
9998,Operations,region_28,2.0,m,other,1.0,32.0,4.0,4.0,1.0,0.0,57.0,1.0


In [19]:
print(new_df.isna().any())

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
dtype: bool
