# Demonstration of DataFrame Missing Value Imputer

Even though the input to an `sklearn` transformer is a `pandas` DataFrame, out-of-the-box output from the transformer is a `numpy` array, which loses column naming metadata.  Although starting in `sklearn` 1.x, column names are captured, these are only saved in an internal variable (`feature_names_in_`) of the transformer object.  To make use of this internal variable requires the data scientist to take additional steps. 

For this reason, custom transformers are recommended to simplify the work of the data scientist.  This notebook demonstrates the issue and the proposed solution.

**Note: This notebook requires `sklearn` 1.0 or higher**

In [1]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.datasets import make_regression
import sklearn
print(f"sklearn version {sklearn.__version__}")

sklearn version 1.0.2


## Create synthentic data for demonstration

In [2]:
N_SAMPLES = 10
N_FEATURES = 3

X, _ = make_regression(n_samples=N_SAMPLES, n_features=N_FEATURES, random_state=123)

# create some missing values
np.random.seed(123)
idx = np.random.choice(range(N_SAMPLES), size=int(0.5 * N_SAMPLES))
X[idx, 0] = np.nan
idx = np.random.choice(range(N_SAMPLES), size=int(0.3 * N_SAMPLES))
X[idx, 1] = np.nan
idx = np.random.choice(range(N_SAMPLES), size=int(0.5 * N_SAMPLES))
X[idx, 2] = np.nan

df = pd.DataFrame(np.concatenate([X], axis=1), columns=['X00', 'X01', 'X02'])
df = df.astype({'X00': np.float32, 'X01': np.float32, 'X02': np.float32})

print(df.dtypes)
df

X00    float32
X01    float32
X02    float32
dtype: object


Unnamed: 0,X00,X01,X02
0,1.651437,-0.5786,
1,,,
2,,-0.637752,-1.253881
3,,0.997345,-1.085631
4,1.265936,-0.428913,-2.426679
5,-0.094709,-0.678886,-0.86674
6,,,-0.434351
7,0.737369,0.386186,1.004054
8,-0.861755,-0.140069,-1.428681
9,-0.443982,,


## Standard `sklearn` Missing Value Imputation

Standard `sklearn` output is a `numpy` array, which has lost all column name information.

In [3]:
impute = SimpleImputer()
impute.fit_transform(df)

array([[ 1.6514366 , -0.5786002 , -0.92741555],
       [ 0.37571594, -0.15438393, -0.92741555],
       [ 0.37571594, -0.6377515 , -1.2538806 ],
       [ 0.37571594,  0.99734545, -1.0856307 ],
       [ 1.2659363 , -0.42891264, -2.4266791 ],
       [-0.09470897, -0.6788862 , -0.8667404 ],
       [ 0.37571594, -0.15438393, -0.43435127],
       [ 0.7373686 ,  0.3861864 ,  1.004054  ],
       [-0.8617549 , -0.14006872, -1.4286807 ],
       [-0.44398195, -0.15438393, -0.92741555]], dtype=float32)

Column name information is 

In [4]:
impute.feature_names_in_

array(['X00', 'X01', 'X02'], dtype=object)

If the data scientist requires column names for later analysis, this requires the data scientis to code something like this to recreate a pandas DataFrame with column names.

In [5]:
arr1 = impute.fit_transform(df)

df2 = pd.DataFrame(arr1, columns=impute.feature_names_in_.tolist())
df2

Unnamed: 0,X00,X01,X02
0,1.651437,-0.5786,-0.927416
1,0.375716,-0.154384,-0.927416
2,0.375716,-0.637752,-1.253881
3,0.375716,0.997345,-1.085631
4,1.265936,-0.428913,-2.426679
5,-0.094709,-0.678886,-0.86674
6,0.375716,-0.154384,-0.434351
7,0.737369,0.386186,1.004054
8,-0.861755,-0.140069,-1.428681
9,-0.443982,-0.154384,-0.927416


If indicator variables are needed to identify the missing value locations, then this code will provide that information.

In [6]:
impute2 = SimpleImputer(add_indicator=True)
impute2.fit_transform(df)

array([[ 1.6514366 , -0.5786002 , -0.92741555,  0.        ,  0.        ,
         1.        ],
       [ 0.37571594, -0.15438393, -0.92741555,  1.        ,  1.        ,
         1.        ],
       [ 0.37571594, -0.6377515 , -1.2538806 ,  1.        ,  0.        ,
         0.        ],
       [ 0.37571594,  0.99734545, -1.0856307 ,  1.        ,  0.        ,
         0.        ],
       [ 1.2659363 , -0.42891264, -2.4266791 ,  0.        ,  0.        ,
         0.        ],
       [-0.09470897, -0.6788862 , -0.8667404 ,  0.        ,  0.        ,
         0.        ],
       [ 0.37571594, -0.15438393, -0.43435127,  1.        ,  1.        ,
         0.        ],
       [ 0.7373686 ,  0.3861864 ,  1.004054  ,  0.        ,  0.        ,
         0.        ],
       [-0.8617549 , -0.14006872, -1.4286807 ,  0.        ,  0.        ,
         0.        ],
       [-0.44398195, -0.15438393, -0.92741555,  0.        ,  1.        ,
         1.        ]], dtype=float32)

To create the DataFrame with column names, the data scientist has to code this.

In [7]:
arr2 = impute2.fit_transform(df)

df3 = pd.DataFrame(arr2, columns=impute2.feature_names_in_.tolist() 
                   + [f"j{c}" for c in impute2.feature_names_in_.tolist()])
df3

Unnamed: 0,X00,X01,X02,jX00,jX01,jX02
0,1.651437,-0.5786,-0.927416,0.0,0.0,1.0
1,0.375716,-0.154384,-0.927416,1.0,1.0,1.0
2,0.375716,-0.637752,-1.253881,1.0,0.0,0.0
3,0.375716,0.997345,-1.085631,1.0,0.0,0.0
4,1.265936,-0.428913,-2.426679,0.0,0.0,0.0
5,-0.094709,-0.678886,-0.86674,0.0,0.0,0.0
6,0.375716,-0.154384,-0.434351,1.0,1.0,0.0
7,0.737369,0.386186,1.004054,0.0,0.0,0.0
8,-0.861755,-0.140069,-1.428681,0.0,0.0,0.0
9,-0.443982,-0.154384,-0.927416,0.0,1.0,1.0


## Custom DataFrame Missing Value Imputer

Here is an example of a custom transformer that ouptput a `pandas` DataFrame.  

This custom transformer is a subclass of the `sklearn.impute.SimpleImputer` class.  Only the `transform()` method is overriden to output a `pandas` DataFrame instead of the normal `numpy` array.  All other methods of `sklearn.impute.SimpleImputer` are re-used.

In [8]:
class DFSimpleImputer(SimpleImputer):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)

    def transform(self, X):
        if self.indicator_:
            return pd.DataFrame(
                super().transform(X),
                columns=self.feature_names_in_.tolist() + [f'j{c}' for c in self.feature_names_in_]
            )
        else:
            return pd.DataFrame(super().transform(X), columns=self.feature_names_in_)


In [9]:
impute_df = DFSimpleImputer()
impute_df.fit_transform(df)

Unnamed: 0,X00,X01,X02
0,1.651437,-0.5786,-0.927416
1,0.375716,-0.154384,-0.927416
2,0.375716,-0.637752,-1.253881
3,0.375716,0.997345,-1.085631
4,1.265936,-0.428913,-2.426679
5,-0.094709,-0.678886,-0.86674
6,0.375716,-0.154384,-0.434351
7,0.737369,0.386186,1.004054
8,-0.861755,-0.140069,-1.428681
9,-0.443982,-0.154384,-0.927416


If indicator variables are required, then the call looks like this.

In [10]:
impute_df2 = DFSimpleImputer(add_indicator=True)
impute_df2.fit_transform(df)

Unnamed: 0,X00,X01,X02,jX00,jX01,jX02
0,1.651437,-0.5786,-0.927416,0.0,0.0,1.0
1,0.375716,-0.154384,-0.927416,1.0,1.0,1.0
2,0.375716,-0.637752,-1.253881,1.0,0.0,0.0
3,0.375716,0.997345,-1.085631,1.0,0.0,0.0
4,1.265936,-0.428913,-2.426679,0.0,0.0,0.0
5,-0.094709,-0.678886,-0.86674,0.0,0.0,0.0
6,0.375716,-0.154384,-0.434351,1.0,1.0,0.0
7,0.737369,0.386186,1.004054,0.0,0.0,0.0
8,-0.861755,-0.140069,-1.428681,0.0,0.0,0.0
9,-0.443982,-0.154384,-0.927416,0.0,1.0,1.0


By using the custom class the end user data scientist do not have to remember to write additional code, such as this
```
df3 = pd.DataFrame(arr2, columns=impute2.feature_names_in_.tolist() 
                   + [f"j{c}" for c in impute2.feature_names_in_.tolist()])
```

While the above code is simple, it is prone to inconsistent use, i.e., different names for the indicator variables.
 
By using a custom class, the problem is solved once and all data scientists are able to take advantage of solution and it ensures consistent application of programming conventions.