# Unbiasing Imputer #

This notebook is about dealing with missing data that does not increase bias (gender bias, race, bias, etc.), or even potentially reduce it



## Problem statement ##
Most common way to handle missing data is to drop them. The second most common way is to replace the missing data with the most likely value. For the categorical features it is the most frequent value. For the numerical features it is the mean. `scikit-learn` has a class available for this: [SimpleImputer](http://scikit-learn.org/dev/modules/generated/sklearn.impute.SimpleImputer.html). The problem with this approach is that even though it preserves mean, but it reduces the standard deviation, sometimes very significantly. To demonstrate this, let's consider a simple array, then remove half of the values and replace them with mean, and see what happens with STD:

In [2]:
import numpy as np
from scipy.stats import norm
original_data = norm.rvs(loc=1.0, scale=0.5, size=1000, random_state=1386)
original_data[:20]

array([1.53547966, 0.99260019, 0.88633099, 1.21320929, 1.03287069,
       1.34151072, 0.98476757, 1.17019719, 1.10089714, 0.48023982,
       1.49781353, 1.21862054, 1.91732282, 0.55931941, 0.53091708,
       1.3266663 , 0.94301855, 1.1107632 , 0.42426201, 1.39311814])

In [3]:
#Now replace every other element with the mean 1.0
missing_elements = np.asarray([0,1]*500)
updated_data = original_data * (1-missing_elements) + missing_elements
updated_data[:20]

array([1.53547966, 1.        , 0.88633099, 1.        , 1.03287069,
       1.        , 0.98476757, 1.        , 1.10089714, 1.        ,
       1.49781353, 1.        , 1.91732282, 1.        , 0.53091708,
       1.        , 0.94301855, 1.        , 0.42426201, 1.        ])

In [4]:
#Now, let's get mean and std of the new distribution:
mean, std = norm.fit(updated_data)
print(f'Mean: {mean}, std: {std}')

Mean: 1.0117580053066189, std: 0.33428315977079176


As you see, even though the mean is the same, the standard deviation is much less. While the imputation of data this way increases the performance of the model, it also amplifies the bias that already exists in the data. In order to prevent amplification of the bias, we have to replace the missing values with a sample from the normal distribution with the same mean and standard deviation. For categorical features it would be a multinomial distribution.

For debiasing we can try to increase the standard deviation of the distribution from which we sample data for numerical features, and a similar transformation for the multinomial distribution. 

In this notebook I suggest two classes for the numerical and categorical features respectively.

In [5]:
from sklearn.base import BaseEstimator, TransformerMixin
import numpy.ma as ma
from sklearn.utils.validation import check_is_fitted
class NumericalUnbiasingImputer(BaseEstimator, TransformerMixin):
    """Un-biasing imputation transformer for completing missing values.
        Parameters
        ----------
        std_scaling_factor : number
            We will multiply std by this factor to increase or decrease bias
    """
    def __init__(self, std_scaling_factor=1, random_state=7294):
        self.std_scaling_factor = std_scaling_factor
        self.random_state = random_state

        
    def fit(self, X: np.ndarray, y=None):
        """Fit the imputer on X.
        Parameters
        ----------
        X : {array-like, sparse matrix}, shape (n_samples, n_features)
            Input data, where ``n_samples`` is the number of samples and
            ``n_features`` is the number of features.
        Returns
        -------
        self : NumericalUnbiasingImputer
        """
        mask = np.isnan(X)
        masked_X = ma.masked_array(X, mask=mask)

        mean_masked = np.ma.mean(masked_X, axis=0)
        std_masked = np.ma.std(masked_X, axis=0)
        mean = np.ma.getdata(mean_masked)
        std = np.ma.getdata(std_masked)
        mean[np.ma.getmask(mean_masked)] = np.nan
        std[np.ma.getmask(std_masked)] = np.nan
        self.mean_ = mean
        self.std_ = std * self.std_scaling_factor

        return self
    
     
    def transform(self, X):
        """Impute all missing values in X.
        Parameters
        ----------
        X : {array-like}, shape (n_samples, n_features)
            The input data to complete.
        """
        check_is_fitted(self, ['mean_', 'std_'])

        mask = np.isnan(X)
        n_missing = np.sum(mask, axis=0)
        
        def transform_single(index):
            col = X[:,index].copy()
            mask_col = mask[:, index]
            sample = np.asarray(norm.rvs(loc=self.mean_[index], scale=self.std_[index], 
                                         size=col.shape[0], random_state=self.random_state))
            col[mask_col] = sample[mask_col]
            return col
            
        
        Xnew = np.vstack([transform_single(index) for index,_ in enumerate(n_missing)]).T
        

        return Xnew
    


In [6]:
imputer = NumericalUnbiasingImputer()
missing_indicator = missing_elements.copy().astype(np.float16)
missing_indicator[missing_indicator == 1] = np.nan
data_with_missing_values = original_data + missing_indicator
data_with_missing_values = np.vstack([data_with_missing_values, original_data*5]).T
imputer.fit(data_with_missing_values)
transformed = imputer.transform(data_with_missing_values)
print(transformed[:20,:])
transformed.shape


[[ 1.53547966  7.6773983 ]
 [ 1.28446105  4.96300096]
 [ 0.88633099  4.43165496]
 [ 1.26161414  6.06604643]
 [ 1.03287069  5.16435346]
 [ 1.54999452  6.70755359]
 [ 0.98476757  4.92383783]
 [ 1.78611804  5.85098593]
 [ 1.10089714  5.50448571]
 [ 1.01113317  2.4011991 ]
 [ 1.49781353  7.48906767]
 [ 1.2508774   6.09310269]
 [ 1.91732282  9.58661411]
 [ 1.13145139  2.79659703]
 [ 0.53091708  2.65458541]
 [ 1.81200067  6.6333315 ]
 [ 0.94301855  4.71509276]
 [-0.0711673   5.55381598]
 [ 0.42426201  2.12131004]
 [ 1.94678859  6.96559068]]


(1000, 2)

In [7]:
#Let's see how it is different from the original array:
new_mean, new_std = norm.fit(transformed[:,0])
print(f'Mean: {new_mean}, Std: {new_std}')

Mean: 1.0197348250784546, Std: 0.4659586665233841


Some difference in the standard deviation can be explained, because we fitted the model on the incomplete data.

Now we need to do the same for the categorical features