# <p style="color:Blue;">Missing Data Imputation</p>

## Numerical Variables
#### Mean / Median Imputation: If the variable is normally distributed the mean and median are approximately the same, If the variable is skewed, the median is a better representation. When to useData is missing completely at random, No more than 5% of the variable contains missing data.
#### Arbitrary value imputation: Need to be careful not to chose an arbitrary value too similar to the mean or median (or any other common value of the variable distribution), Arbitrary value imputation consists of replacing all occurrences of missing values (NA) within a variable by an arbitrary value, Typically used arbitrary values are 0, 999, -999 (or other combinations of 9s) or -1 (if the distribution is positive).
#### End of tail imputation: End of tail imputation is equivalent to arbitrary value imputation, but automatically selecting arbitrary values at the end of the variable distributions. If the variable is normally distributed, we can use the mean plus or minus 3 times the standard deviation. If the variable is skewed, we can use the IQR proximity rule. Suitable numerical variables.

## Categorical Variables
#### Frequent category imputation: Mode imputation consists of replacing all occurrences of missing values (NA) within a variable by the mode, or the most frequent value.
#### Adding a “missing” category: This method consists in treating missing data as an additional label or category of the variable. This is the most widely used method of missing data imputation for categorical variables.

## Both
#### Random sample imputation: Random sampling consist in taking a random observation from the pool of available observations of the variable, and using that randomly extracted value to fill the NA. 

Extract 3 random elements from the df, Note that we use random_state
to ensure the reproducibility of the example: <br>
df["abc"].sample(n=3, random_state=1)

#### Complete Case Analysis
#### Adding a “Missing” indicator: A Missing Indicator is an additional binary variable, which indicates whether the data was missing for an observation (1) or not (0).

X_train['ABC_MI'] = np.where(X_train['abc'].isnull(), 1, 0)


### When to use a missing indicator:
Typically, mean, median and mode imputation are done together with adding a binary "missing indicator" variable to capture those observations where the data was missing (see lecture "Missing Indicator"), thus covering 2 angles:
if the data was missing completely at random, this would be captured by the mean, median or mode imputation, and if it wasn't this would be captured by the additional "missing indicator" variable.
Both methods are extremely straight forward to implement, and therefore are a top choice in data science competitions.

### SimpleImputer

In [None]:
# Imputation transformer for completing missing values:
SimpleImputer(missing_values=nan, strategy='mean', fill_value=None, verbose=0, copy=True, add_indicator=False)

In [None]:
imputer = SimpleImputer(strategy='median')
imputer.fit(X_train[cols_to_use])
# NOTE: the data is returned as a numpy array!!!
X_train = imputer.transform(X_train)
X_train = pd.DataFrame(X_train, columns=cols_to_use)

### ColumnTransformer

Applies transformers to columns of an array or pandas DataFrame.

This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.

In [None]:
# first we need to make lists, indicating which features
# will be imputed with each method

features_numeric = ['BsmtUnfSF', 'LotFrontage', 'MasVnrArea', ]
features_categoric = ['BsmtQual', 'FireplaceQu', 'MSZoning',
                      'Street', 'Alley']

# then we instantiate the imputers, within a pipeline
# we create one mean imputer and one frequent category imputer
# by changing the parameter in the strategy

numeric_imputer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
])

categoric_imputer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
])

# then we put the features list and the transformers together
# using the column transformer

preprocessor = ColumnTransformer(transformers=[
    ('numeric_imputer', numeric_imputer, features_numeric),
    ('categoric_imputer', categoric_imputer, features_categoric)
])

# now we fit the preprocessor
preprocessor.fit(X_train)

# and now we can impute the data
X_train = preprocessor.transform(X_train)
X_test = preprocessor.transform(X_test)

### Add a Missing Indicator

In [None]:
indicator = MissingIndicator(error_on_new=True, features='missing-only')
indicator.fit(X_train)  
tmp = indicator.transform(X_train)

# so we need to join it manually to the original X_train

# let's create a column name for each of the new MissingIndicators
indicator_cols = [c+'_NA' for c in X_train.columns[indicator.features_]]

# and now we concatenate
X_train = pd.concat([
    X_train.reset_index(),
    pd.DataFrame(tmp, columns = indicator_cols)],
    axis=1)