Most machine learning libraries give an error if you try to build a model using data with missing values. Here we are going to go through strategies to deal with missing values.

# Strategies

### Drop Columns with Missing Values
The simplest option is to eliminate any column with a missing value. While it quickly allows the models to run without error, we lose access to a lot of potentially useful information.

### Imputation
Fill in the missing information with some number. For example, we can use the mean value along each column.

### An Extension to Imputation
Imputation is actually the standard approach and it usually works well. However it is often useful to know which values were imputated, so we can a boolean (true or false) column denoting whether a value was imputed or not.

# The Code

In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the data
xls = pd.read_csv('./ressources/RawDataRef_2011.csv')

# Select target
y = data['Home Prices']
features = ['Mid-Century Highrise Households', 'Percent Mid-Century Highrise Households', 'Social Housing Units', 'Social Housing Waiting List']
X = data[features]


# Divide data into training and validation subsets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                      random_state=0)

In [9]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=10, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

In [10]:
# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns
                     if X_train[col].isnull().any()]

# Drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)

print("MAE from Approach 1 (Drop columns with missing values):")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))

MAE from Approach 1 (Drop columns with missing values):
136441.83214285717


In [11]:
from sklearn.impute import SimpleImputer

# Imputation
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

# Imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

print("MAE from Approach 2 (Imputation):")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))

MAE from Approach 2 (Imputation):
136441.83214285717


In [12]:
# Make copy to avoid changing original data (when imputing)
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()

# Make new columns indicating what will be imputed
for col in cols_with_missing:
    X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
    X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()

# Imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))

# Imputation removed column names; put them back
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns

print("MAE from Approach 3 (An Extension to Imputation):")
print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus, y_train, y_valid))

MAE from Approach 3 (An Extension to Imputation):
136441.83214285717


#### Note
Unfortunately I'm not using the best data set here, so the results don't demonstrate the reality in which Imputation is better than Elimination.
**However**, while that is normally the case it is not always. There are times when factors, such as noise in the dataset, result in dropping columns or setting the values to zero producing a more accurate model