# Missing Values

An entry in a column may be missing.

## Solutions
- **1: Drop the columns with missing values:** This solution isn't feasible unless most of the values in the column are missing;
- **2: Imputation:** Fill the missing values with some value (e.g. the mean value of the column)
- **3: Extension to imputation:** Imputed values are usually not right, so we can add a column for each column with imputed values. This new columns will say (TRUE or FALSE) if the corresponding value in that line is imputed.

## Scores of the Differents Approaches

In [34]:
import pandas as pd

# file path location
melbourne_file_path = 'input/CSV/melb_data.csv'
# store the data in a DataFrame named melbourne_data
melbourne_data = pd.read_csv(melbourne_file_path)

y = melbourne_data.Price

X = melbourne_data.drop(["Price"], axis=1)
X = X.select_dtypes(exclude=['object'])

### Initial setup

In [35]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

# split data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

# get names of columns with missing values
cols_with_missing = [col for col in X.columns if X[col].isnull().any()]
cols_with_missing

['Car', 'BuildingArea', 'YearBuilt']

### 1 Drop the columns

In [36]:
# drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)

# prediction and metrics
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))

175703.48185157913


### 2 Imputation

In [38]:
from sklearn.impute import SimpleImputer

# impute data
my_imp = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imp.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imp.transform(X_valid))

# imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

# prediction and metrics
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))

169237.0268668034


### 3 Extend imputation

In [39]:
from sklearn.impute import SimpleImputer

# Make copy to avoid changing original data (when imputing)
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()

# Make new columns indicating what will be imputed
for col in cols_with_missing:
    X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
    X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()

# Imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))

# Imputation removed column names; put them back
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns

# prediction and metrics
print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus, y_train, y_valid))

169795.45249719475
