## Missing Values

### Option 1. Drop Columns with Missing Values

* drop columns with missing values.
* model loses access to a lot of (potentially useful) information with this approach.

```python
    # Get names of columns with missing values
    # cf) any() : returns True if 'any' item in an iterable is true.
    cols_with_missing = [col for col in X_train.columns
                    if X_train[col].isnull().any()]

    # Drop columns in training and validation data
    # cf) axis=0 : rows
    reduced_X_train = X_train.drop(cols_with_missing, axis=1)
    reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)
```

### Option 2. Imputation

* fill in the missing values with some number.
* The imputed value won't be exactly right in most cases, but it usually leads to more accurate models than you would get from dropping the column entirely.

```python
    from sklearn.impute import SimpleImputer

    # Imputation
    # cf) SimpleImputer() : Imputation transformer for completing missing value. (default=’mean’)
    # cf) Fit the transform to the training data, and then use the fitted transform to impute the values in the validation data.
    my_imputer = SimpleImputer()
    imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
    imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

    # Imputation removed column names; put them back
    imputed_X_train.columns = X_train.columns
    imputed_X_valid.columns = X_valid.columns
```

### Option 3. Advanced Imputation

* impute the missing values as before. 
* Additionally, for each column with missing entries in the original dataset, we add a new column that shows the location of the imputed entries.
* In some cases, this will meaningfully improve results. In other cases, it doesn't help at all.<br>
<br>
<img src="02_imputation.png" alt="title" width="300"/>

```python
    # Make copy to avoid changing original data (when imputing)
    X_train_plus = X_train.copy()
    X_valid_plus = X_valid.copy()

    # Make new columns indicating what will be imputed
    for col in cols_with_missing:
        X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
        X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()

    # Imputation
    my_imputer = SimpleImputer()
    imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
    imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))

    # Imputation removed column names; put them back
    imputed_X_train_plus.columns = X_train_plus.columns
    imputed_X_valid_plus.columns = X_valid_plus.columns
```

## Excerise Notes

* Setup Target and Features
```python
    # Remove rows with missing target
    X_full.dropna(axis=0, subset=['SalePrice'], inplace=True)   
    # separate target from predictors
    y = X_full.SalePrice    
    X_full.drop(['SalePrice'], axis=1, inplace=True)
```

* Number of missing values in each column of training data
```python
    missing_val_count_by_column = (X_train.isnull().sum())
    print(missing_val_count_by_column[missing_val_count_by_column > 0])
```

* Computing MAE
```python
    from sklearn.metrics import mean_absolute_error
    mean_absolute_error(a, b)
```