# Null and Missing Values Notebook

This notebook will run you through the problem of missing and null values when doing data cleaning. There are 2 approaches to handling missing/null values. The first is deletion where you delete the entire column associated with the null value and the second involves imputation to fill the missing values. There are many kinds of imputation such as summary statistic, linear regression and K-Nearest Neighbours.

## Importing Libraries and Dataset
First, we import the necessary libraries and dataset. The dataset we are using for this assignment is Kaggle's Melbounre Housing dataset. The notebook is going to develop a model for finding the price of the house using the numerical variables.

In [None]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

raw_data = pd.read_csv('Melbourne_housing_FULL.csv')
raw_data.head()

__Describe what you notice from the header of the data?__ <br>

It is hard to deal with non numerical variables when running a training model as they only take numerical values. Some of the values can be changed into dummy values but others such as address are hard to be interpreted. __Drop the non numerical variables then remove the null values associated with Price.__

In [None]:
# Begin Soultion
#End Solution
full_data.head()

Below is a function that helps count the missing values by columns which will allow us to see how the missing values are distributed.

In [None]:
def missing_vals(data):
    missing_val_count_by_column = (data.isnull().sum())
    return (missing_val_count_by_column[missing_val_count_by_column > 0])

missing_vals(full_data)

In [None]:
full_data.count()

## Dropping Columns and Rows

The count in the cell above shows that there are many columns with missing values present in the data. This part of the notebook approaches the problem by deletion of rows and columns.

__Drop the variables of YearBuilt and Building Area for the data__

In [None]:
# Begin Solution

# End Solution
data.head(5)

__Why do you think we should drop those variables instead of imputing into them?__ <br>

In [None]:
missing_vals(data)

Below are the functions for comparing the dataset and the train-test split of the data. This helper function is credited to https://www.kaggle.com/dansbecker/handling-missing-values.

In [None]:
def score_dataset(X_train, X_test, y_train, y_test):
    model = RandomForestRegressor()
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    return mean_absolute_error(y_test, preds)

X_train, X_test, y_train, y_test = train_test_split(data.drop(columns = ['Price']), data['Price'], train_size=0.7, 
                                                    test_size=0.3, random_state=0)

Now we shall explore deletion and imputation with our existing dataset. Let's try to drop all the rows with a null value and then try to drop all the columns with null values. First, we shall explore dropping the rows. We are not going to test the model of dropped rows due to different number of data points leading to different mean absolute errors. However, we will test the model with dropped columns.

In [None]:
dropped_train_rows = data.copy(deep=True)
# Begin Solution
# End Solution
print(missing_vals(dropped_rows))
assert missing_vals(dropped_rows).sum() == 0
dropped_rows.head()

There should be no missing values in the new dataset but now let us take a look at the number of points remaining in the dataset.

In [None]:
dropped_rows.count()

We notice that we have removed around 1/3rd of the original dataset by dropping the rows with missing values. Consider the case where we did not drop the variables before dropping the rows with missing values. How many points would you expect?. Now lets us try this with columns.<br>
_Hint : Find the missing columns across the entire dataset first and drop those columns for both train and test._ <br>
__What do you expect to see if we drop the columns with missing datapoints?__ <br>


In [None]:
# Begin Solution

# End Solution
assert missing_vals(dropped_train_cols).sum() == 0
assert missing_vals(dropped_test_cols).sum() == 0
dropped_train_cols.head()

We will now make a model using the remaining columns. Use the helper function score_dataset to run the model. <br>
__Comment on the resultant mean absolute error__ <br>


In [None]:
# Begin Solution
score_dataset(dropped_train_cols, dropped_test_cols, y_train, y_test)
# End Solution

## Imputation of values

Imputation of values involves filling in the values with other values such as the mean or a linear regression or KNN to find replacements. <br>
__Describe those 3 different imputation methods__<br>

Below you need to implement a simple imputer which just uses the mean of each of the columns. 

In [None]:
# Begin Solution

# End Solution
score_dataset(imputed_X_train, imputed_X_test, y_train, y_test)

In this notebook, we are going to explore KNN in more detail. Implement a KNN model and explore different values for the number of nearest neighbours with uniform weights. __Comment on what you find.__ <br>


In [None]:
# Begin Solution

# End Solution
score_dataset(KNN_X_train, KNN_X_test, y_train, y_test)

Further avenues of exploring imputation include marking imputation as a feature and adding it to the regression in addition to different imputation models.