# Null and Missing Values Notebook

This notebook will run you through the problem of missing and null values when doing data cleaning. There are 2 approaches to handling missing/null values. The first is deletion where you delete the entire column associated with the null value and the second involves imputation to fill the missing values. There are many kinds of imputation such as summary statistic, linear regression and K-Nearest Neighbours.

## Importing Libraries and Dataset
First, we import the necessary libraries and dataset. The dataset we are using for this assignment is Kaggle's Melbounre Housing dataset. The notebook is going to develop a model for finding the price of the house using the numerical variables.

In [2]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

raw_data = pd.read_csv('Melbourne_housing_FULL.csv')
raw_data.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,68 Studley St,2,h,,SS,Jellis,3/09/2016,2.5,3067.0,...,1.0,1.0,126.0,,,Yarra City Council,-37.8014,144.9958,Northern Metropolitan,4019.0
1,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra City Council,-37.7996,144.9984,Northern Metropolitan,4019.0
2,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra City Council,-37.8079,144.9934,Northern Metropolitan,4019.0
3,Abbotsford,18/659 Victoria St,3,u,,VB,Rounds,4/02/2016,2.5,3067.0,...,2.0,1.0,0.0,,,Yarra City Council,-37.8114,145.0116,Northern Metropolitan,4019.0
4,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra City Council,-37.8093,144.9944,Northern Metropolitan,4019.0


__Describe what you notice from the header of the data?__ <br>
## Student solution here ##
There are many variables with both names and objects in addition to null values.

It is hard to deal with non numerical variables when running a training model as they only take numerical values. Some of the values can be changed into dummy values but others such as address are hard to be interpreted. __Drop the non numerical variables then remove the null values associated with Price.__

In [3]:
# Begin Soultion
full_data = raw_data.select_dtypes(exclude=['object'])
full_data = full_data.dropna(axis=0, subset=['Price'])
# End Solution
full_data.head()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
1,2,1480000.0,2.5,3067.0,2.0,1.0,1.0,202.0,,,-37.7996,144.9984,4019.0
2,2,1035000.0,2.5,3067.0,2.0,1.0,0.0,156.0,79.0,1900.0,-37.8079,144.9934,4019.0
4,3,1465000.0,2.5,3067.0,3.0,2.0,0.0,134.0,150.0,1900.0,-37.8093,144.9944,4019.0
5,3,850000.0,2.5,3067.0,3.0,2.0,1.0,94.0,,,-37.7969,144.9969,4019.0
6,4,1600000.0,2.5,3067.0,3.0,1.0,2.0,120.0,142.0,2014.0,-37.8072,144.9941,4019.0


Below is a function that helps count the missing values by columns which will allow us to see how the missing values are distributed.

In [4]:
def missing_vals(data):
    missing_val_count_by_column = (data.isnull().sum())
    return (missing_val_count_by_column[missing_val_count_by_column > 0])

missing_vals(full_data)

Distance             1
Postcode             1
Bedroom2          6441
Bathroom          6447
Car               6824
Landsize          9265
BuildingArea     16591
YearBuilt        15163
Lattitude         6254
Longtitude        6254
Propertycount        3
dtype: int64

In [5]:
full_data.count()

Rooms            27247
Price            27247
Distance         27246
Postcode         27246
Bedroom2         20806
Bathroom         20800
Car              20423
Landsize         17982
BuildingArea     10656
YearBuilt        12084
Lattitude        20993
Longtitude       20993
Propertycount    27244
dtype: int64

## Dropping Columns and Rows

The count in the cell above shows that there are many columns with missing values present in the data. This part of the notebook approaches the problem by deletion of rows and columns.

__Drop the variables of YearBuilt and Building Area for the data__

In [6]:
# Begin Solution
data = full_data.drop(columns = ['YearBuilt', 'BuildingArea'])
# End Solution
data.head(5)

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,Lattitude,Longtitude,Propertycount
1,2,1480000.0,2.5,3067.0,2.0,1.0,1.0,202.0,-37.7996,144.9984,4019.0
2,2,1035000.0,2.5,3067.0,2.0,1.0,0.0,156.0,-37.8079,144.9934,4019.0
4,3,1465000.0,2.5,3067.0,3.0,2.0,0.0,134.0,-37.8093,144.9944,4019.0
5,3,850000.0,2.5,3067.0,3.0,2.0,1.0,94.0,-37.7969,144.9969,4019.0
6,4,1600000.0,2.5,3067.0,3.0,1.0,2.0,120.0,-37.8072,144.9941,4019.0


__Why do you think we should drop those variables instead of imputing into them?__ <br>
## Student solution here ##
There are too many missing values in these variables that it becomes hard to understand either way so it is better to drop them.

In [7]:
missing_vals(data)

Distance            1
Postcode            1
Bedroom2         6441
Bathroom         6447
Car              6824
Landsize         9265
Lattitude        6254
Longtitude       6254
Propertycount       3
dtype: int64

Below are the functions for comparing the dataset and the train-test split of the data. This helper function is credited to https://www.kaggle.com/dansbecker/handling-missing-values.

In [8]:
def score_dataset(X_train, X_test, y_train, y_test):
    model = RandomForestRegressor()
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    return mean_absolute_error(y_test, preds)

X_train, X_test, y_train, y_test = train_test_split(data.drop(columns = ['Price']), data['Price'], train_size=0.7, 
                                                    test_size=0.3, random_state=0)

Now we shall explore deletion and imputation with our existing dataset. Let's try to drop all the rows with a null value and then try to drop all the columns with null values. First, we shall explore dropping the rows. We are not going to test the model of dropped rows due to different number of data points leading to different mean absolute errors. However, we will test the model with dropped columns.

In [9]:
dropped_rows = data.copy(deep=True)
# Begin Solution
dropped_rows.dropna(inplace=True)
# End Solution
print(missing_vals(dropped_rows))
assert missing_vals(dropped_rows).sum() == 0
dropped_rows.head()

Series([], dtype: int64)


Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,Lattitude,Longtitude,Propertycount
1,2,1480000.0,2.5,3067.0,2.0,1.0,1.0,202.0,-37.7996,144.9984,4019.0
2,2,1035000.0,2.5,3067.0,2.0,1.0,0.0,156.0,-37.8079,144.9934,4019.0
4,3,1465000.0,2.5,3067.0,3.0,2.0,0.0,134.0,-37.8093,144.9944,4019.0
5,3,850000.0,2.5,3067.0,3.0,2.0,1.0,94.0,-37.7969,144.9969,4019.0
6,4,1600000.0,2.5,3067.0,3.0,1.0,2.0,120.0,-37.8072,144.9941,4019.0


There should be no missing values in the new dataset but now let us take a look at the number of points remaining in the dataset.

In [10]:
dropped_rows.count()

Rooms            17679
Price            17679
Distance         17679
Postcode         17679
Bedroom2         17679
Bathroom         17679
Car              17679
Landsize         17679
Lattitude        17679
Longtitude       17679
Propertycount    17679
dtype: int64

We notice that we have removed around 1/3rd of the original dataset by dropping the rows with missing values. Consider the case where we did not drop the variables before dropping the rows with missing values. How many points would you expect?. Now lets us try this with columns.<br>
_Hint : Find the missing columns across the entire dataset first and drop those columns for both train and test._ <br>
__What do you expect to see if we drop the columns with missing datapoints?__ <br>
## Student solution here ##
I expect to see very little columns with a lot of data which is hard to interpret

In [11]:
# Begin Solution
cols_with_missing = [col for col in data.columns 
                                 if data[col].isnull().any()]
dropped_train_cols = X_train.drop(cols_with_missing, axis=1)
dropped_test_cols = X_test.drop(cols_with_missing, axis = 1)
# End Solution
assert missing_vals(dropped_train_cols).sum() == 0
assert missing_vals(dropped_test_cols).sum() == 0
dropped_train_cols.head()

Unnamed: 0,Rooms
30058,3
20196,3
12976,1
28998,3
23676,3


We will now make a model using the remaining columns. Use the helper function score_dataset to run the model. <br>
__Comment on the resultant mean absolute error__ <br>
## Student solution here ##
There is only one column remaining so the model cannot predict well.

In [12]:
# Begin Solution
score_dataset(dropped_train_cols, dropped_test_cols, y_train, y_test)
# End Solution

391181.0008684981

## Imputation of values

Imputation of values involves filling in the values with other values such as the mean or a linear regression or KNN to find replacements. <br>
__Describe those 3 different imputation methods__ 
## Student solution here ##
1. Mean imputation involves finding the mean of each of the different columns and using that value to fill the model
2. Linear Regression Imputation uses the other columns as a formula to derive an equation for the column and then solving for that value based on the other information in the column.
3. KNN is the K nearest neighbour algorithm that looks for the lowest euclidean distance between points and then takes the weighted average of the datapoints as a source for the null points.

Below you need to implement a simple imputer which just uses the mean of each of the columns. 

In [13]:
# Begin Solution
my_imputer = SimpleImputer()
imputed_X_train = my_imputer.fit_transform(X_train)
imputed_X_test = my_imputer.transform(X_test)
# End Solution
score_dataset(imputed_X_train, imputed_X_test, y_train, y_test)

187468.98265814345

In this notebook, we are going to explore KNN in more detail. Implement a KNN model and explore different values for the number of nearest neighbours with uniform weights. __Comment on what you find.__ <br>
## Student solution here ##
We see that 1/2 nearest neighbours is worse than a simple imputer model but as the number goes up we find that around 8 nearest neighbours the model actually out performs the mean model.

In [14]:
# Begin Solution
KNN_imputer = KNNImputer(n_neighbors=8, weights="uniform")
KNN_X_train = KNN_imputer.fit_transform(X_train)
KNN_X_test = KNN_imputer.transform(X_test)
# End Solution
score_dataset(KNN_X_train, KNN_X_test, y_train, y_test)

186008.4040770988

Further avenues of exploring imputation include marking imputation as a feature and adding it to the regression in addition to different imputation models.