# Treating Missing Values

There are broadly two ways to treat missing values:
1. Delete: Delete the missing values
2. Impute:
    * Imputing by a simple statistic: Replace the missing values by another value, commonly the mean, median, mode etc.
    * Predictive techniques: Use statistical models such as k-NN, SVM etc. to predict and impute missing values.
   
In general, imputaion makes assumptions about the missing values and replaces missing values by arbitrary numbers such as mean, median etc. It should be used only when you are reasonably confident about the assumptions.

Otherwise , deletion is often safer and recommended. You may lost some data, but will not make any unreasonable assumptions.

In [53]:
import numpy as np
import pandas as pd

df = pd.read_csv('datasets/melbourne.csv')
df

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,68 Studley St,2,h,,SS,Jellis,03-09-2016,2.5,3067.0,...,1.0,1.0,126.0,,,Yarra,-37.80140,144.99580,Northern Metropolitan,4019.0
1,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,03-12-2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.79960,144.99840,Northern Metropolitan,4019.0
2,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,04-02-2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.80790,144.99340,Northern Metropolitan,4019.0
3,Abbotsford,18/659 Victoria St,3,u,,VB,Rounds,04-02-2016,2.5,3067.0,...,2.0,1.0,0.0,,,Yarra,-37.81140,145.01160,Northern Metropolitan,4019.0
4,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,04-03-2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.80930,144.99440,Northern Metropolitan,4019.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23542,Wyndham Vale,25 Clitheroe Dr,3,u,,PN,Harcourts,26-08-2017,27.2,3024.0,...,1.0,0.0,552.0,119.0,1990.0,,-37.90032,144.61839,Western Metropolitan,5262.0
23543,Wyndham Vale,19 Dalrymple Bvd,4,h,,S,hockingstuart,26-08-2017,27.2,3024.0,...,,,,,,,-37.87882,144.60184,Western Metropolitan,5262.0
23544,Yallambie,17 Amaroo Wy,4,h,1100000.0,S,Buckingham,26-08-2017,12.7,3085.0,...,3.0,2.0,,,,,-37.72006,145.10547,Northern Metropolitan,1369.0
23545,Yarraville,6 Agnes St,4,h,1285000.0,SP,Village,26-08-2017,6.3,3013.0,...,1.0,1.0,362.0,112.0,1920.0,,-37.81188,144.88449,Western Metropolitan,6543.0


# Treating Missing Values in Columns

In [46]:
# summing up the missing values (column-wise)
round(100 * (df.isnull().sum() / len(df.index)), 2)

Suburb            0.00
Address           0.00
Rooms             0.00
Type              0.00
Price            21.88
Method            0.00
SellerG           0.00
Date              0.00
Distance          0.00
Postcode          0.00
Bedroom2         19.03
Bathroom         19.04
Car              19.65
Landsize         26.06
BuildingArea     57.46
YearBuilt        50.99
CouncilArea      33.51
Lattitude        18.28
Longtitude       18.28
Regionname        0.00
Propertycount     0.00
dtype: float64

Notice that there are columns having almost 22%, 19%, 26%, 57% etc missing values. When dealing with columns, you have two simple choices - either **delete or retain the data(column)**. If you retain the column, you'll have to treat the rows having missing values.

If you delete the missing rows, you lost the data. If you impute, you introduce bias.

Apart from the number of missing values, the decision to delete or reatin a variable depends on various other factors, such as:

* The analysis task at hand
* The usefulness of the variable (based on your understanding of the problem)
* The total size of available data(if you have enough, you can afford to throw away some of it) etc

For e.g. Let's say that we want to build a (linear regression) model to predict the house prices in Melbourne. Now, even though the variable `Price` has about 22% missing values, you cannot drop the variable, since that is what we want to predict.

Simlarly, you would expect some other variables such as `Bedroom2`, `Bathroom`, `Landsize` etc. to be important predictors of `Price` and thus cannot be removed.

There are others such as `Building Area` which although seem important, have more than 50% missing values. It is impossible to either delete or impute the rows corresponding to such large number of missing values without losing a lot of data or introducing heavy bias.

Thus, for this exercise, let's remove columns having more than 30% missing values, i.e. `BuildingArea`, `YearBuilt`, `CouncilArea`.


In [47]:
df = df.drop('BuildingArea', axis=1)
df = df.drop('YearBuilt', axis = 1)
df = df.drop('CouncilArea', axis = 1)


round(100 * (df.isnull().sum()/len(df.index)), 2)

Suburb            0.00
Address           0.00
Rooms             0.00
Type              0.00
Price            21.88
Method            0.00
SellerG           0.00
Date              0.00
Distance          0.00
Postcode          0.00
Bedroom2         19.03
Bathroom         19.04
Car              19.65
Landsize         26.06
Lattitude        18.28
Longtitude       18.28
Regionname        0.00
Propertycount     0.00
dtype: float64

# Treat missing Values in Rows

In [66]:
df.isnull().sum(axis=1)

0        3
1        2
2        0
3        3
4        0
        ..
23541    1
23542    2
23544    4
23545    1
23546    2
Length: 19055, dtype: int64

In [67]:
# retaining the rows having <= 5 NaNs
df = df[df.isnull().sum(axis = 1) <= 5]

round(100 * len(df[df.isnull().sum(axis=1) > 5].index) / len(df.index), 2)

0.0

In [68]:
df

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,68 Studley St,2,h,,SS,Jellis,03-09-2016,2.5,3067.0,...,1.0,1.0,126.0,,,Yarra,-37.80140,144.99580,Northern Metropolitan,4019.0
1,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,03-12-2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.79960,144.99840,Northern Metropolitan,4019.0
2,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,04-02-2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.80790,144.99340,Northern Metropolitan,4019.0
3,Abbotsford,18/659 Victoria St,3,u,,VB,Rounds,04-02-2016,2.5,3067.0,...,2.0,1.0,0.0,,,Yarra,-37.81140,145.01160,Northern Metropolitan,4019.0
4,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,04-03-2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.80930,144.99440,Northern Metropolitan,4019.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23541,Williamstown,96 Verdon St,4,h,2500000.0,PI,Sweeney,26-08-2017,6.8,3016.0,...,1.0,5.0,866.0,157.0,1920.0,,-37.85908,144.89299,Western Metropolitan,6380.0
23542,Wyndham Vale,25 Clitheroe Dr,3,u,,PN,Harcourts,26-08-2017,27.2,3024.0,...,1.0,0.0,552.0,119.0,1990.0,,-37.90032,144.61839,Western Metropolitan,5262.0
23544,Yallambie,17 Amaroo Wy,4,h,1100000.0,S,Buckingham,26-08-2017,12.7,3085.0,...,3.0,2.0,,,,,-37.72006,145.10547,Northern Metropolitan,1369.0
23545,Yarraville,6 Agnes St,4,h,1285000.0,SP,Village,26-08-2017,6.3,3013.0,...,1.0,1.0,362.0,112.0,1920.0,,-37.81188,144.88449,Western Metropolitan,6543.0


Notice that now, we have removed most of the rows where multiple columns(`Bedroom2`, `Bathroom`, `Landsize`) were missing.
Now, we still have about 21% missing values in the column `Price` and 9% in `Landsize`. Since `Price` still contains a lot of missing data (and imputing 21% values of a variable you want to predict will introduce heavy bias), its a bad idea to impute those values.

Thus, let's remove the missing rows in `Price` as well. Notice that you can use `np.isnan(df['column'])` to filter out the corresponding rows, and use a ~ to discard the values satisfying the condintion.

In [73]:
df.shape

(14926, 21)

In [74]:
# removing NaN Price rows
df = df[~np.isnan(df['Price'])]

round(100 * (df.isnull().sum() / len(df.index)), 2)


Suburb            0.00
Address           0.00
Rooms             0.00
Type              0.00
Price             0.00
Method            0.00
SellerG           0.00
Date              0.00
Distance          0.00
Postcode          0.00
Bedroom2          0.00
Bathroom          0.01
Car               0.71
Landsize          8.86
BuildingArea     48.00
YearBuilt        39.98
CouncilArea      18.04
Lattitude         0.15
Longtitude        0.15
Regionname        0.00
Propertycount     0.00
dtype: float64

In [72]:
df.shape

(14926, 21)

In [75]:
len(df)

14926

Now, we have Landsize as the only variable having a significant number of missing values. Let's give this variable a chance and consider imputing the NaNs.


The decision (whether and how to impute) will depend upon the distribution of the variable. For e.g., if the variable is such that all the observations lie in a short range (say between 800 sq.ft to 820 sq.ft), you can take a call to impute the missing values by something like that mean or median Landsize.

Let's look at the distribution

In [78]:
df['Landsize'].describe()

count     13603.000000
mean        558.116371
std        3987.326586
min           0.000000
25%         176.500000
50%         440.000000
75%         651.000000
max      433014.000000
Name: Landsize, dtype: float64

Notice that the minumum is 0, max is 433104, the mean is 558 and median(50%) is 440. There's a significant variation in the 25th and the 75th percentile as well(176, 651).

Thus, imputing this with mean/median seems quite biased, and so we should remove the NaNs.