In many cases, you always endup with missing values. <br>
for exp:
    - someone may not want to show the no of bed rooms of his house.
    - someone's property size gets missing
Some machine learning libraries may give an error if you try to build a model whose 
dataset includes missing values

### 1. Drop the Column
This is one of the simplest approach where you can drop those column, which has the nan values <br>
<img src = 'https://i.imgur.com/Sax80za.png' >
Unless most values in the dropped columns are missing, the model loses access to a lot of (potentially useful!) information with this approach. As an extreme example, consider a dataset with 10,000 rows, where one important column is missing a single entry. This approach would drop the column entirely!

### 2. Imputation
Through imputation you can fill the NaN values with other numbers. for exp, you can fill them with the mean of the column <br>
<img src = 'https://i.imgur.com/4BpnlPA.png' />
The imputed value won't be exactly right in most cases, but it usually leads to more accurate models than you would get from dropping the column entirely

### 3. Extended Imputation
Imputation is the standard approach, and it usually works well. However, imputed values may be systematically above or below their actual values (which weren't collected in the dataset). Or rows with missing values may be unique in some other way. In that case, your model would make better predictions by considering which values were originally missing.
<br>
<img src = 'https://i.imgur.com/UWOyg4a.png' />
n this approach, we impute the missing values, as before. And, additionally, for each column with missing entries in the original dataset, we add a new column that shows the location of the imputed entries.

In some cases, this will meaningfully improve results. In other cases, it doesn't help at all.

### Melbourne Housing Prediction

In the example, we will work with the Melbourne Housing dataset. Our model will use information such as the number of rooms and land size to predict home price.

In [78]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

In [3]:
data = pd.read_csv('melb_data.csv')

In [4]:
data.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


In [5]:
data.tail()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
13575,Wheelers Hill,12 Strada Cr,4,h,1245000.0,S,Barry,26/08/2017,16.7,3150.0,...,2.0,2.0,652.0,,1981.0,,-37.90562,145.16761,South-Eastern Metropolitan,7392.0
13576,Williamstown,77 Merrett Dr,3,h,1031000.0,SP,Williams,26/08/2017,6.8,3016.0,...,2.0,2.0,333.0,133.0,1995.0,,-37.85927,144.87904,Western Metropolitan,6380.0
13577,Williamstown,83 Power St,3,h,1170000.0,S,Raine,26/08/2017,6.8,3016.0,...,2.0,4.0,436.0,,1997.0,,-37.85274,144.88738,Western Metropolitan,6380.0
13578,Williamstown,96 Verdon St,4,h,2500000.0,PI,Sweeney,26/08/2017,6.8,3016.0,...,1.0,5.0,866.0,157.0,1920.0,,-37.85908,144.89299,Western Metropolitan,6380.0
13579,Yarraville,6 Agnes St,4,h,1285000.0,SP,Village,26/08/2017,6.3,3013.0,...,1.0,1.0,362.0,112.0,1920.0,,-37.81188,144.88449,Western Metropolitan,6543.0


In [41]:
dict(data.isna()['CouncilArea'].value_counts())

{False: 12211, True: 1369}

In [50]:
#looping through all columns, to find whether they have NaN values of not
cols = data.columns
for c in cols:
    nan_cols = dict(data.isna()[c].value_counts())
    for key,value in nan_cols.items():
        print(f'{c} has {value} {key} values')
    print('-----------------')

Suburb has 13580 False values
-----------------
Address has 13580 False values
-----------------
Rooms has 13580 False values
-----------------
Type has 13580 False values
-----------------
Price has 13580 False values
-----------------
Method has 13580 False values
-----------------
SellerG has 13580 False values
-----------------
Date has 13580 False values
-----------------
Distance has 13580 False values
-----------------
Postcode has 13580 False values
-----------------
Bedroom2 has 13580 False values
-----------------
Bathroom has 13580 False values
-----------------
Car has 13518 False values
Car has 62 True values
-----------------
Landsize has 13580 False values
-----------------
BuildingArea has 7130 False values
BuildingArea has 6450 True values
-----------------
YearBuilt has 8205 False values
YearBuilt has 5375 True values
-----------------
CouncilArea has 12211 False values
CouncilArea has 1369 True values
-----------------
Lattitude has 13580 False values
---------------

In [72]:
Xcol = data.columns[:-1]
Ycol = data.columns[-1]

In [74]:
X_train,X_test,Y_train,Y_test = train_test_split(data[Xcol],data[Ycol],train_size = 0.8, random_state=7)

In [75]:
Y_train.shape,Y_test.shape, X_train.shape,X_test.shape

((10864,), (2716,), (10864, 20), (2716, 20))

In [70]:
cols_with_missinng = [col for col in data.columns if data[col].isnull().any()]

In [71]:
cols_with_missinng

['Car', 'BuildingArea', 'YearBuilt', 'CouncilArea']

##### Drop columns in training and validation data


In [77]:
reduced_X_train = X_train.drop(cols_with_missinng,axis = 1)
reduced_X_test = X_test.drop(cols_with_missinng,axis = 1)

In [None]:
def RegressionScore()