## Objective - Learng the techniques to handle missing data.

-  Python libraries represent __missing numbers as nan__ which is short for "not a number".
-  Most libraries (including scikit-learn) will give you an error if you try to build a model using data with missing values.So you'll need to choose one of the strategies below.

### Solutions:
1. #### A Simple Option: Drop Columns with Missing Values
2. #### A Better Option: Imputation
    Imputation fills in the missing value with some number. 
    <blockquote>
    from sklearn.preprocessing import Imputer<br>
    imputer = Imputer()<br>
    imputed_data = impter.fit_transform(data)<br>
    </blockquote>
3. #### Extension of Imputation
     imputed values may by systematically above or below their actual values (which weren't collected in the dataset). Or rows with missing values may be unique in some other way. In that case, your model would make better predictions by considering which values were originally missing. 
     <blockquote>
    <p>
      make copy to avoid changing original data (when Imputing)<br>
        new_data = original_data.copy()<br><br>

     make new columns indicating what will be imputed<br>
        cols_with_missing = (col for col in new_data.columns <br>
                                 if new_data[col].isnull().any())<br><br>
        for col in cols_with_missing:<br>
            new_data[col + '_was_missing'] = new_data[col].isnull()<br><br>

     Imputation<br>
        my_imputer = SimpleImputer()<br>
        new_data = pd.DataFrame(my_imputer.fit_transform(new_data))<br>
        new_data.columns = original_data.columns<br>
</p>
</blockquote>



In [4]:
import pandas as pd
df = pd.read_csv('./Data/melb_data.csv')

In [7]:
print(df.shape,'\n')
print(df.info(),'\n')
print(df.describe().T,'\n')
print(df.head(20),'\n')
print(df.tail(20),'\n')

(13580, 21) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13580 entries, 0 to 13579
Data columns (total 21 columns):
Suburb           13580 non-null object
Address          13580 non-null object
Rooms            13580 non-null int64
Type             13580 non-null object
Price            13580 non-null float64
Method           13580 non-null object
SellerG          13580 non-null object
Date             13580 non-null object
Distance         13580 non-null float64
Postcode         13580 non-null float64
Bedroom2         13580 non-null float64
Bathroom         13580 non-null float64
Car              13518 non-null float64
Landsize         13580 non-null float64
BuildingArea     7130 non-null float64
YearBuilt        8205 non-null float64
CouncilArea      12211 non-null object
Lattitude        13580 non-null float64
Longtitude       13580 non-null float64
Regionname       13580 non-null object
Propertycount    13580 non-null float64
dtypes: float64(12), int64(1), object(8)
memory u

In [21]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

target = df['Price']
predictors = df.drop(['Price'],axis=1)

# for the sake of simplice - considering only the numeric predictors
only_numeric_predictors = predictors.select_dtypes(exclude=['object'])

##  Create function to measure Quality of an approach


In [22]:
X_train, X_test, Y_train, Y_test = train_test_split(only_numeric_predictors, target, train_size=0.7,test_size=0.3, random_state=7)

def score_dataset(x_train, x_test, y_train, y_test):
    model = RandomForestRegressor()
    model.fit(x_train, y_train)
    pred = model.predict(x_test)
    return mean_absolute_error(y_test, pred)

##  Get Model Score from Dropping Columns with Missing Values

In [24]:
col_with_missings = [col for col in only_numeric_predictors.columns if predictors[col].isnull().any()]

reduced_x_train = X_train.drop(col_with_missings, axis=1)
reduced_x_test = X_test.drop(col_with_missings, axis=1)

print("Mean Absolute Error from dropping columns with Missing Values:")
print(score_dataset(reduced_x_train, reduced_x_test, Y_train, Y_test))

Mean Absolute Error from dropping columns with Missing Values:
191928.72203286813


##  Get Model Score from Imputation

In [27]:
from sklearn.preprocessing import Imputer
imputer = Imputer()
imputed_x_train = imputer.fit_transform(X_train)
imputed_x_test = imputer.fit_transform(X_test)

print("Mean Absolute Error from Imputation:")
print(score_dataset(imputed_x_train, imputed_x_test, Y_train, Y_test))

Mean Absolute Error from Imputation:
194837.63279332352


## Get Score from Imputation with Extra Columns Showing What Was Imputed

In [36]:
imputed_x_train_plus = X_train.copy()
imputed_x_test_plus = X_test.copy()

col_with_misssing = (col for col in only_numeric_predictors.columns if only_numeric_predictors[col].isnull().any())
for col in col_with_missings:
    imputed_x_train_plus[col+'_was_missing'] = X_train[col].isnull()
    imputed_x_test_plus[col+'_was_missing'] = X_test[col].isnull()    
    
imputer = Imputer()
imputed_x_train_plus = imputer.fit_transform(imputed_x_train_plus)
imputed_x_test_plus = imputer.fit_transform(imputed_x_test_plus)

print("Mean Absolute Error from Imputation while Track What Was Imputed:")
print(score_dataset(imputed_x_train_plus, imputed_x_test_plus, Y_train, Y_test))

Mean Absolute Error from Imputation while Track What Was Imputed:
192147.56810669287


### Learning/Take aways
One (of many) nice things about Imputation is that it can be included in a scikit-learn Pipeline. ___Pipelines simplify model building, model validation and model deployment.___