### In this, we will learn to handle with the missing data and the method which will be useful in handling the missing values.

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

In [2]:
file_path = 'C:/Users/hp/Desktop/MACHINE LEARNING/MELBOURNE-HOUSING/train.csv'
home_data = pd.read_csv(file_path)
y = home_data.Price

In [5]:
melb_data = home_data.drop(['Price'],axis=1)

In [8]:
X = melb_data.select_dtypes(exclude=['object'])

In [11]:
train_X, test_X, train_y, test_y = train_test_split(X,y,train_size=0.8,test_size=0.2, random_state=0)

In [12]:
def score_dataset(train_X, test_X, train_y, test_y):
    model = RandomForestRegressor(n_estimators=10, random_state=0)
    model.fit(train_X, train_y)
    pred = model.predict(test_X)
    return mean_absolute_error(test_y,pred)

#### Now we will use different approaches to deal with the missing values and look into the corresponding MAE for each

#### Dropping columns with missing values :
Dropping the columns having at least one cell missing.<br>
Steps are - <br>
 - Getting the names of the columns with the missing values
 - Dropping those columns both from training and the testing dataset
 - Passing the reduced columns to the model function to get MAE

In [13]:
missing_columns = [col for col in train_X.columns if train_X[col].isnull().any()]
missing_columns

['Car', 'BuildingArea', 'YearBuilt']

In [14]:
reduced_train_X = train_X.drop(missing_columns,axis=1)
reduced_test_X = test_X.drop(missing_columns,axis=1)

In [15]:
print('MAE from dropping the columns with missing data:    ')
print(score_dataset(reduced_train_X, reduced_test_X, train_y, test_y))

MAE from dropping the columns with missing data:    
183550.22137772635


#### Using the method of imputation:
It will replace the missing value with some value, for example taking the mean of other values in the column. This will prevent the deletion of the entire column.<br>
Steps are- <br>
 - First call the SimpleImputer function from the sklearn.impute module
 - Then impute both the training and the testing datasets
 - Imputation removes the column names so put it back
 - Then pass on the imputed columns to the function to get MAE

In [16]:
from sklearn.impute import SimpleImputer
myimputer = SimpleImputer()

In [22]:
imputed_train_X = pd.DataFrame(myimputer.fit_transform(train_X))
imputed_train_X.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,1.0,5.0,3182.0,1.0,1.0,1.0,0.0,153.764119,1940.0,-37.85984,144.9867,13240.0
1,2.0,8.0,3016.0,2.0,2.0,1.0,193.0,153.764119,1964.839866,-37.858,144.9005,6380.0
2,3.0,12.6,3020.0,3.0,1.0,1.0,555.0,153.764119,1964.839866,-37.7988,144.822,3755.0


In [23]:
imputed_test_X = pd.DataFrame(myimputer.transform(test_X))
imputed_test_X.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,4.0,8.0,3016.0,4.0,2.0,2.0,450.0,190.0,1910.0,-37.861,144.8985,6380.0
1,2.0,6.6,3011.0,2.0,1.0,0.0,172.0,81.0,1900.0,-37.81,144.8896,2417.0
2,3.0,10.5,3020.0,3.0,1.0,1.0,581.0,153.764119,1964.839866,-37.7674,144.82421,4217.0


In [24]:
imputed_train_X.columns = train_X.columns
imputed_test_X.columns = test_X.columns
imputed_test_X.head(3)

Unnamed: 0,Rooms,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
0,4.0,8.0,3016.0,4.0,2.0,2.0,450.0,190.0,1910.0,-37.861,144.8985,6380.0
1,2.0,6.6,3011.0,2.0,1.0,0.0,172.0,81.0,1900.0,-37.81,144.8896,2417.0
2,3.0,10.5,3020.0,3.0,1.0,1.0,581.0,153.764119,1964.839866,-37.7674,144.82421,4217.0


In [25]:
print('MAE value with the imputation approach is:  ')
print(score_dataset(imputed_train_X,imputed_test_X,train_y,test_y))

MAE value with the imputation approach is:  
178166.46269899711


So, from above it can be seen that Approach 2 has lower MAE than Approach 1, so Approach 2 performed better on this dataset.

#### Extension to the imputation
The third approach is like imputation but it is marking the cells, by making a different columns, whereever imputation has taken place. So, it is basically keeping a track of which values are imputed. <br>
Steps are-<br>
 - Make a copy of the original training and the testing data
 - Make a different column indicating the values that are missing
 - Then do the process same as imputation

In [26]:
train_X_plus = train_X.copy()
test_X_plus = test_X.copy()

In [28]:
missing_columns = [col for col in train_X.columns if train_X[col].isnull().any()]
for col in missing_columns:
    train_X_plus[col + '_missing'] = train_X_plus[col].isnull()
    test_X_plus[col + '_missing'] = test_X_plus[col].isnull()

In [30]:
train_X_plus.head(5)

Unnamed: 0,Rooms,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount,Car_missing,BuildingArea_missing,YearBuilt_missing
12167,1,5.0,3182.0,1.0,1.0,1.0,0.0,,1940.0,-37.85984,144.9867,13240.0,False,True,False
6524,2,8.0,3016.0,2.0,2.0,1.0,193.0,,,-37.858,144.9005,6380.0,False,True,True
8413,3,12.6,3020.0,3.0,1.0,1.0,555.0,,,-37.7988,144.822,3755.0,False,True,True
2919,3,13.0,3046.0,3.0,1.0,1.0,265.0,,1995.0,-37.7083,144.9158,8870.0,False,True,False
6043,3,13.3,3020.0,3.0,1.0,2.0,673.0,673.0,1970.0,-37.7623,144.8272,4217.0,False,False,False


In [32]:
myimputer = SimpleImputer()
imputed_train_X = pd.DataFrame(myimputer.fit_transform(train_X_plus))
imputed_test_X = pd.DataFrame(myimputer.transform(test_X_plus))


imputed_train_X.columns = train_X_plus.columns
imputed_test_X.columns = test_X_plus.columns

In [33]:
print('MAE value with the extension to imputation method is:   ')
print(score_dataset(imputed_train_X, imputed_test_X, train_y, test_y))

MAE value with the extension to imputation method is:   
178927.503183954


#### So, why did imputation perform better than dropping the columns?Â¶
The training data has 10864 rows and 12 columns, where three columns contain missing data. For each column, less than half of the entries are missing. Thus, dropping the columns removes a lot of useful information, and so it makes sense that imputation would perform better.