<a href="https://colab.research.google.com/github/mhax100/Dubstech-Data-Science-Workshops/blob/master/Copy_of_Feature_Engineering_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# FEATURE ENGINEERING PROJECT:
Using feature engineering on the California housing dataset to build an ML model to predict the median house value. 

## STEP 1: IMPORT PACKAGES & LIBRARIES

Importing the packages and libraries we require for our feature engineering project.

In [0]:
import numpy as np
import pandas as pd
import math
import tensorflow as tf
from IPython.core import display as ICD

## STEP 2: READ IN DATA
Reading in our data into the data frame and using the .head() function to get a look at it.

In [0]:
# WE STORE OUR DATA INTO A DATA FRAME CALLED cal_df ON WHICH WE WILL OPERATE
cal_df = pd.read_csv("sample_data/california_housing_train.csv", sep=",")
print('Original Dataset:')
# WE WILL DISPLAY OUR DATASET USING THE DISPLAY PACKAGE AND HEAD FUNCTION
ICD.display(cal_df.head(15))

Original Dataset:


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0
5,-114.58,33.63,29.0,1387.0,236.0,671.0,239.0,3.3438,74000.0
6,-114.58,33.61,25.0,2907.0,680.0,1841.0,633.0,2.6768,82400.0
7,-114.59,34.83,41.0,812.0,168.0,375.0,158.0,1.7083,48500.0
8,-114.59,33.61,34.0,4789.0,1175.0,3134.0,1056.0,2.1782,58400.0
9,-114.6,34.83,46.0,1497.0,309.0,787.0,271.0,2.1908,48100.0


##STEP 3: INVESTIGATE DATASET
Find out how many rows and columns the dataset has. Find the null/missing values and drop them. 

In [0]:
# WE WILL STORE THE NUMBER OF NULL VALUES FOR EACH COLUMN IN A DATA FRAME CALLED 'a'
a = pd.DataFrame(cal_df.isnull().sum())

# WE STORE THE NUMBER OF MISSING/NULL VAUES CORRESPONDING TO THE COLUMN NAME IN 
# A SEPERATE COLUMN CALLED '# of null values'
a['# of null values'] = a[0]

# WE TAKE THE COLUMNS OF THIS DATAFRAME AND STORE IT IN ANOTHER DATAFRAME CALLED 'b'
# in order to display
b = a[['# of null values']]

# WE PRINT OUT THE NUMBER OF NULL VALUES FOR EACH COLUMN BEFORE WE DROP THE VALUES
# AND DISPLAY IT USING OUR DATAFRAME 'b'. WE ALSO SEE HOW MANY ROWS AND COLUMNS OUR
# DATASET HAS BEFORE DROPPING MISSING VALUES USING THE .shape function.
print('Before Dropping Null Values:')
print('# of Rows, Columns: ',cal_df.shape)
ICD.display(b)

# WE SAY THAT THIS IS A RELATIVELY CLEAN DATASET AND HAS 0 MISSING/NULL VALUES. HOWEVER
# FOR PRACTICE WE HAVE GONE AHEAD AND WRITTEN THE CODE FOR DROPPIN NULL VALUES INCASE
# ANY HAD SHOWN UP.
cal_df = cal_df.dropna(axis=0)

# WE REPEAT THE SAME PROCESS OF MAKING A DATAFRAME TO DISPLAY HOW MANY NULL VALUES EACH COLUMN HAS.
# HOWEVER THIS TIME, ALL THESE VALUES SHOULD BE ZERO(THAT IS IF, THEY WERE'NT ZERO ALREADY). 
# WE ONCE AGAIN DISPLAY THE DATAFRAME GIVING US OUR NULL VALUE COUNTS AND PRINT THE NUMBER
# OF ROWS AND COLUMNS WE ARE LEFT WITH (THIS IS TO SEE HOW MUCH OF OUR DATA WE HAD TO DROP IN NULL VALUES).
# IN OUR CASE, SINCE WE ALREADY HAD A DATASET WITH NO MISSING VALUES WE SHOULD RETAIN OUR
# ORIGINAL NUMBER OF ROWS AND COLUMNS
a = pd.DataFrame(cal_df.isnull().sum())
a['# of null values'] = a[0]
b = a[['# of null values']]
print('After Dropping Null Values:')
print('# of Rows, Columns: ',cal_df.shape)
ICD.display(b)

Before Dropping Null Values:
# of Rows, Columns:  (17000, 9)


Unnamed: 0,# of null values
longitude,0
latitude,0
housing_median_age,0
total_rooms,0
total_bedrooms,0
population,0
households,0
median_income,0
median_house_value,0


After Dropping Null Values:
# of Rows, Columns:  (17000, 9)


Unnamed: 0,# of null values
longitude,0
latitude,0
housing_median_age,0
total_rooms,0
total_bedrooms,0
population,0
households,0
median_income,0
median_house_value,0


## STEP 4: FEATURE ENGINEERING
We clean the features that need cleaning and add our own features that will give us a better prediction. Accordingly we will drop the features which we feel are not as valuable to our model.

In [0]:
 # WE WILL NOW MAKE NEW FEATURES THAT WE FEEL WILL IMPROVE THE PERFORMANCE OF OUR MODEL.
 # WE WANT TO BE ABLE TO ADD IT TO OUR DATAFRAME, i.e, WE WILL BE OPERATING ON EXISTING
 # COLUMNS IN OUR DATAFRAME TO MAKE NEW COLUMNS THAT WHILL HELP OUR PREDICTION.

cal_df['num_rooms'] = cal_df['total_rooms'] / cal_df['households']
cal_df['num_bedrooms'] = cal_df['total_bedrooms'] / cal_df['households']
cal_df['persons_per_house'] = cal_df['population'] / cal_df['households']
cal_df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,num_rooms,num_bedrooms,persons_per_house
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0,11.889831,2.71822,2.150424
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0,16.522678,4.105832,2.438445
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0,6.153846,1.487179,2.846154
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0,6.641593,1.49115,2.278761
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0,5.549618,1.244275,2.381679


# STEP 4.5: MAKE YOUR OWN FEATURE

In [0]:
# ..... LET THEM EXPLORE
cal_df['persons_per_bedroom'] = cal_df['population'] / cal_df['total_bedrooms']
cal_df['rooms_per_person'] = cal_df['total_rooms'] / cal_df['population']
cal_df['room_value'] = cal_df['median_house_value'] / cal_df['num_rooms']
cal_df['bedroom_value'] = cal_df['median_house_value'] / cal_df['num_bedrooms']
cal_df.head()


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,num_rooms,num_bedrooms,persons_per_house,persons_per_bedroom,room_per_person,rooms_per_person
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0,11.889831,2.71822,2.150424,0.791115,5.529064,5.529064
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0,16.522678,4.105832,2.438445,0.593898,6.775908,6.775908
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0,6.153846,1.487179,2.846154,1.913793,2.162162,2.162162
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0,6.641593,1.49115,2.278761,1.52819,2.914563,2.914563
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0,5.549618,1.244275,2.381679,1.91411,2.330128,2.330128


# STEP 5: TRAINING THE MACHINE LEARNING MODEL
## TYPE 1: WITHOUT FEATURE ENGINEERING

In [0]:
# TRAIN MODEL WITH ALL THE COLUMNS
import statsmodels.formula.api as smf 
from sklearn.model_selection import train_test_split
from math import sqrt
from sklearn.metrics import mean_squared_error

train, validate  = train_test_split(cal_df, test_size=0.2, random_state=0)
model = smf.ols(formula= "median_house_value ~ housing_median_age + total_rooms + total_bedrooms + population + households + median_income + latitude + longitude", data = train).fit()
model.summary()

predicted_validation = model.predict(validate)



TYPE 2: WITH FEATURE ENGINEERING

In [0]:
# TRAIN MODEL WIHT FEATURE ENGINEERING
tr, te = train_test_split(cal_df, test_size=0.2, random_state=0)
new_model = smf.ols(formula= "median_house_value ~ num_rooms + num_bedrooms + persons_per_house + persons_per_bedroom   + rooms_per_person + housing_median_age + median_income + latitude + longitude + total_rooms + total_bedrooms + population + households", data = tr).fit()
new_model.summary()

new_predicted_validation = new_model.predict(te)

# STEP 6: RESULTS OF THE MACHINE LEARNING MODEL

In [0]:
# WITHOUT FEATURE ENGINEERING
rmse = sqrt(mean_squared_error(validate["median_house_value"], predicted_validation))
rmse
 

68833.25637469959

In [0]:
# WITH FEATURE ENGINEERING
new_rmse = sqrt(mean_squared_error(te["median_house_value"], new_predicted_validation))
new_rmse

66898.30711792968