## Deep Neural Network with Keras for Regression

- In this notebook we will be using DNN on one of Kaggle's dataset [Predict Hourly Wage](https://www.kaggle.com/c/predict-hourly-wage/).
- After getting the prediction of our neural net we will submit the prediction to kaggle competition and check our score.

**NN basics**

- Deep Learning is an increasingly popular part of Machine Learning.
- A neural network (NN) with more than one hidden layer is called **Deep Neural Network (DNN)**.
- A NN takes input, process it in hidden layers with weights and spits out a output/prediction.
- With NN we dont have to worry about feature selection.
- NN adjusts the weights (during forward and back propagation) to meet the target value and in turn provide a prediction.

- In this notebook we will build a DNN for regression problem.
- We will predict hourly wage for an employee.
- Dataset can be downloaded from [here](https://www.kaggle.com/c/predict-hourly-wage/data)

**Content:**

1. Load data
2. Data pre-processing
3. Prepare feature and target variable
4. Build DNN model
5. Compile the model
6. Train the model
7. Predict on new data
8. Submit the prediction to Kaggle for evaluation
9. Additional: Increasing model capacity with additional layers and nodes

**1. Load data using Pandas**

In [41]:
# Import pandas and numpy for file handling and numeric algebra respectively
import pandas as pd
import numpy as np

In [42]:
# Load the data
data=pd.read_csv('Income_training.csv')

In [43]:
# Check numerical statictics of data using pandas 'info' method
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3197 entries, 0 to 3196
Data columns (total 4 columns):
compositeHourlyWages    3197 non-null float64
age                     3197 non-null int64
yearsEducation          3197 non-null int64
sex1M0F                 3197 non-null int64
dtypes: float64(1), int64(3)
memory usage: 100.0 KB


In [44]:
# Check 1st two rows using pandas 'head' method
data.head(2)

Unnamed: 0,compositeHourlyWages,age,yearsEducation,sex1M0F
0,21.38,58,10,1
1,25.15,42,16,1


**2. Data pre-processing**

- This dataset is already clean [i.e. no missing value, no categorical values], hence we don't have to perform any pre-processing.
- However we can standardize the data as wage and age/year are in different units. [try it]

**3. Prepare feature and target variable**

In [45]:
# Prepare feature variable i.e. all features except target i.e. compositeHourlyWages
X_train=data.drop(columns=['compositeHourlyWages'])
# Prepare target variable i.e. compositeHourlyWages
y_train=data['compositeHourlyWages']

In [46]:
# Check feature data
X_train.head(2)

Unnamed: 0,age,yearsEducation,sex1M0F
0,58,10,1
1,42,16,1


In [47]:
# Check target data
y_train.head()

0    21.38
1    25.15
2     8.57
3    12.07
4    10.97
Name: compositeHourlyWages, dtype: float64

**4. Build DNN model**
- For basic info on keras model kindly refer KERAS API notebook [here](https://github.com/jay-pm/Deep-Learning/tree/master/The%20Keras%20API)

In [48]:
# Import Sequential and Dense from Keras

from keras.models import Sequential
from keras.layers import Dense

In [49]:
# create the model
model=Sequential()
# get the number of columns of training data
n_cols=X_train.shape[1] # X_train.shape >> (3197, 3)

- We will use the ‘add()’ function to add layers to our model. We will add two hidden layers and an output layer.
- Dense is a standard layer. In a dense layer, all nodes in the previous layer connect to the nodes in the current layer.
- We have kept 6 nodes in each of our hidden layer.
- number of nodes in hidden layer can be hundreds or thousands.
- increasing the number of nodes in a layer increases model capacity but at the cost of computational costs.
- activation function adds non-linearity to data.
- we will use RELU activation function which is proven to work well with NN.
- The input shape specifies the number of rows and columns in the input. The number of columns in our input is stored in ‘n_cols’. There is nothing after the comma which indicates that there can be any amount of rows.
- The last layer is the output layer. It only has one node, which is for our prediction.

In [50]:
# 1st hidden layer
model.add(Dense(6,activation='relu', input_shape=(n_cols,)))

# 2nd hidden later
model.add(Dense(6, activation='relu'))

# output layer
model.add(Dense(1))

** 5. Compile the model**

- Compiling the model takes two parameters: optimizer and loss.
- The optimizer controls the learning rate. 
- 'adam' is a good default optimizer to use, and most of the time works well.
- The adam optimizer adjusts the learning rate throughout training.
- The learning rate determines how fast the optimal weights for the model are calculated. A smaller learning rate may lead to more accurate weights (up to a certain point), but the time it takes to compute the weights will be longer.
- Loss function depends on the problem at hand. Mean squared error is a common loss function and will optimize for predicting the mean, as is done in least squares regression. For classification use binary_crossentropy [2 class] or categorical_crossentropy [multiclass]

In [51]:
# compile the model
model.compile(optimizer='adam', loss='mean_squared_error')

**6. Train the model**

- For training the model, we will use the ‘fit()’ function on our model with the following five parameters: training data (X), target data (y), validation split, the number of epochs and callbacks.

- The validation split will randomly split the data into use for training and testing. We will set the validation split at 0.2, which means that 20% of the training data we provide in the model will be set aside for testing model performance.

- The number of epochs is the number of times the model will cycle through the data. The more epochs we run, the more the model will improve, up to a certain point. After that point, the model will stop improving during each epoch. In addition, the more epochs, the longer the model will take to run. To monitor this, we will use ‘early stopping’.

- Early stopping will stop the model from training before the number of epochs is reached if the model stops improving. We will set our early stopping monitor to 3. This means that after 3 epochs in a row in which the model doesn’t improve, training will stop.

In [52]:
from keras.callbacks import EarlyStopping

# train the model
model.fit(X_train,y_train,validation_split=.2,epochs=40,callbacks=[EarlyStopping(patience=3)])

Train on 2557 samples, validate on 640 samples
Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


<keras.callbacks.History at 0xf14d908>

**7. Predict on new data**
- we will use the ‘predict()’ function, passing in our new data. The output would be ‘wage per hour’ predictions.

In [53]:
# load the new data
data_new=pd.read_csv('Income_testing.csv') # this data dont have target i.e. compositeHourlyWages which we have to predict

In [54]:
# check the test data
data_new.head()

Unnamed: 0,ID,age,yearsEducation,sex1M0F
0,1,36,20,0
1,2,38,17,0
2,3,24,10,0
3,4,39,12,1
4,5,50,12,0


In [55]:
data_test=data_new.drop(columns='ID') # drop ID column as it is not in our training data
data_test.head(2)

Unnamed: 0,age,yearsEducation,sex1M0F
0,36,20,0
1,38,17,0


In [56]:
# predict on new data with already created model
y_predict=model.predict(data_test)

In [57]:
y_predict

array([[ 19.2833252 ],
       [ 17.7055378 ],
       [ 10.51393127],
       [ 15.98699665],
       [ 16.70557022],
       [ 16.53741455],
       [ 16.95337296],
       [ 17.46391296],
       [ 11.87750721],
       [ 17.45021057],
       [  9.72189045],
       [ 11.71563816],
       [ 13.06551266],
       [ 13.50887966],
       [ 14.07300091],
       [ 20.64690018],
       [ 18.8474865 ],
       [ 15.28630447],
       [ 13.45529842],
       [ 19.34431839],
       [ 15.46275043],
       [ 14.37625408],
       [ 15.4963398 ],
       [ 20.54603195],
       [ 14.82516575],
       [ 19.59964752],
       [ 18.15631676],
       [ 13.06551266],
       [  9.00960541],
       [ 14.33450603],
       [ 12.17888927],
       [ 18.31227875],
       [ 17.51749039],
       [ 13.7305069 ],
       [ 11.93108749],
       [ 15.04679012],
       [ 16.89360809],
       [ 14.7441721 ],
       [ 18.24348259],
       [ 16.71175194],
       [ 15.50386333],
       [ 18.24348259],
       [ 18.20236969],
       [ 11

In [58]:
y_predict.shape

(800, 1)

**8. Submit the prediction to Kaggle for evaluation**

In [59]:
# write prediction data to data_new
data_new['compositeHourlyWages']=y_predict
# preapre submission file only with ID and predicated wages in column 'compositeHourlyWages'
data_new[['ID', 'compositeHourlyWages']].to_csv('HourlyWage_DNN.csv', index=False)

In [60]:
# check the submission file
submission=pd.read_csv('HourlyWage_DNN.csv')
print(submission.head())
print(submission.shape)

   ID  compositeHourlyWages
0   1             19.283325
1   2             17.705538
2   3             10.513931
3   4             15.986997
4   5             16.705570
(800, 2)


- submit the 'HourlyWage_DNN.csv' to kaggle @ https://www.kaggle.com/c/predict-hourly-wage/submit
- Our model score is 6.47462
<img src='Kaggle score_hourlyWageDNN.jpg'>
- Top score for this competition is 6.33222 [https://www.kaggle.com/c/predict-hourly-wage/leaderboard]
<img src='topScore.jpg'>

In [61]:
# save the model
model.save('HourlyWagePred_DNN.h5')