## Build a Regression Model in Keras Path: D

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>
      
1. <a href="#instructions">Instructions</a>   
2. <a href="#about_dataset">About the dataset</a>  
3. <a href="#prepare_data">Prepare the data</a>  
4. <a href="#helper_function">Helper function</a> 
5. <a href="#normalized_dataset">Normalized the data</a> <br>
6. <a href="#train_the_model">Train and test regression model</a> <br> 
7. <a href="#preport">Report Model performance</a>  <br>
</font>
</div>

<h2 id="instructions">Instructions</h2>
This assignment is to build a regression model in Keras with followingz:

<div>
    
- Use a normalized version of the data.
- Three hidden layers, each of 10 nodes and ReLU activation function.
- Use the adam optimizer and the mean squared error as the loss function.

1.Randomly split the data into a training and test sets by holding 30% of the data for testing.

2.Train the model on the training data using 50 epochs.

3.Evaluate the model on the test data and compute the mean squared error between the predicted concrete strength and the actual concrete strength. 

4.Repeat steps 1 - 3, 50 times, i.e., create a list of 50 mean squared errors.

5.Report the mean and the standard deviation of the mean squared errors compare to that from Step B?

</div>

import required libraries:

In [1]:
import pandas as pd
import numpy as np
import pylab as pl

<h2 id="about_dataset">About the dataset</h2>
<strong>The dataset is about the compressive strength of different samples of concrete based on the volumes of the different predictors that were used to make them. The predictors include:</strong>

<strong>1. Cement</strong><br>
<strong>2. Blast Furnace Slag</strong><br>
<strong>3. Fly Ash</strong><br>
<strong>4. Water</strong><br>
<strong>5. Superplasticizer</strong><br>
<strong>6. Coarse Aggregate</strong><br>
<strong>7. Fine Aggregate</strong><br>
the data can be found here again: https://cocl.us/concrete_data.

Let's download the data and read it into a pandas dataframe.

<h2 id="prepare_data">Prepare the data</h2>

In [2]:
concrete_data = pd.read_csv('https://cocl.us/concrete_data/concrete_data.csv')
concrete_data.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


#### Let's check how many data points we have.

In [3]:
concrete_data.shape

(1030, 9)

There are 1030 samples of dataset, it quite a few samples, we have to be careful not to overfit the training data.


In [4]:
concrete_data.describe()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
count,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0
mean,281.167864,73.895825,54.18835,181.567282,6.20466,972.918932,773.580485,45.662136,35.817961
std,104.506364,86.279342,63.997004,21.354219,5.973841,77.753954,80.17598,63.169912,16.705742
min,102.0,0.0,0.0,121.8,0.0,801.0,594.0,1.0,2.33
25%,192.375,0.0,0.0,164.9,0.0,932.0,730.95,7.0,23.71
50%,272.9,22.0,0.0,185.0,6.4,968.0,779.5,28.0,34.445
75%,350.0,142.95,118.3,192.0,10.2,1029.4,824.0,56.0,46.135
max,540.0,359.4,200.1,247.0,32.2,1145.0,992.6,365.0,82.6


Let's check the dataset for any missing values.

In [5]:
concrete_data.isnull().sum()

Cement                0
Blast Furnace Slag    0
Fly Ash               0
Water                 0
Superplasticizer      0
Coarse Aggregate      0
Fine Aggregate        0
Age                   0
Strength              0
dtype: int64

Great!, our data looks very clean, next start to build our model.

Let's import the rest of the library that we will need to build our regressoin model.

In [6]:
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

Using TensorFlow backend.


<h2 id="helper_function">Helper function</h2>
define helper function to create and test regression model

In [7]:
# function to create train/stest dataset
def get_datatset(df, test_size, random):
    
    # get the column as feature
    feature_columns = concrete_data.columns
    
    # get the features
    X = df[feature_columns[feature_columns != 'Strength']] # all columns except Strength
    
    # get the label
    y = df['Strength'] # Strength column
    
    return train_test_split(X, y, test_size=test_size, random_state=random)

In [8]:
# function to verify datatset
def verify_dataset(X_trainset, X_testset, y_trainset, y_testset):
    print("++++++++++++++++++++++++++++++++++++++++++++++++")
    print("X_trainset shape : {}".format(X_trainset.shape))
    print("y_trainset shape : {}".format(y_trainset.shape))
    print("X_testset shape : {}".format(X_testset.shape))
    print("y_testset shape : {}".format(y_testset.shape))

In [9]:
# regression model with 
# Three hidden layers, each of 10 nodes and ReLU activation function.
# Use the adam optimizer and the mean squared error as the loss function.
def regression_model(input_shape):
    # create model
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(input_shape,)))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(1))
    
    # compile model
    model.compile(optimizer='adam', loss='mean_squared_error', metrics=['mse'])
    return model

In [10]:
# function to test model
def test_regression_model(model, X_trainset, X_testset, y_trainset, y_testset, num_epochs):
    # number of input
    input_shape = X_trainset.shape[1] 
    
    # train model with trainginset
    model.fit(X_trainset, y_trainset, epochs=num_epochs, verbose=0)
    
    # prediction with testing set
    y_predict = model.predict_classes(X_testset)
    
    # get mean squared error
    return mean_squared_error(y_testset, y_predict)

<h2 id="path_b_normalized_dataset">Normalized the data</h2>
Normalize the data is by subtracting the mean from the individual predictors and dividing by the standard deviation.

In [11]:
concrete_data_norm = (concrete_data - concrete_data.mean()) / concrete_data.std()
concrete_data.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


<h2 id="train_the_model">Train and test regression model</h2>
We will build the neural network with the following instead:

- Three hidden layers, each of 10 nodes and ReLU activation function.


### step 1. Randomly split the data into a training and test sets by holding 30% of the data for testing.

In [12]:
test_size = 0.3 # 30%

### step 2. Train the model on the training data using 100 epochs.

In [13]:
num_epochs = 50

In [14]:
# experiment 50 times 
path_d_metrics = []

# iterate 50 times
for i in range(1, 50):
    print("++++++++++++++++++++++++++++++++++++++++++++++++")
    print("Step : {}".format(i))
    
    # get random training and testing data set with normalized data
    X_trainset, X_testset, y_trainset, y_testset = get_datatset(concrete_data_norm, test_size, i)
    
    # validate dataset
    verify_dataset(X_trainset, X_testset, y_trainset, y_testset)
    
    # get input shape
    input_shape = X_trainset.shape[1] 
    
    # build regression model
    model = regression_model(input_shape)
    
    # evaluation compute the mean squared
    mse = test_regression_model(model, X_trainset, X_testset, y_trainset, y_testset, num_epochs)
    
    path_d_metrics.append(mse)
    
    print("++++++++++++++++++++++++++++++++++++++++++++++++")
    print("Step : {} Mean squared error = {}".format(i, mse))

++++++++++++++++++++++++++++++++++++++++++++++++
Step : 1
++++++++++++++++++++++++++++++++++++++++++++++++
X_trainset shape : (721, 8)
y_trainset shape : (721,)
X_testset shape : (309, 8)
y_testset shape : (309,)
Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Use tf.cast instead.
++++++++++++++++++++++++++++++++++++++++++++++++
Step : 1 Mean squared error = 0.6128969621383685
++++++++++++++++++++++++++++++++++++++++++++++++
Step : 2
++++++++++++++++++++++++++++++++++++++++++++++++
X_trainset shape : (721, 8)
y_trainset shape : (721,)
X_testset shape : (309, 8)
y_testset shape : (309,)
++++++++++++++++++++++++++++++++++++++++++++++++
Step : 2 Mean squared error = 0.6508372195659627
++++++++++++++++++++++++++++++++++++++++++++++++
Step : 3
++++++++++++++++++++++++++++++++++++++++++++++++
X_trainset shape : (721, 8)
y_trainset shape : (721,)
X_testset shape : (309, 8)
y_testset shape : (309,)
+++++++++++++++++++++++++++++++++++++++++++++

### step 3. Evaluate the model on the test data and compute the mean squared error between the predicted concrete strength and the actual concrete strength.

In [15]:
path_d_mean = np.mean(path_d_metrics)
path_d_std = np.std(path_d_metrics)

<h2 id="preport">Report Model performance</h2>
How does the mean of the mean squared errors compare to that from Step B?

In [16]:
print("++++++++++++++++++++++++++++++++++++++++++++++++")
print("mean of squared errors : {}".format(path_d_mean))
print("standard deviation of squared errors : {}".format(path_d_std))

++++++++++++++++++++++++++++++++++++++++++++++++
mean of squared errors : 0.6381609132986579
standard deviation of squared errors : 0.03936061580750752


### Compare to that from Path B

Path D mean of squared errors : 0.6381609132986579

Path D standard deviation of squared errors : 0.03936061580750752


Path B: mean of squared errors : 0.6654493296422676

Path B: standard deviation of squared errors : 0.040174342338809704

The result show step C better that step B ~4%
With increase number of epochs the MSE better abound 4%