

# IBM AI Engineering Professional Certificate

## Build a Regression Model in Keras

[Peter Buchanan](https://www.linkedin.com/in/buchananpeter/) - May 2020


### Table of Contents
* [1. Data Preparation](#1.-Data-Preparation)
    * [1.1. Download data into a Pandas dataframe](#1.1.-Download-data-into-a-Pandas-dataframe)
    * [1.2. Split data into predictors and target](#1.2.-Split-data-into-predictors-and-target)
* [2. Part A: Build a baseline model](#2.-Part-A:-Build-a-baseline-model)
* [3. Part B: Normalize the data](#3.-Part-B:-Normalize-the-data)
* [4. Part C: Increase the number of epochs to 100](#4.-Part-C:-Increase-the-number-of-epochs-to-100)
* [5. Part D: Increase the number of hidden layers](#5.-Part-D:-Increase-the-number-of-hidden-layers)

## 1. Data Preparation

<div class="alert alert-block alert-success">
<b>Import: </b>Search for named modules, and bind name to local scope
</div>

In [1]:
#!pip install keras
#!pip install tensorflow==2.0.0-rc0

# increase Jupyter cell width
from IPython.core.display import display, HTML
display(HTML("<style>.container {width:75% !important;}</style>"))
from IPython.display import clear_output

# add logging
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Import Pandas
import pandas as pd
from pandas.io.json import json_normalize
pd.set_option("display.width", 201)
pd.set_option("display.max_rows", 1001)
pd.set_option("display.max_columns", 1001)
pd.set_option('max_colwidth', 120)

# Import numerical and machine learning libraries
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Import Keras
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical
from keras.layers.convolutional import Conv2D # to add convolutional layers
from keras.layers.convolutional import MaxPooling2D # to add pooling layers
from keras.layers import Flatten # to flatten data for fully connected layers


Using TensorFlow backend.


<div class="alert alert-block alert-success">
<b>Constants: </b>Declare and assign value to a constant
</div>

In [2]:
DATA_FOLDER = 'data/'
CONCRETE_DATA_URL = 'https://cocl.us/concrete_data'

### 1.1. Download data into a Pandas dataframe
Compressive strength of concrete based on ingredient volumes:
- Cement
- Blast Furnace Slag
- Fly Ash
- Water
- Superplasticizer
- Coarse Aggregate
- Fine Aggregate
  
Data can be downloaded here: <a href="https://cocl.us/concrete_data">Concrete Data</a>


In [3]:
try:
        
    # Import to Datafrome: concrete_data_df
    concrete_data_df = pd.read_csv(CONCRETE_DATA_URL)
     
except Exception:
    logger.error('Dataset: Read file into DataFrame: ', exc_info=True)
    
else:
    logger.info('Dataset: Read file into DataFrame: OK')    

INFO:__main__:Dataset: Read file into DataFrame: OK


In [4]:
print("concrete_data_df shape is " , concrete_data_df.shape)
concrete_data_df.head(10)

concrete_data_df shape is  (1030, 9)


Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3
5,266.0,114.0,0.0,228.0,0.0,932.0,670.0,90,47.03
6,380.0,95.0,0.0,228.0,0.0,932.0,594.0,365,43.7
7,380.0,95.0,0.0,228.0,0.0,932.0,594.0,28,36.45
8,266.0,114.0,0.0,228.0,0.0,932.0,670.0,28,45.85
9,475.0,0.0,0.0,228.0,0.0,932.0,594.0,28,39.29


The first concrete sample has 540 cubic meter of cement, 0 cubic meter of blast furnace slag, 0 cubic meter of fly ash, 162 cubic meter of water, 2.5 cubic meter of superplaticizer, 1040 cubic meter of coarse aggregate, 676 cubic meter of fine aggregate. Such a concrete mix which is 28 days old, has a compressive strength of 79.99 MPa. 

In [5]:
concrete_data_df.dtypes

Cement                float64
Blast Furnace Slag    float64
Fly Ash               float64
Water                 float64
Superplasticizer      float64
Coarse Aggregate      float64
Fine Aggregate        float64
Age                     int64
Strength              float64
dtype: object

Check dataset for any missing values

In [6]:
concrete_data_df.describe()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
count,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0
mean,281.167864,73.895825,54.18835,181.567282,6.20466,972.918932,773.580485,45.662136,35.817961
std,104.506364,86.279342,63.997004,21.354219,5.973841,77.753954,80.17598,63.169912,16.705742
min,102.0,0.0,0.0,121.8,0.0,801.0,594.0,1.0,2.33
25%,192.375,0.0,0.0,164.9,0.0,932.0,730.95,7.0,23.71
50%,272.9,22.0,0.0,185.0,6.4,968.0,779.5,28.0,34.445
75%,350.0,142.95,118.3,192.0,10.2,1029.4,824.0,56.0,46.135
max,540.0,359.4,200.1,247.0,32.2,1145.0,992.6,365.0,82.6


Data is clean and ready to be used in model

In [7]:
concrete_data_df.isnull().sum()

Cement                0
Blast Furnace Slag    0
Fly Ash               0
Water                 0
Superplasticizer      0
Coarse Aggregate      0
Fine Aggregate        0
Age                   0
Strength              0
dtype: int64

### 1.2. Split data into predictors and target
The target variable is the concrete sample strength. Therefore, predictors will be all the other columns.

In [8]:
concrete_data_columns = concrete_data_df.columns
predictors = concrete_data_df[concrete_data_columns[concrete_data_columns != 'Strength']]
target = concrete_data_df['Strength']
n_cols = predictors.shape[1]

In [9]:
predictors.head(5)

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360


In [10]:
target.head(5)

0    79.99
1    61.89
2    40.27
3    41.05
4    44.30
Name: Strength, dtype: float64

## 2. Part A: Build a baseline model

Use the Keras library to build a neural network with the following:
- One hidden layer of 10 nodes, and a ReLU activation function
- Use the adam optimizer and the mean squared error as the loss function.

In [11]:
# One hidden layer of 10 nodes, and a ReLU activation function
def regression_model_one_hidden_layer():
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(1))
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model


# Three hidden layers of 10 nodes, and a ReLU activation function
def regression_model_three_hidden_layer():
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(1))
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model


# Iterate named regression model
def iterate_regression_model(regression_model, iterations, epochs, predictor):
    mean_squared_error_list = []

    for i in range(iterations):
                
        # random_state = 0, every time program run will result in different output because of splitting between train and test varies within.
        X_train, X_test, y_train, y_test = train_test_split(predictor, target, test_size=0.3, random_state=0)

        # create regression model
        model = regression_model()

        # fit model, suppress logging
        model.fit(X_train, y_train, epochs=epochs, verbose=0)

        # test model and append list
        y_predict = model.predict(X_test)
        mean_squared_error_list.append(mean_squared_error(y_test, y_predict))
        print('Iteration {0:3} of {1:3}: Mean Squared Error: {2:.3f}   '.format(i+1, iterations, mean_squared_error_list[i]), end='\r',flush=True)
    
    # return mean and standard deviation of mean_squared_error_list
    return np.mean(mean_squared_error_list), np.std(mean_squared_error_list)


Repeat following steps fifty times and append result to mean squared error list.

- Randomly split the data into a training and test sets by holding 30% of the data for testing.<br>Use the train_test_splithelper function from Scikit-learn.<br><br>
- Train the model on the training data using 50 epochs.<br><br>
- Evaluate the model on the test data and compute the mean squared error between the predicted concrete strength and the actual concrete strength.<br>You can use the mean_squared_error function from Scikit-learn.

### Report the mean and the standard deviation of the mean squared errors

In [12]:
try:
        
    #Report the mean and the standard deviation of the mean squared errors
    mean_mse, std_mse = iterate_regression_model(regression_model_one_hidden_layer, 50, 50, predictors)
     
except Exception:
    logger.error('Model: regression_model_one_hidden_layer: ', exc_info=True)
    
else:
    
    # Report the mean and stddev of the mean squared errors
    clear_output()
    print('Result: Regression Model: One hidden layer: 50 epochs, 50 iterations')
    print('\nMean: %.3f'%(mean_mse))
    print('Standard Deviation: %.3f'%(std_mse))
       

Result: Regression Model: One hidden layer: 50 epochs, 50 iterations

Mean: 275.094
Standard Deviation: 340.107


## 3. Part B: Normalize the data

Repeat Part A but use a normalized version of the data.<br>
Recall that one way to normalize the data is by subtracting the mean from the individual predictors and dividing by the standard deviation.

Normalize the data by substracting the mean and dividing by the standard deviation.

In [13]:
# Normalize the data
predictors_norm = (predictors - predictors.mean()) / predictors.std()
predictors_norm.head()

INFO:numexpr.utils:NumExpr defaulting to 4 threads.


Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,0.862735,-1.217079,-0.279597
1,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,1.055651,-1.217079,-0.279597
2,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,3.55134
3,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,5.055221
4,-0.790075,0.678079,-0.846733,0.488555,-1.038638,0.070492,0.647569,4.976069


In [14]:
try:
        
    #Report the mean and the standard deviation of the mean squared errors
    mean_mse, std_mse = iterate_regression_model(regression_model_one_hidden_layer, 50, 50, predictors_norm)
     
except Exception:
    logger.error('Model: regression_model_one_hidden_layer: ', exc_info=True)
    
else:
    
    # Report the mean and stddev of the mean squared errors
    clear_output()
    print('Result: Regression Model on normalized data: One hidden layer: 50 epochs, 50 iterations')
    print('\nMean: %.3f'%(mean_mse))
    print('Standard Deviation: %.3f'%(std_mse))


Result: Regression Model on normalized data: One hidden layer: 50 epochs, 50 iterations

Mean: 356.706
Standard Deviation: 101.307


### How does the mean of the mean squared errors compare to that from Step A?

The mean of the mean squared error approx the same but the standard deviation of the mean squared errors reduced significantly.<br>
Normalizing the dataset mean serves to "center" the data decreasing risk of uneven train/test split.

## 4. Part C: Increase the number of epochs to 100
Repeat Part B but use 100 epochs this time for training.

In [15]:
try:
        
    #Report the mean and the standard deviation of the mean squared errors
    mean_mse, std_mse = iterate_regression_model(regression_model_one_hidden_layer, 50, 100, predictors_norm)
     
except Exception:
    logger.error('Model: regression_model_one_hidden_layer: ', exc_info=True)
    
else:
    
    # Report the mean and stddev of the mean squared errors
    clear_output()
    print('Result: Regression Model on normalized data: One hidden layer: 100 epochs, 50 iterations')
    print('\nMean: %.3f'%(mean_mse))
    print('Standard Deviation: %.3f'%(std_mse))


Result: Regression Model on normalized data: One hidden layer: 100 epochs, 50 iterations

Mean: 148.077
Standard Deviation: 10.754


### How does the mean of the mean squared errors compare to that from Step B?

The mean squared error on the test set halved.<br>
Increasing the epochs decreased the average Mean squared error<br>
The prediction quality of the model is improving by increasing the number of epochs


## 5. Part D: Increase the number of hidden layers
Repeat part B but use a neural network with the following instead:

- Three hidden layers, each of 10 nodes and ReLU activation function.

In [16]:
try:
        
    #Report the mean and the standard deviation of the mean squared errors
    mean_mse, std_mse = iterate_regression_model(regression_model_three_hidden_layer, 50, 50, predictors_norm)
     
except Exception:
    logger.error('Model: regression_model_three_hidden_layer: ', exc_info=True)
    
else:
    
    # Report the mean and stddev of the mean squared errors
    clear_output()
    print('Result: Regression Model on normalized data: Three hidden layers: 50 epochs, 50 iterations')
    print('\nMean: %.3f'%(mean_mse))
    print('Standard Deviation: %.3f'%(std_mse))


Result: Regression Model on normalized data: Three hidden layers: 50 epochs, 50 iterations

Mean: 116.593
Standard Deviation: 11.252


### How does the mean of the mean squared errors compare to that from Step B?

The mean of the mean squared errors is better than in part B.<br>
Multiple hidden layers significantly better learning to predict the 'Strength' feature than the single layer network.