<h2> Predicting the Compressive Strength of Different Samples of Concrete with Keras Library</h2>

<b> By Michael Kumakech</b>

<B> INTRODUCTION</b>

In this project, the researcher will build a regression model using the Keras library to model the data about concrete compressive strength found in the following URL: https://cocl.us/concrete_data. 

<b> Import necessary libaries</b>

In [1]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

In [2]:
# Download the data set from the following URL
!wget -O cement.csv https://cocl.us/concrete_data


--2022-03-22 10:55:44--  https://cocl.us/concrete_data
Resolving cocl.us (cocl.us)... 150.239.113.228, 150.239.82.32
Connecting to cocl.us (cocl.us)|150.239.113.228|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DL0101EN/labs/data/concrete_data.csv [following]
--2022-03-22 10:55:45--  https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DL0101EN/labs/data/concrete_data.csv
Resolving s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)... 67.228.254.196
Connecting to s3-api.us-geo.objectstorage.softlayer.net (s3-api.us-geo.objectstorage.softlayer.net)|67.228.254.196|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 58988 (58K) [text/csv]
Saving to: ‘cement.csv’


2022-03-22 10:55:45 (386 KB/s) - ‘cement.csv’ saved [58988/58988]



In [3]:
df = pd.read_csv('cement.csv')# read the content of cement.csv file
df.head() # display the first five records

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


In [4]:
# display the last five records
df.tail()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
1025,276.4,116.0,90.3,179.6,8.9,870.1,768.3,28,44.28
1026,322.2,0.0,115.6,196.0,10.4,817.9,813.4,28,31.18
1027,148.5,139.4,108.6,192.7,6.1,892.4,780.0,28,23.7
1028,159.1,186.7,0.0,175.6,11.3,989.6,788.9,28,32.77
1029,260.9,100.5,78.3,200.6,8.6,864.5,761.5,28,32.4


In [5]:
df.shape

(1030, 9)

In [6]:
df.describe()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
count,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0
mean,281.167864,73.895825,54.18835,181.567282,6.20466,972.918932,773.580485,45.662136,35.817961
std,104.506364,86.279342,63.997004,21.354219,5.973841,77.753954,80.17598,63.169912,16.705742
min,102.0,0.0,0.0,121.8,0.0,801.0,594.0,1.0,2.33
25%,192.375,0.0,0.0,164.9,0.0,932.0,730.95,7.0,23.71
50%,272.9,22.0,0.0,185.0,6.4,968.0,779.5,28.0,34.445
75%,350.0,142.95,118.3,192.0,10.2,1029.4,824.0,56.0,46.135
max,540.0,359.4,200.1,247.0,32.2,1145.0,992.6,365.0,82.6


In [7]:
df.isnull().sum()# check if the data is clean

Cement                0
Blast Furnace Slag    0
Fly Ash               0
Water                 0
Superplasticizer      0
Coarse Aggregate      0
Fine Aggregate        0
Age                   0
Strength              0
dtype: int64

The data is clean with no missing values

<b>  Separating predictors and target</b>

In [26]:
df_columns = df.columns

predictors = df[df_columns[df_columns != 'Strength']] # all columns except Strength
target = df['Strength'] # Strength column

In [27]:
predictors.head()# check the first five records of predictor

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360


In [28]:
target.head()# check the first five records of target feature

0    79.99
1    61.89
2    40.27
3    41.05
4    44.30
Name: Strength, dtype: float64

<b> Normalization of the Data </b>

In [29]:
#predictors_norm = (predictors - predictors.mean()) / predictors.std() # substract  the mean and divide by standard deviation
#predictors_norm.head()

In [30]:
#n_cols = predictors_norm.shape[1] # number of predictors to be saved in n columns

<b>  Import the Keras library </b>

In [31]:
import keras

In [32]:
from keras.models import Sequential
from keras.layers import Dense

<b> Build a Neural Network</b>

In [33]:

def regression_model(): # define regression model
    # create model with two hidden layers, each of 100 hidden units.
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
   # model.add(Dense(100, activation='relu'))
    model.add(Dense(1))
    
    # compile model
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

ReLU (or Rectified Linear Unit) is the most widely used activation function. It gives an output of X if X is positive and zeros otherwise. ReLU is often used for hidden layers.

Adaptive Moment Estimation or Adam optimization is an extension to the stochastic gradient descent. This algorithm is useful when working with complex problems involving vast amounts of data or parameters. It needs less memory and is efficient. 

<b>Take 30% of the data set for Testing the accuracy of the model and 70% for training the model</b>

In [34]:
from sklearn.model_selection import train_test_split # import the packages for training, testing and spliting from sklearn
X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size=0.3, random_state=42) # 30% data set for testing

<b>  Train and Test the Network</b>

In [35]:
# build the model
model = regression_model()

In [36]:
# fit the model
epochs = 50 # increase the number of epochs to 100
model.fit(X_train, y_train, epochs=epochs, verbose=1)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7f727410b2e0>

<b>  Evaluating the Model</b>

In [37]:
loss_val = model.evaluate(X_test, y_test)
y_pred = model.predict(X_test)
loss_val



106.0365219116211

In [38]:
from sklearn.metrics import mean_squared_error

In [39]:
mean_square_error = mean_squared_error(y_test, y_pred)
mean = np.mean(mean_square_error)
standard_deviation = np.std(mean_square_error)
print(mean, standard_deviation)

106.03652124524191 0.0


In [40]:
total_mean_squared_errors = 50
epochs = 50
mean_squared_errors = []
for i in range(0, total_mean_squared_errors):
    X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size=0.3, random_state=i)
    model.fit(X_train, y_train, epochs=epochs, verbose=0)
    MSE = model.evaluate(X_test, y_test, verbose=0)
    print("MSE "+str(i+1)+": "+str(MSE))
    y_pred = model.predict(X_test)
    mean_square_error = mean_squared_error(y_test, y_pred)
    mean_squared_errors.append(mean_square_error)

mean_squared_errors = np.array(mean_squared_errors)
mean = np.mean(mean_squared_errors)
standard_deviation = np.std(mean_squared_errors)

print('\n')
print("Below is the mean and standard deviation of " +str(total_mean_squared_errors) + " mean squared errors without normalized data. Total number of epochs for each training is: " +str(epochs) + "\n")
print("Mean: "+str(mean))
print("Standard Deviation: "+str(standard_deviation))

MSE 1: 72.51903533935547
MSE 2: 70.06085968017578
MSE 3: 52.49831008911133
MSE 4: 51.35323715209961
MSE 5: 49.96198654174805
MSE 6: 53.58228302001953
MSE 7: 58.94122314453125
MSE 8: 45.4842414855957
MSE 9: 47.988468170166016
MSE 10: 50.42008972167969
MSE 11: 48.25065231323242
MSE 12: 47.80887985229492
MSE 13: 55.9586067199707
MSE 14: 52.171783447265625
MSE 15: 50.44282913208008
MSE 16: 43.49639892578125
MSE 17: 50.274818420410156
MSE 18: 49.23743438720703
MSE 19: 44.23754119873047
MSE 20: 47.904502868652344
MSE 21: 44.29347610473633
MSE 22: 45.54547119140625
MSE 23: 44.154354095458984
MSE 24: 46.15468978881836
MSE 25: 48.24483108520508
MSE 26: 48.43710708618164
MSE 27: 52.11391830444336
MSE 28: 45.284263610839844
MSE 29: 52.67596435546875
MSE 30: 54.42585754394531
MSE 31: 52.157081604003906
MSE 32: 42.682308197021484
MSE 33: 47.23645782470703
MSE 34: 49.14694595336914
MSE 35: 47.461360931396484
MSE 36: 55.01321792602539
MSE 37: 51.22016143798828
MSE 38: 53.80708312988281
MSE 39: 47.811

In [41]:
from sklearn.metrics import r2_score

In [42]:
print("R2-score: %.2f" % r2_score(y_pred , y_test) )

R2-score: 0.75


<h2> Conclusion</h2>

The average squared difference between the estimated values and the actual value. MSE is a risk function, corresponding to the expected value of the squared error loss. Thus, the risk function keep varying with different values in the datacset.

How to actually avoid overfitting through the experiments conducted we can safely say that there is no optimal number of epochs. Actually, the number of epochs
differ from one dataset to other the main factor which comes into the picture is the training and validation error.

When 10 nodes with 1 hidden layer, one output layer with 50 epochs were used, the ANN regression model has R-square value of 0.75 which is 75% accurate with square mean error of 50.72 and standard deviation of 22.65. 