# Peer-graded Assignment: Concrete Compressive Strength Prediction
**Student:** Luis Ignacio Reyes Castro

### Assignment Topic
In this project, we will build a regression model using the Keras library to model the same data about concrete compressive strength that we used in Lab 3.

For your convenience, the data can be found here again: 
[https://cocl.us/concrete_data](https://cocl.us/concrete_data). To recap, the predictors in the data of concrete strength include:
* Cement
* Blast Furnace Slag
* Fly Ash
* Water
* Superplasticizer
* Coarse Aggregate
* Fine Aggregate

Let's start by installing and importing the required libraries.

In [1]:
print( '-' * 72 )
print( 'BEGIN INSTALLATION OF REQUIRED LIBRARIES' )
print( '-' * 72 )
!pip install numpy==2.0.2
!pip install pandas==2.2.2
!pip install tensorflow_cpu==2.18.0
!pip install scikit-learn==1.6.0
!pip install matplotlib==3.9.3
print( '-' * 72 )
print( 'INSTALLATION OF REQUIRED LIBRARIES COMPLETE' )
print( '-' * 72 )

------------------------------------------------------------------------
BEGIN INSTALLATION OF REQUIRED LIBRARIES
------------------------------------------------------------------------
------------------------------------------------------------------------
INSTALLATION OF REQUIRED LIBRARIES COMPLETE
------------------------------------------------------------------------


In [2]:
import pandas as pd
import numpy as np
import random

import keras
from keras.models import Sequential
from keras.layers import Input
from keras.layers import Dense

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

import warnings
warnings.simplefilter('ignore', FutureWarning)

2025-02-26 02:00:34.362489: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-02-26 02:00:34.433206: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Flake8 is very annoying. The following lines mute the most common warning.

In [3]:
!echo "[flake8]" > ~/.config/flake8
!echo "ignore = E303" >> ~/.config/flake8

Next we set random number generator seeds for reproducibility.

In [4]:
lucky_number = 42
random.seed(lucky_number)
np.random.seed(lucky_number)
keras.utils.set_random_seed(lucky_number)

Now let's download the data and load it into a pandas dataframe.

In [5]:
filepath='https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DL0101EN/labs/data/concrete_data.csv'
concrete_data = pd.read_csv(filepath)
concrete_data.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


As instructed, we will be using the first seven columns as predictors.

In [6]:
n_cols     = 7
predictors = concrete_data.iloc[ :, :n_cols]
target     = concrete_data['Strength']
X = predictors.to_numpy()
y = target.to_numpy()

Finally, we define the function in charge of the train-test split.

In [7]:
def split_data( test_size) :
    X_train, X_test, y_train, y_test = \
    train_test_split( X, y, test_size = test_size)
    return X_train, X_test, y_train, y_test

## Part A: Build a Baseline Model

As instructed, we build a neural network model with the following specifications:
- One hidden layer with 10 nodes with ReLU activations.
- Optimizer: Adam (short for Adaptive Momentum)
- Loss: Mean Squared Error

In [8]:
# Define neural network model
def regression_model() :
    model = Sequential()
    model.add( Input( shape = (n_cols,)) )
    model.add( Dense( 10, activation = 'relu') )
    model.add( Dense(1) )
    model.compile( optimizer = 'adam', loss = 'mean_squared_error')
    return model

Now we repeat 50 times the following activities:
1. Randomly split the data into a training and test sets by holding 30% of the data for testing.
2. Train the model on the training data using 50 epochs.
3. Evaluate the model on the test data and compute the mean squared error between the predicted concrete strength and the actual concrete strength.

In [9]:
# Initialize number of iterations and MSEs placeholder
iterations = 50
mses       = np.zeros((iterations,))
# Iterate...
for i in range(iterations) :
    print(f'ITERATION {i}:')
    # Split data into training and test sets
    X_train, X_test, y_train, y_test = split_data( test_size = 0.30)
    # Build and train neural network model
    model = regression_model()
    model.fit( X_train, y_train, epochs = 50, verbose = 0)
    # Compute the Mean Squared Error (MSE)
    mses[i] = model.evaluate( X_test, y_test)
    print(f'Test Data Mean Squared Error: {mses[i]:.2f}')
# Compute and report the mean and std of the MSEs
print('ITERATIONS COMPLETE')
mses_mean = mses.mean()
mses_std  = mses.std()
print('Test Data Statistics:')
print(f'- Mean of the MSEs: {mses_mean:.2f}')
print(f'- Standard Deviation of the MSEs: {mses_std:.2f}')

ITERATION 0:
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - loss: 177.4348 
Test Data Mean Squared Error: 175.16
ITERATION 1:
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - loss: 285.6011 
Test Data Mean Squared Error: 301.09
ITERATION 2:
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - loss: 230.8200  
Test Data Mean Squared Error: 239.50
ITERATION 3:
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - loss: 1914.8876  
Test Data Mean Squared Error: 1960.17
ITERATION 4:
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - loss: 165.0546  
Test Data Mean Squared Error: 151.50
ITERATION 5:
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - loss: 902.4086 
Test Data Mean Squared Error: 1048.47
ITERATION 6:
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - loss: 163.6655  
Test Data Mean Squared Error: 167.65
ITERAT

## Part B: Normalize the Data

Here we scale the predictors to improve model performance.
- The training data predictors are scaled by subtracting the mean and dividing by the standard deviation.
- To prevent _data leakeage_, also known as _data snooping_, the test data predictors are scaled in the same way as the training data predictors but using the training data predictors' mean and standard deviation.
- The scaling is executed using scikit-learn's StandardScaler, which we learned about in the previous course (Machine Learning with Python).

In addition, **because this is a regression problem, the targets are also be scaled**.
- The training data targets are scaled by subtracting the mean and dividing by the standard deviation.
- As before, to prevent _data leakage_, the test data targets are scaled using the training data targets' mean and standard deviation.
- Again, we use Scikit-learn's StandardScaler.
- After prediction, the model outputs are inverse-scaled (i.e., inverse-transformed) so that the final outputs are mapped to their original scale. This is important because it allows us to compare the results of Parts A and B.

In [10]:
# Wrap the code in a function for use in Parts B, C & D
def build_train_test_with_normalization( epochs) :
    # Initialize number of iterations and MSEs placeholder
    iterations = 50
    mses       = np.zeros((iterations,))
    # Iterate...
    for i in range(iterations) :
        print(f'ITERATION {i}:')
        # Split data into training and test sets
        X_train, X_test, y_train, y_test = split_data( test_size = 0.30)
        # Reshape targets
        y_train = y_train.reshape(-1,1)
        y_test  = y_test.reshape(-1,1)
        # Initialize and fit training data scalers
        scaler_X = StandardScaler()
        scaler_y = StandardScaler()
        scaler_X.fit(X_train)
        scaler_y.fit(y_train)
        # Scale the training and test data
        X_train_scaled = scaler_X.transform(X_train)
        y_train_scaled = scaler_y.transform(y_train)
        X_test_scaled  = scaler_X.transform(X_test)
        y_test_scaled  = scaler_y.transform(y_test)
        # Build and train neural network model
        model = regression_model()
        model.fit( X_train_scaled,
                   y_train_scaled, epochs = epochs, verbose = 0)
        # Compute scaled predictions
        y_pred_scaled = model.predict(X_test_scaled)
        # Inverse scale predictions
        y_pred = scaler_y.inverse_transform(y_pred_scaled)
        # Compute the Mean Squared Error (MSE)
        mses[i] = mean_squared_error( y_test, y_pred)
        print(f'Test Data Mean Squared Error: {mses[i]:.2f}')
    # Compute and report the mean and std of the MSEs
    print('ITERATIONS COMPLETE')
    mses_mean = mses.mean()
    mses_std  = mses.std()
    print('Test Data Statistics:')
    print(f'- Mean of the MSEs: {mses_mean:.2f}')
    print(f'- Standard Deviation of the MSEs: {mses_std:.2f}')
    return

In [11]:
build_train_test_with_normalization( epochs = 50)

ITERATION 0:
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step 
Test Data Mean Squared Error: 168.15
ITERATION 1:
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step 
Test Data Mean Squared Error: 152.56
ITERATION 2:
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step 
Test Data Mean Squared Error: 159.64
ITERATION 3:
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step 
Test Data Mean Squared Error: 147.22
ITERATION 4:
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step 
Test Data Mean Squared Error: 161.63
ITERATION 5:
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step 
Test Data Mean Squared Error: 148.86
ITERATION 6:
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step 
Test Data Mean Squared Error: 146.24
ITERATION 7:
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step 
Test Data Mean Squared Error: 156.35
ITERATIO

**How does the mean of the mean squared errors compare to that from Part A?**
- The mean of the MSEs has decreased significantly.
  - Part A mean MSE: 523.77
  - Part B mean MSE: 148.64
  - Percentage decrease: 71.62%
- The standard deviation has also decreased even more significantly.
  - Part A mean STD: 609.71
  - Part B mean STD: 10.90
  - Percentage decrease: 98.21%
- The lesson is: Data normalization significantly improves model performance.

# Part C: Increase the Number of Epochs

Here we re-run the code from Part B with 100 epochs.

In [12]:
build_train_test_with_normalization( epochs = 100)

ITERATION 0:
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step 
Test Data Mean Squared Error: 134.52
ITERATION 1:
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step 
Test Data Mean Squared Error: 149.62
ITERATION 2:
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step 
Test Data Mean Squared Error: 128.98
ITERATION 3:
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step 
Test Data Mean Squared Error: 128.65
ITERATION 4:
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step 
Test Data Mean Squared Error: 140.65
ITERATION 5:
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step 
Test Data Mean Squared Error: 138.42
ITERATION 6:
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step 
Test Data Mean Squared Error: 144.49
ITERATION 7:
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step 
Test Data Mean Squared Error: 144.57
ITERATIO

**How does the mean of the mean squared errors compare to that from Part B?**
- The mean of the MSEs has decreased.
  - Part B mean MSE: 148.64
  - Part C mean MSE: 141.11
  - Percentage decrease: 5.07%
- The standard deviation has also decreased.
  - Part B mean STD: 10.90
  - Part C mean STD: 8.07
  - Percentage decrease: 25.96%
- The lesson is: In this particular case, i.e., for this dataset and for this _shallow_ neural network model (1 hidden layer), more training epochs lead to better model performance.

## Part D: Increase the Number of Hidden Layers

As instructed, we build a neural network model with the following specifications:
- Three hidden layers, each with 10 nodes with ReLU activations.
- Optimizer: Adam (short for Adaptive Momentum)
- Loss: Mean Squared Error

In [13]:
# Re-define neural network model
del regression_model
def regression_model() :
    model = Sequential()
    model.add( Input( shape = (n_cols,)) )
    model.add( Dense( 10, activation = 'relu') )
    model.add( Dense( 10, activation = 'relu') )
    model.add( Dense( 10, activation = 'relu') )
    model.add( Dense(1) )
    model.compile( optimizer = 'adam', loss = 'mean_squared_error')
    return model

With the neural network model re-defined, we can re-run the code from Part B.

In [14]:
build_train_test_with_normalization( epochs = 50)

ITERATION 0:
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step 
Test Data Mean Squared Error: 128.67
ITERATION 1:
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step 
Test Data Mean Squared Error: 145.59
ITERATION 2:
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step 
Test Data Mean Squared Error: 132.06
ITERATION 3:
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step 
Test Data Mean Squared Error: 145.31
ITERATION 4:
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step 
Test Data Mean Squared Error: 138.45
ITERATION 5:
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step 
Test Data Mean Squared Error: 149.23
ITERATION 6:
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step 
Test Data Mean Squared Error: 142.27
ITERATION 7:
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step 
Test Data Mean Squared Error: 144.37
ITERATIO

**How does the mean of the mean squared errors compare to that from Part B?**
- The mean of the MSEs has decreased.
  - Part B mean MSE: 148.64
  - Part C mean MSE: 143.41
  - Percentage decrease: 3.52%
- The standard deviation has also decreased.
  - Part B mean STD: 10.90
  - Part C mean STD: 8.99
  - Percentage decrease: 17.52%
- However, the model performed better in Part C than in Part D. This may be due to overfitting. The lesson is: In this particular case, i.e., for this dataset and this _deep_ neural network model (3 hidden layers), adding too many extra hidden layers excessively increases model complexity and thus might lead to overfitting.