# B. Normalize the data (5 marks) 

Repeat Part A but use a normalized version of the data. Recall that one way to normalize the data is by subtracting the mean from the individual predictors and dividing by the standard deviation.

How does the mean of the mean squared errors compare to that from Step A?

# Regression model using Keras (Data normalized)

Lets import numpy and pandas to help us load and analyze data

In [3]:
import numpy as np
import pandas as pd

In [4]:
#lets load the data, and take a look at the data using .head()
data = pd.read_csv("concrete_data.csv")
data.head()


Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


In [5]:
#lets print the shape of the data (i.e. number of rows and columns)
data.shape

(1030, 9)

Therefore, our dataset has 1030 rows and only 9 columns.
Lets take a look at the data for any missing values before we start building the model using the data.

In [6]:
data.describe()


Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
count,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0
mean,281.167864,73.895825,54.18835,181.567282,6.20466,972.918932,773.580485,45.662136,35.817961
std,104.506364,86.279342,63.997004,21.354219,5.973841,77.753954,80.17598,63.169912,16.705742
min,102.0,0.0,0.0,121.8,0.0,801.0,594.0,1.0,2.33
25%,192.375,0.0,0.0,164.9,0.0,932.0,730.95,7.0,23.71
50%,272.9,22.0,0.0,185.0,6.4,968.0,779.5,28.0,34.445
75%,350.0,142.95,118.3,192.0,10.2,1029.4,824.0,56.0,46.135
max,540.0,359.4,200.1,247.0,32.2,1145.0,992.6,365.0,82.6


In [7]:
data.isnull().sum()

Cement                0
Blast Furnace Slag    0
Fly Ash               0
Water                 0
Superplasticizer      0
Coarse Aggregate      0
Fine Aggregate        0
Age                   0
Strength              0
dtype: int64

The data looks good so far, therfore we can begin the next steps.

Since, for the first part, we are not to normalioze the data, I will jump straight to splitting the dataset.

### Lets divide our dataset into predictors (X) and target variable (y) (independent and dependent variable)

In [8]:
X = data[['Cement','Blast Furnace Slag','Fly Ash',
                  'Water','Superplasticizer','Coarse Aggregate','Fine Aggregate','Age']]

X.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360


In [9]:
y = data[['Strength']]
y.head()

Unnamed: 0,Strength
0,79.99
1,61.89
2,40.27
3,41.05
4,44.3


### Lets convert both, X and y into arrays

In [10]:
X = X.values
X

array([[ 540. ,    0. ,    0. , ..., 1040. ,  676. ,   28. ],
       [ 540. ,    0. ,    0. , ..., 1055. ,  676. ,   28. ],
       [ 332.5,  142.5,    0. , ...,  932. ,  594. ,  270. ],
       ...,
       [ 148.5,  139.4,  108.6, ...,  892.4,  780. ,   28. ],
       [ 159.1,  186.7,    0. , ...,  989.6,  788.9,   28. ],
       [ 260.9,  100.5,   78.3, ...,  864.5,  761.5,   28. ]])

In [11]:
y = y.values
y

array([[79.99],
       [61.89],
       [40.27],
       ...,
       [23.7 ],
       [32.77],
       [32.4 ]])

## Lets Normalize the X (independent variables/predictors)

In [12]:
X_normalized = (X - X.mean()) / X.std() 

In [13]:
print(X_normalized.shape)
print(X.shape)

(1030, 8)
(1030, 8)


### Now that we have both, the target and predictor variabels, lets move onto splitting our dataset.


In [14]:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split( X_normalized, y, test_size=0.3, random_state=42)
print(f"Train Set = {X_train.shape},{y_train.shape}")
print(f"Test Set = {X_test.shape},{y_test.shape}")

Train Set = (721, 8),(721, 1)
Test Set = (309, 8),(309, 1)


30% of the dataset has been reserved for testing as per the instructions

### Lets import some important libraries for building our model

In [15]:
import tensorflow

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.utils import to_categorical

In [16]:
#lets define n_cols to be the size of the number of variables in X
n_cols = X_test.shape[1]
print(n_cols)

8


#### Therefore, we will have 8 nodes in the input layer of the ANN.

In [17]:
#lets create our model
def regression_model():
    # create the model
    model = tensorflow.keras.Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(1))
    
    # compile thye model
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model


The above function creates a model that has one hidden layer with 10 neurons and uses ReLU activation function. It uses the adam optimizer and the mean squared error as the loss function, as per instructions.

In [18]:
#lets build the model
model = regression_model()

2022-04-13 14:35:07.284874: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  SSE4.1 SSE4.2
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2022-04-13 14:35:07.285322: I tensorflow/core/common_runtime/process_util.cc:115] Creating new thread pool with default inter op setting: 8. Tune using inter_op_parallelism_threads for best performance.


Lets train the model with 50 epochs

In [19]:
# fit the model
epochs = 50
model.fit(X_train, y_train, epochs=epochs, verbose=2)

Train on 721 samples
Epoch 1/50
721/721 - 0s - loss: 1450.1316
Epoch 2/50
721/721 - 0s - loss: 1411.0284
Epoch 3/50
721/721 - 0s - loss: 1371.6819
Epoch 4/50
721/721 - 0s - loss: 1332.0697
Epoch 5/50
721/721 - 0s - loss: 1291.5961
Epoch 6/50
721/721 - 0s - loss: 1249.9402
Epoch 7/50
721/721 - 0s - loss: 1207.3595
Epoch 8/50
721/721 - 0s - loss: 1162.6883
Epoch 9/50
721/721 - 0s - loss: 1116.3489
Epoch 10/50
721/721 - 0s - loss: 1069.0254
Epoch 11/50
721/721 - 0s - loss: 1019.6831
Epoch 12/50
721/721 - 0s - loss: 970.7288
Epoch 13/50
721/721 - 0s - loss: 920.6431
Epoch 14/50
721/721 - 0s - loss: 870.9823
Epoch 15/50
721/721 - 0s - loss: 821.7653
Epoch 16/50
721/721 - 0s - loss: 773.5706
Epoch 17/50
721/721 - 0s - loss: 726.0541
Epoch 18/50
721/721 - 0s - loss: 680.4849
Epoch 19/50
721/721 - 0s - loss: 636.1126
Epoch 20/50
721/721 - 0s - loss: 594.2987
Epoch 21/50
721/721 - 0s - loss: 554.2411
Epoch 22/50
721/721 - 0s - loss: 517.2179
Epoch 23/50
721/721 - 0s - loss: 482.4808
Epoch 24/50

<tensorflow.python.keras.callbacks.History at 0x7fb0b8203e50>

In [20]:
#Lets evaluate the model now:

loss_ = model.evaluate(X_test, y_test, verbose  =2)
y_pred = model.predict(X_test)
loss_



309/1 - 0s - loss: 244.1764


268.62339935796547



Now we need to compute the mean squared error between the predicted concrete strength and the actual concrete strength.

Let's import the mean_squared_error function from Scikit-learn.



In [21]:
from sklearn.metrics import mean_squared_error

In [22]:
mean_square_error = mean_squared_error(y_test, y_pred)
mean = np.mean(mean_square_error)
standard_deviation = np.std(mean_square_error)
print(f"Mean of MSE = {mean}")
print(f"Standard Deviation of MSE = {standard_deviation}")

Mean of MSE = 268.62340942009075
Standard Deviation of MSE = 0.0


### Now, we will repeat steps 1 - 3, 50 times, i.e., create a list of 50 mean squared errors and calculate the mean and Standard deviation of the list.

In [23]:
z =1 #for indexing 
mse_list_50 = [] #empty list for the 50 values 
model = regression_model()
epochs = 50
for x in range(50):
    
    X_train, X_test, y_train, y_test = train_test_split(X_normalized, y, test_size=0.3, random_state=x)
    model.fit(X_train, y_train, epochs=epochs, verbose=0)
    loss_1 = model.evaluate(X_test, y_test, verbose  =0)
    print(f" {z}: MSE = {loss_1}")
    y_pred1 = model.predict(X_test)
    mean_square_error = mean_squared_error(y_test, y_pred1)
    mse_list_50.append(mean_square_error)
    z += 1
    
#lets convert the list mse_list_50 into array before we calculate the mean and the standard deviation of the mean squared errors.
mse_array_50 = np.array(mse_list_50)
mse_array_50_mean = np.mean(mse_array_50)
mse_array_50_std = np.std(mse_array_50)
print(f"Mean of all 50 Mean squared error values = {mse_array_50_mean}")
print(f"Standard Deviation of all 50 Mean squared error values = {mse_array_50_std}")

 1: MSE = 253.65496055825244
 2: MSE = 218.91899963181382
 3: MSE = 214.45071885192277
 4: MSE = 220.57252161711165
 5: MSE = 197.28304836433682
 6: MSE = 160.46466355802173
 7: MSE = 173.15946288864976
 8: MSE = 126.89095276767767
 9: MSE = 142.55969021003995
 10: MSE = 124.7042385953144
 11: MSE = 113.13132528656895
 12: MSE = 109.44876883793803
 13: MSE = 116.43116599765024
 14: MSE = 119.2698263520176
 15: MSE = 106.93716818306439
 16: MSE = 101.24694100784252
 17: MSE = 100.14456655755399
 18: MSE = 95.20003200889019
 19: MSE = 86.88076076075482
 20: MSE = 106.97797497499336
 21: MSE = 89.26086539357998
 22: MSE = 87.92968258657116
 23: MSE = 92.81697833113685
 24: MSE = 88.40291648543769
 25: MSE = 92.81960410207607
 26: MSE = 85.81355458478711
 27: MSE = 95.3424291765034
 28: MSE = 90.7981791079623
 29: MSE = 87.56704916846019
 30: MSE = 86.2289626976433
 31: MSE = 84.32435936295099
 32: MSE = 77.83063591028109
 33: MSE = 71.47631850751858
 34: MSE = 82.61176974102132
 35: MSE =

In [24]:
#a look at the list of 50 mean squared errors
mse_list_50

[253.6549580600918,
 218.91899911393568,
 214.45071547673413,
 220.572515846307,
 197.28304764132756,
 160.46466461210213,
 173.1594628665499,
 126.89095454331132,
 142.55969185041894,
 124.70423809855293,
 113.13132486996386,
 109.44876785449831,
 116.43116585203373,
 119.26982552615455,
 106.9371650175403,
 101.24694017219355,
 100.14456539463497,
 95.20003116582488,
 86.88075865134273,
 106.97797477677871,
 89.26086341173526,
 87.92968109926554,
 92.81698104564093,
 88.40291655030335,
 92.81960037688077,
 85.81355501987021,
 95.34242933386099,
 90.79817500936639,
 87.56705058417077,
 86.22896188826967,
 84.3243595986096,
 77.83063355662888,
 71.47631662976559,
 82.61176737209637,
 78.11758098294065,
 88.37036601611327,
 81.68768175563683,
 80.18702522526183,
 70.31023958191744,
 70.34208138692726,
 76.34483849077779,
 71.4675444240268,
 65.40383987137025,
 71.34504095805903,
 70.28408880043554,
 71.83411949926827,
 66.34418389678142,
 66.53886668948292,
 61.84839760600185,
 65.31342

## How does the mean of the mean squared errors compare to that from Step A?

In [29]:
print("Without Normalization")
print(f"\nMean of all 50 Mean squared error values = 115.86157194804495")
print(f"Standard Deviation of all 50 Mean squared error values = 40.07537429387559")

Without Normalization

Mean of all 50 Mean squared error values = 115.86157194804495
Standard Deviation of all 50 Mean squared error values = 40.07537429387559


In [28]:
print("With Normalization")
print(f"\nMean of all 50 Mean squared error values = {mse_array_50_mean}")
print(f"Standard Deviation of all 50 Mean squared error values = {mse_array_50_std}")

With Normalization

Mean of all 50 Mean squared error values = 105.14580754892266
Standard Deviation of all 50 Mean squared error values = 45.379225803390646
