# Build a baseline model (5 marks) 

Use the Keras library to build a neural network with the following:

- One hidden layer of 10 nodes, and a ReLU activation function

- Use the adam optimizer and the mean squared error  as the loss function.

1. Randomly split the data into a training and test sets by holding 30% of the data for testing. You can use the train_test_splithelper function from Scikit-learn.

2. Train the model on the training data using 50 epochs.

3. Evaluate the model on the test data and compute the mean squared error between the predicted concrete strength and the actual concrete strength. You can use the mean_squared_error function from Scikit-learn.

4. Repeat steps 1 - 3, 50 times, i.e., create a list of 50 mean squared errors.

5. Report the mean and the standard deviation of the mean squared errors.

# Regression model using Keras (Data not normalized)

Lets import numpy and pandas to help us load and analyze data

In [36]:
import numpy as np
import pandas as pd

In [37]:
#lets load the data, and take a look at the data using .head()
data = pd.read_csv("concrete_data.csv")
data.head()


Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


In [38]:
#lets print the shape of the data (i.e. number of rows and columns)
data.shape

(1030, 9)

Therefore, our dataset has 1030 rows and only 9 columns.
Lets take a look at the data for any missing values before we start building the model using the data.

In [39]:
data.describe()


Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
count,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0
mean,281.167864,73.895825,54.18835,181.567282,6.20466,972.918932,773.580485,45.662136,35.817961
std,104.506364,86.279342,63.997004,21.354219,5.973841,77.753954,80.17598,63.169912,16.705742
min,102.0,0.0,0.0,121.8,0.0,801.0,594.0,1.0,2.33
25%,192.375,0.0,0.0,164.9,0.0,932.0,730.95,7.0,23.71
50%,272.9,22.0,0.0,185.0,6.4,968.0,779.5,28.0,34.445
75%,350.0,142.95,118.3,192.0,10.2,1029.4,824.0,56.0,46.135
max,540.0,359.4,200.1,247.0,32.2,1145.0,992.6,365.0,82.6


In [40]:
data.isnull().sum()

Cement                0
Blast Furnace Slag    0
Fly Ash               0
Water                 0
Superplasticizer      0
Coarse Aggregate      0
Fine Aggregate        0
Age                   0
Strength              0
dtype: int64

The data looks good so far, therfore we can begin the next steps.

Since, for the first part, we are not to normalioze the data, I will jump straight to splitting the dataset.

### Lets divide our dataset into predictors (X) and target variable (y) (independent and dependent variable)

In [41]:
X = data[['Cement','Blast Furnace Slag','Fly Ash',
                  'Water','Superplasticizer','Coarse Aggregate','Fine Aggregate','Age']]

X.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360


In [42]:
y = data[['Strength']]
y.head()

Unnamed: 0,Strength
0,79.99
1,61.89
2,40.27
3,41.05
4,44.3


### Lets convert both, X and y into arrays

In [43]:
X = X.values
X

array([[ 540. ,    0. ,    0. , ..., 1040. ,  676. ,   28. ],
       [ 540. ,    0. ,    0. , ..., 1055. ,  676. ,   28. ],
       [ 332.5,  142.5,    0. , ...,  932. ,  594. ,  270. ],
       ...,
       [ 148.5,  139.4,  108.6, ...,  892.4,  780. ,   28. ],
       [ 159.1,  186.7,    0. , ...,  989.6,  788.9,   28. ],
       [ 260.9,  100.5,   78.3, ...,  864.5,  761.5,   28. ]])

In [44]:
y = y.values
y

array([[79.99],
       [61.89],
       [40.27],
       ...,
       [23.7 ],
       [32.77],
       [32.4 ]])

### Now that we have both, the target and predictor variabels, lets move onto splitting our dataset.


In [45]:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42)
print(f"Train Set = {X_train.shape},{y_train.shape}")
print(f"Test Set = {X_test.shape},{y_test.shape}")

Train Set = (721, 8),(721, 1)
Test Set = (309, 8),(309, 1)


30% of the dataset has been reserved for testing as per the instructions

### Lets import some important libraries for building our model

In [46]:
import tensorflow

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.utils import to_categorical

In [47]:
#lets define n_cols to be the size of the number of variables in X
n_cols = X_test.shape[1]
print(n_cols)

8


#### Therefore, we will have 8 nodes in the input layer of the ANN.

In [48]:
#lets create our model
def regression_model():
    # create the model
    model = tensorflow.keras.Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(1))
    
    # compile thye model
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model


The above function creates a model that has one hidden layer with 10 neurons and uses ReLU activation function. It uses the adam optimizer and the mean squared error as the loss function, as per instructions.

In [49]:
#lets build the model
model = regression_model()

Lets train the model with 50 epochs

In [50]:
# fit the model
epochs = 50
model.fit(X_train, y_train, epochs=epochs, verbose=1)

Train on 721 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x7faf3facf450>

In [51]:
#Lets evaluate the model now:

loss_ = model.evaluate(X_test, y_test, verbose  =0)
y_pred = model.predict(X_test)
loss_



151.3155773125806



Now we need to compute the mean squared error between the predicted concrete strength and the actual concrete strength.

Let's import the mean_squared_error function from Scikit-learn.



In [52]:
from sklearn.metrics import mean_squared_error

In [53]:
mean_square_error = mean_squared_error(y_test, y_pred)
mean = np.mean(mean_square_error)
standard_deviation = np.std(mean_square_error)
print(f"Mean of MSE = {mean}")
print(f"Standard Deviation of MSE = {standard_deviation}")

Mean of MSE = 151.31558038069792
Standard Deviation of MSE = 0.0


### Now, we will repeat steps 1 - 3, 50 times, i.e., create a list of 50 mean squared errors and calculate the mean and Standard deviation of the list.

In [54]:
z =1 #for indexing 
mse_list_50 = [] #empty list for the 50 values 
model = regression_model()
epochs = 50
for x in range(50):
    
    X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42)
    model.fit(X_train, y_train, epochs=epochs, verbose=0)
    loss_1 = model.evaluate(X_test, y_test, verbose  =0)
    print(f" {z}: MSE = {loss_1}")
    y_pred1 = model.predict(X_test)
    mean_square_error = mean_squared_error(y_test, y_pred1)
    mse_list_50.append(mean_square_error)
    z += 1
    
#lets convert the list mse_list_50 into array before we calculate the mean and the standard deviation of the mean squared errors.
mse_array_50 = np.array(mse_list_50)
mse_array_50_mean = np.mean(mse_array_50)
mse_array_50_std = np.std(mse_array_50)
print(f"Mean of all 50 Mean squared error values = {mse_array_50_mean}")
print(f"Standard Deviation of all 50 Mean squared error values = {mse_array_50_std}")

 1: MSE = 393.0620283108313
 2: MSE = 146.54383326656995
 3: MSE = 110.63551639198872
 4: MSE = 108.09608681688032
 5: MSE = 109.26204002405062
 6: MSE = 110.51688059093883
 7: MSE = 113.00702156140966
 8: MSE = 108.92978678089129
 9: MSE = 110.5800561750591
 10: MSE = 108.46591053194213
 11: MSE = 108.38249512706373
 12: MSE = 108.19451931456531
 13: MSE = 108.76265183235834
 14: MSE = 109.28072692969856
 15: MSE = 108.75264379507516
 16: MSE = 118.00767973865892
 17: MSE = 109.44109243250973
 18: MSE = 109.81839263323441
 19: MSE = 114.40316172479426
 20: MSE = 107.93408538917122
 21: MSE = 106.81830275946065
 22: MSE = 111.54775492350261
 23: MSE = 106.67421532294512
 24: MSE = 107.03749183235045
 25: MSE = 107.81180456463959
 26: MSE = 107.28254440455761
 27: MSE = 109.022875542008
 28: MSE = 106.92063247038709
 29: MSE = 108.96980088119754
 30: MSE = 106.08166392798563
 31: MSE = 113.14672293555003
 32: MSE = 107.41609789406984
 33: MSE = 110.76961638317911
 34: MSE = 108.94810905

In [55]:
#a look at the list of 50 mean squared errors
mse_list_50

[393.06202657047317,
 146.54383939693764,
 110.63552042850878,
 108.09608892056339,
 109.26204311192639,
 110.51688241801976,
 113.00702128836211,
 108.92979098104176,
 110.58005693258667,
 108.46591384518227,
 108.38249646422054,
 108.19452395290452,
 108.76265431045226,
 109.28072359001362,
 108.75264260994138,
 118.00767910656869,
 109.44109610458231,
 109.8183928928208,
 114.40316443692173,
 107.9340851118793,
 106.81830471171722,
 111.54775827001008,
 106.67421606021996,
 107.03749552133158,
 107.8118039128104,
 107.2825421999724,
 109.02288002452134,
 106.9206325683592,
 108.96979918877763,
 106.08166515307482,
 113.1467234546087,
 107.416098568517,
 110.76961940951553,
 108.9481087099542,
 106.90136021757647,
 111.69538271228679,
 107.32505723600616,
 117.78107511399115,
 106.91079480679562,
 124.33097549565112,
 107.53140029927042,
 106.57030983504063,
 108.60116580228315,
 107.69089991029679,
 107.06464884212451,
 109.44576187025213,
 106.70670211463471,
 107.77843309534894,
 