# Regression Model Application on Concrete data

### 1. Assignment Topic:

In this project, you will build a regression model using the Keras library to model the same data about concrete compressive strength that we used in labs 3.

### 2. Concrete Data:

For your convenience, the data can be found here again: https://cocl.us/concrete_data. To recap, the predictors in the data of concrete strength include:

    1.Cement
    2.Blast Furnace Slag
    3.Fly Ash
    4.Water
    5.Superplasticizer
    6.Coarse Aggregate
    7.Fine Aggregate

In [1]:
import pandas as pd
import numpy as np

In [2]:
concrete_data = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DL0101EN/labs/data/concrete_data.csv')
concrete_data.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


Check shape of data

In [3]:
concrete_data.shape

(1030, 9)

Check the data distribution

In [4]:
concrete_data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Cement,1030.0,281.167864,104.506364,102.0,192.375,272.9,350.0,540.0
Blast Furnace Slag,1030.0,73.895825,86.279342,0.0,0.0,22.0,142.95,359.4
Fly Ash,1030.0,54.18835,63.997004,0.0,0.0,0.0,118.3,200.1
Water,1030.0,181.567282,21.354219,121.8,164.9,185.0,192.0,247.0
Superplasticizer,1030.0,6.20466,5.973841,0.0,0.0,6.4,10.2,32.2
Coarse Aggregate,1030.0,972.918932,77.753954,801.0,932.0,968.0,1029.4,1145.0
Fine Aggregate,1030.0,773.580485,80.17598,594.0,730.95,779.5,824.0,992.6
Age,1030.0,45.662136,63.169912,1.0,7.0,28.0,56.0,365.0
Strength,1030.0,35.817961,16.705742,2.33,23.71,34.445,46.135,82.6


Check for any missing values in data

In [5]:
concrete_data.isnull().sum()

Cement                0
Blast Furnace Slag    0
Fly Ash               0
Water                 0
Superplasticizer      0
Coarse Aggregate      0
Fine Aggregate        0
Age                   0
Strength              0
dtype: int64

There are no missing values in data

Seperate out data for predictors & target data

In [6]:
predictors = concrete_data.drop('Strength', axis=1) # all columns except Strength
target = concrete_data['Strength'] # Strength column

Quickly check the data

In [7]:
predictors.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360


In [8]:
target.head(3)

0    79.99
1    61.89
2    40.27
Name: Strength, dtype: float64

Finally, the last step is to normalize the data by substracting the mean and dividing by the standard deviation.

In [9]:
predictors_norm = (predictors - predictors.mean()) / predictors.std()
predictors_norm.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,0.862735,-1.217079,-0.279597
1,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,1.055651,-1.217079,-0.279597
2,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,3.55134
3,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,5.055221
4,-0.790075,0.678079,-0.846733,0.488555,-1.038638,0.070492,0.647569,4.976069


Let's save the number of predictors to n_cols since we will need this number when building our network.

In [10]:
n_cols = predictors_norm.shape[1] # number of predictors

## Import neccessary libraries

In [11]:
import keras

Using TensorFlow backend.


In [12]:
from keras.models import Sequential
from keras.layers import Dense

### Build Neural Network model

In [13]:
model = Sequential()
model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
model.add(Dense(1))
    
# compile model
model.compile(optimizer='adam', loss='mean_squared_error')

### 1. Randomly split the data into a training and test sets by holding 30% of the data for testing. You can use the train_test_split helper function from Scikit-learn.

In [14]:
from sklearn.model_selection import train_test_split

In [15]:
# Train, test data sets split by 30%
X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.30, random_state=1, shuffle=True)

### 2. Train the model on the training data using 50 epochs.

In [16]:
history = model.fit(X_train, y_train, epochs=50, verbose=2)

Epoch 1/50
 - 0s - loss: 1567.0162
Epoch 2/50
 - 0s - loss: 1549.9476
Epoch 3/50
 - 0s - loss: 1533.1200
Epoch 4/50
 - 0s - loss: 1516.3646
Epoch 5/50
 - 0s - loss: 1499.5616
Epoch 6/50
 - 0s - loss: 1482.3764
Epoch 7/50
 - 0s - loss: 1464.7635
Epoch 8/50
 - 0s - loss: 1446.2442
Epoch 9/50
 - 0s - loss: 1427.2840
Epoch 10/50
 - 0s - loss: 1407.5170
Epoch 11/50
 - 0s - loss: 1387.1018
Epoch 12/50
 - 0s - loss: 1365.7773
Epoch 13/50
 - 0s - loss: 1343.8409
Epoch 14/50
 - 0s - loss: 1321.2726
Epoch 15/50
 - 0s - loss: 1297.6959
Epoch 16/50
 - 0s - loss: 1273.4355
Epoch 17/50
 - 0s - loss: 1248.3921
Epoch 18/50
 - 0s - loss: 1223.0643
Epoch 19/50
 - 0s - loss: 1196.4503
Epoch 20/50
 - 0s - loss: 1170.1329
Epoch 21/50
 - 0s - loss: 1142.8898
Epoch 22/50
 - 0s - loss: 1115.0976
Epoch 23/50
 - 0s - loss: 1087.0281
Epoch 24/50
 - 0s - loss: 1058.3560
Epoch 25/50
 - 0s - loss: 1029.6355
Epoch 26/50
 - 0s - loss: 1000.9304
Epoch 27/50
 - 0s - loss: 971.7605
Epoch 28/50
 - 0s - loss: 942.8683
Epo

### 3. Evaluate the model on the test data and compute the mean squared error between the predicted concrete strength and the actual concrete strength. You can use the mean_squared_error function from Scikit-learn.

In [17]:
pred = model.predict(X_test)

In [18]:
from sklearn.metrics import mean_squared_error

In [19]:
err = mean_squared_error(y_test, pred)
print(err)

461.6492474547625


Error value is very big for the given set-up

### 4. Repeat steps 1 - 3, 50 times, i.e., create a list of 50 mean squared errors.

In [20]:
#Define model as model1 which we will use with 50 iterations
model1 =Sequential()
model1.add(Dense(10, activation='relu', input_shape=(n_cols,)))
model1.add(Dense(1))

# compile model
model1.compile(optimizer='adam', loss='mean_squared_error')

In [21]:
#Error value holder
error = np.zeros(50)

# For loop for 50 iterations
for i in range(50):
    X_tr, X_tt, y_tr, y_tt = train_test_split(predictors_norm, target, test_size=0.30, random_state=1, shuffle=True)
    hist = model1.fit(X_tr, y_tr, epochs=50, verbose=0)
    preds = model1.predict(X_tt)
    err = mean_squared_error(y_tt, preds)
    error[i] = err
    print("Error from iteration {}, is {}".format(i+1,err))
    

Error from iteration 1, is 408.6096390845315
Error from iteration 2, is 197.58088228839287
Error from iteration 3, is 126.90579527153004
Error from iteration 4, is 94.39206526058925
Error from iteration 5, is 81.06747776276107
Error from iteration 6, is 72.13628263967274
Error from iteration 7, is 61.39078025927579
Error from iteration 8, is 56.74311571971817
Error from iteration 9, is 54.05706656539921
Error from iteration 10, is 52.38337341991615
Error from iteration 11, is 51.256775469408105
Error from iteration 12, is 50.25023139826584
Error from iteration 13, is 49.79815514393613
Error from iteration 14, is 49.23746533828033
Error from iteration 15, is 48.85476107988239
Error from iteration 16, is 48.23222949927553
Error from iteration 17, is 47.716365496334106
Error from iteration 18, is 47.47313180637061
Error from iteration 19, is 47.17779632567817
Error from iteration 20, is 46.66156117238517
Error from iteration 21, is 46.6822811416837
Error from iteration 22, is 46.748518135

### 5. Report the mean and the standard deviation of the mean squared errors.

In [22]:
dev = np.std(error)
Mn = np.mean(error)
print("Mean is {}, and std. deviation is {}".format(Mn,dev))

Mean is 61.238712498673, and std. deviation is 55.53675131737623


### How does the mean of the mean squared errors compare to that from Step A?

Mean & std. deviation with normalized data reduced more than 50% (Mean & Std. deviation are very close to each other) compared to A.