# Part A

Let's start by importing the <em>pandas</em> and the Numpy libraries.

In [130]:
import pandas as pd
import numpy as np

We will be playing around with the same dataset that we used in the videos.

<strong>The dataset is about the compressive strength of different samples of concrete based on the volumes of the different ingredients that were used to make them. Ingredients include:</strong>

<strong>1. Cement</strong>

<strong>2. Blast Furnace Slag</strong>

<strong>3. Fly Ash</strong>

<strong>4. Water</strong>

<strong>5. Superplasticizer</strong>

<strong>6. Coarse Aggregate</strong>

<strong>7. Fine Aggregate</strong>

Let's download the data and read it into a <em>pandas</em> dataframe.

In [131]:
concrete_data = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DL0101EN/labs/data/concrete_data.csv')
concrete_data.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


So the first concrete sample has 540 cubic meter of cement, 0 cubic meter of blast furnace slag, 0 cubic meter of fly ash, 162 cubic meter of water, 2.5 cubic meter of superplaticizer, 1040 cubic meter of coarse aggregate, 676 cubic meter of fine aggregate. Such a concrete mix which is 28 days old, has a compressive strength of 79.99 MPa. 

#### Let's check how many data points we have.

In [132]:
concrete_data.shape

(1030, 9)

So, there are approximately 1000 samples to train our model on. Because of the few samples, we have to be careful not to overfit the training data.

Let's check the dataset for any missing values.

In [133]:
concrete_data.describe()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
count,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0
mean,281.167864,73.895825,54.18835,181.567282,6.20466,972.918932,773.580485,45.662136,35.817961
std,104.506364,86.279342,63.997004,21.354219,5.973841,77.753954,80.17598,63.169912,16.705742
min,102.0,0.0,0.0,121.8,0.0,801.0,594.0,1.0,2.33
25%,192.375,0.0,0.0,164.9,0.0,932.0,730.95,7.0,23.71
50%,272.9,22.0,0.0,185.0,6.4,968.0,779.5,28.0,34.445
75%,350.0,142.95,118.3,192.0,10.2,1029.4,824.0,56.0,46.135
max,540.0,359.4,200.1,247.0,32.2,1145.0,992.6,365.0,82.6


In [134]:
concrete_data.isnull().sum()

Cement                0
Blast Furnace Slag    0
Fly Ash               0
Water                 0
Superplasticizer      0
Coarse Aggregate      0
Fine Aggregate        0
Age                   0
Strength              0
dtype: int64

The data looks very clean and is ready to be used to build our model.

#### Split data into predictors and target

The target variable in this problem is the concrete sample strength. Therefore, our predictors will be all the other columns.

In [135]:
concrete_data_columns = concrete_data.columns

predictors = concrete_data[concrete_data_columns[concrete_data_columns != 'Strength']] # all columns except Strength
target = concrete_data['Strength'] # Strength column

<a id="item2"></a>

Let's do a quick sanity check of the predictors and the target dataframes.

In [136]:
predictors.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360


In [137]:
target.head()

0    79.99
1    61.89
2    40.27
3    41.05
4    44.30
Name: Strength, dtype: float64

# Splitting into traning and test sets

In [138]:
from sklearn.model_selection import train_test_split

In [139]:
X_train, X_test, y_train, y_test = train_test_split(
     predictors, target, test_size=0.3, random_state=42)

In [140]:
X_train.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
196,194.7,0.0,100.5,165.6,7.5,1006.4,905.9,28
631,325.0,0.0,0.0,184.0,0.0,1063.0,783.0,7
81,318.8,212.5,0.0,155.7,14.3,852.1,880.4,3
526,359.0,19.0,141.0,154.0,10.9,942.0,801.0,3
830,162.0,190.0,148.0,179.0,19.0,838.0,741.0,28


In [141]:
X_test.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
31,266.0,114.0,0.0,228.0,0.0,932.0,670.0,365
109,362.6,189.0,0.0,164.9,11.6,944.7,755.8,7
136,389.9,189.0,0.0,145.9,22.0,944.7,755.8,28
88,362.6,189.0,0.0,164.9,11.6,944.7,755.8,3
918,145.0,0.0,179.0,202.0,8.0,824.0,869.0,28


In [142]:
print(y_train.head(),"\n",y_test.head())

196    25.72
631    17.54
81     25.20
526    23.64
830    33.76
Name: Strength, dtype: float64 
 31     52.91
109    55.90
136    74.50
88     35.30
918    10.54
Name: Strength, dtype: float64


In [143]:
n_cols = X_train.shape[1] # number of predictors
n_cols

8

# Importing Keras

In [148]:
import keras
from keras.models import Sequential
from keras.layers import Dense
from sklearn.metrics import mean_squared_error

In [169]:
# define regression model
def regression_model():
    # create model
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(1))
    
    # compile model
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

In [170]:
# build the model
model = regression_model()

# fit the model
model.fit(X_train, y_train, validation_data=(X_test, y_test),epochs=50, verbose=2) 

scores = model.evaluate(X_test, y_test, verbose=0)
scores

Train on 721 samples, validate on 309 samples
Epoch 1/50
 - 0s - loss: 62331.1385 - val_loss: 48732.2892
Epoch 2/50
 - 0s - loss: 38487.2074 - val_loss: 29863.8592
Epoch 3/50
 - 0s - loss: 23159.4678 - val_loss: 17727.4231
Epoch 4/50
 - 0s - loss: 13327.5659 - val_loss: 9947.4176
Epoch 5/50
 - 0s - loss: 7230.6246 - val_loss: 5315.9222
Epoch 6/50
 - 0s - loss: 3868.1501 - val_loss: 2856.4456
Epoch 7/50
 - 0s - loss: 2283.0024 - val_loss: 1777.5971
Epoch 8/50
 - 0s - loss: 1689.8207 - val_loss: 1361.6078
Epoch 9/50
 - 0s - loss: 1484.8479 - val_loss: 1229.3305
Epoch 10/50
 - 0s - loss: 1413.8751 - val_loss: 1155.6715
Epoch 11/50
 - 0s - loss: 1367.9756 - val_loss: 1112.1226
Epoch 12/50
 - 0s - loss: 1325.9028 - val_loss: 1072.9809
Epoch 13/50
 - 0s - loss: 1285.3086 - val_loss: 1043.9006
Epoch 14/50
 - 0s - loss: 1244.6341 - val_loss: 1005.9979
Epoch 15/50
 - 0s - loss: 1204.5852 - val_loss: 974.0061
Epoch 16/50
 - 0s - loss: 1166.7144 - val_loss: 942.8566
Epoch 17/50
 - 0s - loss: 1129

312.8867878836721

In [171]:
yhat = model.predict(X_test, verbose=0)

In [172]:
mean_squared_error(y_test, yhat)

312.8867820595263

#### . Repeating  50 times, i.e., create a list of 50 mean squared errors.

In [167]:
mean_squared_errors = []
for i in range(50):
    X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size=0.3)
    model = regression_model()
    model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=50, verbose=0)
    yhat = model.predict(X_test, verbose=0)
    mse = mean_squared_error(yhat, y_pred)
    mean_squared_errors.append(mse)
    print(i, mse)

0 274.84244
1 297.7638
2 363.07367
3 1168.5137
4 261.04865
5 237.53903
6 254.795
7 326.63596
8 561.9569
9 1164.1183
10 267.80872
11 221.44736
12 235.82582
13 284.56467
14 279.14795
15 289.59528
16 338.42255
17 483.0723
18 277.6725
19 259.70218
20 568.8101
21 251.46298
22 388.0746
23 517.5044
24 242.87762
25 266.01486
26 329.07208
27 326.9737
28 693.75446
29 454.33423
30 318.132
31 1469.3091
32 651.8206
33 234.3149
34 288.67764
35 230.08131
36 286.35074
37 431.16934
38 727.0083
39 222.43942
40 1308.1841
41 386.39627
42 225.35707
43 186.40585
44 969.60657
45 214.34952
46 653.0831
47 190.50063
48 297.6643
49 245.23883


In [168]:
print(np.mean(mean_squared_errors))
print(np.std(mean_squared_errors))

428.4503
298.84494


## mean = 428 & std = 298

# Part B

# Normalizing the data

In [180]:
X_train_norm = (X_train - X_train.mean()) / X_train.std()
y_train_norm = (y_train - y_train.mean()) / y_train.std()


X_train_norm.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
219,-1.094577,-0.866874,1.682062,-0.201728,-0.302889,1.113327,0.092485,-0.682444
913,0.162154,-0.866874,0.806307,-0.793779,1.128632,-0.25509,0.141573,-0.292751
164,1.372198,0.344848,-0.858094,-1.291101,1.718081,-1.562602,1.439262,0.689275
735,0.648077,-0.866874,-0.858094,0.532416,-1.060752,1.079635,0.456241,1.141319
704,-0.29328,1.034493,-0.858094,1.077103,-1.060752,0.063689,-0.989971,-0.682444


In [174]:
# define regression model
def regression_modelB():
    # create model
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(1))
    
    # compile model
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

In [175]:
model = regression_modelB()

# fit the model
model.fit(X_train_norm, y_train_norm, validation_data=(X_test_norm, y_test_norm), epochs=50, verbose=2)

# evaluate the model
scores = model.evaluate(X_test_norm, y_test_norm, verbose=0)
scores

Train on 721 samples, validate on 309 samples
Epoch 1/50
 - 0s - loss: 1.2484 - val_loss: 1.0128
Epoch 2/50
 - 0s - loss: 0.9388 - val_loss: 0.7954
Epoch 3/50
 - 0s - loss: 0.7630 - val_loss: 0.6702
Epoch 4/50
 - 0s - loss: 0.6630 - val_loss: 0.6048
Epoch 5/50
 - 0s - loss: 0.6094 - val_loss: 0.5608
Epoch 6/50
 - 0s - loss: 0.5733 - val_loss: 0.5285
Epoch 7/50
 - 0s - loss: 0.5474 - val_loss: 0.5025
Epoch 8/50
 - 0s - loss: 0.5252 - val_loss: 0.4820
Epoch 9/50
 - 0s - loss: 0.5074 - val_loss: 0.4629
Epoch 10/50
 - 0s - loss: 0.4911 - val_loss: 0.4465
Epoch 11/50
 - 0s - loss: 0.4769 - val_loss: 0.4320
Epoch 12/50
 - 0s - loss: 0.4623 - val_loss: 0.4176
Epoch 13/50
 - 0s - loss: 0.4492 - val_loss: 0.4054
Epoch 14/50
 - 0s - loss: 0.4371 - val_loss: 0.3943
Epoch 15/50
 - 0s - loss: 0.4260 - val_loss: 0.3841
Epoch 16/50
 - 0s - loss: 0.4155 - val_loss: 0.3745
Epoch 17/50
 - 0s - loss: 0.4062 - val_loss: 0.3654
Epoch 18/50
 - 0s - loss: 0.3970 - val_loss: 0.3573
Epoch 19/50
 - 0s - loss: 0

0.1820443809321783

## mean and std 

In [179]:
mean_squared_errors = []
for i in range(50):
    X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3)
    model = regression_modelB()
    model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=50, verbose=0)
    yhat = model.predict(X_test, verbose=0)
    mse = mean_squared_error(y_test, yhat)
    mean_squared_errors.append(mse) 
    print(i, mse)

0 290.9822009366341
1 236.8994414349722
2 456.12914203407155
3 272.9973084238885
4 407.8842179756613
5 326.9726959064244
6 274.8956327746963
7 303.15395452750414
8 343.01898559270006
9 285.99162363881214
10 295.40897978580995
11 209.81835747523206
12 489.3348356279121
13 261.25849753743677
14 405.85085090879517
15 295.12501138596974
16 360.3906738093616
17 280.6329534866911
18 247.9584381978255
19 275.84663472257
20 251.11570254272863
21 422.04637763138646
22 295.2588744304219
23 371.70984326063973
24 363.9526258329691
25 441.3624672417323
26 310.5508820991759
27 363.61452499553155
28 421.3293334878431
29 261.61156992595204
30 264.69101855871423
31 448.4898879907667
32 412.37187533933985
33 270.9038179783254
34 272.92202546366826
35 293.09617432918776
36 243.426792900958
37 310.58642677862974
38 323.27882115526813
39 278.55285771497716
40 307.98356868566174
41 402.8419550817915
42 301.1444866408699
43 362.2539475503274
44 519.3456032079444
45 320.3914872647727
46 287.4404097559864
47 3

In [181]:
print(np.mean(mean_squared_errors))
print(np.std(mean_squared_errors))

327.36819750784815
69.83400968919267


### for normalized data | mean = 327.36 & std = 69.83

# Part C

In [182]:
mean_squared_errors = []
for i in range(50):
    X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3)
    model = regression_modelB()
    model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=100, verbose=0)
    yhat = model.predict(X_test, verbose=0)
    mse = mean_squared_error(yhat, y_pred)
    mean_squared_errors.append(mse) 
    print(i, mse)

0 268.9686
1 290.74677
2 304.6586
3 321.54758
4 306.35614
5 320.44824
6 298.24747
7 334.34604
8 307.47678
9 311.57092
10 280.81317
11 306.33682
12 241.70374
13 279.28763
14 290.51718
15 284.39948
16 303.6806
17 301.2023
18 337.41553
19 308.29684
20 315.73154
21 302.352
22 301.37598
23 288.43713
24 266.26392
25 300.77557
26 321.00552
27 358.8969
28 300.14728
29 288.85822
30 334.92175
31 311.39874
32 341.98224
33 286.80594
34 280.37772
35 311.89917
36 267.73563
37 298.72177
38 352.45975
39 292.05948
40 338.17136
41 297.7308
42 292.45828
43 279.71313
44 272.88602
45 329.5979
46 271.29068
47 289.81937
48 288.3032
49 314.36548


In [183]:
print(np.mean(mean_squared_errors))
print(np.std(mean_squared_errors))

301.89124
23.359968


### for normalized data and 100 epochs | mean = 301.89 & std = 23.35

# Part D

In [184]:
def regression_modelD():
    # create model
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(1))
    
    # compile model
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

In [186]:
model = regression_modelD()

# fit the model
model.fit(X_train_norm, y_train_norm, validation_data=(X_test_norm, y_test_norm), epochs=50, verbose=2)

# evaluate the model
scores = model.evaluate(X_test_norm, y_test_norm, verbose=0)
scores

Train on 721 samples, validate on 309 samples
Epoch 1/50
 - 0s - loss: 1.0269 - val_loss: 0.9360
Epoch 2/50
 - 0s - loss: 0.8484 - val_loss: 0.7752
Epoch 3/50
 - 0s - loss: 0.7122 - val_loss: 0.6382
Epoch 4/50
 - 0s - loss: 0.6127 - val_loss: 0.5442
Epoch 5/50
 - 0s - loss: 0.5468 - val_loss: 0.4864
Epoch 6/50
 - 0s - loss: 0.5033 - val_loss: 0.4485
Epoch 7/50
 - 0s - loss: 0.4746 - val_loss: 0.4182
Epoch 8/50
 - 0s - loss: 0.4487 - val_loss: 0.3949
Epoch 9/50
 - 0s - loss: 0.4288 - val_loss: 0.3719
Epoch 10/50
 - 0s - loss: 0.4081 - val_loss: 0.3509
Epoch 11/50
 - 0s - loss: 0.3927 - val_loss: 0.3327
Epoch 12/50
 - 0s - loss: 0.3772 - val_loss: 0.3167
Epoch 13/50
 - 0s - loss: 0.3604 - val_loss: 0.3025
Epoch 14/50
 - 0s - loss: 0.3471 - val_loss: 0.2878
Epoch 15/50
 - 0s - loss: 0.3323 - val_loss: 0.2733
Epoch 16/50
 - 0s - loss: 0.3171 - val_loss: 0.2605
Epoch 17/50
 - 0s - loss: 0.3031 - val_loss: 0.2491
Epoch 18/50
 - 0s - loss: 0.2888 - val_loss: 0.2345
Epoch 19/50
 - 0s - loss: 0

0.1402108401156552

In [191]:
mean_squared_errors = []
for i in range(50):
    X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3)    
    model = regression_modelD()
    model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=50, verbose=0)
    yhat = model.predict(X_test, verbose=0)
    mse = mean_squared_error(yhat, y_pred)
    mean_squared_errors.append(mse)
    print(i, mse)

0 277.13687
1 270.72336
2 286.53787
3 248.34433
4 261.07468
5 250.75339
6 264.34106
7 284.7224
8 255.88115
9 276.42813
10 292.78903
11 294.42728
12 299.28326
13 252.6857
14 285.8546
15 281.9327
16 332.0602
17 244.86868
18 300.89056
19 316.15485
20 261.0397
21 237.36183
22 275.91916
23 284.91025
24 291.0709
25 276.51163
26 313.15417
27 254.26613
28 266.3056
29 309.0993
30 261.75684
31 306.1691
32 298.80032
33 324.15485
34 308.54868
35 305.7513
36 300.0947
37 241.34555
38 290.33563
39 291.84094
40 287.8569
41 252.24797
42 309.74506
43 264.58322
44 287.91348
45 259.09348
46 278.07614
47 285.56714
48 265.19818
49 260.25937


In [192]:
print(np.mean(mean_squared_errors))
print(np.std(mean_squared_errors))

280.51736
22.630545


### for normalized data , 50 epochs and 3 hidden layers | mean = 280.51 & std = 22.63

----

## part A | mean = 428.45 & std = 298.84
## part B | mean = 327.36 & std = 69.83
## part C | mean = 301.89 & std = 23.35
## part D | mean = 280.51 & std = 22.63

----