# Question A

## Looking at the data

In [140]:
import pandas as pd
datadf = pd.read_csv("concrete_data.csv")
datadf.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


## Preparing the data

In [141]:
X = datadf.drop(columns=['Strength'])
y = datadf['Strength']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

## Creating the model as per instruction

In [155]:
# define classification model
def classification_model(hidden_layers = 1):
    # create model
    model = Sequential()
    # One hidden layer of 10 nodes, and a ReLU activation function
    model.add(Dense(10, activation='relu', input_shape=(8,)))
    for i in range(1, hidden_layers):
        model.add(Dense(10, activation='relu'))
    model.add(Dense(1))
     # Use the adam optimizer and the mean squared error as the loss function.
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

## Training the model

In [143]:
import keras
from keras.models import Sequential
from keras.layers import Dense
model = classification_model()
#  Train the model on the training data using 50 epochs
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=50, verbose=0)

<keras.callbacks.callbacks.History at 0x18f4e356488>

## Looking at the predictions

In [144]:
preview = datadf.drop(columns=['Strength'])
preview['Predicted Strength'] = model.predict(preview)
preview['Actual Strength'] = datadf['Strength']
preview.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Predicted Strength,Actual Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,33.964439,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,31.648523,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,65.088371,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,85.351509,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,94.503014,44.3


In [145]:
# evaluate the model
scores = model.evaluate(X_test, y_test, verbose=0)
scores

515.961541629532

You can use the mean_squared_error function from Scikit-learn.

In [146]:
from sklearn.metrics import mean_squared_error 
y_true = y_test
y_pred = model.predict(X_test)
mean_squared_error(y_true, y_pred) 

515.961537703522

## Do this 50 times

In [147]:
scores = []
for i in range(0, 50):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
    model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=50, verbose=0)
    scores.append(model.evaluate(X_test, y_test, verbose=0))

In [148]:
import statistics
print('Mean: ' + str(statistics.mean(scores)))
print('Standard deviation: ' + str(statistics.stdev(scores)))

Mean: 86.68467699612614
Standard deviation: 29.309258474440732


# Question B

Normalize the data

In [149]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(datadf)
datascdf = pd.DataFrame(scaler.transform(datadf))
datascdf.columns = datadf.columns
datascdf.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,2.477915,-0.856888,-0.847144,-0.916764,-0.620448,0.863154,-1.21767,-0.279733,2.645408
1,2.477915,-0.856888,-0.847144,-0.916764,-0.620448,1.056164,-1.21767,-0.279733,1.561421
2,0.491425,0.795526,-0.847144,2.175461,-1.039143,-0.526517,-2.240917,3.553066,0.266627
3,0.491425,0.795526,-0.847144,2.175461,-1.039143,-0.526517,-2.240917,5.057677,0.31334
4,-0.790459,0.678408,-0.847144,0.488793,-1.039143,0.070527,0.647884,4.978487,0.507979


Do as above, but this time before computing the mean and the standard deviation of the scores, normalized result must be inversly transformed, and since the scaler needs the same columns, that means quite a lot of boilerplate.

In [150]:
def question_a_an_b(epochs):    
    Xsc = datascdf.drop(columns=['Strength'])
    ysc = datascdf['Strength']
    scores = []
    for i in range(0, 50):
        Xsc_train, Xsc_test, ysc_train, ysc_test = train_test_split(Xsc, ysc, test_size=0.3)
        model.fit(Xsc_train, ysc_train, validation_data=(Xsc_test, ysc_test), epochs=epochs, verbose=0)

        result = Xsc_test.copy() # MIND THE COPY, otherwise Xsc_test will be transformed alongside result
        result['Predicted Strength'] = model.predict(Xsc_test)
        result_columns = result.columns
        result = scaler.inverse_transform(result)
        result = pd.DataFrame(result)
        result.columns = result_columns
        y_pred = result['Predicted Strength']

        result = Xsc_test.copy()
        result['Actual Strength'] = ysc_test
        result_columns = result.columns
        result = scaler.inverse_transform(result)
        result = pd.DataFrame(result)
        result.columns = result_columns
        y_true = result['Actual Strength']

        scores.append(mean_squared_error(y_true, y_pred))
    return scores

scores = question_a_an_b(50)
print('Mean: ' + str(statistics.mean(scores)))
print('Standard deviation: ' + str(statistics.stdev(scores)))

Mean: 34.444658149766916
Standard deviation: 8.674616871528379


The mean before normalization was 86.685, now it is 34.445.

The standard deviation before normalization was 29.310, now it is 8.675.

Normalization had decreased the average error and reduced outliers.

# Question C

Doing it again, but this time ver 100 epochs each time.

In [156]:
scores = question_a_an_b(100)
print('Mean: ' + str(statistics.mean(scores)))
print('Standard deviation: ' + str(statistics.stdev(scores)))

Mean: 26.202931506081747
Standard deviation: 2.280154534985959


The mean with normalization and 50 epocs was 34.445, now it is 26.203. 

The standard deviation with normalization and 50 epocs was 8.675, now it is 2.280. 

Using 100 epochs instrad of 50 had decreased the average error and reduced outliers.

# Question D

In [157]:
model = classification_model(3)
scores = question_a_an_b(50)
print('Mean: ' + str(statistics.mean(scores)))
print('Standard deviation: ' + str(statistics.stdev(scores)))

Mean: 22.851870209000793
Standard deviation: 8.586851404365877


In [158]:
model = classification_model(3)
scores = question_a_an_b(100)
print('Mean: ' + str(statistics.mean(scores)))
print('Standard deviation: ' + str(statistics.stdev(scores)))

Mean: 19.667367694771894
Standard deviation: 4.794289423161374


| | **A** **Not** normalized, 1 layer, 50 epochs| **B** Normalized, 1 layer, 50 epochs | **C** Normalized, 1 layer, 100 epochs | **D** Normalized, 3 layer, 50 epochs | Normalized, 3 layer, 100 epochs |
|-|-|-|-|-|-|
| Mean |  86.685 | 34.445 | 26.203 | 22.852 | 19.668 |
| Stanadard deviation | 29.310 | 8.675 | 2.280 | 8.587 | 4.794 |



Normalizing improved the results. Adding 50 epochs improved the results after normalization. Same for adding 2 layers, but the standard deviation was better when adding 50 epochs. Finally adding both 50 epochs and 2 layers led to the best average but not the best standard deviation, found with a single layer and 100 epoch. Since simpler models are trained faster, **the best choice seems to be a single layer with 100 epochs**.