# Regression model with Keras
## A. Build a baseline model (5 marks)
- Define a function baseline_model to create the baseline neural network model with one hidden layer of 10 nodes and ReLU activation.
- Initialize an empty list mse_list to store the mean squared errors calculated in each iteration.
- Loop 50 times, in each iteration:
  - Split the data into training and test sets.
  - Build and train the model on the training data.
  - Evaluate the model on the test data and calculate the mean squared error.
  - Append the mean squared error to the mse_list.
- Calculate the mean and standard deviation of the mean squared errors from the list and print them out.

Cement data

The dataset is about the compressive strength of different samples of concrete based on the volumes of the different ingredients that were used to make them. Ingredients include:

1. Cement

2. Blast Furnace Slag

3. Fly Ash

4. Water

5. Superplasticizer

6. Coarse Aggregate

7. Fine Aggregate

8. Age

In [None]:
# Import libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam

In [2]:
# Load the data into a Pandas DataFrame
concrete_data = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DL0101EN/labs/data/concrete_data.csv')
concrete_data.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


In [3]:
# Splitting the data into features (X) and target (y)
X = concrete_data.drop(columns=['Strength'])
y = concrete_data['Strength']

In [4]:
# Define the baseline regression model
def baseline_model():
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(X.shape[1],)))
    model.add(Dense(1))
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

In [5]:
# Initialize a list to store mean squared errors
mse_list = []

In [6]:
# Repeat steps 1 - 3, 50 times
for _ in range(50):
    # Split the data into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
    
    # Build and train the model
    model = baseline_model()
    model.fit(X_train, y_train, epochs=50, verbose=0)
    
    # Evaluate the model on the test data
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_list.append(mse)









2024-02-08 08:22:39.388348: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
2024-02-08 08:22:39.393634: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2394280000 Hz
2024-02-08 08:22:39.394443: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55f1a4b68550 executing computations on platform Host. Devices:
2024-02-08 08:22:39.394507: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>


In [7]:
# Calculate mean and standard deviation of mean squared errors
mean_mse = np.mean(mse_list)
std_mse = np.std(mse_list)

In [15]:
import pprint as pp

print("The 50 MSE from our model:")
pp.pprint(mse_list)

print("Mean Squared Error (MSE) across 50 runs:", mean_mse)
print("Standard Deviation of MSE across 50 runs:", std_mse)

The 50 MSE from our model:
[1449.5564794588665,
 124.81163492182192,
 130.29771170313742,
 118.29256733173607,
 139.51851486946015,
 221.614117210219,
 130.24801397482614,
 153.419446776072,
 100.25656116186184,
 99.02090544033874,
 1818.977193926198,
 117.19773879515962,
 175.15560079891628,
 76.34180551870477,
 332.8392820096209,
 591.2228649748192,
 105.99073926054807,
 125.00643714692251,
 620.5891305043426,
 892.9708182921638,
 109.48945812619155,
 519.7431042269554,
 147.3800970367127,
 151.77851967548673,
 264.29953969090406,
 109.2221610115333,
 142.66981638586824,
 1168.857807707891,
 450.17123486883315,
 349.86013594943984,
 239.2755845969015,
 93.35491152256016,
 351.55826705974243,
 157.41471162887916,
 596.0653165278425,
 130.80606682529276,
 145.39627279033206,
 113.3491761709147,
 451.5830437595016,
 111.77809898441139,
 180.5753125452564,
 421.22247749276755,
 140.56345606416264,
 117.58563127696118,
 1247.5077485139846,
 1772.2532996937737,
 431.7940444342002,
 130.908

## B. Normalize the data (5 marks) 

Repeat Part A but use a normalized version of the data. Recall that one way to normalize the data is by subtracting the mean from the individual predictors and dividing by the standard deviation.

How does the mean of the mean squared errors compare to that from Step A?

In [17]:
# Import libraries
from sklearn.preprocessing import StandardScaler

# Normalize the data
scaler = StandardScaler()
X_normalized = scaler.fit_transform(X)

# Define the baseline regression model
def baseline_model():
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(X_normalized.shape[1],)))
    model.add(Dense(1))
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

# Initialize a list to store mean squared errors
mse_list_normalized = []

# Repeat steps 1 - 3, 50 times
for _ in range(50):
    # Split the normalized data into training and test sets
    X_train_norm, X_test_norm, y_train, y_test = train_test_split(X_normalized, y, test_size=0.3)
    
    # Build and train the model
    model = baseline_model()
    model.fit(X_train_norm, y_train, epochs=50, verbose=0)
    
    # Evaluate the model on the normalized test data
    y_pred = model.predict(X_test_norm)
    mse = mean_squared_error(y_test, y_pred)
    mse_list_normalized.append(mse)

# Calculate mean and standard deviation of mean squared errors
mean_mse_normalized = np.mean(mse_list_normalized)
std_mse_normalized = np.std(mse_list_normalized)

print("Mean Squared Error (MSE) across 50 runs with normalized data:", mean_mse_normalized)
print("Standard Deviation of MSE across 50 runs with normalized data:", std_mse_normalized)


  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


Mean Squared Error (MSE) across 50 runs with normalized data: 370.48961542065786
Standard Deviation of MSE across 50 runs with normalized data: 123.3840618756732


## C. Increase the number of epochs (5 marks)

Repeat Part B but use 100 epochs this time for training.

How does the mean of the mean squared errors compare to that from Step B?

In [18]:
# Splitting the data into features (X) and target (y)
X = concrete_data.drop(columns=['Strength'])
y = concrete_data['Strength']

# Normalize the data
scaler = StandardScaler()
X_normalized = scaler.fit_transform(X)

# Define the baseline regression model
def baseline_model():
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(X_normalized.shape[1],)))
    model.add(Dense(1))
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

# Initialize a list to store mean squared errors
mse_list_normalized_100_epochs = []

# Repeat steps 1 - 3, 50 times
for _ in range(50):
    # Split the normalized data into training and test sets
    X_train_norm, X_test_norm, y_train, y_test = train_test_split(X_normalized, y, test_size=0.3)
    
    # Build and train the model with 100 epochs
    model = baseline_model()
    model.fit(X_train_norm, y_train, epochs=100, verbose=0)
    
    # Evaluate the model on the normalized test data
    y_pred = model.predict(X_test_norm)
    mse = mean_squared_error(y_test, y_pred)
    mse_list_normalized_100_epochs.append(mse)

# Calculate mean and standard deviation of mean squared errors
mean_mse_normalized_100_epochs = np.mean(mse_list_normalized_100_epochs)
std_mse_normalized_100_epochs = np.std(mse_list_normalized_100_epochs)

print("Mean Squared Error (MSE) across 50 runs with normalized data and 100 epochs:", mean_mse_normalized_100_epochs)
print("Standard Deviation of MSE across 50 runs with normalized data and 100 epochs:", std_mse_normalized_100_epochs)


  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


Mean Squared Error (MSE) across 50 runs with normalized data and 100 epochs: 164.51256072361764
Standard Deviation of MSE across 50 runs with normalized data and 100 epochs: 17.33833984634371


## D. Increase the number of hidden layers (5 marks)

Repeat part B but use a neural network with the following instead:

- Three hidden layers, each of 10 nodes and ReLU activation function.

How does the mean of the mean squared errors compare to that from Step B?

In [19]:
# Splitting the data into features (X) and target (y)
X = concrete_data.drop(columns=['Strength'])
y = concrete_data['Strength']

# Define the updated regression model with three hidden layers
def updated_model():
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(X.shape[1],)))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(1))
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

# Initialize a list to store mean squared errors
mse_list_updated = []

# Repeat steps 1 - 3, 50 times
for _ in range(50):
    # Split the data into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
    
    # Build and train the updated model
    model = updated_model()
    model.fit(X_train, y_train, epochs=50, verbose=0)
    
    # Evaluate the updated model on the test data
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_list_updated.append(mse)

# Calculate mean and standard deviation of mean squared errors for the updated model
mean_mse_updated = np.mean(mse_list_updated)
std_mse_updated = np.std(mse_list_updated)

print("Mean Squared Error (MSE) across 50 runs for the updated model:", mean_mse_updated)
print("Standard Deviation of MSE across 50 runs for the updated model:", std_mse_updated)


Mean Squared Error (MSE) across 50 runs for the updated model: 117.38601020983754
Standard Deviation of MSE across 50 runs for the updated model: 45.92629442755122


## Summary
### Data Normalization (Part B vs. Part A):
Normalizing the data slightly increased the mean squared error (from 361.17 to 370.49), but significantly reduced the standard deviation of the errors (from 422.54 to 123.38). This indicates that normalization helps to stabilize the model's performance across different runs.

### Increased Epochs (Part C vs. Part B):
Increasing the number of epochs from 50 to 100 resulted in a notable decrease in mean squared error (from 370.49 to 164.51) and a significant reduction in standard deviation (from 123.38 to 17.34). This suggests that training the model for more epochs allowed it to better capture the underlying patterns in the data, leading to improved performance and reduced variance.

### Additional Hidden Layers (Part D vs. Part C):
Adding three hidden layers with 10 nodes each further decreased the mean squared error (from 164.51 to 117.39) while increasing the standard deviation (from 17.34 to 45.93). This indicates that increasing the complexity of the model by adding more hidden layers improved the average performance, but also increased the variance or variability in the model's predictions across different runs.

In summary, data normalization, increased epochs, and additional model complexity all contributed to improvements in the model's performance, as evidenced by the decreasing mean squared error. However, there's a trade-off between model complexity and stability, as increasing complexity (e.g., adding more hidden layers) can lead to higher variance in the model's predictions.