# Compile model: Loss Functions, Optimizers, and Metrics in TensorFlow
This notebook introduces three fundamental concepts in TensorFlow:
- **Loss Functions**: Measure how far predictions are from true values.
- **Optimizers**: Update model weights to minimize the loss.
- **Metrics**: Evaluate model performance during training and testing.

We will explore each concept with explanations and code examples.

## Loss Functions

Before training a Keras model, we need to compile it. The `compile()` function tells TensorFlow how the model should learn. One of the most important part of this setup is the loss function.

The loss function measures how far off the model’s predictions are from the actual targets. Training is all about reducing this value: the smaller the loss, the closer the predictions are to reality.

Different problems call for different loss functions: regression tasks often use Mean Squared Error, binary classification uses Binary Crossentropy, and multi-class classification uses Categorical Crossentropy. Choosing the right loss is essential because it directly shapes how your model learns.

### Loss Functions for Regression Problems

Regression problems involve predicting continuous values, like house prices or temperature.

- **Mean Squared Error (MSE):** This is the most common loss function for regression. It calculates the average of the squared differences between predictions and true values. Squaring the difference penalizes larger errors more heavily and ensures the result is always positive.
$$
MSE = \frac{1}{N} \sum_{i=1}^{N} \big(y_{true,i} - y_{pred,i}\big)^2
$$

Where N is the number of samples. You typically specify it in Keras using the string identifier `mean_squared_error` or `mse`.

In [None]:
import tensorflow as tf

# Sample data
y_true = tf.constant([[1.], [0.], [1.]])
y_pred = tf.constant([[0.9], [0.2], [0.8]])

# --- Manual implementation with for loop ---
mse_manual = 0.0
N = y_true.shape[0]
for i in range(N):
    mse_manual += (float(y_true[i]) - float(y_pred[i]))**2
mse_manual /= N
print("Manual MSE:", mse_manual)

# --- TensorFlow/Keras implementation ---
mse_fn = tf.keras.losses.MeanSquaredError()
mse_value = mse_fn(y_true, y_pred)
print("tf.keras MSE:", mse_value.numpy())


- **Mean Absolute Error (MAE):** This loss function calculates the average of the absolute differences between predictions and true values. Unlike MSE, MAE doesn't square the errors, making it less sensitive to outliers. If your dataset contains significant outliers that you don't want to dominate the loss, MAE might be a better choice.
$$
MAE = \frac{1}{N} \sum_{i=1}^{N} \left| y_{true,i} - y_{pred,i} \right|
$$

You can specify it using `mean_absolute_error` or `mae`.

In [None]:
import tensorflow as tf

# Sample data
y_true = tf.constant([[1.], [0.], [1.]])
y_pred = tf.constant([[0.9], [0.2], [0.8]])

# --- Manual implementation with for loop ---
mae_manual = 0.0
N = y_true.shape[0]
for i in range(N):
    mae_manual += abs(float(y_true[i]) - float(y_pred[i]))
mae_manual /= N
print("Manual MAE:", mae_manual)

# --- TensorFlow/Keras implementation ---
mae_fn = tf.keras.losses.MeanAbsoluteError()
mae_value = mae_fn(y_true, y_pred)
print("tf.keras MAE:", mae_value.numpy())


### Loss Functions for Classification Problems

Classification problems involve predicting a discrete class label, like identifying spam emails or classifying images.

- **Binary Crossentropy:** Use this loss function for binary (two-class) classification problems. It measures the distance between the true probability distribution (e.g., [0, 1] or [1, 0]) and the predicted probability distribution. It expects the model's final layer to have a single output unit with a sigmoid activation function (outputting a probability between 0 and 1), and the target values should be 0 or 1. The formula for binary crossentropy for a single prediction is:
$$
Loss = - \Big( y_{true} \cdot \log(y_{pred}) + (1 - y_{true}) \cdot \log(1 - y_{pred}) \Big)
$$

The final loss is averaged over all samples. Specify it using the string `binary_crossentropy`.

In [None]:
import tensorflow as tf
import math

# Example data: binary classification
y_true = tf.constant([[1.], [0.], [1.]])
y_pred = tf.constant([[0.9], [0.2], [0.8]])

# --- Manual implementation ---
bce_manual = 0.0
N = y_true.shape[0]
eps = 1e-7  # to avoid log(0)
for i in range(N):
    yt = float(y_true[i])
    yp = float(y_pred[i])
    bce_manual += -(yt * math.log(yp + eps) + (1 - yt) * math.log(1 - yp + eps))
bce_manual /= N
print("Manual Binary Crossentropy:", bce_manual)

# --- TensorFlow/Keras implementation ---
bce_fn = tf.keras.losses.BinaryCrossentropy()
bce_value = bce_fn(y_true, y_pred)
print("tf.keras BinaryCrossentropy:", bce_value.numpy())



- **Categorical Crossentropy:** This is the standard loss function for multi-class classification when your target labels are one-hot encoded. For example, if you have three classes, the targets might look like [1, 0, 0], [0, 1, 0], or [0, 0, 1]. It expects the model's final layer to have C output units (where C is the number of classes) and use a softmax activation function, which outputs a probability distribution across the classes. The formula for a single sample is:

$$
Loss = - \sum_{c=1}^{C} y_{true,c} \cdot \log(y_{pred,c})
$$

Where C is the number of classes. The final loss is averaged over all samples. Specify it using the string `categorical_crossentropy`.

In [None]:
# Example data: 3-class classification, one-hot labels
y_true = tf.constant([[1,0,0], [0,1,0], [0,0,1]], dtype=tf.float32)
y_pred = tf.constant([[0.7,0.2,0.1], [0.1,0.8,0.1], [0.2,0.2,0.6]], dtype=tf.float32)

# --- Manual implementation ---
cce_manual = 0.0
N = y_true.shape[0]
eps = 1e-7
for i in range(N):
    for c in range(y_true.shape[1]):
        cce_manual += -float(y_true[i][c]) * math.log(float(y_pred[i][c]) + eps)
cce_manual /= N
print("Manual Categorical Crossentropy:", cce_manual)

# --- TensorFlow/Keras implementation ---
cce_fn = tf.keras.losses.CategoricalCrossentropy()
cce_value = cce_fn(y_true, y_pred)
print("tf.keras CategoricalCrossentropy:", cce_value.numpy())


- **Sparse Categorical Crossentropy:** This loss function serves the same purpose as categorical crossentropy but is used when your target labels are provided as integers (e.g., 0, 1, 2 for three classes) rather than one-hot encoded vectors. This is often more convenient as it avoids the need to explicitly convert integer labels to one-hot vectors. The model output requirements (C units, softmax activation) remain the same as for categorical crossentropy. Specify it using `sparse_categorical_crossentropy`. This often saves memory and computation compared to using `categorical_crossentropy` with explicitly one-hot encoded labels, especially for a large number of classes.

In [None]:
# Example data: 3-class classification, integer labels
y_true = tf.constant([0, 1, 2])  # class indices
y_pred = tf.constant([[0.7,0.2,0.1], [0.1,0.8,0.1], [0.2,0.2,0.6]], dtype=tf.float32)

# --- Manual implementation ---
scce_manual = 0.0
N = y_true.shape[0]
eps = 1e-7
for i in range(N):
    true_class = int(y_true[i])
    scce_manual += -math.log(float(y_pred[i][true_class]) + eps)
scce_manual /= N
print("Manual Sparse Categorical Crossentropy:", scce_manual)

# --- TensorFlow/Keras implementation ---
scce_fn = tf.keras.losses.SparseCategoricalCrossentropy()
scce_value = scce_fn(y_true, y_pred)
print("tf.keras SparseCategoricalCrossentropy:", scce_value.numpy())


## Optimizers

While basic gradient descent provides the foundation, several more sophisticated optimizers have been developed to improve convergence speed and stability. TensorFlow's Keras API provides easy access to many of them. Here are some of the most frequently used:

### SGD (Stochastic Gradient Descent)

This is the classic optimizer. Instead of calculating the gradient using the entire dataset (which is computationally expensive), SGD estimates the gradient using a small random subset of the data called a mini-batch. It is computationally efficient, simple concept. However, it can have noisy updates, potentially slow convergence, sensitive to learning rate choice, might get stuck in local minima or saddle points more easily than adaptive methods.

Keras's SGD optimizer often includes enhancements:

- **Momentum:** This introduces a "velocity" component. Updates accumulate momentum in a consistent direction, helping to accelerate convergence, especially through flat regions or shallow ravines, and dampening oscillations.
- **Nesterov Momentum:** A slight variation of momentum that often provides faster convergence in practice. It calculates the gradient "looking ahead" slightly in the direction of the momentum update.


In [None]:
# Basic SGD
sgd_optimizer_basic = tf.keras.optimizers.SGD(learning_rate=0.01)

# SGD with momentum
sgd_optimizer_momentum = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)

# SGD with Nesterov momentum
sgd_optimizer_nesterov = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9, nesterov=True)

### Adam (Adaptive Moment Estimation)

Adam is often the default choice for many deep learning tasks due to its effectiveness and relative ease of use. It's an adaptive learning rate optimizer, meaning it computes individual learning rates for different parameters. It does this by keeping track of exponentially decaying averages of past gradients (first moment, like momentum) and past squared gradients (second moment, capturing the variance).

Generally, ADAM converges quickly and performs well on a wide range of problems. However, it can sometimes converge to suboptimal solutions compared to finely tuned SGD with momentum and it requires more memory to store the moving averages.

Important hyperparameters include learning_rate, beta_1 (decay rate for the first moment), beta_2 (decay rate for the second moment), and epsilon (a small value to prevent division by zero).

In [None]:
# Adam optimizer with default parameters (learning_rate=0.001)
adam_optimizer_default = tf.keras.optimizers.Adam()

# Adam optimizer with a custom learning rate
adam_optimizer_custom = tf.keras.optimizers.Adam(learning_rate=0.0005)

# Adam optimizer with custom beta values
adam_optimizer_betas = tf.keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.99)

### RMSprop (Root Mean Square Propagation)

RMSprop is another adaptive learning rate method that also maintains a moving average of the squared gradients. It divides the learning rate by the square root of this average. This effectively adapts the learning rate per parameter, decreasing it for parameters with large gradients and increasing it for parameters with small gradients.

- Pros: Works well in practice, particularly for recurrent neural networks (RNNs). Good alternative to Adagrad (which can suffer from learning rates becoming too small).
- Cons: Less commonly used as a default than Adam, but still very effective.


In [None]:
import tensorflow as tf

# RMSprop optimizer with default parameters (learning_rate=0.001)
rmsprop_optimizer_default = tf.keras.optimizers.RMSprop()

# RMSprop optimizer with custom learning rate and momentum
rmsprop_optimizer_custom = tf.keras.optimizers.RMSprop(learning_rate=0.001, rho=0.9, momentum=0.1)


### Choosing an Optimizer

Keras offers other optimizers like Adagrad, Adadelta, Adamax, and Nadam. While less frequently used as initial choices compared to Adam or SGD, they have specific properties that might be beneficial for certain types of data or network architectures (e.g., Adagrad for sparse data). Selecting the best optimizer often involves some experimentation:

- Start with Adam: Its adaptive nature and generally strong performance make it an excellent starting point for most problems. Use the default learning rate (0.001) initially.
- Try SGD with Momentum: If Adam doesn't yield satisfactory results or if you suspect it's converging too quickly to a poor minimum, try SGD with momentum (e.g., momentum=0.9). This often requires more careful tuning of the learning rate. You might need to experiment with values like 0.1, 0.01, 0.001.
- Consider RMSprop: It's a solid alternative, especially if you encounter issues with Adam or are working with RNNs.
- Learning Rate Schedules: Instead of a fixed learning rate, you can use learning rate schedules that decrease the learning rate over time during training. This can help achieve finer convergence later in the training process. Keras callbacks or tf.keras.optimizers.schedules can implement this.


## Metrics

When compiling a model in TensorFlow/Keras, you configure the learning process by specifying, loss function, optimizer and Metrics. Metrics do not directly influence weight updates, but they provide valuable feedback on how well the model is performing.

The choice of metrics depends heavily on your specific machine learning task (e.g., classification, regression) and what aspects of performance are most important :
- **Classification**: Accuracy, Precision, Recall, AUC
- **Regression**: MAE, MSE, RMSE

### Classification metrics

In [None]:
# Example of lassification metrics
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=[
                  tf.keras.metrics.BinaryAccuracy(name='accuracy'),
                  tf.keras.metrics.Precision(name='precision'),
                  tf.keras.metrics.Recall(name='recall'),
                  tf.keras.metrics.AUC(name='auc')
              ])

#### Accuracy
Measures the proportion of correct predictions.

$$Accuracy = \frac{\text{Correct Predictions}}{\text{Total Predictions}}$$

==> Accuracy can be misleading with imbalanced datasets.

In [None]:
import tensorflow as tf

# Sample data
y_true = tf.constant([[1.], [0.], [1.], [0.]])
y_pred = tf.constant([[1.], [1.], [1.], [0.]])

# Using BinaryAccuracy metric
metric = tf.keras.metrics.BinaryAccuracy()
metric.update_state(y_true, y_pred)

# Display result
print("Accuracy:", metric.result().numpy())

#### Precision
Out of all predicted positives, how many are truly positive?

$$Precision = \frac{TP}{TP + FP}$$

Useful when false positives are costly.

In [None]:
# Using Precision metric
metric = tf.keras.metrics.Precision()
metric.update_state(y_true, y_pred)

# Display result
print("Precision:", metric.result().numpy())

#### Recall
Out of all actual positives, how many did the model correctly identify?

$$Recall = \frac{TP}{TP + FN}$$

Useful when false negatives are costly.

In [None]:
# Using Recall metric
metric = tf.keras.metrics.Recall()
metric.update_state(y_true, y_pred)

# Display result
print("Recall:", metric.result().numpy())

#### AUC (Area Under ROC Curve)
Measures the ability of the model to distinguish between classes.
- AUC = 1.0 → Perfect classifier
- AUC = 0.5 → Random guessing

In [None]:
# Using AUC metric
metric = tf.keras.metrics.AUC()
metric.update_state(y_true, y_pred)

# Display result
print("AUC:", metric.result().numpy())

### Regression metrics

In regression problems, the goal is to predict a **continuous value**.  
To evaluate how well our model performs, we use metrics such as:

- **Mean Absolute Error (MAE)**
- **Mean Squared Error (MSE)**
- **Root Mean Squared Error (RMSE)**

These metrics provide different perspectives on prediction errors.

#### Mean Absolute Error (MAE)


$$
MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|
$$



- Represents the average absolute difference between true and predicted values.  
- Easy to interpret since it is in the same units as the target variable.  
- Less sensitive to outliers compared to MSE.


In [None]:
import tensorflow as tf

# Example true and predicted values
y_true = tf.constant([[3.0], [-0.5], [2.0], [7.0]])
y_pred = tf.constant([[2.5], [0.0],  [2.0], [8.0]])

# Compute MAE
mae_metric = tf.keras.metrics.MeanAbsoluteError()
mae_metric.update_state(y_true, y_pred)
print("Mean Absolute Error (MAE):", mae_metric.result().numpy())


#### Mean Squared Error (MSE)


$$
MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$



- Penalizes larger errors more heavily due to squaring.  
- Units are the square of the target variable.  
- Commonly used as a loss function and monitoring metric.


In [None]:
# Compute MSE
mse_metric = tf.keras.metrics.MeanSquaredError()
mse_metric.update_state(y_true, y_pred)
print("Mean Squared Error (MSE):", mse_metric.result().numpy())

#### Root Mean Squared Error (RMSE)

$$
RMSE = \sqrt{MSE}
$$



- Same units as the target variable.  
- More interpretable than MSE.  
- Still penalizes large errors significantly.


In [None]:
# Compute RMSE
rmse_metric = tf.keras.metrics.RootMeanSquaredError()
rmse_metric.update_state(y_true, y_pred)
print("Root Mean Squared Error (RMSE):", rmse_metric.result().numpy())

#### Custom Metrics in TensorFlow

Sometimes built-in metrics are not enough.   TensorFlow allows to define **custom metrics** either as simple Python functions or by subclassing `tf.keras.metrics.Metric`.

This gives flexibility to measure exactly what matters for the problem.



In [None]:
# Custom metric function
def custom_metric_function(y_true, y_pred):
    y_true = tf.cast(y_true, tf.float32)
    y_pred = tf.cast(y_pred, tf.float32)
    mask = tf.where(y_true > 5, tf.constant(10.0, dtype=tf.float32), tf.constant(1.0, dtype=tf.float32))
    diff = tf.abs(y_true - y_pred) * mask
    return tf.reduce_mean(diff)

mape_value = custom_metric_function(y_true, y_pred)
print("Custom MAE:", mape_value.numpy())

# Put the custom function inside model definition
# model.compile(optimizer='adam',
#               loss='mse',
#               metrics=['mae', custom_metric_function])

## Putting it all together
We now build a simple binary classification model with synthetic data.
We will compile the model with a loss, optimizer, and metrics, then train and evaluate it.

In [None]:
import numpy as np

# Generate synthetic dataset
X = np.random.rand(1000, 3)
y = (X[:, 0] + X[:, 1] > 1).astype(int)

# Build model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(16, activation='relu', input_shape=(3,)),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Compile model with loss, optimizer, and metrics
model.compile(
    loss='binary_crossentropy',
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
    metrics=['accuracy', tf.keras.metrics.AUC(name='auc')]
)

# Train model
history = model.fit(X, y, epochs=10, batch_size=32, validation_split=0.2)

## Evaluation
After training, we can evaluate the model on new data and inspect metrics.

In [None]:
# Evaluate on test data
X_test = np.random.rand(200, 3)
y_test = (X_test[:, 0] + X_test[:, 1] > 1).astype(int)

results = model.evaluate(X_test, y_test, verbose=0)
print("Test Loss, Accuracy, AUC:", results)

## Optimizing Hyperparameters with Grid Search (Keras)

We now use TensorFlow/Keras integrated with Scikit-learn's **Grid Search** functionality to automatically test various combinations of hyperparameters. This systematic approach ensures we find the best settings for the **learning rate** and the **number of neurons** in the hidden layer.

We will use the **Adam optimizer**, which is an adaptive optimization algorithm that already incorporates momentum-like behavior.

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam, SGD 
from scikeras.wrappers import KerasClassifier 
from sklearn.model_selection import GridSearchCV
import numpy as np

# 1. Prepare Data 
X_train = np.array([[0., 0.], [0., 1.], [1., 0.], [1., 1.]], dtype=np.float32)
Y_train = np.array([[0.], [1.], [1.], [0.]], dtype=np.float32)

# 2. Function to create the Keras model
def create_model(learning_rate=0.01, n_neurons=2, optimizer_name='Adam'):
    """Creates a neural network model with specified hyperparameters, including optimizer."""
    model = Sequential([
        # Hidden layer uses Sigmoid activation
        Dense(n_neurons, activation='sigmoid', input_shape=(2,)), 
        Dense(1) 
    ])
    
    # Select optimizer based on the parameter value
    if optimizer_name == 'Adam':
        optimizer = Adam(learning_rate=learning_rate)
    elif optimizer_name == 'SGD':
        # Using SGD as an alternative optimizer
        optimizer = SGD(learning_rate=learning_rate) 
    else:
        # Fallback for safety
        optimizer = Adam(learning_rate=learning_rate)

    model.compile(
        optimizer=optimizer,
        loss='mse',
        metrics=['mean_squared_error']
    )
    return model

# 3. Create the KerasClassifier wrapper
# Note: We set a high default epoch count here, but the Grid Search will override it.
keras_model = KerasClassifier(
    model=create_model, 
    verbose=0, 
    loss='mse' 
)

# 4. Define the Grid Search parameter space
# PARAMETER FIX: 'epochs' and 'batch_size' are Scikit-learn parameters (no prefix needed).
# 'model__' parameters target arguments of the create_model function.
param_grid = {
    'model__learning_rate': [0.1, 0.01, 0.001], 
    'model__n_neurons': [2, 4, 8],
    'model__optimizer_name': ['Adam', 'SGD'],
    'batch_size': [2, 4],
    'epochs': [10, 20] # New parameter for tuning the number of training passes
}

# 5. Initialize and Run Grid Search
print("\n Starting Grid Search to find optimal hyperparameters (this may take a moment)...")

# Grid search will perform cross-validation (cv=2) over the training data
grid = GridSearchCV(estimator=keras_model, param_grid=param_grid, cv=2, scoring='neg_mean_squared_error')
grid_result = grid.fit(X_train, Y_train)

# 6. Summarize Results
print("\n Grid Search Complete.")

print(f"Best Loss (MSE, lower is better): {-grid_result.best_score_:.4f}")

# Clean up parameter names for display
best_params = {k.replace('model__', ''): v for k, v in grid_result.best_params_.items()}
print(f"Best Hyperparameters: {best_params}")