# Deep Neural Networks: Visualizations and Explanations 

This notebook provides visualizations and explanations for key concepts related to training deep neural networks, complementing the lecture slides. We will explore:

1.  **Adaptive Learning Rates:** Visualizing how different optimization algorithms navigate a loss landscape.
2.  **Error Landscapes:** Understanding saddle points and their impact on optimization.
3.  **Debugging:** Interpreting loss and accuracy curves during training.
4.  **Performance Metrics:** Evaluating model performance, especially on imbalanced datasets.

**Instructions:**
- Complete all tasks marked with **TODO** comments
- Answer all questions in the designated cells
- Run all cells to verify your implementations
- The exercises should take approximately 30 minutes to complete

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc
import seaborn as sns

# Set style for plots
plt.style.use('seaborn-v0_8-whitegrid')

## 1. Adaptive Learning Rates and Optimization Algorithms

Gradient descent algorithms are used to minimize the loss function by iteratively updating the model parameters. Different algorithms adapt the learning rate or use momentum to navigate the loss landscape more effectively. Let's visualize the paths taken by different optimizers on a simple quadratic loss surface.

In [None]:
# Define a simple quadratic loss function (representing a simple error landscape)
def loss_function(w1, w2):
    return 0.5 * (w1**2 + 10 * w2**2) # Elliptical bowl shape

# Define gradients for the loss function
def gradients(w1, w2):
    return np.array([w1, 10 * w2])

# --- Optimization Algorithms Implementation ---

# Standard SGD
def sgd(w_init, n_iterations, learning_rate):
    w_path = [w_init]
    w = w_init.copy()
    for _ in range(n_iterations):
        grad = gradients(w[0], w[1])
        w = w - learning_rate * grad
        w_path.append(w.copy())
    return np.array(w_path)

# TODO: Implement SGD with Momentum
# The momentum update rule is: v = momentum * v + learning_rate * gradient, w = w - v
def sgd_momentum(w_init, n_iterations, learning_rate, momentum=0.9):
    w_path = [w_init]
    w = w_init.copy()
    v = np.zeros_like(w)
    for _ in range(n_iterations):
        # TODO: Implement momentum update
        # 1. Calculate gradient
        # 2. Update velocity v using momentum formula
        # 3. Update weights w
        # 4. Append to w_path
        pass
    return np.array(w_path)

# RMSProp
def rmsprop(w_init, n_iterations, learning_rate, decay_rate=0.9, epsilon=1e-8):
    w_path = [w_init]
    w = w_init.copy()
    r = np.zeros_like(w) # Accumulated squared gradient
    for _ in range(n_iterations):
        grad = gradients(w[0], w[1])
        r = decay_rate * r + (1 - decay_rate) * grad**2
        update = (learning_rate / (np.sqrt(r) + epsilon)) * grad
        w = w - update
        w_path.append(w.copy())
    return np.array(w_path)

# Adam
def adam(w_init, n_iterations, learning_rate, beta1=0.9, beta2=0.999, epsilon=1e-8):
    w_path = [w_init]
    w = w_init.copy()
    s = np.zeros_like(w) # First moment estimate
    r = np.zeros_like(w) # Second moment estimate
    t = 0
    for _ in range(n_iterations):
        t += 1
        grad = gradients(w[0], w[1])
        s = beta1 * s + (1 - beta1) * grad
        r = beta2 * r + (1 - beta2) * grad**2
        s_hat = s / (1 - beta1**t)
        r_hat = r / (1 - beta2**t)
        update = (learning_rate / (np.sqrt(r_hat) + epsilon)) * s_hat
        w = w - update
        w_path.append(w.copy())
    return np.array(w_path)

# --- Visualization ---

# Initial point
w_init = np.array([4.0, 1.5])
n_iterations = 50

# Run optimizers
path_sgd_fast = sgd(w_init, n_iterations, learning_rate=0.18) # Can oscillate
path_sgd_slow = sgd(w_init, n_iterations, learning_rate=0.05) # Slow convergence
# TODO: Uncomment after implementing sgd_momentum
# path_momentum = sgd_momentum(w_init, n_iterations, learning_rate=0.05, momentum=0.9)
path_rmsprop = rmsprop(w_init, n_iterations, learning_rate=0.1)
path_adam = adam(w_init, n_iterations, learning_rate=0.2)

# Create contour plot
w1_range = np.linspace(-5, 5, 100)
w2_range = np.linspace(-2, 2, 100)
W1, W2 = np.meshgrid(w1_range, w2_range)
Z = loss_function(W1, W2)

plt.figure(figsize=(10, 6))
contour = plt.contour(W1, W2, Z, levels=np.logspace(-1, 3, 10), cmap='viridis')
plt.colorbar(contour, label='Loss Value')

# Plot paths
plt.plot(path_sgd_fast[:, 0], path_sgd_fast[:, 1], 'o-', label='SGD (LR=0.18)', markersize=3, alpha=0.7)
plt.plot(path_sgd_slow[:, 0], path_sgd_slow[:, 1], 's-', label='SGD (LR=0.05)', markersize=3, alpha=0.7)
# TODO: Uncomment after implementing sgd_momentum
# plt.plot(path_momentum[:, 0], path_momentum[:, 1], '^-', label='Momentum', markersize=3, alpha=0.7)
plt.plot(path_rmsprop[:, 0], path_rmsprop[:, 1], 'd-', label='RMSProp', markersize=3, alpha=0.7)
plt.plot(path_adam[:, 0], path_adam[:, 1], '*-', label='Adam', markersize=4, alpha=0.7)

plt.plot(w_init[0], w_init[1], 'ko', label='Start') # Start point
plt.plot(0, 0, 'rx', label='Minimum', markersize=10) # Minimum point

plt.title('Optimization Paths on a Quadratic Loss Surface')
plt.xlabel('w1')
plt.ylabel('w2')
plt.legend()
plt.axis('equal')
plt.grid(True)
plt.show()

**Task 1:** After implementing the momentum algorithm and running the visualization, answer the following questions:

1. How does momentum help with optimization compared to standard SGD?
2. Why does SGD with a high learning rate oscillate more in the w2 direction than in the w1 direction?
3. Which algorithm would you choose for a real-world deep learning problem and why?

**Your answers to Task 1:**

1. 

2. 

3. 

## 2. Error Landscapes: Saddle Points

In high-dimensional spaces, saddle points are much more common than local minima. A saddle point is a point where the gradient is zero, but it's a minimum along some dimensions and a maximum along others. Optimization algorithms can slow down significantly near saddle points.

In [None]:
# Define a saddle point function
def saddle_function(w1, w2):
    return w1**2 - w2**2

# TODO: Implement the gradient function for the saddle function
def saddle_gradients(w1, w2):
    # The gradient of f(w1,w2) = w1^2 - w2^2 is [∂f/∂w1, ∂f/∂w2]
    # TODO: Calculate and return the gradient as a numpy array
    return np.array([0, 0])  # Replace with correct gradient

# --- Optimization Algorithms (modified for saddle function) ---
def sgd_saddle(w_init, n_iterations, learning_rate):
    w_path = [w_init]
    w = w_init.copy()
    for _ in range(n_iterations):
        grad = saddle_gradients(w[0], w[1])
        w = w - learning_rate * grad
        w_path.append(w.copy())
    return np.array(w_path)

def sgd_momentum_saddle(w_init, n_iterations, learning_rate, momentum=0.9):
    w_path = [w_init]
    w = w_init.copy()
    v = np.zeros_like(w)
    for _ in range(n_iterations):
        grad = saddle_gradients(w[0], w[1])
        v = momentum * v + learning_rate * grad
        w = w - v
        w_path.append(w.copy())
    return np.array(w_path)

# --- Visualization ---
w_init_saddle = np.array([0.1, 1.5]) # Start near the saddle point, off the unstable axis
n_iterations_saddle = 100

path_sgd_saddle = sgd_saddle(w_init_saddle, n_iterations_saddle, learning_rate=0.1)
path_momentum_saddle = sgd_momentum_saddle(w_init_saddle, n_iterations_saddle, learning_rate=0.1, momentum=0.7) # Lower momentum helps visualize escape

# Create contour plot for saddle function
w1_saddle_range = np.linspace(-2, 2, 100)
w2_saddle_range = np.linspace(-2, 2, 100)
W1_saddle, W2_saddle = np.meshgrid(w1_saddle_range, w2_saddle_range)
Z_saddle = saddle_function(W1_saddle, W2_saddle)

plt.figure(figsize=(8, 8))
contour_saddle = plt.contour(W1_saddle, W2_saddle, Z_saddle, levels=20, cmap='coolwarm')
plt.colorbar(contour_saddle, label='Function Value')

# Plot paths
plt.plot(path_sgd_saddle[:, 0], path_sgd_saddle[:, 1], 'o-', label='SGD', markersize=3, alpha=0.7)
plt.plot(path_momentum_saddle[:, 0], path_momentum_saddle[:, 1], '^-', label='Momentum', markersize=3, alpha=0.7)

plt.plot(w_init_saddle[0], w_init_saddle[1], 'ko', label='Start') # Start point
plt.plot(0, 0, 'rx', label='Saddle Point', markersize=10) # Saddle point at (0,0)

plt.title('Optimization Paths near a Saddle Point')
plt.xlabel('w1 (minimum direction)')
plt.ylabel('w2 (maximum direction)')
plt.legend()
plt.axis('equal')
plt.grid(True)
plt.show()

**Task 2:** After implementing the saddle point gradient function and observing the visualization, explain:

1. Why are saddle points more common than local minima in high-dimensional spaces?
2. How does momentum help with escaping saddle points?
3. What would happen if we initialized both optimizers exactly at the saddle point (0,0)?

**Your answers to Task 2:**

1. 

2. 

3. 

## 3. Debugging: Interpreting Loss and Accuracy Curves

Monitoring loss and accuracy curves during training is crucial for debugging and understanding model behavior.

In [None]:
# Simulate some training curves
epochs = np.arange(1, 101)

# Scenario 1: Good convergence
loss_good = 1.0 / epochs**0.5 + np.random.rand(100) * 0.1
acc_train_good = 0.95 - 0.4 / epochs**0.5 + np.random.rand(100) * 0.02
acc_val_good = 0.90 - 0.3 / epochs**0.5 + np.random.rand(100) * 0.02

# Scenario 2: Overfitting
loss_overfit = 0.5 / epochs**0.7 + np.random.rand(100) * 0.05
acc_train_overfit = 0.98 - 0.2 / epochs**0.8 + np.random.rand(100) * 0.01
acc_val_overfit = 0.75 + 0.1 * np.sin(epochs / 20) + np.random.rand(100) * 0.03 # Validation accuracy stagnates/decreases

# TODO: Create a scenario for underfitting
# Scenario 4: Underfitting - implement your own curves that would represent underfitting
# Hint: Consider what happens when the model is too simple or training is insufficient
loss_underfit = None  # TODO: Create appropriate loss curve
acc_train_underfit = None  # TODO: Create appropriate training accuracy curve
acc_val_underfit = None  # TODO: Create appropriate validation accuracy curve

# Scenario 3: Learning rate too high (oscillating loss)
loss_high_lr = 0.5 + np.abs(0.5 * np.sin(epochs / 5) + np.random.rand(100) * 0.2) + 1.0 / epochs

# Plotting
fig, axs = plt.subplots(2, 2, figsize=(18, 10))

# Plot 1: Good Convergence
axs[0, 0].plot(epochs, loss_good, label='Training Loss')
axs[0, 0].plot(epochs, acc_train_good, label='Training Accuracy')
axs[0, 0].plot(epochs, acc_val_good, label='Validation Accuracy')
axs[0, 0].set_title('Good Convergence')
axs[0, 0].set_xlabel('Epochs')
axs[0, 0].set_ylabel('Value')
axs[0, 0].legend()
axs[0, 0].grid(True)

# Plot 2: Overfitting
axs[0, 1].plot(epochs, loss_overfit, label='Training Loss')
axs[0, 1].plot(epochs, acc_train_overfit, label='Training Accuracy')
axs[0, 1].plot(epochs, acc_val_overfit, label='Validation Accuracy')
axs[0, 1].set_title('Overfitting')
axs[0, 1].set_xlabel('Epochs')
axs[0, 1].set_ylabel('Value')
axs[0, 1].legend()
axs[0, 1].grid(True)

# Plot 3: High Learning Rate
axs[1, 0].plot(epochs, loss_high_lr, label='Training Loss')
axs[1, 0].set_title('High Learning Rate (Oscillating Loss)')
axs[1, 0].set_xlabel('Epochs')
axs[1, 0].set_ylabel('Loss')
axs[1, 0].legend()
axs[1, 0].grid(True)

# TODO: Uncomment after implementing underfitting scenario
# Plot 4: Underfitting
# axs[1, 1].plot(epochs, loss_underfit, label='Training Loss')
# axs[1, 1].plot(epochs, acc_train_underfit, label='Training Accuracy')
# axs[1, 1].plot(epochs, acc_val_underfit, label='Validation Accuracy')
# axs[1, 1].set_title('Underfitting')
# axs[1, 1].set_xlabel('Epochs')
# axs[1, 1].set_ylabel('Value')
# axs[1, 1].legend()
# axs[1, 1].grid(True)

plt.tight_layout()
plt.show()

**Task 3:** After implementing the underfitting scenario and observing all four plots, complete the following table by listing the key characteristics of each scenario and suggesting at least one solution for each problem:

| Scenario | Key Characteristics | Potential Solutions |
|----------|---------------------|---------------------|
| Good Convergence | | |
| Overfitting | | |
| High Learning Rate | | |
| Underfitting | | |

**Your answers to Task 3:**

| Scenario | Key Characteristics | Potential Solutions |
|----------|---------------------|---------------------|
| Good Convergence | | |
| Overfitting | | |
| High Learning Rate | | |
| Underfitting | | |

## 4. Performance Metrics for Imbalanced Datasets

Accuracy can be misleading on imbalanced datasets. Metrics like precision, recall, F1-score, and the ROC curve provide a more nuanced view of performance.

In [None]:
# Generate an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, 
                           n_informative=2, n_redundant=10, 
                           n_clusters_per_class=1, weights=[0.95, 0.05], 
                           flip_y=0, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Train a simple classifier
model = LogisticRegression(solver='liblinear', random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1] # Probabilities for ROC curve

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

print(f"Dataset Class Distribution (Test Set): {np.bincount(y_test)}")
print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1 Score: {f1:.4f}')

# TODO: Calculate specificity (true negative rate)
# Specificity = TN / (TN + FP)
# Extract values from confusion matrix: cm = [[TN, FP], [FN, TP]]
specificity = None  # TODO: Calculate specificity
print(f'Specificity: {specificity:.4f}')

# Plot Confusion Matrix
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Predicted 0', 'Predicted 1'], yticklabels=['Actual 0', 'Actual 1'])
plt.title('Confusion Matrix')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.show()

# Plot ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate (Recall)')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

**Task 4:** After calculating specificity and examining all the metrics, answer the following questions:

1. Why is accuracy alone insufficient for evaluating models on imbalanced datasets?
2. In a medical diagnosis scenario where the positive class represents a rare disease, which metric would be most important to optimize and why?
3. What does the area under the ROC curve (AUC) represent, and why is it useful for imbalanced datasets?
4. Suggest two techniques that could improve the model's performance on the minority class.

**Your answers to Task 4:**

1. 

2. 

3. 

4. 

## 5. Conclusion

This notebook demonstrated key aspects of training and evaluating deep neural networks:

*   Adaptive optimization algorithms like Adam and RMSProp often converge faster and more reliably than standard SGD, especially on complex loss surfaces.
*   Understanding the behavior of optimizers near saddle points is important, as these are common in high-dimensional spaces.
*   Monitoring training curves (loss, accuracy) is essential for debugging issues like overfitting or poor learning rates.
*   For imbalanced datasets, metrics beyond accuracy (precision, recall, F1, AUC) are crucial for a complete performance evaluation.

**Final Task:** Summarize the most important insight you gained from each of the four sections of this notebook and how you might apply it in your own deep learning projects.

**Your final summary:**

1. Adaptive Learning Rates: 

2. Error Landscapes: 

3. Debugging: 

4. Performance Metrics: 

_This notebook has been created with the help of LLMs, 17.06.2025._