# Module 2: Deep Learning Primer

## Learning Objectives
By the end of this module, you will be able to:
- Understand the fundamental concepts of neural networks
- Recognize common neural network architectures
- Understand how networks learn through gradient descent and backpropagation
- Build a simple image classifier with TensorFlow/Keras

---

## 1. From ML to Deep Learning

### What Makes Deep Learning "Deep"?

**Traditional Machine Learning:**
- Requires manual **feature engineering** (humans decide what patterns to look for)
- Works well with structured, tabular data
- Limited ability to handle raw data like images, audio, text

**Deep Learning:**
- **Automatic feature learning** from raw data
- Uses multiple layers (hence "deep") to learn hierarchical representations
- Excels at unstructured data: images, speech, text, video

```
Image ‚Üí [Low-level: edges] ‚Üí [Mid-level: shapes] ‚Üí [High-level: objects] ‚Üí "Cat"
Text  ‚Üí [Characters] ‚Üí [Words] ‚Üí [Phrases] ‚Üí [Meaning/Context]
```

### When to Use Deep Learning?

| Use Traditional ML | Use Deep Learning |
|-------------------|-------------------|
| Small datasets (<10K samples) | Large datasets (100K+) |
| Structured/tabular data | Unstructured data (images, text, audio) |
| Need interpretability | Performance is priority |
| Limited compute resources | Have GPUs available |

---

## 2. The Artificial Neuron

### Inspired by Biology

An artificial neuron mimics (loosely) how biological neurons work:

```
Biological:  Dendrites ‚Üí Cell Body ‚Üí Axon ‚Üí Synapses
Artificial:  Inputs ‚Üí Weighted Sum ‚Üí Activation ‚Üí Output
```

### Mathematical Model

```
                    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
     x‚ÇÅ ‚îÄ‚îÄw‚ÇÅ‚îÄ‚îÄ‚ñ∂    ‚îÇ                     ‚îÇ
     x‚ÇÇ ‚îÄ‚îÄw‚ÇÇ‚îÄ‚îÄ‚ñ∂    ‚îÇ  z = Œ£(w·µ¢¬∑x·µ¢) + b  ‚îÇ ‚îÄ‚îÄ‚ñ∂ a = œÉ(z) ‚îÄ‚îÄ‚ñ∂ output
     x‚ÇÉ ‚îÄ‚îÄw‚ÇÉ‚îÄ‚îÄ‚ñ∂    ‚îÇ                     ‚îÇ
          ‚¨Ü        ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
        bias (b)
```

Where:
- **x·µ¢**: Input features
- **w·µ¢**: Weights (learnable parameters)
- **b**: Bias (another learnable parameter)
- **œÉ**: Activation function (introduces non-linearity)

### Activation Functions

Without activation functions, a neural network would just be linear transformations stacked together (equivalent to a single linear function). Activations introduce **non-linearity**:

- **ReLU** (Rectified Linear Unit): `max(0, x)` - Most common
- **Sigmoid**: `1 / (1 + e^(-x))` - Output between 0 and 1
- **Tanh**: `(e^x - e^(-x)) / (e^x + e^(-x))` - Output between -1 and 1
- **Softmax**: Converts outputs to probability distribution

In [None]:
# Install required packages
!pip install -q tensorflow matplotlib numpy

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Visualize activation functions
x = np.linspace(-5, 5, 100)

# Define activation functions
relu = np.maximum(0, x)
sigmoid = 1 / (1 + np.exp(-x))
tanh = np.tanh(x)

fig, axes = plt.subplots(1, 3, figsize=(14, 4))

axes[0].plot(x, relu, 'b-', linewidth=2)
axes[0].axhline(y=0, color='k', linestyle='-', linewidth=0.5)
axes[0].axvline(x=0, color='k', linestyle='-', linewidth=0.5)
axes[0].set_title('ReLU: max(0, x)', fontsize=12)
axes[0].set_xlabel('z')
axes[0].set_ylabel('a = ReLU(z)')
axes[0].grid(True, alpha=0.3)

axes[1].plot(x, sigmoid, 'g-', linewidth=2)
axes[1].axhline(y=0.5, color='r', linestyle='--', alpha=0.5)
axes[1].axvline(x=0, color='k', linestyle='-', linewidth=0.5)
axes[1].set_title('Sigmoid: 1/(1+e^(-x))', fontsize=12)
axes[1].set_xlabel('z')
axes[1].set_ylabel('a = œÉ(z)')
axes[1].grid(True, alpha=0.3)

axes[2].plot(x, tanh, 'm-', linewidth=2)
axes[2].axhline(y=0, color='k', linestyle='-', linewidth=0.5)
axes[2].axvline(x=0, color='k', linestyle='-', linewidth=0.5)
axes[2].set_title('Tanh', fontsize=12)
axes[2].set_xlabel('z')
axes[2].set_ylabel('a = tanh(z)')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("üí° ReLU is the most popular because it's simple, fast, and helps avoid vanishing gradients!")

---

## 3. Neural Network Architecture

A neural network is layers of neurons connected together:

```
Input Layer          Hidden Layers              Output Layer
    ‚óã                    ‚óã    ‚óã                     ‚óã
    ‚óã ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∂    ‚óã    ‚óã  ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ñ∂    ‚óã
    ‚óã                    ‚óã    ‚óã                     ‚óã
    ‚óã                    ‚óã    ‚óã                     
(features)           (learned                   (predictions)
                    representations)
```

### Common Architectures

| Architecture | Best For | Key Idea |
|-------------|----------|----------|
| **Dense/MLP** | Tabular data, simple tasks | Fully connected layers |
| **CNN** | Images, spatial data | Convolutions detect local patterns |
| **RNN/LSTM** | Sequences, time series | Memory of previous inputs |
| **Transformer** | Text, long sequences | Attention mechanism (GPT, BERT) |

> **üîë For Generative AI:** Transformers are the foundation of modern LLMs (GPT, BERT, LLaMA, Claude)

---

## 4. How Neural Networks Learn

### The Learning Loop

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  1. FORWARD PASS                                            ‚îÇ
‚îÇ     Input ‚Üí Network ‚Üí Prediction                            ‚îÇ
‚îÇ                                                             ‚îÇ
‚îÇ  2. LOSS CALCULATION                                        ‚îÇ
‚îÇ     How wrong was the prediction? (Loss = f(prediction, y)) ‚îÇ
‚îÇ                                                             ‚îÇ
‚îÇ  3. BACKWARD PASS (Backpropagation)                         ‚îÇ
‚îÇ     Calculate gradients: ‚àÇLoss/‚àÇweights                     ‚îÇ
‚îÇ                                                             ‚îÇ
‚îÇ  4. UPDATE WEIGHTS (Gradient Descent)                       ‚îÇ
‚îÇ     weights = weights - learning_rate √ó gradient            ‚îÇ
‚îÇ                                                             ‚îÇ
‚îÇ  Repeat for many iterations (epochs)                        ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### Key Concepts

**Loss Function:** Measures how wrong predictions are
- Classification: Cross-entropy loss
- Regression: Mean squared error

**Gradient Descent:** Finds the direction to adjust weights to reduce loss
- Think of it as rolling downhill on the loss landscape

**Learning Rate:** How big of steps to take
- Too high: Overshoots optimal values
- Too low: Training takes forever

**Backpropagation:** Efficiently computes gradients using chain rule

In [None]:
# Visualizing Gradient Descent on a simple loss landscape
def loss_function(w):
    """Simple quadratic loss function"""
    return (w - 3) ** 2 + 2

def gradient(w):
    """Derivative of loss function"""
    return 2 * (w - 3)

# Gradient descent simulation
w = -2  # Starting point
learning_rate = 0.1
history = [w]

for _ in range(20):
    grad = gradient(w)
    w = w - learning_rate * grad  # Update rule
    history.append(w)

# Visualize
w_range = np.linspace(-3, 8, 100)
loss_values = loss_function(w_range)

plt.figure(figsize=(10, 5))
plt.plot(w_range, loss_values, 'b-', linewidth=2, label='Loss landscape')
plt.plot(history, [loss_function(w) for w in history], 'ro-', markersize=8, label='Gradient descent steps')
plt.axvline(x=3, color='g', linestyle='--', alpha=0.7, label='Optimal w=3')
plt.xlabel('Weight (w)', fontsize=12)
plt.ylabel('Loss', fontsize=12)
plt.title('Gradient Descent: Finding the Minimum', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"Started at w={history[0]:.2f}, converged to w={history[-1]:.2f}")
print(f"Optimal value is w=3.00")

---

## 5. Optional Demo: TensorFlow Playground

üéÆ **Interactive Exploration:**

Visit [TensorFlow Playground](https://playground.tensorflow.org/) to:

1. See how adding layers and neurons affects learning
2. Experiment with different activation functions
3. Watch the decision boundary evolve during training
4. Understand feature learning visually

**Try These Experiments:**
1. Start with the spiral dataset and 1 hidden layer - can you fit it?
2. Add more layers - how does it change?
3. Try ReLU vs Sigmoid - which learns faster?
4. What happens with too high learning rate?

---

## 6. Hands-On: Image Classification with TensorFlow

Let's build a neural network to classify handwritten digits (MNIST dataset).

MNIST contains 70,000 grayscale images of digits 0-9, each 28√ó28 pixels.

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

print(f"TensorFlow version: {tf.__version__}")

# Load MNIST dataset
(X_train, y_train), (X_test, y_test) = keras.datasets.mnist.load_data()

print(f"\nüìä Dataset loaded:")
print(f"   Training samples: {X_train.shape[0]}")
print(f"   Test samples: {X_test.shape[0]}")
print(f"   Image shape: {X_train.shape[1:]} (28√ó28 pixels, grayscale)")

# Visualize some samples
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i, ax in enumerate(axes.flat):
    ax.imshow(X_train[i], cmap='gray')
    ax.set_title(f'Label: {y_train[i]}', fontsize=11)
    ax.axis('off')
plt.suptitle('Sample MNIST Images', fontsize=14)
plt.tight_layout()
plt.show()

In [None]:
# Preprocess the data

# 1. Normalize pixel values to 0-1 range (originally 0-255)
X_train = X_train.astype('float32') / 255.0
X_test = X_test.astype('float32') / 255.0

# 2. Flatten images from 28√ó28 to 784-dimensional vectors
#    (for our simple dense network)
X_train_flat = X_train.reshape(-1, 28 * 28)
X_test_flat = X_test.reshape(-1, 28 * 28)

print(f"Preprocessed shapes:")
print(f"  X_train: {X_train_flat.shape}")
print(f"  X_test: {X_test_flat.shape}")

In [None]:
# Build the Neural Network

model = keras.Sequential([
    # Input layer (784 features = 28√ó28 pixels)
    layers.Input(shape=(784,)),

    # Hidden layer 1: 128 neurons with ReLU activation
    layers.Dense(128, activation='relu', name='hidden_1'),

    # Hidden layer 2: 64 neurons with ReLU activation
    layers.Dense(64, activation='relu', name='hidden_2'),

    # Output layer: 10 neurons (one per digit) with softmax for probabilities
    layers.Dense(10, activation='softmax', name='output')
])

# Display model architecture
model.summary()

print("\nüí° Total parameters to learn:", model.count_params())

In [None]:
# Compile the model
model.compile(
    optimizer='adam',  # Popular adaptive optimizer
    loss='sparse_categorical_crossentropy',  # For multi-class classification
    metrics=['accuracy']
)

print("‚úÖ Model compiled!")
print("   Optimizer: Adam (adaptive learning rate)")
print("   Loss: Sparse Categorical Crossentropy")
print("   Metric: Accuracy")

In [None]:
# Train the model
print("üéì Training the neural network...\n")

history = model.fit(
    X_train_flat, y_train,
    epochs=10,  # Number of full passes through training data
    batch_size=32,  # How many samples per gradient update
    validation_split=0.1,  # Use 10% of training data for validation
    verbose=1
)

print("\n‚úÖ Training complete!")

In [None]:
# Visualize training progress
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Accuracy plot
axes[0].plot(history.history['accuracy'], 'b-', label='Training')
axes[0].plot(history.history['val_accuracy'], 'r-', label='Validation')
axes[0].set_title('Model Accuracy Over Epochs', fontsize=12)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Accuracy')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Loss plot
axes[1].plot(history.history['loss'], 'b-', label='Training')
axes[1].plot(history.history['val_loss'], 'r-', label='Validation')
axes[1].set_title('Model Loss Over Epochs', fontsize=12)
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Loss')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Evaluate on test set
test_loss, test_accuracy = model.evaluate(X_test_flat, y_test, verbose=0)

print(f"\nFinal Test Results:")
print(f"   Test Accuracy: {test_accuracy:.2%}")
print(f"   Test Loss: {test_loss:.4f}")

In [None]:
# Make predictions and visualize
predictions = model.predict(X_test_flat[:10], verbose=0)

fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i, ax in enumerate(axes.flat):
    ax.imshow(X_test[i], cmap='gray')
    pred_digit = np.argmax(predictions[i])
    confidence = predictions[i][pred_digit]
    true_digit = y_test[i]

    color = 'green' if pred_digit == true_digit else 'red'
    ax.set_title(f'Pred: {pred_digit} ({confidence:.0%})\nTrue: {true_digit}',
                 fontsize=10, color=color)
    ax.axis('off')

plt.suptitle('Model Predictions on Test Images', fontsize=14)
plt.tight_layout()
plt.show()

---

## 7. Why Deep Learning Powers Generative AI

The concepts we learned today are the foundation of LLMs and generative models:

| Deep Learning Concept | Generative AI Application |
|----------------------|---------------------------|
| Neural networks | GPT, BERT, LLaMA are massive neural networks |
| Backpropagation | How LLMs are trained on text data |
| Loss functions | Next-word prediction loss for language models |
| Activation functions | Used throughout transformer architectures |
| Hidden layers | LLMs have dozens to hundreds of layers |
| Softmax | Converts outputs to probability over vocabulary |

### Scale Comparison

| Model | Parameters |
|-------|------------|
| Our MNIST classifier | ~109,000 |
| GPT-2 | 1.5 billion |
| GPT-3 | 175 billion |
| GPT-4 | ~1.7 trillion (estimated) |

---

## üìù Student Exercise

### Challenge: Improve the MNIST Classifier

Try modifying the network architecture to improve accuracy:

1. Add more hidden layers
2. Change the number of neurons per layer
3. Try different activation functions
4. Add dropout for regularization
5. Train for more epochs

In [None]:
# Student Challenge: Build an improved model

improved_model = keras.Sequential([
    layers.Input(shape=(784,)),

    # TODO: Add your layers here
    # Try: More layers, different sizes, dropout, etc.
    # Example:
    # layers.Dense(256, activation='relu'),
    # layers.Dropout(0.3),  # Regularization
    # layers.Dense(128, activation='relu'),

    layers.Dense(10, activation='softmax')
])

# Compile and train
# improved_model.compile(...)
# improved_model.fit(...)

print("Complete the improved model above!")

---

## üéØ Key Takeaways

1. **Deep learning** automatically learns features from raw data through multiple layers
2. **Neurons** compute weighted sums and apply activation functions
3. **Training** involves forward pass ‚Üí loss ‚Üí backpropagation ‚Üí weight update
4. **Gradient descent** optimizes weights by following the slope of the loss
5. Modern **LLMs are massive neural networks** built on these same principles

---

### Next Module: Overview of Generative AI ‚Üí
We'll explore autoencoders, VAEs, and the foundations of generative models!