<a href="https://www.kaggle.com/code/mrafraim/dl-day-33-phase-03-summary-day-20-33?scriptVersionId=295164189" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Phase 03 Summary: Day 20–33

Welcome to Day 33!

Topics Covered in Phase 03:

✔ CNN fundamentals  
✔ CNN training in PyTorch  
✔ Filters, gradients, and visualization  
✔ RNN, LSTM, GRU intuition  
✔ NLP preprocessing  
✔ CNN vs RNN decision-making  
✔ Regularization techniques  
✔ Hyperparameter tuning  

If you found this notebook helpful, your **<b style="color:orange;">UPVOTE</b>** would be greatly appreciated! It helps others discover the work and supports continuous improvement.

---


# SECTION 1: CNN RECAP (Day 20–23)


## What CNNs Actually Learn

CNNs learn filters (kernels).

Each filter:
- Is a small weight matrix
- Slides over input
- Detects a specific pattern

Example:
- Edge
- Curve
- Texture
- Stroke


## CNN Filter Shape (Revisited)

```python
model.conv1.weight.shape

```

output
```python
[16, 1, 3, 3]
```

Meaning:

- 16 filters
- 1 input channel (grayscale)
- 3×3 spatial size
- Each filter = one pattern detector.

## CNN Training Loop

For each batch:
1. Forward pass → predictions
2. Compute loss
3. Backprop → gradients for filters
4. Optimizer updates filters

This happens thousands of times, not once.


# SECTION 2: RNN / LSTM RECAP (Day 24–29)


## Why RNNs Exist

CNNs ignore order beyond local windows.

RNNs solve:
> “What happened *before* matters.”

They introduce:
- Hidden state
- Temporal dependency
- Sequential processing


## RNN Unrolling Intuition

At time step $t$:
$$
h_t = f(x_t, h_{t-1})
$$

Meaning:
- Current output depends on past
- Memory flows forward


## Why LSTM > RNN

Plain RNN:
- Suffers from vanishing gradients

LSTM adds:
- Forget gate
- Input gate
- Output gate

Result:
- Stable long-term memory
- Better gradient flow


## Mini Example: Sequence Prediction

Task:
> Predict next word given previous words

RNN/LSTM:
- Processes tokens one-by-one
- Updates hidden state
- Outputs probabilities at each step


# SECTION 3: CNN vs RNN (Day 30)


## Architectural Decision Summary

| Question | Use CNN | Use RNN |
|--------|--------|--------|
| Is order critical? | ❌ | ✅ |
| Is speed important? | ✅ | ❌ |
| Long context needed? | ❌ | ✅ |
| Production scale? | ✅ | ❌ |

Key insight:
> CNN = pattern extractor  
> RNN = memory-based model


# SECTION 4: Regularization (Day 31)


## Dropout Recap

Dropout randomly disables neurons:
$$
\tilde{h} = h \cdot r
$$

Where:
- $r \sim \text{Bernoulli}(p)$

Effect:
- Prevents co-adaptation
- Improves generalization


## BatchNorm vs LayerNorm

| Model | Preferred Normalization |
|----|------------------------|
| CNN | BatchNorm |
| RNN | LayerNorm |

Reason:
- CNN → stable batch statistics
- RNN → sequence-length issues


# SECTION 5: Hyperparameters (Day 32)


## Learning Rate (Most Critical)

Update rule:
$$
\theta_{t+1} = \theta_t - \alpha \nabla L
$$

Rules of thumb:
- Too large → divergence
- Too small → slow training
- Tune LR before architecture


## CNN vs RNN Hyperparameters

| Parameter | CNN | RNN |
|--------|-----|-----|
| Learning rate | Higher | Lower |
| Batch size | Larger | Smaller |
| Gradient clipping | Rare | Mandatory |
| Optimizer | SGD / Adam | Adam / RMSprop |


# Phase 03 Final Takeaways

- CNNs learn spatial patterns
- RNNs learn temporal dependencies
- Training stability is not automatic
- Regularization is not optional
- Hyperparameters control learning dynamics
- Architecture choice is a business decision

---


<p style="text-align:center; font-size:18px;">
© 2026 Mostafizur Rahman
</p>
