# Week 1: Foundations of Conventional Neural Netweorks

Implement the foundational layers of CNNs (pooling, convolutions) and stack them properly in a deep network to solve multi-class image classification problems.

**Learning Objectives**:

* Explain the convolution operation
* Apply two different types of pooling operations
* Identify the components used in a convolutional neural network (padding, stride, filter, ...) and their purpose
* Build a convolutional neural network
* Implement convolutional and pooling layers in numpy, including forward propagation
* Implement helper functions to use when implementing a TensorFlow model
* Create a mood classifer using the TF Keras Sequential API
* Build a ConvNet to identify sign language digits using the TF Keras Functional API
* Build and train a ConvNet in TensorFlow for a binary classification problem
* Build and train a ConvNet in TensorFlow for a multiclass classification problem
* Explain different use cases for the Sequential and Functional APIs

---

## Table of Contents

---

## Computer Vision

In this introduction to Convolutional Neural Networks (CNNs), the focus is on the transformative power of computer vision and the technical necessity of convolutions when dealing with high-resolution image data.

### Importance and Impact of Computer Vision

* **Rapid Advancement:** Deep learning has propelled computer vision into real-world utility, enabling self-driving cars, advanced face recognition, and relevant content curation in consumer apps.
* **Cross-Fertilization:** Architectural innovations in computer vision often inspire breakthroughs in other fields, such as speech recognition.

### Key Computer Vision Problems

* **Image Classification:** Determining whether an object (e.g., a cat) is present in an image.
* **Object Detection:** Not only identifying objects but also determining their specific positions and drawing bounding boxes around them.
* **Neural Style Transfer:** Repainting a content image in the artistic style of a reference image (e.g., turning a landscape photo into a "Picasso" style painting).

<img src='images/cv.png' width=750px>

### The Challenge of Input Scale

* **Small Images:** A $64 \times 64$ RGB image has 12,288 features ($64 \times 64 \times 3$), which is manageable for standard fully connected networks.
* **Large Images:** A modest $1000 \times 1000$ (1-megapixel) image results in 3,000,000 input features.
* **Parameter Explosion:** In a fully connected layer with just 1,000 hidden units, a 1-megapixel input would require a weight matrix with 3 billion parameters.
* **Overfitting:** With billions of parameters, models are highly prone to overfitting without massive amounts of data.
* **Resource Constraints:** The memory and computational power required to train such a network are generally infeasible for standard hardware.

<img src='images/input_scale.png' width=750px>

### The Solution: Convolutional Operations

* To process high-resolution images efficiently without a parameter explosion, deep learning uses **Convolutional Neural Networks (CNNs)**.
* **Convolutions:** This operation is the fundamental building block of CNNs, allowing the network to learn local patterns (like edges) while drastically reducing the number of parameters compared to fully connected layers.

---

## Edge Detection Example

The convolution operation is a fundamental building block of Convolutional Neural Networks (CNNs). It allows a model to learn features—starting with simple edges and progressing to complex objects—by sliding a filter over an input image.

### What is the Convolution Operation?

In computer vision, convolution is used to detect specific features, such as vertical or horizontal lines.
* **The Input:** A grayscale image is represented as a matrix of pixel intensities (e.g., a $6 \times 6 \times 1$ matrix).
* **The Filter (or Kernel):** A smaller matrix (typically $3 \times 3$) designed to identify a specific pattern. For a vertical edge detector, a common filter is:

$$ \begin{bmatrix} 1 & 0 & -1 \\ 1 & 0 & -1 \\ 1 & 0 & -1 \end{bmatrix}$$

* **The Notation:** In math and deep learning, the asterisk ($*$) denotes the convolution operation (not to be confused with standard multiplication).

### The Mechanics of Convolution

The process of convolving a $6 \times 6$ image with a $3 \times 3$ filter results in a $4 \times 4$ output matrix.

1. **Overlay:** Place the $3 \times 3$ filter over the top-left $3 \times 3$ patch of the image.
2. **Element-wise Product:** Multiply each of the 9 numbers in the filter by the corresponding pixel value in the image patch.
3. **Summation:** Add those 9 products together to get a single value for the first cell of the output matrix.
4. **Shift (Slide):** Move the filter one pixel to the right (the "stride") and repeat the calculation. Once the row is finished, move down and start the next row.

### Intuition: Why It Detects Edges

The filter acts as a mathematical "transition detector."
* **Vertical Edge Case:** Imagine an image where the left half is bright (pixel value 10) and the right half is dark (pixel value 0).
* **The Calculation:** When the filter (with 1s on the left and -1s on the right) sits on the transition:
    * The 1s multiply the bright pixels (high positive sum).
    * The -1s multiply the dark pixels (near-zero sum).
* **The Result:** The sum is a large positive number (e.g., 30). In areas where the color is uniform (all 10s or all 0s), the 1s and -1s cancel each other out, resulting in 0.
* **Visual Output:** The final $4 \times 4$ matrix will show a bright "strip" in the middle, representing the detected edge.

<img src='images/edge_detection.png' width=750px>

### Practical Implementation

In deep learning frameworks, you don't perform these sums manually. Functions are built-in to handle high-dimensional convolutions:
* **TensorFlow:** `tf.nn.conv2d`
* **Keras:** `Conv2D layer`
* **Output Dimensions:** For an $n \times n$ image and an $f \times f$ filter, the output size is generally $(n - f + 1) \times (n - f + 1)$. In our example, $6 - 3 + 1 = 4$.

---

## More on Edge Detection

This section discusses edge detection transitions, specialized filters, and the transition to learned parameters in Convolutional Neural Networks (CNNs).

### Positive vs. Negative Edge Transitions

* **Direction Matters:** Edge detection identifies the direction of light intensity change.
    * **Light to Dark:** A transition from high pixel values ($10$) to low values ($0$) results in a positive output (e.g., $+30$).
    * **Dark to Light:** A transition from low pixel values ($0$) to high values ($10$) results in a negative output (e.g., $-30$).
* **Absolute Value:** If the specific direction of the transition is irrelevant to the task, the absolute value of the output matrix can be used to treat all edges equally.

### Horizontal vs. Vertical Detection

* **Vertical Filters:** Designed with vertical columns of weights (e.g., positive on left, negative on right) to detect changes across the x-axis.
* **Horizontal Filters:** Designed with horizontal rows of weights to detect changes across the y-axis (top vs. bottom). A typical horizontal filter is the 90-degree rotation of the vertical filter:

$$\text{Horizontal Filter} = \begin{bmatrix}1&1&1\\0&0&0\\-1&-1&-1\end{bmatrix}$$

### Specialized Hand-Coded Filters

Historically, researchers developed specific matrices to make edge detection more robust to noise:
* **Sobel Filter:** Adds weight to the central pixel to increase robustness.

$$\text{Sobel} = \begin{bmatrix}1&0&-1\\2&0&-2\\1&0&-1\end{bmatrix}$$

* **Scharr Filter:** Uses even more aggressive weighting for specific statistical properties.

$$\text{Scharr} = \begin{bmatrix}3&0&-3\\10&0&-10\\3&0&-3\end{bmatrix}$$

### The Deep Learning Paradigm: Learned Filters

The most significant shift in modern AI is moving from hand-coded filters to learned parameters.

**Learnable Weights:** Instead of manually picking values like $1$ or $10$, the nine numbers in a $3\times3$ filter are treated as parameters ($w_1$ through $w_9$). The network uses backpropagation to automatically learn the optimal weights based on the dataset. Learned filters can detect edges at any orientation (e.g., $45^\circ$, $73^\circ$) or even complex textures that do not have a standard mathematical name.

### Mathematical Representation

Whether hand-coded or learned, the operation remains a convolution. For a $6\times6$ input and a $3\times3$ filter, we are optimizing the parameters to produce a $4\times4$ feature map that minimizes the loss function:

$$(n-f+1) \times (n-f+1) = (6-3+1) \times (6-3+1) = 4 \times 4$$

----

## Padding

This section describes the key principles of Padding, which is a vital modification to the standard convolution operation used to build deep neural networks.

### The Problems with "Valid" Convolutions

Without padding, standard convolutions (called "Valid" convolutions) suffer from two primary issues:
* **Image Shrinkage:** Every layer reduces the spatial dimensions. In deep networks with hundreds of layers, the image would shrink to $1\times{1}$ very quickly.
* **Loss of Edge Information:** Pixels at the corners and edges are only "touched" by the filter a few times, whereas central pixels are included in many overlapping $3\times3$ regions. This results in the network effectively "throwing away" data from the borders.

### The Solution: Padding ($p$)

Padding involves adding a border of pixels (usually filled with zeros) around the original image before applying the filter.
* **Dimensionality Preservation:** If we pad a $6\times6$ image with a border of $p=1$, it becomes an $8\times8$ matrix. Convolving this with a $3\times3$ filter results in a $6\times6$ output, perfectly preserving the original size.
* **Updated Output Formula:** With padding $p$ included, the output dimension for an $n\times{n}$ image and $f\times{f}$ filter is:

$$(n+2p-f+1)\times(n+2p-f+1)$$

Example: For $n=6, f=3, p=1$:

$$(6+2(1)-3+1)=6$$

### Common Padding Choices
1. **Valid Convolution:** This means $p=0$. The output size is simply $(n-f+1)$.
2. **Same Convolution:** The padding is calculated so that the output size equals the input size.
To achieve a "Same" convolution, the padding $p$ must follow this formula:

$$p=\frac{f-1}{2}$$

### Why Filters are Almost Always Odd

In computer vision, filters ($f$) like $3\times3$, $5\times5$, or $7\times7$ are standard. Even-sized filters are rarely used for two reasons:
* **Symmetry:** If $f$ is even, the padding $p$ would need to be asymmetric (e.g., padding more on the left than the right), which is computationally messy.
* **Central Pixel:** Odd-sized filters have a clear "central pixel," which is helpful for tracking the position and orientation of features in the image.

---

## Strided Convolutions

This section discusses the key mechanics of **Strided Convolutions** and the technical distinction between convolution and cross-correlation.

### Strided Convolutions

Strided convolutions introduce a "jump" in the sliding window process, effectively downsampling the image during the convolution itself.
* **The Mechanism:** Instead of moving the filter by $1$ pixel at a time, we move it by a value $s$ (the stride).
If $s=2$, the filter hops over two positions horizontally and vertically, skipping intermediate calculations.
* **Purpose:** This is primarily used to reduce the spatial dimensions (width and height) of the feature maps as the data moves deeper into the network, reducing computational load and increasing the receptive field.

### The General Output Dimension Formula

When combining input size ($n$), filter size ($f$), padding ($p$), and stride ($s$), the dimension of the output is calculated as:

$$\lfloor\frac{n+2p-f}{s}+1\rfloor\times\lfloor\frac{n+2p-f}{s}+1\rfloor$$

The Floor Function: The $\lfloor{z}\rfloor$ notation denotes the "floor" of $z$, which means rounding down to the nearest integer.

**The Logic:** In deep learning conventions, a filter must lie entirely within the image (or padded area). If the stride causes the filter to "hang" off the edge, that final calculation is simply discarded.

Example: For $n=7, f=3, p=0, s=2$:

$$\lfloor\frac{7+0-3}{2}+1\rfloor=\lfloor\frac{4}{2}+1\rfloor=3$$

The resulting output is a $3\times3$ matrix.

### Convolution vs. Cross-Correlation

There is a common terminology mismatch between pure mathematics and deep learning:
* **Mathematical Convolution:** Requires "flipping" the filter both horizontally and vertically (mirroring) before performing the element-wise product. This gives the operation the associative property:

$$(A*B)*C=A*(B*C)$$

* **Cross-Correlation:** The exact same operation but without the flipping step.
* **Deep Learning Convention:** In AI literature, we perform cross-correlation but almost universally refer to it as "convolution."
* **Why it doesn't matter:** In a neural network, the filter values are learned via backpropagation. If a "flipped" filter were better, the network would simply learn the weights in that flipped orientation automatically. Omitting the flip simplifies implementation without affecting performance.

### Summary Table of Parameters
| Parameter | Symbol | Effect on Output Size |
| --- | --- | --- |
|Input Size | $n$ | Larger input leads to larger output |
| Filter Size | $f$ | Larger filter reduces output size |
| Padding | $p$ | Increases output size (preserves borders)| 
| Stride | $s$ | Larger stride significantly reduces output size |

---

## Convolutions Over Volume

This section explains how the convolution operation expands from 2D grayscale images into 3D Volumes, which is the standard way CNNs process color images and complex feature maps.

### Convolving over RGB Channels

* **The Input Volume:** A color image is typically represented as a $6 \times 6 \times 3$ volume, where the last dimension corresponds to the Red, Green, and Blue (RGB) channels.
* **The 3D Filter:** To convolve over this volume, the filter must also be 3D. A $3 \times 3$ filter becomes a $3 \times 3 \times 3$ volume.
* **Channel Matching Rule:** The number of channels in the filter must exactly match the number of channels in the input.
    * Formula: $n \times n \times n_c$ input requires an $f \times f \times n_c$ filter.
* **The Calculation:** The filter (containing $27$ parameters for $3^3$) overlays a 3D patch of the image. You perform an element-wise product of all $27$ numbers and sum them into a single value.
* **Dimensionality Reduction:** Even though the input and filter are 3D, the output of a single filter is a 2D matrix.
Output size: $(n - f + 1) \times (n - f + 1) \times 1$.

### Multiple Feature Detection

In practice, we don't just want to detect one type of feature (like vertical edges). We want to detect many features simultaneously.
* **Stacking Filters:** If you use two different filters (e.g., one for vertical edges and one for horizontal edges), each produces its own $4 \times 4$ output.
* **Output Volumes:** These 2D outputs are stacked together to form a new 3D volume.
If you use $2$ filters, the output is $4 \times 4 \times 2$.
If you use $128$ filters, the output is $4 \times 4 \times 128$.
* **Channels vs. Depth:** While some literature refers to this third dimension as "depth," the term "channels" is preferred in this context to avoid confusion with the "depth" (number of layers) of the neural network itself.

<img src='images/con_vol.png' width=750px>

### Summary of Dimensions

For an input volume with height/width $n$ and $n_c$ channels, using $n_c'$ filters of size $f$:

**Input:**

$$n \times n \times n_c$$

**Filter:**

$$f \times f \times n_c'$$

**Output:**

$$(n - f + 1) \times (n - f + 1) \times n_c'$$

(Note: This assumes a stride $s = 1$ and padding $p = 0$. If these are changed, the spatial dimensions $n - f + 1$ are updated using the standard formulas discussed previously.)

### Why This Matters

This approach allows the network to look for different patterns in different colors (e.g., an edge that only appears in the Red channel) or look for universal patterns by using the same weights across all three channels.

---

## One Layer of Convolutional Network

In this section, we demonstrate the transition from a single convolution operation to a complete Convolutional Neural Network (CNN) Layer. This involves adding biases, non-linearities, and understanding parameter scaling.

### Building a Full CNN Layer

One layer of a CNN consists of several sequential operations that map an input volume $a^{[l-1]}$ to an output activation volume $a^{[l]}$:
1. **Convolution:** Multiply the input volume by $n_c$ different filters (weights $W^{[l]}$).
2. **Bias Addition:** Add a specific bias $b^{[l]}$ to each filter's output. Note that through broadcasting, a single real number is added to every element in that filter's output map.
3. **Non-linearity:** Apply an activation function, most commonly ReLU ($g(z) = \max(0, z)$).
4. **Stacking:** Combine the results into a 3D volume where the depth equals the number of filters used.

<img src='images/cnn_one_layer_example.png' width=800px>

### Parameter Calculation Exercise

A major advantage of CNNs is parameter sharing, which makes them less prone to overfitting than fully connected networks.

**Example Case:**

* **Input:** $6 \times 6 \times 3$ image
* **Filters:** $10$ filters of size $3 \times 3 \times 3$
* **Parameters per filter:** $(3 \times 3 \times 3) = 27$ weights $+ 1$ bias = $28$ parameters.
* **Total parameters for the layer:** $28 \times 10 = 280$

**Key Insight:** The number of parameters ($280$) remains constant regardless of the input image size (whether it's $1,000 \times 1,000$ or $5,000 \times 5,000$).

### Notation Summary for Layer $l$

To describe deep networks, we use specific notation for each layer's properties:
* **Filter Size:** $f^{[l]}$
* **Padding:** $p^{[l]}$
* **Stride:** $s^{[l]}$
* **Number of Filters:** $n_c^{[l]}$

**Dimensions of Volumes:**
* **Input:** $n_H^{[l-1]} \times n_W^{[l-1]} \times n_c^{[l-1]}$
* **Output:** $n_H^{[l]} \times n_W^{[l]} \times n_c^{[l]}$

The height ($n_H$) and width ($n_W$) are calculated using:

$$n_H^{[l]} = \lfloor \frac{n_H^{[l-1]} + 2p^{[l]} - f^{[l]}}{s^{[l]}} + 1 \rfloor$$

**Weights and Activations:**
* **Weights ($W^{[l]}$):** The total volume of weights is $f^{[l]} \times f^{[l]} \times n_c^{[l-1]} \times n_c^{[l]}$.
* **Biases ($b^{[l]}$):** A vector of dimension $n_c^{[l]}$ (or represented as $1 \times 1 \times 1 \times n_c^{[l]}$).
* **Activations ($a^{[l]}$):** $n_H^{[l]} \times n_W^{[l]} \times n_c^{[l]}$.

In vectorized implementations (batch size or mini-batch size $m$), the dimension of $A^{[l]}$ is $m \times n_H^{[l]} \times n_W^{[l]} \times n_c^{[l]}$.

**Note on Convention:** While some frameworks use "channels first" ($m, n_c, n_H, n_W$), here we follow the "channels last" convention ($m, n_H, n_W, n_c$).

---

## Simple Convolutional Netwrok Example

This section describes the architecture of a complete Deep Convolutional Neural Network (ConvNet) example. This example demonstrates how spatial dimensions shrink while feature depth increases as data moves toward a final classification.

### Anatomy of a Typical ConvNet

A standard image classification pipeline follows a specific progression: taking an input image $x$ (e.g., $39 \times 39 \times 3$) and transforming it into a high-level feature vector for prediction.
* **Layer 1 (Conv1):**
    * **Input:** $39 \times 39 \times 3$
    * **Hyperparameters:** $f^{[1]}=3, s^{[1]}=1, p^{[1]}=0$ (Valid), $10$ filters.
    * **Output:** $37 \times 37 \times 10$.
* **Layer 2 (Conv2):**
    * **Input:** $37 \times 37 \times 10$
    * **Hyperparameters:** $f^{[2]}=5, s^{[2]}=2, p^{[2]}=0, 20$ filters.
    * **Output:** $17 \times 17 \times 20$.
* **Layer 3 (Conv3):**
    * **Input:** $17 \times 17 \times 20$
    * **Hyperparameters:** $f^{[3]}=5, s^{[3]}=2, p^{[3]}=0, 40$ filters.
    * **Output:** $7 \times 7 \times 40$.

<img src='images/conv_net_example.png' width=800px>

### The Design Pattern: Spatial vs. Depth

A key observation in ConvNet design is the inverse relationship between spatial dimensions and channel depth:
* **Height and Width ($n_H, n_W$):** Typically decrease as you go deeper into the network (e.g., $39 \to 37 \to 17 \to 7$).
* **Number of Channels ($n_c$):** Typically increase as you go deeper (e.g., $3 \to 10 \to 20 \to 40$). This allows the network to detect a wider variety of complex features even as the resolution coarsens.

### Final Classification: Flattening and FC

Once the features are extracted into a small 3D volume ($7 \times 7 \times 40$):
* **Flattening:** The volume is unrolled into a single long vector.
* **Total Units:** $7 \times 7 \times 40 = 1,960$ units.
* **Fully Connected (FC):** This vector is fed into standard neural network layers (Logistic Regression or Softmax).
Output: A final prediction $\hat{y}$ (e.g., $0$ or $1$ for "cat vs. no cat").

### Dimensionality Recap Formula

To compute the spatial size for any layer, we use the generalized formula:

$$n^{[l]} = \lfloor \frac{n^{[l-1]} + 2p^{[l]} - f^{[l]}}{s^{[l]}} + 1 \rfloor$$

For the transition from Layer 1 to Layer 2 in the example:

$$\lfloor \frac{37 + 0 - 5}{2} + 1 \rfloor = \lfloor \frac{32}{2} + 1 \rfloor = 17$$

### Three Types of ConvNet Layers

Most modern architectures are composed of three primary building blocks:
* **Convolutional (Conv):** Extracts features using sliding filters.
* **Pooling (Pool):** Reduces spatial size to decrease computation and parameters (to be discussed next).
* **Fully Connected (FC):** Performs the final high-level reasoning and classification.

---

## Pooling Layers

This section briefly touches on Pooling layers, which serve as a critical tool for reducing the spatial size of representations and making feature detection more robust.

### The Mechanics of Pooling

Pooling layers slide a window across the input volume, similar to a convolution, but instead of a dot product with weights, they perform a fixed mathematical operation.
* **Max Pooling:** The most common type. Within each window (e.g., $2\times2$), only the maximum value is preserved.
    * **Intuition:** A high value represents the presence of a specific feature (like a cat's eye). If that feature is detected anywhere in the window, the max operation preserves that "signal."
* **Average Pooling:** Computes the mean value of all pixels in the window.
    * **Usage:** While less common than Max Pooling, it is sometimes used very deep in a network to collapse spatial dimensions (e.g., from $7\times7\times1000$ to $1\times1\times1000$).

### Key Properties and Hyperparameters

Unlike convolutional layers, pooling layers are "static" operations.
* **No Learnable Parameters:** There are no weights ($W$) or biases ($b$) for gradient descent to update. Once you choose the hyperparameters, the computation is fixed.
* **Hyperparameters:**
    * **Filter Size ($f$):** Common choice is $f=2$ or $f=3$.
    * **Stride ($s$):** Common choice is $s=2$.
    * **Padding ($p$):** Very rarely used; usually $p=0$.
* **Channel Independence:** Pooling is applied to each channel ($n_c$) independently. The number of channels in the output is always identical to the number of channels in the input.

### Dimensionality and Formulas

The same formula used for convolutions applies to pooling to determine the output dimensions ($n_H \times n_W \times n_c$):

Output Height/Width:

$$\lfloor\frac{n+2p-f}{s}+1\rfloor$$

Example ($f=2, s=2$):

A $4\times4\times n_c$ input pooled with $f=2, s=2$ results in a $2\times2\times n_c$ output. This effectively halves the height and width, reducing the total spatial area by $75\%$.

<img src='images/max_pooling.png' width=750px>

### Why Use Pooling?

* **Computational Efficiency:** By shrinking $n_H$ and $n_W$, it reduces the number of operations required in subsequent layers.
* **Robustness:** It provides a form of translational invariance. If a feature shifts by a pixel or two, the "max" in that local region will likely stay the same.
* **Overfitting Prevention:** By reducing the number of total activations, it helps the network focus on the most important features.

---

## Neural Network Example

This section demonstrates the architectural integration of a full Convolutional Neural Network (ConvNet). This example follows the classic logic of the LeNet-5 architecture, which is a foundational model for handwritten digit recognition.

### ConvNet Layer Convention

There is a common point of confusion in AI literature regarding what constitutes a "layer":
* **The Weight Rule:** Most practitioners only count layers that contain learnable parameters (weights and biases).
* **Grouping:** Because Pooling layers have no weights, they are typically grouped with the preceding Convolutional layer. Thus, "Layer 1" is often defined as $(\text{Conv1} + \text{Pool1})$.

### The Progressive Architecture (Example Walkthrough)

A standard network follows a repeating pattern of feature extraction followed by classification.
1. **Feature Extraction (Layers 1 & 2):**
    * **Layer 1:** A $32\times32\times3$ input is convolved with $6$ filters ($5\times5$), resulting in $28\times28\times6$. It is then downsampled via Max Pooling ($f=2, s=2$) to $14\times14\times6$.
    * **Layer 2:** A second convolution with $16$ filters ($5\times5$) results in $10\times10\times16$, which is pooled down to $5\times5\times16$.
2. **Flattening:** The final $5\times5\times16$ volume is "unrolled" into a 1D vector of $400$ units ($5\times5\times16 = 400$).
3. **Classification (Fully Connected Layers):**
    * **FC3:** $400$ units connect to $120$ neurons.
    * **FC4:** $120$ units connect to $84$ neurons.
    * **Output Layer:** $84$ units connect to a Softmax layer with $10$ outputs (representing digits $0$ through $9$).

<img src='images/cnn_example.png' width=900px>

### Parameter vs. Activation Trends

Analyzing the data flow of a ConvNet reveals several critical patterns:
* **Spatial Dimensions ($n_H, n_W$):** Gradually decrease as you go deeper ($32 \to 28 \to 14 \to 10 \to 5$).
* **Channel Depth ($n_c$):** Gradually increases ($3 \to 6 \to 16$).
* **Parameter Distribution:**
    * **Conv Layers:** Use relatively few parameters due to weight sharing.
    * **FC Layers:** Contain the vast majority of the network's parameters.
    * **Pool Layers:** Contain exactly $0$ parameters.
* **Activation Size:** Ideally should decrease gradually. A sudden "drop" in activation size can lead to a loss of representational power.

### Dimensionality Summary Table

For an input $32\times32\times3$, the transformation looks like this:

| Layer | Type | Activation Shape | Activation Size | Parameters |
| --- | --- | --- | --- | --- |
| Input | Image | $32 \times 32 \times 3$ | 3,072 | $0$ |
| Layer 1 | Conv | $28 \times 28 \times 6$ | 4,704 | $(\text{f} \cdot \text{f} \cdot 3 + 1) \cdot 6 = 456$|
| Layer 1 | Pool | $14 \times 14 \times 6$ | 1,176 | 0 |
| Layer 2 | Conv | $10 \times 10 \times 16$ | 1,600 | $(\text{f} \cdot \text{f} \cdot 6 + 1) \cdot 16 = 2,416$ |
| Layer 2 | Pool | $5 \times 5 \times 16$ | 400 | 0 |
| FC 3 | Dense | $120 \times 1$ | 120 |$400 \cdot 120 + 120 = 48,120$ |
| FC 4 | Dense | $84 \times 1$ | 84 | $120 \cdot 84 + 84 = 10,164$ |
| Output | Softmax | $10 \times 1$ | 10 | $84 \cdot 10 + 10 = 850$|

### Expert Advice

When designing your own network, don't try to "guess" these hyperparameters ($f, s, p, n_c$) from scratch. The industry standard is to start with a proven architecture (like LeNet, AlexNet, or VGG) that has worked on similar datasets and adapt it to your specific needs.

---

## Why Convolutions?

This section discusses the concluding concepts for this week regarding why Convolutional Neural Networks (CNNs) are mathematically and practically superior to standard networks for computer vision.

### The Two Primary Advantages of Convolutions

CNNs use significantly fewer parameters than Fully Connected (FC) networks, which makes them faster to train and less prone to overfitting.

* **Parameter Sharing:** A feature detector (like a vertical edge detector) that is useful in the top-left of an image is likely useful in the bottom-right. Instead of learning a different detector for every pixel location, the CNN learns one set of filter weights and applies it across the entire image.
    * *Example:* A $5\times5$ filter has only $26$ parameters (including bias), regardless of the input image size.
* **Sparsity of Connections:** In a convolutional layer, each output value depends only on a small local region of the input (the size of the filter).
    * *Example:* In a $3\times3$ convolution, one output unit is connected to only $9$ input pixels. In an FC layer, that same unit would be connected to every single pixel in the image.

### Parameter Comparison Example

Consider a $32\times32\times3$ input ($3,072$ units) transforming into a $28\times28\times6$ output ($4,704$ units).
* Fully Connected Approach:

$$3,072 \times 4,704 \approx 14,000,000\text{ parameters}$$

* Convolutional Approach ($6$ filters of $5\times5$):

$$(5 \times 5 \times 3 + 1) \times 6 = 456\text{ parameters}$$

The CNN reduces the parameter count by more than $30,000$ times in this single layer transition.

### Translation Invariance

CNNs naturally capture Translation Invariance, which is the property that an object (like a cat) is still the same object even if it is shifted a few pixels in any direction. Because the network applies the same filters across all positions, it learns to recognize features regardless of their specific coordinates.

### Training the Network

Training a CNN follows the same fundamental principles as a standard Deep Neural Network:
1. **Initialization:** Randomly initialize weights $W$ and biases $b$.
2. **Forward Prop:** Pass the image through Conv, Pool, and FC layers to generate a prediction $\hat{y}$.
3. **Cost Function:** Calculate the total loss $J$ over $m$ training examples:

$$J = \frac{1}{m}\sum_{i=1}^{m}\mathcal{L}(\hat{y}^{(i)}, y^{(i)})$$

4. **Optimization:** Use algorithms like Gradient Descent, Momentum, RMSProp, or Adam to update the parameters and minimize the cost.

### Final Pro Tip

Due to the high number of hyperparameters in CNNs, the most effective strategy for most developers is **Transfer Learning** or **Architecture Adaptation**: using a structure that has already been proven effective in research (like ResNet or Inception) and fine-tuning it for your specific data.