# Day 20: CNN Introduction

Welcome to Day 20!

Today you'll learn:

- What is CNN?
- Understand why CNNs exist
- Learn convolution operation step-by-step
- Understand filters, stride, padding
- Manually compute a convolution
- Implement convolution using NumPy

If you found this notebook helpful, your **<b style="color:red;">UPVOTE</b>** would be greatly appreciated! It helps others discover the work and supports continuous improvement.

---

# What is a CNN?

A Convolutional Neural Network (CNN) is a specialized type of neural network designed to work with grid-structured data, especially images.

Examples of grid-structured data:
- Images → 2D grid of pixels  
- Videos → 3D grid (height × width × time)  
- Spectrograms → time × frequency grid  

CNNs are the standard architecture for computer vision because they preserve and exploit the spatial structure of data.

# Why Traditional Neural Networks Fail for Images

Consider a grayscale image of size 28 × 28:

- Total pixels:  
  $$
  28 \times 28 = 784
  $$

If we use a fully connected layer with just 1,000 neurons:

$$
784 \times 1000 = 784{,}000 \text{ parameters}
$$

Problems:
1. **Too many parameters** → slow training, high memory usage  
2. **No spatial awareness**  
   - Neighboring pixels are treated the same as distant pixels  
3. **No translation understanding**  
   - The same object in a different position looks completely new  

Fully connected networks ignore how images are structured.

# Core Ideas Behind CNNs

CNNs are built on three key assumptions about images. Let’s break down each one carefully.

### 1. Locality

Locality means most important visual information in an image is contained in small, local regions, rather than spread across the entire image.

* **Local region / patch:** A small area of the image, e.g., a 3×3 or 5×5 block of pixels.
* **Edge:** A boundary where the intensity of pixels changes sharply (e.g., where light meets dark).
* **Corner:** A point where two edges meet.
* **Texture:** Repeating patterns in a small region, like stripes or dots.

Pixels are meaningful relative to their neighbors, not the entire image. Detecting edges, corners, or textures locally helps the network understand small building blocks of the image.

### 2. Parameter Sharing

Parameter sharing means the same set of weights (filter) is applied across multiple positions in the image.

* **Filter (kernel):** A small matrix of numbers that slides over the image to detect specific patterns.
* **Weights:** Numbers in the filter that the network learns during training.
* **Feature map:** The result of applying a filter across the image, showing where the pattern occurs.

Instead of learning a separate detector for every location in the image, CNNs learn one filter and reuse it everywhere.

* This drastically reduces the number of parameters (learnable weights), making training more efficient.
* It also allows the network to recognize the same pattern regardless of its position, a property called **translation invariance**.

### 3. Spatial Hierarchy

Spatial hierarchy means simple patterns combine to form complex structures in a layered manner.

* **Layer:** A level in the neural network that transforms input into more abstract features.
* **Edges → shapes → object parts → full objects:** This describes how visual features are learned progressively:

  1. Early layers detect simple features (edges, corners)
  2. Middle layers combine them into shapes or textures
  3. Deep layers recognize complex objects

CNNs automatically learn a hierarchy of features, building up from local patterns to high-level concepts. This is one of the main reasons CNNs work so well for images.


 **Summary of New Terms**

| Term                   | Definition                                                    |
| ---------------------- | ------------------------------------------------------------- |
| Local region / patch   | Small area of the image (e.g., 3×3 pixels)                    |
| Edge                   | Boundary of sharp pixel intensity change                      |
| Corner                 | Intersection point of two edges                               |
| Texture                | Repeating pattern in a local area                             |
| Filter / Kernel        | Small learnable matrix applied across the image               |
| Weights                | Learnable numbers in the filter that detect patterns          |
| Feature map            | Output of a filter showing where a pattern occurs             |
| Layer                  | Level in a neural network that transforms input into features |
| Translation invariance | Ability to detect the same pattern regardless of its location |


# What Makes a CNN?

A Convolutional Neural Network (CNN) is a neural network designed to work with images by understanding what patterns exist and where they appear.

To understand this, we must clearly define three new terms:
- Convolution operation  
- Filter (Kernel)  
- Feature map  

We will explain each from scratch.

### 1. Convolution Operation

A convolution is a mathematical operation where a small matrix (filter) slides over an input (like an image) and computes a weighted sum at each position.

Instead of looking at the entire image at once (like fully connected layers), convolution focuses on small local regions.

- Imagine placing a small transparent grid on top of an image.
- You slide it step by step.
- At each position, you check how well the grid matches the image underneath.
- This is different from full matrix multiplication (used in dense layers) because we only focus on small local patches instead of the whole image at once.

This is how CNNs scan images.

**Example:**

Input image patch:

$$
X =
\begin{bmatrix}
1 & 2 & 0 \\
0 & 1 & 3 \\
1 & 2 & 1
\end{bmatrix}
$$

Filter (kernel):
$$
K =
\begin{bmatrix}
0 & 1 & 0 \\
1 & -4 & 1 \\
0 & 1 & 0
\end{bmatrix}
$$

- Slide the filter across the image
- At each position, compute:
$$
\text{sum of } (X \odot K) = \sum_{i,j} X_{i,j} \cdot K_{i,j}
$$

The result (feature map) is  one number that tells how strongly the pattern exists at that location.

**Example:**


Image (3×3):

$$
I = \begin{bmatrix}
1 & 2 & 3 \\
4 & 5 & 6 \\
7 & 8 & 9
\end{bmatrix}
$$

Filter (2×2):

$$
F = \begin{bmatrix}
1 & 0 \\
0 & -1
\end{bmatrix}
$$

Step 1: Patches of the image

Top-left patch:

$$
P_1 = \begin{bmatrix}
1 & 2 \\
4 & 5
\end{bmatrix}
$$

Top-right patch:

$$
P_2 = \begin{bmatrix}
2 & 3 \\
5 & 6
\end{bmatrix}
$$

Bottom-left patch:

$$
P_3 = \begin{bmatrix}
4 & 5 \\
7 & 8
\end{bmatrix}
$$

Bottom-right patch:

$$
P_4 = \begin{bmatrix}
5 & 6 \\
8 & 9
\end{bmatrix}
$$


Step 2: Flatten for dot product

Filter flattened: $[1, 0, 0, -1]$  
Top-left patch flattened: $[1, 2, 4, 5]$  

Step 3: Dot product

Top-left patch: $1*1 + 0*2 + 0*4 + (-1)*5 = -4$  

Top-right patch: $1*2 + 0*3 + 0*5 + (-1)*6 = -4$  

Bottom-left patch: $1*4 + 0*5 + 0*7 + (-1)*8 = -4$  

Bottom-right patch: $1*5 + 0*6 + 0*8 + (-1)*9 = -4$  


Resulting Feature Map (2×2)

$$
\text{Feature Map} = \begin{bmatrix}
-4 & -4 \\
-4 & -4
\end{bmatrix}
$$

This demonstrates how the convolution operation computes a dot product between the filter and each patch of the image, producing a feature map that highlights where the filter pattern appears.

In short:

Convolution is a mathematical operation that:
- Slides a small matrix (filter / kernel) over an input (like an image)
- Computes element-wise multiplication
- Sums the result to produce a feature map

This allows the network to detect:
- Edges
- Corners
- Textures
- Shapes


### 2. Filter (Kernel)

A filter (also called a kernel) is a small learnable matrix of numbers used to detect a specific visual pattern.

- Its job: detect specific patterns (edges, textures, corners) in images.
- Typical sizes: 3×3, 5×5, 7×7  
- Learns during training  
- The same filter is applied across the entire image  

**Example:**  Filter (Edge Detector)

Vertical edge filter (Sobel):

$$
\begin{bmatrix}
-1 & 0 & 1 \\
-2 & 0 & 2 \\
-1 & 0 & 1
\end{bmatrix}
$$

- This filter responds strongly to vertical edges*.
- When applied to an image:
    - Strong response → vertical edge exists
    - Weak response → no vertical edge

**Why Filters Matter**
- One filter → detects one type of pattern  
- Multiple filters → detect multiple patterns (edges, corners, textures)
- NN learns the best filter values during training, we don’t handcraft them.

---
#### <b style="color:skyblue;">*What Does “Vertical Edge” Mean?</b>

A vertical edge is a location in an image where the pixel intensity changes sharply from left to right, while remaining relatively similar from top to bottom.

In simple terms:

> A vertical edge is a vertical boundary between two different regions of brightness or color.


Imagine a black object on a white background:

White | Black <br>
White | Black <br>
White | Black


The boundary between white and black is vertical, so this boundary is called a vertical edge.

Your eyes immediately notice this boundary, CNNs are trained to notice the same thing.


Consider a grayscale image patch:

$$
\begin{bmatrix}
10 & 10 & 200 \\
10 & 10 & 200 \\
10 & 10 & 200
\end{bmatrix}
$$

- Left side pixels → dark (low values)
- Right side pixels → bright (high values)

The sudden change from left to right indicates a vertical edge.


**Why CNNs Detect Vertical Edges First**

Edges are the simplest and most informative visual patterns:
- They define object boundaries
- They help form shapes
- They are stable across lighting changes

CNNs usually learn:
- Vertical edges
- Horizontal edges
- Diagonal edges

in their first convolutional layer.


**Vertical Edge Filter (Example)**

A common vertical edge detector (Sobel filter):

$$
\begin{bmatrix}
-1 & 0 & 1 \\
-2 & 0 & 2 \\
-1 & 0 & 1
\end{bmatrix}
$$

**What This Filter Does**
- Left side → negative weights
- Right side → positive weights
- Strong response when left ≠ right

High output → vertical edge detected  
Low output → no vertical edge

---
### 3. Feature Map

A feature map is the output of applying a filter over an image.

- Each value in the feature map shows how strongly the filter matched at that position.
- Feature maps preserve the 2D structure of the image.

Example
- Input image size: 28×28  
- Filter size: 3×3  
- *Output feature map size: 26×26 (each value indicates how strongly the pattern is present at that location) 

High values in the feature map = strong presence of the pattern.

Think of it as a heatmap highlighting the pattern the filter detects.

---

#### <b style="color:skyblue;">*Why does a 28×28 image become a 26×26 feature map?</b>

A 3×3 filter is a small matrix that looks at 9 pixels at a time.

The filter:
- Starts at the top-left corner
- Computes a dot product with the underlying 3×3 patch
- Moves one pixel at a time (this movement is called stride = 1)
- Stops when it can no longer fully fit inside the image

Critical rule
> The filter must fit entirely inside the image. No partial overlap allowed [unless padding(talk later about that) is added, we are NOT using padding here].

Visual intuition (1D first)

Imagine a 1D line of 28 pixels  
A filter of size 3

Valid positions:

[1 2 3] ✔ <br>
[2 3 4] ✔ <br>
... <br>
[26 27 28] ✔


How many positions?<br>
28 - 3 + 1 = 26


That’s the key formula.

Now extend to 2D (real images)

Input
- Height = 28
- Width = 28

Filter
- Height = 3
- Width = 3

Output height

28 - 3 + 1 = 26

Output width

28 - 3 + 1 = 26

Final output feature map size: 26 × 26

**General formula:**

For no padding, stride = 1:

- Output Size = (N - F + 1)

Where:
- `N` = input size
- `F` = filter size

For 2D:

- Output Height = (H - FH + 1)
- Output Width = (W - FW + 1)

**Why not 28×28?**

Because:
- The filter cannot sit partially outside the image
- The last 2 rows and columns don’t have enough pixels to support a full 3×3 window

CNNs are physically constrained pattern scanners, not abstract math tricks.

If you keep stacking convolutions without padding:
- Feature maps shrink fast
- Deep networks collapse spatial dimensions

That’s why padding exists to control information loss.

---

### Spatial Structure

Spatial structure means the 2D arrangement of pixels in an image is preserved.

- Unlike dense layers that flatten the image and destroy pixel positions, CNNs maintain height × width of the image through feature maps.
- his allows the network to understand where patterns occur, not just what patterns exist.

Why This Is Important
- A pattern at the top of an image remains at the top of the feature map.
- CNNs know both what the pattern is and where it appears.

### Putting Everything Together

1. Start with an image.
2. Slide a filter over small local regions.
3. Perform convolution at each position.
4. Produce a feature map.
5. Repeat with multiple filters and layers.

Early layers detect edges and textures.  
Deeper layers combine them into shapes and objects.

> CNNs learn small pattern detectors (filters) that slide across images, creating feature maps that show what patterns exist and where they appear.


# Stride

* Stride is the number of pixels the filter moves at each step when sliding over the input image.
* Default stride is 1 (move one pixel at a time).
* Increasing stride reduces the size of the output feature map because the filter skips positions.

Example 1: Stride = 1

Input image (5×5):

$$
I = \begin{bmatrix}
1 & 2 & 3 & 0 & 1 \\
0 & 1 & 2 & 3 & 1 \\
1 & 0 & 1 & 2 & 0 \\
2 & 1 & 0 & 1 & 2 \\
1 & 2 & 1 & 0 & 1
\end{bmatrix}
$$

Filter (3×3), stride 1:

* Top-left patch = rows 0–2, cols 0–2
* Next patch = rows 0–2, cols 1–3
* Next patch = rows 0–2, cols 2–4

Output feature map size formula:

$$
\text{Output Height} = \frac{H - F}{\text{stride}} + 1
$$

$$
\text{Output Width} = \frac{W - F}{\text{stride}} + 1
$$

Here, H = 5, W = 5, F = 3, stride = 1 → Output = 3×3

Example 2: Stride = 2

* Filter jumps 2 pixels at a time instead of 1.
* Top-left patch = rows 0–2, cols 0–2
* Next patch = rows 0–2, cols 2–4
* Next patch would start at col 4 → exceeds boundary → stop

Output feature map size:

$$
\text{Output Height} = \frac{5 - 3}{2} + 1 = 2
$$

$$
\text{Output Width} = \frac{5 - 3}{2} + 1 = 2
$$

* So, stride = 2 produces a smaller 2×2 feature map.

### Key Points About Stride

1. **Controls feature map size**

   * Larger stride → smaller feature map → fewer computations
   * Smaller stride → larger feature map → more detailed spatial info

2. **Works with padding**

   * Sometimes stride > 1 and padding is used to preserve spatial dimensions

3. **Intuition**

   * Think of the filter as a “window” scanning the image.
   * Stride = 1 → window moves slowly, capturing fine details.
   * Stride = 2 → window moves faster, capturing coarse information.

### Quick Visual Intuition

If filter = 2×2, stride = 1:

Step 1: filter covers top-left 2x2<br>
Step 2: move right by 1 pixel → next 2x2 patch<br>
Step 3: move right by 1 pixel → next 2x2 patch
...


If stride = 2:

Step 1: filter covers top-left 2x2<br>
Step 2: move right by 2 pixels → next 2x2 patch (skips 1 column)
...

* Each move defines where the dot product is computed and thus which position in the feature map gets filled.

# Padding 

Padding is a technique used to add extra pixels around the border of an image before applying a convolution. It controls the spatial size of the output feature map and affects how edges are treated.


### Why Padding is Used

1. **Control output size**  
   - Without padding, feature maps shrink after each convolution.  
   - Padding can preserve the original spatial dimensions.

2. **Preserve edge information**  
   - Without padding, pixels at the edge are included in fewer patches, so their influence is reduced.  
   - Padding ensures all pixels are treated equally.

3. **Enable deeper networks**  
   - When stacking many convolution layers, padding prevents feature maps from shrinking too quickly.


### Common Types of Padding

#### 1. Valid Padding ("no padding")
- No extra pixels added.  
- Output size shrinks:

$$
\text{Output Height} = H - F + 1
$$

$$
\text{Output Width} = W - F + 1
$$

Example: 5×5 image, 3×3 filter → Output: 3×3

#### 2. Same Padding
- Pads the image so that the output size equals the input size.  
- Formula for padding size:

$$
\text{Padding} = \frac{F - 1}{2} \quad (\text{assuming stride = 1})
$$

- For odd-sized filters (e.g., 3×3), padding = 1  
- 5×5 image with 3×3 filter and padding = 1 → Output = 5×5


### How Padding Works

Example: 5×5 image, 3×3 filter, **padding = 1**

Original image:

$$
\begin{bmatrix}
1 & 2 & 3 & 0 & 1 \\
0 & 1 & 2 & 3 & 1 \\
1 & 0 & 1 & 2 & 0 \\
2 & 1 & 0 & 1 & 2 \\
1 & 2 & 1 & 0 & 1
\end{bmatrix}
$$

After padding with zeros around the border:

$$
\begin{bmatrix}
0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 1 & 2 & 3 & 0 & 1 & 0 \\
0 & 0 & 1 & 2 & 3 & 1 & 0 \\
0 & 1 & 0 & 1 & 2 & 0 & 0 \\
0 & 2 & 1 & 0 & 1 & 2 & 0 \\
0 & 1 & 2 & 1 & 0 & 1 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0
\end{bmatrix}
$$

- Convolution is now applied to the padded image.  
- Output feature map size can now equal the original input size.


### Key Points

- Padding helps preserve spatial dimensions.  
- Common in modern CNNs to maintain consistent feature map size across layers.  
- Works together with stride to control output shape.  
- Two main types: valid (no padding), same (padding to preserve size).

### Output Size Formula with Padding

If input height = H, filter size = F, padding = P, stride = S:

$$
\text{Output Height} = \frac{H - F + 2P}{S} + 1
$$

$$
\text{Output Width} = \frac{W - F + 2P}{S} + 1
$$

Example: H = 5, F = 3, P = 1, S = 1 → Output Height = $(5-3+2*1)/1 + 1$ = 5


# Basic Components of Convolution

- **Input Image**: $H \times W$ (Height × Width)  
- **Filter (Kernel)**: $k \times k$ (Height × Width)  
- **Stride ($S$)**: Number of pixels the filter moves at each step  
- **Padding ($P$)**: Extra pixels (usually zeros) added around the input to control output size  


### Output Size Formula

For an input of size $N$ and a filter of size $K$:

$$
\text{Output Size} = \frac{N - K + 2P}{S} + 1
$$

Where:  
- $N$ = input size (height or width)  
- $K$ = kernel size (height or width)  
- $P$ = padding size  
- $S$ = stride  


### Examples

1. **No padding, stride = 1**  
Input: $5 \times 5$, Filter: $3 \times 3$, $P=0$, $S=1$  

$$
\text{Output Size} = \frac{5-3+2*0}{1} + 1 = 3
$$  

Feature map size: $3 \times 3$

2. **Same padding, stride = 1**  
Input: $5 \times 5$, Filter: $3 \times 3$, $P=1$, $S=1$  

$$
\text{Output Size} = \frac{5-3+2*1}{1} + 1 = 5
$$  

Feature map size: $5 \times 5$  

Using padding preserves the spatial dimensions of the input.

### Solve for Padding

If you want same padding (output size $O = N$), plug in $O = N$:

$$
N = \frac{N - F + 2P}{S} + 1
$$

Rearranging:

$$
N - 1 = \frac{N - F + 2P}{S}
$$

$$
2P = S(N-1) - N + F
$$

$$
\boxed{P = \frac{(S \cdot (N-1) - N + F)}{2}}
$$

* This is the **general formula for padding**, valid for any stride $S$.
* For stride $S=1$, it simplifies to:

$$
P = \frac{F-1}{2}
$$


### Notes

1. **Odd-sized filters**: $F = 3,5,7,\dots$ → $P$ is integer, simple.
2. **Even-sized filters**: $F = 2,4,\dots$ → $P$ may be fractional → split asymmetrically top/bottom or left/right.
3. **Stride > 1**: Padding ensures the output feature map does not shrink too much.

### Example

* Input: $7$, Filter: $3$, Stride: $1$

$$
P = \frac{(1 \cdot (7-1) - 7 + 3)}{2} = \frac{(6-7+3)}{2} = \frac{2}{2} = 1
$$

* Input: $7$, Filter: $3$, Stride: $2$

$$
P = \frac{(2 \cdot (7-1) - 7 + 3)}{2} = \frac{(12-7+3)}{2} = \frac{8}{2} = 4
$$


* | Input N | Filter F | Stride S | Padding P | Output O |
|---------|----------|----------|-----------|----------|
| 7       | 3        | 1        | 1         | 7        |
| 7       | 3        | 2        | 4         | 7        |
| 5       | 5        | 1        | 2         | 5        |
| 5       | 3        | 1        | 0         | 3        |

This ensures the feature map has the desired size.

# How CNNs Represent an Image

CNNs treat images not as long vectors, but as structured data with spatial meaning.
This single design choice is why CNNs work and why fully connected networks fail at vision.

Let’s break it down from first principles.


### What an image really is (numerically)

An image is not a picture to a computer. It is a grid of numbers.

#### Grayscale image
Each pixel → one intensity value (brightness)

- 0 = black  
- 255 = white (or 0–1 if normalized)

Mathematically:
$$
X \in \mathbb{R}^{H \times W}
$$

Example:

28 × 28 → MNIST digit*

That is 784 numbers arranged in a grid, not a list.

<u>*MNIST Dataset</u>

MNIST (Modified National Institute of Standards and Technology) is a benchmark dataset of handwritten digits used to introduce and evaluate computer vision models.

It contains 70,000 grayscale images of digits from 0 to 9, where each image is a 28 × 28 pixel grid representing a single handwritten digit. Every image is paired with a label indicating the correct digit.

MNIST is widely used because it is simple, standardized, and ideal for learning fundamental concepts such as image representation, convolution, and feature extraction in Convolutional Neural Networks (CNNs).

#### RGB image
Each pixel has 3 values:
- Red
- Green
- Blue

Mathematically:
$$
X \in \mathbb{R}^{H \times W \times 3}
$$

Example:

224 × 224 × 3 → ImageNet image*

Think of it as:
- 3 stacked grayscale images
- One per color channel

<u>*ImageNet Images</u>

ImageNet is a large-scale computer vision dataset designed for real-world image recognition tasks.

An ImageNet image is typically a high-resolution RGB image represented as a tensor of shape:
$$
X \in \mathbb{R}^{H \times W \times 3}
$$
where the three channels correspond to Red, Green, and Blue color values.

In practice, images are commonly resized to 224 × 224 × 3 before being fed into CNNs. ImageNet contains millions of such images across 1,000 object categories, including animals, vehicles, tools, and everyday objects.

ImageNet is used to train and evaluate deep CNN architectures and serves as a standard benchmark for modern computer vision systems.


### Why CNNs do NOT flatten images

#### What flattening does

Flattening is the process of converting a multi-dimensional tensor (like an image) into a 1-dimensional vector.

Flattening converts:

H × W × C  →  (H · W · C)

Example:

28 × 28 → 784

This destroys spatial relationships.

After flattening, the model no longer knows:
- Which pixels were neighbors
- Where edges or corners were
- Whether patterns are local or global

To the model:

> pixel #1 and pixel #782 look equally unrelated

That’s catastrophic for vision.

#### CNN core assumption

CNNs assume:

> Important visual patterns are local and repeat across space

Examples:
- An edge is meaningful wherever it appears
- A corner is still a corner at any location
- A texture is defined by neighboring pixels

This assumption only holds if:
- Height, width, and channels are preserved

Hence:

No flattening at the beginning.

### Tensor representation (how CNNs see images)

CNNs represent images as 3D tensors:

(height, width, channels)

Example (*CIFAR-10):

32 × 32 × 3

This allows:
- Convolutions across height & width
- Filters to look at local pixel neighborhoods
- Channels to encode different visual information

<u>*CIFAR-10 Dataset</u>

CIFAR-10 is a benchmark dataset for image classification consisting of 60,000 color images (32 × 32 × 3) across 10 classes such as airplane, car, cat, dog, and more.

- **Training images:** 50,000  
- **Test images:** 10,000  
- **Image size:** 32 × 32 pixels  
- **Channels:** 3 (RGB)  
- **Classes:** 10

CIFAR-10 is more complex than MNIST because images are colored, contain natural objects, and include variations in pose, lighting, and background. It is widely used to test CNN architectures and understand how models detect patterns like edges, textures, and shapes across all channels.

### What convolution actually uses from this tensor

A convolutional filter operates like this:

- Looks at a small spatial window (e.g., 3×3)
- Sees all channels at once

Example filter shape for RGB input:

3 × 3 × 3

So for each spatial location, the filter:
- Reads Red, Green, Blue together
- Learns color-aware patterns (e.g., colored edges)

This is impossible after flattening.

### Concrete example (edge detection)

Suppose this vertical edge exists:

| dark | dark | bright |<br>
| dark | dark | bright |<br>
| dark | dark | bright |

A CNN:
- Detects it using local neighborhoods
- Keeps its position
- Passes this structure to deeper layers

A fully connected network:
- Sees 9 unrelated numbers
- Has no idea they form a vertical line

### When flattening DOES happen

Flattening is delayed until:
- The network has extracted high-level features
- Spatial structure is no longer critical

Typical flow:

Image → Conv → Conv → Pool → Conv → Flatten → Dense → Output

At that point:
- Each neuron represents a meaningful concept
- Position matters less than presence

# CNN vs Fully Connected Networks

| Aspect | Fully Connected NN | CNN |
|------|-------------------|-----|
| Input handling | Flattened | Spatially preserved |
| Parameter count | Very large | Much smaller |
| Translation awareness | No | Yes |
| Scalability to images | Poor | Excellent |
| Vision performance | Weak | State-of-the-art |


# Intuition from Human Vision

CNNs loosely mimic the human visual system:
- Early layers → detect edges
- Middle layers → detect shapes
- Deeper layers → recognize objects

This hierarchical processing is a key reason for CNN success.

# Common Applications of CNNs

CNNs are used in:
- Image classification (ResNet, EfficientNet)
- Object detection (YOLO, Faster R-CNN)
- Face recognition
- Medical image analysis
- Autonomous vehicle perception

If your data has spatial structure, CNNs are usually the right tool.

# A Critical Misconception

❌ “CNNs understand images like humans.”

Reality:
- CNNs detect statistical patterns
- They have no semantic understanding
- They can fail badly on unfamiliar data

CNNs are pattern extractors, not intelligent observers.


# Manual Convolution Example

Input Image (5×5)

$$
I = \begin{bmatrix}
1 & 2 & 3 & 0 & 1 \\
0 & 1 & 2 & 3 & 1 \\
1 & 0 & 1 & 2 & 0 \\
2 & 1 & 0 & 1 & 2 \\
1 & 2 & 1 & 0 & 1
\end{bmatrix}
$$

Filter (3×3)

$$
F = \begin{bmatrix}
1 & 0 & -1 \\
1 & 0 & -1 \\
1 & 0 & -1
\end{bmatrix}
$$


**Step 1: Determine Output Size**

Input: $5 \times 5$, Filter: $3 \times 3$, Stride = 1, Padding = 0

$$
\text{Output Size} = (H - F + 1) \times (W - F + 1) = (5-3+1) \times (5-3+1) = 3 \times 3
$$


**Step 2: Compute Dot Products for Each Patch**

**Top-left patch:**

$$
\begin{bmatrix}
1 & 2 & 3 \\
0 & 1 & 2 \\
1 & 0 & 1
\end{bmatrix} 
\cdot 
\begin{bmatrix}
1 & 0 & -1 \\
1 & 0 & -1 \\
1 & 0 & -1
\end{bmatrix} = -4
$$

**Top-middle patch:**

$$
\begin{bmatrix}
2 & 3 & 0 \\
1 & 2 & 3 \\
0 & 1 & 2
\end{bmatrix} \cdot F = -2
$$

**Top-right patch:**

$$
\begin{bmatrix}
3 & 0 & 1 \\
2 & 3 & 1 \\
1 & 2 & 0
\end{bmatrix} \cdot F = 4
$$

**Middle-left patch:**

$$
\begin{bmatrix}
0 & 1 & 2 \\
1 & 0 & 1 \\
2 & 1 & 0
\end{bmatrix} \cdot F = 0
$$

**Middle-middle patch:**

$$
\begin{bmatrix}
1 & 2 & 3 \\
0 & 1 & 2 \\
1 & 0 & 1
\end{bmatrix} \cdot F = -4
$$

**Middle-right patch:**

$$
\begin{bmatrix}
2 & 3 & 1 \\
1 & 2 & 0 \\
0 & 1 & 2
\end{bmatrix} \cdot F = 0
$$

**Bottom-left patch:**

$$
\begin{bmatrix}
1 & 0 & 1 \\
2 & 1 & 0 \\
1 & 2 & 1
\end{bmatrix} \cdot F = 2
$$

**Bottom-middle patch:**

$$
\begin{bmatrix}
0 & 1 & 2 \\
1 & 0 & 1 \\
2 & 1 & 0
\end{bmatrix} \cdot F = 0
$$

**Bottom-right patch:**

$$
\begin{bmatrix}
1 & 2 & 0 \\
0 & 1 & 2 \\
1 & 0 & 1
\end{bmatrix} \cdot F = -1
$$

**Step 3: Final 3×3 Feature Map**

$$
\text{Feature Map} = \begin{bmatrix}
-4 & -2 & 4 \\
0 & -4 & 0 \\
2 & 0 & -1
\end{bmatrix}
$$

This demonstrates how the filter slides over the image, computes a dot product for each patch, and produces a feature map that highlights where the vertical edge pattern appears.


In [1]:
import numpy as np

# Input image (5x5)
image = np.array([
    [1, 2, 3, 0, 1],
    [0, 1, 2, 3, 1],
    [1, 0, 1, 2, 0],
    [2, 1, 0, 1, 2],
    [1, 2, 1, 0, 1]
])

# Filter (3x3)
filter = np.array([
    [1, 0, -1],
    [1, 0, -1],
    [1, 0, -1]
])

# Output feature map size
H_out = image.shape[0] - filter.shape[0] + 1
W_out = image.shape[1] - filter.shape[1] + 1
feature_map = np.zeros((H_out, W_out))

# Perform convolution
for i in range(H_out):
    for j in range(W_out):
        patch = image[i:i+filter.shape[0], j:j+filter.shape[1]]
        feature_map[i, j] = np.sum(patch * filter)  # element-wise multiplication + sum (dot product)

print("Feature Map:\n", feature_map)


Feature Map:
 [[-4. -2.  4.]
 [ 0. -4.  0.]
 [ 2.  0. -1.]]


In [2]:
# Zero padding (adding zeros around the border of the input image before applying convolution.)
padded_image = np.pad(image, pad_width=1, mode='constant', constant_values=0)
padded_image


array([[0, 0, 0, 0, 0, 0, 0],
       [0, 1, 2, 3, 0, 1, 0],
       [0, 0, 1, 2, 3, 1, 0],
       [0, 1, 0, 1, 2, 0, 0],
       [0, 2, 1, 0, 1, 2, 0],
       [0, 1, 2, 1, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0]])

In [3]:
# Output feature map size
H_out = padded_image.shape[0] - filter.shape[0] + 1
W_out = padded_image.shape[1] - filter.shape[1] + 1
feature_map = np.zeros((H_out, W_out))

# Perform convolution
for i in range(H_out):
    for j in range(W_out):
        patch = padded_image[i:i+filter.shape[0], j:j+filter.shape[1]]
        feature_map[i, j] = np.sum(patch * filter)  # dot product

print("Padded Image:\n", padded_image)
print("\nFeature Map after Convolution:\n", feature_map)

Padded Image:
 [[0 0 0 0 0 0 0]
 [0 1 2 3 0 1 0]
 [0 0 1 2 3 1 0]
 [0 1 0 1 2 0 0]
 [0 2 1 0 1 2 0]
 [0 1 2 1 0 1 0]
 [0 0 0 0 0 0 0]]

Feature Map after Convolution:
 [[-3. -4.  0.  3.  3.]
 [-3. -4. -2.  4.  5.]
 [-2.  0. -4.  0.  6.]
 [-3.  2.  0. -1.  3.]
 [-3.  2.  2. -2.  1.]]


# Key Takeaways from  Day 20

- CNNs preserve spatial structure
- Convolution extracts local patterns
- Filters learn features automatically
- Stride controls resolution
- Padding preserves size and borders
- CNNs scale efficiently for images

---

<p style="text-align:center; font-size:18px;">
© 2025 Mostafizur Rahman
</p>

