# Day 20: CNN Introduction

Welcome to Day 20!

Today you'll learn:

- What is CNN?
- Understand why CNNs exist
- Learn convolution operation step-by-step
- Understand filters, stride, padding
- Manually compute a convolution
- Implement convolution using NumPy

If you found this notebook helpful, your **<b style="color:red;">UPVOTE</b>** would be greatly appreciated! It helps others discover the work and supports continuous improvement.

---

# What is a Convolutional Neural Network (CNN)?

A Convolutional Neural Network (CNN) is a specialized type of neural network designed to work with grid-structured data, especially images.

Examples of grid-structured data:
- Images → 2D grid of pixels  
- Videos → 3D grid (height × width × time)  
- Spectrograms → time × frequency grid  

CNNs are the standard architecture for computer vision because they preserve and exploit the spatial structure of data.

## Why Traditional Neural Networks Fail for Images

Consider a grayscale image of size 28 × 28:

- Total pixels:  
  $$
  28 \times 28 = 784
  $$

If we use a fully connected layer with just 1,000 neurons:

$$
784 \times 1000 = 784{,}000 \text{ parameters}
$$

Problems:
1. **Too many parameters** → slow training, high memory usage  
2. **No spatial awareness**  
   - Neighboring pixels are treated the same as distant pixels  
3. **No translation understanding**  
   - The same object in a different position looks completely new  

Fully connected networks ignore how images are structured.

## Core Ideas Behind CNNs

CNNs are built on three key assumptions about images. Let’s break down each one carefully.

### 1. Locality

Locality means most important visual information in an image is contained in small, local regions, rather than spread across the entire image.

* **Local region / patch:** A small area of the image, e.g., a 3×3 or 5×5 block of pixels.
* **Edge:** A boundary where the intensity of pixels changes sharply (e.g., where light meets dark).
* **Corner:** A point where two edges meet.
* **Texture:** Repeating patterns in a small region, like stripes or dots.

Pixels are meaningful relative to their neighbors, not the entire image. Detecting edges, corners, or textures locally helps the network understand small building blocks of the image.

### 2. Parameter Sharing

Parameter sharing means the same set of weights (filter) is applied across multiple positions in the image.

* **Filter (kernel):** A small matrix of numbers that slides over the image to detect specific patterns.
* **Weights:** Numbers in the filter that the network learns during training.
* **Feature map:** The result of applying a filter across the image, showing where the pattern occurs.

Instead of learning a separate detector for every location in the image, CNNs learn one filter and reuse it everywhere.

* This drastically reduces the number of parameters (learnable weights), making training more efficient.
* It also allows the network to recognize the same pattern regardless of its position, a property called **translation invariance**.

### 3. Spatial Hierarchy

Spatial hierarchy means simple patterns combine to form complex structures in a layered manner.

* **Layer:** A level in the neural network that transforms input into more abstract features.
* **Edges → shapes → object parts → full objects:** This describes how visual features are learned progressively:

  1. Early layers detect simple features (edges, corners)
  2. Middle layers combine them into shapes or textures
  3. Deep layers recognize complex objects

CNNs automatically learn a hierarchy of features, building up from local patterns to high-level concepts. This is one of the main reasons CNNs work so well for images.


 **Summary of New Terms**

| Term                   | Definition                                                    |
| ---------------------- | ------------------------------------------------------------- |
| Local region / patch   | Small area of the image (e.g., 3×3 pixels)                    |
| Edge                   | Boundary of sharp pixel intensity change                      |
| Corner                 | Intersection point of two edges                               |
| Texture                | Repeating pattern in a local area                             |
| Filter / Kernel        | Small learnable matrix applied across the image               |
| Weights                | Learnable numbers in the filter that detect patterns          |
| Feature map            | Output of a filter showing where a pattern occurs             |
| Layer                  | Level in a neural network that transforms input into features |
| Translation invariance | Ability to detect the same pattern regardless of its location |


## What Makes a Convolutional Neural Network (CNN)?

A Convolutional Neural Network (CNN) is a neural network designed to work with images by understanding what patterns exist and where they appear.

To understand this, we must clearly define three new terms:
- Convolution operation  
- Filter (Kernel)  
- Feature map  

We will explain each from scratch.

## 1. Convolution Operation

A convolution is a mathematical operation where a small matrix (filter) slides over an input (like an image) and computes a weighted sum at each position.

Instead of looking at the entire image at once (like fully connected layers), convolution focuses on small local regions.

- Imagine placing a small transparent grid on top of an image.
- You slide it step by step.
- At each position, you check how well the grid matches the image underneath.
- This is different from full matrix multiplication (used in dense layers) because we only focus on small local patches instead of the whole image at once.

This is how CNNs scan images.

Example

Input image patch:

$$
X =
\begin{bmatrix}
1 & 2 & 0 \\
0 & 1 & 3 \\
1 & 2 & 1
\end{bmatrix}
$$

Filter (kernel):
$$
K =
\begin{bmatrix}
0 & 1 & 0 \\
1 & -4 & 1 \\
0 & 1 & 0
\end{bmatrix}
$$

- Slide the filter across the image
- At each position, compute:
$$
\text{sum of } (X \odot K) = \sum_{i,j} X_{i,j} \cdot K_{i,j}
$$

The result (feature map) is  one number that tells how strongly the pattern exists at that location.

## 2. Filter (Kernel)

A filter (also called a kernel) is a small learnable matrix of numbers used to detect a specific visual pattern.

- Its job: detect specific patterns (edges, textures, corners) in images.
- Typical sizes: 3×3, 5×5, 7×7  
- Learns during training  
- The same filter is applied across the entire image  

Example:  Filter (Edge Detector)

Vertical edge filter (Sobel):

$$
\begin{bmatrix}
-1 & 0 & 1 \\
-2 & 0 & 2 \\
-1 & 0 & 1
\end{bmatrix}
$$

- This filter responds strongly to vertical edges.
- When applied to an image:
    - Strong response → vertical edge exists
    - Weak response → no vertical edge

**Why Filters Matter**
- One filter → detects one type of pattern  
- Multiple filters → detect multiple patterns (edges, corners, textures)
- NN learns the best filter values during training, we don’t handcraft them.



## 3. Feature Map

**Definition:**  
A **feature map** is the output produced after applying a filter across the image.

- Each value in the feature map shows **how strongly the filter matched** at that position.
- Feature maps preserve the **2D structure** of the image.

### Example
- Input image size: 28×28  
- Filter size: 3×3  
- Output feature map size: 26×26  

High values in the feature map = strong presence of the pattern.

Think of it as a **heatmap** of detected features.

---

## 4. Spatial Structure

**Definition:**  
**Spatial structure** means the **relative position of pixels** is preserved.

- CNNs keep the height and width of images intact.
- Unlike dense layers, CNNs do **not flatten images early**.

### Why This Is Important
- A pattern at the top of an image remains at the top of the feature map.
- CNNs know both **what** the pattern is and **where** it appears.

---

## Putting Everything Together

1. Start with an image.
2. Slide a **filter** over small local regions.
3. Perform **convolution** at each position.
4. Produce a **feature map**.
5. Repeat with multiple filters and layers.

Early layers detect **edges and textures**.  
Deeper layers combine them into **shapes and objects**.

---

## One-Line Intuition

> CNNs learn small pattern detectors (filters) that slide across images, creating feature maps that show what patterns exist and where they appear.


Perfect — let’s go **term by term**, explain everything, and include examples so a beginner fully grasps the idea.

---

## 1. Convolution Operation

**Definition:**
A **convolution** is a mathematical operation where a **small matrix (filter)** slides over an input (like an image) and computes a **weighted sum** at each position.

* Think of it as **scanning a small window across the image** to detect patterns.
* This is different from **full matrix multiplication** (used in dense layers) because we only focus on **small local patches** instead of the whole image at once.

**Example:**

Small 3×3 image patch:

[
X =
\begin{bmatrix}
1 & 2 & 0 \
0 & 1 & 3 \
1 & 2 & 1
\end{bmatrix}
]

Filter (kernel):

[
K =
\begin{bmatrix}
0 & 1 & 0 \
1 & -4 & 1 \
0 & 1 & 0
\end{bmatrix}
]

* Slide the filter across the image
* At each position, compute:
  
* Result → **feature map** highlighting edges, corners, or textures

---

## 2. Filter / Kernel

**Definition:**
A **filter (or kernel)** is a **small, learnable matrix of numbers** in CNNs.

* Its job: detect **specific patterns** (edges, textures, corners) in images.
* Typical sizes: 3×3, 5×5, 7×7

**Example:**

* Vertical edge filter (Sobel):

[
\begin{bmatrix}
-1 & 0 & 1 \
-2 & 0 & 2 \
-1 & 0 & 1
\end{bmatrix}
]

* When applied to an image:

  * Strong response → vertical edge exists
  * Weak response → no vertical edge

**Key Point:**
CNN **learns the best filter values** during training — we don’t handcraft them.

---

## 3. Feature Map

**Definition:**
A **feature map** is the output of applying a filter over an image.

* Shows **where a particular pattern occurs** in the image.
* Usually a smaller 2D grid than the original image (depending on padding and stride).

**Example:**

* Input image → 28×28
* Filter → 3×3
* Feature map → 26×26 (each value indicates **how strongly the pattern is present** at that location)

Think of it as a **heatmap** highlighting the pattern the filter detects.

---

## 4. Spatial Structure

**Definition:**
**Spatial structure** means the **2D arrangement of pixels** in an image is preserved.

* Unlike dense layers that flatten the image and destroy pixel positions, CNNs maintain **height × width** of the image through feature maps.
* This allows the network to understand **where patterns occur**, not just **what patterns exist**.

**Example:**

* Horizontal edge at the top of an image → top of feature map lights up
* Horizontal edge at the bottom → bottom of feature map lights up

---

## 5. Putting it all together

**Step-by-step intuition:**

1. Start with an image:

```
[ 0 0 0 ]
[ 0 1 0 ]
[ 0 0 0 ]
```

2. Apply a **filter** (like a vertical edge detector) → slide across all positions.

3. Compute **dot products at each position** → produce **feature map**.

4. Feature map shows **where the vertical edges appear**.

* Early layers → simple patterns (edges, corners)
* Middle layers → combine into shapes, textures
* Deep layers → detect objects (faces, cars, digits)

---

✅ **Summary in Beginner Terms:**

| Term              | Simple Definition                                        |
| ----------------- | -------------------------------------------------------- |
| Convolution       | Sliding a small filter over an image to detect patterns  |
| Filter / Kernel   | Small learnable matrix that looks for a specific feature |
| Feature Map       | Output showing where the filter’s pattern occurs         |
| Spatial Structure | Keeping 2D pixel arrangement intact                      |

---

If you want, I can **draw a beginner-friendly diagram showing filter sliding over an image and producing a feature map**, which usually makes these abstract ideas very clear.

Do you want me to do that next?


## How CNNs Represent an Image

An image is represented as a tensor.

- Grayscale image:
  \[
  X \in \mathbb{R}^{H \times W}
  \]

- RGB image:
  \[
  X \in \mathbb{R}^{H \times W \times 3}
  \]

CNNs **do not flatten** images at the beginning.  
They keep height, width, and channels intact.

This is a critical design choice.

---

## Convolution Operation (High-Level View)

CNNs use small matrices called **filters** (or kernels), for example:

\[
K =
\begin{bmatrix}
-1 & 0 & 1 \\
-2 & 0 & 2 \\
-1 & 0 & 1
\end{bmatrix}
\]

This filter:
- Slides across the image
- Computes dot products
- Produces a **feature map**

Mathematically:

\[
(X * K)(i, j) = \sum_{m} \sum_{n} X(i+m, j+n) \cdot K(m, n)
\]

Each filter specializes in detecting a specific pattern (edges, textures, etc.).

---

## CNN vs Fully Connected Networks

| Aspect | Fully Connected NN | CNN |
|------|-------------------|-----|
| Input handling | Flattened | Spatially preserved |
| Parameter count | Very large | Much smaller |
| Translation awareness | ❌ | ✅ |
| Scalability to images | Poor | Excellent |
| Vision performance | Weak | State-of-the-art |

---

## Intuition from Human Vision

CNNs loosely mimic the human visual system:
- Early layers → detect edges
- Middle layers → detect shapes
- Deeper layers → recognize objects

This hierarchical processing is a **key reason for CNN success**.

---

## Common Applications of CNNs

CNNs are used in:
- Image classification (ResNet, EfficientNet)
- Object detection (YOLO, Faster R-CNN)
- Face recognition
- Medical image analysis
- Autonomous vehicle perception

If your data has **spatial structure**, CNNs are usually the right tool.

---

## A Critical Misconception

❌ *“CNNs understand images like humans.”*

Reality:
- CNNs detect statistical patterns
- They have no semantic understanding
- They can fail badly on unfamiliar data

CNNs are **pattern extractors**, not intelligent observers.

---

## Summary

- CNNs are designed for **spatial data**
- Convolution enables **efficient pattern detection**
- Parameter sharing makes CNNs scalable
- Hierarchical features enable strong visual performance

This explains **why CNNs exist**, not just how they work.


# Why Convolutional Neural Networks (CNNs)?

Traditional neural networks:
- Flatten images → lose spatial structure
- Too many parameters
- Poor scalability for images

CNNs:
- Preserve spatial relationships
- Use local connectivity
- Share parameters (filters)
- Are translation invariant

> CNNs learn patterns, not pixels.


# Convolution

Convolution is a mathematical operation that:
- Slides a small matrix (filter / kernel) over an input
- Computes element-wise multiplication
- Sums the result to produce a feature map

This allows the network to detect:
- Edges
- Corners
- Textures
- Shapes


# Basic Components

- **Input Image**: $H \times W$
- **Filter (Kernel)**: $k \times k$
- **Stride**: how far the filter moves each step
- **Padding**: zeros added around the input

### Output Size Formula

$$
\text{Output Size} = \frac{N - K + 2P}{S} + 1
$$

Where:
- $N$ = input size
- $K$ = kernel size
- $P$ = padding
- $S$ = stride


## Manual Convolution Example

### Input Image (5×5)

$$
\begin{bmatrix}
1 & 2 & 3 & 0 & 1 \\
0 & 1 & 2 & 3 & 1 \\
1 & 0 & 1 & 2 & 0 \\
2 & 1 & 0 & 1 & 2 \\
1 & 2 & 1 & 0 & 1
\end{bmatrix}
$$

### Filter (3×3)

$$
\begin{bmatrix}
1 & 0 & -1 \\
1 & 0 & -1 \\
1 & 0 & -1
\end{bmatrix}
$$


# One Convolution Step

Take top-left $3 \times 3$ region:

$$
\begin{bmatrix}
1 & 2 & 3 \\
0 & 1 & 2 \\
1 & 0 & 1
\end{bmatrix}
$$

Multiply element-wise with filter and sum:

$$
(1×1)+(2×0)+(3×-1) +
(0×1)+(1×0)+(2×-1) +
(1×1)+(0×0)+(1×-1)
$$

$$
= 1 - 3 - 2 + 1 - 1 = -4
$$

That value becomes **one pixel in the output feature map**.


In [1]:
import numpy as np

image = np.array([
    [1,2,3,0,1],
    [0,1,2,3,1],
    [1,0,1,2,0],
    [2,1,0,1,2],
    [1,2,1,0,1]
])

kernel = np.array([
    [1,0,-1],
    [1,0,-1],
    [1,0,-1]
])

output = np.zeros((3,3))

for i in range(3):
    for j in range(3):
        region = image[i:i+3, j:j+3]
        output[i,j] = np.sum(region * kernel)

output


array([[-4., -2.,  4.],
       [ 0., -4.,  0.],
       [ 2.,  0., -1.]])

## What Filters Learn

- Vertical edge detector
- Horizontal edge detector
- Diagonal patterns
- Texture patterns

Early CNN layers:
> Detect simple patterns

Deeper CNN layers:
> Combine patterns into objects


## Stride

Stride controls how much the filter moves each step.

- Stride = 1 → detailed feature map
- Stride = 2 → smaller feature map
- Larger stride → more compression, less detail

Effect:
- Reduces spatial size
- Increases receptive field


## Padding

Padding adds zeros around the image.

### Why Padding?
- Prevents shrinking feature maps too fast
- Preserves border information

Types:
- **Valid**: no padding
- **Same**: output size = input size


In [3]:
padded_image = np.pad(image, pad_width=1, mode='constant', constant_values=0)
padded_image


array([[0, 0, 0, 0, 0, 0, 0],
       [0, 1, 2, 3, 0, 1, 0],
       [0, 0, 1, 2, 3, 1, 0],
       [0, 1, 0, 1, 2, 0, 0],
       [0, 2, 1, 0, 1, 2, 0],
       [0, 1, 2, 1, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0]])

## CNN vs Fully Connected Layers

| Aspect | Fully Connected | CNN |
|------|----------------|-----|
| Parameters | Very high | Low (shared) |
| Spatial awareness | ❌ | ✅ |
| Translation invariant | ❌ | ✅ |
| Image scalability | Poor | Excellent |


# Key Takeaways from  Day 20

- CNNs preserve spatial structure
- Convolution extracts local patterns
- Filters learn features automatically
- Stride controls resolution
- Padding preserves size and borders
- CNNs scale efficiently for images

---

<p style="text-align:center; font-size:18px;">
© 2025 Mostafizur Rahman
</p>

