# Day 39: CNN Multi-Class Classification

Welcome to Day 39!

Today You’ll Learn

1. How CNNs produce multi-class predictions  
2. What logits really represent  
3. Softmax and probability interpretation  
4. Why CrossEntropyLoss includes Softmax internally  
5. Handling class imbalance properly  
6. Top-k accuracy and when to use it  
7. Practical evaluation strategy for real-world CNNs

If you found this notebook helpful, your **<b style="color:skyblue;">UPVOTE</b>** would be greatly appreciated! It helps others discover the work and supports continuous improvement.

---

# Multi-Class Classification

## Problem Setup

You are given:

- An image $x$  
- $C$ possible classes  

Goal:

$$
\text{Predict exactly ONE class out of } C
$$

This is called:

**Single-label multi-class classification**

Classes are mutually exclusive.

Example:

- CIFAR-10 → $C = 10$  
- ImageNet → $C = 1000$

## What the CNN Produces

After feature extraction, the CNN outputs a vector:

```python
Linear(features, C)
```

Mathematically:

If final feature vector is:

$$
h \in \mathbb{R}^{d}
$$

Then:

$$
z = Wh + b
$$

Where:

* $W \in \mathbb{R}^{C \times d}$
* $b \in \mathbb{R}^{C}$

Output:

$$
z = [z_1, z_2, ..., z_C]
$$


## What Are Logits?

Logits are the raw output scores of a neural network before converting them into probabilities.

The values $z_i$ are called:

* Logits
* Raw scores
* Unnormalized predictions

Important:

$$
z_i \in (-\infty, +\infty)
$$

They are **NOT probabilities**.

They are simply the output of the final linear layer:

$$
z = Wh + b
$$

Where:

- $h$ = feature vector  
- $W$ = weight matrix  
- $b$ = bias  

### Key Properties

- Logits are real numbers  
- They are NOT probabilities  
- They do NOT sum to 1  
- They can be negative or positive

Example:

If $C = 3$:

$$
z = [2.3,\ -1.1,\ 0.7]
$$

These are just scores.

## How Prediction Works

We choose:

$$
\hat{y} = \arg\max_i z_i
$$

Meaning:

Pick the class with the largest score.

Example:

$$
[2.3,\ -1.1,\ 0.7]
$$

Largest = $2.3$
So predict class 1.

Only ranking matters.

Scale does NOT matter.

Example:

$$
[100,\ 50,\ -20]
$$

Still class 1.

## Why They Are Not Probabilities

Probabilities must:

* Be between $0$ and $1$
* Sum to $1$

Logits:

* Can be negative
* Can be large
* Do NOT sum to $1$

They are raw evidence scores.

Probabilities are obtained later using Softmax.



---

### <p style="text-align:center; color:orange; font-size:18px;">Optional: Explain $$ \hat{y} = \arg\max_i z_i $$ </p>

Let’s decode it slowly.


**Step 1. What Is $z_i$?**

The model outputs:

$$
z = [z_1, z_2, ..., z_C]
$$

Each $z_i$ is a score for class $i$.

Example (3 classes):

$$
z = [2.3,\ -1.1,\ 0.7]
$$

**Step 2. What Does “max” Mean?**

The maximum value here is:

$$
2.3
$$

That’s the largest score.

**Step 3. What Does “argmax” Mean?**

Important:

- **max** → gives the value  
- **argmax** → gives the index (position)

Example:

$$
z = [2.3,\ -1.1,\ 0.7]
$$

- max = 2.3  
- argmax = 1  

(assuming indexing starts from 1)

Because 2.3 is at position 1.

**Step 4. What Is $\hat{y}$?**

$\hat{y}$ means:

> Predicted label

So:

$$
\hat{y} = \arg\max_i z_i
$$

means:

> The predicted class is the index of the largest score.


**Super Simple Version**

The model outputs scores:

Class 1 → 2.3<br>
Class 2 → -1.1<br>
Class 3 → 0.7

Biggest score = 2.3  
So prediction = Class 1.

**One-Line Meaning**

$$
\hat{y} = \arg\max_i z_i
$$

means:

> Pick the class with the highest score.

---

# Softmax

Softmax is a normalization function that converts logits into a probability distribution across classes.

Given logits:

$$
z = [z_1, z_2, ..., z_C]
$$

Softmax computes:

$$
\text{Softmax}(z_i) =
\frac{e^{z_i}}{\sum_{j=1}^{C} e^{z_j}}
$$

for each class $i$.

Softmax transforms:

$$
(-\infty, +\infty)
$$

into:

$$
(0,1)
$$

while ensuring:

$$
\sum_{i=1}^{C} P(y=i) = 1
$$

So the outputs become valid probabilities.

**Key Properties**

1. Outputs are between 0 and 1  
2. Outputs sum to 1  
3. Preserves ranking (largest logit → largest probability)  
4. Sensitive to relative differences

## Manual Softmax Example

Suppose the model outputs:

$$
z = [2.0,\ 1.0,\ 0.1]
$$

These are raw scores.

Compute:

$$
e^{z_i}
$$

Using approximate values:

$$
e^{2.0} \approx 7.39
$$

$$
e^{1.0} \approx 2.72
$$

$$
e^{0.1} \approx 1.11
$$

Now we have:

$$
[7.39,\ 2.72,\ 1.11]
$$

Compute the Sum:

$$
7.39 + 2.72 + 1.11 = 11.22
$$


Divide Each by the Sum

Softmax formula:

$$
P(y=i) = \frac{e^{z_i}}{\sum_j e^{z_j}}
$$

Now compute probabilities:

Class 1:

$$
\frac{7.39}{11.22} \approx 0.66
$$

Class 2:

$$
\frac{2.72}{11.22} \approx 0.24
$$

Class 3:

$$
\frac{1.11}{11.22} \approx 0.10
$$


Final Softmax Output:

$$
[0.66,\ 0.24,\ 0.10]
$$


**Check**

Do they sum to 1?

$$
0.66 + 0.24 + 0.10 = 1.00
$$

Yes.

**Interpretation**

The model assigns:

- 66% probability to class 1  
- 24% to class 2  
- 10% to class 3  

Prediction = class 1 (highest probability).

## Why Exponential?

The exponential function:

- Makes all outputs positive  
- Magnifies larger logits more strongly  

If one logit is slightly larger, its probability increases significantly.

## Critical Rule (PyTorch)

### Do NOT apply Softmax before CrossEntropyLoss

```python
criterion = torch.nn.CrossEntropyLoss()
```

Why?

Because `CrossEntropyLoss` already computes:

$$
\text{LogSoftmax} + \text{Negative Log Likelihood}
$$

Internally it does:

$$
\log\left(\frac{e^{z_i}}{\sum_j e^{z_j}}\right)
$$

If you apply Softmax manually first:

* You exponentiate twice
* You reduce numerical stability
* You can cause gradient issues

### Correct Usage

Model should output **logits**, not probabilities.

Pass logits directly to:

```python
loss = criterion(logits, labels)
```


# CrossEntropy Loss

CrossEntropy Loss is the negative log of the probability assigned to the true class, small when confident and correct, huge when confident and wrong.

For the true class $y$:

$$
L = -\log(P(y))
$$

It only cares about the probability assigned to the correct class.

## What This Means

If the model assigns:

- High probability → small loss  
- Low probability → large loss  

Because:

- $\log(1) = 0$
- $\log(\text{small number})$ is very negative  
- Negative sign makes it large positive loss  

## Manual Example 1. Moderate Confidence

Suppose true class = Class 1.

Model prediction:

$$
P = [0.51,\ 0.30,\ 0.19]
$$

Correct class probability:

$$
P(y) = 0.51
$$

Compute loss:

$$
L = -\log(0.51)
$$

Using natural log:

$$
\log(0.51) \approx -0.673
$$

So:

$$
L \approx 0.673
$$

## Manual Example 2. High Confidence

Now model predicts:

$$
P = [0.95,\ 0.03,\ 0.02]
$$

Correct class probability:

$$
P(y) = 0.95
$$

Compute:

$$
L = -\log(0.95)
$$

$$
\log(0.95) \approx -0.051
$$

So:

$$
L \approx 0.051
$$

Much smaller loss.


## Manual Example 3. Confident but Wrong

True class = Class 1.

Model predicts:

$$
P = [0.01,\ 0.97,\ 0.02]
$$

Correct class probability:

$$
P(y) = 0.01
$$

Compute:

$$
L = -\log(0.01)
$$

$$
\log(0.01) \approx -4.605
$$

So:

$$
L \approx 4.605
$$

Huge loss.

---

### <p style="text-align:center; color:orange; font-size:18px;"> Clear Your Confusion (Optional) </p>

#### Prediction Step (Argmax)

When predicting, we pick the largest probability:

$$
P = [0.01,\ 0.97,\ 0.02]
$$

Largest value = 0.97
So the model predicts:

Class 2

Correct.

That’s **prediction**.


#### Loss Step (Training)

Loss does NOT care what the model predicted.

Loss cares about:

> What probability did the model assign to the TRUE class?

You told me:

True class = Class 1.

So the correct class index is 1.

Therefore:

$$
P(y) = P(\text{true class}) = 0.01
$$

Not 0.97.

Because 0.97 belongs to Class 2.


#### Why We Don’t Use 0.97

Because 0.97 is the probability of the **wrong class**.

The loss must measure:

> How wrong was the model about the correct answer?

The model was extremely confident in Class 2 (0.97).

But Class 2 is wrong.

So the model gave only 0.01 probability to the true class.

That is terrible.

So loss becomes large:

$$
L = -\log(0.01) \approx 4.605
$$

Huge penalty.


#### The Core Rule

Prediction uses:

$$
\hat{y} = \arg\max_i P_i
$$

Loss uses:

$$
L = -\log(P_{\text{true class}})
$$

Two completely different things.

---


## Why CrossEntropy Works So Well

It:

- Penalizes confident wrong predictions heavily  
- Rewards confident correct predictions  
- Pushes probability mass toward true class  

It doesn’t just optimize accuracy.

It optimizes **confidence quality**.

## Key Insight

Accuracy treats:

- 0.51 and 0.95 the same  

CrossEntropy does not.

It forces the model to:

> Be correct and be confident.

# Where Does Softmax Fit?

We’ll go through training flow mathematically, cleanly and precisely.

**Step 0; Model Output (Logits)**

The model outputs raw scores:

$$
z = [2.0,\ 1.0,\ 0.1]
$$

These are logits.

They are just real numbers:

$$
z_i \in (-\infty, +\infty)
$$

No probabilities yet.


**Step 1. Apply Softmax (Convert to Probabilities)**

Softmax formula:

$$
P(y=i) = \frac{e^{z_i}}{\sum_{j=1}^{C} e^{z_j}}
$$

Compute exponentials

$$
e^{2.0} \approx 7.39
$$

$$
e^{1.0} \approx 2.72
$$

$$
e^{0.1} \approx 1.11
$$

Compute denominator

$$
7.39 + 2.72 + 1.11 = 11.22
$$

Compute probabilities

$$
P_1 = \frac{7.39}{11.22} \approx 0.66
$$

$$
P_2 = \frac{2.72}{11.22} \approx 0.24
$$

$$
P_3 = \frac{1.11}{11.22} \approx 0.10
$$

Now we have:

$$
P = [0.66,\ 0.24,\ 0.10]
$$

This is where **Softmax happens**.


**Step 2. Compute CrossEntropy Loss**

Definition:

$$
L = -\log P(y)
$$

Assume true class is **Class 1**.

So we use:

$$
P(y) = 0.66
$$

Compute:

$$
L = -\log(0.66)
$$

$$
\log(0.66) \approx -0.415
$$

So:

$$
L \approx 0.415
$$

Small loss → good prediction.


**Step 3. Confident but Wrong Case**

Suppose logits are:

$$
z = [0.1,\ 4.0,\ 0.2]
$$

### Apply Softmax

Exponentials:

$$
e^{0.1} \approx 1.11
$$

$$
e^{4.0} \approx 54.6
$$

$$
e^{0.2} \approx 1.22
$$

Sum:

$$
1.11 + 54.6 + 1.22 \approx 56.93
$$

Probabilities:

$$
P \approx [0.02,\ 0.96,\ 0.02]
$$

True class is still Class 1.

So:

$$
L = -\log(0.02)
$$

$$
\log(0.02) \approx -3.91
$$

$$
L \approx 3.91
$$

Huge loss.

Because the model gave very low probability to the true class.


### Where Softmax Happens in PyTorch

When you write:

```python
criterion = torch.nn.CrossEntropyLoss()
loss = criterion(logits, target)
```

Internally PyTorch computes:

$$
-\left(
z_y - \log \sum_j e^{z_j}
\right)
$$

This expression is mathematically equivalent to:

$$
-\log \left(
\frac{e^{z_y}}{\sum_j e^{z_j}}
\right)
$$

Which is:

$$
-\log(\text{Softmax}(z_y))
$$

So during training:

$$
\text{Softmax is applied internally.}
$$

You do NOT call it manually.


### Full Pipeline Summary

**During Training**

$$
\text{Logits} \rightarrow \text{CrossEntropyLoss}
$$

Softmax is inside the loss.

**During Inference**

If you want probabilities:

$$
\text{Logits} \rightarrow \text{Softmax} \rightarrow \text{Probabilities}
$$


# Class Imbalance

Suppose dataset distribution:

- Class A → 90%  
- Class B → 5%  
- Class C → 5%  

A lazy model can learn:

$$
\hat{y} = \text{Class A (always)}
$$

Accuracy:

$$
90\%
$$

But:

- Recall for B = 0  
- Recall for C = 0  

Model is statistically “good” but practically useless.

## Why This Happens

CrossEntropy minimizes **average loss**:

$$
L = -\log P(y)
$$

If 90% of samples are Class A:

- Gradients are dominated by Class A  
- The optimizer mostly improves Class A  
- Minority classes barely influence updates  

The optimization objective is skewed.

## Solution 1: Weighted Loss

Modify loss:

$$
L = -w_y \log P(y)
$$

Where:

- $w_y$ = weight of the true class  

Rare classes get larger weights.

Example:

```python
class_weights = torch.tensor([0.2, 0.4, 0.4]).to(device)
criterion = torch.nn.CrossEntropyLoss(weight=class_weights)
```

Effect:

* Errors on minority classes produce larger gradients
* Optimizer is forced to care about them


**Important Insight**

Weights should often be:

$$
w_i \propto \frac{1}{\text{frequency}_i}
$$

Or normalized inverse frequency.

Otherwise you risk:

* Overcompensating
* Training instability


---

### <p style="text-align:center; color:orange; font-size:18px;"> Inverse Frequency (Optional) </p>

We want each class to contribute roughly equally to training.

Suppose dataset:

* Class A → 90%
* Class B → 5%
* Class C → 5%

If we do nothing:

Expected gradient contribution:

$$
0.9 \nabla L_A + 0.05 \nabla L_B + 0.05 \nabla L_C
$$

Class A dominates updates.

So we increase weights for rare classes.


**Why Inverse Frequency?**

If a class appears less often, we want its total influence over time to match others.

Let frequency of class $i$ be:

$$
f_i
$$

If we choose:

$$
w_i = \frac{1}{f_i}
$$

Then expected contribution becomes:

$$
f_i \cdot \frac{1}{f_i} \cdot \nabla L_i
= \nabla L_i
$$

Meaning:

Each class contributes equally in expectation.

That’s the mathematical reason.


**Concrete Numbers**

Dataset:

* A → 0.90
* B → 0.05
* C → 0.05

Inverse frequency weights:

$$
w_A = \frac{1}{0.90} \approx 1.11
$$

$$
w_B = \frac{1}{0.05} = 20
$$

$$
w_C = 20
$$

Now:

Expected contribution:

$$
0.90 \times 1.11 \approx 1
$$

$$
0.05 \times 20 = 1
$$

Balanced.


**Why Not Use Huge Weights Directly?**

Notice:

Rare classes got weight 20.

That’s large.

If imbalance is extreme:

Example:

* A → 99%
* B → 1%

Then:

$$
w_B = 100
$$

Now gradients for B are 100× larger.

This can cause:

* Exploding gradients
* Training instability
* Overfitting minority samples

That’s what “overcompensating” means.


**Why Normalize?**

Instead of raw inverse frequency:

We often normalize:

$$
w_i = \frac{1/f_i}{\sum_j (1/f_j)}
$$

or scale so that:

$$
\sum w_i = C
$$

This keeps gradient magnitudes reasonable.

You preserve balance
without blowing up updates.


**Core Insight**

You want:

$$
f_i \cdot w_i \approx \text{constant}
$$

That’s the real objective.

Inverse frequency achieves that.

But scaling must be controlled.

---


## Solution 2: Weighted Sampling

Instead of modifying loss, modify sampling.

Use:

```python
WeightedRandomSampler
```

Idea:

* Oversample minority classes
* Create balanced mini-batches

Now each batch contains roughly equal class representation.

Loss stays unchanged.

**Strategic Difference**

Weighted Loss:

* Adjusts gradient magnitude

Weighted Sampling:

* Adjusts data distribution seen by optimizer

Sampling often produces more stable training.

## Solution 3. Targeted Augmentation

Augment minority classes more aggressively.

Example:

* More rotations
* Stronger color jitter
* Random erasing

This increases their effective data size.

Instead of:

5% raw samples,

You increase their representation in training space.


## Which One Is Better?

It depends.

* Mild imbalance → Weighted loss may be enough
* Severe imbalance → Weighted sampling works better
* Small minority dataset → Combine sampling + augmentation

In high-risk domains (medical, fraud detection):

You often combine all three.

## Final Mental Model

Imbalanced data → biased gradients → biased model.

Fix by:

1. Reweighting loss
2. Resampling data
3. Expanding minority manifold via augmentation


# Top-k Accuracy

<p style="text-align:center; color:red; font-size:18px;">To be continue...</p>

---

<p style="text-align:center; color:skyblue; font-size:18px;">
© 2026 Mostafizur Rahman
</p>
