In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

| **Line of Code**                               | **What It Does**                                                                                                  | **Simple Analogy**                                                                                |
| ---------------------------------------------- | ----------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------- |
| `import torch`                                 | Brings in **PyTorch**, a library for tensors and deep learning.                                                   | Like opening a **big toolbox** for machine learning.                                              |
| `import torch.nn as nn`                        | Imports the **neural network building blocks** (layers, activations, etc.) and gives them the nickname `nn`.      | Like taking out a box of **Lego blocks** to build neural networks.                                |
| `import torch.optim as optim`                  | Imports **optimizers**, which help the model learn by adjusting weights step by step. Nickname = `optim`.         | Like having a **guide** who tells you which way to walk when climbing a hill.                     |
| `from torchvision import datasets, transforms` | `datasets`: ready-to-use image collections (like MNIST). `transforms`: tools to resize, convert, or clean images. | `datasets` = a **library of books**, `transforms` = the **photocopier/scanner** to prepare pages. |
| `from torch.utils.data import DataLoader`      | Splits datasets into **mini-batches**, shuffles them, and feeds them to the network.                              | Like slicing a **big pizza** into smaller pieces so it‚Äôs easier to eat.                           |


In [2]:
# Load the datasets
train_dataset = datasets.MNIST(
    root = "data",
    train = True,
    transform = transforms.ToTensor(),
    download = True
)

test_dataset = datasets.MNIST(
    root = "data",
    train = False,
    transform = transforms.ToTensor(),
    download = True
)

| **Line of Code**                      | **What It Does**                                                                                | **Simple Analogy**                                                                              |
| ------------------------------------- | ----------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------- |
| `# Load the datasets`                 | A **comment** (not code). It‚Äôs just a note to remind us that the next lines will load datasets. | Like putting a **sticky note** on a page saying ‚ÄúHere‚Äôs where we load the data.‚Äù                |
| `train_dataset = datasets.MNIST(...)` | Loads the **MNIST training set** (60,000 images of handwritten digits 0‚Äì9).                     | Like getting a **practice workbook** full of math problems to train on.                         |
| `root = "data"`                       | Tells PyTorch to **store/download** the dataset inside a folder called `"data"`.                | Like choosing a **folder on your computer** where you‚Äôll keep your homework.                    |
| `train = True`                        | Says we want the **training portion** of MNIST (used for learning).                             | Like using the **practice problems** section of a textbook.                                     |
| `transform = transforms.ToTensor()`   | Converts images into **tensors** (so PyTorch can understand them as numbers).                   | Like scanning a **paper photo** and turning it into a **digital image** your computer can read. |
| `download = True`                     | If MNIST isn‚Äôt already in `"data"`, it will **download it from the internet**.                  | Like saying ‚ÄúIf I don‚Äôt have this workbook, go buy it online.‚Äù                                  |
| `test_dataset = datasets.MNIST(...)`  | Loads the **MNIST testing set** (10,000 images of handwritten digits).                          | Like having a **final exam paper** to check if you really learned from practice.                |
| `train = False`                       | Means this time we want the **test portion** of MNIST (used for evaluation).                    | Like opening the **exam questions section** of a book, not practice problems.                   |
| *(Other arguments same as above)*     | Root folder, transform, and download work the same way as before.                               | Same as above.                                                                                  |


In [3]:
# Wrap the datasets in dataloader for batching
train_loader = DataLoader(dataset = train_dataset, batch_size = 64, shuffle = True)
test_loader = DataLoader(dataset = test_dataset, batch_size = 64, shuffle = False)

| **Line of Code**                                 | **What It Does**                                                                                                | **Simple Analogy**                                                                                       |
| ------------------------------------------------ | --------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------- |
| `# Wrap the datasets in dataloader for batching` | A **comment** to remind us the next lines are about batching data.                                              | Like writing a note: ‚ÄúNow we‚Äôll cut the pizza into slices.‚Äù                                              |
| `train_loader = DataLoader(...)`                 | Creates a **data loader** for the training set. It will feed the model small batches of images during training. | Like a **waiter** bringing food to the table plate by plate instead of dumping the whole buffet at once. |
| `dataset = train_dataset`                        | Tells it to use the **training dataset** we created earlier.                                                    | Like saying ‚ÄúServe food from the practice workbook.‚Äù                                                     |
| `batch_size = 64`                                | Each batch will contain **64 images** at a time.                                                                | Like giving the student **64 practice problems** in one go instead of the whole book.                    |
| `shuffle = True`                                 | Mixes the data randomly every time before giving it to the model.                                               | Like shuffling a deck of cards so the student doesn‚Äôt memorize the **order** of questions.               |
| `test_loader = DataLoader(...)`                  | Creates a **data loader** for the test set.                                                                     | Like preparing the **exam questions** to be served in chunks.                                            |
| `dataset = test_dataset`                         | Uses the **testing dataset**.                                                                                   | Like saying ‚ÄúServe food from the exam paper.‚Äù                                                            |
| `batch_size = 64`                                | Again, 64 test images at a time.                                                                                | Like checking the student‚Äôs answers on **64 exam questions** at once.                                    |
| `shuffle = False`                                | Doesn‚Äôt shuffle test data (keeps it in the same order).                                                         | Like saying ‚ÄúThe exam paper should stay in its original order, no shuffling.‚Äù                            |



In [4]:
# Define a simple neural network
class SimpleNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(28*28, 128)  #nn.Linear(input_size, output_size)
        self.fc2 = nn.Linear(128,10)
    def forward(self,x):
        x = x.view(-1, 28*28)
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

| **Line of Code**                   | **What It Does**                                                                                                              | **Simple Analogy**                                                                                |
| ---------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------- |
| `# Define a simple neural network` | A **comment** to explain the next code defines the model.                                                                     | Like writing a label on a box: ‚ÄúThis is where we design the brain.‚Äù                               |
| `class SimpleNN(nn.Module):`       | Creates a new class called `SimpleNN` which is a type of **neural network model** (`nn.Module`).                              | Like saying ‚ÄúI‚Äôm building my own robot by extending the blueprint from a robot factory.‚Äù          |
| `def __init__(self):`              | The **constructor function**. Runs once when you create the model.                                                            | Like setting up the robot‚Äôs body parts when it‚Äôs first built.                                     |
| `super().__init__()`               | Calls the parent class (`nn.Module`) to set things up properly.                                                               | Like telling the factory: ‚ÄúFinish the standard setup before I add custom parts.‚Äù                  |
| `self.fc1 = nn.Linear(28*28, 128)` | First layer: takes an input of size **28√ó28 pixels = 784 numbers** and maps them to **128 hidden units**.                     | Like converting a **large raw photo** into a **smaller summary** the robot can understand.        |
| `self.fc2 = nn.Linear(128, 10)`    | Second layer: takes the 128 hidden values and maps them to **10 outputs** (digits 0‚Äì9).                                       | Like the robot deciding which of the **10 possible answers** (0‚Äì9) is most likely.                |
| `def forward(self, x):`            | Defines **how data flows** through the network.                                                                               | Like saying ‚ÄúWhen you give my robot some input, here‚Äôs the step-by-step process it follows.‚Äù      |
| `x = x.view(-1, 28*28)`            | Flattens each image (28√ó28 pixels) into a **1D vector of 784 numbers**. `-1` means ‚Äúfigure out batch size automatically.‚Äù     | Like taking a **folded paper** (image) and laying it **flat** so the robot can read it in a line. |
| `x = torch.relu(self.fc1(x))`      | Passes the input through the first layer (`fc1`), then applies **ReLU activation** (keeps positives, turns negatives into 0). | Like filtering signals ‚Äî the robot ignores ‚Äúnegative‚Äù signals and only keeps useful ones.         |
| `x = self.fc2(x)`                  | Feeds the result into the second layer (`fc2`) to get **10 output scores**.                                                   | Like the robot choosing which of the 10 digits it thinks the image is.                            |
| `return x`                         | Returns the output predictions (called **logits**).                                                                           | Like the robot giving its answer: ‚ÄúI think this is a 7 (with 90% confidence).‚Äù                    |


| Parameter in `nn.Linear(input_size, output_size)` | Meaning                                                                                  | Example in our case                                                       |
| ------------------------------------------------- | ---------------------------------------------------------------------------------------- | ------------------------------------------------------------------------- |
| `input_size`                                      | How many numbers come **into** the layer (like the number of doors into a room).         | `28*28 = 784` ‚Üí Each MNIST image has 784 pixels (each pixel = one input). |
| `output_size`                                     | How many numbers come **out** of the layer (like the number of windows out of the room). | `128` ‚Üí The layer produces 128 features (compressed useful signals).      |


| Parameter in `nn.Linear(128, 10)` | Meaning                                                      | Example in our case                               |
| --------------------------------- | ------------------------------------------------------------ | ------------------------------------------------- |
| `128`                             | Number of inputs to this layer (features coming from `fc1`). | 128 features summarizing the image.               |
| `10`                              | Number of outputs from this layer.                           | 10 classes (digits 0‚Äì9).                          |
| Output                            | What the layer produces.                                     | 10 raw values (logits), one score for each digit. |


**So the flow is:
Pixels (784) ‚Üí Features (128) ‚Üí Digit Scores (10).**

### üîé What is a "Digit Score"?

- When the network reaches the last layer (`fc2`), it produces **10 numbers**.  
- Each number corresponds to one digit class:  
  - First number ‚Üí score for digit **0**  
  - Second number ‚Üí score for digit **1**  
  - ‚Ä¶ and so on until digit **9**  

These numbers are called **logits** in deep learning.  
They are **not yet probabilities** ‚Äî they can be positive, negative, or any value.  

---

### ‚ö° Analogy

Imagine you‚Äôre judging a handwriting contest with **10 possible winners (digits 0‚Äì9)**.  

- For each digit, the judge (network) gives a **score** (like a raw rating).  

**Example (logits):**

- Digit 0 ‚Üí -2.1  ‚Üí The network thinks it doesn‚Äôt look like a 0.
- Digit 1 ‚Üí 0.5   ‚Üí It looks a little like a 1.
- Digit 7 ‚Üí 3.7   ‚Üí It looks strongly like a 7. 
- Digit 9 ‚Üí 1.2   ‚Üí It looks somewhat like a 9.

Even though these aren‚Äôt probabilities, the **highest score wins** ‚Üí here, digit **7**.  

---

### üß† Big Picture

- **Digit scores = raw outputs (logits) from the final layer.**  
- They represent how strongly the network thinks the input image belongs to each digit.  
- The **highest score‚Äôs digit is the prediction**.  

Later, we usually apply a **softmax function** to turn these raw scores into **probabilities** (e.g., ‚Äúdigit 7 = 85% likely‚Äù).  


In [5]:
"""
Explanation of 
def forward(self,x):
    x = x.view(-1, 28*28)
    x = torch.relu(self.fc1(x))
    x = self.fc2(x)
    return x
    """

'\nExplanation of \ndef forward(self,x):\n    x = x.view(-1, 28*28)\n    x = torch.relu(self.fc1(x))\n    x = self.fc2(x)\n    return x\n    '

| **Line of Code**              | **What It Does**                                                                                                                  | **Simple Analogy**                                                                              |
| ----------------------------- | --------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------- |
| `def forward(self, x):`       | Defines **how input data flows** through the network layers. This is called when you run `model(x)`.                              | Like writing down the **step-by-step instructions** for how your robot should process an image. |
| `x = x.view(-1, 28*28)`       | Flattens the image (28√ó28 pixels) into a **single row of 784 numbers**. The `-1` automatically adjusts for batch size.            | Like unfolding a **folded newspaper** and laying it flat in one line so it‚Äôs easier to read.    |
| `x = torch.relu(self.fc1(x))` | Sends the data through the **first layer** (`fc1`) and applies the **ReLU activation** (turns negatives into 0, keeps positives). | Like the robot filtering signals: ‚ÄúIgnore bad signals (negative) and keep useful ones.‚Äù         |
| `x = self.fc2(x)`             | Passes the result into the **second layer** to get **10 output scores** (one for each digit 0‚Äì9).                                 | Like the robot saying: ‚ÄúOkay, I‚Äôve processed the signals ‚Äî here are my 10 possible guesses.‚Äù    |
| `return x`                    | Returns the **final output** (raw predictions/logits).                                                                            | Like the robot handing you its final answer sheet.                                              |


### üëâ In short:

- Flatten the picture üìÑ

- Process it with first layer + filter üéõÔ∏è

- Send it through second layer üî¢

- Give back predictions ‚úÖ

# üèãÔ∏è Training vs üß™ Testing in a Neural Network (Example with Digit "5")

---

## üèãÔ∏è Training Phase (Learning Time)

**Input:**
- You feed an image of ‚Äú5‚Äù into the network (28√ó28 pixels).

**Forward Pass:**
- The network flattens it ‚Üí processes it through layers ‚Üí outputs 10 raw scores (logits).
- Example output: [ -0.8, 0.2, 0.1, -0.5, 0.3, 2.9, 0.7, -1.1, 0.0, -0.2 ]
- The biggest score is at index **5** (2.9).  
- Network predicts **‚Äú5‚Äù**.

**Compare with True Label:**
- We know the correct answer (label) is ‚Äú5‚Äù because this is training data.
- The model‚Äôs output is compared with the true label using a **loss function** (usually cross-entropy).

**Error Calculation (Loss):**
- If the network guessed wrong (say it thought ‚Äú3‚Äù), the loss would be **high**.
- If it guessed right (‚Äú5‚Äù), the loss is **low**.

**Backpropagation + Optimizer:**
- The optimizer (SGD/Adam, etc.) updates the network‚Äôs weights so it does **a little better next time**.
- This repeats **thousands of times** until the network gets very good.

---

## üß™ Testing Phase (Evaluation Time)

**Input:**
- Now you give a new image of ‚Äú5‚Äù (from test set or even a random hand-drawn one).

**Forward Pass:**
- Same process: flatten ‚Üí layers ‚Üí 10 output scores.

**Prediction:**
- Pick the highest score as the network‚Äôs guess.
- Example:[ -1.0, 0.5, 0.3, -0.4, 0.7, 3.8, 1.1, -0.6, 0.0, -0.2 ]
- Highest is at index **5** (3.8).  
- Prediction = **‚Äú5‚Äù**.

**No Backpropagation:**
- In testing, we **don‚Äôt adjust weights** anymore.
- We only check: *Did the model‚Äôs guess match the true answer?*

---

## ‚öñÔ∏è Analogy

- **Training**: Like a student practicing math problems. If they get one wrong, the teacher explains the mistake, and the student **learns** (updates their brain).  
- **Testing**: Like the final exam. The student just writes answers ‚Äî no feedback, no learning. Only **grading** happens.

---

## ‚úÖ Summary

- During **training**, the model learns from its mistakes by comparing predictions with correct labels.  
- During **testing**, the model only predicts and we check accuracy ‚Äî no learning happens.


In [6]:
# Initialize model, loss, optimizer
model = SimpleNN()
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

| **Line of Code**                                       | **What It Does**                                                                                                                                                                                 | **Simple Analogy**                                                                                                                                                           |
| ------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `model = SimpleNN()`                                   | Creates an object of the `SimpleNN` class (the neural network we defined).                                                                                                                       | Like **building the robot** from your blueprint so it‚Äôs ready to work.                                                                                                       |
| `loss_fn = nn.CrossEntropyLoss()`                      | Defines the **loss function**. Cross-entropy is used for classification tasks (like MNIST digits). It measures how far the predictions are from the correct answer.                              | Like a **teacher** checking how wrong your homework answers are.                                                                                                             |
| `optimizer = optim.Adam(model.parameters(), lr=0.001)` | Sets up the **optimizer** (Adam). It will update the model‚Äôs parameters (weights) using the gradients during training. The `lr=0.001` is the learning rate ‚Äî how big each update step should be. | Like giving the student a **study strategy**: how quickly to learn from mistakes. A small learning rate = small careful steps, a big one = big jumps (risk of overshooting). |


# Optimizer & Learning Rate Explained (Simple Terms)

### 1. What is Adam?
- **Adam = Adaptive Moment Estimation** (an optimizer).
- It‚Äôs the algorithm that **updates the weights** of your neural network after each batch.
- Think of it as a **smart tutor** who:
  - üìí Remembers your past mistakes (**momentum**).
  - üîß Adjusts how much to correct based on how hard the mistake was (**adaptive learning**).
- Compared to older optimizers (like SGD), Adam usually **learns faster and smoother**.

---

### 2. What is Learning Rate (lr)?
- **Learning rate = step size** for each weight update.
- Analogy: Learning to ride a bike üö≤
  - **High learning rate** ‚Üí you make huge corrections ‚Üí fast, but you wobble/fall.
  - **Low learning rate** ‚Üí tiny corrections ‚Üí slow, but more stable.

---

### 3. Should it be higher or lower?
- **High lr (e.g., 0.1)**:
  - üöÄ Learns quickly.
  - ‚ùå Might overshoot and never settle (too jumpy).
- **Low lr (e.g., 0.0001)**:
  - üê¢ Learns slowly.
  - ‚úÖ More precise and stable.

---

### 4. What‚Äôs a good value?
- **Default = 0.001** ‚Üí balanced (like in your code).
- Tune if needed:
  - If model **isn‚Äôt learning at all** ‚Üí üîº increase lr.
  - If model **bounces/oscillates** ‚Üí üîΩ decrease lr.

---

üìå **Quick Analogy Recap**:
- **Optimizer** = how you correct yourself while riding a bike.  
- **Learning rate** = how big those corrections are.  
- **Adam** = a smart rider who remembers past wobbles and improves correction each time.  


# üìâ What is a Loss Function?

A **loss function** tells the model how wrong its prediction was.  

- The **smaller the loss** ‚Üí the better the model is doing.  

---

## üîπ Why CrossEntropyLoss?

- It‚Äôs the most common loss function for **classification problems** (like MNIST digit recognition).  
- It combines **Softmax + Negative Log Likelihood (NLL)** in one step.  

---

## üîπ How it Works

### 1. Logits (raw model outputs)
Example ‚Üí `[-2.1, 0.5, 3.7, 1.2]`

### 2. Softmax ‚Üí Probabilities
Converts logits into numbers between 0 and 1 (that sum to 1).  
After softmax ‚Üí `[0.02, 0.05, 0.85, 0.08]`

### 3. Compare with True Label
- Suppose the correct digit is **2** (index = 2).  
- The probability for class 2 is **0.85**.  

üëâ If the probability was **low**, the loss would be **high** ‚Üí punishing the model.  
üëâ If the probability was **high**, the loss would be **small** ‚Üí rewarding the model.  

---

## üîπ Intuition

Think of it like a **teacher grading multiple-choice answers**:  

- The model writes its confidence for each option.  
- The loss function checks if the model gave **high confidence to the correct answer**.  

‚úÖ If yes ‚Üí **small penalty (good job)**  
‚ùå If no ‚Üí **large penalty (bad job)**  

---

## ‚úÖ In Short
`CrossEntropyLoss` measures **how far the predicted probability distribution is from the true answer**.


In [7]:
# Training loop
for epoch in range(10):   
    for images, labels in train_loader:
        outputs = model(images)         
        loss = loss_fn(outputs, labels)   

        optimizer.zero_grad()             
        loss.backward()                   
        optimizer.step()                  

    print(f"Epoch {epoch+1}, Loss = {loss.item():.4f}")

Epoch 1, Loss = 0.1303
Epoch 2, Loss = 0.1749
Epoch 3, Loss = 0.0333
Epoch 4, Loss = 0.1395
Epoch 5, Loss = 0.1421
Epoch 6, Loss = 0.0163
Epoch 7, Loss = 0.0027
Epoch 8, Loss = 0.1560
Epoch 9, Loss = 0.0148
Epoch 10, Loss = 0.0053


| Line of Code                                          | What It Does                                                                           | Analogy                                                                     |
| ----------------------------------------------------- | -------------------------------------------------------------------------------------- | --------------------------------------------------------------------------- |
| `for epoch in range(10):`                             | Runs the training process for 10 full passes through the dataset (10 epochs).          | Like practicing the entire question paper 10 times.                         |
| `for images, labels in train_loader:`                 | Loops through batches of training data (images + their correct labels).                | Getting practice questions in small sets instead of the whole exam at once. |
| `outputs = model(images)`                             | Passes the batch of images through the neural network to get predictions (logits).     | The student attempts answers based on their current knowledge.              |
| `loss = loss_fn(outputs, labels)`                     | Calculates how far the predictions are from the correct answers using a loss function. | Teacher checks answers and gives a penalty score (higher = more mistakes).  |
| `optimizer.zero_grad()`                               | Resets (clears) old gradients before calculating new ones.                             | Erasing yesterday‚Äôs corrections before starting today‚Äôs work.               |
| `loss.backward()`                                     | Backpropagation: calculates gradients (how much each parameter should change).         | Teacher gives feedback on where and how much the student went wrong.        |
| `optimizer.step()`                                    | Updates the model‚Äôs parameters using the optimizer (like Adam).                        | Student improves their understanding based on feedback.                     |
| `print(f"Epoch {epoch+1}, Loss = {loss.item():.4f}")` | Displays the loss value after each epoch to monitor progress.                          | Seeing the student‚Äôs score improve after every round of practice.           |


In [8]:
# Test accuracy
correct, total = 0, 0
with torch.no_grad():  # no gradients during testing
    for images, labels in test_loader:
        outputs = model(images)
        _, predicted = torch.max(outputs, 1)
        correct += (predicted == labels).sum().item()
        total += labels.size(0)

print(f"Test Accuracy = {100 * correct / total:.2f}%")

Test Accuracy = 97.58%


| Line of Code                                             | What It Does                                                       | Analogy                                                                                  |
| -------------------------------------------------------- | ------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- |
| `correct, total = 0, 0`                                  | Initializes counters for correct predictions and total samples.    | Starting with a clean scorecard before grading an exam.                                  |
| `with torch.no_grad():`                                  | Turns off gradient calculation (saves memory & speeds up testing). | During the real exam, no teacher feedback/corrections are given ‚Äî just checking answers. |
| `for images, labels in test_loader:`                     | Loops over batches of test data.                                   | Distributes exam papers to students in groups.                                           |
| `outputs = model(images)`                                | Gets model predictions for the test batch.                         | Student writes answers based on their learning.                                          |
| `_, predicted = torch.max(outputs, 1)`                   | Selects the class with the highest score (the model‚Äôs choice).     | Student picks the answer they are most confident about.                                  |
| `correct += (predicted == labels).sum().item()`          | Counts how many answers are correct in this batch.                 | Teacher marks the number of correct answers.                                             |
| `total += labels.size(0)`                                | Keeps track of total number of questions checked.                  | Counting total questions in the exam.                                                    |
| `print(f"Test Accuracy = {100 * correct / total:.2f}%")` | Calculates and prints accuracy = (correct √∑ total √ó 100).          | Final exam result sheet: percentage score.                                               |
