# MNIST CNN for Kids (but still real science!)

This notebook teaches you how a **CNN (Convolutional Neural Network)** learns to read **handwritten digits** (0–9) from the **MNIST** dataset.

It is written so a **10‑year‑old** can follow it, but it also includes the **real technical details** (shapes, kernels, pooling, dropout, training history, etc.).

**What you will learn:**
- What a **tensor** is (a fancy word for a number box with dimensions)
- What **shape** means (like width × height × channels)
- What **convolution kernels** (filters) are (like tiny stamp patterns)
- Why we use **pooling** (shrinking while keeping important info)
- Why we use **dropout** (to prevent memorizing)
- How to build and train a CNN with **TensorFlow / Keras**
- How to pick different test images and get predictions

## 0) Install / Imports

If you are using Google Colab, TensorFlow is usually already installed.

If you're running locally and TensorFlow isn't installed, run in a terminal:

```bash
pip install tensorflow matplotlib numpy ipywidgets
```

Now we import the libraries.

In [None]:
# Imports (tools we need)
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

# Make results repeatable (so you see similar numbers each time)
tf.keras.utils.set_random_seed(42)

print("TensorFlow version:", tf.__version__)

## 1) Load the MNIST dataset

MNIST is a famous dataset: 70,000 tiny images of handwritten digits.

- Each image is **28 pixels wide** and **28 pixels tall**
- Each pixel is a number from **0 to 255** (0 = black, 255 = white)

In [None]:
# Load MNIST (built into Keras)
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

print("x_train shape:", x_train.shape)  # (60000, 28, 28)
print("y_train shape:", y_train.shape)  # (60000,)
print("x_test shape :", x_test.shape)   # (10000, 28, 28)
print("y_test shape :", y_test.shape)   # (10000,)

print("Example labels:", y_train[:10])

### What does `shape` mean?

`shape` is like a **label on a box** telling you its dimensions.

For `x_train.shape = (60000, 28, 28)`:
- 60000 images
- each image is 28×28 pixels

So `x_train[i]` is one image and has shape `(28, 28)`.

In [None]:
# Let's look at one image
i = 0
img = x_train[i]
label = y_train[i]

print("One image shape:", img.shape)
print("Its label is:", label)

plt.imshow(img, cmap="gray")
plt.title(f"Digit: {label}")
plt.axis("off")
plt.show()

## 2) Prepare the data for a CNN

### Step A: Normalize (make pixel values small)
Right now pixels are 0..255.
Neural networks often learn better if we scale to **0..1**.

So we divide by 255.

### Step B: Add a channel dimension
CNN layers in Keras expect images shaped like:

`(height, width, channels)`

MNIST is grayscale (one channel), so channels = 1.

We change:
- from `(28, 28)` to `(28, 28, 1)`

In [None]:
# Normalize to [0, 1]
x_train = x_train.astype("float32") / 255.0
x_test  = x_test.astype("float32") / 255.0

# Add the channel dimension
x_train = x_train[..., None]  # (60000, 28, 28, 1)
x_test  = x_test[..., None]   # (10000, 28, 28, 1)

print("After preprocessing:")
print("x_train shape:", x_train.shape)
print("x_test shape :", x_test.shape)

## 3) The Big Idea: What is a Convolution?

Imagine you have a **tiny stamp** (called a **kernel** or **filter**), like a 3×3 square.

You press that stamp onto the image in many places.

- If the stamp matches the local pattern (like an edge), you get a **big number**
- If it doesn't match, you get a **small number**

So a convolution filter is like a little robot looking for a specific pattern:
- edges
- curves
- corners

### Kernel shape
For grayscale images, a Conv2D kernel has shape:
`(kernel_height, kernel_width, input_channels)`

Example: `(3, 3, 1)` means a 3×3 patch and 1 input channel.

In [None]:
# Let's create a simple edge-detecting kernel (hand-made) just for intuition
kernel = np.array([
    [-1, -1, -1],
    [ 0,  0,  0],
    [ 1,  1,  1],
], dtype=np.float32)

print("Kernel shape:", kernel.shape)
print(kernel)

### A quick demo: apply a kernel to an image (very simplified)

This is NOT the full Keras Conv2D (which learns kernels automatically),
but it helps you understand the idea.

We slide the kernel over the image and compute a dot product.

In [None]:
def simple_convolution2d(image_2d, kernel_2d):
    """Very simple convolution (no padding, stride 1)."""
    H, W = image_2d.shape
    kh, kw = kernel_2d.shape
    out = np.zeros((H - kh + 1, W - kw + 1), dtype=np.float32)
    for r in range(out.shape[0]):
        for c in range(out.shape[1]):
            patch = image_2d[r:r+kh, c:c+kw]
            out[r, c] = np.sum(patch * kernel_2d)
    return out

demo_img = x_train[0].squeeze()  # (28,28)
conv_out = simple_convolution2d(demo_img, kernel)

plt.figure(figsize=(10,4))
plt.subplot(1,2,1)
plt.imshow(demo_img, cmap="gray")
plt.title("Original image")
plt.axis("off")

plt.subplot(1,2,2)
plt.imshow(conv_out, cmap="gray")
plt.title("After kernel (edges pop out)")
plt.axis("off")
plt.show()

print("Original shape:", demo_img.shape, "-> Convolution output shape:", conv_out.shape)

## 4) Pooling: Why do we shrink images?

After convolution, we often do **pooling** to shrink the feature maps.

Think of it like:
- You have a big LEGO sculpture.
- You want a smaller summary version that still shows the important shape.

**MaxPooling2D (2×2)** looks at each 2×2 square and keeps the **biggest number**.
That keeps the strongest signal (like the strongest edge).

If we start with 28×28:
- after 2×2 pooling we get 14×14
- after another 2×2 pooling we get 7×7

## 5) Dropout: Why do we randomly turn off neurons?

Dropout is like training a sports team:
- If only 1 superstar always scores, the team becomes weak.
- Dropout forces the network to **not depend on just one neuron**.

So during training, dropout randomly turns off some neurons.
This helps the model **generalize** instead of memorizing.

## 6) Build the CNN (Keras)

We will build a small CNN:

1. **Conv2D** (learns kernels like stamps)
2. **MaxPooling2D** (shrinks while keeping important signals)
3. **Conv2D**
4. **MaxPooling2D**
5. **Dropout**
6. **Flatten** (turn 2D maps into a 1D list)
7. **Dense** (decision-making)
8. **Dense(10)** with softmax (probabilities for digits 0–9)

### Shapes through the network
Input: `(28, 28, 1)`

After Conv2D (padding="same"): still `(28, 28, filters)`
After MaxPool (2×2): `(14, 14, filters)`
After second MaxPool: `(7, 7, filters)`
After Flatten: `7*7*filters` numbers in a line.

In [None]:
# Build a small CNN model
model = tf.keras.Sequential([
    # Input: 28x28 grayscale image (1 channel)
    tf.keras.layers.Input(shape=(28, 28, 1)),

    # Convolution: learn 16 different 3x3 kernels (filters)
    tf.keras.layers.Conv2D(filters=16, kernel_size=(3,3), activation="relu", padding="same"),
    tf.keras.layers.MaxPooling2D(pool_size=(2,2)),  # 28x28 -> 14x14

    # Another convolution block
    tf.keras.layers.Conv2D(filters=32, kernel_size=(3,3), activation="relu", padding="same"),
    tf.keras.layers.MaxPooling2D(pool_size=(2,2)),  # 14x14 -> 7x7

    # Dropout: randomly turn off 25% of signals during training
    tf.keras.layers.Dropout(0.25),

    # Flatten 7x7x32 -> 1568 numbers
    tf.keras.layers.Flatten(),

    # Dense layer for mixing information
    tf.keras.layers.Dense(64, activation="relu"),

    # Output layer: 10 digits
    tf.keras.layers.Dense(10, activation="softmax")
])

# Compile: choose how it learns
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"]
)

# Show the model blueprint (including layer output shapes)
model.summary()

## 7) Train the CNN

During training, the model:
- guesses the digit
- measures how wrong it was (**loss**)
- adjusts kernels to be less wrong (**learning**)

An **epoch** means the model sees the whole training dataset once.

We also use **validation_split=0.2**:
- 80% training
- 20% validation (mini-test during training)

In [None]:
history = model.fit(
    x_train, y_train,
    epochs=3,
    batch_size=64,
    validation_split=0.2,
    verbose=1
)

## 8) Plot training history (loss and accuracy)

- Loss should go **down**
- Accuracy should go **up**
- Validation curves help you notice overfitting

In [None]:
hist = history.history

plt.figure(figsize=(12,4))

plt.subplot(1,2,1)
plt.plot(hist["loss"], label="train loss")
plt.plot(hist["val_loss"], label="val loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.title("Loss")

plt.subplot(1,2,2)
plt.plot(hist["accuracy"], label="train acc")
plt.plot(hist["val_accuracy"], label="val acc")
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.legend()
plt.title("Accuracy")

plt.show()

## 9) Test accuracy

Now we evaluate using the official test set (images the model never trained on).

In [None]:
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=0)
print("Test accuracy:", float(test_acc))
print("Test loss    :", float(test_loss))

## 10) Predict one image and see probabilities

The model outputs 10 probabilities that add up to 1.0.

In [None]:
idx = 0
img = x_test[idx]
true_label = int(y_test[idx])

probs = model.predict(img[None, ...], verbose=0)[0]
pred_label = int(np.argmax(probs))

print("True label:", true_label)
print("Predicted:", pred_label)

plt.imshow(img.squeeze(), cmap="gray")
plt.title(f"True: {true_label} | Pred: {pred_label}")
plt.axis("off")
plt.show()

plt.figure(figsize=(8,3))
plt.bar(range(10), probs)
plt.xticks(range(10))
plt.xlabel("Digit")
plt.ylabel("Probability")
plt.title("Model confidence")
plt.show()

## 11) Choose another image and predict (interactive option)

**Option A (widgets):** If `ipywidgets` works in your notebook, you get a slider.

**Option B (manual):** If widgets don't work, just change `idx = ...` and re-run.

In [None]:
def show_prediction(idx: int):
    img = x_test[idx]
    true_label = int(y_test[idx])
    probs = model.predict(img[None, ...], verbose=0)[0]
    pred_label = int(np.argmax(probs))

    plt.figure(figsize=(4,4))
    plt.imshow(img.squeeze(), cmap="gray")
    plt.title(f"Index {idx} | True: {true_label} | Pred: {pred_label}")
    plt.axis("off")
    plt.show()

    plt.figure(figsize=(8,3))
    plt.bar(range(10), probs)
    plt.xticks(range(10))
    plt.ylim(0, 1)
    plt.xlabel("Digit")
    plt.ylabel("Probability")
    plt.title("Probabilities")
    plt.show()

try:
    import ipywidgets as widgets
    from IPython.display import display

    slider = widgets.IntSlider(value=0, min=0, max=len(x_test)-1, step=1, description="Image idx:")
    ui = widgets.interactive_output(show_prediction, {"idx": slider})
    display(slider, ui)
except Exception as e:
    print("ipywidgets not available here. Manual mode works!")
    idx = 123
    show_prediction(idx)

# Quick quiz (to check understanding)

1. If an image is `(28, 28)` and we add a channel dimension, what is the new shape?
2. Why do we divide pixels by 255?
3. What does a 3×3 kernel do as it slides over an image?
4. What does MaxPooling(2×2) do, and why is it helpful?
5. What problem does Dropout try to prevent?
6. Why do we use `softmax` in the last layer?