### Introduction Notes
We are interested in minimizing the error of our function over the test set. The gap between the test set and train set is given approximately as:

$E_{\text{test}} - E_{\text{train}} = k(h/P)^{\alpha}$

If possible we'd like this term to be lower.

P is the number of training examples, so increasing P decreases this gap.
h is a measure of the capacity, so as the capacity goes up this value goes up, which makes sense as it will likely have more capacity to overfit

In practice this is minimized by minimizing $E_{\text{train}} + \beta H (W)$ which is just where we add regularization.

---

We minimize this by computing the gradients and then updating the weights based on these gradients.

Popular procedure is SGD or the "on-line update" where we don't get the complete gradient of the training set, but we get the gradient on the basis of a single or small amount of samples.

### Convolutional Neural Networks
One advantage of convolutional neural networks is that they are better at handling invariance with respect to translations or local distortions. This is because with a large MLP, it could learn the different features in different positions, but it would be very inefficient as it would need multiple little mini networks within the network to handle detecting the important patterns positioned at different locations.

Another issue with fully connected MLPs is that there is kind of this loss of information. The variables can be fed in in any fixed order and the outcome won't change. The information going into an MLP is just a 1d vector of numbers. Yet obviously images have a strong and important 2D structure.

Convolutional networks use "local receptive fields" or patches. They extract the elementary features from them such as edges, corners etc... These features are then combined by subsequent convolutional layers in order to detect higher-order features.

Because feature detectors that are useful on one part of the image are likely to be useful for the entire image, we use the single patch that has one set of weights to go across the entire image.

Units in a layer are organized into planes where all the units share the same set of weights. What this means essentially is that we can kind of think of it as sliding a window of the same weights across the input. This would be opposed to, maybe in the bottom right we have a set of weights that detects a corner here and another set of weights in the top left that detects this corner here.

This does not mean that each individual feature map has the same set of weights across them. That would be quite duplicative for no reason.

In LeNet-5 the first layer has 6 planes or feature maps. A unit in a feature map has 25 inputs connected to a 5x5 area in the input. A sequential implementation (scanning across) would slide this 5x5 set of weights across the input and multiply each by the weight and then add the bias and squash at the end.

Once a feature is detected, the location becomes less important, what's important is the position relative to other features.

To reduce the precision with which the positions are encoded there is sub-sampling of layers which they use averaging but you could also use max pooling or something. This helps to reduce sensitivity to shifts and distortions, also may help with overfitting? Not sure.

Before I go into implementing CNNs I want to rework my code from the previous backprop notebook.

I'm going to, based on Karpathy's micrograd, quickly write up / copy over the code so that I can have an autograd engine that implements .backwards. From there I'll add code on top of this that implements the CNN.

Because I want to practice this, I used o3 to turn the implementation of Value into an assignment with methods that I should implement myself.

In [20]:
#!/usr/bin/env python3
from typing import Iterable, Union

Number = Union[int, float]


class Value:
    """A node in a dynamically‑built computation graph."""

    # -------------------------------------------------------------- T1
    def __init__(self, data: Number, _children: Iterable["Value"] = (), _op: str = ""):
        """Create a **leaf** (when ``_children`` is empty) or an **operation
        result** (when created inside an operator).

        What this method must do
        ------------------------
        1. Store ``data`` – the wrapped scalar (float/int).
        2. Create ``grad`` and set it to **0.0**.  We accumulate partial
           derivatives here during back‑prop.
        3. Track the immediate predecessors with ``self._prev = set(_children)``.
           We use a **set** so look‑ups are O(1) and there are no duplicates.
        4. Save ``_op`` – a short string describing *how* this node was
           produced ("+", "*", "ReLU", etc.).  It is *only* for graphviz /
           debugging convenience.
        5. Initialise ``self._backward`` to a **do‑nothing** lambda.  Each
           operator will later *replace* this with a closure that knows
           how to distribute ``out.grad`` to its parents.

        Why these pieces?
        -----------------
        * Automatic differentiation works by walking the graph from the
          output back to the leaves, so every node needs a list of its
          parents (``_prev``) **and** the rule to push its gradient back
          (``_backward``).
        * ``grad`` starts at 0 so multiple downstream paths can safely
          *accumulate* into it.
        """
        self.data = data
        self.grad = 0
        self._backward = lambda: None
        self.prev = set(_children)
        self._op = _op

    def __add__(self, other: Union["Value", Number]):
        out = Value(self.data + other.data, (self, other), '+')
        def _backward():
            self.grad += out.grad
            other.grad += out.grad
        out._backward = _backward
        
        return out

    def __mul__(self, other: Union["Value", Number]):
        """self * other."""
        out = Value(self.data * other.data, (self, other), '*')
        def _backward():
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _backward
        return out

    __rmul__ = __mul__

    # -------------------------------------------------------------- T4
    def __pow__(self, other: Number):
        """self ** other (scalar exponent)."""
        out = Value(self.data ** other, (self,), '**')
        def _backward():
            self.grad += (other) * self.data ** (other - 1)
        out._backward = _backward
        return out
        

    # -------------------------------------------------------------- T5
    def relu(self):
        """ReLU activation: max(0, x)."""
        out = Value(max(self.data, 0), (self,), "relu")
        def _backward():
            self.grad += max(0, out.grad)
        out._backward = _backward
        return out

    # -------------------------------------------------------------- T6
    def backward(self):
        """Compute ``d(output)/d(node)`` for *every* ``node`` that influences
        this Value (call it *out*).

        Behaviour overview
        ------------------
        *The chain rule tells us we must process nodes **in reverse
        topological order** – children **before** parents – so that when
        we reach a node, the gradients flowing into it from all of its
        consumers have already been accumulated.*

        Implementation recipe
        ---------------------
        1. **Topological sort**
           Depth‑first search starting from ``self`` collects nodes in a
           list ``topo`` such that parents appear **before** children.
        2. **Seed the output**
           A node’s gradient with respect to itself is 1, so set
           ``self.grad = 1.0``.
        3. **Reverse sweep**
           Iterate ``for v in reversed(topo): v._backward()``.  Each
           stored ``_backward`` closure adds its *local* contribution to
           ``child.grad``.
        """
        # so we start from my node and go backwards calling backwards on all nodes from
        # root me, all the way to the leaves

        self.grad = 1
        topographical_ancestors = []
        seen = set()
        def recurse(node):
            topographical_ancestors.append(node)
            seen.add(node)
            if(node.prev):
                for node in node.prev:
                    if(node not in seen):
                        recurse(node)
        recurse(self)
        print(topographical_ancestors)
        for node in topographical_ancestors:
            node._backward()
            

    # -------------------------------------------------------------- T7 helpers
    def __neg__(self):
        return self * -1

    def __sub__(self, other: Union["Value", Number]):
        return self + (-other)

    def __rsub__(self, other: Union["Value", Number]):
        return other + (-self)

    # -------------------------------------------------------------- T8
    def __truediv__(self, other: Union["Value", Number]):
        return self * other ** -1

    def __rtruediv__(self, other: Union["Value", Number]):
        return other * self ** -1

    # -------------------------------------------------------------- T9
    def __repr__(self):
        return str(self.data)

In [27]:
from math import isclose

EPS = 1e-6

def close(a, b, eps=EPS):
    assert isclose(a, b, rel_tol=0.0, abs_tol=eps), f"{a} != {b}"


def test_add_mul_pow():
    # z = x * y + x ** 2
    x = Value(2.0)
    y = Value(3.0)
    z = x * y + x ** 2
    z.backward()

    # forward value
    close(z.data, 2.0 * 3.0 + 2.0 ** 2)  # 10.0

    # analytic grads
    close(x.grad, y.data + 2 * x.data)   # 3 + 4 = 7
    close(y.grad, x.data)               # 2
    close(z.grad, 1.0)


def test_relu():
    a = Value(-1.0).relu()
    b = Value(2.5).relu()
    out = a + b
    out.backward()

    close(a.data, 0.0)
    close(b.data, 2.5)

def test_division():
    a = Value(3.0)
    b = Value(2.0)
    c = a / b  # shorthand for a * b ** -1
    c.backward()

    close(c.data, 1.5)
    close(a.grad, 1 / b.data)              # 0.5
    close(b.grad, -a.data / b.data ** 2)   # -0.75


def test_chain_complex():
    # u = (x - y / z).relu() * z
    x = Value(4.0)
    y = Value(1.0)
    z = Value(2.0)
    u = (x - y / z).relu() * z
    u.backward()

    close(x.grad, 2.0)
    close(y.grad, -1.0)
    close(z.grad, 3.0)


if __name__ == "__main__":
    # simple CLI run without pytest
    test_add_mul_pow()
    test_relu()
    # test_division()
    test_chain_complex()
    print("All tests passed ✨")


[10.0, 6.0, 3.0, 2.0, 4.0]
[2.5, 0, -1.0, 2.5, 2.5]


AttributeError: 'int' object has no attribute 'data'

In [None]:
## Reimplement Karpathy's setup from micrograd to create NNs

In [None]:
## test micrograd on simple excercise from past paper like the adder problem

In [None]:
## set up yann le cunn's loss function

In [None]:
## Test running just a fully connected network on mnist

In [None]:
## run the CNN on it

In [None]:
## try augmenting the data and seeing what happens

In [None]:
## add some regularization