# Tiny Neural Network using The Autograd Engine 

## Overview

In this notebook, I build a tiny neural network entirely on top of my own reverse-mode autodiff engine.

There is:
- No NumPy
- No PyTorch
- No vectorization
- Only scalar Nodes and a computational graph

The goal is not performance — it is understanding gradient flow deeply.

Importing our autograd engine as a library to use it

In [223]:
from autograd import *

## Phase 1 — Single Input, Single Neuron

We start with the simplest possible model:

y = w·x + b

Loss:
L = (y − y_true)²

This phase validates:

- Parameter nodes store gradients correctly
- Multiplication and addition propagate gradients
- reverse-mode traversal works
- zero_grad lifecycle is correct
- manual derivative matches engine output

If this fails, the engine is broken.

### Single-Input Single Neuron

Parameters

In [224]:
w = Node(0)
b = Node(0)

Data

In [225]:
x = 2
ytrue = 10

Training

In [226]:
n = 0.01
for step in range(100):
    zero_grad(w)
    zero_grad(b)

    y = w*x + b
    l = (y-ytrue)**2
    
    backward(l)
    
    manual_dw = 2*(y.value - ytrue)*x
    manual_db = 2*(y.value - ytrue)

    print("engine:", w.grad, b.grad)
    print("manual:", manual_dw, manual_db)
    
    w.value -= n*w.grad
    b.value -= n*b.grad
    
    print(step, l.value)


engine: -40 -20
manual: -40 -20
0 100
engine: -36.0 -18.0
manual: -36.0 -18.0
1 81.0
engine: -32.4 -16.2
manual: -32.4 -16.2
2 65.61
engine: -29.16 -14.58
manual: -29.16 -14.58
3 53.1441
engine: -26.244 -13.122
manual: -26.244 -13.122
4 43.046721
engine: -23.6196 -11.8098
manual: -23.6196 -11.8098
5 34.86784400999999
engine: -21.25764 -10.62882
manual: -21.25764 -10.62882
6 28.242953648099995
engine: -19.131876 -9.565938
manual: -19.131876 -9.565938
7 22.876792454960995
engine: -17.218688399999994 -8.609344199999997
manual: -17.218688399999994 -8.609344199999997
8 18.5302018885184
engine: -15.496819559999999 -7.748409779999999
manual: -15.496819559999999 -7.748409779999999
9 15.009463529699909
engine: -13.947137603999998 -6.973568801999999
manual: -13.947137603999998 -6.973568801999999
10 12.157665459056926
engine: -12.552423843599996 -6.276211921799998
manual: -12.552423843599996 -6.276211921799998
11 9.847709021836106
engine: -11.29718145924 -5.64859072962
manual: -11.29718145924 -5.

### What This Confirms

The gradients printed by the engine match the manually derived gradients:

dL/dw = 2(y − y_true)x  
dL/db = 2(y − y_true)

This confirms:

- Chain rule is implemented correctly
- Gradients accumulate via +=
- Backward traversal is correct
- Parameter updates modify value, not graph structure

## Phase 2 — Multi-Input Neuron

Now we extend the neuron to:

y = w1·x1 + w2·x2 + b

This introduces multiple parents in the computational graph.

Graph structure becomes branched:

(w1·x1) + (w2·x2) + b

Reverse-mode must:

- Propagate gradients through multiple branches
- Accumulate gradient contributions correctly
- Handle shared nodes safely

### Multi-Input Single Neuron

parameters

In [227]:
w1 = Node(0)
w2 = Node(0)
b = Node(0)

Data


In [228]:
x1 = 2
x2 = 3
ytrue = 16

Training

In [229]:
n = 0.01
for step in range(100):
    zero_grad(w1)
    zero_grad(w2)
    zero_grad(b)
    
    y = w1*x1 + w2*x2 + b
    l = (y-ytrue)**2

    backward(l)

    delta = 2*(y.value - ytrue)

    manual_dw1 = delta * x1
    manual_dw2 = delta * x2
    manual_db = delta
    
    print("engine:", w1.grad, w2.grad, b.grad)
    print("manual:", manual_dw1, manual_dw2, manual_db)

    w1.value -= n * w1.grad
    w2.value -= n * w2.grad
    b.value -= n * b.grad

    print("step:", step, "Loss:", l.value)

engine: -64 -96 -32
manual: -64 -96 -32
step: 0 Loss: 256
engine: -46.08 -69.12 -23.04
manual: -46.08 -69.12 -23.04
step: 1 Loss: 132.7104
engine: -33.1776 -49.7664 -16.5888
manual: -33.1776 -49.7664 -16.5888
step: 2 Loss: 68.79707135999999
engine: -23.887871999999994 -35.831807999999995 -11.943935999999997
manual: -23.887871999999994 -35.831807999999995 -11.943935999999997
step: 3 Loss: 35.664401793023984
engine: -17.199267839999997 -25.798901759999996 -8.599633919999999
manual: -17.199267839999997 -25.798901759999996 -8.599633919999999
step: 4 Loss: 18.488425889503635
engine: -12.383472844800004 -18.575209267200005 -6.191736422400002
manual: -12.383472844800004 -18.575209267200005 -6.191736422400002
step: 5 Loss: 9.584399981118693
engine: -8.916100448255996 -13.374150672383994 -4.458050224127998
manual: -8.916100448255996 -13.374150672383994 -4.458050224127998
step: 6 Loss: 4.968552950211923
engine: -6.419592322744322 -9.629388484116483 -3.209796161372161
manual: -6.419592322744322 -

Now for dataset

In [230]:
w1 = Node(0)
w2 = Node(0)
b = Node(0)

In [231]:
dataset = [
    (2,3,16),
    (1,1,5)
]

In [232]:
n = 0.01
for step in range(100):
    for x1, x2, ytrue in dataset:
        
        zero_grad(w1)
        zero_grad(w2)
        zero_grad(b)

        y = w1*x1 + w2*x2 + b
        l = (y-ytrue)**2
        
        backward(l)
        
        delta = 2*(y.value - ytrue)

        manual_dw1 = delta * x1
        manual_dw2 = delta * x2
        manual_db = delta
    
        print("engine:", w1.grad, w2.grad, b.grad)
        print("manual:", manual_dw1, manual_dw2, manual_db)

        w1.value -= n * w1.grad
        w2.value -= n * w2.grad
        b.value -= n * b.grad

        print("step:", step, "Loss:", l.value)

engine: -64 -96 -32
manual: -64 -96 -32
step: 0 Loss: 256
engine: -6.16 -6.16 -6.16
manual: -6.16 -6.16 -6.16
step: 0 Loss: 9.4864
engine: -44.601600000000005 -66.9024 -22.300800000000002
manual: -44.601600000000005 -66.9024 -22.300800000000002
step: 1 Loss: 124.33142016000002
engine: -3.114303999999999 -3.114303999999999 -3.114303999999999
manual: -3.114303999999999 -3.114303999999999 -3.114303999999999
step: 1 Loss: 2.4247223511039984
engine: -31.365719039999995 -47.048578559999996 -15.682859519999997
manual: -31.365719039999995 -47.048578559999996 -15.682859519999997
step: 2 Loss: 61.48802068101364
engine: -1.0455026176000004 -1.0455026176000004 -1.0455026176000004
manual: -1.0455026176000004 -1.0455026176000004 -1.0455026176000004
step: 2 Loss: 0.2732689308521132
engine: -22.332397080576 -33.498595620864 -11.166198540288
manual: -22.332397080576 -33.498595620864 -11.166198540288
step: 3 Loss: 31.170997460282468
engine: 0.3571713642905596 0.3571713642905596 0.3571713642905596
manual

### Observations with Dataset Training

Training over multiple examples introduces SGD dynamics.

We observe:

- Loss oscillates because updates happen per sample
- Gradients must be zeroed before backward
- Parameters must persist across steps

This validates the full training lifecycle:

1. zero_grad  
2. forward  
3. backward  
4. update  

## Class Functions of the above

In [233]:
import random

### Abstraction — Turning Neuron into a Class

To scale the system, we abstract the neuron into a reusable class.

This introduces:

- Weight vector abstraction
- Dimension safety checks
- Dot-product computation
- Random initialization to break symmetry

Now the neuron represents a hyperplane in ℝⁿ.

#### For a single input neuron

In [234]:
# Neuron for a single input
class neuron:
    def __init__(self):
        self.w = Node(random.uniform(-0.1, 0.1))
        self.b = Node(0)
    def pred(self, x):
        y = self.w * x + self.b
        return y

#### For multi-input neuron

In [235]:
# Updated __init___ and pred for multi-inputs
def __init__(self, dim=1):
    self.w = []
    self.b = Node(0)
    self.dim = dim
    for i in range (0, self.dim):
        w = Node(random.uniform(-0.1, 0.1))
        self.w.append(w)
        i += 1

neuron.__init__ = __init__

def pred(self, x):
    if len(x) != len(self.w):
        raise ValueError("Input dimension does not match neuron weight dimension")
    else:
        wx = Node(0)
        for i in range(0, self.dim):
            wx += x[i]*self.w[i]
        y = wx + self.b
        return y

neuron.pred = pred

Example

In [236]:
dataset = [
    (2,3,16),
    (1,1,5)
]

In [237]:
n = 0.01
model = neuron(2)
for step in range(100):
    for x1, x2, ytrue in dataset:
        x = [x1, x2]

        for w in model.w:
            zero_grad(w)
        zero_grad(model.b)

        y = model.pred(x)
        l = (y-ytrue)**2
        
        backward(l)

        delta = 2*(y.value - ytrue)

        manual_dw1 = delta * x1
        manual_dw2 = delta * x2
        manual_db = delta
    
        print("engine:", model.w[0].grad, model.w[1].grad, model.b.grad)
        print("manual:", manual_dw1, manual_dw2, manual_db)

        for w in model.w:
            w.value -= n * w.grad
        model.b.value -= n* model.b.grad

        print("step:", step, "Loss:", l.value)

engine: -64.37048576725331 -96.55572865087997 -32.18524288362666
manual: -64.37048576725331 -96.55572865087997 -32.18524288362666
step: 0 Loss: 258.9724648695101
engine: -6.237448499825584 -6.237448499825584 -6.237448499825584
manual: -6.237448499825584 -6.237448499825584 -6.237448499825584
step: 0 Loss: 9.726440946994108
engine: -44.84976211246424 -67.27464316869636 -22.42488105623212
manual: -44.84976211246424 -67.27464316869636 -22.42488105623212
step: 1 Loss: 125.71882259653955
engine: -3.1722158630881934 -3.1722158630881934 -3.1722158630881934
manual: -3.1722158630881934 -3.1722158630881934 -3.1722158630881934
step: 1 Loss: 2.515738370507093
engine: -31.530496913833083 -47.29574537074963 -15.765248456916542
manual: -31.530496913833083 -47.29574537074963 -15.765248456916542
step: 2 Loss: 62.13576472707735
engine: -1.0900530964729178 -1.0900530964729178 -1.0900530964729178
manual: -1.0900530964729178 -1.0900530964729178 -1.0900530964729178
step: 2 Loss: 0.29705393828254906
engine: -

## Phase 3 — Adding Depth (2 → 3 → 1)

Now we build a multi-layer network:

Input (2)
  ↓
Linear (2 → 3)
  ↓
ReLU
  ↓
Linear (3 → 1)
  ↓
Loss

This is where real backpropagation complexity begins.

We now test:

- Gradient flow through depth
- Nonlinear gating via ReLU
- Reverse topological traversal correctness

### Layer Abstraction

A layer is simply a collection of neurons sharing the same input.

Each neuron corresponds to one row of a weight matrix.

Mathematically:

y = W·x + b

Even though everything is scalar-based,
this reproduces matrix multiplication behavior.

In [238]:
# Layer class for the different layers
class layer:
    def __init__(self, dim_in, dim_out):
        self.dim_in = dim_in
        self.dim_out = dim_out
        self.neurons = [neuron(dim_in) for _ in range(dim_out)]
    
    def forward(self, x):
        yout = []
        if len(x) != self.dim_in:
            raise ValueError(f"Layer expected input dimension {self.dim_in}, got {len(x)}")
        else:
            for neuron in self.neurons:
                yout.append(neuron.pred(x))
        return yout

### ReLU — First Nonlinearity

ReLU(z) = max(0, z)

Backward rule:

If z > 0:
    dL/dz = dL/da
Else:
    dL/dz = 0

This introduces gradient gating.

If a neuron’s pre-activation is negative,
it receives zero gradient and does not learn.

In [239]:
def relu(node):
    out = Node(
        value = max(0, node.value),
        parents = (node,),
    )
    def backward():
        if node.value > 0:
            node.grad += out.grad
        else:
            node.grad += 0
    out.backward_fn = backward
    return out

### Model building

### Full Forward Pass

For each example:

1. h = hidden.forward(x)
2. a = relu(h)  (element-wise)
3. y = output.forward(a)
4. L = (y − y_true)²

Backward then computes gradients for all 13 parameters
in a single reverse traversal.

This is where reverse-mode autodiff becomes powerful.

In [240]:
hidden = layer(2,3)
output = layer(3,1)

In [241]:
dataset = [
    (2,3,16),
    (1,1,5)
]

In [None]:
n = 0.01
for step in range(100):
    for x1, x2, ytrue in dataset:
        x = [x1, x2]

        for neuron in hidden.neurons:
            for w in neuron.w:
                zero_grad(w)
            zero_grad(neuron.b)

        for neuron in output.neurons:
            for w in neuron.w:
                zero_grad(w)
            zero_grad(neuron.b)

        h = hidden.forward(x)
        a = [relu(node) for node in h]
        y = output.forward(a)[0]
        l = (y - ytrue)**2

        backward(l)

        for neuron in hidden.neurons:
            print([w.grad for w in neuron.w], neuron.b.grad)

        for neuron in hidden.neurons:
            for w in neuron.w:
                w.value -= n * w.grad
            neuron.b.value -= n * neuron.b.grad

        for neuron in output.neurons:
            for w in neuron.w:
                w.value -= n * w.grad
            neuron.b.value -= n * neuron.b.grad

    if step % 10 == 0:
        print("Step:", step, "| Loss:", l.value)

training output:
Step   0 | Loss: 21.232770
Step  10 | Loss: 5.194363
Step  20 | Loss: 1.437143
Step  30 | Loss: 0.345769
Step  40 | Loss: 0.073744
Step  50 | Loss: 0.014611
Step  60 | Loss: 0.002788
Step  70 | Loss: 0.000523
Step  80 | Loss: 0.000097
Step  90 | Loss: 0.000018


#### Dead Neurons and Gradient Flow

If a hidden neuron has negative pre-activation:

ReLU'(z) = 0

Then:

dL/dw_hidden = 0

This demonstrates how nonlinearities gate gradients.

Only active neurons update.

This is also why initialization matters.

## Key Engineering Insights

- Reverse-mode autodiff computes all gradients in one backward pass.
- Gradients accumulate via +=, not assignment.
- zero_grad is mandatory before backward.
- Random initialization breaks symmetry.
- ReLU introduces gradient gating.
- Depth multiplies derivative terms via chain rule.
- Even multi-layer networks can collapse to simpler representations.

This project transformed backpropagation
from an abstract formula into a concrete computational system.