# Building Micrograd from Scratch

**Video**: [Karpathy's Neural Networks: Zero to Hero](https://www.youtube.com/watch?v=VMj-3S1tku0)

---

## How This Notebook Works

1. **Part 0**: See the magic first (finished product doing something cool)
2. **Parts 1-6**: Build it yourself, piece by piece, with heavy annotations
3. **Part 7-8**: Use it to train a real neural network

Let's start by seeing what we're building toward!

In [None]:
%load_ext claude_code_jupyter

---
# Part 0: See the Magic First

Before we build anything, let's see what micrograd can do.

Here's the finished Value class (don't worry about understanding it yet):

In [None]:
# THE FINISHED PRODUCT - just run this, we'll understand it later
import math

class Value:
    def __init__(self, data, _children=(), _op=''):
        self.data = data
        self.grad = 0.0
        self._backward = lambda: None
        self._prev = set(_children)
        self._op = _op
    
    def __repr__(self):
        return f"Value(data={self.data:.4f}, grad={self.grad:.4f})"
    
    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), '+')
        def _backward():
            self.grad += out.grad
            other.grad += out.grad
        out._backward = _backward
        return out
    
    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')
        def _backward():
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _backward
        return out
    
    def __radd__(self, other): return self + other
    def __rmul__(self, other): return self * other
    def __neg__(self): return self * -1
    def __sub__(self, other): return self + (-other)
    
    def backward(self):
        topo = []
        visited = set()
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    build_topo(child)
                topo.append(v)
        build_topo(self)
        self.grad = 1.0
        for node in reversed(topo):
            node._backward()

print("Value class loaded! Let's see what it can do...")

### The Magic in Action

Remember in the calculus notebook, you manually computed gradients for `L = (a √ó b) + c`?

**Watch micrograd do it automatically:**

In [None]:
# Create some values
a = Value(2.0)
b = Value(3.0)
c = Value(4.0)

# Do math (forward pass)
L = (a * b) + c

print(f"L = (a √ó b) + c = ({a.data} √ó {b.data}) + {c.data} = {L.data}")
print()
print("Now watch this...")
print()

In [None]:
# ONE LINE computes ALL gradients!
L.backward()

print("After L.backward():")
print(f"  a.grad = {a.grad}  ‚Üê 'nudge a up by 1, L goes up by {a.grad}'")
print(f"  b.grad = {b.grad}  ‚Üê 'nudge b up by 1, L goes up by {b.grad}'")
print(f"  c.grad = {c.grad}  ‚Üê 'nudge c up by 1, L goes up by {c.grad}'")
print()
print("Remember from calculus notebook:")
print("  - Multiplication rule: gradient = the OTHER input")
print(f"  - So a.grad = b = {b.data} ‚úì")
print(f"  - And b.grad = a = {a.data} ‚úì")
print("  - Addition rule: gradient flows through equally")
print(f"  - So c.grad = 1 ‚úì")

### ü§Ø That's It!

With plain Python numbers, `2 * 3 + 4 = 10` and that's all you get.

With Value objects, you get the answer **AND** you can ask "how does each input affect the output?"

**This is the entire foundation of deep learning.** A neural network is just a big expression like this, and `.backward()` tells us how to adjust every weight.

---

Now let's build this ourselves, step by step!

---
# Part 1: The Dumb Way - Numerical Derivatives

Before we automate anything, let's remember how to compute derivatives "by hand" (with code).

**The method:** Nudge the input, see how much the output changes.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# A simple function
def f(x):
    return 3 * x**2 - 4 * x + 5

# Plot it
xs = np.arange(-5, 5, 0.25)
ys = f(xs)
plt.plot(xs, ys)
plt.title('f(x) = 3x¬≤ - 4x + 5')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.grid(True)
plt.show()

In [None]:
# Compute the derivative at x = 3 using the "nudge" method
x = 3.0
h = 0.0001  # tiny nudge

# How much does f change when we nudge x?
derivative = (f(x + h) - f(x)) / h

print(f"At x = {x}:")
print(f"  f(x) = {f(x)}")
print(f"  f(x + h) = {f(x + h)}")
print(f"  Derivative ‚âà {derivative:.4f}")
print()
print(f"Check: derivative of 3x¬≤ - 4x + 5 is 6x - 4")
print(f"       At x=3: 6(3) - 4 = {6*3 - 4} ‚úì")

### Why h = 0.0001?

- **Too big** (h = 1): You get the average slope over a big range, not the instant slope
- **Too small** (h = 0.0000000001): Computer floating point errors mess things up
- **Just right** (h ‚âà 0.0001): Small enough to be accurate, big enough to avoid rounding errors

---
# Part 2: Multiple Inputs - The Problem

What if we have multiple variables? Let's compute `d = a * b + c`

In [None]:
# Our expression: d = a * b + c
a = 2.0
b = -3.0
c = 10.0

d = a * b + c
print(f"d = a * b + c = {a} * {b} + {c} = {d}")

In [None]:
# To find dd/da, dd/db, dd/dc, we have to nudge EACH variable separately
h = 0.0001

def compute_d(a, b, c):
    return a * b + c

d_original = compute_d(a, b, c)

# Nudge a
dd_da = (compute_d(a + h, b, c) - d_original) / h
print(f"dd/da = {dd_da:.4f}  (expected: b = {b})")

# Nudge b
dd_db = (compute_d(a, b + h, c) - d_original) / h
print(f"dd/db = {dd_db:.4f}  (expected: a = {a})")

# Nudge c
dd_dc = (compute_d(a, b, c + h) - d_original) / h
print(f"dd/dc = {dd_dc:.4f}  (expected: 1)")

### The Problem: This is Slow!

For 3 variables, we needed 3 separate nudge computations.

For a neural network with **1 million** weights, we'd need **1 million** nudge computations.

**Micrograd's solution:** Compute ALL gradients in ONE backward pass using the chain rule.

---

---
# Part 3: Building the Value Class

Now we build the magic box that remembers its history.

### What Value Needs to Track

```python
self.data      # The actual number (like 2.0)
self.grad      # The gradient (dL/d_this_value) - starts at 0
self._prev     # Who made me? (the parent Values)
self._backward # How to compute parent gradients (the calculus rule)
self._op       # What operation made me? (for visualization)
```

In [None]:
# Version 1: Just the container (no operations yet)

class Value:
    
    def __init__(self, data):
        self.data = data      # The number
        self.grad = 0.0       # Gradient (computed later)
        self._prev = set()    # Parents (empty for now)
        self._backward = lambda: None  # Backward function (does nothing yet)
    
    def __repr__(self):
        return f"Value(data={self.data})"

# Test it
a = Value(2.0)
print(a)
print(f"a.data = {a.data}")
print(f"a.grad = {a.grad}")
print(f"a._prev = {a._prev}")

### Adding Addition

When we do `c = a + b`, we want:
1. `c.data` = sum of the numbers
2. `c._prev` = `{a, b}` (c remembers its parents)
3. `c._backward` = function that applies the **addition rule**

**Remember from calculus:** For `z = x + y`, both inputs get the gradient equally.
- `dx = dz √ó 1`
- `dy = dz √ó 1`

In [None]:
# Version 2: Add addition

class Value:
    
    def __init__(self, data, _children=()):
        self.data = data
        self.grad = 0.0
        self._prev = set(_children)  # Now we can pass in parents!
        self._backward = lambda: None
    
    def __repr__(self):
        return f"Value(data={self.data})"
    
    def __add__(self, other):
        # Create the output Value
        out = Value(
            self.data + other.data,  # The sum
            (self, other)            # Remember parents!
        )
        
        # Define how gradients flow backward through addition
        # Addition rule: gradient flows through equally (√ó 1)
        def _backward():
            self.grad += out.grad   # my gradient += output's gradient √ó 1
            other.grad += out.grad  # other's gradient += output's gradient √ó 1
        
        out._backward = _backward
        return out

# Test it
a = Value(2.0)
b = Value(3.0)
c = a + b

print(f"a = {a}")
print(f"b = {b}")
print(f"c = a + b = {c}")
print(f"c._prev = {c._prev}  ‚Üê c remembers its parents!")

### Let's Manually Test the Backward Pass

In [None]:
a = Value(2.0)
b = Value(3.0)
c = a + b  # c = 5

# Pretend c is our final output, so dc/dc = 1
c.grad = 1.0

# Now call backward to flow gradients to parents
c._backward()

print(f"c.grad = {c.grad} (we set this to 1)")
print(f"a.grad = {a.grad} (should be 1 - addition passes gradient through)")
print(f"b.grad = {b.grad} (should be 1 - addition passes gradient through)")

### Adding Multiplication

**Remember from calculus:** For `z = x √ó y`, each input's gradient is the OTHER input.
- `dx = dz √ó y`
- `dy = dz √ó x`

In [None]:
# Version 3: Add multiplication

class Value:
    
    def __init__(self, data, _children=()):
        self.data = data
        self.grad = 0.0
        self._prev = set(_children)
        self._backward = lambda: None
    
    def __repr__(self):
        return f"Value(data={self.data})"
    
    def __add__(self, other):
        out = Value(self.data + other.data, (self, other))
        def _backward():
            # Addition rule: gradient flows through equally
            self.grad += out.grad
            other.grad += out.grad
        out._backward = _backward
        return out
    
    def __mul__(self, other):
        out = Value(self.data * other.data, (self, other))
        def _backward():
            # Multiplication rule: gradient = other input √ó output gradient
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _backward
        return out

# Test: d = a * b + c
a = Value(2.0)
b = Value(3.0)
c = Value(4.0)

d = a * b  # intermediate: 6
L = d + c  # final: 10

print(f"a = {a.data}, b = {b.data}, c = {c.data}")
print(f"d = a * b = {d.data}")
print(f"L = d + c = {L.data}")

In [None]:
# Manually do backward pass
# Step 1: Start at L, set its gradient to 1
L.grad = 1.0

# Step 2: L = d + c (addition) ‚Üí both get gradient 1
L._backward()
print(f"After L._backward():")
print(f"  d.grad = {d.grad} (addition: passes through)")
print(f"  c.grad = {c.grad} (addition: passes through)")

# Step 3: d = a * b (multiplication) ‚Üí gradient = other input
d._backward()
print(f"After d._backward():")
print(f"  a.grad = {a.grad} (multiplication: = b = {b.data})")
print(f"  b.grad = {b.grad} (multiplication: = a = {a.data})")

### üìù Quick Check

These are the same gradients we computed by hand in the calculus notebook!

- `a.grad = 3` because a is multiplied by b=3
- `b.grad = 2` because b is multiplied by a=2
- `c.grad = 1` because c is just added

---

---
# Part 4: Automating the Backward Pass

We don't want to call `_backward()` on each node manually. We want ONE call: `L.backward()`

### The Challenge: Order Matters!

We must process nodes in the right order:
- `L` first (set its grad to 1)
- Then `d` and `c` (they depend on L's grad)
- Then `a` and `b` (they depend on d's grad)

This is called **topological sort** - process parents before children (when going backward).

In [None]:
# Version 4: Add automatic backward()

class Value:
    
    def __init__(self, data, _children=()):
        self.data = data
        self.grad = 0.0
        self._prev = set(_children)
        self._backward = lambda: None
    
    def __repr__(self):
        return f"Value(data={self.data}, grad={self.grad})"
    
    def __add__(self, other):
        out = Value(self.data + other.data, (self, other))
        def _backward():
            self.grad += out.grad
            other.grad += out.grad
        out._backward = _backward
        return out
    
    def __mul__(self, other):
        out = Value(self.data * other.data, (self, other))
        def _backward():
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _backward
        return out
    
    def backward(self):
        # Step 1: Build a list of all nodes in topological order
        topo = []
        visited = set()
        
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev:  # Visit all parents first
                    build_topo(child)
                topo.append(v)  # Then add this node
        
        build_topo(self)
        
        # Step 2: Start with gradient of 1 at the output
        self.grad = 1.0
        
        # Step 3: Go backward through the list, calling each _backward()
        for node in reversed(topo):
            node._backward()

print("Value class with backward() ready!")

In [None]:
# Test it! Same expression: L = a * b + c
a = Value(2.0)
b = Value(3.0)
c = Value(4.0)

L = a * b + c

print(f"Before backward():")
print(f"  L = {L.data}")
print(f"  a.grad = {a.grad}, b.grad = {b.grad}, c.grad = {c.grad}")
print()

# ONE CALL computes all gradients!
L.backward()

print(f"After L.backward():")
print(f"  a.grad = {a.grad} (should be b = 3)")
print(f"  b.grad = {b.grad} (should be a = 2)")
print(f"  c.grad = {c.grad} (should be 1)")

### üéâ It Works!

One call to `L.backward()` computed all three gradients automatically.

This is the core of backpropagation!

---

---
# Part 5: Making it Robust

Our Value class works, but it can't handle:
- `2 * a` (number on the left)
- `a - b` (subtraction)
- `a ** 2` (powers)

Let's add these:

In [None]:
# Version 5: More operations

class Value:
    
    def __init__(self, data, _children=(), _op=''):
        self.data = data
        self.grad = 0.0
        self._prev = set(_children)
        self._backward = lambda: None
        self._op = _op  # For debugging/visualization
    
    def __repr__(self):
        return f"Value(data={self.data:.4f}, grad={self.grad:.4f})"
    
    def __add__(self, other):
        # Handle: Value + number
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), '+')
        def _backward():
            self.grad += out.grad
            other.grad += out.grad
        out._backward = _backward
        return out
    
    def __radd__(self, other):  # number + Value
        return self + other
    
    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')
        def _backward():
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _backward
        return out
    
    def __rmul__(self, other):  # number * Value
        return self * other
    
    def __pow__(self, other):
        # Only supporting int/float powers for simplicity
        assert isinstance(other, (int, float))
        out = Value(self.data ** other, (self,), f'**{other}')
        def _backward():
            # Power rule: d/dx(x^n) = n * x^(n-1)
            self.grad += other * (self.data ** (other - 1)) * out.grad
        out._backward = _backward
        return out
    
    def __neg__(self):  # -Value
        return self * -1
    
    def __sub__(self, other):  # Value - other
        return self + (-other)
    
    def __rsub__(self, other):  # other - Value
        return other + (-self)
    
    def __truediv__(self, other):  # Value / other
        return self * other**-1
    
    def backward(self):
        topo = []
        visited = set()
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    build_topo(child)
                topo.append(v)
        build_topo(self)
        
        self.grad = 1.0
        for node in reversed(topo):
            node._backward()

print("Enhanced Value class ready!")

In [None]:
# Test all the new operations
x = Value(3.0)

# Test: f(x) = 2*x^2 - 5*x + 3
y = 2 * x**2 - 5*x + 3

print(f"x = {x.data}")
print(f"y = 2*x¬≤ - 5*x + 3 = {y.data}")
print()

y.backward()
print(f"dy/dx = {x.grad}")
print(f"Expected: 4*x - 5 = 4*{x.data} - 5 = {4*x.data - 5}")

---
# Part 6: Adding tanh (Activation Function)

Neural networks need **activation functions** to be interesting. The simplest is `tanh`.

```
tanh(x) = (e^(2x) - 1) / (e^(2x) + 1)
```

Its derivative is: `d/dx tanh(x) = 1 - tanh(x)¬≤`

In [None]:
import math

# Version 6: Add tanh

class Value:
    
    def __init__(self, data, _children=(), _op=''):
        self.data = data
        self.grad = 0.0
        self._prev = set(_children)
        self._backward = lambda: None
        self._op = _op
    
    def __repr__(self):
        return f"Value(data={self.data:.4f}, grad={self.grad:.4f})"
    
    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), '+')
        def _backward():
            self.grad += out.grad
            other.grad += out.grad
        out._backward = _backward
        return out
    
    def __radd__(self, other): return self + other
    
    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')
        def _backward():
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _backward
        return out
    
    def __rmul__(self, other): return self * other
    
    def __pow__(self, other):
        assert isinstance(other, (int, float))
        out = Value(self.data ** other, (self,), f'**{other}')
        def _backward():
            self.grad += other * (self.data ** (other - 1)) * out.grad
        out._backward = _backward
        return out
    
    def __neg__(self): return self * -1
    def __sub__(self, other): return self + (-other)
    def __rsub__(self, other): return other + (-self)
    def __truediv__(self, other): return self * other**-1
    
    def tanh(self):
        x = self.data
        t = (math.exp(2*x) - 1) / (math.exp(2*x) + 1)
        out = Value(t, (self,), 'tanh')
        def _backward():
            # Derivative of tanh: 1 - tanh¬≤
            self.grad += (1 - t**2) * out.grad
        out._backward = _backward
        return out
    
    def backward(self):
        topo = []
        visited = set()
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    build_topo(child)
                topo.append(v)
        build_topo(self)
        self.grad = 1.0
        for node in reversed(topo):
            node._backward()

print("Value class with tanh ready!")

In [None]:
# Test a neuron-like computation!
# neuron output = tanh(w1*x1 + w2*x2 + b)

# inputs
x1 = Value(2.0)
x2 = Value(0.0)

# weights
w1 = Value(-3.0)
w2 = Value(1.0)

# bias
b = Value(6.88137)

# Forward pass: compute the neuron output
n = x1*w1 + x2*w2 + b  # weighted sum
o = n.tanh()            # activation

print(f"Neuron computation:")
print(f"  x1={x1.data}, x2={x2.data}")
print(f"  w1={w1.data}, w2={w2.data}")
print(f"  b={b.data}")
print(f"  n = x1*w1 + x2*w2 + b = {n.data:.4f}")
print(f"  o = tanh(n) = {o.data:.4f}")

In [None]:
# Backward pass
o.backward()

print(f"Gradients (how each input affects the output):")
print(f"  do/dx1 = {x1.grad:.4f}")
print(f"  do/dx2 = {x2.grad:.4f}")
print(f"  do/dw1 = {w1.grad:.4f}  ‚Üê This tells us how to adjust w1!")
print(f"  do/dw2 = {w2.grad:.4f}  ‚Üê This tells us how to adjust w2!")
print(f"  do/db  = {b.grad:.4f}")

---
# Part 7: Building a Neural Network

Now we have all the pieces. Let's build actual neural network components!

### What We'll Build

```
Neuron: takes inputs, multiplies by weights, adds bias, applies tanh
Layer:  a bunch of neurons
MLP:    Multiple Layers stacked (Multi-Layer Perceptron)
```

In [None]:
import random

class Neuron:
    """A single neuron: weighted sum of inputs + bias, through tanh"""
    
    def __init__(self, nin):  # nin = number of inputs
        # Random weights between -1 and 1
        self.w = [Value(random.uniform(-1, 1)) for _ in range(nin)]
        self.b = Value(random.uniform(-1, 1))
    
    def __call__(self, x):
        # w ¬∑ x + b (dot product + bias)
        act = sum((wi * xi for wi, xi in zip(self.w, x)), self.b)
        return act.tanh()
    
    def parameters(self):
        return self.w + [self.b]

# Test a neuron
n = Neuron(3)  # Neuron with 3 inputs
x = [1.0, 2.0, 3.0]
out = n(x)

print(f"Neuron with 3 inputs")
print(f"  Weights: {[f'{w.data:.3f}' for w in n.w]}")
print(f"  Bias: {n.b.data:.3f}")
print(f"  Input: {x}")
print(f"  Output: {out.data:.4f}")
print(f"  Number of parameters: {len(n.parameters())}")

In [None]:
class Layer:
    """A layer of neurons"""
    
    def __init__(self, nin, nout):  # nin inputs, nout neurons
        self.neurons = [Neuron(nin) for _ in range(nout)]
    
    def __call__(self, x):
        outs = [n(x) for n in self.neurons]
        return outs[0] if len(outs) == 1 else outs
    
    def parameters(self):
        return [p for neuron in self.neurons for p in neuron.parameters()]

# Test a layer
layer = Layer(3, 4)  # 3 inputs, 4 neurons
x = [1.0, 2.0, 3.0]
out = layer(x)

print(f"Layer: 3 inputs ‚Üí 4 neurons")
print(f"  Outputs: {[f'{o.data:.3f}' for o in out]}")
print(f"  Number of parameters: {len(layer.parameters())}")

In [None]:
class MLP:
    """Multi-Layer Perceptron: stack of layers"""
    
    def __init__(self, nin, nouts):  # nouts is list of layer sizes
        sz = [nin] + nouts  # e.g., [3, 4, 4, 1] for 3‚Üí4‚Üí4‚Üí1
        self.layers = [Layer(sz[i], sz[i+1]) for i in range(len(nouts))]
    
    def __call__(self, x):
        for layer in self.layers:
            x = layer(x)
        return x
    
    def parameters(self):
        return [p for layer in self.layers for p in layer.parameters()]

# Create a network: 3 inputs ‚Üí 4 neurons ‚Üí 4 neurons ‚Üí 1 output
net = MLP(3, [4, 4, 1])

x = [2.0, 3.0, -1.0]
out = net(x)

print(f"MLP: 3 ‚Üí 4 ‚Üí 4 ‚Üí 1")
print(f"  Input: {x}")
print(f"  Output: {out.data:.4f}")
print(f"  Total parameters: {len(net.parameters())}")

### Network Architecture

```
Input (3)     Hidden (4)    Hidden (4)    Output (1)
   ‚óè‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚óè‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚óè‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚óè
   ‚óè‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚óè‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚óè              
   ‚óè‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚óè‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚óè
              ‚óè‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚óè
              
41 parameters = (3√ó4 + 4) + (4√ó4 + 4) + (4√ó1 + 1)
              = 16 + 20 + 5 = 41
```

---

---
# Part 8: Training!

Now the grand finale: let's train this network to learn something!

### The Training Loop

```
1. Forward pass: compute predictions
2. Compute loss: how wrong are we?
3. Backward pass: compute gradients
4. Update: nudge parameters to reduce loss
5. Repeat!
```

In [None]:
# Training data: 4 examples
xs = [
    [2.0, 3.0, -1.0],
    [3.0, -1.0, 0.5],
    [0.5, 1.0, 1.0],
    [1.0, 1.0, -1.0],
]
ys = [1.0, -1.0, -1.0, 1.0]  # What we want the network to output

print("Training data:")
for x, y in zip(xs, ys):
    print(f"  {x} ‚Üí {y}")

In [None]:
# Create fresh network
net = MLP(3, [4, 4, 1])

# Training loop!
learning_rate = 0.1

for step in range(100):
    
    # === FORWARD PASS ===
    # Make predictions for all inputs
    ypred = [net(x) for x in xs]
    
    # === COMPUTE LOSS ===
    # Mean squared error: sum of (prediction - target)¬≤
    loss = sum((yp - yt)**2 for yp, yt in zip(ypred, ys))
    
    # === BACKWARD PASS ===
    # First, zero all gradients (important!)
    for p in net.parameters():
        p.grad = 0.0
    # Then compute new gradients
    loss.backward()
    
    # === UPDATE ===
    # Nudge each parameter in the direction that reduces loss
    for p in net.parameters():
        p.data -= learning_rate * p.grad
    
    # Print progress
    if step % 10 == 0:
        print(f"Step {step:3d}: loss = {loss.data:.4f}")

print(f"\nFinal loss: {loss.data:.6f}")

In [None]:
# Check final predictions!
print("Final predictions vs targets:")
print()
for x, y in zip(xs, ys):
    pred = net(x)
    print(f"  Input: {x}")
    print(f"  Predicted: {pred.data:+.4f}  Target: {y:+.1f}")
    print()

---
# üéâ Congratulations!

You just built:

1. **An autograd engine** (Value class) - ~50 lines
2. **A neural network library** (Neuron, Layer, MLP) - ~30 lines  
3. **A training loop** - ~15 lines

This is the **entire foundation** of modern deep learning!

PyTorch, TensorFlow, JAX - they're all doing exactly this, just:
- More operations (convolution, attention, etc.)
- GPU acceleration
- More optimizers (Adam, etc.)
- Better numerical stability

But the core idea is **exactly what you just built**.

---

## Key Takeaways

| Concept | What It Means |
|---------|---------------|
| `.data` | The actual number |
| `.grad` | How much the loss changes if we nudge this value |
| `._prev` | Who made me (parents in the computation graph) |
| `._backward` | The calculus rule for this operation |
| `.backward()` | Walk the graph backwards, fill in all gradients |
| Training | Forward ‚Üí Loss ‚Üí Backward ‚Üí Update ‚Üí Repeat |

---

**Next:** makemore (building a character-level language model)