# Building Micrograd from Scratch

Follow along with Karpathy's video! This notebook mirrors his lecture structure.

**Video**: [The spelled-out intro to neural networks and backpropagation](https://www.youtube.com/watch?v=VMj-3S1tku0)

In [1]:
%load_ext claude_code_jupyter
import numpy as np
import matplotlib.pyplot as plt


ðŸš€ Claude Code Magic loaded!
Features:
  â€¢ Full agentic Claude Code execution
  â€¢ Cell-based code approval workflow
  â€¢ Real-time message streaming
  â€¢ Session state preservation
  â€¢ Conversation continuity across cells

Usage:
  %cc <instructions>       # Continue with additional instructions (one-line)
  %%cc <instructions>      # Continue with additional instructions (multi-line)
  %cc_new (or %ccn)        # Start fresh conversation
  %cc --help               # Show available options and usage information

Context management:
  %cc --import <file>       # Add a file to be included in initial conversation messages
  %cc --add-dir <dir>       # Add a directory to Claude's accessible directories
  %cc --mcp-config <file>   # Set path to a .mcp.json file containing MCP server configurations
  %cc --cells-to-load <num> # The number of cells to load into a new conversation (default: all for first %cc, none for %cc_new)

Output:
  %cc --model <name>       # Model to use for Cl

---
## Part 1: Numerical Derivatives (Video ~0:00-0:15)

Start simple: how do we compute a derivative numerically?

In [None]:
# Karpathy's example function
def f(x):
    return 3 * x**2 - 4 * x + 5

# Plot it
xs = np.arange(-5, 5, 0.25)
ys = f(xs)
plt.plot(xs, ys)
plt.title('f(x) = 3xÂ² - 4x + 5')
plt.grid(True)
plt.show()

In [None]:
# Compute derivative at a point using the limit definition
h = 0.0001
x = 3.0

derivative = (f(x + h) - f(x)) / h
print(f"f({x}) = {f(x)}")
print(f"f'({x}) â‰ˆ {derivative}")
print(f"Exact: 6x - 4 = 6({x}) - 4 = {6*x - 4}")

In [None]:
%cc Why do we use a small h like 0.0001? What happens if h is too big or too small?

---
## Part 2: Derivatives of Expressions with Multiple Variables (Video ~0:15-0:30)

What if our function has multiple inputs?

In [None]:
# Function of multiple variables
a = 2.0
b = -3.0
c = 10.0

d = a * b + c
print(f"d = a*b + c = {a}*{b} + {c} = {d}")

In [None]:
# Derivative with respect to each variable
h = 0.0001

# d/da
d1 = a * b + c
d2 = (a + h) * b + c
print(f"dd/da = {(d2 - d1) / h}")

# d/db
d1 = a * b + c
d2 = a * (b + h) + c
print(f"dd/db = {(d2 - d1) / h}")

# d/dc
d1 = a * b + c
d2 = a * b + (c + h)
print(f"dd/dc = {(d2 - d1) / h}")

In [None]:
# YOUR TURN: What are the expected values analytically?
# dd/da should be ?
# dd/db should be ?
# dd/dc should be ?

%cc For d = a*b + c, what are the partial derivatives dd/da, dd/db, and dd/dc? Explain why.

---
## Part 3: Building the Value Class (Video ~0:30-1:00)

Now we build the core abstraction!

In [None]:
class Value:
    
    def __init__(self, data, _children=(), _op=''):
        self.data = data
        self.grad = 0.0
        self._backward = lambda: None
        self._prev = set(_children)
        self._op = _op  # The operation that produced this node
    
    def __repr__(self):
        return f"Value(data={self.data})"
    
    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), '+')
        
        def _backward():
            self.grad += out.grad
            other.grad += out.grad
        out._backward = _backward
        
        return out
    
    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')
        
        def _backward():
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _backward
        
        return out

In [None]:
# Test it!
a = Value(2.0)
b = Value(-3.0)
c = Value(10.0)

d = a * b + c
print(f"d = {d}")
print(f"d._prev = {d._prev}")
print(f"d._op = '{d._op}'")

In [None]:
%cc Walk me through what happens internally when I write d = a * b + c with the Value class

---
## Part 4: Visualizing the Graph (Video ~1:00-1:15)

In [None]:
from graphviz import Digraph

def trace(root):
    """Build a set of all nodes and edges in the graph"""
    nodes, edges = set(), set()
    def build(v):
        if v not in nodes:
            nodes.add(v)
            for child in v._prev:
                edges.add((child, v))
                build(child)
    build(root)
    return nodes, edges

def draw_dot(root):
    """Visualize the computation graph"""
    dot = Digraph(format='svg', graph_attr={'rankdir': 'LR'})
    
    nodes, edges = trace(root)
    for n in nodes:
        uid = str(id(n))
        dot.node(name=uid, label=f"{{ data: {n.data:.4f} | grad: {n.grad:.4f} }}", shape='record')
        if n._op:
            dot.node(name=uid + n._op, label=n._op)
            dot.edge(uid + n._op, uid)
    
    for n1, n2 in edges:
        dot.edge(str(id(n1)), str(id(n2)) + n2._op)
    
    return dot

In [None]:
# Visualize our expression
a = Value(2.0)
b = Value(-3.0)
c = Value(10.0)
d = a * b + c

draw_dot(d)

---
## Part 5: Implementing Backward (Video ~1:15-1:45)

Now we implement the full backward pass using topological sort!

In [None]:
class Value:
    
    def __init__(self, data, _children=(), _op=''):
        self.data = data
        self.grad = 0.0
        self._backward = lambda: None
        self._prev = set(_children)
        self._op = _op
    
    def __repr__(self):
        return f"Value(data={self.data})"
    
    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), '+')
        
        def _backward():
            self.grad += out.grad
            other.grad += out.grad
        out._backward = _backward
        
        return out
    
    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')
        
        def _backward():
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _backward
        
        return out
    
    def backward(self):
        # Topological sort
        topo = []
        visited = set()
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    build_topo(child)
                topo.append(v)
        build_topo(self)
        
        # Go backwards and apply chain rule
        self.grad = 1.0
        for node in reversed(topo):
            node._backward()

In [None]:
# Test backpropagation!
a = Value(2.0)
b = Value(-3.0)
c = Value(10.0)
d = a * b + c

d.backward()

print(f"d = {d.data}")
print(f"")
print(f"Gradients:")
print(f"  a.grad = {a.grad} (expected: b = -3)")
print(f"  b.grad = {b.grad} (expected: a = 2)")
print(f"  c.grad = {c.grad} (expected: 1)")

In [None]:
# Visualize with gradients
draw_dot(d)

In [None]:
%cc Why do we need topological sort for backward()? What would go wrong without it?

---
## Part 6: More Operations (Video ~1:45-2:00)

Let's add tanh, power, and more!

In [None]:
import math

class Value:
    
    def __init__(self, data, _children=(), _op=''):
        self.data = data
        self.grad = 0.0
        self._backward = lambda: None
        self._prev = set(_children)
        self._op = _op
    
    def __repr__(self):
        return f"Value(data={self.data:.4f})"
    
    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), '+')
        def _backward():
            self.grad += out.grad
            other.grad += out.grad
        out._backward = _backward
        return out
    
    def __radd__(self, other):  # other + self
        return self + other
    
    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')
        def _backward():
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _backward
        return out
    
    def __rmul__(self, other):  # other * self
        return self * other
    
    def __pow__(self, other):
        assert isinstance(other, (int, float)), "only supporting int/float powers"
        out = Value(self.data ** other, (self,), f'**{other}')
        def _backward():
            self.grad += other * (self.data ** (other - 1)) * out.grad
        out._backward = _backward
        return out
    
    def __neg__(self):
        return self * -1
    
    def __sub__(self, other):
        return self + (-other)
    
    def __truediv__(self, other):
        return self * other**-1
    
    def tanh(self):
        x = self.data
        t = (math.exp(2*x) - 1) / (math.exp(2*x) + 1)
        out = Value(t, (self,), 'tanh')
        def _backward():
            self.grad += (1 - t**2) * out.grad
        out._backward = _backward
        return out
    
    def backward(self):
        topo = []
        visited = set()
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    build_topo(child)
                topo.append(v)
        build_topo(self)
        
        self.grad = 1.0
        for node in reversed(topo):
            node._backward()

In [None]:
# Test a neuron-like computation!
# inputs
x1 = Value(2.0)
x2 = Value(0.0)
# weights
w1 = Value(-3.0)
w2 = Value(1.0)
# bias
b = Value(6.8813735870195432)

# neuron: w1*x1 + w2*x2 + b
x1w1 = x1 * w1
x2w2 = x2 * w2
x1w1x2w2 = x1w1 + x2w2
n = x1w1x2w2 + b
o = n.tanh()

print(f"Output: {o.data}")

In [None]:
o.backward()

print("Gradients:")
print(f"  x1.grad = {x1.grad:.4f}")
print(f"  x2.grad = {x2.grad:.4f}")
print(f"  w1.grad = {w1.grad:.4f}")
print(f"  w2.grad = {w2.grad:.4f}")

In [None]:
draw_dot(o)

---
## Part 7: Building a Neural Network (Video ~2:00-2:25)

Now let's build an actual neural network on top of our Value class!

In [None]:
import random

class Neuron:
    
    def __init__(self, nin):
        self.w = [Value(random.uniform(-1, 1)) for _ in range(nin)]
        self.b = Value(random.uniform(-1, 1))
    
    def __call__(self, x):
        # w Â· x + b
        act = sum((wi * xi for wi, xi in zip(self.w, x)), self.b)
        return act.tanh()
    
    def parameters(self):
        return self.w + [self.b]

class Layer:
    
    def __init__(self, nin, nout):
        self.neurons = [Neuron(nin) for _ in range(nout)]
    
    def __call__(self, x):
        outs = [n(x) for n in self.neurons]
        return outs[0] if len(outs) == 1 else outs
    
    def parameters(self):
        return [p for neuron in self.neurons for p in neuron.parameters()]

class MLP:
    
    def __init__(self, nin, nouts):
        sz = [nin] + nouts
        self.layers = [Layer(sz[i], sz[i+1]) for i in range(len(nouts))]
    
    def __call__(self, x):
        for layer in self.layers:
            x = layer(x)
        return x
    
    def parameters(self):
        return [p for layer in self.layers for p in layer.parameters()]

In [None]:
# Create a small MLP: 3 inputs -> 4 neurons -> 4 neurons -> 1 output
n = MLP(3, [4, 4, 1])

# Test with some input
x = [2.0, 3.0, -1.0]
output = n(x)
print(f"Output: {output}")
print(f"Number of parameters: {len(n.parameters())}")

In [None]:
%cc What does MLP(3, [4, 4, 1]) create? Draw me the network architecture.

---
## Part 8: Training! (Video ~2:10-2:25)

In [None]:
# Training data
xs = [
    [2.0, 3.0, -1.0],
    [3.0, -1.0, 0.5],
    [0.5, 1.0, 1.0],
    [1.0, 1.0, -1.0],
]
ys = [1.0, -1.0, -1.0, 1.0]  # desired targets

In [None]:
# Training loop!
n = MLP(3, [4, 4, 1])

for k in range(100):
    
    # Forward pass
    ypred = [n(x) for x in xs]
    loss = sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred))
    
    # Backward pass
    for p in n.parameters():
        p.grad = 0.0  # Zero gradients!
    loss.backward()
    
    # Update
    for p in n.parameters():
        p.data += -0.1 * p.grad
    
    if k % 10 == 0:
        print(f"Step {k}: loss = {loss.data:.4f}")

In [None]:
# Check predictions
print("Final predictions:")
for x, y in zip(xs, ys):
    pred = n(x)
    print(f"  Input: {x} -> Pred: {pred.data:.4f}, Target: {y}")

In [None]:
%cc Walk me through one iteration of the training loop. What happens in forward pass, backward pass, and update?

---
## ðŸŽ‰ Congratulations!

You just built:
1. An autograd engine (Value class)
2. A neural network library (Neuron, Layer, MLP)
3. A training loop with gradient descent

This is the foundation of ALL modern deep learning frameworks!

In [None]:
%cc What should I focus on understanding deeply before moving to the next lecture (makemore)?