# 02 - Derivatives and the Chain Rule

Understanding derivatives and the chain rule is essential for training neural networks, including LLMs. Every weight update in an LLM is powered by these concepts!

In this notebook, you'll:
- Compute derivatives of basic functions
- Implement and differentiate activation functions
- Apply the chain rule (the backbone of backpropagation)
- See how gradients flow through a mini neural network


## 🧮 Scalar Derivatives

Derivatives tell us how a function changes as its input changes. In neural networks, this tells us how to adjust weights to reduce loss.

### Task:
- Implement the derivative of a simple scalar function $f(x) = x^2 + 3x + 2$.
- Verify numerically using finite difference.

**LLM/NN Context:**
- Every parameter update in an LLM is based on derivatives of the loss with respect to that parameter.

In [None]:
import numpy as np

def f(x):
    return x**2 + 3*x + 2

# Analytical derivative
def df_dx(x):
    return 2*x + 3

# Numerical derivative (finite difference)
def numerical_derivative(f, x, eps=1e-5):
    return (f(x + eps) - f(x - eps)) / (2 * eps)

x = 5.0
print('Analytical:', df_dx(x))
print('Numerical:', numerical_derivative(f, x))

## 🔁 Derivatives of Activation Functions

Activation functions introduce non-linearity in neural networks. Their derivatives are needed for backpropagation.

### Task:
- Implement the sigmoid function and its derivative
- Implement ReLU and its derivative

**LLM/NN Context:**
- Transformers use activation functions (like GELU, ReLU) in every feedforward block. Their derivatives are used in every backward pass.

In [None]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)

def relu(x):
    return np.maximum(0, x)

def relu_derivative(x):
    return (x > 0).astype(float)

x_vals = np.array([-2.0, 0.0, 2.0])
print('Sigmoid derivative:', sigmoid_derivative(x_vals))
print('ReLU derivative:', relu_derivative(x_vals))

## 🔗 Chain Rule

The chain rule lets us compute derivatives of composite functions. This is the backbone of backpropagation in neural networks and LLMs.

### Example:
If $f(x) = g(h(x))$, then $\frac{df}{dx} = \frac{dg}{dh} \cdot \frac{dh}{dx}$

### Task:
- Let $h(x) = x^2$, $g(h) = 3h + 1$
- Write $f(x) = g(h(x))$, then compute $df/dx$ using the chain rule.

**LLM/NN Context:**
- Every layer in an LLM is a function of the previous layer. The chain rule lets us propagate gradients backward through all layers.

In [None]:
def h(x):
    return x**2

def g(h_val):
    return 3*h_val + 1

def f(x):
    return g(h(x))

def df_dx(x):
    dh_dx = 2*x
    dg_dh = 3
    return dg_dh * dh_dx

x = 4.0
print('f(x):', f(x))
print('df/dx (chain rule):', df_dx(x))
print('df/dx (numerical):', numerical_derivative(f, x))

## 🧠 Chain Rule in a Mini Neural Network

Let’s apply this to a simple feedforward computation:
- $z = Wx + b$
- $a = \text{ReLU}(z)$
- $y = W_a a + b_a$

### Task:
- Do a forward pass with sample data.
- Manually compute all gradients using the chain rule.

**LLM/NN Context:**
- This is the core of the backward pass in every transformer block.

In [None]:
# Sample values
x = np.array([[1.0], [2.0]])      # input (2x1)
W1 = np.array([[1.0, -1.0]])      # weights (1x2)
b1 = np.array([[0.5]])            # bias (1x1)

# Forward pass
z1 = np.dot(W1, x) + b1           # (1x1)
a1 = relu(z1)                     # (1x1)

W2 = np.array([[2.0]])            # weights (1x1)
b2 = np.array([[1.0]])            # bias (1x1)
y = np.dot(W2, a1) + b2           # (1x1)

# Backward pass (manual gradients)
dy = 1                            # ∂L/∂y, assume loss gradient = 1
dW2 = dy * a1                     # ∂L/∂W2
db2 = dy                          # ∂L/∂b2
da1 = dy * W2                     # ∂L/∂a1
dz1 = da1 * relu_derivative(z1)   # ∂L/∂z1
dW1 = np.dot(dz1, x.T)            # ∂L/∂W1
db1 = dz1                         # ∂L/∂b1

print('dW2:', dW2)
print('db2:', db2)
print('dW1:', dW1)
print('db1:', db1)

## 🧠 Final Summary: Why Derivatives and the Chain Rule Matter for LLMs

- Derivatives and the chain rule are the foundation of backpropagation, which powers learning in all neural networks—including LLMs.
- Every parameter update in an LLM is based on gradients computed using these rules.
- Understanding how gradients flow through each layer will help you debug and design your own models.

**Next:** In the next notebook, you'll use these gradients to implement and train a simple neural network from scratch!