# Math Foundations for Neural Networks

This notebook covers the essential mathematics needed to understand neural networks:
1. **Linear Algebra** - Vectors, matrices, and operations
2. **Calculus** - Derivatives, partial derivatives, and the chain rule
3. **Probability Basics** - Distributions and expectations

We'll use NumPy to demonstrate these concepts with code.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Set random seed for reproducibility
np.random.seed(42)

## 1. Linear Algebra Fundamentals

Neural networks are built on linear algebra operations. Understanding vectors and matrices is crucial.

### 1.1 Vectors

A vector is an ordered list of numbers. In neural networks, vectors represent:
- Input features (e.g., pixel values, word embeddings)
- Neuron activations
- Weight updates

In [None]:
# Creating vectors in NumPy
v1 = np.array([1, 2, 3])
v2 = np.array([4, 5, 6])

print(f"Vector v1: {v1}")
print(f"Vector v2: {v2}")
print(f"Shape: {v1.shape}")

### 1.2 Vector Operations

**Element-wise addition:**

In [None]:
# Element-wise addition
v_add = v1 + v2
print(f"v1 + v2 = {v_add}")

# Element-wise multiplication (Hadamard product)
v_mult = v1 * v2
print(f"v1 * v2 (element-wise) = {v_mult}")

### 1.3 Dot Product (Inner Product)

The dot product is fundamental to neural networks. It's used in:
- Computing neuron activations: $z = w \cdot x + b$
- Attention mechanisms
- Similarity measures

**Formula:** $v_1 \cdot v_2 = \sum_{i=1}^{n} v_{1i} \cdot v_{2i}$

In [None]:
# Dot product
dot_product = np.dot(v1, v2)
print(f"v1 · v2 = {dot_product}")
print(f"Manual calculation: {v1[0]*v2[0]} + {v1[1]*v2[1]} + {v1[2]*v2[2]} = {dot_product}")

# Alternative notation
dot_product_alt = v1 @ v2
print(f"Using @ operator: {dot_product_alt}")

### 1.4 Matrices

Matrices are 2D arrays of numbers. In neural networks:
- **Weight matrices** connect layers
- **Data matrices** store batches of examples
- **Gradient matrices** store derivatives

In [None]:
# Creating matrices
M1 = np.array([[1, 2, 3],
               [4, 5, 6]])

M2 = np.array([[7, 8],
               [9, 10],
               [11, 12]])

print("Matrix M1 (2x3):")
print(M1)
print(f"Shape: {M1.shape}\n")

print("Matrix M2 (3x2):")
print(M2)
print(f"Shape: {M2.shape}")

### 1.5 Matrix Multiplication

Matrix multiplication is the core operation in neural networks!

**Rule:** For matrices $A_{m \times n}$ and $B_{n \times p}$, the product $C = AB$ has shape $m \times p$

Each element: $C_{ij} = \sum_{k=1}^{n} A_{ik} \cdot B_{kj}$

In [None]:
# Matrix multiplication
M3 = M1 @ M2  # or np.dot(M1, M2)

print(f"M1 @ M2:")
print(M3)
print(f"Shape: {M3.shape}")

# Verify dimensions work
print(f"\nDimension check: ({M1.shape[0]}x{M1.shape[1]}) @ ({M2.shape[0]}x{M2.shape[1]}) = ({M3.shape[0]}x{M3.shape[1]})")

### 1.6 Matrix-Vector Multiplication (Neural Network Layer)

This is exactly how a neural network layer computes its output:

$y = Wx + b$

Where:
- $W$ is the weight matrix
- $x$ is the input vector
- $b$ is the bias vector
- $y$ is the output vector

In [None]:
# Example: 3 inputs, 2 neurons
W = np.array([[0.5, 0.2, 0.1],   # weights for neuron 1
              [0.3, 0.4, 0.6]])  # weights for neuron 2

x = np.array([1.0, 2.0, 3.0])    # input features
b = np.array([0.1, 0.2])         # biases

# Compute layer output
y = W @ x + b

print(f"Input x: {x}")
print(f"Weight matrix W:\n{W}")
print(f"Bias b: {b}")
print(f"\nOutput y = Wx + b: {y}")
print(f"\nThis is a 2-neuron layer processing 3 input features!")

### 1.7 Transpose

The transpose operation flips rows and columns. Critical for backpropagation!

Notation: $A^T$ or $A'$

In [None]:
A = np.array([[1, 2, 3],
              [4, 5, 6]])

A_T = A.T

print(f"Original A ({A.shape}):")
print(A)
print(f"\nTranspose A^T ({A_T.shape}):")
print(A_T)

## 2. Calculus for Neural Networks

Calculus helps us understand how small changes in inputs affect outputs. This is essential for:
- **Training neural networks** (gradient descent)
- **Backpropagation** (computing gradients)
- **Optimization** (finding minima)

### 2.1 Derivatives

The derivative measures how a function changes as its input changes.

**Definition:** $f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$

**Geometric interpretation:** Slope of the tangent line

In [None]:
# Example: f(x) = x^2
# Derivative: f'(x) = 2x

def f(x):
    return x**2

def f_prime(x):
    return 2*x

# Numerical approximation of derivative
def numerical_derivative(f, x, h=1e-5):
    return (f(x + h) - f(x)) / h

x = 3.0
print(f"Function value at x={x}: f({x}) = {f(x)}")
print(f"Analytical derivative: f'({x}) = {f_prime(x)}")
print(f"Numerical derivative: f'({x}) ≈ {numerical_derivative(f, x)}")

In [None]:
# Visualize the function and its derivative
x_vals = np.linspace(-5, 5, 100)
y_vals = f(x_vals)
dy_vals = f_prime(x_vals)

plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(x_vals, y_vals, label='f(x) = x²', linewidth=2)
plt.grid(True, alpha=0.3)
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('Function f(x) = x²')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(x_vals, dy_vals, label="f'(x) = 2x", color='orange', linewidth=2)
plt.grid(True, alpha=0.3)
plt.xlabel('x')
plt.ylabel("f'(x)")
plt.title('Derivative f\'(x) = 2x')
plt.legend()

plt.tight_layout()
plt.show()

### 2.2 Common Derivatives (Reference)

These appear constantly in neural networks:

| Function | Derivative |
|----------|------------|
| $f(x) = c$ (constant) | $f'(x) = 0$ |
| $f(x) = x$ | $f'(x) = 1$ |
| $f(x) = x^n$ | $f'(x) = nx^{n-1}$ |
| $f(x) = e^x$ | $f'(x) = e^x$ |
| $f(x) = \ln(x)$ | $f'(x) = \frac{1}{x}$ |
| $f(x) = \sin(x)$ | $f'(x) = \cos(x)$ |

### 2.3 The Chain Rule (Most Important for Backpropagation!)

The chain rule tells us how to differentiate composite functions:

**If** $y = f(g(x))$, **then** $\frac{dy}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}$

This is the mathematical foundation of backpropagation!

In [None]:
# Example: y = (2x + 1)^3
# Let g(x) = 2x + 1, so f(g) = g^3
# dy/dx = dy/dg * dg/dx = 3g^2 * 2 = 6(2x + 1)^2

def composite_function(x):
    """y = (2x + 1)^3"""
    return (2*x + 1)**3

def composite_derivative(x):
    """dy/dx = 6(2x + 1)^2"""
    return 6 * (2*x + 1)**2

x = 2.0
print(f"Analytical derivative at x={x}: {composite_derivative(x)}")
print(f"Numerical derivative at x={x}: {numerical_derivative(composite_function, x)}")

# Breaking it down with chain rule
g = 2*x + 1  # inner function
dg_dx = 2    # derivative of inner function
df_dg = 3 * g**2  # derivative of outer function
dy_dx = df_dg * dg_dx  # chain rule

print(f"\nChain rule breakdown:")
print(f"  g = 2x + 1 = {g}")
print(f"  dg/dx = {dg_dx}")
print(f"  df/dg = 3g² = {df_dg}")
print(f"  dy/dx = df/dg * dg/dx = {dy_dx}")

### 2.4 Partial Derivatives

When a function has multiple inputs, we compute the **partial derivative** with respect to each input, treating others as constants.

**Example:** $f(x, y) = x^2 + xy + y^2$

- $\frac{\partial f}{\partial x} = 2x + y$
- $\frac{\partial f}{\partial y} = x + 2y$

In [None]:
def f_xy(x, y):
    """f(x,y) = x^2 + xy + y^2"""
    return x**2 + x*y + y**2

def df_dx(x, y):
    """∂f/∂x = 2x + y"""
    return 2*x + y

def df_dy(x, y):
    """∂f/∂y = x + 2y"""
    return x + 2*y

x, y = 3.0, 4.0
print(f"f({x}, {y}) = {f_xy(x, y)}")
print(f"\nPartial derivatives:")
print(f"  ∂f/∂x = {df_dx(x, y)}")
print(f"  ∂f/∂y = {df_dy(x, y)}")

# Numerical verification
h = 1e-5
numerical_dx = (f_xy(x+h, y) - f_xy(x, y)) / h
numerical_dy = (f_xy(x, y+h) - f_xy(x, y)) / h

print(f"\nNumerical verification:")
print(f"  ∂f/∂x ≈ {numerical_dx}")
print(f"  ∂f/∂y ≈ {numerical_dy}")

### 2.5 Gradients

The **gradient** is a vector of all partial derivatives:

$\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$

The gradient points in the direction of steepest increase!

In [None]:
# Gradient of f(x,y) = x^2 + xy + y^2
def gradient(x, y):
    """Returns gradient vector [∂f/∂x, ∂f/∂y]"""
    return np.array([df_dx(x, y), df_dy(x, y)])

grad = gradient(3.0, 4.0)
print(f"Gradient at (3, 4): {grad}")
print(f"This vector points in the direction of steepest increase!")

In [None]:
# Visualize the gradient field
x = np.linspace(-3, 3, 20)
y = np.linspace(-3, 3, 20)
X, Y = np.meshgrid(x, y)

# Compute function values
Z = f_xy(X, Y)

# Compute gradient at each point
U = df_dx(X, Y)  # ∂f/∂x
V = df_dy(X, Y)  # ∂f/∂y

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
contour = plt.contour(X, Y, Z, levels=20)
plt.clabel(contour, inline=True, fontsize=8)
plt.title('Function f(x,y) = x² + xy + y²')
plt.xlabel('x')
plt.ylabel('y')
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.contour(X, Y, Z, levels=20, alpha=0.3)
plt.quiver(X, Y, U, V, alpha=0.6)
plt.title('Gradient Field (arrows point uphill)')
plt.xlabel('x')
plt.ylabel('y')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Notice: Gradients point away from the minimum at (0,0) toward higher values!")

## 3. Probability Basics

Neural networks use probability for:
- **Classification** (softmax outputs are probabilities)
- **Loss functions** (cross-entropy)
- **Regularization** (dropout)
- **Uncertainty estimation**

### 3.1 Probability Distributions

A probability distribution assigns probabilities to different outcomes.

**Key properties:**
- All probabilities are between 0 and 1: $0 \leq P(x) \leq 1$
- All probabilities sum to 1: $\sum P(x) = 1$

In [None]:
# Example: Probability distribution over 3 classes
class_probs = np.array([0.7, 0.2, 0.1])

print(f"Class probabilities: {class_probs}")
print(f"Sum of probabilities: {class_probs.sum()} (must equal 1)")
print(f"All between 0 and 1: {np.all((class_probs >= 0) & (class_probs <= 1))}")

# Visualize
plt.figure(figsize=(8, 4))
plt.bar(['Class 0', 'Class 1', 'Class 2'], class_probs)
plt.ylabel('Probability')
plt.title('Probability Distribution Over Classes')
plt.ylim(0, 1)
plt.grid(True, alpha=0.3, axis='y')
plt.show()

### 3.2 Expected Value (Mean)

The expected value is the average outcome weighted by probability:

$E[X] = \sum_{i} x_i \cdot P(x_i)$

In [None]:
# Example: Expected value of a dice roll
outcomes = np.array([1, 2, 3, 4, 5, 6])
probabilities = np.array([1/6, 1/6, 1/6, 1/6, 1/6, 1/6])

expected_value = np.sum(outcomes * probabilities)
print(f"Expected value of fair dice: {expected_value}")

# Verify with simulation
rolls = np.random.randint(1, 7, size=10000)
print(f"Average from 10,000 simulated rolls: {rolls.mean():.2f}")

### 3.3 Softmax Function (Converting to Probabilities)

Neural network outputs are often arbitrary real numbers (logits). Softmax converts them to probabilities:

$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j} e^{z_j}}$

This ensures:
- All outputs are positive
- All outputs sum to 1
- Maintains relative ordering

In [None]:
def softmax(z):
    """Compute softmax values for array z"""
    exp_z = np.exp(z - np.max(z))  # subtract max for numerical stability
    return exp_z / exp_z.sum()

# Example: Neural network outputs (logits)
logits = np.array([2.0, 1.0, 0.1])

probs = softmax(logits)

print(f"Raw logits: {logits}")
print(f"After softmax: {probs}")
print(f"Sum of probabilities: {probs.sum()}")

# Visualize
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.bar(['Class 0', 'Class 1', 'Class 2'], logits)
ax1.set_ylabel('Value')
ax1.set_title('Raw Logits (network outputs)')
ax1.grid(True, alpha=0.3, axis='y')

ax2.bar(['Class 0', 'Class 1', 'Class 2'], probs)
ax2.set_ylabel('Probability')
ax2.set_title('After Softmax')
ax2.set_ylim(0, 1)
ax2.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

## Summary

You now have the mathematical foundation for neural networks!

### Key Takeaways:

**Linear Algebra:**
- Vectors and matrices represent data and parameters
- Matrix multiplication computes layer outputs: $y = Wx + b$
- Transpose is crucial for backpropagation

**Calculus:**
- Derivatives measure sensitivity to changes
- **Chain rule** is the foundation of backpropagation
- Gradients point in the direction of steepest increase
- We use negative gradients for gradient descent (going downhill)

**Probability:**
- Softmax converts outputs to probabilities
- Expected value computes weighted averages
- Probability distributions must sum to 1

### Next Steps:
In the next notebook, we'll use these concepts to build our first neural network!

## Practice Exercises

Try these exercises to reinforce your understanding:

In [None]:
# Exercise 1: Compute the dot product of these vectors
a = np.array([1, 2, 3, 4])
b = np.array([5, 6, 7, 8])
# Your code here


In [None]:
# Exercise 2: Create a 3x4 weight matrix and multiply it with a 4-element input vector
# Then add a bias vector of size 3
# Your code here


In [None]:
# Exercise 3: Implement a function to compute the numerical derivative
# of f(x) = x^3 - 2x^2 + 1 at x = 2
# Then compare with the analytical derivative: f'(x) = 3x^2 - 4x
# Your code here


In [None]:
# Exercise 4: Apply softmax to these logits and verify they sum to 1
logits = np.array([3.2, 1.3, 0.2, 0.8])
# Your code here
