# Neural Network Basics

In this notebook, we'll build our understanding of neural networks from the ground up:
1. **The Perceptron** - The simplest neural unit
2. **Activation Functions** - Adding non-linearity
3. **Forward Propagation** - Computing predictions
4. **Loss Functions** - Measuring errors

We'll implement everything in NumPy first to understand the mechanics, then see how PyTorch simplifies it.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn

np.random.seed(42)
torch.manual_seed(42)

## 1. The Perceptron: The Building Block

A **perceptron** (or neuron) is the fundamental unit of a neural network.

### How it works:
1. Takes multiple inputs: $x_1, x_2, ..., x_n$
2. Multiplies each by a weight: $w_1, w_2, ..., w_n$
3. Sums them with a bias: $z = w_1x_1 + w_2x_2 + ... + w_nx_n + b$
4. Applies an activation function: $a = \sigma(z)$

### Vector notation:
$z = w^T x + b$ (or $z = w \cdot x + b$)

$a = \sigma(z)$

In [None]:
class Perceptron:
    """A single perceptron/neuron"""
    
    def __init__(self, n_inputs):
        """Initialize with random weights and bias"""
        self.weights = np.random.randn(n_inputs) * 0.1
        self.bias = np.random.randn() * 0.1
    
    def forward(self, x):
        """Compute output given input x"""
        # Linear combination: z = w·x + b
        z = np.dot(self.weights, x) + self.bias
        return z
    
# Example: perceptron with 3 inputs
perceptron = Perceptron(n_inputs=3)
x = np.array([1.0, 2.0, 3.0])

print(f"Input: {x}")
print(f"Weights: {perceptron.weights}")
print(f"Bias: {perceptron.bias}")
print(f"Output (z): {perceptron.forward(x)}")

## 2. Activation Functions

**Why do we need activation functions?**

Without activation functions, stacking multiple layers would just be a linear transformation:
- $y = W_2(W_1x + b_1) + b_2 = (W_2W_1)x + (W_2b_1 + b_2) = W'x + b'$

This is equivalent to a single layer! Activation functions add **non-linearity**, enabling networks to learn complex patterns.

### Common Activation Functions:

### 2.1 Sigmoid

**Formula:** $\sigma(z) = \frac{1}{1 + e^{-z}}$

**Properties:**
- Output range: (0, 1)
- Smooth, differentiable
- Used for binary classification
- **Problem:** Vanishing gradients for large |z|

**Derivative:** $\sigma'(z) = \sigma(z)(1 - \sigma(z))$

In [None]:
def sigmoid(z):
    """Sigmoid activation function"""
    return 1 / (1 + np.exp(-z))

def sigmoid_derivative(z):
    """Derivative of sigmoid"""
    s = sigmoid(z)
    return s * (1 - s)

# Visualize
z = np.linspace(-10, 10, 200)
y = sigmoid(z)
dy = sigmoid_derivative(z)

plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(z, y, linewidth=2)
plt.grid(True, alpha=0.3)
plt.xlabel('z')
plt.ylabel('σ(z)')
plt.title('Sigmoid Function')
plt.axhline(y=0.5, color='r', linestyle='--', alpha=0.3)

plt.subplot(1, 2, 2)
plt.plot(z, dy, linewidth=2, color='orange')
plt.grid(True, alpha=0.3)
plt.xlabel('z')
plt.ylabel("σ'(z)")
plt.title('Sigmoid Derivative (notice vanishing at extremes)')

plt.tight_layout()
plt.show()

### 2.2 Tanh (Hyperbolic Tangent)

**Formula:** $\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$

**Properties:**
- Output range: (-1, 1)
- Zero-centered (better than sigmoid)
- Still has vanishing gradient problem

**Derivative:** $\tanh'(z) = 1 - \tanh^2(z)$

In [None]:
def tanh(z):
    """Tanh activation function"""
    return np.tanh(z)

def tanh_derivative(z):
    """Derivative of tanh"""
    return 1 - np.tanh(z)**2

# Visualize
y_tanh = tanh(z)
dy_tanh = tanh_derivative(z)

plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(z, y_tanh, linewidth=2, color='green')
plt.grid(True, alpha=0.3)
plt.xlabel('z')
plt.ylabel('tanh(z)')
plt.title('Tanh Function')
plt.axhline(y=0, color='r', linestyle='--', alpha=0.3)

plt.subplot(1, 2, 2)
plt.plot(z, dy_tanh, linewidth=2, color='orange')
plt.grid(True, alpha=0.3)
plt.xlabel('z')
plt.ylabel("tanh'(z)")
plt.title('Tanh Derivative')

plt.tight_layout()
plt.show()

### 2.3 ReLU (Rectified Linear Unit) - Most Popular!

**Formula:** $\text{ReLU}(z) = \max(0, z)$

**Properties:**
- Output range: [0, ∞)
- Computationally efficient
- No vanishing gradient for positive values
- **Default choice** for hidden layers
- **Problem:** "Dead ReLU" (neurons that always output 0)

**Derivative:**
$\text{ReLU}'(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{if } z \leq 0 \end{cases}$

In [None]:
def relu(z):
    """ReLU activation function"""
    return np.maximum(0, z)

def relu_derivative(z):
    """Derivative of ReLU"""
    return (z > 0).astype(float)

# Visualize
y_relu = relu(z)
dy_relu = relu_derivative(z)

plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(z, y_relu, linewidth=2, color='red')
plt.grid(True, alpha=0.3)
plt.xlabel('z')
plt.ylabel('ReLU(z)')
plt.title('ReLU Function')

plt.subplot(1, 2, 2)
plt.plot(z, dy_relu, linewidth=2, color='orange')
plt.grid(True, alpha=0.3)
plt.xlabel('z')
plt.ylabel("ReLU'(z)")
plt.title('ReLU Derivative')
plt.ylim(-0.1, 1.1)

plt.tight_layout()
plt.show()

### 2.4 Comparing All Activation Functions

In [None]:
# Compare all three
plt.figure(figsize=(14, 5))

plt.subplot(1, 2, 1)
plt.plot(z, sigmoid(z), label='Sigmoid', linewidth=2)
plt.plot(z, tanh(z), label='Tanh', linewidth=2)
plt.plot(z, relu(z), label='ReLU', linewidth=2)
plt.grid(True, alpha=0.3)
plt.xlabel('z')
plt.ylabel('Activation')
plt.title('Activation Functions')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(z, sigmoid_derivative(z), label='Sigmoid', linewidth=2)
plt.plot(z, tanh_derivative(z), label='Tanh', linewidth=2)
plt.plot(z, relu_derivative(z), label='ReLU', linewidth=2)
plt.grid(True, alpha=0.3)
plt.xlabel('z')
plt.ylabel('Derivative')
plt.title('Derivatives (gradients)')
plt.legend()

plt.tight_layout()
plt.show()

print("Key Observation: ReLU derivative is always 0 or 1 (no vanishing gradient!)")

## 3. Forward Propagation

Forward propagation is the process of computing the network's output given an input.

### For a single layer:
1. **Linear transformation:** $Z = WX + b$
2. **Activation:** $A = \sigma(Z)$

### For multiple layers:
1. $Z^{[1]} = W^{[1]}X + b^{[1]}$
2. $A^{[1]} = \sigma(Z^{[1]})$
3. $Z^{[2]} = W^{[2]}A^{[1]} + b^{[2]}$
4. $A^{[2]} = \sigma(Z^{[2]})$
5. And so on...

### 3.1 Single Neuron Forward Pass - NumPy

In [None]:
# Example: Single neuron with ReLU activation
def forward_neuron(x, w, b, activation='relu'):
    """
    Forward pass for a single neuron
    
    Args:
        x: input vector
        w: weight vector
        b: bias (scalar)
        activation: activation function name
    
    Returns:
        z: pre-activation
        a: post-activation
    """
    # Linear transformation
    z = np.dot(w, x) + b
    
    # Apply activation
    if activation == 'relu':
        a = relu(z)
    elif activation == 'sigmoid':
        a = sigmoid(z)
    elif activation == 'tanh':
        a = tanh(z)
    else:
        a = z  # linear
    
    return z, a

# Test it
x = np.array([1.0, 2.0, 3.0])
w = np.array([0.5, -0.2, 0.1])
b = 0.3

z, a = forward_neuron(x, w, b, activation='relu')
print(f"Input: {x}")
print(f"Weights: {w}")
print(f"Bias: {b}")
print(f"Pre-activation (z): {z:.4f}")
print(f"Post-activation (a): {a:.4f}")

### 3.2 Layer Forward Pass - NumPy

A layer has multiple neurons. We can compute all outputs in parallel using matrix multiplication!

In [None]:
def forward_layer(X, W, b, activation='relu'):
    """
    Forward pass for a layer
    
    Args:
        X: input matrix (features x samples)
        W: weight matrix (neurons x features)
        b: bias vector (neurons,)
        activation: activation function
    
    Returns:
        Z: pre-activation
        A: post-activation
    """
    # Linear transformation: Z = WX + b
    Z = W @ X + b.reshape(-1, 1)  # broadcast bias
    
    # Apply activation
    if activation == 'relu':
        A = relu(Z)
    elif activation == 'sigmoid':
        A = sigmoid(Z)
    elif activation == 'tanh':
        A = tanh(Z)
    else:
        A = Z
    
    return Z, A

# Example: 3 inputs, 4 neurons, 2 samples
X = np.array([[1.0, 2.0],    # sample 1 and 2 for feature 1
              [2.0, 3.0],    # sample 1 and 2 for feature 2
              [3.0, 4.0]])   # sample 1 and 2 for feature 3

W = np.random.randn(4, 3) * 0.1  # 4 neurons, 3 inputs each
b = np.random.randn(4) * 0.1     # 4 biases

Z, A = forward_layer(X, W, b, activation='relu')

print(f"Input shape: {X.shape} (3 features, 2 samples)")
print(f"Weight shape: {W.shape} (4 neurons, 3 inputs)")
print(f"Bias shape: {b.shape}")
print(f"\nOutput shape: {A.shape} (4 neurons, 2 samples)")
print(f"\nOutput values:\n{A}")

### 3.3 Multi-Layer Forward Pass - NumPy

In [None]:
def forward_network(X, parameters):
    """
    Forward propagation through multiple layers
    
    Args:
        X: input (features x samples)
        parameters: dict with W1, b1, W2, b2, ...
    
    Returns:
        cache: dict storing all intermediate values
    """
    cache = {'A0': X}
    A = X
    L = len(parameters) // 2  # number of layers
    
    # Forward through hidden layers with ReLU
    for l in range(1, L):
        W = parameters[f'W{l}']
        b = parameters[f'b{l}']
        Z = W @ A + b.reshape(-1, 1)
        A = relu(Z)
        cache[f'Z{l}'] = Z
        cache[f'A{l}'] = A
    
    # Output layer (no activation for now)
    W = parameters[f'W{L}']
    b = parameters[f'b{L}']
    Z = W @ A + b.reshape(-1, 1)
    A = Z  # linear output
    cache[f'Z{L}'] = Z
    cache[f'A{L}'] = A
    
    return cache

# Example: 3 -> 4 -> 2 network
parameters = {
    'W1': np.random.randn(4, 3) * 0.1,
    'b1': np.random.randn(4) * 0.1,
    'W2': np.random.randn(2, 4) * 0.1,
    'b2': np.random.randn(2) * 0.1,
}

X = np.array([[1.0, 2.0],
              [2.0, 3.0],
              [3.0, 4.0]])

cache = forward_network(X, parameters)

print("Network architecture: 3 inputs -> 4 hidden -> 2 outputs")
print(f"\nInput (A0): shape {cache['A0'].shape}")
print(cache['A0'])
print(f"\nHidden layer (A1): shape {cache['A1'].shape}")
print(cache['A1'])
print(f"\nOutput (A2): shape {cache['A2'].shape}")
print(cache['A2'])

## 4. Loss Functions

Loss functions measure how wrong our predictions are. During training, we minimize the loss.

### Common Loss Functions:

### 4.1 Mean Squared Error (MSE) - Regression

**Formula:** $L = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

Used for regression tasks (predicting continuous values).

In [None]:
def mse_loss(y_true, y_pred):
    """Mean Squared Error"""
    return np.mean((y_true - y_pred)**2)

# Example
y_true = np.array([1.0, 2.0, 3.0])
y_pred = np.array([1.1, 2.3, 2.8])

loss = mse_loss(y_true, y_pred)
print(f"True values: {y_true}")
print(f"Predictions: {y_pred}")
print(f"MSE Loss: {loss:.4f}")

### 4.2 Binary Cross-Entropy - Binary Classification

**Formula:** $L = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)]$

Used when classifying into 2 classes (0 or 1).

In [None]:
def binary_cross_entropy(y_true, y_pred):
    """Binary Cross-Entropy Loss"""
    # Clip predictions to avoid log(0)
    y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

# Example
y_true = np.array([1, 0, 1, 1, 0])
y_pred = np.array([0.9, 0.1, 0.8, 0.7, 0.2])

loss = binary_cross_entropy(y_true, y_pred)
print(f"True labels: {y_true}")
print(f"Predicted probabilities: {y_pred}")
print(f"Binary Cross-Entropy: {loss:.4f}")

# Compare with bad predictions
y_pred_bad = np.array([0.3, 0.6, 0.4, 0.2, 0.8])
loss_bad = binary_cross_entropy(y_true, y_pred_bad)
print(f"\nBad predictions: {y_pred_bad}")
print(f"Binary Cross-Entropy: {loss_bad:.4f} (higher is worse!)")

### 4.3 Categorical Cross-Entropy - Multi-class Classification

**Formula:** $L = -\frac{1}{n} \sum_{i=1}^{n} \sum_{c=1}^{C} y_{i,c} \log(\hat{y}_{i,c})$

Used when classifying into multiple classes (e.g., cat, dog, bird).

In [None]:
def categorical_cross_entropy(y_true, y_pred):
    """
    Categorical Cross-Entropy Loss
    
    Args:
        y_true: one-hot encoded true labels (samples x classes)
        y_pred: predicted probabilities (samples x classes)
    """
    y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)
    return -np.mean(np.sum(y_true * np.log(y_pred), axis=1))

# Example: 3 classes, 4 samples
y_true = np.array([[1, 0, 0],  # class 0
                   [0, 1, 0],  # class 1
                   [0, 0, 1],  # class 2
                   [1, 0, 0]]) # class 0

y_pred = np.array([[0.7, 0.2, 0.1],
                   [0.1, 0.8, 0.1],
                   [0.2, 0.2, 0.6],
                   [0.8, 0.1, 0.1]])

loss = categorical_cross_entropy(y_true, y_pred)
print(f"True labels (one-hot):\n{y_true}")
print(f"\nPredicted probabilities:\n{y_pred}")
print(f"\nCategorical Cross-Entropy: {loss:.4f}")

## 5. Putting It All Together: A Complete Example

In [None]:
# Generate simple dataset: XOR problem
# XOR is not linearly separable, needs a hidden layer!
X = np.array([[0, 0],
              [0, 1],
              [1, 0],
              [1, 1]]).T  # transpose to make it (features x samples)

y = np.array([[0],
              [1],
              [1],
              [0]]).T

print("XOR Dataset:")
print(f"Inputs:\n{X.T}")
print(f"Targets:\n{y.T}")

# Initialize small network: 2 -> 4 -> 1
np.random.seed(42)
params = {
    'W1': np.random.randn(4, 2) * 0.5,
    'b1': np.zeros(4),
    'W2': np.random.randn(1, 4) * 0.5,
    'b2': np.zeros(1),
}

# Forward pass
Z1 = params['W1'] @ X + params['b1'].reshape(-1, 1)
A1 = relu(Z1)
Z2 = params['W2'] @ A1 + params['b2'].reshape(-1, 1)
A2 = sigmoid(Z2)  # sigmoid for binary output

print(f"\nHidden layer activations (A1):\n{A1}")
print(f"\nOutput predictions (A2):\n{A2}")
print(f"\nLoss (before training): {binary_cross_entropy(y, A2):.4f}")
print("\nNote: Random weights give poor predictions. We'll learn to train this in the next notebook!")

## 6. PyTorch Introduction

PyTorch makes all of this much easier! Let's see how:

### 6.1 PyTorch Tensors (like NumPy arrays, but GPU-enabled)

In [None]:
# NumPy array
x_np = np.array([1.0, 2.0, 3.0])
print(f"NumPy array: {x_np}")

# Convert to PyTorch tensor
x_torch = torch.tensor([1.0, 2.0, 3.0])
print(f"PyTorch tensor: {x_torch}")

# Or convert from NumPy
x_torch2 = torch.from_numpy(x_np)
print(f"From NumPy: {x_torch2}")

# Convert back to NumPy
x_back = x_torch.numpy()
print(f"Back to NumPy: {x_back}")

### 6.2 Built-in Activation Functions

In [None]:
z_torch = torch.tensor([-2.0, -1.0, 0.0, 1.0, 2.0])

# ReLU
relu_torch = torch.relu(z_torch)
print(f"Input: {z_torch}")
print(f"ReLU: {relu_torch}")

# Sigmoid
sigmoid_torch = torch.sigmoid(z_torch)
print(f"Sigmoid: {sigmoid_torch}")

# Tanh
tanh_torch = torch.tanh(z_torch)
print(f"Tanh: {tanh_torch}")

### 6.3 Building a Neural Network Layer in PyTorch

In [None]:
# Create a linear layer: 3 inputs -> 4 outputs
layer = nn.Linear(in_features=3, out_features=4)

print(f"Weight shape: {layer.weight.shape}")
print(f"Bias shape: {layer.bias.shape}")

# Forward pass
x = torch.tensor([[1.0, 2.0, 3.0],
                  [4.0, 5.0, 6.0]])

output = layer(x)
print(f"\nInput shape: {x.shape} (2 samples, 3 features)")
print(f"Output shape: {output.shape} (2 samples, 4 neurons)")
print(f"Output:\n{output}")

### 6.4 Building a Complete Network in PyTorch

In [None]:
class SimpleNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(2, 4)  # 2 inputs -> 4 hidden
        self.layer2 = nn.Linear(4, 1)  # 4 hidden -> 1 output
    
    def forward(self, x):
        x = torch.relu(self.layer1(x))  # hidden layer with ReLU
        x = torch.sigmoid(self.layer2(x))  # output with sigmoid
        return x

# Create network
model = SimpleNetwork()
print(model)

# Test with XOR data
X_torch = torch.tensor([[0., 0.],
                        [0., 1.],
                        [1., 0.],
                        [1., 1.]])

y_torch = torch.tensor([[0.],
                        [1.],
                        [1.],
                        [0.]])

# Forward pass
predictions = model(X_torch)
print(f"\nPredictions (untrained):\n{predictions}")

# Compute loss
loss_fn = nn.BCELoss()  # Binary Cross-Entropy
loss = loss_fn(predictions, y_torch)
print(f"\nLoss: {loss.item():.4f}")

## Summary

### What We Learned:

1. **Perceptron/Neuron**: The basic unit
   - Linear combination: $z = w^Tx + b$
   - Activation function: $a = \sigma(z)$

2. **Activation Functions**:
   - **Sigmoid**: Smooth, 0-1 range, has vanishing gradient
   - **Tanh**: Smooth, -1 to 1, zero-centered
   - **ReLU**: Most popular, simple, no vanishing gradient

3. **Forward Propagation**:
   - Pass data through layers sequentially
   - Each layer: linear transform + activation
   - Cache intermediate values for backprop

4. **Loss Functions**:
   - **MSE**: For regression
   - **Binary Cross-Entropy**: For binary classification
   - **Categorical Cross-Entropy**: For multi-class

5. **PyTorch**:
   - Tensors (GPU-enabled arrays)
   - Built-in layers (nn.Linear)
   - Built-in activations and losses
   - Much cleaner code!

### Next Steps:
In the next notebook, we'll build complete feedforward networks in both NumPy and PyTorch!

## Practice Exercises

In [None]:
# Exercise 1: Implement LeakyReLU
# LeakyReLU(z) = max(0.01*z, z)
# This fixes the "dead ReLU" problem by allowing small negative values
def leaky_relu(z, alpha=0.01):
    # Your code here
    pass

# Test it
# z_test = np.array([-2, -1, 0, 1, 2])
# print(leaky_relu(z_test))

In [None]:
# Exercise 2: Compute forward pass for a 3->5->2 network manually
# Use your own random weights and ReLU activation
# Your code here


In [None]:
# Exercise 3: Build a 4-layer network in PyTorch (2->8->4->1)
# Use ReLU for hidden layers and sigmoid for output
# Your code here
