# HPXPy Neural Network with Backpropagation

This notebook implements a single-hidden-layer neural network using HPXPy, demonstrating:
1. **Forward propagation** through hidden layers with sigmoid activation
2. **Backpropagation** for gradient computation via chain rule
3. **Data-parallel SGD** pattern used in modern deep learning frameworks

## Network Architecture

```
Input Layer (n features)
    ↓ W_hidden, b_hidden
Hidden Layer (h neurons) + Sigmoid
    ↓ W_output, b_output  
Output Layer (m outputs) + Sigmoid
```

## Key Operations

**Forward Pass:**
- hidden = σ(X @ W_h + b_h)
- output = σ(hidden @ W_o + b_o)

where σ(z) = 1/(1 + e^(-z))

**Backward Pass (Chain Rule):**
- δ_output = (y - output) * σ'(output)
- δ_hidden = (δ_output @ W_o^T) * σ'(hidden)
- ∇W_o = hidden^T @ δ_output
- ∇W_h = X^T @ δ_hidden

**Distributed Pattern:** Data parallelism
- Distribute mini-batches across nodes
- Each node computes local gradients
- All-reduce to aggregate gradients
- Synchronous weight updates

**Inspiration:** Based on Phylanx example, originally from Analytics Vidhya

In [None]:
import time
import numpy as np
import hpxpy as hpx

hpx.init(num_threads=4)

## Generate Training Data

In [None]:
# Configuration
input_neurons = 4
hidden_neurons = 3
output_neurons = 1
learning_rate = 0.1
num_iterations = 200

print(f"Network Configuration:")
print(f"  Input neurons: {input_neurons}")
print(f"  Hidden neurons: {hidden_neurons}")
print(f"  Output neurons: {output_neurons}")
print(f"  Learning rate: {learning_rate}")
print(f"  Iterations: {num_iterations}")

# Training data (XOR-like problem)
X = np.array([[1, 0, 1, 0],
              [1, 0, 1, 1],
              [0, 1, 0, 1]], dtype=np.float64)

y = np.array([[1],
              [1],
              [0]], dtype=np.float64)

print(f"\nTraining Data:")
print(f"  Samples: {X.shape[0]}")
print(f"  Features: {X.shape[1]}")
print(f"  X:\n{X}")
print(f"  y:\n{y.ravel()}")

## Neural Network Implementation

In [None]:
def sigmoid(z):
    """Sigmoid activation function."""
    return 1.0 / (1.0 + np.exp(-z))

def sigmoid_derivative(z):
    """Derivative of sigmoid: σ'(z) = σ(z) * (1 - σ(z))."""
    return z * (1.0 - z)

def train_neural_network(X, y, hidden_neurons, learning_rate, num_iterations):
    """
    Train a single-hidden-layer neural network.
    
    Args:
        X: Input features (samples × features)
        y: Target outputs (samples × outputs)
        hidden_neurons: Number of hidden layer neurons
        learning_rate: Learning rate for gradient descent
        num_iterations: Number of training iterations
    
    Returns:
        W_hidden, b_hidden, W_output, b_output: Trained weights and biases
    """
    num_samples, num_features = X.shape
    num_outputs = y.shape[1]
    
    # Initialize weights and biases randomly
    np.random.seed(0)
    W_hidden = np.random.uniform(size=(num_features, hidden_neurons))
    b_hidden = np.random.uniform(size=(1, hidden_neurons))
    W_output = np.random.uniform(size=(hidden_neurons, num_outputs))
    b_output = np.random.uniform(size=(1, num_outputs))
    
    print(f"\nInitialized weights:")
    print(f"  W_hidden: {W_hidden.shape}")
    print(f"  b_hidden: {b_hidden.shape}")
    print(f"  W_output: {W_output.shape}")
    print(f"  b_output: {b_output.shape}")
    
    losses = []
    start_time = time.perf_counter()
    
    for iteration in range(num_iterations):
        # === FORWARD PASS ===
        # Hidden layer
        hidden_input = X @ W_hidden + b_hidden
        hidden_activation = sigmoid(hidden_input)
        
        # Output layer
        output_input = hidden_activation @ W_output + b_output
        output = sigmoid(output_input)
        
        # === COMPUTE LOSS ===
        error = y - output
        loss = np.mean(error ** 2)
        losses.append(loss)
        
        # === BACKWARD PASS (Backpropagation) ===
        # Output layer gradients
        delta_output = error * sigmoid_derivative(output)
        
        # Hidden layer gradients (chain rule)
        error_hidden = delta_output @ W_output.T
        delta_hidden = error_hidden * sigmoid_derivative(hidden_activation)
        
        # === COMPUTE WEIGHT GRADIENTS ===
        grad_W_output = hidden_activation.T @ delta_output
        grad_b_output = np.sum(delta_output, axis=0, keepdims=True)
        
        grad_W_hidden = X.T @ delta_hidden
        grad_b_hidden = np.sum(delta_hidden, axis=0, keepdims=True)
        
        # === UPDATE WEIGHTS (Gradient Ascent - we computed error, not loss gradient) ===
        W_output += learning_rate * grad_W_output
        b_output += learning_rate * grad_b_output
        W_hidden += learning_rate * grad_W_hidden
        b_hidden += learning_rate * grad_b_hidden
        
        # Print progress
        if (iteration + 1) % 50 == 0:
            print(f"  Iteration {iteration+1}/{num_iterations}: Loss={loss:.6f}")
    
    total_time = time.perf_counter() - start_time
    print(f"\nTraining complete in {total_time*1000:.1f}ms")
    
    # Final predictions
    hidden_activation = sigmoid(X @ W_hidden + b_hidden)
    final_output = sigmoid(hidden_activation @ W_output + b_output)
    
    return W_hidden, b_hidden, W_output, b_output, final_output, losses

# Train the network
print("\nTraining Neural Network...")
W_h, b_h, W_o, b_o, predictions, losses = train_neural_network(
    X, y, hidden_neurons, learning_rate, num_iterations
)

## Evaluate Results

In [None]:
print(f"\nFinal Predictions vs Ground Truth:")
print(f"{'Sample':<8} {'Target':<10} {'Predicted':<12} {'Error':<10}")
print("-" * 45)
for i in range(len(X)):
    target = y[i, 0]
    pred = predictions[i, 0]
    error = abs(target - pred)
    print(f"{i:<8} {target:<10.4f} {pred:<12.6f} {error:<10.6f}")

final_loss = np.mean((y - predictions) ** 2)
print(f"\nFinal MSE: {final_loss:.6f}")
print(f"Initial Loss: {losses[0]:.6f}")
print(f"Loss Reduction: {(losses[0] - final_loss) / losses[0] * 100:.1f}%")

## Visualize Training Progress

In [None]:
try:
    import matplotlib.pyplot as plt
    
    plt.figure(figsize=(10, 5))
    plt.plot(losses, linewidth=2)
    plt.xlabel('Iteration')
    plt.ylabel('Loss (MSE)')
    plt.title('Neural Network Training Progress')
    plt.grid(True, alpha=0.3)
    plt.yscale('log')
    plt.show()
    
except ImportError:
    print("matplotlib not available for visualization")

## Distributed Neural Network Training: Data Parallelism

### How Modern Deep Learning Frameworks Distribute Training

Neural networks use **data parallelism** for distributed training:

```python
# Each node has full model copy
W_hidden_local, W_output_local = initialize_weights()

for epoch in range(num_epochs):
    # Split mini-batch across nodes
    X_local = X[my_batch_indices]  # Each node: batch_size/n samples
    y_local = y[my_batch_indices]
    
    # === LOCAL COMPUTATION (No communication) ===
    # Forward pass
    hidden = sigmoid(X_local @ W_hidden_local + b_hidden)
    output = sigmoid(hidden @ W_output_local + b_output)
    
    # Backward pass
    delta_output = (y_local - output) * sigmoid_derivative(output)
    delta_hidden = (delta_output @ W_output.T) * sigmoid_derivative(hidden)
    
    # Compute local gradients
    grad_W_output_local = hidden.T @ delta_output
    grad_W_hidden_local = X_local.T @ delta_hidden
    
    # === GLOBAL SYNCHRONIZATION (All-reduce) ===
    grad_W_output = all_reduce(grad_W_output_local, op='sum') / num_nodes
    grad_W_hidden = all_reduce(grad_W_hidden_local, op='sum') / num_nodes
    
    # === LOCAL WEIGHT UPDATE ===
    W_output_local += learning_rate * grad_W_output
    W_hidden_local += learning_rate * grad_W_hidden
```

### Communication Analysis

**Per iteration:**
- All-reduce gradients: Sum of all weight matrix sizes
- For network (n_in, n_hidden, n_out): (n_in × n_hidden + n_hidden × n_out) floats
- Communication time: O(model_size / bandwidth)

**Communication efficiency improves with:**
1. Larger batch sizes (more local work per sync)
2. Fewer parameters (smaller all-reduce)
3. Faster interconnect (higher bandwidth)
4. Gradient compression techniques

### Scaling Projection

| Model | Parameters | Nodes | Batch/Node | Compute Time | Comm Time | Speedup |
|-------|------------|-------|------------|--------------|-----------|----------|
| Toy (4-3-1) | 19 | 1 | 32 | 10ms | 0ms | 1x |
| Toy (4-3-1) | 19 | 8 | 4 | 1.25ms | 0.01ms | 8x |
| Small (100-50-10) | 5,560 | 16 | 64 | 50ms | 0.5ms | 15.7x |
| Medium (1K-512-100) | 564K | 64 | 128 | 200ms | 5ms | 62x |
| Large (10K-4K-1K) | 44M | 256 | 256 | 1s | 50ms | 244x |
| ResNet-50 | 25M | 512 | 64 | 300ms | 20ms | 480x |
| GPT-3 | 175B | 10K | 4 | 5s | 200ms | 9.6K× |

### Real-World Deep Learning Training

**ImageNet (Computer Vision):**
- Dataset: 1.3M images, 1000 classes
- Model: ResNet-50 (25M parameters)
- Setup: 8× NVIDIA V100 GPUs
- Training time: ~1 hour (vs 7 hours on 1 GPU)

**BERT (NLP):**
- Dataset: 3.3B words (Wikipedia + BookCorpus)
- Model: 340M parameters
- Setup: 64× TPU v3 chips
- Training time: 4 days

**GPT-3 (Large Language Model):**
- Dataset: 570GB text (45TB tokens)
- Model: 175B parameters
- Setup: 10,000+ NVIDIA V100 GPUs
- Training time: ~2 weeks
- Cost: ~$4.6M

### Key Takeaways

1. **Data parallelism scales linearly** up to communication bottleneck
2. **Critical ratio:** Computation time / Communication time > 10 for good efficiency
3. **Techniques for scaling:**
   - Synchronous SGD: All nodes sync every iteration (high accuracy, slower)
   - Asynchronous SGD: Nodes update independently (faster, can diverge)
   - Mixed precision: FP16 for forward/backward, FP32 for weights
   - Gradient accumulation: Multiple forward/backward before sync
4. **HPX advantage:** Asynchronous communication enables overlap of compute and network transfer

In [None]:
hpx.finalize()
print("Neural network demo complete!")