# HPXPy Scalability Demonstration

This notebook demonstrates HPXPy's parallel scalability through benchmarks on different workloads.

**Note:** For proper thread scaling tests (1, 2, 4, 8 threads), run the `scalability_demo.py` script instead, which spawns separate processes with different HPX thread counts.

## What We Test

1. **Monte Carlo Pi** - Mixed workload: random generation, operators, comparison, reduction
2. **Element-wise Operations** - Pure compute: sqrt, exp, sin, power
3. **Reductions** - Memory-bound: sum, prod, min, max

In [None]:
import time
import numpy as np
import hpxpy as hpx

hpx.init(num_threads=4)
print(f"Running with {hpx.num_threads()} HPX threads")

## Benchmark 1: Monte Carlo Pi Estimation

In [None]:
def monte_carlo_pi_hpxpy(n_samples):
    """Monte Carlo Pi - demonstrates operators, random, reduction."""
    hpx.random.seed(42)
    
    x = hpx.random.uniform(0, 1, size=n_samples)
    y = hpx.random.uniform(0, 1, size=n_samples)
    distances_squared = x**2 + y**2
    inside_mask = distances_squared <= 1
    inside_float = hpx.from_numpy(inside_mask.to_numpy().astype(float), copy=True)
    inside_count = hpx.sum(inside_float)
    return 4 * inside_count / n_samples

def monte_carlo_pi_numpy(n_samples):
    """NumPy reference."""
    np.random.seed(42)
    x = np.random.uniform(0, 1, n_samples)
    y = np.random.uniform(0, 1, n_samples)
    inside = np.sum(x**2 + y**2 <= 1)
    return 4 * inside / n_samples

n_samples = 50_000_000

print("=" * 60)
print(f"Monte Carlo Pi Estimation ({n_samples:,} samples)")
print("=" * 60)

# Warm up
_ = monte_carlo_pi_hpxpy(1000)

# NumPy
start = time.perf_counter()
pi_np = monte_carlo_pi_numpy(n_samples)
np_time = time.perf_counter() - start

# HPXPy
start = time.perf_counter()
pi_hpx = monte_carlo_pi_hpxpy(n_samples)
hpx_time = time.perf_counter() - start

print(f"\n{'Method':>10} | {'Time (s)':>10} | {'Pi Estimate':>12}")
print("-" * 40)
print(f"{'NumPy':>10} | {np_time:>10.4f} | {pi_np:>12.8f}")
print(f"{'HPXPy':>10} | {hpx_time:>10.4f} | {pi_hpx:>12.8f}")
print(f"\nSpeedup: {np_time/hpx_time:.2f}x")

## Benchmark 2: Element-wise Operations

In [None]:
n_elements = 100_000_000

print("=" * 60)
print(f"Element-wise Operations ({n_elements:,} elements)")
print("Operations: sqrt, exp, sin, power, sum")
print("=" * 60)

# Create data
np_arr = np.arange(n_elements, dtype=np.float64)
hpx_arr = hpx.from_numpy(np_arr)

# NumPy
start = time.perf_counter()
result = np.sqrt(np_arr + 1)
result = np.exp(result * 0.001)
result = np.sin(result)
result = result ** 2
_ = np.sum(result)
np_time = time.perf_counter() - start

# HPXPy
start = time.perf_counter()
result = hpx.sqrt(hpx_arr + 1)
result = hpx.exp(result * 0.001)
result = hpx.sin(result)
result = result ** 2
_ = hpx.sum(result)
hpx_time = time.perf_counter() - start

print(f"\nNumPy:  {np_time:.4f} s")
print(f"HPXPy:  {hpx_time:.4f} s")
print(f"Speedup: {np_time/hpx_time:.2f}x")

## Benchmark 3: Reduction Operations

In [None]:
n_elements = 50_000_000

print("=" * 60)
print(f"Reduction Operations ({n_elements:,} elements)")
print("Operations: sum, prod (small), min, max")
print("=" * 60)

# Create data
np_arr = np.random.randn(n_elements)
hpx_arr = hpx.from_numpy(np_arr)

n_iterations = 10

# NumPy
start = time.perf_counter()
for _ in range(n_iterations):
    _ = np.sum(np_arr)
    _ = np.prod(np_arr[:1000])  # Small to avoid overflow
    _ = np.min(np_arr)
    _ = np.max(np_arr)
np_time = time.perf_counter() - start

# HPXPy
small_hpx = hpx.from_numpy(np_arr[:1000])
start = time.perf_counter()
for _ in range(n_iterations):
    _ = hpx.sum(hpx_arr)
    _ = hpx.prod(small_hpx)
    _ = hpx.min(hpx_arr)
    _ = hpx.max(hpx_arr)
hpx_time = time.perf_counter() - start

print(f"\nNumPy:  {np_time:.4f} s ({n_iterations} iterations)")
print(f"HPXPy:  {hpx_time:.4f} s ({n_iterations} iterations)")
print(f"Speedup: {np_time/hpx_time:.2f}x")

## Summary

HPXPy uses HPX's parallel execution policies which automatically distribute work across available threads. The speedup depends on:

1. **Array size** - Larger arrays benefit more from parallelism
2. **Operation type** - Compute-intensive operations scale better
3. **Memory bandwidth** - Some operations are memory-bound

### Key Performance Factors

| Factor | Impact |
|--------|--------|
| SIMD vectorization | Significant speedup on element-wise ops |
| GIL release | Python threads can execute C++ in parallel |
| HPX parallel policies | Work stealing and load balancing |
| Memory locality | Better cache utilization |

### Distributed Parallelism

Distributed parallelism (multiple processes/nodes) with AGAS-backed distributed arrays extends these benefits across multiple machines.

In [None]:
hpx.finalize()
print("Demo complete!")