# HPXPy Distributed Reduction Demo

This notebook demonstrates how collective operations enable distributed computing patterns using the **SPMD (Single Program, Multiple Data)** execution model.

In single-locality mode (like this notebook), we demonstrate the API and pattern. In multi-locality mode, this would run across nodes.

In [None]:
import time
import numpy as np
import hpxpy as hpx

hpx.init(num_threads=4)

## Locality Configuration

In [None]:
num_localities = hpx.collectives.get_num_localities()
locality_id = hpx.collectives.get_locality_id()

print(f"Locality Configuration:")
print(f"  Number of localities: {num_localities}")
print(f"  This locality ID: {locality_id}")
print(f"  HPX threads: {hpx.num_threads()}")

## Demo 1: Distributed Sum (All-Reduce)

In a real distributed scenario:
- Each locality would have a different portion of the data
- `all_reduce` combines all local sums into a global sum

In [None]:
# Simulate local data (in multi-locality, each would have different data)
np.random.seed(42 + locality_id)  # Different seed per locality
local_data = np.random.randn(1000000)
local_arr = hpx.from_numpy(local_data)

# Compute local sum
local_sum = float(hpx.sum(local_arr))
print(f"Local sum (locality {locality_id}): {local_sum:.4f}")

# All-reduce to get global sum
local_sum_arr = hpx.from_numpy(np.array([local_sum]))
global_sum_arr = hpx.all_reduce(local_sum_arr, op='sum')
global_sum = float(global_sum_arr.to_numpy()[0])

print(f"Global sum (all localities): {global_sum:.4f}")

if num_localities == 1:
    print("\nNote: In single-locality mode, global_sum == local_sum")
    print("With N localities, this would sum contributions from all N")

## Demo 2: Parameter Broadcast

Root locality computes parameters, then broadcasts to all others.

In [None]:
if locality_id == 0:
    # Only root does this computation
    params = np.array([0.01, 0.99, 42.0])  # learning_rate, momentum, seed
    print(f"Root computed parameters: {params}")
else:
    params = np.zeros(3)  # Other localities wait for broadcast

params_arr = hpx.from_numpy(params)
params_arr = hpx.broadcast(params_arr, root=0)
received_params = params_arr.to_numpy()

print(f"Locality {locality_id} received: {received_params}")

## Demo 3: Gather Local Statistics

Each locality computes local statistics, then gathers them to root.

In [None]:
# Each locality computes local statistics
local_mean = float(hpx.mean(local_arr))
local_std = float(hpx.std(local_arr))
local_stats = np.array([local_mean, local_std])

print(f"Locality {locality_id} stats: mean={local_mean:.4f}, std={local_std:.4f}")

local_stats_arr = hpx.from_numpy(local_stats)
all_stats = hpx.gather(local_stats_arr, root=0)

if locality_id == 0:
    print(f"Root gathered {len(all_stats)} locality stats:")
    for i, stats in enumerate(all_stats):
        print(f"  From locality {i}: mean={stats[0]:.4f}, std={stats[1]:.4f}")

## Demo 4: Barrier Synchronization

In [None]:
print(f"Locality {locality_id} reaching barrier...")
start = time.perf_counter()
hpx.barrier("demo_barrier")
elapsed = time.perf_counter() - start
print(f"Locality {locality_id} passed barrier in {elapsed*1000:.3f} ms")

## Distributed Computing Pattern

The SPMD pattern demonstrated above enables:

### 1. Data Parallelism

```
┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│ Locality 0  │  │ Locality 1  │  │ Locality 2  │  │ Locality 3  │
│ Data[0:N/4] │  │ Data[N/4:N/2│  │Data[N/2:3N/4│  │Data[3N/4:N] │
└─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘
      ↓                ↓                ↓                ↓
┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│ Local Comp  │  │ Local Comp  │  │ Local Comp  │  │ Local Comp  │
└─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘
      ↓                ↓                ↓                ↓
┌──────────────────────────────────────────────────────────────────┐
│                        ALL-REDUCE                                │
│                  Combine local results                           │
└──────────────────────────────────────────────────────────────────┘
```

### 2. Collective Operations

| Operation | Description |
|-----------|-------------|
| `all_reduce` | Combine values, result on all localities |
| `broadcast` | Send from one to all |
| `gather` | Collect from all to one |
| `scatter` | Distribute from one to all |
| `barrier` | Synchronize all localities |

### 3. Use Cases

- **Machine Learning**: Distributed gradient descent
- **Scientific Computing**: Domain decomposition
- **Data Analytics**: MapReduce patterns
- **Simulation**: Parallel time stepping

### Running Multi-Locality (future)

```bash
mpirun -n 4 python script.py --hpx:threads=8
# or
srun -n 4 python script.py --hpx:threads=8
```

In [None]:
hpx.finalize()
print("Demo complete!")