PERFORMANCE REDESIGN: Batch-Oriented HPC API
CRITICAL INSIGHT: Current Fortran implementation is ALREADY HPC/SoA optimized with excellent data structures. The task is creating batch-oriented APIs that expose this performance properly.
Current Implementation Analysis ✅
EXCELLENT HPC FOUNDATIONS:
! Already Structure of Arrays (SoA) - PERFECT for performance
real(dp), dimension(:,:), allocatable :: zstart, zend ! (5, ntestpart)
real(dp), dimension(:), allocatable :: times_lost ! (ntestpart)
! Already OpenMP optimized - PERFECT for CPU parallelization
!$omp parallel firstprivate(norb)
!$omp do
do i = 1, ntestpart
call trace_orbit(norb, i) ! Cache-friendly particle-level parallelization
end do
MEMORY LAYOUT ANALYSIS:
- ✅ SoA Pattern:
zstart(1:5, 1:ntestpart) - coordinates as columns
- ✅ Cache Performance: Column access
zstart(:, ipart) is optimal
- ✅ GPU-Ready: Contiguous memory, minimal pointer chasing
- ✅ Batch Scale: Designed for millions of particles
- ✅ OpenMP Optimized: Thread-per-particle parallelization
THE REAL ISSUE: Current pysimple exposes low-level procedural interface instead of leveraging this excellent batch architecture.
Batch-Oriented API Design Strategy
1. GPU-Future Batch Operations
import simple
# LEVERAGE EXISTING: zstart(5, n_particles) SoA layout
particles = simple.ParticleBatch(n_particles=100_000)
particles.initialize_surface(vmec_file='wout.nc', s=0.9)
# LEVERAGE EXISTING: OpenMP parallelized trace_orbit loop
results = simple.trace_orbits(
particles=particles,
tmax=1000.0,
integrator='symplectic', # Uses existing optimized integrators
backend='openmp' # Future: 'cuda', 'opencl'
)
# DIRECT ACCESS: Existing SoA result arrays
final_pos = results.positions # Shape: (5, n_particles) - maps to zend
loss_times = results.loss_times # Shape: (n_particles,) - maps to times_lost
2. Memory-Efficient Batch Processing
# STREAM PROCESSING: Handle memory constraints
for batch in simple.ParticleBatch.stream_from_file('particles.h5', batch_size=50_000):
results = simple.trace_orbits(batch, tmax=100.0)
results.save_hdf5('output.h5', append=True)
3. Performance-First Configuration
# EXPOSE EXISTING: High-performance configuration options
config = simple.Config(
integrator='symplectic_midpoint', # Maps to existing integmode
openmp_threads=16, # Controls existing OpenMP
memory_layout='soa', # Explicit about existing layout
particle_buffer_size=1_000_000 # Maps to existing ntestpart
)
Implementation Strategy
Phase 1: Batch API Foundation (2 weeks)
# Core batch data structures that wrap existing Fortran arrays
class ParticleBatch:
def __init__(self, n_particles: int):
# Wraps existing zstart(5, n_particles) allocation
self._fortran_zstart = allocate_particle_arrays(n_particles)
def initialize_surface(self, vmec_file: str, s: float):
# Uses existing samplers.f90 functions
sample_surface_fieldline(self._fortran_zstart, vmec_file, s)
@property
def positions(self) -> np.ndarray:
# Zero-copy view of underlying Fortran SoA array
return self._fortran_zstart # Shape: (5, n_particles)
# Batch processing function that uses existing main loop
def trace_orbits(particles: ParticleBatch, **kwargs) -> BatchResults:
# Calls existing OpenMP parallelized simple_main.run()
return BatchResults(particles._fortran_arrays)
Phase 2: GPU Preparation (1 week)
# GPU-ready data movement (future CuPy/JAX integration)
class GPUParticleBatch(ParticleBatch):
def to_gpu(self) -> 'GPUParticleBatch':
# Leverages existing SoA layout for efficient GPU transfer
return GPUParticleBatch(cupy.asarray(self.positions))
def from_gpu(self) -> ParticleBatch:
# Efficient GPU->CPU transfer maintaining SoA layout
return ParticleBatch(cupy.asnumpy(self._gpu_positions))
Phase 3: Advanced Batch Operations (1 week)
# Advanced batch analytics leveraging SoA performance
class BatchResults:
@property
def confinement_statistics(self) -> Dict[str, float]:
# Vectorized operations on existing times_lost array
return analyze_confinement_vectorized(self.loss_times)
def plot_poincare_batch(self, n_sample: int = 1000):
# Efficient visualization of subset of particles
sample_idx = np.random.choice(self.n_particles, n_sample)
plot_trajectories(self.positions[:, sample_idx]) # SoA slicing
Performance Requirements
1. Zero Data Structure Changes
- ✅ Preserve existing
zstart(5, ntestpart) SoA layout
- ✅ Maintain existing OpenMP parallelization patterns
- ✅ Keep existing integrator performance optimizations
- ✅ No changes to core simulation kernels
2. API Performance Targets
- <5% Python overhead vs direct Fortran execution
- Zero-copy array access where possible
- Memory-efficient streaming for large particle counts
- GPU-ready data layouts (already achieved with SoA)
3. Scalability Validation
- Test with 1M+ particles (leverages existing capability)
- OpenMP scaling verification (uses existing parallelization)
- Memory usage profiling (existing SoA is optimal)
- Cache performance analysis (existing column access is optimal)
Success Metrics
- HPC Performance: Python API maintains >95% of Fortran performance
- Batch Operations: Natural APIs for processing millions of particles
- GPU Readiness: Data structures optimized for future GPU porting
- Memory Efficiency: Leverage existing SoA patterns for optimal cache usage
- Developer Experience: Clean batch-oriented APIs that expose HPC capabilities
Architecture Benefits
- Preserve Excellence: Maintains existing high-performance Fortran core
- Expose Power: Makes HPC capabilities easily accessible via Python
- Future-Proof: SoA layout perfect for GPU acceleration
- No Regressions: Zero impact on existing
simple.x performance
- Scalable Design: APIs designed around million-particle use cases
This approach recognizes and preserves the excellent HPC engineering already present in SIMPLE while creating modern APIs that properly expose this performance to Python users.
PERFORMANCE REDESIGN: Batch-Oriented HPC API
CRITICAL INSIGHT: Current Fortran implementation is ALREADY HPC/SoA optimized with excellent data structures. The task is creating batch-oriented APIs that expose this performance properly.
Current Implementation Analysis ✅
EXCELLENT HPC FOUNDATIONS:
MEMORY LAYOUT ANALYSIS:
zstart(1:5, 1:ntestpart)- coordinates as columnszstart(:, ipart)is optimalTHE REAL ISSUE: Current
pysimpleexposes low-level procedural interface instead of leveraging this excellent batch architecture.Batch-Oriented API Design Strategy
1. GPU-Future Batch Operations
2. Memory-Efficient Batch Processing
3. Performance-First Configuration
Implementation Strategy
Phase 1: Batch API Foundation (2 weeks)
Phase 2: GPU Preparation (1 week)
Phase 3: Advanced Batch Operations (1 week)
Performance Requirements
1. Zero Data Structure Changes
zstart(5, ntestpart)SoA layout2. API Performance Targets
3. Scalability Validation
Success Metrics
Architecture Benefits
simple.xperformanceThis approach recognizes and preserves the excellent HPC engineering already present in SIMPLE while creating modern APIs that properly expose this performance to Python users.