Skip to content

Clean up Python API #90

@krystophny

Description

@krystophny

PERFORMANCE REDESIGN: Batch-Oriented HPC API

CRITICAL INSIGHT: Current Fortran implementation is ALREADY HPC/SoA optimized with excellent data structures. The task is creating batch-oriented APIs that expose this performance properly.

Current Implementation Analysis ✅

EXCELLENT HPC FOUNDATIONS:

! Already Structure of Arrays (SoA) - PERFECT for performance
real(dp), dimension(:,:), allocatable :: zstart, zend  ! (5, ntestpart)
real(dp), dimension(:), allocatable :: times_lost     ! (ntestpart)

! Already OpenMP optimized - PERFECT for CPU parallelization  
!$omp parallel firstprivate(norb)
!$omp do
do i = 1, ntestpart
  call trace_orbit(norb, i)  ! Cache-friendly particle-level parallelization
end do

MEMORY LAYOUT ANALYSIS:

  • SoA Pattern: zstart(1:5, 1:ntestpart) - coordinates as columns
  • Cache Performance: Column access zstart(:, ipart) is optimal
  • GPU-Ready: Contiguous memory, minimal pointer chasing
  • Batch Scale: Designed for millions of particles
  • OpenMP Optimized: Thread-per-particle parallelization

THE REAL ISSUE: Current pysimple exposes low-level procedural interface instead of leveraging this excellent batch architecture.

Batch-Oriented API Design Strategy

1. GPU-Future Batch Operations

import simple

# LEVERAGE EXISTING: zstart(5, n_particles) SoA layout
particles = simple.ParticleBatch(n_particles=100_000)
particles.initialize_surface(vmec_file='wout.nc', s=0.9)

# LEVERAGE EXISTING: OpenMP parallelized trace_orbit loop  
results = simple.trace_orbits(
    particles=particles,
    tmax=1000.0,
    integrator='symplectic',  # Uses existing optimized integrators
    backend='openmp'          # Future: 'cuda', 'opencl'
)

# DIRECT ACCESS: Existing SoA result arrays
final_pos = results.positions     # Shape: (5, n_particles) - maps to zend
loss_times = results.loss_times   # Shape: (n_particles,) - maps to times_lost

2. Memory-Efficient Batch Processing

# STREAM PROCESSING: Handle memory constraints
for batch in simple.ParticleBatch.stream_from_file('particles.h5', batch_size=50_000):
    results = simple.trace_orbits(batch, tmax=100.0)
    results.save_hdf5('output.h5', append=True)

3. Performance-First Configuration

# EXPOSE EXISTING: High-performance configuration options
config = simple.Config(
    integrator='symplectic_midpoint',    # Maps to existing integmode
    openmp_threads=16,                   # Controls existing OpenMP
    memory_layout='soa',                 # Explicit about existing layout
    particle_buffer_size=1_000_000       # Maps to existing ntestpart
)

Implementation Strategy

Phase 1: Batch API Foundation (2 weeks)

# Core batch data structures that wrap existing Fortran arrays
class ParticleBatch:
    def __init__(self, n_particles: int):
        # Wraps existing zstart(5, n_particles) allocation
        self._fortran_zstart = allocate_particle_arrays(n_particles)
    
    def initialize_surface(self, vmec_file: str, s: float):
        # Uses existing samplers.f90 functions
        sample_surface_fieldline(self._fortran_zstart, vmec_file, s)
    
    @property
    def positions(self) -> np.ndarray:
        # Zero-copy view of underlying Fortran SoA array
        return self._fortran_zstart  # Shape: (5, n_particles)

# Batch processing function that uses existing main loop
def trace_orbits(particles: ParticleBatch, **kwargs) -> BatchResults:
    # Calls existing OpenMP parallelized simple_main.run() 
    return BatchResults(particles._fortran_arrays)

Phase 2: GPU Preparation (1 week)

# GPU-ready data movement (future CuPy/JAX integration)
class GPUParticleBatch(ParticleBatch):
    def to_gpu(self) -> 'GPUParticleBatch':
        # Leverages existing SoA layout for efficient GPU transfer
        return GPUParticleBatch(cupy.asarray(self.positions))
    
    def from_gpu(self) -> ParticleBatch:
        # Efficient GPU->CPU transfer maintaining SoA layout
        return ParticleBatch(cupy.asnumpy(self._gpu_positions))

Phase 3: Advanced Batch Operations (1 week)

# Advanced batch analytics leveraging SoA performance
class BatchResults:
    @property
    def confinement_statistics(self) -> Dict[str, float]:
        # Vectorized operations on existing times_lost array
        return analyze_confinement_vectorized(self.loss_times)
    
    def plot_poincare_batch(self, n_sample: int = 1000):
        # Efficient visualization of subset of particles
        sample_idx = np.random.choice(self.n_particles, n_sample)
        plot_trajectories(self.positions[:, sample_idx])  # SoA slicing

Performance Requirements

1. Zero Data Structure Changes

  • ✅ Preserve existing zstart(5, ntestpart) SoA layout
  • ✅ Maintain existing OpenMP parallelization patterns
  • ✅ Keep existing integrator performance optimizations
  • ✅ No changes to core simulation kernels

2. API Performance Targets

  • <5% Python overhead vs direct Fortran execution
  • Zero-copy array access where possible
  • Memory-efficient streaming for large particle counts
  • GPU-ready data layouts (already achieved with SoA)

3. Scalability Validation

  • Test with 1M+ particles (leverages existing capability)
  • OpenMP scaling verification (uses existing parallelization)
  • Memory usage profiling (existing SoA is optimal)
  • Cache performance analysis (existing column access is optimal)

Success Metrics

  1. HPC Performance: Python API maintains >95% of Fortran performance
  2. Batch Operations: Natural APIs for processing millions of particles
  3. GPU Readiness: Data structures optimized for future GPU porting
  4. Memory Efficiency: Leverage existing SoA patterns for optimal cache usage
  5. Developer Experience: Clean batch-oriented APIs that expose HPC capabilities

Architecture Benefits

  1. Preserve Excellence: Maintains existing high-performance Fortran core
  2. Expose Power: Makes HPC capabilities easily accessible via Python
  3. Future-Proof: SoA layout perfect for GPU acceleration
  4. No Regressions: Zero impact on existing simple.x performance
  5. Scalable Design: APIs designed around million-particle use cases

This approach recognizes and preserves the excellent HPC engineering already present in SIMPLE while creating modern APIs that properly expose this performance to Python users.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions