Clean up Python API

## PERFORMANCE REDESIGN: Batch-Oriented HPC API

**CRITICAL INSIGHT**: Current Fortran implementation is ALREADY HPC/SoA optimized with excellent data structures. The task is creating **batch-oriented APIs** that expose this performance properly.

### Current Implementation Analysis ✅

**EXCELLENT HPC FOUNDATIONS:**
```fortran
! Already Structure of Arrays (SoA) - PERFECT for performance
real(dp), dimension(:,:), allocatable :: zstart, zend  ! (5, ntestpart)
real(dp), dimension(:), allocatable :: times_lost     ! (ntestpart)

! Already OpenMP optimized - PERFECT for CPU parallelization  
!$omp parallel firstprivate(norb)
!$omp do
do i = 1, ntestpart
  call trace_orbit(norb, i)  ! Cache-friendly particle-level parallelization
end do
```

**MEMORY LAYOUT ANALYSIS:**
- ✅ **SoA Pattern**: `zstart(1:5, 1:ntestpart)` - coordinates as columns
- ✅ **Cache Performance**: Column access `zstart(:, ipart)` is optimal
- ✅ **GPU-Ready**: Contiguous memory, minimal pointer chasing
- ✅ **Batch Scale**: Designed for millions of particles
- ✅ **OpenMP Optimized**: Thread-per-particle parallelization

**THE REAL ISSUE:** Current `pysimple` exposes low-level procedural interface instead of leveraging this excellent batch architecture.

## Batch-Oriented API Design Strategy

### 1. **GPU-Future Batch Operations**
```python
import simple

# LEVERAGE EXISTING: zstart(5, n_particles) SoA layout
particles = simple.ParticleBatch(n_particles=100_000)
particles.initialize_surface(vmec_file='wout.nc', s=0.9)

# LEVERAGE EXISTING: OpenMP parallelized trace_orbit loop  
results = simple.trace_orbits(
    particles=particles,
    tmax=1000.0,
    integrator='symplectic',  # Uses existing optimized integrators
    backend='openmp'          # Future: 'cuda', 'opencl'
)

# DIRECT ACCESS: Existing SoA result arrays
final_pos = results.positions     # Shape: (5, n_particles) - maps to zend
loss_times = results.loss_times   # Shape: (n_particles,) - maps to times_lost
```

### 2. **Memory-Efficient Batch Processing**
```python
# STREAM PROCESSING: Handle memory constraints
for batch in simple.ParticleBatch.stream_from_file('particles.h5', batch_size=50_000):
    results = simple.trace_orbits(batch, tmax=100.0)
    results.save_hdf5('output.h5', append=True)
```

### 3. **Performance-First Configuration**
```python
# EXPOSE EXISTING: High-performance configuration options
config = simple.Config(
    integrator='symplectic_midpoint',    # Maps to existing integmode
    openmp_threads=16,                   # Controls existing OpenMP
    memory_layout='soa',                 # Explicit about existing layout
    particle_buffer_size=1_000_000       # Maps to existing ntestpart
)
```

## Implementation Strategy

### Phase 1: Batch API Foundation (2 weeks)
```python
# Core batch data structures that wrap existing Fortran arrays
class ParticleBatch:
    def __init__(self, n_particles: int):
        # Wraps existing zstart(5, n_particles) allocation
        self._fortran_zstart = allocate_particle_arrays(n_particles)
    
    def initialize_surface(self, vmec_file: str, s: float):
        # Uses existing samplers.f90 functions
        sample_surface_fieldline(self._fortran_zstart, vmec_file, s)
    
    @property
    def positions(self) -> np.ndarray:
        # Zero-copy view of underlying Fortran SoA array
        return self._fortran_zstart  # Shape: (5, n_particles)

# Batch processing function that uses existing main loop
def trace_orbits(particles: ParticleBatch, **kwargs) -> BatchResults:
    # Calls existing OpenMP parallelized simple_main.run() 
    return BatchResults(particles._fortran_arrays)
```

### Phase 2: GPU Preparation (1 week)  
```python
# GPU-ready data movement (future CuPy/JAX integration)
class GPUParticleBatch(ParticleBatch):
    def to_gpu(self) -> 'GPUParticleBatch':
        # Leverages existing SoA layout for efficient GPU transfer
        return GPUParticleBatch(cupy.asarray(self.positions))
    
    def from_gpu(self) -> ParticleBatch:
        # Efficient GPU->CPU transfer maintaining SoA layout
        return ParticleBatch(cupy.asnumpy(self._gpu_positions))
```

### Phase 3: Advanced Batch Operations (1 week)
```python
# Advanced batch analytics leveraging SoA performance
class BatchResults:
    @property
    def confinement_statistics(self) -> Dict[str, float]:
        # Vectorized operations on existing times_lost array
        return analyze_confinement_vectorized(self.loss_times)
    
    def plot_poincare_batch(self, n_sample: int = 1000):
        # Efficient visualization of subset of particles
        sample_idx = np.random.choice(self.n_particles, n_sample)
        plot_trajectories(self.positions[:, sample_idx])  # SoA slicing
```

## Performance Requirements

### 1. **Zero Data Structure Changes**
- ✅ Preserve existing `zstart(5, ntestpart)` SoA layout
- ✅ Maintain existing OpenMP parallelization patterns  
- ✅ Keep existing integrator performance optimizations
- ✅ No changes to core simulation kernels

### 2. **API Performance Targets**
- **<5% Python overhead** vs direct Fortran execution
- **Zero-copy array access** where possible
- **Memory-efficient streaming** for large particle counts
- **GPU-ready data layouts** (already achieved with SoA)

### 3. **Scalability Validation**
- Test with 1M+ particles (leverages existing capability)
- OpenMP scaling verification (uses existing parallelization)
- Memory usage profiling (existing SoA is optimal)
- Cache performance analysis (existing column access is optimal)

## Success Metrics

1. **HPC Performance**: Python API maintains >95% of Fortran performance
2. **Batch Operations**: Natural APIs for processing millions of particles
3. **GPU Readiness**: Data structures optimized for future GPU porting
4. **Memory Efficiency**: Leverage existing SoA patterns for optimal cache usage
5. **Developer Experience**: Clean batch-oriented APIs that expose HPC capabilities

## Architecture Benefits

1. **Preserve Excellence**: Maintains existing high-performance Fortran core
2. **Expose Power**: Makes HPC capabilities easily accessible via Python
3. **Future-Proof**: SoA layout perfect for GPU acceleration
4. **No Regressions**: Zero impact on existing `simple.x` performance
5. **Scalable Design**: APIs designed around million-particle use cases

This approach recognizes and preserves the excellent HPC engineering already present in SIMPLE while creating modern APIs that properly expose this performance to Python users.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up Python API #90

PERFORMANCE REDESIGN: Batch-Oriented HPC API

Current Implementation Analysis ✅

Batch-Oriented API Design Strategy

1. GPU-Future Batch Operations

2. Memory-Efficient Batch Processing

3. Performance-First Configuration

Implementation Strategy

Phase 1: Batch API Foundation (2 weeks)

Phase 2: GPU Preparation (1 week)

Phase 3: Advanced Batch Operations (1 week)

Performance Requirements

1. Zero Data Structure Changes

2. API Performance Targets

3. Scalability Validation

Success Metrics

Architecture Benefits

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Clean up Python API #90

Description

PERFORMANCE REDESIGN: Batch-Oriented HPC API

Current Implementation Analysis ✅

Batch-Oriented API Design Strategy

1. GPU-Future Batch Operations

2. Memory-Efficient Batch Processing

3. Performance-First Configuration

Implementation Strategy

Phase 1: Batch API Foundation (2 weeks)

Phase 2: GPU Preparation (1 week)

Phase 3: Advanced Batch Operations (1 week)

Performance Requirements

1. Zero Data Structure Changes

2. API Performance Targets

3. Scalability Validation

Success Metrics

Architecture Benefits

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions