GPU spline accuracy degrades for small σ values

## Summary

When using `SplinedPotential` with the GPU backend (f32), potentials with small σ values (e.g. σ=2 Å for coarse-grained titratable site beads) produce inaccurate results at long separations, while the reference backend (exact f64) works correctly.

## Context

In [cif2top](https://github.com/mlund/cif2top), coarse-grained protein models have two classes of beads:
- **Backbone beads**: σ = 4.5–6.78 Å (Calvados 3 model)
- **Titratable site beads**: σ = 2.0 Å (point-charge-like, minimal LJ)

When used with [duello](https://github.com/mlund/duello), the GPU backend fails to produce correct energies (doesn't decay to zero at long separations), while the `reference` backend works as expected.

## Analysis

The issue stems from f32 precision loss in the spline pipeline:

1. **Wide grid span**: `lower_cutoff = σ × 0.6`, so for σ=2, r_min = 1.2 Å. With a typical cutoff of 20–50 Å, the spline grid spans a 17–42× range, vs 5–12× for backbone beads.

2. **f64 → f32 coefficient truncation**: Spline coefficients are computed in f64 but cast to f32 for GPU upload (`gpu.rs:pack_coeffs`). Near the steep LJ wall (r ≈ 1.2 Å, u ≈ 1500 kJ/mol), coefficients have large magnitudes. After f32 truncation, Horner evaluation accumulates rounding errors that propagate into the tail region.

3. **PowerLaw2 grid concentration**: With the default `PowerLaw2` grid, points are concentrated near r_min, leaving the tail region (where energies should be near-zero) with fewer, wider polynomial intervals — amplifying f32 residual errors.

## Potential fixes

- **Per-pair validation**: After f32 conversion, evaluate the splined potential at a few tail points and warn if the residual exceeds a threshold (e.g. 0.01 kJ/mol).
- **Adaptive grid density**: For pairs with large r_max/r_min ratios, automatically increase `n_points` or switch grid type.
- **f32 coefficient rescaling**: Normalize coefficients per interval to reduce dynamic range before GPU upload, then rescale in the shader.
- **Separate short/long-range splines**: Split into two grids at the WCA crossover (2^(1/6)σ) to keep coefficient magnitudes bounded in each region.

## Reproduction

```bash
# Generate coarse-grained topology with σ=2 titratable sites
cif2top 4lzt.cif convert -o t.xyz --top topology.yaml

# GPU backend — incorrect long-range behavior
duello scan --mol1 t.xyz --mol2 t.xyz --top topology.yaml \
  --rmin 24 --rmax 40 --dr 1.0 -r 1.0 --backend gpu

# Reference backend — correct (decays to zero)
duello scan --mol1 t.xyz --mol2 t.xyz --top topology.yaml \
  --rmin 24 --rmax 40 --dr 1.0 -r 1.0 --backend reference
```

## Environment

- interatomic 0.5.0 (from git)
- macOS (Apple Silicon)
- Default `SplineConfig` (2000 points, PowerLaw2, shift_energy=true)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU spline accuracy degrades for small σ values #7

Summary

Context

Analysis

Potential fixes

Reproduction

Environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

GPU spline accuracy degrades for small σ values #7

Description

Summary

Context

Analysis

Potential fixes

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions