Improve numerical stability of variance calculation in envelope#107
Improve numerical stability of variance calculation in envelope#107
Conversation
…m_values() The computational variance formula E[X²] - E[X]² suffers from catastrophic floating-point cancellation when capacity values are large or nearly identical. This produced silently wrong stdev values (e.g., 41 million instead of 0 for identical values) or complex numbers when the computed variance went negative. Replace with the numerically stable two-pass formula sum((x - mean)²) / n, iterating over the frequency map for efficiency with duplicate values. https://claude.ai/code/session_01BH7FXdY35eRtf98jo8kQiG
There was a problem hiding this comment.
Pull request overview
Refactors the capacity variance/standard deviation computation in CapacityEnvelope.from_values() to reduce catastrophic cancellation risk for large-magnitude inputs, while leveraging the existing frequency map representation used for Monte Carlo outputs.
Changes:
- Replaced the computational variance formula (
E[X²] - (E[X])²) with a deviation-based computation over the frequency map. - Removed the
sum_squaressingle-pass accumulation and added a second pass over unique values to compute variance.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # First pass: build frequency map and compute mean | ||
| frequencies = {} |
There was a problem hiding this comment.
PR description mentions refactoring Envelope.from_values(), but the code change is in CapacityEnvelope.from_values(). Please update the description (or code) to match the actual API being modified to avoid confusion for reviewers and future readers.
| # Second pass over unique values: compute variance using the | ||
| # numerically stable formula sum((x - mean)^2) / n. | ||
| # Iterating over the frequency map is efficient when there are | ||
| # many duplicate values (common in Monte Carlo results). | ||
| variance_sum = 0.0 | ||
| for value, count in frequencies.items(): | ||
| diff = value - mean_capacity | ||
| variance_sum += count * diff * diff | ||
| stdev_capacity = (variance_sum / n) ** 0.5 |
There was a problem hiding this comment.
The variance/stdev computation was changed to a new two-pass algorithm, but there are no unit tests asserting stdev_capacity correctness or demonstrating improved numerical stability (e.g., large-magnitude values that cause catastrophic cancellation in the old formula, and duplicate-heavy inputs). Please add tests that validate mean_capacity/stdev_capacity for representative cases and edge cases.
Summary
Refactored the variance calculation in
Envelope.from_values()to use a numerically stable two-pass algorithm instead of the computational formula, improving accuracy for edge cases with extreme values.Changes
sum((x - mean)²) / nsum_squareswhich can suffer from catastrophic cancellation when values are largeImplementation Details
The new approach trades a second iteration over unique values for significantly improved numerical stability. This is particularly beneficial when:
E[X²]can dominate and lose precision)The change maintains the same computational complexity while providing better accuracy for edge cases.
https://claude.ai/code/session_01BH7FXdY35eRtf98jo8kQiG
Note
Low Risk
Small, localized change to a statistical calculation; main risk is minor numeric/behavioral drift in reported
stdev_capacityfor some datasets.Overview
Updates
CapacityEnvelope.from_values()variance/stddev computation to a numerically stable two-pass approach (sum((x-mean)^2)/n) instead ofE[X^2]-(E[X])^2, removing thesum_squaresaccumulator.The second pass iterates over the computed
frequenciesmap (unique values) to keep performance reasonable when Monte Carlo outputs contain many duplicates, while leaving the envelope output fields unchanged.Written by Cursor Bugbot for commit c6ade7e. This will update automatically on new commits. Configure here.