Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional fastmath optimizations via env var #290

Merged
merged 10 commits into from
Feb 29, 2024

Conversation

frazane
Copy link
Contributor

@frazane frazane commented Feb 29, 2024

With this PR fastmath optimizations can be optionally enabled via a NUMBAGG_FASTMATH environment variable. A warning is always issued when used. Importantly, we don't simply use fastmath=True but specify a set of flags that should not result in unsafe behavior, but possibly in reduced precision. The "no nans" and "no infs" fastmath flags are not used. See also the discussion in #287.
In my benchmarks I only observe a 2x performance improvement on nansum and nanmean (for double precision floats, in micro-benchmarks I saw 4x for single precision), smaller performance improvements on nanstd and nanvar, and no noticeable differences, or even slightly worse results, on all other aggregations.
Closes #287

Tests

Tests are passing, including those for correctness, when NUMBAGG_FASTMATH=true, which should indicate it's safe to use it. This also includes tests for large arrays (1'000'000 elements).

Benchmark: Linux system with 8 skylake-avx512 CPUs

NUMBAGG_FASTMATH=true:

func 1D
pandas
1D
bottleneck
1D
numpy
2D
pandas
2D
bottleneck
2D
numpy
bfill 1.80x 1.17x n/a 21.54x 7.70x n/a
ffill 2.12x 1.59x n/a 25.84x 9.85x n/a
group_nanall 1.49x n/a n/a 12.51x n/a n/a
group_nanany 1.43x n/a n/a 13.05x n/a n/a
group_nanargmax 1.28x n/a n/a 10.04x n/a n/a
group_nanargmin 1.22x n/a n/a 9.61x n/a n/a
group_nancount 1.26x n/a n/a 8.21x n/a n/a
group_nanfirst 1.34x n/a n/a 19.43x n/a n/a
group_nanlast 1.20x n/a n/a 8.27x n/a n/a
group_nanmax 1.21x n/a n/a 8.14x n/a n/a
group_nanmean 1.38x n/a n/a 13.98x n/a n/a
group_nanmin 1.27x n/a n/a 8.08x n/a n/a
group_nanprod 1.21x n/a n/a 7.62x n/a n/a
group_nanstd 1.35x n/a n/a 12.17x n/a n/a
group_nansum_of_squares 1.65x n/a n/a 21.51x n/a n/a
group_nansum 1.41x n/a n/a 13.73x n/a n/a
group_nanvar 1.36x n/a n/a 13.44x n/a n/a
move_corr 28.32x n/a n/a 148.17x n/a n/a
move_cov 23.53x n/a n/a 121.59x n/a n/a
move_exp_nancorr 12.59x n/a n/a 78.87x n/a n/a
move_exp_nancount 3.98x n/a n/a 20.27x n/a n/a
move_exp_nancov 11.46x n/a n/a 78.58x n/a n/a
move_exp_nanmean 3.61x n/a n/a 23.49x n/a n/a
move_exp_nanstd 2.71x n/a n/a 21.07x n/a n/a
move_exp_nansum 3.23x n/a n/a 21.76x n/a n/a
move_exp_nanvar 2.99x n/a n/a 20.68x n/a n/a
move_mean 5.42x 0.89x n/a 28.66x 5.97x n/a
move_std 6.39x 0.94x n/a 36.55x 6.95x n/a
move_sum 5.17x 0.87x n/a 27.45x 6.04x n/a
move_var 5.97x 1.01x n/a 34.26x 7.14x n/a
nanargmax[^5] 6.92x 0.91x n/a 5.27x 0.91x n/a
nanargmin[^5] 7.01x 0.90x n/a 5.67x 0.92x n/a
nancount 1.84x n/a 1.61x 22.48x n/a 13.43x
nanmax[^5] 0.66x 0.64x 0.31x 1.10x 0.69x 0.33x
nanmean 8.23x 2.17x 9.52x 77.79x 17.48x 79.58x
nanmin[^5] 0.66x 0.68x 0.30x 1.00x 0.64x 0.30x
nanquantile 0.75x n/a 0.61x 4.99x n/a 4.83x
nanstd 1.55x 1.47x 5.47x 13.99x 10.81x 44.85x
nansum 7.40x 1.95x 8.32x 80.83x 18.30x 73.13x
nanvar 1.53x 1.39x 5.33x 12.86x 10.35x 41.29x

NUMBAGG_FASTMATH=false:

func 1D
pandas
1D
bottleneck
1D
numpy
2D
pandas
2D
bottleneck
2D
numpy
bfill 1.72x 1.18x n/a 21.78x 7.71x n/a
ffill 2.13x 1.57x n/a 27.47x 9.52x n/a
group_nanall 1.42x n/a n/a 12.80x n/a n/a
group_nanany 1.43x n/a n/a 12.96x n/a n/a
group_nanargmax 1.34x n/a n/a 10.83x n/a n/a
group_nanargmin 1.23x n/a n/a 10.53x n/a n/a
group_nancount 1.25x n/a n/a 8.06x n/a n/a
group_nanfirst 1.37x n/a n/a 21.19x n/a n/a
group_nanlast 1.26x n/a n/a 9.57x n/a n/a
group_nanmax 1.25x n/a n/a 8.83x n/a n/a
group_nanmean 1.38x n/a n/a 15.23x n/a n/a
group_nanmin 1.24x n/a n/a 8.90x n/a n/a
group_nanprod 1.30x n/a n/a 8.98x n/a n/a
group_nanstd 1.36x n/a n/a 13.15x n/a n/a
group_nansum_of_squares 1.57x n/a n/a 21.60x n/a n/a
group_nansum 1.35x n/a n/a 15.07x n/a n/a
group_nanvar 1.39x n/a n/a 12.77x n/a n/a
move_corr 24.69x n/a n/a 138.69x n/a n/a
move_cov 22.53x n/a n/a 129.15x n/a n/a
move_exp_nancorr 11.09x n/a n/a 75.82x n/a n/a
move_exp_nancount 3.95x n/a n/a 21.67x n/a n/a
move_exp_nancov 10.25x n/a n/a 78.33x n/a n/a
move_exp_nanmean 2.93x n/a n/a 23.15x n/a n/a
move_exp_nanstd 2.62x n/a n/a 19.62x n/a n/a
move_exp_nansum 3.14x n/a n/a 23.23x n/a n/a
move_exp_nanvar 2.75x n/a n/a 21.07x n/a n/a
move_mean 5.56x 0.94x n/a 31.32x 6.43x n/a
move_std 6.70x 0.94x n/a 37.33x 6.85x n/a
move_sum 5.14x 0.92x n/a 28.64x 6.39x n/a
move_var 5.94x 1.00x n/a 36.89x 7.15x n/a
nanargmax[^5] 6.12x 0.90x n/a 5.42x 0.86x n/a
nanargmin[^5] 6.85x 0.77x n/a 5.55x 0.77x n/a
nancount 1.88x n/a 1.48x 23.93x n/a 12.99x
nanmax[^5] 0.71x 0.72x 0.32x 1.01x 0.68x 0.30x
nanmean 4.76x 1.29x 5.49x 36.56x 8.55x 39.01x
nanmin[^5] 0.71x 0.70x 0.32x 1.07x 0.70x 0.32x
nanquantile 0.76x n/a 0.63x 5.40x n/a 5.55x
nanstd 1.31x 1.31x 4.82x 10.66x 8.99x 35.21x
nansum 4.88x 1.40x 5.49x 38.33x 8.25x 34.83x
nanvar 1.40x 1.38x 4.73x 10.77x 9.56x 33.81x

Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, very cool, thank you!

Surprising that the benchmarks don't do better tbh. Do we know whether the initial results in #287 an anomaly? I thought this might make up for #256 (though I note that your results on intel are 0.30x vs 0.11x on my ARM...)

I was trying to think whether it's possible to change the setting at runtime, and so we could add this as a parameter in the benchmarks, rather than running the benchmark script twice.
I think it's possible, but not easy — it would require something like

@property
def target(self):
if self._target_cpu:
return "cpu"
else:
if _is_in_unsafe_thread_pool():
logger.debug(
"Numbagg detected that we're in a thread pool with workqueue threading. "
"As a result, we're turning off parallel support to ensure numba doesn't abort. "
"This will result in lower performance on parallelizable arrays on multi-core systems. "
"To enable parallel support, run outside a multithreading context, or install TBB or OpenMP. "
"Numbagg won't re-check on every call — restart your python session to reset the check. "
"For more details, check out https://numba.readthedocs.io/en/stable/developer/threading_implementation.html#caveats"
)
self._target_cpu = True
return "cpu"
else:
return "parallel"
, and then clearing the cache with
@pytest.fixture
def clear_numba_cache(func):
func.gufunc.cache_clear()
yield
, which would then mean the order of the tests matters (or we clear after every function, which would be materially slower to run)

How do tests do with the flag enabled?

I'll prospectively merge so we can test it some more.

Thank you very much @frazane!

numbagg/funcs.py Show resolved Hide resolved
@max-sixty max-sixty merged commit 2eb10fa into numbagg:main Feb 29, 2024
7 checks passed
@frazane
Copy link
Contributor Author

frazane commented Mar 1, 2024

@max-sixty purely guessing, but I think the problem with nanmin and similar aggregations is in LLVM itself. Apparently bottleneck uses clang to compile which is also based on LLVM, whereas numpy uses gcc...it would explain why both numbagg and bottleneck do worse than numpy.

@max-sixty
Copy link
Collaborator

Overall I think this is an interesting area to explore. But the perf gains aren't that high or widespread. So let's leave this in and see whether we can find any that are — if there are cases where it's 5x faster then that totally changes the calculus on whether we try and promote this path for users...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optional fastmath compilation?
2 participants