Optional fastmath optimizations via env var #290

frazane · 2024-02-29T10:01:32Z

With this PR fastmath optimizations can be optionally enabled via a NUMBAGG_FASTMATH environment variable. A warning is always issued when used. Importantly, we don't simply use fastmath=True but specify a set of flags that should not result in unsafe behavior, but possibly in reduced precision. The "no nans" and "no infs" fastmath flags are not used. See also the discussion in #287.
In my benchmarks I only observe a 2x performance improvement on nansum and nanmean (for double precision floats, in micro-benchmarks I saw 4x for single precision), smaller performance improvements on nanstd and nanvar, and no noticeable differences, or even slightly worse results, on all other aggregations.
Closes #287

Tests

Tests are passing, including those for correctness, when NUMBAGG_FASTMATH=true, which should indicate it's safe to use it. This also includes tests for large arrays (1'000'000 elements).

Benchmark: Linux system with 8 skylake-avx512 CPUs

`NUMBAGG_FASTMATH=true`:

func	1D pandas	1D bottleneck	1D numpy	2D pandas	2D bottleneck	2D numpy
`bfill`	1.80x	1.17x	n/a	21.54x	7.70x	n/a
`ffill`	2.12x	1.59x	n/a	25.84x	9.85x	n/a
`group_nanall`	1.49x	n/a	n/a	12.51x	n/a	n/a
`group_nanany`	1.43x	n/a	n/a	13.05x	n/a	n/a
`group_nanargmax`	1.28x	n/a	n/a	10.04x	n/a	n/a
`group_nanargmin`	1.22x	n/a	n/a	9.61x	n/a	n/a
`group_nancount`	1.26x	n/a	n/a	8.21x	n/a	n/a
`group_nanfirst`	1.34x	n/a	n/a	19.43x	n/a	n/a
`group_nanlast`	1.20x	n/a	n/a	8.27x	n/a	n/a
`group_nanmax`	1.21x	n/a	n/a	8.14x	n/a	n/a
`group_nanmean`	1.38x	n/a	n/a	13.98x	n/a	n/a
`group_nanmin`	1.27x	n/a	n/a	8.08x	n/a	n/a
`group_nanprod`	1.21x	n/a	n/a	7.62x	n/a	n/a
`group_nanstd`	1.35x	n/a	n/a	12.17x	n/a	n/a
`group_nansum_of_squares`	1.65x	n/a	n/a	21.51x	n/a	n/a
`group_nansum`	1.41x	n/a	n/a	13.73x	n/a	n/a
`group_nanvar`	1.36x	n/a	n/a	13.44x	n/a	n/a
`move_corr`	28.32x	n/a	n/a	148.17x	n/a	n/a
`move_cov`	23.53x	n/a	n/a	121.59x	n/a	n/a
`move_exp_nancorr`	12.59x	n/a	n/a	78.87x	n/a	n/a
`move_exp_nancount`	3.98x	n/a	n/a	20.27x	n/a	n/a
`move_exp_nancov`	11.46x	n/a	n/a	78.58x	n/a	n/a
`move_exp_nanmean`	3.61x	n/a	n/a	23.49x	n/a	n/a
`move_exp_nanstd`	2.71x	n/a	n/a	21.07x	n/a	n/a
`move_exp_nansum`	3.23x	n/a	n/a	21.76x	n/a	n/a
`move_exp_nanvar`	2.99x	n/a	n/a	20.68x	n/a	n/a
`move_mean`	5.42x	0.89x	n/a	28.66x	5.97x	n/a
`move_std`	6.39x	0.94x	n/a	36.55x	6.95x	n/a
`move_sum`	5.17x	0.87x	n/a	27.45x	6.04x	n/a
`move_var`	5.97x	1.01x	n/a	34.26x	7.14x	n/a
`nanargmax`[^5]	6.92x	0.91x	n/a	5.27x	0.91x	n/a
`nanargmin`[^5]	7.01x	0.90x	n/a	5.67x	0.92x	n/a
`nancount`	1.84x	n/a	1.61x	22.48x	n/a	13.43x
`nanmax`[^5]	0.66x	0.64x	0.31x	1.10x	0.69x	0.33x
`nanmean`	8.23x	2.17x	9.52x	77.79x	17.48x	79.58x
`nanmin`[^5]	0.66x	0.68x	0.30x	1.00x	0.64x	0.30x
`nanquantile`	0.75x	n/a	0.61x	4.99x	n/a	4.83x
`nanstd`	1.55x	1.47x	5.47x	13.99x	10.81x	44.85x
`nansum`	7.40x	1.95x	8.32x	80.83x	18.30x	73.13x
`nanvar`	1.53x	1.39x	5.33x	12.86x	10.35x	41.29x

`NUMBAGG_FASTMATH=false`:

func	1D pandas	1D bottleneck	1D numpy	2D pandas	2D bottleneck	2D numpy
`bfill`	1.72x	1.18x	n/a	21.78x	7.71x	n/a
`ffill`	2.13x	1.57x	n/a	27.47x	9.52x	n/a
`group_nanall`	1.42x	n/a	n/a	12.80x	n/a	n/a
`group_nanany`	1.43x	n/a	n/a	12.96x	n/a	n/a
`group_nanargmax`	1.34x	n/a	n/a	10.83x	n/a	n/a
`group_nanargmin`	1.23x	n/a	n/a	10.53x	n/a	n/a
`group_nancount`	1.25x	n/a	n/a	8.06x	n/a	n/a
`group_nanfirst`	1.37x	n/a	n/a	21.19x	n/a	n/a
`group_nanlast`	1.26x	n/a	n/a	9.57x	n/a	n/a
`group_nanmax`	1.25x	n/a	n/a	8.83x	n/a	n/a
`group_nanmean`	1.38x	n/a	n/a	15.23x	n/a	n/a
`group_nanmin`	1.24x	n/a	n/a	8.90x	n/a	n/a
`group_nanprod`	1.30x	n/a	n/a	8.98x	n/a	n/a
`group_nanstd`	1.36x	n/a	n/a	13.15x	n/a	n/a
`group_nansum_of_squares`	1.57x	n/a	n/a	21.60x	n/a	n/a
`group_nansum`	1.35x	n/a	n/a	15.07x	n/a	n/a
`group_nanvar`	1.39x	n/a	n/a	12.77x	n/a	n/a
`move_corr`	24.69x	n/a	n/a	138.69x	n/a	n/a
`move_cov`	22.53x	n/a	n/a	129.15x	n/a	n/a
`move_exp_nancorr`	11.09x	n/a	n/a	75.82x	n/a	n/a
`move_exp_nancount`	3.95x	n/a	n/a	21.67x	n/a	n/a
`move_exp_nancov`	10.25x	n/a	n/a	78.33x	n/a	n/a
`move_exp_nanmean`	2.93x	n/a	n/a	23.15x	n/a	n/a
`move_exp_nanstd`	2.62x	n/a	n/a	19.62x	n/a	n/a
`move_exp_nansum`	3.14x	n/a	n/a	23.23x	n/a	n/a
`move_exp_nanvar`	2.75x	n/a	n/a	21.07x	n/a	n/a
`move_mean`	5.56x	0.94x	n/a	31.32x	6.43x	n/a
`move_std`	6.70x	0.94x	n/a	37.33x	6.85x	n/a
`move_sum`	5.14x	0.92x	n/a	28.64x	6.39x	n/a
`move_var`	5.94x	1.00x	n/a	36.89x	7.15x	n/a
`nanargmax`[^5]	6.12x	0.90x	n/a	5.42x	0.86x	n/a
`nanargmin`[^5]	6.85x	0.77x	n/a	5.55x	0.77x	n/a
`nancount`	1.88x	n/a	1.48x	23.93x	n/a	12.99x
`nanmax`[^5]	0.71x	0.72x	0.32x	1.01x	0.68x	0.30x
`nanmean`	4.76x	1.29x	5.49x	36.56x	8.55x	39.01x
`nanmin`[^5]	0.71x	0.70x	0.32x	1.07x	0.70x	0.32x
`nanquantile`	0.76x	n/a	0.63x	5.40x	n/a	5.55x
`nanstd`	1.31x	1.31x	4.82x	10.66x	8.99x	35.21x
`nansum`	4.88x	1.40x	5.49x	38.33x	8.25x	34.83x
`nanvar`	1.40x	1.38x	4.73x	10.77x	9.56x	33.81x

for more information, see https://pre-commit.ci

max-sixty

Awesome, very cool, thank you!

Surprising that the benchmarks don't do better tbh. Do we know whether the initial results in #287 an anomaly? I thought this might make up for #256 (though I note that your results on intel are 0.30x vs 0.11x on my ARM...)

I was trying to think whether it's possible to change the setting at runtime, and so we could add this as a parameter in the benchmarks, rather than running the benchmark script twice.
I think it's possible, but not easy — it would require something like

numbagg/numbagg/decorators.py

Lines 79 to 96 in fa851cc

    
           @property 
        
           def target(self): 
        
               if self._target_cpu: 
        
                   return "cpu" 
        
               else: 
        
                   if _is_in_unsafe_thread_pool(): 
        
                       logger.debug( 
        
                           "Numbagg detected that we're in a thread pool with workqueue threading. " 
        
                           "As a result, we're turning off parallel support to ensure numba doesn't abort. " 
        
                           "This will result in lower performance on parallelizable arrays on multi-core systems. " 
        
                           "To enable parallel support, run outside a multithreading context, or install TBB or OpenMP. " 
        
                           "Numbagg won't re-check on every call — restart your python session to reset the check. " 
        
                           "For more details, check out https://numba.readthedocs.io/en/stable/developer/threading_implementation.html#caveats" 
        
                       ) 
        
                       self._target_cpu = True 
        
                       return "cpu" 
        
                   else: 
        
                       return "parallel"

, and then clearing the cache with

numbagg/numbagg/test/test_benchmark.py

Lines 59 to 63 in 86ce31d

    
           @pytest.fixture 
        
           def clear_numba_cache(func): 
        
               func.gufunc.cache_clear() 
        
               yield

, which would then mean the order of the tests matters (or we clear after every function, which would be materially slower to run)

How do tests do with the flag enabled?

I'll prospectively merge so we can test it some more.

Thank you very much @frazane!

numbagg/funcs.py

frazane · 2024-03-01T15:17:58Z

@max-sixty purely guessing, but I think the problem with nanmin and similar aggregations is in LLVM itself. Apparently bottleneck uses clang to compile which is also based on LLVM, whereas numpy uses gcc...it would explain why both numbagg and bottleneck do worse than numpy.

max-sixty · 2024-03-01T23:50:30Z

Overall I think this is an interesting area to explore. But the perf gains aren't that high or widespread. So let's leave this in and see whether we can find any that are — if there are cases where it's 5x faster then that totally changes the calculus on whether we try and promote this path for users...

frazane and others added 10 commits February 17, 2024 15:59

replace ndreduce with ndaggregate

307208e

fix circular import

d588e5c

optional fastmath

38f3bca

refactor _FASTMATH option

8e918f7

remove funcs with old decorators

a7db56a

test for some large arrays

dc758e7

skip too large arrays for moving aggregation tests

a67b38d

[pre-commit.ci] auto fixes from pre-commit.com hooks

78adc0f

for more information, see https://pre-commit.ci

make mypy happy

afd1557

Merge branch 'main' into optional-fastmath

f33da09

max-sixty approved these changes Feb 29, 2024

View reviewed changes

numbagg/funcs.py Show resolved Hide resolved

max-sixty merged commit 2eb10fa into numbagg:main Feb 29, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optional fastmath optimizations via env var #290

Optional fastmath optimizations via env var #290

frazane commented Feb 29, 2024 •

edited

max-sixty left a comment

frazane commented Mar 1, 2024

max-sixty commented Mar 1, 2024

	@property
	def target(self):
	if self._target_cpu:
	return "cpu"
	else:
	if _is_in_unsafe_thread_pool():
	logger.debug(
	"Numbagg detected that we're in a thread pool with workqueue threading. "
	"As a result, we're turning off parallel support to ensure numba doesn't abort. "
	"This will result in lower performance on parallelizable arrays on multi-core systems. "
	"To enable parallel support, run outside a multithreading context, or install TBB or OpenMP. "
	"Numbagg won't re-check on every call — restart your python session to reset the check. "
	"For more details, check out https://numba.readthedocs.io/en/stable/developer/threading_implementation.html#caveats"
	)
	self._target_cpu = True
	return "cpu"
	else:
	return "parallel"

	@pytest.fixture
	def clear_numba_cache(func):
	func.gufunc.cache_clear()

	yield

Optional fastmath optimizations via env var #290

Optional fastmath optimizations via env var #290

Conversation

frazane commented Feb 29, 2024 • edited

Tests

Benchmark: Linux system with 8 skylake-avx512 CPUs

NUMBAGG_FASTMATH=true:

NUMBAGG_FASTMATH=false:

max-sixty left a comment

Choose a reason for hiding this comment

frazane commented Mar 1, 2024

max-sixty commented Mar 1, 2024

frazane commented Feb 29, 2024 •

edited

`NUMBAGG_FASTMATH=true`:

`NUMBAGG_FASTMATH=false`: