ENH/BUG: optimisations and fixes for np.unique with nan values #23288

jiwaszki · 2023-02-27T09:31:59Z

BUG: fix for np.unique with nan values (See #23286)

Fix np.unique which returns multiple NaNs despite setting equal_nan=True when axis!=None.

According to on-going discussion in #23286 - there should be a way to fix this behaviour using new function/ufunc. Let's treat this PR as a hot-fix.

Brief of changes:

Scan over incoming input array with (np.isnan(ar)).any() to determine if there are any NaN values. If not, redirect function to "fallback" in if-else body.
Keep previous optimization that was relying on axis=None and sorting properties. Leave comments for future devs.
Allow dtypes of kind V (np.void) to be processed in _unique1d().
Added tests covering problematic multiple-NaN cases.
Refactored the benchmark and parametrised it to cover 1D/2D cases with the axis set to None/0.
Usage: python runtests.py --bench-compare main bench_lib.Unique
Perf number are really close to each other and tests are passing once again. Only significant differences are present in cases with axis=0+ (which is currently a price for working "NaN-detection").

Main branch:

[ 75.00%] · For numpy commit e141d907 <main> (round 2/2):
[ 87.50%] ··· bench_lib.Unique.time_unique_1D                                                                                                                                                                    ok
[ 87.50%] ··· ============ ============= ============ ============ ============ ============= ============ ============= ============ ============= ============
              --                                                                    percent_nans / axis                                                         
              ------------ -------------------------------------------------------------------------------------------------------------------------------------
               array_size     0 / None      0 / 0      0.1 / None    0.1 / 0      2.0 / None    2.0 / 0     50.0 / None    50.0 / 0    90.0 / None    90.0 / 0  
              ============ ============= ============ ============ ============ ============= ============ ============= ============ ============= ============
                   64       3.31±0.08μs   26.0±0.8μs   3.57±0.7μs   26.2±0.6μs   3.25±0.03μs   25.7±0.6μs    5.21±0.1μs   23.7±0.4μs   5.16±0.09μs   19.8±0.2μs 
                  300       4.96±0.05μs    81.4±3μs    5.06±0.1μs    83.6±2μs     7.13±0.2μs    88.8±2μs     7.15±0.1μs    79.3±2μs     6.44±0.2μs   60.3±0.9μs 
                  512        7.32±0.2μs    161±1μs     9.28±0.5μs    151±2μs     8.85±0.08μs    146±5μs     8.11±0.02μs    127±2μs     7.42±0.09μs    92.9±1μs  
              ============ ============= ============ ============ ============ ============= ============ ============= ============ ============= ============

This PR:

[ 50.00%] · For numpy commit 7ef3abf3 <jiwaszki/np_unique_nan_bug> (round 2/2):
[ 62.50%] ··· bench_lib.Unique.time_unique_1D                                                                                                                                                                    ok
[ 62.50%] ··· ============ ============= ============ ============ ============ ============ ============ ============= ========== ============= ===========
              --                                                                  percent_nans / axis                                                       
              ------------ ---------------------------------------------------------------------------------------------------------------------------------
               array_size     0 / None      0 / 0      0.1 / None    0.1 / 0     2.0 / None    2.0 / 0     50.0 / None   50.0 / 0   90.0 / None    90.0 / 0 
              ============ ============= ============ ============ ============ ============ ============ ============= ========== ============= ===========
                   64        3.46±0.5μs    27.5±2μs    3.45±0.1μs   27.8±0.8μs   3.42±0.1μs   27.3±0.5μs    5.01±0.1μs   36.4±7μs   5.19±0.03μs    25.0±1μs 
                  300       5.15±0.06μs   83.0±0.9μs    5.57±3μs     93.4±3μs    8.02±0.3μs    104±30μs     7.26±0.8μs   86.6±3μs    6.17±0.2μs   84.2±20μs 
                  512        7.75±0.2μs    154±2μs     8.94±0.6μs    152±8μs     8.98±0.5μs    158±6μs      8.60±0.1μs   134±7μs     7.28±0.5μs    99.2±3μs 
              ============ ============= ============ ============ ============ ============ ============ ============= ========== ============= ===========

TODO:
There is another one "early return" case which I am considering. If the input array is already 1D, axis value is not affecting the output shape as it can be only set to either None or 0, axis is swaping with itself:

>>> a = np.array([0., np.nan, 2., np.nan, 2., 1., 0., 1., 2., 0.])
>>> np.unique(a, equal_nan=True, axis=None)
array([ 0.,  1.,  2., nan])
>>> a = np.array([0., np.nan, 2., np.nan, 2., 1., 0., 1., 2., 0.])
>>> np.unique(a, equal_nan=True, axis=0)
array([ 0.,  1.,  2., nan])

MatteoRaso · 2023-02-27T19:27:24Z

Can you add a test to see if this also works for masked arrays? If possible, it'd be nice if this PR can close #23281 as well.

jiwaszki · 2023-02-27T23:30:02Z

Can you add a test to see if this also works for masked arrays? If possible, it'd be nice if this PR can close #23281 as well.

Sure thing! But first, regarding the problems with axis argument. Currently, it seems that the function is broken when combining axis with anything. Example:

>>> import numpy as np
>>> import numpy.ma as ma
>>> a = ma.array([1,2,3,4,2,3,1,4], mask=[0,0,1,1,0,0,0,1])
>>> 
>>> out = np.unique(a, return_index=True)
>>> print(type(out), out)
<class 'tuple'> (masked_array(data=[1, 2, 3, --],
             mask=[False, False, False,  True],
       fill_value=999999), array([0, 1, 5, 2]))
>>> 
>>> out = np.unique(a, axis=0, return_index=True)
>>> print(type(out), out)
<class 'tuple'> (array([1, 2, 3, 4]), array([0, 1, 2, 3]))

It will take me some extra time to figure out how to fix it and keep performance "as-is".

Thanks for the heads up!

seberg · 2023-03-06T09:43:38Z

I have my doubts that this works. The NaN can be only in one place (i.e. the first), in which case it isn't even sorted to the end (np.unique([[np.nan, 0], [1, 2]], axis=0), the value of axis doesn't really matter).

It is indeed interesting whether this helps the masked array code paths, but likely the masked array fix lies not here but must be in np.ma itself (it is fine if np.unique does the wrong thing when given a masked array).

jiwaszki · 2023-03-15T23:45:16Z

@seberg you are right! I am back with the another proposition. Some highlights:

Highly efficient solution for: 2-D not structured dtypes and the additional finding for 1-D case.
Fix all test cases and added new ones.

I was scratching my head a lot, this "nan issue" seemed to be unsolvable in any (efficient) way. I have found this interesting solution by @rjeb #20896 , it was also reviewed by you. It was a game-changer for me.

Fixing the bug:

Generally, I applied the similar idea. For structured types it goes down recursively to check every element correctly. The basic building block "checking of nans" is done like this:

a != b and not(np.isnan(a) and np.isnan(b))

@rjeb solution was not perfect in "nested" scenario. Now it works flawlessly in all existing test cases. For structured types it might not be the most performant implementation... but hey! It works, looks clean and solves "nan issue":)

Optimising solution:

While working on this case I have found two really great optimisations to apply. I put a lot of comments for future devs. I think this is a quite advanced implementation and it will be better to look at the code at the same time.

Based on some properties (like dtype, array shape and axis) there are cases which allow to simplify required computations. "Consolidating" of the array is always making it 2-D structured-like ndarray. This is not good for the performance, thus if given function input:

array is 1-D and axis is 0. "Consolidation" was making it 2-D with single entry on each row, then flattening and apply all computations... why bother with it? If array is 1-D and axis=0, this is the same as passing it with axis=None. It will be moving axes 0-to-0 and flattened eventually. This saves a lot of time! It basically matches the axis=None case.
array is 2-D and is not structured ("simple dtype"). If this is fulfilled, proceed with optimised version of _unique1D called _unique2D. New function have same "preprocessing step" to have axes switched and correctly sorted. The whole trick is to reconstruct the array to access optimised numpy call on slices. Now the cool part, with this optimisation the bottle-neck of unique call is moved to flattening/sorting of the array and previous check/looping can be rewritten to simple:

_nans = np.isnan(aux)
mask[1:] = np.any((aux[1:] != aux[:-1]) & ~(_nans[1:] & _nans[:-1]), axis=1)

Let numbers speak for themselves:

       before           after         ratio
     [e141d907]       [7e7cd9f5]
     <main>           <jiwaszki/np_unique_nan_bug>
# These three are not significant at all, right? 
+     2.89±0.01μs      3.11±0.03μs     1.07  bench_lib.Unique.time_unique_1D(64, 0.1, None)
+     2.88±0.02μs      3.08±0.01μs     1.07  bench_lib.Unique.time_unique_1D(64, 0, None)
+     2.93±0.01μs      3.08±0.02μs     1.05  bench_lib.Unique.time_unique_1D(64, 2.0, None)
# 2-D optimisation effects:
-         157±3μs          143±4μs     0.91  bench_lib.Unique.time_unique_2D(64, 0, None)
-         111±6ms         70.2±9ms     0.63  bench_lib.Unique.time_unique_2D(2048, 90.0, 0)
-         108±7ms         64.6±8ms     0.60  bench_lib.Unique.time_unique_2D(2048, 2.0, 0)
-     1.35±0.01ms          811±2μs     0.60  bench_lib.Unique.time_unique_2D(300, 90.0, 0)
-      21.2±0.2ms       12.5±0.3ms     0.59  bench_lib.Unique.time_unique_2D(1024, 90.0, 0)
-     4.98±0.04ms      2.90±0.02ms     0.58  bench_lib.Unique.time_unique_2D(512, 90.0, 0)
-         121±6ms         69.3±5ms     0.57  bench_lib.Unique.time_unique_2D(2048, 0.1, 0)
-         183±1μs          103±1μs     0.56  bench_lib.Unique.time_unique_2D(64, 90.0, 0)
-      19.8±0.1ms      11.2±0.09ms     0.56  bench_lib.Unique.time_unique_2D(1024, 2.0, 0)
-      20.0±0.3ms      11.2±0.09ms     0.56  bench_lib.Unique.time_unique_2D(1024, 0, 0)
-      20.2±0.2ms      11.2±0.03ms     0.56  bench_lib.Unique.time_unique_2D(1024, 50.0, 0)
-     19.9±0.07ms       11.0±0.2ms     0.55  bench_lib.Unique.time_unique_2D(1024, 0.1, 0)
-     4.35±0.03ms       2.34±0.2ms     0.54  bench_lib.Unique.time_unique_2D(512, 2.0, 0)
-     4.46±0.05ms      2.38±0.02ms     0.53  bench_lib.Unique.time_unique_2D(512, 50.0, 0)
-     4.31±0.01ms      2.30±0.05ms     0.53  bench_lib.Unique.time_unique_2D(512, 0.1, 0)
-     4.35±0.04ms      2.29±0.01ms     0.53  bench_lib.Unique.time_unique_2D(512, 0, 0)
-         125±9ms         64.1±5ms     0.51  bench_lib.Unique.time_unique_2D(2048, 50.0, 0)
-     1.10±0.01ms        562±0.9μs     0.51  bench_lib.Unique.time_unique_2D(300, 50.0, 0)
-        1.07±0ms          537±5μs     0.50  bench_lib.Unique.time_unique_2D(300, 2.0, 0)
-         119±2ms         59.1±3ms     0.49  bench_lib.Unique.time_unique_2D(2048, 0, 0)
-        1.07±0ms          528±6μs     0.49  bench_lib.Unique.time_unique_2D(300, 0.1, 0)
-        1.07±0ms          525±1μs     0.49  bench_lib.Unique.time_unique_2D(300, 0, 0)
-         151±2μs         71.3±1μs     0.47  bench_lib.Unique.time_unique_2D(64, 50.0, 0)
-         145±2μs       65.9±0.4μs     0.46  bench_lib.Unique.time_unique_2D(64, 2.0, 0)
-       144±0.5μs       65.3±0.4μs     0.45  bench_lib.Unique.time_unique_2D(64, 0.1, 0)
-       145±0.4μs       65.6±0.6μs     0.45  bench_lib.Unique.time_unique_2D(64, 0, 0)
# 1-D optimisation effects:
-      17.5±0.5μs      4.44±0.02μs     0.25  bench_lib.Unique.time_unique_1D(64, 90.0, 0)
-      20.6±0.3μs      4.52±0.02μs     0.22  bench_lib.Unique.time_unique_1D(64, 50.0, 0)
-      22.5±0.4μs      3.11±0.03μs     0.14  bench_lib.Unique.time_unique_1D(64, 2.0, 0)
-      22.4±0.5μs      3.09±0.01μs     0.14  bench_lib.Unique.time_unique_1D(64, 0, 0)
-      22.5±0.3μs      3.10±0.02μs     0.14  bench_lib.Unique.time_unique_1D(64, 0.1, 0)
-        52.9±2μs      5.68±0.01μs     0.11  bench_lib.Unique.time_unique_1D(300, 90.0, 0)
-        69.9±2μs      6.22±0.03μs     0.09  bench_lib.Unique.time_unique_1D(300, 50.0, 0)
-      77.4±0.2μs      6.46±0.02μs     0.08  bench_lib.Unique.time_unique_1D(300, 2.0, 0)
-      81.4±0.9μs      6.70±0.03μs     0.08  bench_lib.Unique.time_unique_1D(512, 90.0, 0)
-       112±0.3μs      7.46±0.05μs     0.07  bench_lib.Unique.time_unique_1D(512, 50.0, 0)
-        72.5±3μs      4.68±0.02μs     0.06  bench_lib.Unique.time_unique_1D(300, 0, 0)
-      73.0±0.5μs      4.69±0.02μs     0.06  bench_lib.Unique.time_unique_1D(300, 0.1, 0)
-       131±0.8μs      8.17±0.03μs     0.06  bench_lib.Unique.time_unique_1D(512, 0.1, 0)
-       131±0.7μs       8.15±0.2μs     0.06  bench_lib.Unique.time_unique_1D(512, 2.0, 0)
-         593±3μs       33.5±0.8μs     0.06  bench_lib.Unique.time_unique_1D(2048, 2.0, 0)
-       171±0.4μs      9.62±0.04μs     0.06  bench_lib.Unique.time_unique_1D(1024, 90.0, 0)
-         139±4μs      6.71±0.01μs     0.05  bench_lib.Unique.time_unique_1D(512, 0, 0)
-         234±1μs      11.0±0.08μs     0.05  bench_lib.Unique.time_unique_1D(1024, 50.0, 0)
-         273±5μs      12.1±0.01μs     0.04  bench_lib.Unique.time_unique_1D(1024, 0.1, 0)
-         273±5μs       12.0±0.1μs     0.04  bench_lib.Unique.time_unique_1D(1024, 2.0, 0)
-         377±6μs      16.2±0.06μs     0.04  bench_lib.Unique.time_unique_1D(2048, 90.0, 0)
-         500±2μs         21.3±3μs     0.04  bench_lib.Unique.time_unique_1D(2048, 50.0, 0)
-         579±3μs       23.4±0.7μs     0.04  bench_lib.Unique.time_unique_1D(2048, 0.1, 0)
-         275±3μs       10.7±0.2μs     0.04  bench_lib.Unique.time_unique_1D(1024, 0, 0)
-         602±3μs       22.9±0.8μs     0.04  bench_lib.Unique.time_unique_1D(2048, 0, 0)

Thanks in advance for another review!

jiwaszki · 2023-03-26T12:54:47Z

@seberg ping

jiwaszki · 2023-05-15T19:22:51Z

@seberg could you review it again? Thanks!

…nan_bug

seberg

This is quite complicated, to make it easier for me to get started: do I get it right that the approach is to use the structured dtype trick for sorting only?
If we undo that view immediately after sorting, we then apply the same NaN sanitizing logic (just with the right .any()/.all() calls to account for the extra axis sprinkled in?

In that case, I suspect you can simplify the branching a bit, just view the old dtype (if it was already structured, no harm done).

(I almost wonder if np.lexsort might not be as well for that sorting trick.)

seberg · 2023-05-17T09:42:12Z

numpy/lib/arraysetops.py

+    # Then this optimization might be used to speed up unique computation along
+    # axis equal to either 0 or 1.
+    if (orig_dim == 2 and len(orig_dtype) < 2 and
+       all(i > 0 for i in orig_shape)):


I don't understand why orig_ndim matters. Isn't the array always flattened along all but one dim?

jiwaszki added 2 commits February 27, 2023 09:47

BUG: fix for np.unique with nan values

502f623

Remove comments

7ef3abf

github-actions bot added the 00 - Bug label Feb 27, 2023

ENH/BUG: optimizations and fixes for np.unique

7e7cd9f

jiwaszki changed the title ~~BUG: fix for np.unique with nan values~~ ENH/BUG: optimisations and fixes for np.unique with nan values Mar 15, 2023

jiwaszki added 3 commits March 16, 2023 01:17

STY: fix lines length

bc072e4

STY: fix lines length 2

e582323

STY: remove leftover comment

c69fcff

Merge remote-tracking branch 'upstream/main' into jiwaszki/np_unique_…

67f12f3

…nan_bug

jiwaszki force-pushed the jiwaszki/np_unique_nan_bug branch from d9c2a7b to 67f12f3 Compare May 15, 2023 19:26

seberg reviewed May 17, 2023

View reviewed changes

goruha mentioned this pull request Mar 23, 2024

Follow this PRs intervals-mining-lab/foapy#18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH/BUG: optimisations and fixes for np.unique with nan values #23288

ENH/BUG: optimisations and fixes for np.unique with nan values #23288

jiwaszki commented Feb 27, 2023

MatteoRaso commented Feb 27, 2023

jiwaszki commented Feb 27, 2023

seberg commented Mar 6, 2023

jiwaszki commented Mar 15, 2023

jiwaszki commented Mar 26, 2023

jiwaszki commented May 15, 2023

seberg left a comment

seberg May 17, 2023

ENH/BUG: optimisations and fixes for np.unique with nan values #23288

Are you sure you want to change the base?

ENH/BUG: optimisations and fixes for np.unique with nan values #23288

Conversation

jiwaszki commented Feb 27, 2023

MatteoRaso commented Feb 27, 2023

jiwaszki commented Feb 27, 2023

seberg commented Mar 6, 2023

jiwaszki commented Mar 15, 2023

Fixing the bug:

Optimising solution:

jiwaszki commented Mar 26, 2023

jiwaszki commented May 15, 2023

seberg left a comment

Choose a reason for hiding this comment

seberg May 17, 2023

Choose a reason for hiding this comment