Bitmask Backed MaskedArray #54506

WillAyd · 2023-08-11T20:04:19Z

POC using vendored nanoarrow

WillAyd · 2023-08-11T20:04:42Z

.pre-commit-config.yaml

@@ -70,19 +70,6 @@ repos:
    -   id: fix-encoding-pragma
        args: [--remove]
    -   id: trailing-whitespace
-   repo: https://github.com/cpplint/cpplint


nanoarrow and cpplint have a ton of conflicts. generally not sure how useful cpplint is for us so just removed check

We should fix this in nanoarrow! It's true that cpplint does poorly for C; however, lots of projects use it and will run into this problem.

pandas/core/arrays/masked.py

This reverts commit a39581b.

lithomas1 · 2023-08-13T07:31:43Z

pandas/core/arrays/masked.py

@@ -112,7 +113,7 @@ class BaseMaskedArray(OpsMixin, ExtensionArray):
    _internal_fill_value: Scalar
    # our underlying data and mask are each ndarrays
    _data: np.ndarray
-    _mask: npt.NDArray[np.bool_]
+    _mask: BitMaskArray


Maybe we can make this a property and override the getter?

What are you thinking we override the getter with? Are you thinking of special logic or still to just treat it like a normal attribute?

I was thinking that you can make _mask a property with a getter that calls to_numpy on a internal _bitmask attribute where we'll store the actual mask.

Ah I see. The problem is sometimes we want to access the actual bitmask object (I.e. to forward to another constructor for reference counting). The to_numpy calls are only required when we are implicitly using numpy functionality in the code (granted this is pretty often)

In theory it would be great to eliminate to_numpy() calls. With __or__ being a good example, it would be preferable to do that bitwise between two buffers if they are bitwise masks with the same length.

In practice I was having trouble properly exposing that through the Cython classes so am hoping to leave as a future optimization

We might need to keep _mask as a property that returns the numpy byte mask (this is somewhat semi-public), as Thomas mentioned, and add another attribute for the new bitmask (so essentially where you now changed ._mask to ._mask.to_numpy(), it would stay _mask, and anywhere you now uses just ._mask, that would become something like ._bitmask)

See #34873 for some context on this

Good to know - I'm going to hold off on this until the end, maybe a follow up depending on how much of a change it requires. It is definitely feasible to do as you've suggested just want to balance that against a huge initial diff

I'm a bit conflicted on this. The big downside to keeping ._mask call to_numpy() is that that is a really expensive conversion, so it can cause performance regression unless you really root it out. Assuming it is not a ton of downstream libraries that have assumed ._mask is an ndarray maybe we can coordinate with them to update to ._mask.to_numpy() and maybe even ._mask.as_bitmask()

I think what we could do for now is ask any downstream libraries that have been using this to switch to isna(), which has and will continue to return a NumPy array with this PR

WillAyd · 2023-08-14T00:33:27Z

Benchmarks

$ asv continuous upstream/main HEAD
      before           after         ratio
     [4f1a086f]       [173b4cbe]
     <main>           <bitmask-backed>
+        1.23±0μs         66.8±1μs    54.35  array.IntegerArray.time_constructor
+        707±10ns      1.41±0.04μs     2.00  array.BooleanArray.time_constructor
+         957±4μs      1.70±0.02ms     1.78  frame_ctor.FromScalar.time_frame_from_scalar_ea_float64
+         978±6μs      1.68±0.01ms     1.71  frame_ctor.FromScalar.time_frame_from_scalar_ea_float64_na
+      8.72±0.2μs      11.9±0.09μs     1.37  boolean.TimeLogicalOps.time_xor_scalar
+      8.96±0.2μs       11.9±0.3μs     1.33  boolean.TimeLogicalOps.time_or_array
+      8.60±0.3μs       11.4±0.2μs     1.33  boolean.TimeLogicalOps.time_and_array
+     2.19±0.03ms       2.92±0.5ms     1.33  multiindex_object.SetOperations.time_operation('monotonic', 'ea_int', 'intersection', False)
+      13.7±0.2ms       17.7±0.1ms     1.29  index_object.SetOperations.time_operation('monotonic', 'date_string', 'intersection')
+     1.26±0.07ms      1.53±0.02ms     1.21  index_object.SetOperations.time_operation('non_monotonic', 'ea_int', 'intersection')
+        297±30μs         368±30μs     1.24  series_methods.NanOps.time_func('sum', 1000000, 'int8')
+     1.24±0.01ms      1.48±0.03ms     1.19  index_object.SetOperations.time_operation('non_monotonic', 'ea_int', 'intersection')
+     2.47±0.02ms      2.92±0.03ms     1.18  reshape.ReshapeMaskedArrayDtype.time_transpose('Int64')
+      45.2±0.3μs       52.5±0.6μs     1.16  algos.isin.IsinAlmostFullWithRandomInt.time_isin(<class 'numpy.uint64'>, 12, 'outside')
+     2.51±0.01ms      2.89±0.05ms     1.15  reshape.ReshapeMaskedArrayDtype.time_transpose('Float64')
+        76.8±1μs         88.0±2μs     1.15  algos.isin.IsinWithRandomFloat.time_isin(<class 'numpy.float64'>, 2000, 'outside')
+        225±30μs          257±4μs     1.14  algos.isin.IsIn.time_isin_categorical('string[python]')
+         195±4ns          223±5ns     1.14  indexing_engines.MaskedNumericEngineIndexing.time_get_loc((<class 'pandas._libs.index.MaskedFloat64Engine'>, 'Float64'), 'monotonic_decr', True, 2000000)
+      8.03±0.2ms      9.13±0.08ms     1.14  frame_methods.Equals.time_frame_nonunique_unequal
+     2.05±0.03ms      2.33±0.04ms     1.14  index_object.SetOperations.time_operation('non_monotonic', 'ea_int', 'union')
+         478±3μs          539±1μs     1.13  series_methods.NanOps.time_func('min', 1000000, 'boolean')
+        95.3±1μs          108±2μs     1.13  algos.isin.IsinAlmostFullWithRandomInt.time_isin(<class 'numpy.uint64'>, 13, 'outside')
+        46.2±1μs       51.8±0.5μs     1.12  algos.isin.IsinAlmostFullWithRandomInt.time_isin(<class 'numpy.int64'>, 12, 'outside')
+      11.0±0.5μs       12.3±0.9μs     1.12  algorithms.SortIntegerArray.time_argsort(1000)
+       860±100ns        963±100ns     1.12  index_cached_properties.IndexCache.time_shape('TimedeltaIndex')
+         463±5μs          515±5μs     1.11  series_methods.NanOps.time_func('max', 1000000, 'boolean')
+        56.3±2μs       62.5±0.8μs     1.11  algos.isin.IsinWithArangeSorted.time_isin(<class 'numpy.float64'>, 2000)
+      8.60±0.2ms       9.51±0.2ms     1.11  frame_methods.Equals.time_frame_nonunique_equal
+     1.12±0.01ms      1.25±0.03ms     1.11  series_methods.Fillna.time_bfill('Int64')
+     2.07±0.02ms       2.30±0.1ms     1.11  multiindex_object.SetOperations.time_operation('non_monotonic', 'datetime', 'union', False)
+         196±1ns          217±3ns     1.11  indexing_engines.NumericEngineIndexing.time_get_loc_near_middle((<class 'pandas._libs.index.Float32Engine'>, <class 'numpy.float32'>), 'monotonic_decr', True, 2000000)
+        31.6±1μs       34.8±0.2μs     1.10  algos.isin.IsIn.time_isin_mismatched_dtype('Int64')
+     1.15±0.02ms      1.26±0.01ms     1.10  series_methods.Fillna.time_bfill('Float64')
+      4.13±0.05s       4.55±0.02s     1.10  stat_ops.FrameOps.time_op('kurt', 'Int64', 1)
+     2.71±0.03ms      2.98±0.02ms     1.10  indexing_engines.MaskedNumericEngineIndexing.time_get_loc((<class 'pandas._libs.index.MaskedInt16Engine'>, 'Int16'), 'monotonic_incr', True, 2000000)
-     2.45±0.02ms      2.22±0.01ms     0.91  stat_ops.FrameOps.time_op('sem', 'Int64', None)
-      13.5±0.6ms       12.3±0.1ms     0.91  categoricals.Constructor.time_regular
-     2.43±0.02ms      2.18±0.01ms     0.90  stat_ops.FrameOps.time_op('sem', 'Int64', 0)
-      16.3±0.6μs       14.8±0.2μs     0.90  series_methods.NanOps.time_func('std', 1000, 'int32')
-      2.82±0.4ms      2.52±0.05ms     0.89  series_methods.NanOps.time_func('var', 1000000, 'int32')
-      9.94±0.4ms       8.81±0.2ms     0.89  series_methods.NanOps.time_func('sem', 1000000, 'float64')
-      93.7±0.8μs       83.7±0.4μs     0.89  join_merge.ConcatIndexDtype.time_concat_series('Int64', 'non_monotonic', 0, True)
-      93.8±0.5μs       83.6±0.7μs     0.89  join_merge.ConcatIndexDtype.time_concat_series('Int64', 'non_monotonic', 0, False)
-      1.22±0.3μs       1.08±0.1μs     0.88  index_cached_properties.IndexCache.time_values('CategoricalIndex')
-      11.9±0.3ms      10.1±0.08ms     0.84  frame_methods.Fillna.time_fillna(False, 'Float64')
-         520±50μs          433±3μs     0.83  series_methods.NanOps.time_func('sum', 1000000, 'float64')
-        728±50μs          596±7μs     0.82  series_methods.NanOps.time_func('sum', 1000000, 'Int64')
-      4.90±0.3ms       4.03±0.2ms     0.82  indexing.MultiIndexing.time_loc_slice_plus_null_slice(False)
-      12.1±0.4ms      9.92±0.07ms     0.82  frame_methods.Fillna.time_fillna(False, 'Int64')
-         374±3μs          307±2μs     0.82  timeseries.DatetimeIndex.time_unique('repeated')
-     1.21±0.01ms          981±8μs     0.81  frame_methods.Fillna.time_bfill(True, 'Int64')
-        815±70ns         661±60ns     0.81  index_cached_properties.IndexCache.time_values('TimedeltaIndex')
-     1.20±0.01ms          972±7μs     0.81  frame_methods.Fillna.time_bfill(True, 'Float64')
-     11.6±0.03ms      9.01±0.08ms     0.78  indexing.InsertColumns.time_insert_middle
-      3.68±0.3ms      2.85±0.05ms     0.77  series_methods.NanOps.time_func('var', 1000000, 'int64')
-     3.91±0.02μs      3.03±0.02μs     0.78  series_methods.Any.time_any(1000, 'slow', 'boolean')
-     3.99±0.06μs      3.06±0.04μs     0.77  series_methods.All.time_all(1000, 'fast', 'boolean')
-            458M             349M     0.76  array.BooleanArrayMem.peakmem_array
-     4.74±0.09ms      3.28±0.03ms     0.69  series_methods.Fillna.time_fillna('Float64')
-     4.73±0.03ms      3.19±0.01ms     0.67  series_methods.Fillna.time_fillna('Int64')
-         324±6μs       29.9±0.4μs     0.09  series_methods.All.time_all(1000000, 'slow', 'boolean')
-         309±3μs       10.8±0.1μs     0.03  series_methods.All.time_all(1000000, 'fast', 'boolean')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

jbrockmendel · 2023-08-31T17:41:24Z

pandas/core/array_algos/quantile.py

@@ -43,7 +44,7 @@ def quantile_compat(

 def quantile_with_mask(
    values: np.ndarray,
-    mask: npt.NDArray[np.bool_],
+    mask: npt.NDArray[np.bool_] | BitmaskArray,


no viable way to avoid this?

The typing has to get changed somehow since we aren't passing an ndarray. Instead of a Union I think a more generic type would declare this as accepting anything that implements the buffer protocol.

If that were done as a pre-cursor it would really cut down on the diff here, and arguably align better with our memoryview declarations we have in the Cython extensions (at least most of the time)

as above id really prefer the caller to be responsible for conversion

jbrockmendel · 2023-08-31T17:42:12Z

can the BitMaskArray support 2D?

WillAyd · 2023-08-31T17:58:25Z

Yea 2D is working with this. The "support" is limited to storing the bits in a C-style contiguous fashion. Slicing and non axis=0 operations right now require a to_numpy() call, but if you don't do that this works with 2D natively

WillAyd · 2023-08-31T18:07:44Z

To illustrate:

>>> from pandas._libs.arrays import BitmaskArray
>>> import numpy as np
>>> arr = np.array([True, True, False, False]).reshape(2, -1)
>>> bma = BitmaskArray(arr)
>>> bma.any()
True
>>> bma.all()
False
>>> bma.shape
(2, 2)
>>> bma[0]
array([ True,  True])
>>> bma[1]
array([False, False])
>>> bma[:, 0]
array([ True, False])
>>> bma[:, 1]
array([ True, False])
>>> "{0:08b}".format(bma.bytes[0])
'00000011'
>>> bma.to_numpy()
array([[ True,  True],
       [False, False]])

This is all handled via BitmaskArray.__getitem__ - the BitmaskArray has a few cases that it can fastpath and not have to convert to NumPy on (i.e. 1D scalar indexing, full slice) but generally for more complicated indexers it converts to NumPy and returns that as a fallback

jbrockmendel · 2023-09-01T17:41:48Z

pandas/core/algorithms.py

@@ -424,7 +425,7 @@ def nunique_ints(values: ArrayLike) -> int:
    return result


-def unique_with_mask(values, mask: npt.NDArray[np.bool_] | None = None):
+def unique_with_mask(values, mask: npt.NDArray[np.bool_] | BitmaskArray | None = None):


havent looked that closely, but my kneejerk both here and in array_algos is that the caller should be responsible for converting the BitmaskArray (or do somethinh at a higher level that makes the conversion unnecessary)

Yea I went back and forth with that design. What is nice about keeping this very local to the issue is that it is clearer where you would need to refactor in the future to make the BitmaskArray work better.

In this case the only reason the conversion is needed is because of the astype("bool") call on L453. If that didn't exist we could probably avoid this isinstance check. If we moved the astype higher up the call stack it would be less clear why it is needed

seberg · 2023-09-05T07:48:59Z

Both PEP 3118 and the NumPy array interface mention the ability to work with bits, but they have different documented syntax (Nt versus tN where N is the number of bits) and I'm not clear that they were ever implemented? Maybe something @scoder or @seberg would know about?

I am not aware that NumPy actually supports this in any useful way at least. memoryview() doesn't care if a format is unknown. I would say the buffer protocol docs are more authoritative but it might be good to see if there is prior art that actually uses this.

WillAyd · 2023-09-05T16:46:27Z

From searching through https://discuss.python.org/t/question-pep-3118-format-strings-and-the-buffer-protocol/31264/8 I found a link to python/cpython#47382 (comment) which I think confirms that bit support isn't implemented in CPython

Of course not a blocker to this PR but something to consider in the future that would make this even more useful

pandas/_libs/lib.pyx

pandas/core/arrays/masked.py

jbrockmendel · 2023-09-15T15:03:36Z

pandas/core/arrays/masked.py

@@ -229,7 +240,7 @@ def fillna(
            if method is not None:
                func = missing.get_fill_func(method, ndim=self.ndim)
                npvalues = self._data.T
-                new_mask = mask.T
+                new_mask = mask.to_numpy().T


can we avoid this conversion? maybe feasible-but-out-of-scope?

Are you asking about avoiding the T even in the current code? Or implementing support for transposition in the BitmaskArray directly?

Or implementing support for transposition in the BitmaskArray directly?

That one

The two options for that are:

Physically transpose the layout of the bits in memory

Keep the bits the same and just use state management to resolve indexes to a different address when transposed

I don't know which of these NumPy does. Maybe a combination of both. If we want point 1, we also need to find / develop a good algorithm for it

It's possible just requires some dedicated time just to that topic

jbrockmendel · 2023-09-15T15:04:56Z

pandas/tests/indexes/test_old_base.py

@@ -286,9 +286,11 @@ def test_ensure_copied_data(self, index):
                tm.assert_numpy_array_equal(
                    index._values._data, result._values._data, check_same="same"
                )
-                assert np.shares_memory(index._values._mask, result._values._mask)
+                assert index._values._mask is result._values._mask


there's a little-used tm.shares_memory that was intended to allow implementing this kind of check for non-ndarrays

We could look at that, but I don't know if nanoarrow defines what the value of bits should be when the bitmask doesn't end on a word boundary. For example in array with length of 9, you will have 2 total bytes of memory; I am not sure if nanoarrow specifies what bits 10-16 are, so doing a memcmp on that region might not work. @paleolimbot probably knows best though

Sorry I missed this! nanoarrow definitely does not define anything about what the value of those bits should be, but where possible I believe it prefers to zero them (because it causes problems with valgrind if it does not). From the Arrow format:

Array slots which are null are not required to have a particular value; any “masked” memory can have any value and need not be zeroed, though implementations frequently choose to zero memory for null values.

jbrockmendel · 2023-09-15T15:05:08Z

pandas/tests/indexes/test_old_base.py

@@ -286,9 +286,11 @@ def test_ensure_copied_data(self, index):
                tm.assert_numpy_array_equal(
                    index._values._data, result._values._data, check_same="same"
                )
-                assert np.shares_memory(index._values._mask, result._values._mask)
+                assert index._values._mask is result._values._mask
                tm.assert_numpy_array_equal(


should just have tm.assert_bitmask_equal?

jbrockmendel · 2023-09-15T15:05:26Z

pandas/tests/frame/indexing/test_where.py

+    @pytest.mark.xfail(
+        not np_version_gte1p24,
+        reason="Changed NumPy behavior for >1D non-tuple sequence indexing",
+        strict=False,


is this flaky?

Yea I couldn't reproduce locally but saw it on CI. Next push I'll take this off and we can see if it shows back up

Here's the error this creates on the min versions CI

https://github.com/pandas-dev/pandas/actions/runs/6200248114/job/16834535267?pr=54506#step:8:181

WillAyd · 2023-09-15T17:56:17Z

pandas/_libs/hashtable.pyi

@@ -231,15 +232,15 @@ class IntpHashTable(HashTable): ...
 def duplicated(
    values: np.ndarray,
    keep: Literal["last", "first", False] = ...,
-    mask: npt.NDArray[np.bool_] | None = ...,
+    mask: npt.NDArray[np.bool_] | BitmaskArray | None = ...,


For typing annotations like this we could also do a precursor to change mask and other arguments that ared typed memoryviews in Cython to be typing_extensions.Buffer; I think that would be more technically correct

Python 3.12 will have collections.abc.Buffer built in. See also PEP 688 https://peps.python.org/pep-0688/#equivalent-for-older-python-versions

I think the only loss that way is I don't see the typing.Buffer as being Generic, so AFAICT can't do typing.Buffer[bool] yet. But seems logical that will get implemented in Python at some point

WillAyd · 2023-10-23T12:46:46Z

Closing for now, but I think this is feasible if we ever wanted to come back to it

WillAyd added 5 commits August 11, 2023 05:25

initial build

1832617

removed cpplint

b69c00f

checkpoint

64b0f01

Passing test suite

e5238d9

revert modifications to nanoarrow

b63b671

WillAyd commented Aug 11, 2023

View reviewed changes

WillAyd added 8 commits August 11, 2023 16:07

force vendor

fe31993

more to_numpy adds

a39581b

Revert "more to_numpy adds"

cb1b274

This reverts commit a39581b.

implement __or__

dabe1b6

checkpoint

28f7ab1

more cleanups

43f3cbc

groupby support

902cef9

prep for 2d

4d4ebfe

lithomas1 reviewed Aug 13, 2023

View reviewed changes

WillAyd added 8 commits August 13, 2023 03:39

support 2D

2898bb1

fix numeric

108a86c

temp pass for CI

8decf2a

fixed negative indexing

3da7aa2

fixed copying

11467c7

Working

d91fb8e

cleanups

757605c

fix

6fbbad8

WillAyd requested a review from rhshadrach as a code owner August 13, 2023 22:05

WillAyd added 5 commits August 13, 2023 23:11

cleanups and some performance boosts

b9723ab

perf boost

3f60cd0

perf boost

3b8921a

more performance

999e743

better perf

2b764ce

jbrockmendel reviewed Aug 31, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/main' into bitmask-backed

22148de

Merge remote-tracking branch 'upstream/main' into bitmask-backed

afef21e

jbrockmendel reviewed Sep 1, 2023

View reviewed changes

WillAyd added 3 commits September 5, 2023 12:46

Merge branch 'main' into bitmask-backed

d1bd251

Merge branch 'main' into bitmask-backed

7b4810d

remove cast

6a56ec1

WillAyd mentioned this pull request Sep 6, 2023

Cythonize is_null_slice #55037

Closed

WillAyd added 2 commits September 6, 2023 17:11

less diff

23fb76d

Merge remote-tracking branch 'upstream/main' into bitmask-backed

3eeaa12

jbrockmendel reviewed Sep 15, 2023

View reviewed changes

pandas/_libs/lib.pyx Outdated Show resolved Hide resolved

jbrockmendel reviewed Sep 15, 2023

View reviewed changes

pandas/core/arrays/masked.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Sep 15, 2023

View reviewed changes

WillAyd added 4 commits September 15, 2023 11:51

Merge remote-tracking branch 'upstream/main' into bitmask-backed

dfd2b57

reverted cythonized is_null_slice

3fb26ec

remove xfail of test

541de2e

change assert to ignore

34bc194

WillAyd commented Sep 15, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/main' into bitmask-backed

c6abf22

WillAyd closed this Oct 23, 2023

jorisvandenbossche mentioned this pull request Jan 25, 2024

DISC: Consider not requiring PyArrow in 3.0 #57073

Open

Bitmask Backed MaskedArray #54506

Bitmask Backed MaskedArray #54506

Conversation

WillAyd commented Aug 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd commented Aug 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Aug 31, 2023

WillAyd commented Aug 31, 2023 • edited Loading

WillAyd commented Aug 31, 2023 • edited Loading

Choose a reason for hiding this comment

WillAyd Sep 1, 2023 • edited Loading

Choose a reason for hiding this comment

seberg commented Sep 5, 2023

WillAyd commented Sep 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd commented Oct 23, 2023

WillAyd commented Aug 11, 2023 •

edited

Loading

WillAyd commented Aug 14, 2023 •

edited

Loading

WillAyd commented Aug 31, 2023 •

edited

Loading

WillAyd commented Aug 31, 2023 •

edited

Loading

WillAyd Sep 1, 2023 •

edited

Loading