Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bitmask Backed MaskedArray #54506

Closed
wants to merge 142 commits into from
Closed

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Aug 11, 2023

POC using vendored nanoarrow

@@ -70,19 +70,6 @@ repos:
- id: fix-encoding-pragma
args: [--remove]
- id: trailing-whitespace
- repo: https://github.com/cpplint/cpplint
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nanoarrow and cpplint have a ton of conflicts. generally not sure how useful cpplint is for us so just removed check

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should fix this in nanoarrow! It's true that cpplint does poorly for C; however, lots of projects use it and will run into this problem.

pandas/core/arrays/masked.py Outdated Show resolved Hide resolved
@@ -112,7 +113,7 @@ class BaseMaskedArray(OpsMixin, ExtensionArray):
_internal_fill_value: Scalar
# our underlying data and mask are each ndarrays
_data: np.ndarray
_mask: npt.NDArray[np.bool_]
_mask: BitMaskArray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can make this a property and override the getter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are you thinking we override the getter with? Are you thinking of special logic or still to just treat it like a normal attribute?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that you can make _mask a property with a getter that calls to_numpy on a internal _bitmask attribute where we'll store the actual mask.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. The problem is sometimes we want to access the actual bitmask object (I.e. to forward to another constructor for reference counting). The to_numpy calls are only required when we are implicitly using numpy functionality in the code (granted this is pretty often)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory it would be great to eliminate to_numpy() calls. With __or__ being a good example, it would be preferable to do that bitwise between two buffers if they are bitwise masks with the same length.

In practice I was having trouble properly exposing that through the Cython classes so am hoping to leave as a future optimization

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to keep _mask as a property that returns the numpy byte mask (this is somewhat semi-public), as Thomas mentioned, and add another attribute for the new bitmask (so essentially where you now changed ._mask to ._mask.to_numpy(), it would stay _mask, and anywhere you now uses just ._mask, that would become something like ._bitmask)

See #34873 for some context on this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know - I'm going to hold off on this until the end, maybe a follow up depending on how much of a change it requires. It is definitely feasible to do as you've suggested just want to balance that against a huge initial diff

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit conflicted on this. The big downside to keeping ._mask call to_numpy() is that that is a really expensive conversion, so it can cause performance regression unless you really root it out. Assuming it is not a ton of downstream libraries that have assumed ._mask is an ndarray maybe we can coordinate with them to update to ._mask.to_numpy() and maybe even ._mask.as_bitmask()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we could do for now is ask any downstream libraries that have been using this to switch to isna(), which has and will continue to return a NumPy array with this PR

@WillAyd
Copy link
Member Author

WillAyd commented Aug 14, 2023

Benchmarks

$ asv continuous upstream/main HEAD
      before           after         ratio
     [4f1a086f]       [173b4cbe]
     <main>           <bitmask-backed>
+        1.23±0μs         66.8±1μs    54.35  array.IntegerArray.time_constructor
+        707±10ns      1.41±0.04μs     2.00  array.BooleanArray.time_constructor
+         957±4μs      1.70±0.02ms     1.78  frame_ctor.FromScalar.time_frame_from_scalar_ea_float64
+         978±6μs      1.68±0.01ms     1.71  frame_ctor.FromScalar.time_frame_from_scalar_ea_float64_na
+      8.72±0.2μs      11.9±0.09μs     1.37  boolean.TimeLogicalOps.time_xor_scalar
+      8.96±0.2μs       11.9±0.3μs     1.33  boolean.TimeLogicalOps.time_or_array
+      8.60±0.3μs       11.4±0.2μs     1.33  boolean.TimeLogicalOps.time_and_array
+     2.19±0.03ms       2.92±0.5ms     1.33  multiindex_object.SetOperations.time_operation('monotonic', 'ea_int', 'intersection', False)
+      13.7±0.2ms       17.7±0.1ms     1.29  index_object.SetOperations.time_operation('monotonic', 'date_string', 'intersection')
+     1.26±0.07ms      1.53±0.02ms     1.21  index_object.SetOperations.time_operation('non_monotonic', 'ea_int', 'intersection')
+        297±30μs         368±30μs     1.24  series_methods.NanOps.time_func('sum', 1000000, 'int8')
+     1.24±0.01ms      1.48±0.03ms     1.19  index_object.SetOperations.time_operation('non_monotonic', 'ea_int', 'intersection')
+     2.47±0.02ms      2.92±0.03ms     1.18  reshape.ReshapeMaskedArrayDtype.time_transpose('Int64')
+      45.2±0.3μs       52.5±0.6μs     1.16  algos.isin.IsinAlmostFullWithRandomInt.time_isin(<class 'numpy.uint64'>, 12, 'outside')
+     2.51±0.01ms      2.89±0.05ms     1.15  reshape.ReshapeMaskedArrayDtype.time_transpose('Float64')
+        76.8±1μs         88.0±2μs     1.15  algos.isin.IsinWithRandomFloat.time_isin(<class 'numpy.float64'>, 2000, 'outside')
+        225±30μs          257±4μs     1.14  algos.isin.IsIn.time_isin_categorical('string[python]')
+         195±4ns          223±5ns     1.14  indexing_engines.MaskedNumericEngineIndexing.time_get_loc((<class 'pandas._libs.index.MaskedFloat64Engine'>, 'Float64'), 'monotonic_decr', True, 2000000)
+      8.03±0.2ms      9.13±0.08ms     1.14  frame_methods.Equals.time_frame_nonunique_unequal
+     2.05±0.03ms      2.33±0.04ms     1.14  index_object.SetOperations.time_operation('non_monotonic', 'ea_int', 'union')
+         478±3μs          539±1μs     1.13  series_methods.NanOps.time_func('min', 1000000, 'boolean')
+        95.3±1μs          108±2μs     1.13  algos.isin.IsinAlmostFullWithRandomInt.time_isin(<class 'numpy.uint64'>, 13, 'outside')
+        46.2±1μs       51.8±0.5μs     1.12  algos.isin.IsinAlmostFullWithRandomInt.time_isin(<class 'numpy.int64'>, 12, 'outside')
+      11.0±0.5μs       12.3±0.9μs     1.12  algorithms.SortIntegerArray.time_argsort(1000)
+       860±100ns        963±100ns     1.12  index_cached_properties.IndexCache.time_shape('TimedeltaIndex')
+         463±5μs          515±5μs     1.11  series_methods.NanOps.time_func('max', 1000000, 'boolean')
+        56.3±2μs       62.5±0.8μs     1.11  algos.isin.IsinWithArangeSorted.time_isin(<class 'numpy.float64'>, 2000)
+      8.60±0.2ms       9.51±0.2ms     1.11  frame_methods.Equals.time_frame_nonunique_equal
+     1.12±0.01ms      1.25±0.03ms     1.11  series_methods.Fillna.time_bfill('Int64')
+     2.07±0.02ms       2.30±0.1ms     1.11  multiindex_object.SetOperations.time_operation('non_monotonic', 'datetime', 'union', False)
+         196±1ns          217±3ns     1.11  indexing_engines.NumericEngineIndexing.time_get_loc_near_middle((<class 'pandas._libs.index.Float32Engine'>, <class 'numpy.float32'>), 'monotonic_decr', True, 2000000)
+        31.6±1μs       34.8±0.2μs     1.10  algos.isin.IsIn.time_isin_mismatched_dtype('Int64')
+     1.15±0.02ms      1.26±0.01ms     1.10  series_methods.Fillna.time_bfill('Float64')
+      4.13±0.05s       4.55±0.02s     1.10  stat_ops.FrameOps.time_op('kurt', 'Int64', 1)
+     2.71±0.03ms      2.98±0.02ms     1.10  indexing_engines.MaskedNumericEngineIndexing.time_get_loc((<class 'pandas._libs.index.MaskedInt16Engine'>, 'Int16'), 'monotonic_incr', True, 2000000)
-     2.45±0.02ms      2.22±0.01ms     0.91  stat_ops.FrameOps.time_op('sem', 'Int64', None)
-      13.5±0.6ms       12.3±0.1ms     0.91  categoricals.Constructor.time_regular
-     2.43±0.02ms      2.18±0.01ms     0.90  stat_ops.FrameOps.time_op('sem', 'Int64', 0)
-      16.3±0.6μs       14.8±0.2μs     0.90  series_methods.NanOps.time_func('std', 1000, 'int32')
-      2.82±0.4ms      2.52±0.05ms     0.89  series_methods.NanOps.time_func('var', 1000000, 'int32')
-      9.94±0.4ms       8.81±0.2ms     0.89  series_methods.NanOps.time_func('sem', 1000000, 'float64')
-      93.7±0.8μs       83.7±0.4μs     0.89  join_merge.ConcatIndexDtype.time_concat_series('Int64', 'non_monotonic', 0, True)
-      93.8±0.5μs       83.6±0.7μs     0.89  join_merge.ConcatIndexDtype.time_concat_series('Int64', 'non_monotonic', 0, False)
-      1.22±0.3μs       1.08±0.1μs     0.88  index_cached_properties.IndexCache.time_values('CategoricalIndex')
-      11.9±0.3ms      10.1±0.08ms     0.84  frame_methods.Fillna.time_fillna(False, 'Float64')
-         520±50μs          433±3μs     0.83  series_methods.NanOps.time_func('sum', 1000000, 'float64')
-        728±50μs          596±7μs     0.82  series_methods.NanOps.time_func('sum', 1000000, 'Int64')
-      4.90±0.3ms       4.03±0.2ms     0.82  indexing.MultiIndexing.time_loc_slice_plus_null_slice(False)
-      12.1±0.4ms      9.92±0.07ms     0.82  frame_methods.Fillna.time_fillna(False, 'Int64')
-         374±3μs          307±2μs     0.82  timeseries.DatetimeIndex.time_unique('repeated')
-     1.21±0.01ms          981±8μs     0.81  frame_methods.Fillna.time_bfill(True, 'Int64')
-        815±70ns         661±60ns     0.81  index_cached_properties.IndexCache.time_values('TimedeltaIndex')
-     1.20±0.01ms          972±7μs     0.81  frame_methods.Fillna.time_bfill(True, 'Float64')
-     11.6±0.03ms      9.01±0.08ms     0.78  indexing.InsertColumns.time_insert_middle
-      3.68±0.3ms      2.85±0.05ms     0.77  series_methods.NanOps.time_func('var', 1000000, 'int64')
-     3.91±0.02μs      3.03±0.02μs     0.78  series_methods.Any.time_any(1000, 'slow', 'boolean')
-     3.99±0.06μs      3.06±0.04μs     0.77  series_methods.All.time_all(1000, 'fast', 'boolean')
-            458M             349M     0.76  array.BooleanArrayMem.peakmem_array
-     4.74±0.09ms      3.28±0.03ms     0.69  series_methods.Fillna.time_fillna('Float64')
-     4.73±0.03ms      3.19±0.01ms     0.67  series_methods.Fillna.time_fillna('Int64')
-         324±6μs       29.9±0.4μs     0.09  series_methods.All.time_all(1000000, 'slow', 'boolean')
-         309±3μs       10.8±0.1μs     0.03  series_methods.All.time_all(1000000, 'fast', 'boolean')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

@@ -43,7 +44,7 @@ def quantile_compat(

def quantile_with_mask(
values: np.ndarray,
mask: npt.NDArray[np.bool_],
mask: npt.NDArray[np.bool_] | BitmaskArray,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no viable way to avoid this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The typing has to get changed somehow since we aren't passing an ndarray. Instead of a Union I think a more generic type would declare this as accepting anything that implements the buffer protocol.

If that were done as a pre-cursor it would really cut down on the diff here, and arguably align better with our memoryview declarations we have in the Cython extensions (at least most of the time)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above id really prefer the caller to be responsible for conversion

@jbrockmendel
Copy link
Member

can the BitMaskArray support 2D?

@WillAyd
Copy link
Member Author

WillAyd commented Aug 31, 2023

Yea 2D is working with this. The "support" is limited to storing the bits in a C-style contiguous fashion. Slicing and non axis=0 operations right now require a to_numpy() call, but if you don't do that this works with 2D natively

@WillAyd
Copy link
Member Author

WillAyd commented Aug 31, 2023

To illustrate:

>>> from pandas._libs.arrays import BitmaskArray
>>> import numpy as np
>>> arr = np.array([True, True, False, False]).reshape(2, -1)
>>> bma = BitmaskArray(arr)
>>> bma.any()
True
>>> bma.all()
False
>>> bma.shape
(2, 2)
>>> bma[0]
array([ True,  True])
>>> bma[1]
array([False, False])
>>> bma[:, 0]
array([ True, False])
>>> bma[:, 1]
array([ True, False])
>>> "{0:08b}".format(bma.bytes[0])
'00000011'
>>> bma.to_numpy()
array([[ True,  True],
       [False, False]])

This is all handled via BitmaskArray.__getitem__ - the BitmaskArray has a few cases that it can fastpath and not have to convert to NumPy on (i.e. 1D scalar indexing, full slice) but generally for more complicated indexers it converts to NumPy and returns that as a fallback

@@ -424,7 +425,7 @@ def nunique_ints(values: ArrayLike) -> int:
return result


def unique_with_mask(values, mask: npt.NDArray[np.bool_] | None = None):
def unique_with_mask(values, mask: npt.NDArray[np.bool_] | BitmaskArray | None = None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

havent looked that closely, but my kneejerk both here and in array_algos is that the caller should be responsible for converting the BitmaskArray (or do somethinh at a higher level that makes the conversion unnecessary)

Copy link
Member Author

@WillAyd WillAyd Sep 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I went back and forth with that design. What is nice about keeping this very local to the issue is that it is clearer where you would need to refactor in the future to make the BitmaskArray work better.

In this case the only reason the conversion is needed is because of the astype("bool") call on L453. If that didn't exist we could probably avoid this isinstance check. If we moved the astype higher up the call stack it would be less clear why it is needed

@seberg
Copy link
Contributor

seberg commented Sep 5, 2023

Both PEP 3118 and the NumPy array interface mention the ability to work with bits, but they have different documented syntax (Nt versus tN where N is the number of bits) and I'm not clear that they were ever implemented? Maybe something @scoder or @seberg would know about?

I am not aware that NumPy actually supports this in any useful way at least. memoryview() doesn't care if a format is unknown. I would say the buffer protocol docs are more authoritative but it might be good to see if there is prior art that actually uses this.

@WillAyd
Copy link
Member Author

WillAyd commented Sep 5, 2023

From searching through https://discuss.python.org/t/question-pep-3118-format-strings-and-the-buffer-protocol/31264/8 I found a link to python/cpython#47382 (comment) which I think confirms that bit support isn't implemented in CPython

Of course not a blocker to this PR but something to consider in the future that would make this even more useful

@WillAyd WillAyd mentioned this pull request Sep 6, 2023
pandas/_libs/lib.pyx Outdated Show resolved Hide resolved
@@ -229,7 +240,7 @@ def fillna(
if method is not None:
func = missing.get_fill_func(method, ndim=self.ndim)
npvalues = self._data.T
new_mask = mask.T
new_mask = mask.to_numpy().T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid this conversion? maybe feasible-but-out-of-scope?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you asking about avoiding the T even in the current code? Or implementing support for transposition in the BitmaskArray directly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or implementing support for transposition in the BitmaskArray directly?

That one

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two options for that are:

  1. Physically transpose the layout of the bits in memory
  2. Keep the bits the same and just use state management to resolve indexes to a different address when transposed

I don't know which of these NumPy does. Maybe a combination of both. If we want point 1, we also need to find / develop a good algorithm for it

It's possible just requires some dedicated time just to that topic

@@ -286,9 +286,11 @@ def test_ensure_copied_data(self, index):
tm.assert_numpy_array_equal(
index._values._data, result._values._data, check_same="same"
)
assert np.shares_memory(index._values._mask, result._values._mask)
assert index._values._mask is result._values._mask
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a little-used tm.shares_memory that was intended to allow implementing this kind of check for non-ndarrays

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could look at that, but I don't know if nanoarrow defines what the value of bits should be when the bitmask doesn't end on a word boundary. For example in array with length of 9, you will have 2 total bytes of memory; I am not sure if nanoarrow specifies what bits 10-16 are, so doing a memcmp on that region might not work. @paleolimbot probably knows best though

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this! nanoarrow definitely does not define anything about what the value of those bits should be, but where possible I believe it prefers to zero them (because it causes problems with valgrind if it does not). From the Arrow format:

Array slots which are null are not required to have a particular value; any “masked” memory can have any value and need not be zeroed, though implementations frequently choose to zero memory for null values.

@@ -286,9 +286,11 @@ def test_ensure_copied_data(self, index):
tm.assert_numpy_array_equal(
index._values._data, result._values._data, check_same="same"
)
assert np.shares_memory(index._values._mask, result._values._mask)
assert index._values._mask is result._values._mask
tm.assert_numpy_array_equal(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should just have tm.assert_bitmask_equal?

@pytest.mark.xfail(
not np_version_gte1p24,
reason="Changed NumPy behavior for >1D non-tuple sequence indexing",
strict=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this flaky?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I couldn't reproduce locally but saw it on CI. Next push I'll take this off and we can see if it shows back up

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -231,15 +232,15 @@ class IntpHashTable(HashTable): ...
def duplicated(
values: np.ndarray,
keep: Literal["last", "first", False] = ...,
mask: npt.NDArray[np.bool_] | None = ...,
mask: npt.NDArray[np.bool_] | BitmaskArray | None = ...,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For typing annotations like this we could also do a precursor to change mask and other arguments that ared typed memoryviews in Cython to be typing_extensions.Buffer; I think that would be more technically correct

Python 3.12 will have collections.abc.Buffer built in. See also PEP 688 https://peps.python.org/pep-0688/#equivalent-for-older-python-versions

I think the only loss that way is I don't see the typing.Buffer as being Generic, so AFAICT can't do typing.Buffer[bool] yet. But seems logical that will get implemented in Python at some point

@WillAyd
Copy link
Member Author

WillAyd commented Oct 23, 2023

Closing for now, but I think this is feasible if we ever wanted to come back to it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Build Library building on various platforms NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants