-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bitmask Backed MaskedArray #54506
Bitmask Backed MaskedArray #54506
Conversation
.pre-commit-config.yaml
Outdated
@@ -70,19 +70,6 @@ repos: | |||
- id: fix-encoding-pragma | |||
args: [--remove] | |||
- id: trailing-whitespace | |||
- repo: https://github.com/cpplint/cpplint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nanoarrow and cpplint have a ton of conflicts. generally not sure how useful cpplint is for us so just removed check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should fix this in nanoarrow! It's true that cpplint does poorly for C; however, lots of projects use it and will run into this problem.
pandas/core/arrays/masked.py
Outdated
@@ -112,7 +113,7 @@ class BaseMaskedArray(OpsMixin, ExtensionArray): | |||
_internal_fill_value: Scalar | |||
# our underlying data and mask are each ndarrays | |||
_data: np.ndarray | |||
_mask: npt.NDArray[np.bool_] | |||
_mask: BitMaskArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can make this a property and override the getter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are you thinking we override the getter with? Are you thinking of special logic or still to just treat it like a normal attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that you can make _mask
a property with a getter that calls to_numpy
on a internal _bitmask
attribute where we'll store the actual mask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. The problem is sometimes we want to access the actual bitmask object (I.e. to forward to another constructor for reference counting). The to_numpy calls are only required when we are implicitly using numpy functionality in the code (granted this is pretty often)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory it would be great to eliminate to_numpy()
calls. With __or__
being a good example, it would be preferable to do that bitwise between two buffers if they are bitwise masks with the same length.
In practice I was having trouble properly exposing that through the Cython classes so am hoping to leave as a future optimization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need to keep _mask
as a property that returns the numpy byte mask (this is somewhat semi-public), as Thomas mentioned, and add another attribute for the new bitmask (so essentially where you now changed ._mask
to ._mask.to_numpy()
, it would stay _mask
, and anywhere you now uses just ._mask
, that would become something like ._bitmask
)
See #34873 for some context on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know - I'm going to hold off on this until the end, maybe a follow up depending on how much of a change it requires. It is definitely feasible to do as you've suggested just want to balance that against a huge initial diff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit conflicted on this. The big downside to keeping ._mask
call to_numpy()
is that that is a really expensive conversion, so it can cause performance regression unless you really root it out. Assuming it is not a ton of downstream libraries that have assumed ._mask
is an ndarray maybe we can coordinate with them to update to ._mask.to_numpy()
and maybe even ._mask.as_bitmask()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what we could do for now is ask any downstream libraries that have been using this to switch to isna()
, which has and will continue to return a NumPy array with this PR
Benchmarks $ asv continuous upstream/main HEAD
before after ratio
[4f1a086f] [173b4cbe]
<main> <bitmask-backed>
+ 1.23±0μs 66.8±1μs 54.35 array.IntegerArray.time_constructor
+ 707±10ns 1.41±0.04μs 2.00 array.BooleanArray.time_constructor
+ 957±4μs 1.70±0.02ms 1.78 frame_ctor.FromScalar.time_frame_from_scalar_ea_float64
+ 978±6μs 1.68±0.01ms 1.71 frame_ctor.FromScalar.time_frame_from_scalar_ea_float64_na
+ 8.72±0.2μs 11.9±0.09μs 1.37 boolean.TimeLogicalOps.time_xor_scalar
+ 8.96±0.2μs 11.9±0.3μs 1.33 boolean.TimeLogicalOps.time_or_array
+ 8.60±0.3μs 11.4±0.2μs 1.33 boolean.TimeLogicalOps.time_and_array
+ 2.19±0.03ms 2.92±0.5ms 1.33 multiindex_object.SetOperations.time_operation('monotonic', 'ea_int', 'intersection', False)
+ 13.7±0.2ms 17.7±0.1ms 1.29 index_object.SetOperations.time_operation('monotonic', 'date_string', 'intersection')
+ 1.26±0.07ms 1.53±0.02ms 1.21 index_object.SetOperations.time_operation('non_monotonic', 'ea_int', 'intersection')
+ 297±30μs 368±30μs 1.24 series_methods.NanOps.time_func('sum', 1000000, 'int8')
+ 1.24±0.01ms 1.48±0.03ms 1.19 index_object.SetOperations.time_operation('non_monotonic', 'ea_int', 'intersection')
+ 2.47±0.02ms 2.92±0.03ms 1.18 reshape.ReshapeMaskedArrayDtype.time_transpose('Int64')
+ 45.2±0.3μs 52.5±0.6μs 1.16 algos.isin.IsinAlmostFullWithRandomInt.time_isin(<class 'numpy.uint64'>, 12, 'outside')
+ 2.51±0.01ms 2.89±0.05ms 1.15 reshape.ReshapeMaskedArrayDtype.time_transpose('Float64')
+ 76.8±1μs 88.0±2μs 1.15 algos.isin.IsinWithRandomFloat.time_isin(<class 'numpy.float64'>, 2000, 'outside')
+ 225±30μs 257±4μs 1.14 algos.isin.IsIn.time_isin_categorical('string[python]')
+ 195±4ns 223±5ns 1.14 indexing_engines.MaskedNumericEngineIndexing.time_get_loc((<class 'pandas._libs.index.MaskedFloat64Engine'>, 'Float64'), 'monotonic_decr', True, 2000000)
+ 8.03±0.2ms 9.13±0.08ms 1.14 frame_methods.Equals.time_frame_nonunique_unequal
+ 2.05±0.03ms 2.33±0.04ms 1.14 index_object.SetOperations.time_operation('non_monotonic', 'ea_int', 'union')
+ 478±3μs 539±1μs 1.13 series_methods.NanOps.time_func('min', 1000000, 'boolean')
+ 95.3±1μs 108±2μs 1.13 algos.isin.IsinAlmostFullWithRandomInt.time_isin(<class 'numpy.uint64'>, 13, 'outside')
+ 46.2±1μs 51.8±0.5μs 1.12 algos.isin.IsinAlmostFullWithRandomInt.time_isin(<class 'numpy.int64'>, 12, 'outside')
+ 11.0±0.5μs 12.3±0.9μs 1.12 algorithms.SortIntegerArray.time_argsort(1000)
+ 860±100ns 963±100ns 1.12 index_cached_properties.IndexCache.time_shape('TimedeltaIndex')
+ 463±5μs 515±5μs 1.11 series_methods.NanOps.time_func('max', 1000000, 'boolean')
+ 56.3±2μs 62.5±0.8μs 1.11 algos.isin.IsinWithArangeSorted.time_isin(<class 'numpy.float64'>, 2000)
+ 8.60±0.2ms 9.51±0.2ms 1.11 frame_methods.Equals.time_frame_nonunique_equal
+ 1.12±0.01ms 1.25±0.03ms 1.11 series_methods.Fillna.time_bfill('Int64')
+ 2.07±0.02ms 2.30±0.1ms 1.11 multiindex_object.SetOperations.time_operation('non_monotonic', 'datetime', 'union', False)
+ 196±1ns 217±3ns 1.11 indexing_engines.NumericEngineIndexing.time_get_loc_near_middle((<class 'pandas._libs.index.Float32Engine'>, <class 'numpy.float32'>), 'monotonic_decr', True, 2000000)
+ 31.6±1μs 34.8±0.2μs 1.10 algos.isin.IsIn.time_isin_mismatched_dtype('Int64')
+ 1.15±0.02ms 1.26±0.01ms 1.10 series_methods.Fillna.time_bfill('Float64')
+ 4.13±0.05s 4.55±0.02s 1.10 stat_ops.FrameOps.time_op('kurt', 'Int64', 1)
+ 2.71±0.03ms 2.98±0.02ms 1.10 indexing_engines.MaskedNumericEngineIndexing.time_get_loc((<class 'pandas._libs.index.MaskedInt16Engine'>, 'Int16'), 'monotonic_incr', True, 2000000)
- 2.45±0.02ms 2.22±0.01ms 0.91 stat_ops.FrameOps.time_op('sem', 'Int64', None)
- 13.5±0.6ms 12.3±0.1ms 0.91 categoricals.Constructor.time_regular
- 2.43±0.02ms 2.18±0.01ms 0.90 stat_ops.FrameOps.time_op('sem', 'Int64', 0)
- 16.3±0.6μs 14.8±0.2μs 0.90 series_methods.NanOps.time_func('std', 1000, 'int32')
- 2.82±0.4ms 2.52±0.05ms 0.89 series_methods.NanOps.time_func('var', 1000000, 'int32')
- 9.94±0.4ms 8.81±0.2ms 0.89 series_methods.NanOps.time_func('sem', 1000000, 'float64')
- 93.7±0.8μs 83.7±0.4μs 0.89 join_merge.ConcatIndexDtype.time_concat_series('Int64', 'non_monotonic', 0, True)
- 93.8±0.5μs 83.6±0.7μs 0.89 join_merge.ConcatIndexDtype.time_concat_series('Int64', 'non_monotonic', 0, False)
- 1.22±0.3μs 1.08±0.1μs 0.88 index_cached_properties.IndexCache.time_values('CategoricalIndex')
- 11.9±0.3ms 10.1±0.08ms 0.84 frame_methods.Fillna.time_fillna(False, 'Float64')
- 520±50μs 433±3μs 0.83 series_methods.NanOps.time_func('sum', 1000000, 'float64')
- 728±50μs 596±7μs 0.82 series_methods.NanOps.time_func('sum', 1000000, 'Int64')
- 4.90±0.3ms 4.03±0.2ms 0.82 indexing.MultiIndexing.time_loc_slice_plus_null_slice(False)
- 12.1±0.4ms 9.92±0.07ms 0.82 frame_methods.Fillna.time_fillna(False, 'Int64')
- 374±3μs 307±2μs 0.82 timeseries.DatetimeIndex.time_unique('repeated')
- 1.21±0.01ms 981±8μs 0.81 frame_methods.Fillna.time_bfill(True, 'Int64')
- 815±70ns 661±60ns 0.81 index_cached_properties.IndexCache.time_values('TimedeltaIndex')
- 1.20±0.01ms 972±7μs 0.81 frame_methods.Fillna.time_bfill(True, 'Float64')
- 11.6±0.03ms 9.01±0.08ms 0.78 indexing.InsertColumns.time_insert_middle
- 3.68±0.3ms 2.85±0.05ms 0.77 series_methods.NanOps.time_func('var', 1000000, 'int64')
- 3.91±0.02μs 3.03±0.02μs 0.78 series_methods.Any.time_any(1000, 'slow', 'boolean')
- 3.99±0.06μs 3.06±0.04μs 0.77 series_methods.All.time_all(1000, 'fast', 'boolean')
- 458M 349M 0.76 array.BooleanArrayMem.peakmem_array
- 4.74±0.09ms 3.28±0.03ms 0.69 series_methods.Fillna.time_fillna('Float64')
- 4.73±0.03ms 3.19±0.01ms 0.67 series_methods.Fillna.time_fillna('Int64')
- 324±6μs 29.9±0.4μs 0.09 series_methods.All.time_all(1000000, 'slow', 'boolean')
- 309±3μs 10.8±0.1μs 0.03 series_methods.All.time_all(1000000, 'fast', 'boolean')
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED. |
@@ -43,7 +44,7 @@ def quantile_compat( | |||
|
|||
def quantile_with_mask( | |||
values: np.ndarray, | |||
mask: npt.NDArray[np.bool_], | |||
mask: npt.NDArray[np.bool_] | BitmaskArray, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no viable way to avoid this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The typing has to get changed somehow since we aren't passing an ndarray. Instead of a Union I think a more generic type would declare this as accepting anything that implements the buffer protocol.
If that were done as a pre-cursor it would really cut down on the diff here, and arguably align better with our memoryview declarations we have in the Cython extensions (at least most of the time)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above id really prefer the caller to be responsible for conversion
can the BitMaskArray support 2D? |
Yea 2D is working with this. The "support" is limited to storing the bits in a C-style contiguous fashion. Slicing and non axis=0 operations right now require a |
To illustrate: >>> from pandas._libs.arrays import BitmaskArray
>>> import numpy as np
>>> arr = np.array([True, True, False, False]).reshape(2, -1)
>>> bma = BitmaskArray(arr)
>>> bma.any()
True
>>> bma.all()
False
>>> bma.shape
(2, 2)
>>> bma[0]
array([ True, True])
>>> bma[1]
array([False, False])
>>> bma[:, 0]
array([ True, False])
>>> bma[:, 1]
array([ True, False])
>>> "{0:08b}".format(bma.bytes[0])
'00000011'
>>> bma.to_numpy()
array([[ True, True],
[False, False]]) This is all handled via |
@@ -424,7 +425,7 @@ def nunique_ints(values: ArrayLike) -> int: | |||
return result | |||
|
|||
|
|||
def unique_with_mask(values, mask: npt.NDArray[np.bool_] | None = None): | |||
def unique_with_mask(values, mask: npt.NDArray[np.bool_] | BitmaskArray | None = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
havent looked that closely, but my kneejerk both here and in array_algos is that the caller should be responsible for converting the BitmaskArray (or do somethinh at a higher level that makes the conversion unnecessary)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I went back and forth with that design. What is nice about keeping this very local to the issue is that it is clearer where you would need to refactor in the future to make the BitmaskArray work better.
In this case the only reason the conversion is needed is because of the astype("bool")
call on L453. If that didn't exist we could probably avoid this isinstance check. If we moved the astype higher up the call stack it would be less clear why it is needed
I am not aware that NumPy actually supports this in any useful way at least. |
From searching through https://discuss.python.org/t/question-pep-3118-format-strings-and-the-buffer-protocol/31264/8 I found a link to python/cpython#47382 (comment) which I think confirms that bit support isn't implemented in CPython Of course not a blocker to this PR but something to consider in the future that would make this even more useful |
@@ -229,7 +240,7 @@ def fillna( | |||
if method is not None: | |||
func = missing.get_fill_func(method, ndim=self.ndim) | |||
npvalues = self._data.T | |||
new_mask = mask.T | |||
new_mask = mask.to_numpy().T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we avoid this conversion? maybe feasible-but-out-of-scope?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you asking about avoiding the T
even in the current code? Or implementing support for transposition in the BitmaskArray directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or implementing support for transposition in the BitmaskArray directly?
That one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two options for that are:
- Physically transpose the layout of the bits in memory
- Keep the bits the same and just use state management to resolve indexes to a different address when transposed
I don't know which of these NumPy does. Maybe a combination of both. If we want point 1, we also need to find / develop a good algorithm for it
It's possible just requires some dedicated time just to that topic
@@ -286,9 +286,11 @@ def test_ensure_copied_data(self, index): | |||
tm.assert_numpy_array_equal( | |||
index._values._data, result._values._data, check_same="same" | |||
) | |||
assert np.shares_memory(index._values._mask, result._values._mask) | |||
assert index._values._mask is result._values._mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's a little-used tm.shares_memory that was intended to allow implementing this kind of check for non-ndarrays
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could look at that, but I don't know if nanoarrow defines what the value of bits should be when the bitmask doesn't end on a word boundary. For example in array with length of 9, you will have 2 total bytes of memory; I am not sure if nanoarrow specifies what bits 10-16 are, so doing a memcmp on that region might not work. @paleolimbot probably knows best though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I missed this! nanoarrow definitely does not define anything about what the value of those bits should be, but where possible I believe it prefers to zero them (because it causes problems with valgrind if it does not). From the Arrow format:
Array slots which are null are not required to have a particular value; any “masked” memory can have any value and need not be zeroed, though implementations frequently choose to zero memory for null values.
@@ -286,9 +286,11 @@ def test_ensure_copied_data(self, index): | |||
tm.assert_numpy_array_equal( | |||
index._values._data, result._values._data, check_same="same" | |||
) | |||
assert np.shares_memory(index._values._mask, result._values._mask) | |||
assert index._values._mask is result._values._mask | |||
tm.assert_numpy_array_equal( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should just have tm.assert_bitmask_equal?
@pytest.mark.xfail( | ||
not np_version_gte1p24, | ||
reason="Changed NumPy behavior for >1D non-tuple sequence indexing", | ||
strict=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this flaky?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I couldn't reproduce locally but saw it on CI. Next push I'll take this off and we can see if it shows back up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the error this creates on the min versions CI
https://github.com/pandas-dev/pandas/actions/runs/6200248114/job/16834535267?pr=54506#step:8:181
@@ -231,15 +232,15 @@ class IntpHashTable(HashTable): ... | |||
def duplicated( | |||
values: np.ndarray, | |||
keep: Literal["last", "first", False] = ..., | |||
mask: npt.NDArray[np.bool_] | None = ..., | |||
mask: npt.NDArray[np.bool_] | BitmaskArray | None = ..., |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For typing annotations like this we could also do a precursor to change mask
and other arguments that ared typed memoryviews in Cython to be typing_extensions.Buffer
; I think that would be more technically correct
Python 3.12 will have collections.abc.Buffer built in. See also PEP 688 https://peps.python.org/pep-0688/#equivalent-for-older-python-versions
I think the only loss that way is I don't see the typing.Buffer as being Generic, so AFAICT can't do typing.Buffer[bool] yet. But seems logical that will get implemented in Python at some point
Closing for now, but I think this is feasible if we ever wanted to come back to it |
POC using vendored nanoarrow