ENH: 2D support for MaskedArray #38992

jbrockmendel · 2021-01-06T02:55:49Z

This doesn't in any way use the 2D support, but opens up the option of incrementally fleshing out the tests.

jreback

looks fine to me.

pandas/core/arrays/_mixins.py

simonjayhawkins · 2021-01-06T15:30:38Z

pandas/core/arrays/masked.py

@@ -80,6 +80,8 @@ class BaseMaskedArray(OpsMixin, ExtensionArray):

    # The value used to fill '_data' to avoid upcasting
    _internal_fill_value: Scalar
+    _data: np.ndarray
+    _mask: np.ndarray[Any, bool]


np.ndarray in numpy 1.20 is not generic so although mypy is happy with type parameters on Any, this will raise errors when we transition to numpy types.

removed the [Any, bool] part of this. is there an approximate calendar for the transition to numpy types?

I think these are in 1.20 (releaseing blocking on arrow update atm)

is there an approximate calendar for the transition to numpy types?

I updated to 1.20.0rc2 locally and there were no changes to the mypy errors. ( I've not yet checked the commit history to see if there were any changes)

Once 1.20 is released, and we pick it up on CI we will see the errors in #36092. (I'll merge master and make a start on updating now)

I assume that we will pin numpy in ci while we discuss how to sort out the errors. (once we know what the status is with the released numpy)

…h-masked-2d

jorisvandenbossche

Such a significant architectural change shouldn't be merged without prior discussion

jreback · 2021-01-06T18:35:57Z

Such a significant architectural change shouldn't be merged without prior discussion

sure, what are your concerns

jorisvandenbossche · 2021-01-08T22:16:23Z

The ExtensionArrays have been 1D from the start (for around 3 years now). And idem for MaskedArray in specific. So if Brock proposes to change that, then I think it is to start up to the proposer to come up with arguments for changing this.

There has been discussion before about 1D vs 2D extension arrays, for sure (although I don't think any of that discussion resulted in a clear decision to merge a PR like this). But specifically for MaskedArray, we didn't have any discussion about this, AFAIK. I think that at least requires some discussion about whether we want this or not?

The masked arrays have been explicitly designed to be 1D, of course also because currently ExtensionArrays in general are 1D, but in addition, there are several ideas for future improvements (#30435), exploring bitmask instead of boolean mask (#31293), optionally using arrow under the hood like we are doing for the string array, more efficient zero-copy conversion with arrow, nested data types (eg #35176). Those are all not impossible with 2D arrays, but IMO will be much easier with 1D arrays, and thus requires some consideration.

It's also not fully clear to me where this PR is going towards. At the moment it is adding quite some complexity for something that is not yet used. What's the plan for how to actually use it in pandas? Do we want to give the masked arrays 2D capabilities (so this ability can be used for certain operations), but keep storing them as 1D in DataFrames? Or do we want to change the ExtensionBlock to a consolidated 2D Block? But only for masked arrays, or for all ExtensionArrays? What for arrays that cannot easily be 2D (eg nested array)? What's the idea for externally defined ExtensionArrays? ...

jbrockmendel · 2021-01-09T01:19:01Z

The reason to do this is roughly the same reason why we're moving forward with ArrayManager: so that we can see if actually using this is something we want to do longer-term.

jbrockmendel · 2021-01-16T03:42:53Z

gentle ping, plenty more tests where these came from

jreback · 2021-01-20T22:59:29Z

ok I am +1 on merging this. I agree with @jbrockmendel reasoning here. We don't really know where we are going to ultimately go, e.g. ArrayManger or simplified BlockManager. We need more support & performance testing to see. Sure I'd like to see a unified approach, but we have advocates for both and would rather not inhibit experimentation.

cc @jorisvandenbossche

jorisvandenbossche · 2021-01-21T23:17:25Z

The reason to do this is roughly the same reason why we're moving forward with ArrayManager: so that we can see if actually using this is something we want to do longer-term.

For the simplified non-consolidating BlockManager, I started with a description of arguments for it, we had an extensive discussion about it, with several people expressing their interest for it, and with the main question mark being performance. At which point we need a proof of concept to test things.

As far as I know, we have had no such discussion about 2D masked arrays.

jbrockmendel · 2021-01-21T23:18:26Z

As far as I know, we have had no such discussion about 2D masked arrays.

We've had the same discussion about 2D EAs repeatedly.

jreback · 2021-01-21T23:46:04Z

@jorisvandenbossche do you have actual concrete objections to merging this? We are allowing ArrayManager on an experimental basis, I don't see how this is any different.

jorisvandenbossche · 2021-01-22T14:31:35Z

I think my longer comment above (#38992 (comment)) already includes some concrete concerns. Reformulating them:

We have had many discussions about 2D ExtensionArrays, yes (mostly in 2019, see eg EA: support basic 2D operations #27142 and linked PRs and the mailing list discussion at https://mail.python.org/pipermail/pandas-dev/2019-June/000983.html), but AFAIK those discussions have not yet led to a consensus or compromise in favor of 2D EAs (if I recall correctly, the use of 2D arrays for datetime ops was discussed then as a compromise).
Fully supporting 2D ExtensionArrays is a big change, that requires a more detailed proposal and discussion IMO. And we already have the existing consolidating BlockManager to know how internals with 2D arrays would work (and to compare the ArrayManager with).
IMO this PR is missing context on how we would actually use this in pandas. I think we should at least have some idea about that before merging this (I asked several questions above (eg do we want to make ExtensionBlock 2D? What does this mean for other EAs? ...), to which no response has been given)
The POC for the ArrayManager is mostly independent from the existing code (eg the BlockManager didn't become any more complex due to merging it), while this is profoundly changing the existing MaskedArrays, making it more difficult to further improve them as 1D arrays (there is still a lot of work to make them fully feature-complete to start with, and I mentioned several possible additional enhancements above)

(having a call about this might help resolve some of those discussion points?)

jbrockmendel · 2021-01-22T16:43:12Z

I'm tired of repeating myself. At the sprint in 2019 we (including Wes) agreed to move forward with 2D EA support for experimentation. Since then the only thing we've learned is that you are more willing to repeat the same arguments over and over again than I am, and everyone else makes the entirely reasonable decision to tune it out.

there is still a lot of work to make them fully feature-complete to start with

I would dearly like to see that, which I see as part-in-parcel with the fix-many-xfails that I brought up on last week's call. But I don't see any effort towards making them happen, or any reason why they are mutually exclusive with this experimentation.

…h-masked-2d

jbrockmendel · 2021-08-03T19:35:49Z

gentle ping; this would simplify a corner case in #33036

jreback · 2021-08-04T02:22:20Z

ok tests are failing.

i suppose it makes sense to support both 1d and 2d kernels on things. it does lead to some code duplication, but performance can be great if we don't need to operate column-by-column all the time. However the codebase is mostly 1d currently, with some efforts to add 2d kernels.

does this concur with your thinking?

jbrockmendel · 2021-08-04T15:52:27Z

ok tests are failing.

Fixed

[...] does this concur with your thinking?

Was this part of the comment supposed to go in #42841? Will answer there.

simonjayhawkins · 2021-10-06T11:19:32Z

@jbrockmendel needs rebase

pandas/core/arrays/boolean.py

jreback · 2021-10-06T12:42:39Z

pandas/core/arrays/masked.py

@@ -115,6 +117,9 @@ class BaseMaskedArray(OpsMixin, ExtensionArray):

    # The value used to fill '_data' to avoid upcasting
    _internal_fill_value: Scalar
+    _data: np.ndarray


add a comment about these

…h-masked-2d

jreback

lgtm. great to see how to proceed.

jreback · 2021-10-16T17:56:36Z

pandas/core/array_algos/masked_reductions.py

+    skipna: bool = True,
+    axis: Optional[int] = None,
+):
+    return _minmax(np.max, values=values, mask=mask, skipna=skipna, axis=axis)


are there doc strings here? if so can you update (can be followup as well)

jreback · 2021-10-16T17:58:11Z

pandas/tests/extension/base/dim2.py

@@ -194,9 +199,23 @@ def test_reductions_2d_axis0(self, data, method, request):
            if method in ["sum", "prod"] and data.dtype.kind in ["i", "u"]:
                # FIXME: kludge
                if data.dtype.kind == "i":
-                    dtype = pd.Int64Dtype()
+                    if is_platform_windows() or not IS64:


oh yeah followup with this to make nicer

jreback · 2021-10-16T17:58:58Z

certainly fine for testing things out. thanks @jbrockmendel

a couple of followups

simonjayhawkins · 2021-10-19T10:35:31Z

pandas/_libs/algos.pyx

                    continue
                fill_count += 1
                values[j, i] = val
+                mask[j, i] = False


mask should be previous mask... see pad_inplace #39953

import numpy as np import pandas as pd dtype = pd.Int64Dtype() data_missing = pd.array([pd.NA, 1], dtype=dtype) arr = data_missing.repeat(4).reshape(4, 2) result = arr.fillna(method="pad") print(result) expected = data_missing.fillna(method="pad").repeat(4).reshape(4, 2) print(expected)

<IntegerArray> [ [<NA>, <NA>], [1, 1], [1, 1], [1, 1] ] Shape: (4, 2), dtype: Int64 <IntegerArray> [ [<NA>, <NA>], [<NA>, <NA>], [1, 1], [1, 1] ] Shape: (4, 2), dtype: Int64

so fixing this on my numba branch results in...

@numba.njit def _pad_2d_inplace(values, mask, limit=None): if values.shape[1]: K, N = values.shape if limit is None: for j in range(K): val, prev_mask = values[j, 0], mask[j, 0] for i in range(N): if mask[j, i]: values[j, i], mask[j, i] = val, prev_mask else: val, prev_mask = values[j, i], mask[j, i] else: for j in range(K): fill_count = 0 val, prev_mask = values[j, 0], mask[j, 0] for i in range(N): if mask[j, i]: if fill_count >= limit: continue fill_count += 1 values[j, i], mask[j, i] = val, prev_mask else: fill_count = 0 val, prev_mask = values[j, i], mask[j, i]

I have some duplication here but a perf improvement for the common case of no limit, the duplication can probably be mitigated by reshaping a 1d array and removing the 1d version pad_inplace

I might look into a variant using 2 loops, the first to find the first not missing value. and the second to fill without tracking the previous mask, then we could just do mask[j, i] = False

One more thing while in the neighborhood, I think for i in range(N) should probably be for i in range(1, N)?

simonjayhawkins · 2021-10-19T11:58:26Z

pandas/_libs/algos.pyx

@@ -656,10 +656,11 @@ def pad_2d_inplace(numeric_object_t[:, :] values, const uint8_t[:, :] mask, limi
        val = values[j, 0]
        for i in range(N):
            if mask[j, i]:
-                if fill_count >= lim:
+                if fill_count >= lim or i == 0:


see #38992 (comment)

jbrockmendel · 2022-04-15T20:39:23Z

@simonjayhawkins addressing your post-merge comments has been on my todo list for a while, looks likely to fall off. Do they merit their own Issue?

ENH: 2D support for MaskedArray

ae51cff

jreback approved these changes Jan 6, 2021

View reviewed changes

pandas/core/arrays/_mixins.py Show resolved Hide resolved

jreback added the ExtensionArray Extending pandas with custom dtypes or arrays. label Jan 6, 2021

jreback added this to the 1.3 milestone Jan 6, 2021

simonjayhawkins reviewed Jan 6, 2021

View reviewed changes

jbrockmendel added 2 commits January 6, 2021 08:11

Merge branch 'master' of https://github.com/pandas-dev/pandas into en…

f608792

…h-masked-2d

remove Any part of _mask annotation

125606b

jorisvandenbossche requested changes Jan 6, 2021

View reviewed changes

xfail for ArrowStringArray

dd5dbbe

jbrockmendel mentioned this pull request Jan 20, 2021

BUG: setting dt64 values into Series[int] incorrectly casting dt64->int #39266

Merged

4 tasks

jbrockmendel mentioned this pull request Feb 1, 2021

CI/TST: update exception message, xfail #39546

Merged

jbrockmendel added 9 commits February 3, 2021 10:03

Merge branch 'master' of https://github.com/pandas-dev/pandas into en…

577826c

…h-masked-2d

absolute import

17f63d4

Merge branch 'master' of https://github.com/pandas-dev/pandas into en…

33b2d78

…h-masked-2d

Merge branch 'master' of https://github.com/pandas-dev/pandas into en…

a2bd7b1

…h-masked-2d

TST: reductions with axis

3f14fa3

Merge branch 'master' of https://github.com/pandas-dev/pandas into en…

6600588

…h-masked-2d

Merge branch 'master' into enh-masked-2d

560279c

np_version_under1p17 compat

553038c

Merge branch 'master' into enh-masked-2d

b2a26bf

jbrockmendel mentioned this pull request May 24, 2021

Support for pandas Extension Arrays pydata/xarray#5287

Closed

Merge branch 'master' into enh-masked-2d

8f315bc

fix broken tests

21cf578

jbrockmendel mentioned this pull request Aug 4, 2021

PERF: GroupBy.any/all operate blockwise instead of column-wise #42841

Merged

4 tasks

jbrockmendel added 3 commits September 28, 2021 15:27

Merge branch 'master' into enh-masked-2d

6f215d6

Merge branch 'master' into enh-masked-2d

e68c797

Merge branch 'master' into enh-masked-2d

17dd19a

jreback added this to the 1.4 milestone Oct 6, 2021

jreback requested changes Oct 6, 2021

View reviewed changes

jbrockmendel added 10 commits October 6, 2021 09:05

Merge branch 'master' into enh-masked-2d

93d65eb

comment

3bfe60c

Merge branch 'master' into enh-masked-2d

5c28d69

Merge branch 'master' of https://github.com/pandas-dev/pandas into en…

92d710b

…h-masked-2d

troubleshoot windows build

5b014c1

Merge branch 'master' into enh-masked-2d

db76ca0

Merge branch 'master' into enh-masked-2d

8148fcd

troubleshoot 32bit builds

15a533f

troubleshoot 32bit builds

7a7601e

troubleshoot 32 bit builds

7c6baaf

jreback approved these changes Oct 16, 2021

View reviewed changes

jreback merged commit 4d9b6f7 into pandas-dev:master Oct 16, 2021

jbrockmendel deleted the enh-masked-2d branch October 18, 2021 15:02

simonjayhawkins reviewed Oct 19, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: 2D support for MaskedArray #38992

ENH: 2D support for MaskedArray #38992

jbrockmendel commented Jan 6, 2021

jreback left a comment

simonjayhawkins Jan 6, 2021

jbrockmendel Jan 6, 2021

jreback Jan 6, 2021

simonjayhawkins Jan 6, 2021

jorisvandenbossche left a comment

jreback commented Jan 6, 2021

jorisvandenbossche commented Jan 8, 2021 •

edited

Loading

jbrockmendel commented Jan 9, 2021

jbrockmendel commented Jan 16, 2021

jreback commented Jan 20, 2021

jorisvandenbossche commented Jan 21, 2021

jbrockmendel commented Jan 21, 2021

jreback commented Jan 21, 2021 •

edited

Loading

jorisvandenbossche commented Jan 22, 2021

jbrockmendel commented Jan 22, 2021

jbrockmendel commented Aug 3, 2021

jreback commented Aug 4, 2021

jbrockmendel commented Aug 4, 2021

simonjayhawkins commented Oct 6, 2021

jreback Oct 6, 2021

jreback left a comment

jreback Oct 16, 2021

jreback Oct 16, 2021

jreback commented Oct 16, 2021

simonjayhawkins Oct 19, 2021

simonjayhawkins Oct 19, 2021

simonjayhawkins Oct 19, 2021

simonjayhawkins Oct 19, 2021

simonjayhawkins Oct 19, 2021

jbrockmendel commented Apr 15, 2022

ENH: 2D support for MaskedArray #38992

ENH: 2D support for MaskedArray #38992

Conversation

jbrockmendel commented Jan 6, 2021

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jreback commented Jan 6, 2021

jorisvandenbossche commented Jan 8, 2021 • edited Loading

jbrockmendel commented Jan 9, 2021

jbrockmendel commented Jan 16, 2021

jreback commented Jan 20, 2021

jorisvandenbossche commented Jan 21, 2021

jbrockmendel commented Jan 21, 2021

jreback commented Jan 21, 2021 • edited Loading

jorisvandenbossche commented Jan 22, 2021

jbrockmendel commented Jan 22, 2021

jbrockmendel commented Aug 3, 2021

jreback commented Aug 4, 2021

jbrockmendel commented Aug 4, 2021

simonjayhawkins commented Oct 6, 2021

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Oct 16, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Apr 15, 2022

jorisvandenbossche commented Jan 8, 2021 •

edited

Loading

jreback commented Jan 21, 2021 •

edited

Loading