Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ArrayManager] Array version of fillna logic #41104

Closed

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Apr 22, 2021

xref #39146

I was looking into getting rid of the Block fallback (apply_with_block) for fillna in ArrayManager. For that, I started with putting the bits and pieces from the several BlockManager methods (the different fillna implementations, maybe_downcast) together in a function that would do the same for plain arrays as input, to see what this would give.

The draft result of that can be seen in this PR. This seems to pass all tests for the ArrayManager, so should have the same behaviour as BlockManager.fillna. I only hoped that we could then also use this in the BlockManager to have a shared implementation (as we did for eg Block.astype and others). However, that doesn't seem so easy in this case .. So not yet sure how to best proceed here.

@jorisvandenbossche jorisvandenbossche added Refactor Internal refactoring of code Internals Related to non-user accessible pandas implementation labels Apr 22, 2021
@@ -964,3 +974,107 @@ def _rolling_window(a: np.ndarray, window: int):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)


def _can_hold_element(values, element: Any) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a version of this that makes the existing can_hold_element more accurate instead of implementing another function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not fully sure what I was thinking in April / if that was already the case, but so right now the existing can_hold_element actually fully covers what I needed here.

It only doesn't do extract_array compared to Block._can_hold_element, so kept that here for now. But probably that could be moved to can_hold_element as well (will check later).

# We support filling a DatetimeTZ with a `value` whose timezone
# is different by coercing to object.
# TODO: don't special-case td64
values = values.astype(object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK if you want to keep this copy/pasted version, but i think using coerce_to_target_dtype would be more Technically Correct here

values, value, limit=limit, inplace=True, downcast=downcast
)

values = values if inplace else values.copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think EA.fillna is never inplace, so this may be making an unnecessary copy (or not being inplace even when the user specifies it) (applicable to status quo too)

@jbrockmendel
Copy link
Member

However, that doesn't seem so easy in this case .. So not yet sure how to best proceed here.

Some exposition on ideas I've been looking at for this problem (for Block methods in general, not specific to fillna)

  • Instead of defining fillna_array (or more generally foo_array) to return a single array, we could define it to return a list of arrays.

    • For AM this would always return a length-1 list.
    • For BM it would return
      • a) a length-1 list if there was no splitting,
      • b) a length-N list of 1-column arrays of it fully split (equiv split_and_operate or maybe_split), or
      • c) a length-2 list for where_array.
        This where_array case is annoying, could be made nicer if we changed Block.where to fully split instead of split in two pieces. This would entail slightly more aggressive downcasting (breaks 1 test, though tentatively looks like this is surfacing an unrelated bug).
    • Block.convert is low hanging fruit for this approach.
  • 2/3 uses of @maybe_split are on ObjectBlock methods. If we could get _downcast_2d to only split for object dtype (tried it; a few tests break bc they downcast floats less aggressively), it would become easier to share some of functions (though exactly how escapes me ATM, i just remember trying this a couple weeks ago with this being the motivation)

  • DEPR: Deprecate downcast keyword for fillna #40988 might make things marginally easier (though if theres a deprecation cycle, it will take too long to really be useful for this)

  • The complexity reduction of ArrayManager mostly comes from getting rid of BlockManager, not from getting rid of Block. We could just learn to live with dispatching to blocks for some things.

    • One disadvantage of the recent push to share methods across Block subclasses is that we end up repeating dtype checks. I doubt it makes a big difference, but haven't measured.

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label May 23, 2021
@simonjayhawkins
Copy link
Member

can you address @jbrockmendel comments

@jorisvandenbossche
Copy link
Member Author

Started looking at this again, for now only simplified the can_hold_element usage. Still need to look into / respond to the other comments.

@jorisvandenbossche
Copy link
Member Author

  • Instead of defining fillna_array (or more generally foo_array) to return a single array, we could define it to return a list of arrays.

I now shared the implementation for the ExtensionBlock (subclasses), which was essentially just the datetime64 special casing for allowing object fallback (but good to have this in one place). But for sharing the generic Block implementation something like the above would indeed be needed. This splitting into multiple arrays is very much block specific, though, and personally I would prefer to keep that logic on the Blocks itself.

return values.fillna(value, limit=limit)


def fillna_array(values, value, limit=None, inplace: bool = False, downcast=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be used for Block.fillna too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't directly see an easy way to do that. The Block.fillna additionally has the "split" logic, which I personally don't want to put in here. And it's not directly straightforward to split a part of Block.fillna.

I could for example let this function return some sentinel like NotImplemented or None to signal to Block.fillna that the filling hasn't been done yet, and it should split the block and call fillna again. In which case most of the splitting logic stays in the internals. I could try something like this if you want, but it's also a bit of a hack.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Block.fillna additionally has the "split" logic, which I personally don't want to put in here

100% agree

I've been looking at simplifying the downcasting in a way that would make the splitting much easier to separate the splitting from non-splitting (not just for fillna) (xref #44241, #40988). It would be an API change so wouldn't be implement in time to simplify this (and your putmask PR), but would eventually allow for some cleanup.

ignored), except for datetime64 in which case fallback to object dtype
is currently allowed.
"""
if not _can_hold_element(values, value) and values.dtype.kind == "M":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this would change behavior for e.g. Period or Interval?

(which would make us more consistent across dtypes, which i would like to see, just needs to be intentional)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we don't have tests for it in any case, since the tests are passing.

Now, checking for period, it seems to raise for both of array and Series:

In [5]: arr = pd.period_range("2012", periods=3).array

In [6]: arr[0] = None

In [8]: arr.fillna("?")
...
TypeError: value should be a 'Period', 'NaT', or array of those. Got 'str' instead.

In [9]: pd.Series(arr).fillna("?")
...
TypeError: value should be a 'Period', 'NaT', or array of those. Got 'str' instead.

while indeed for DatetimeArray, we coerce to object for Series:

In [10]: arr = pd.date_range("2012", periods=3).array

In [11]: arr[0] = None

In [12]: arr.fillna("?")
..
TypeError: value should be a 'Timestamp', 'NaT', or array of those. Got 'str' instead.

In [13]: pd.Series(arr).fillna("?")
Out[13]: 
0                      ?
1    2012-01-02 00:00:00
2    2012-01-03 00:00:00
dtype: object

return can_hold_element(values, element)


def fillna_ea_array(values, value, limit=None, inplace: bool = False, downcast=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values: ExtensionArray?

@jbrockmendel
Copy link
Member

I increasingly think that this type of change should be left until right before BlockManager is ripped out. Until then, this just duplicates code and makes it easy for behavior to accidentally drift apart (xref #40988, #45153)

@jreback
Copy link
Contributor

jreback commented Jan 16, 2022

@jorisvandenbossche if you can rebase

@simonjayhawkins
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ArrayManager Internals Related to non-user accessible pandas implementation Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants