Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame.diff(axis=1) with mixed (or EA) dtypes #32995

Merged
merged 12 commits into from Apr 10, 2020

Conversation

jbrockmendel
Copy link
Member

@jbrockmendel jbrockmendel commented Mar 25, 2020

  • closes #xxxx
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can just raise NotImplemented here. I suspect a mix of floats and bool will be somewhat common.

Can we instead skip doing things blockwise for axis=1 and do it columnwise instead? Something like

In [39]: df = pd.DataFrame({'a': [1,1,2,2],'b': [1, 2, 3, 4.], 'c': [True, False, False, True]})

In [40]: import toolz

In [41]: pd.concat([a - b for a, b in toolz.sliding_window(2, (df.iloc[:, i] for i in range(len(df.columns))))], axis=1)
Out[41]:
     0    1
0  0.0  0.0
1 -1.0  2.0
2 -1.0  3.0
3 -2.0  3.0

(without using toolz, and including the all-NA columns and fixing the column names). We could perhaps only do that when nblocks > 1, to preserve the performance in the homogenous case.

@jbrockmendel
Copy link
Member Author

Can we instead skip doing things blockwise for axis=1 and do it columnwise instead?

That would also allow us to avoid consolidating, which would be nice.

@jbrockmendel
Copy link
Member Author

updated to operate column-wise.

DatetimeTZBlock.diff could have its NotImplementedError case removed in a follow-up

@@ -6667,6 +6667,30 @@ def diff(self, periods: int = 1, axis: Axis = 0) -> "DataFrame":
5 NaN NaN NaN
"""
bm_axis = self._get_block_manager_axis(axis)
self._consolidate_inplace()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not pretty. why are we not simply transposing and calling .diff()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we not simply transposing and calling .diff()?

I'd be fine with that, but it is potentially costly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think the column-wise approach could be fine

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be co-located with BlockManager.diff. but to be honest I think transposing is just fine here. The block type is already inferred and handled. This is just adding a lot of complexity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you really want to do this column based approach them move this with the other .diff methods (I actually prefer just a transpose here, its totally fine for now)

@jbrockmendel
Copy link
Member Author

Do we have anything approaching consensus here? I'm OK with either column-wise or transpose

@TomAugspurger
Copy link
Contributor

I'm pretty strongly in favor of columnwise when there's more than 1 block. While I don't think sparse support should ever be a primary motivator, the difference in performance of a columnwise diff vs. a .T.diff().T for something like

In [6]: a = pd.arrays.SparseArray([1] + [0] * 1000)

In [7]: df = pd.DataFrame({"A": a, "B": a})

will be huge.

@@ -6667,6 +6667,30 @@ def diff(self, periods: int = 1, axis: Axis = 0) -> "DataFrame":
5 NaN NaN NaN
"""
bm_axis = self._get_block_manager_axis(axis)
self._consolidate_inplace()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you really want to do this column based approach them move this with the other .diff methods (I actually prefer just a transpose here, its totally fine for now)

@jbrockmendel
Copy link
Member Author

@jreback is column-wise vs transpose a deal breaker? if not, i think this is ready

@jreback
Copy link
Contributor

jreback commented Apr 10, 2020

if you really want to do columnwise can you move impl to another location as indicated

the code is pretty complex and should be separated from inline in the function

@jreback jreback added this to the 1.1 milestone Apr 10, 2020
@jreback jreback added the Numeric Operations Arithmetic, Comparison, and Logical operations label Apr 10, 2020
@jreback jreback merged commit a142ad7 into pandas-dev:master Apr 10, 2020
@jreback
Copy link
Contributor

jreback commented Apr 10, 2020

thanks, IIRC there might be some issues this closes if you'd have a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants