Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IntegerNA Support for DataFrame.diff() #24171

Closed
ededovic opened this issue Dec 9, 2018 · 3 comments · Fixed by #34889
Closed

IntegerNA Support for DataFrame.diff() #24171

ededovic opened this issue Dec 9, 2018 · 3 comments · Fixed by #34889
Assignees
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. good first issue Needs Tests Unit test(s) needed to prevent regressions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@ededovic
Copy link

ededovic commented Dec 9, 2018

Code Sample, a copy-pastable example if possible

import pandas as pd, numpy as np

df = pd.DataFrame({'a': np.arange(1,6),
                   'b': np.arange(1,6)+2,
                   'c': np.arange(1,6)**2})
print(f"all columns are int: \n{df}")
df.loc[2,'b'] = np.nan
print(f"if one value in column b is set to NA, then the whole column in set to float:\n{df}")
print(f'column b is float, and column a is int, then there is no result for anything for column b-a:\n{df.diff(axis=1)}')
print(f'if the whole dataframe is casted to float, then diff works:\n{df.astype(np.float).diff(axis=1)}')

Problem description

If for example, in the dataframe of integers, we set (2,b) as NA, then the entire column b becomes float. When we attempt to use df.diff(axis=1), then column b-a gives all NaN because float - int gives NaN.
In addition, c-b does not even happen, c-a is done instead as only c and a columns are integers.
Both of these appear unexpected results as in regular python, float - int is a float, and providing results in column c as c-a instead of c-b is conflicting with the definition of the diff function period = 1.

The workaround to this problem is to cast the entire dataframe to float df.astype(np.float).diff(axis=1) and it works. I think this should be fixed in pandas or at least a warning should be given.

Here is steps by step. Simple dataframe of integers:

   a  b   c
0  1  3   1
1  2  4   4
2  3  5   9
3  4  6  16
4  5  7  25

If one value in column b is set to NA, then the whole column in set to float:

   a    b   c
0  1  3.0   1
1  2  4.0   4
2  3  NaN   9
3  4  6.0  16
4  5  7.0  25

when we attempt to do df.diff(axis=1), then column b-a gives all NaN, however, instead of doing c-b, c-a is done instead, it is only doing difference between the same type columns. For example, in the row 4, 25 - 5 = 20, which means c-a was done instead of c-b:

    a   b     c
0 NaN NaN   0.0
1 NaN NaN   2.0
2 NaN NaN   6.0
3 NaN NaN  12.0
4 NaN NaN  20.0

if the whole dataframe is casted to float, then diff works as expected:

    a    b     c
0 NaN  2.0  -2.0
1 NaN  2.0   0.0
2 NaN  NaN   NaN
3 NaN  2.0  10.0
4 NaN  2.0  18.0

panda version 0.23.4. This issue appears in prior versions as well.

@WillAyd
Copy link
Member

WillAyd commented Dec 11, 2018

This isn't a pandas bug per se. You simply can't store NA values mixed with integers in NumPy land.

We are working on allowing that with the IntegerNA work being done, though it doesn't look like this is supported at the moment:

In [34]: df = pd.DataFrame({'a': np.arange(1,6), 
    ...:                    'b': np.arange(1,6)+2, 
    ...:                    'c': np.arange(1,6)**2}, dtype='Int64') 
In [35]: df.iloc[1,1] = np.nan 
In [36]: df.diff() 

TypeError: data type not understood

I've repurposed the title to reflect that

@WillAyd WillAyd added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff ExtensionArray Extending pandas with custom dtypes or arrays. labels Dec 11, 2018
@WillAyd WillAyd changed the title pandas.DataFrame.diff(axis=1) does not work when columns are a mix of int/float IntegerNA Support for DataFrame.diff() Dec 11, 2018
@mroeschke
Copy link
Member

Looks to be working on master now. Could use a test

In [35]: pd.__version__
Out[35]: '1.1.0.dev0+1558.g4f698ec16'

In [36]: In [34]: df = pd.DataFrame({'a': np.arange(1,6),
    ...:     ...:                    'b': np.arange(1,6)+2,
    ...:     ...:                    'c': np.arange(1,6)**2}, dtype='Int64')
    ...: In [35]: df.iloc[1,1] = np.nan
    ...: In [36]: df.diff()
Out[36]:
      a     b     c
0  <NA>  <NA>  <NA>
1     1  <NA>     3
2     1  <NA>     5
3     1     1     7
4     1     1     9

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff ExtensionArray Extending pandas with custom dtypes or arrays. labels May 13, 2020
@Marvzinc
Copy link
Contributor

Take

Marvzinc added a commit to Marvzinc/pandas that referenced this issue Jun 20, 2020
@jreback jreback added ExtensionArray Extending pandas with custom dtypes or arrays. Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jun 20, 2020
@jreback jreback added this to the 1.1 milestone Jun 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. good first issue Needs Tests Unit test(s) needed to prevent regressions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants