New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IntegerNA Support for DataFrame.diff() #24171
Labels
ExtensionArray
Extending pandas with custom dtypes or arrays.
good first issue
Needs Tests
Unit test(s) needed to prevent regressions
Reshaping
Concat, Merge/Join, Stack/Unstack, Explode
Milestone
Comments
This isn't a pandas bug per se. You simply can't store NA values mixed with integers in NumPy land. We are working on allowing that with the IntegerNA work being done, though it doesn't look like this is supported at the moment: In [34]: df = pd.DataFrame({'a': np.arange(1,6),
...: 'b': np.arange(1,6)+2,
...: 'c': np.arange(1,6)**2}, dtype='Int64')
In [35]: df.iloc[1,1] = np.nan
In [36]: df.diff()
TypeError: data type not understood I've repurposed the title to reflect that |
WillAyd
added
Algos
Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff
ExtensionArray
Extending pandas with custom dtypes or arrays.
labels
Dec 11, 2018
WillAyd
changed the title
pandas.DataFrame.diff(axis=1) does not work when columns are a mix of int/float
IntegerNA Support for DataFrame.diff()
Dec 11, 2018
Looks to be working on master now. Could use a test
|
mroeschke
added
good first issue
Needs Tests
Unit test(s) needed to prevent regressions
and removed
Algos
Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff
ExtensionArray
Extending pandas with custom dtypes or arrays.
labels
May 13, 2020
Take |
Marvzinc
added a commit
to Marvzinc/pandas
that referenced
this issue
Jun 20, 2020
5 tasks
jreback
added
ExtensionArray
Extending pandas with custom dtypes or arrays.
Reshaping
Concat, Merge/Join, Stack/Unstack, Explode
labels
Jun 20, 2020
Marvzinc
added a commit
to Marvzinc/pandas
that referenced
this issue
Jun 20, 2020
Marvzinc
added a commit
to Marvzinc/pandas
that referenced
this issue
Jun 20, 2020
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
ExtensionArray
Extending pandas with custom dtypes or arrays.
good first issue
Needs Tests
Unit test(s) needed to prevent regressions
Reshaping
Concat, Merge/Join, Stack/Unstack, Explode
Code Sample, a copy-pastable example if possible
Problem description
If for example, in the dataframe of integers, we set (2,b) as NA, then the entire column b becomes float. When we attempt to use df.diff(axis=1), then column b-a gives all NaN because float - int gives NaN.
In addition, c-b does not even happen, c-a is done instead as only c and a columns are integers.
Both of these appear unexpected results as in regular python, float - int is a float, and providing results in column c as c-a instead of c-b is conflicting with the definition of the diff function period = 1.
The workaround to this problem is to cast the entire dataframe to float
df.astype(np.float).diff(axis=1)
and it works. I think this should be fixed in pandas or at least a warning should be given.Here is steps by step. Simple dataframe of integers:
If one value in column b is set to NA, then the whole column in set to float:
when we attempt to do
df.diff(axis=1)
, then column b-a gives all NaN, however, instead of doing c-b, c-a is done instead, it is only doing difference between the same type columns. For example, in the row 4, 25 - 5 = 20, which means c-a was done instead of c-b:if the whole dataframe is casted to float, then diff works as expected:
panda version 0.23.4. This issue appears in prior versions as well.
The text was updated successfully, but these errors were encountered: