PDEP-11: Change default of dropna to False #53094

rhshadrach · 2023-05-05T02:28:47Z

cc @pandas-dev/pandas-core

jbrockmendel · 2023-05-05T17:12:49Z

Will need to give this some thought. The "even further" possibility is to deprecate the argument entirely and tell users to do obj.whatever().dropna() to get the old behavior

web/pandas/pdeps/0011-dropna-default.md

attack68 · 2023-05-05T17:16:47Z

Will need to give this some thought. The "even further" possibility is to deprecate the argument entirely and tell users to do obj.whatever().dropna() to get the old behavior

Wouldn't this not work for certain functions. If you sum over values with na and then drop you get nothing, whilst if the sum function knows to drop you will return a value. This would be important in the groupby.

rhshadrach · 2023-05-06T12:10:27Z

@attack68 (#53094 (comment)): Consider also obj.dropna(...).whatever()

The "even further" possibility is to deprecate the argument entirely and tell users to do obj.whatever().dropna() to get the old behavior

@jbrockmendel - I also think further deprecation might be desirable for some of these methods. However they would then act as dropna=False, and deprecating an argument and changing the behavior to the non-default option is not a great user experience. So I think we should first change the default and then deprecate if desired. But even if we don't deprecate the argument, I would like to see the default change. So these are independent.

attack68 · 2023-05-06T14:08:18Z

My point being that this doesn't work:

If you want to get the answer 3, you can currently do df.sum(skipna=True).sum(), where the dropna is within the method. Or, in this trivial case you could wrangle it, df.unstack().dropna().sum(), but that might not work generally?

phofl · 2023-05-06T14:45:37Z

This should take the skipna keyword into consideration, right now we are ignoring NAs consistently, this PDEP would make things kind of inconsistent (not that this is a bad thing, but it should at least get mentioned)

rhshadrach · 2023-05-07T13:05:14Z

@attack68 - skipna is not part of the PDEP, I'll make this more clear in an update. And I don't think anyone is thinking it should be deprecated in methods like sum.

Also, deprecating any dropna is not part of the PDEP.

attack68 · 2023-05-07T13:19:31Z

I am generally in support of this PDEP. I think NA information is valuable in many processes and should not be dropped by default.
I understand skipna is not part of this PDEP and nothing is being deprecated, but it is worth mentioning as @phofl says.

I've just raised the points to try to discover if there are any processes / calculation chains that might be made more difficult by the change, including any user defined functions and/or in groupbys.

Dr-Irv · 2023-05-09T14:01:02Z

web/pandas/pdeps/0011-dropna-default.md

+## Timeline
+
+If accepted, the current `dropna` default would be deprecated as part of pandas
+2.x and this deprecation would be enforced in pandas 3.0.


How would users find out about this deprecation? I'm concerned it will create noisy messages. For example, if you were to do df[["a", "b"]].groupby("a").sum(), would you always get a deprecation message? Would you only get a message if the result would change because the column "a" had NA values?

So can you be more specific about how the deprecation would work?

Added. A warning would only be emitted when dropna is unspecified and an NA value is encountered.

rhshadrach · 2023-05-09T21:00:33Z

Thanks @jbrockmendel @attack68 @mroeschke @phofl @Dr-Irv - I've updated the PDEP based on your feedback.

topper-123 · 2023-05-11T15:56:14Z

I'm +1 on this change.

Do we have a rough timeline for Pandas 3.0? I'm a bit worried about noise from this when working interactively, could we consider implementing the deprecation late in the v2.x release cycle rather than early? Or is the 2.x cycle so short that timing this will not matter?

mroeschke · 2023-05-11T16:13:36Z

I am +1 on this proposal as well.

Do we have a rough timeline for Pandas 3.0?

We would like to release 3.0 in 1 year from 2.0, so April 3rd 2024 ideally. 2.1 should be released in August 2.1 where this deprecation could be introduced

rhshadrach · 2023-05-27T13:06:12Z

@mroeschke: Is there then not going to be a pandas 2.2?

@topper-123: I'm happy to wait as long as possible in 2.x. The deprecation will be straightforward.

MarcoGorelli

very much in favour, I've done value_counts in the past and thought "oh, no missing values, all good", only to later be bitten by nonsense forecasts

+1 (sorry it's taken me some time to get round to this)

jorisvandenbossche · 2023-05-30T07:49:18Z

Thanks for the PDEP!

I am +1 on the proposal for the groupby-related methods (the variations of groupby and value_counts).

I am less sure that we should also do this for other methods, or at least I think those merit some specific case-by-case discussion (AFAIK the linked previous discussions were all about the groupby/value_counts-related cases?)

mode() is a tricky one, because on the one hand it is an aggregation (and all our other aggregations skip NAs by default, although the keyword name here is dropna and not skipna), while on the other hand it is very much related to "counting values" (and for value_counts I think it makes sense to change the default to dropna=False)
- Some related discussion about this at [C++][Python] Allow pyarrow.compute.mode to include null count apache/arrow#34173
- There wasn't much discussion about the keyword name (skipna vs dropna) when this was introduced (#20779), but from the issue it seems it was modeled after value_counts
- It might be worth checking some other data tools what their behaviour is for "mode" with nulls.
nunique() is a bit similar as mode(), since it is also an aggregation function, but also related to counts. But for example, SQL COUNT(DISTINCT col) also skips nulls.
pivot_table(): it seems that the dropna keyword is used for multiple things? It is passed through to the underlying groupby call (and here it of course makes sense to follow the change in default), but it is also used to post-process the result and to calculate the margins (the docstring says: "If True, rows with a NaN value in any column will be omitted before computing margins."). Have we evaluated what the impact of the change is here?
For stack() I was going to say that I thought this was only about dropping NAs that were introduced because of the reshaping operations (and not NAs that already existed before), but it seems to be a bit more complicated than just that. The last example in the docstring illustrates the effect of dropna, and it's also dropping at least one pre-existing NA value. In that case, it seems this is a good change to preserve those original NAs in the output.
But if you are stacking multiple levels at once, you can write an example where the NAs that you get with dropna=False are actually new NAs, and for this case I am not sure this is necessarily the best behaviour:
```
>>> df = pd.DataFrame(np.arange(6).reshape(2,3), columns=pd.MultiIndex.from_tuples([('A','x'), ('A','y'), ('B','z')], names=['Upper', 'Lower']))
>>> df
Upper  A     B
Lower  x  y  z
0      0  1  2
1      3  4  5
>>> df.stack(level=[0,1])
   Upper  Lower
0  A      x        0.0
          y        1.0
   B      z        2.0
1  A      x        3.0
          y        4.0
   B      z        5.0
dtype: float64

>>> df.stack(level=[0,1], dropna=False)
   Upper  Lower
0  A      x        0.0
          y        1.0
          z        NaN
   B      x        NaN
          y        NaN
          z        2.0
1  A      x        3.0
          y        4.0
          z        NaN
   B      x        NaN
          y        NaN
          z        5.0
dtype: float64
```
- There has been some discussion on this topic several years ago, see API/ENH: df.stack(level=[0,1], dropna=False) behavior #8851, ENH/API: DataFrame.stack() support for level=None, sequentially=True/False, and NaN level values. #9023 (comment) (and comments below that)
- I think I would be in favor of changing dropna=True to dropna=False, if this would only about the pre-existing NAs, and not about not dropping the NAs introduced by product-wise combinations of MultiIndex levels (I would say that even dropna=True should only drop those introduced NAs, and not the pre-existing ones, in which case we could keep the current default, but effectively get the desired default behaviour)

mroeschke · 2023-05-30T18:04:35Z

@mroeschke: Is there then not going to be a pandas 2.2?

I think there should be a 2.2 release in December with the every-4-months for a minor release

mroeschke

+1 on this proposal

lithomas1 · 2023-05-30T19:42:39Z

web/pandas/pdeps/0011-dropna-default.md

+SeriesGroupBy.value_counts
+DataFrameGroupBy.nunique
+DataFrameGroupBy.value_counts
+```


I think you might be missing a couple functions here.
This is the complete list (made using the keyword inspector script I posted on slack a while back).

<class 'pandas.core.arrays.categorical.Categorical'>.value_counts <class 'pandas.core.indexes.category.CategoricalIndex'>.nunique <class 'pandas.core.indexes.category.CategoricalIndex'>.value_counts <class 'pandas.core.frame.DataFrame'>.groupby <class 'pandas.core.frame.DataFrame'>.mode <class 'pandas.core.frame.DataFrame'>.nunique <class 'pandas.core.frame.DataFrame'>.pivot_table <class 'pandas.core.frame.DataFrame'>.stack <class 'pandas.core.frame.DataFrame'>.to_hdf <class 'pandas.core.frame.DataFrame'>.value_counts <class 'pandas.core.indexes.datetimes.DatetimeIndex'>.nunique <class 'pandas.core.indexes.datetimes.DatetimeIndex'>.value_counts <class 'pandas.io.pytables.HDFStore'>.append <class 'pandas.io.pytables.HDFStore'>.append_to_multiple <class 'pandas.io.pytables.HDFStore'>.put <class 'pandas.core.indexes.base.Index'>.nunique <class 'pandas.core.indexes.base.Index'>.value_counts <class 'pandas.core.indexes.interval.IntervalIndex'>.nunique <class 'pandas.core.indexes.interval.IntervalIndex'>.value_counts <class 'pandas.core.indexes.multi.MultiIndex'>.nunique <class 'pandas.core.indexes.multi.MultiIndex'>.value_counts <class 'pandas.core.indexes.period.PeriodIndex'>.nunique <class 'pandas.core.indexes.period.PeriodIndex'>.value_counts <class 'pandas.core.indexes.range.RangeIndex'>.nunique <class 'pandas.core.indexes.range.RangeIndex'>.value_counts <class 'pandas.core.series.Series'>.groupby <class 'pandas.core.series.Series'>.mode <class 'pandas.core.series.Series'>.nunique <class 'pandas.core.series.Series'>.to_hdf <class 'pandas.core.series.Series'>.value_counts <class 'pandas.core.indexes.timedeltas.TimedeltaIndex'>.nunique <class 'pandas.core.indexes.timedeltas.TimedeltaIndex'>.value_counts crosstab lreshape pivot_table value_counts

I think the missing ones are
crosstab, lreshape, HDFStore.put|append|append_to_multiple.

jorisvandenbossche

To make it explicit with all the people starting to approve: I am +1 on changing it for groupby/value_counts, but -1 on the proposal as it is currently written.
I think all the other group of methods (other than groupby/value_counts) need a separate motivation (and maybe discussion), see my comment at #53094 (comment)

rhshadrach · 2023-06-06T10:56:10Z

Thanks @jorisvandenbossche - for now I'd like to move discussions on the behavior of dropna for stack (#53515) and pivot_table (#53521) to those issues as I think they will require more lengthy discussion. If at the end they do not sufficiently address your concerns, we can bring them back here.

I am less sure that we should also do this for other methods, or at least I think those merit some specific case-by-case discussion

Emphasis mine. Certainly no objection to such discussions, but I do think there is high value in having a consistent default for dropna across pandas. It leads to less surprises and is less taxing on our users to have a consistent default.

mode() is a tricky one, because on the one hand it is an aggregation... It might be worth checking some other data tools what their behaviour is for "mode" with nulls.

I agree this could either have dropna defaulting to False or skipna defaulting to True. I'm okay either way here and can see arguments for both. At the end of the day I think we just have to make a call as there doesn't seem to be a standard. I'd lean toward changing to skipna so as not to change default behavior but have consistency in default arguments. Other languages/libraries:

Those that do not implement mode: R, SQL, GSL (C/C++)
Those that implement mode and include nulls by default: polars, SciPy
Those that implement mode and skip nulls: pyarrow, pyspark, scikit-learn

nunique() is a bit similar as mode(), since it is also an aggregation function, but also related to counts. But for example, SQL COUNT(DISTINCT col) also skips nulls.

pandas impelments count and this skips NA values (with no option to not skip). That is the method I think should be compared to SQL COUNT, not nunique. In addition, I think it is important that ser.nunique() == len(ser.unique()) and that is currently not the case.

Thanks @lithomas1! I indeed missed checking the top namespace - will add them to the PDEP. Some thoughts on each:

crosstab: The dropna argument is only used via passing to pivot_table, so I think we can combine this with the discussion on pivot_table (DEPR: Some dropna behaviors in DataFrame.pivot_table #53521).
lreshape: The docs here on dropna are wrong; this is equivalent to calling DataFrame.dropna(how='any') prior to the call to lreshape. I think we can deprecate dropna here outright.
HDFStore.put|append|append_to_multiple: These already default to dropna=False.

github-actions · 2023-07-07T00:06:23Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

github-actions · 2023-08-07T00:06:06Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

simonjayhawkins · 2024-02-06T17:39:10Z

@rhshadrach what's the status here? was this announced to solicit feedback from the community?

rhshadrach · 2024-02-07T03:13:27Z

Thanks for the ping here @simonjayhawkins - I've been meaning to post an update. I put the brakes on this from the feedback in #53094 (comment); namely the behavior of dropna in pivot_table and crosstab. I tried to summarize these issues in #53521.

In addition, I think we should decide to have mode with the default dropna=False or skipna=True. I've argued for the latter, but would be happy with either option. Finally, some opposition was voiced against nunique defaulting to dropna=False; I've argued against this in #53094 (comment) but have not heard if others find my arguments compelling.

In my mind, there are two ways forward:

change the default to dropna in all methods except for pivot_table and crosstab and clean those methods up in the future; or
wait to change the default of dropna until pivot_table and crosstab are fixed.

I am okay with either way, but would lean toward the first.

cc @jorisvandenbossche

PDEP-11: Change default of dropna to False

7c5f8c7

rhshadrach added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Deprecate Functionality to remove in pandas PDEP pandas enhancement proposal labels May 5, 2023

PR #, fixups

e45bfeb

mroeschke reviewed May 5, 2023

View reviewed changes

web/pandas/pdeps/0011-dropna-default.md Show resolved Hide resolved

Dr-Irv requested changes May 9, 2023

View reviewed changes

rhshadrach added 2 commits May 9, 2023 16:46

Merge branch 'main' of https://github.com/pandas-dev/pandas into pdep11

6984c7e

Update from feedback

e1194a5

MarcoGorelli approved these changes May 30, 2023

View reviewed changes

mroeschke approved these changes May 30, 2023

View reviewed changes

WillAyd approved these changes May 30, 2023

View reviewed changes

lithomas1 reviewed May 30, 2023

View reviewed changes

attack68 approved these changes May 30, 2023

View reviewed changes

jorisvandenbossche requested changes May 30, 2023

View reviewed changes

rhshadrach mentioned this pull request Jun 3, 2023

DEPR/BUG: DataFrame.stack including all null rows when stacking multiple levels #53515

Closed

github-actions bot added the Stale label Jul 7, 2023

rhshadrach removed the Stale label Jul 7, 2023

github-actions bot added the Stale label Aug 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDEP-11: Change default of dropna to False #53094

PDEP-11: Change default of dropna to False #53094

rhshadrach commented May 5, 2023

jbrockmendel commented May 5, 2023

attack68 commented May 5, 2023

rhshadrach commented May 6, 2023 •

edited

attack68 commented May 6, 2023 •

edited

phofl commented May 6, 2023

rhshadrach commented May 7, 2023

attack68 commented May 7, 2023

Dr-Irv May 9, 2023

rhshadrach May 9, 2023

rhshadrach commented May 9, 2023

topper-123 commented May 11, 2023

mroeschke commented May 11, 2023

rhshadrach commented May 27, 2023

MarcoGorelli left a comment

jorisvandenbossche commented May 30, 2023 •

edited

mroeschke commented May 30, 2023

mroeschke left a comment

lithomas1 May 30, 2023

jorisvandenbossche left a comment •

edited

rhshadrach commented Jun 6, 2023 •

edited

github-actions bot commented Jul 7, 2023

github-actions bot commented Aug 7, 2023

simonjayhawkins commented Feb 6, 2024

rhshadrach commented Feb 7, 2024

PDEP-11: Change default of dropna to False #53094

Are you sure you want to change the base?

PDEP-11: Change default of dropna to False #53094

Conversation

rhshadrach commented May 5, 2023

jbrockmendel commented May 5, 2023

attack68 commented May 5, 2023

rhshadrach commented May 6, 2023 • edited

attack68 commented May 6, 2023 • edited

phofl commented May 6, 2023

rhshadrach commented May 7, 2023

attack68 commented May 7, 2023

Dr-Irv May 9, 2023

Choose a reason for hiding this comment

rhshadrach May 9, 2023

Choose a reason for hiding this comment

rhshadrach commented May 9, 2023

topper-123 commented May 11, 2023

mroeschke commented May 11, 2023

rhshadrach commented May 27, 2023

MarcoGorelli left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented May 30, 2023 • edited

mroeschke commented May 30, 2023

mroeschke left a comment

Choose a reason for hiding this comment

lithomas1 May 30, 2023

Choose a reason for hiding this comment

jorisvandenbossche left a comment • edited

Choose a reason for hiding this comment

rhshadrach commented Jun 6, 2023 • edited

github-actions bot commented Jul 7, 2023

github-actions bot commented Aug 7, 2023

simonjayhawkins commented Feb 6, 2024

rhshadrach commented Feb 7, 2024

rhshadrach commented May 6, 2023 •

edited

attack68 commented May 6, 2023 •

edited

jorisvandenbossche commented May 30, 2023 •

edited

jorisvandenbossche left a comment •

edited

rhshadrach commented Jun 6, 2023 •

edited