Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR/DOC: pd.concat(axis=1) should special case DatetimeIndex to sort even when sort=False #57006

Closed
1 of 3 tasks
torext opened this issue Jan 22, 2024 · 4 comments · Fixed by #57250
Closed
1 of 3 tasks
Labels
Datetime Datetime data dtype Index Related to the Index class or subclasses Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@torext
Copy link
Contributor

torext commented Jan 22, 2024

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

For a very long time now it used to be the default behaviour that calling pd.concat(..., axis=1) on DataFrames with DateTimeIndex would result in a DataFrame with sorted index. This is very reasonable default behaviour and pretty much what one wants in this case the vast majority of times.

With #54769 this was considered a bug simply because it wasn't documented correctly, and now the default behaviour, without any warning, was suddenly changed. This means likely code breakages across many users' libraries that were written based on the (up until now true) assumption that concatenating timeseries-like DataFrames/Series would give another timeseries-like object, i.e. something with a clearly sorted DateTimeIndex. Making sort=False work in the DateTimeIndex case is of course something that should be ensured, as it is something that might come in handy in some rare cases, but having that the default I don't think is a good idea at all, given the opposite has been the default for a long time and frankly makes more sense.

Feature Description

Change the default back to sort=True in the case of DataFrames/Series with DateTimeIndex and update the documentation to more accurately reflect this original default behaviour.

Alternative Solutions

Many users suddenly (and without clear warning) need to put sort=True in all the places they call pd.concat on timeseries-like Pandas objects.

Additional Context

No response

@torext torext added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 22, 2024
@enzbus
Copy link

enzbus commented Jan 22, 2024

Thank you, this just caused my unit tests to fail. I agree on giving a deprecation warning and/or reverting to the previous behavior.

@jorisvandenbossche jorisvandenbossche added this to the 2.2.1 milestone Jan 22, 2024
enzbus added a commit to cvxgrp/cvxportfolio that referenced this issue Jan 23, 2024
This revision fixes #128: a regression pandas-dev/pandas#57006
and a few warnings introduced by pandas 2.2.0. There are no other
code changes.
@lithomas1 lithomas1 added Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode Index Related to the Index class or subclasses Datetime Datetime data dtype and removed Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 24, 2024
@lithomas1
Copy link
Member

This seems reasonable to me.

@lukemanley
Copy link
Member

It sounds like #54769 should be reverted in 2.2.1 and documented instead due to long-standing behavior.

The long-standing behavior does have a few inconsistencies that motivated #54769 that may be worth noting:

  1. pd.concat(axis=1, how="outer") special cases DatetimeIndex to always sort (ignores sort parameter). All other dtypes and index types do NOT sort when sort=False (default) including other time-like dtypes such as pyarrow timestamp/date/time and PeriodIndex.

  2. pd.concat(axis=1, how="outer") always sorts for DatetimeIndex, but pd.concat(axis=1, how="inner") only sorts if sort=True

  3. I can see the argument for outer joining two sorted (monotonic) indexes via pd.concat(axis=1) and expecting a monotonic output by default. I dont know the full history, but that may be the reason pd.merge and DataFrame.join ALWAYS sort outer joins (for all dtypes, not just DatetimeIndex). The behavior of pd.concat does not necessarily need to align with merge/join but it might be more consistent and less confusing than maintaining/documenting these kind of exception cases.

@lukemanley lukemanley changed the title ENH: Revert GH 54769 and fix documentation instead REGR/DOC: pd.concat(axis=1) should special case DatetimeIndex to sort even when sort=False Jan 28, 2024
@torext
Copy link
Contributor Author

torext commented Jan 28, 2024

3. I can see the argument for outer joining two sorted (monotonic) indexes via pd.concat(axis=1) and expecting a monotonic output by default. I dont know the full history, but that may be the reason `pd.merge` and `DataFrame.join` ALWAYS sort outer joins (for all dtypes, not just DatetimeIndex). The behavior of `pd.concat` does not necessarily need to align with merge/join but it might be more consistent and less confusing than maintaining/documenting these kind of exception cases.

I like this suggestion a lot. I primarily work with and write libraries for timeseries-like data, whether that be DataFrames or Series, and I can't actually imagine a single scenario where I wouldn't wan't a monotonic output from an outer join induced on the other axis by pd.concat(axis=1). In fact, even for an inner join I can't think of such a scenario, given the input indices were monotonic.

A general argument for why maybe having the indices sorted by default could go as follows: if one cares about the sorting of indices in a given context, then the input indices in that context will presumably be monotonic and it will be desirable to keep the output monotonic as well. If instead one is in a situation where one doesn't care about the sorting (since e.g. it's categorical "labeling"-type of data that's stored in the index), then one won't be too fussed if the output ends up sorted by default; the context and code around it presumably is written in an order-agnostic way in this case and having the additional sorting step thus at most has a performance impact (which can be mitigated with sort=False).

Ironically enough, .groupby(), which I feel is more often used with unsorted categorical-type data in the index/grouper, has sort=True by default, which I often have to end up overriding since I'm grouping by some categories for display-purposes and I want to keep the implicit original order intact. Cases where .groupby() is used with a DateTimeIndex-like grouper are rare, since for those cases often there are more adapted approaches like .resample()etc. already implemented. And to top off the irony, having sort=False in those cases would probably anyways give the desired output assuming the input indices were monotonic (think grouping a monotonic daily frequency DateTimeIndex by month with sort=Falseand applying some aggregation function).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Index Related to the Index class or subclasses Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants