REGR/DOC: pd.concat(axis=1) should special case DatetimeIndex to sort even when sort=False #57006

torext · 2024-01-22T14:14:16Z

Feature Type

Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas

Problem Description

For a very long time now it used to be the default behaviour that calling pd.concat(..., axis=1) on DataFrames with DateTimeIndex would result in a DataFrame with sorted index. This is very reasonable default behaviour and pretty much what one wants in this case the vast majority of times.

With #54769 this was considered a bug simply because it wasn't documented correctly, and now the default behaviour, without any warning, was suddenly changed. This means likely code breakages across many users' libraries that were written based on the (up until now true) assumption that concatenating timeseries-like DataFrames/Series would give another timeseries-like object, i.e. something with a clearly sorted DateTimeIndex. Making sort=False work in the DateTimeIndex case is of course something that should be ensured, as it is something that might come in handy in some rare cases, but having that the default I don't think is a good idea at all, given the opposite has been the default for a long time and frankly makes more sense.

Feature Description

Change the default back to sort=True in the case of DataFrames/Series with DateTimeIndex and update the documentation to more accurately reflect this original default behaviour.

Alternative Solutions

Many users suddenly (and without clear warning) need to put sort=True in all the places they call pd.concat on timeseries-like Pandas objects.

Additional Context

No response

The text was updated successfully, but these errors were encountered:

enzbus · 2024-01-22T16:11:59Z

Thank you, this just caused my unit tests to fail. I agree on giving a deprecation warning and/or reverting to the previous behavior.

This revision fixes #128: a regression pandas-dev/pandas#57006 and a few warnings introduced by pandas 2.2.0. There are no other code changes.

lithomas1 · 2024-01-24T16:30:34Z

This seems reasonable to me.

lukemanley · 2024-01-28T11:28:50Z

It sounds like #54769 should be reverted in 2.2.1 and documented instead due to long-standing behavior.

The long-standing behavior does have a few inconsistencies that motivated #54769 that may be worth noting:

pd.concat(axis=1, how="outer") special cases DatetimeIndex to always sort (ignores sort parameter). All other dtypes and index types do NOT sort when sort=False (default) including other time-like dtypes such as pyarrow timestamp/date/time and PeriodIndex.
pd.concat(axis=1, how="outer") always sorts for DatetimeIndex, but pd.concat(axis=1, how="inner") only sorts if sort=True
I can see the argument for outer joining two sorted (monotonic) indexes via pd.concat(axis=1) and expecting a monotonic output by default. I dont know the full history, but that may be the reason pd.merge and DataFrame.join ALWAYS sort outer joins (for all dtypes, not just DatetimeIndex). The behavior of pd.concat does not necessarily need to align with merge/join but it might be more consistent and less confusing than maintaining/documenting these kind of exception cases.

torext · 2024-01-28T12:41:22Z

3. I can see the argument for outer joining two sorted (monotonic) indexes via pd.concat(axis=1) and expecting a monotonic output by default. I dont know the full history, but that may be the reason `pd.merge` and `DataFrame.join` ALWAYS sort outer joins (for all dtypes, not just DatetimeIndex). The behavior of `pd.concat` does not necessarily need to align with merge/join but it might be more consistent and less confusing than maintaining/documenting these kind of exception cases.

I like this suggestion a lot. I primarily work with and write libraries for timeseries-like data, whether that be DataFrames or Series, and I can't actually imagine a single scenario where I wouldn't wan't a monotonic output from an outer join induced on the other axis by pd.concat(axis=1). In fact, even for an inner join I can't think of such a scenario, given the input indices were monotonic.

A general argument for why maybe having the indices sorted by default could go as follows: if one cares about the sorting of indices in a given context, then the input indices in that context will presumably be monotonic and it will be desirable to keep the output monotonic as well. If instead one is in a situation where one doesn't care about the sorting (since e.g. it's categorical "labeling"-type of data that's stored in the index), then one won't be too fussed if the output ends up sorted by default; the context and code around it presumably is written in an order-agnostic way in this case and having the additional sorting step thus at most has a performance impact (which can be mitigated with sort=False).

Ironically enough, .groupby(), which I feel is more often used with unsorted categorical-type data in the index/grouper, has sort=True by default, which I often have to end up overriding since I'm grouping by some categories for display-purposes and I want to keep the implicit original order intact. Cases where .groupby() is used with a DateTimeIndex-like grouper are rare, since for those cases often there are more adapted approaches like .resample()etc. already implemented. And to top off the irony, having sort=False in those cases would probably anyways give the desired output assuming the input indices were monotonic (think grouping a monotonic daily frequency DateTimeIndex by month with sort=Falseand applying some aggregation function).

torext added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 22, 2024

enzbus mentioned this issue Jan 22, 2024

Pandas 2.2.0 causes issues cvxgrp/cvxportfolio#128

Closed

jorisvandenbossche added this to the 2.2.1 milestone Jan 22, 2024

enzbus added a commit to cvxgrp/cvxportfolio that referenced this issue Jan 23, 2024

version 1.1.1

5eebc9d

This revision fixes #128: a regression pandas-dev/pandas#57006 and a few warnings introduced by pandas 2.2.0. There are no other code changes.

lukemanley changed the title ~~ENH: Revert GH 54769 and fix documentation instead~~ REGR/DOC: pd.concat(axis=1) should special case DatetimeIndex to sort even when sort=False Jan 28, 2024

lukemanley mentioned this issue Feb 4, 2024

REGR/DOC: pd.concat should special case DatetimeIndex to sort even when sort=False #57250

Merged

5 tasks

phofl closed this as completed in #57250 Feb 9, 2024

lukemanley mentioned this issue Feb 10, 2024

DEPR: pd.concat special cases DatetimeIndex to sort even when sort=False #57335

Open

raoulcollenteur mentioned this issue Apr 3, 2024

Add bokeh plots and deprecate contributions_pie pastas/pastas#715

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGR/DOC: pd.concat(axis=1) should special case DatetimeIndex to sort even when sort=False #57006

REGR/DOC: pd.concat(axis=1) should special case DatetimeIndex to sort even when sort=False #57006

torext commented Jan 22, 2024

enzbus commented Jan 22, 2024

lithomas1 commented Jan 24, 2024

lukemanley commented Jan 28, 2024

torext commented Jan 28, 2024

REGR/DOC: pd.concat(axis=1) should special case DatetimeIndex to sort even when sort=False #57006

REGR/DOC: pd.concat(axis=1) should special case DatetimeIndex to sort even when sort=False #57006

Comments

torext commented Jan 22, 2024

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

enzbus commented Jan 22, 2024

lithomas1 commented Jan 24, 2024

lukemanley commented Jan 28, 2024

torext commented Jan 28, 2024