Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: concat column order behaviors changes after 1.4 #47127

Closed
2 of 3 tasks
Yikun opened this issue May 26, 2022 · 2 comments · Fixed by #47206
Closed
2 of 3 tasks

BUG: concat column order behaviors changes after 1.4 #47127

Yikun opened this issue May 26, 2022 · 2 comments · Fixed by #47206
Labels
Bug Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@Yikun
Copy link
Contributor

Yikun commented May 26, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

After 1.4:
>>> import pandas as pd
>>> pdf = pd.DataFrame({"A": [0, 2, 4], "B": [1, 3, 5], "C": [6, 7, 8]})
>>> pd.concat([pdf, pdf["C"], pdf["A"]], ignore_index=True, join='outer', sort=True)
     A    B    C    0
0  0.0  1.0  6.0  NaN
1  2.0  3.0  7.0  NaN
2  4.0  5.0  8.0  NaN
3  NaN  NaN  NaN  6.0
4  NaN  NaN  NaN  7.0
5  NaN  NaN  NaN  8.0
6  NaN  NaN  NaN  0.0
7  NaN  NaN  NaN  2.0
8  NaN  NaN  NaN  4.0


Before 1.4:
>>> import pandas as pd
>>> pdf = pd.DataFrame({"A": [0, 2, 4], "B": [1, 3, 5], "C": [6, 7, 8]})
>>> pd.concat([pdf, pdf["C"], pdf["A"]], ignore_index=True, join='outer', sort=True)
     0    A    B    C
0  NaN  0.0  1.0  6.0
1  NaN  2.0  3.0  7.0
2  NaN  4.0  5.0  8.0
3  6.0  NaN  NaN  NaN
4  7.0  NaN  NaN  NaN
5  8.0  NaN  NaN  NaN
6  0.0  NaN  NaN  NaN
7  2.0  NaN  NaN  NaN
8  4.0  NaN  NaN  NaN

Issue Description

concat column order behaviors changes after 1.4

Expected Behavior

>>> pd.concat([pdf, pdf["C"], pdf["A"]], ignore_index=True, join='outer', sort=False)
     A    B    C    0
0  0.0  1.0  6.0  NaN
1  2.0  3.0  7.0  NaN
2  4.0  5.0  8.0  NaN
3  NaN  NaN  NaN  6.0
4  NaN  NaN  NaN  7.0
5  NaN  NaN  NaN  8.0
6  NaN  NaN  NaN  0.0
7  NaN  NaN  NaN  2.0
8  NaN  NaN  NaN  4.0
>>> pd.concat([pdf, pdf["C"], pdf["A"]], ignore_index=True, join='outer', sort=True)
     0    A    B    C
0  NaN  0.0  1.0  6.0
1  NaN  2.0  3.0  7.0
2  NaN  4.0  5.0  8.0
3  6.0  NaN  NaN  NaN
4  7.0  NaN  NaN  NaN
5  8.0  NaN  NaN  NaN
6  0.0  NaN  NaN  NaN
7  2.0  NaN  NaN  NaN
8  4.0  NaN  NaN  NaN

Installed Versions

1.4+

@Yikun Yikun added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 26, 2022
@Yikun
Copy link
Contributor Author

Yikun commented May 28, 2022

Related first commit: 01b8d2a

@CloseChoice @jreback

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue May 28, 2022
@simonjayhawkins simonjayhawkins added this to the 1.4.3 milestone May 28, 2022
@simonjayhawkins simonjayhawkins added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 28, 2022
@simonjayhawkins
Copy link
Member

Thanks @Yikun for the report.

Related first commit: 01b8d2a

in _get_combined_index in pandas/core/indexes/api.py in #43833, index = union_indexes(indexes, sort=sort) -> index = union_indexes(indexes, sort=False)

(Pdb) a
indexes = [Index(['A', 'B', 'C'], dtype='object'), Int64Index([0], dtype='int64'), Int64Index([0], dtype='int64')]
intersect = False
sort = True
copy = True
(Pdb) union_indexes(indexes, sort=False)
Index(['A', 'B', 'C', 0], dtype='object')
(Pdb) union_indexes(indexes, sort=True)
Index([0, 'A', 'B', 'C'], dtype='object')
(Pdb) 

and the code a few lines later

    if sort:
        try:
            index = index.sort_values()
        except TypeError:
            pass

index.sort_values() raises TypeError: '<' not supported between instances of 'int' and 'str' as _sort_mixed from pandas/core/algorithms.py is not used nor is the same sort logic as implemented in union_indexes(indexes, sort=True)

dotsdl added a commit to alchemistry/alchemlyb that referenced this issue Jun 30, 2022
Previous `pandas` behavior prior to 1.4.3 [did not sort numeric column
names](pandas-dev/pandas#47127), but this now
occurs. We don't sort within other parsers, so switching this flag to be
consistent with previous behavior. There is no clear reason sorting is
necessary here.
orbeckst pushed a commit to alchemistry/alchemlyb that referenced this issue Jun 30, 2022
* Fix #200
* Set namd parser column sorting to False

  Previous `pandas` behavior prior to 1.4.3 [did not sort numeric column
  names](pandas-dev/pandas#47127), but this now
  occurs. We don't sort within other parsers, so switching this flag to be
  consistent with previous behavior. There is no clear reason sorting is
  necessary here.

Co-authored-by: David Dotson <dotsdl@gmail.com>
HyukjinKwon pushed a commit to apache/spark that referenced this issue Jul 19, 2022
…low 1.4.3 behavior

### What changes were proposed in this pull request?

Respect Series.concat sort parameter when `num_series == 1` to follow 1.4.3 behavior.

### Why are the changes needed?
In #36711, we follow the pandas 1.4.2 behaviors to respect Series.concat sort parameter except `num_series == 1` case.

In [pandas 1.4.3](https://github.com/pandas-dev/pandas/releases/tag/v1.4.3), fix the issue pandas-dev/pandas#47127. The bug of `num_series == 1` is also fixed, so we add this PR to follow panda 1.4.3 behavior.

### Does this PR introduce _any_ user-facing change?
Yes, we already cover this case in:
https://github.com/apache/spark/blob/master/python/docs/source/migration_guide/pyspark_3.3_to_3.4.rst
```
In Spark 3.4, the Series.concat sort parameter will be respected to follow pandas 1.4 behaviors.
```

### How was this patch tested?
- CI passed
- test_concat_index_axis passed with panda 1.3.5, 1.4.2, 1.4.3.

Closes #37217 from Yikun/SPARK-39807.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants