New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: concat unwantedly sorts DataFrame column names if they differ #4588

Closed
smcinerney opened this Issue Aug 16, 2013 · 35 comments

Comments

Projects
None yet
@smcinerney

smcinerney commented Aug 16, 2013

When concat'ing DataFrames, the column names get alphanumerically sorted if there are any differences between them. If they're identical across DataFrames, they don't get sorted.
This sort is undocumented and unwanted. Certainly the default behavior should be no-sort. EDIT: the standard order as in SQL would be: columns from df1 (same order as in df1), columns (uniquely) from df2 (less the common columns) (same order as in df2). Example:

df4a = DataFrame(columns=['C','B','D','A'], data=np.random.randn(3,4))
df4b = DataFrame(columns=['C','B','D','A'], data=np.random.randn(3,4))
df5  = DataFrame(columns=['C','B','E','D','A'], data=np.random.randn(3,5))

print "Cols unsorted:", concat([df4a,df4b])
# Cols unsorted:           C         B         D         A

print "Cols sorted", concat([df4a,df5])
# Cols sorted           A         B         C         D         E
``'
@hayd

This comment has been minimized.

Contributor

hayd commented Aug 18, 2013

Looking at this briefly I think this stems from Index.intersection, whose docstring states:

Form the intersection of two Index objects. Sortedness of the result is not guaranteed

Not sure in which cases they appear/are sorted, but the case when the columns are equal (in your first one) is special cased to return the same result...

@jtratner

This comment has been minimized.

Contributor

jtratner commented Aug 18, 2013

@smcierney what order would you expect instead?

@jerryzhujian9

This comment has been minimized.

jerryzhujian9 commented Nov 25, 2013

I found the auto sort was a bit annoying too (well, I should say depends on your purpose), because I was trying to concat a frame to an empty frame in a loop (like append an element to a list). Then I realized my column order changed. This change also applies to index, if you are concatenating along axis=1.

In a case similar to that of @smcinerney , I expect the final order of CBDAE. E shows up last because the order CBDA shows up first when concatenating.

Therefore I wrote a "hack" (kinda silly though)

sorted = pd.concat(frameList, axis=axis, join=join, join_axes=join_axes, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False)

if join_axes:
    return sorted
elif sort:
    return sorted
else:
    # expand all original orders in each frame
    sourceOrder = []
    for frame in frameList:
        sourceOrder.extend(frame.Columns()) if axis == 0 else sourceOrder.extend(frame.Indices())
    sortedOrder = sorted.Columns() if axis == 0 else sorted.Indices()

    positions = []
    positionsSorted = []
    for i in sortedOrder:
        positions.append(sourceOrder.index(i))
        positionsSorted.append(sourceOrder.index(i))
    positionsSorted.sort()

    unsortedOrder = []
    for i in positionsSorted:
        unsortedOrder.append(sortedOrder[positions.index(i)])

    return sorted.ReorderCols(unsortedOrder) if axis == 0 else sorted.ReorderRows(unsortedOrder)

The function is included in my personal module called kungfu! Anyone can adopt the above algorithm, or have a look at my module at https://github.com/jerryzhujian9/kungfu

Finally, I greatly appreciate the work of the development team for this great module!

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Feb 18, 2014

@asteppke

This comment has been minimized.

asteppke commented May 28, 2014

This behavior is indeed quite unexpected and I also stumbled over it.

 >>> df = pd.DataFrame()

>>> df['b'] = [1,2,3]
>>> df['c'] = [1,2,3]
>>> df['a'] = [1,2,3]
>>> print(df)
   b  c  a
0  1  1  1
1  2  2  2
2  3  3  3

[3 rows x 3 columns]
>>> df2 = pd.DataFrame({'a':[4,5]})
>>> df3 = pd.concat([df, df2])

Naively one would expect that the order of columns is preserved. Instead the columns are sorted:

>>> print(df3)
   a   b   c
0  1   1   1
1  2   2   2
2  3   3   3
0  4 NaN NaN
1  5 NaN NaN

[5 rows x 3 columns]

This can be corrected by reindexing with the original columns as follows:

>>> df4 = df3.reindex_axis(df.columns, axis=1)
>>> print(df4)
    b   c  a
0   1   1  1
1   2   2  2
2   3   3  3
0 NaN NaN  4
1 NaN NaN  5

[5 rows x 3 columns]

Still it seems counter-intuitive that this automatic sorting takes place and cannot be disabled as far as I know.

@zadacka

This comment has been minimized.

zadacka commented Nov 19, 2014

I've just come across this too.

new_data = pd.concat([churn_data, numerical_data])

Produced a DataFrame:

     churn  Var1  Var10  Var100  Var101 
0      -1   NaN    NaN     NaN     NaN     
1      -1   NaN    NaN     NaN     NaN

It would seem more natural for the numerical DataFrame to be concatenated without being sorted first!!

@jreback

This comment has been minimized.

Contributor

jreback commented Nov 19, 2014

well, this is a bit of work to fix. but pull requests accepted!

@rasbt

This comment has been minimized.

rasbt commented Jan 13, 2015

Just stumbled upon this same issue when I was concatenating DataFrames. It's a little bit annoying if you don't know about this issue, but actually there is a quick remedy:

say dfs is a list of DataFrames you want to concatenate, you can just take the the original column order and feed it back in:

df = pd.concat(dfs, axis=0)
df = df[dfs[0].columns]
@max-sixty

This comment has been minimized.

Contributor

max-sixty commented Feb 5, 2015

I believe append causes the same behavior, FYI

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

@tompicking

This comment has been minimized.

tompicking commented Mar 26, 2015

It's the default behaviour across the board. For example, if you apply a function, f, to a groupby() that returns a varying number of columns, the concatenation taking place behind the scene also auto-sorts the columns.

df.groupby(some_ts).apply(f)

Likely because the known order of the columns is open to interpretation.

However, this also happens for MultiIndices and all hierarchies in MultiIndices. So you can concat dataframes that agree on level0 columns and all bar one level1 columns, and all levels of the MultiIndices will be autosorted because of one mismatch within one level0 column. I don't imagine that is desirable.

I'd love to help, but unfortunately fixing this issue is beyond my ability. Thanks for the hard work all.

@vitalyisaev2

This comment has been minimized.

vitalyisaev2 commented Apr 25, 2015

+1 for this feature

@ashishsingal1

This comment has been minimized.

Contributor

ashishsingal1 commented May 30, 2015

Agreed, +1. Unexpected sorting happens all the time for me.

@scyllagist

This comment has been minimized.

scyllagist commented Aug 1, 2015

+1, this was an unpleasant surprise!

@Zenadix

This comment has been minimized.

Zenadix commented Aug 27, 2015

+1, I hate having the columns sorted after every append.

@summerela

This comment has been minimized.

summerela commented Sep 2, 2015

+1 from me, as well.

Because even if I did want to manually re-order after a concat, when I try to print out the 60 + column names and positions in my dataframe:

 for id, value in enumerate(df.columns):
      print id, value

All 60+ columns are output in alphabetical order, not their actual position in the data frame.

So that means that after ever concat, I have to manually type out a list of 60 columns to reorder. Ouch.

While I'm here, does anyone have a way to print out column name and position that I'm missing?

@jmwoloso

This comment has been minimized.

Contributor

jmwoloso commented Jan 20, 2016

+1 for this feature, just ran across the same deal myself.

@summerela Get the column index and then re-index your new dataframe using the original column index

# assuming you have two dataframes, `df_train` & `df_test` (with the same columns) 
# that you want to concatenate

# get the columns from one of them
all_columns = df_train.columns

# concatenate them
df_concat = pd.concat([df_train,
                       df_test])

# finally, re-index the new dataframe using the original column index
df_concat = df_concat.ix[:, all_columns]

Conversely, if you need to re-index a smaller subset of columns, you could use this function I made. It can operate with relative indices as well. For example, if you wanted to move a column to the end of a dataframe, but you aren't sure how many columns may remain after prior processing steps in your script (maybe you're dropping zero-variance columns, for instance), you could pass a relative index position to new_indices --> new_indices = [-1] and it will take care of the rest.

def reindex_columns(dframe=None, columns=None, new_indices=None):
    """
    Reorders the columns of a dataframe as specified by
    `reorder_indices`. Values of `columns` should align with their
    respective values in `new_indices`.

    `dframe`: pandas dataframe.

    `columns`: list,pandas.core.index.Index, or numpy array; columns to
    reindex.

    `reorder_indices`: list of integers or numpy array; indices
    corresponding to where each column should be inserted during
    re-indexing.
    """
    print("Re-indexing columns.")
    try:
        df = dframe.copy()

        # ensure parameters are of correct type and length
        assert isinstance(columns, (pd.core.index.Index,
                                    list,
                                    np.array)),\
        "`columns` must be of type `pandas.core.index.Index` or `list`"

        assert isinstance(new_indices,
                          list),\
        "`reorder_indices` must be of type `list`"

        assert len(columns) == len(new_indices),\
        "Length of `columns` and `reorder_indices` must be equal"

        # check for negative values in `new_indices`
        if any(idx < 0 for idx in new_indices):

            # get a list of the negative values
            negatives = [value for value
                         in new_indices
                         if value < 0]

            # find the index location for each negative value in
            # `new_indices`
            negative_idx_locations = [new_indices.index(negative)
                                      for negative in negatives]

            # zip the lists
            negative_zipped = list(zip(negative_idx_locations,
                                       negatives))

            # replace the negatives in `new_indices` with their
            # absolute position in the index
            for idx, negative in negative_zipped:
                new_indices[idx] = df.columns.get_loc(df.columns[
                                                          negative])

        # re-order the index now
        # get all columns
        all_columns = df.columns

        # drop the columns that need to be re-indexed
        all_columns = all_columns.drop(columns)

        # now re-insert them at the specified locations
        zipped_columns = list(zip(new_indices,
                                  columns))

        for idx, column in zipped_columns:
            all_columns = all_columns.insert(idx,
                                             column)
        # re-index the dataframe
        df = df.ix[:, all_columns]

        print("Successfully re-indexed dataframe.")

    except Exception as e:
        print(e)
        print("Could not re-index columns. Something went wrong.")

    return df

Edit: Usage would look like the following:

# move 'Column_1' to the end, move 'Column_2' to the beginning
df = reindex_columns(dframe=df,
                     columns=['Column_1', 'Column_2'],
                     new_indices=[-1, 0])

@jreback jreback modified the milestones: 0.18.1, Next Major Release Mar 12, 2016

@patricktokeeffe

This comment has been minimized.

Contributor

patricktokeeffe commented Mar 23, 2016

I encountered this (with 0.13.1) from an edge case not mentioned: combining dataframes each containing unique columns. A naive re-assignment of column names didn't work:

dat = pd.concat([out_dust, in_dust, in_air, out_air])
dat.columns = [out_dust.columns + in_dust.columns + in_air.columns + out_air.columns]

The columns still get sorted. Using lists intermediately resolved things, though:

Edit: I spoke too soon..


Follow-up: fwiw, column order can be preserved with chained .join calls on singular objects:

df1.join([df2, df3]) # sorts columns
df1.join(df2).join(df3) # column order retained

@jreback jreback modified the milestones: 0.18.1, 0.18.2 Apr 26, 2016

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Mar 30, 2018

Any updates on the status of this thread?

Still open.

Procrastination seems to be quite an issue here...

Procrastination? We have a lot of open issues. If you want to ensure this gets fixed in the next release then you're best bet is to put together a PR. Let us know if you need help getting started.

@smcinerney

This comment has been minimized.

smcinerney commented Mar 31, 2018

@jtratner: in case it wasn't obvious from my examples at the top, I'd expect the order to be:

  • shared columns, unsorted
  • columns unique to df1, unsorted (i.e. in order in which they occur in df1)
  • columns unique to df2, unsorted (i.e. in order in which they occur in df2)

This is what you get in other packages or languages, e.g. SQL. There should never be an unwanted automatic sort. If the user wants to sort the column names, let them do that manually.

@MikeTam1021

This comment has been minimized.

MikeTam1021 commented Mar 31, 2018

Hey guys, 2 things. 1) Welcome to pandas! I suggest just using more python native types like dictionaries. Stop trying to turn python (or any language) into SQL. 2) This is not technically a bug. It is just an unwanted effect of the code. You can easily overcome it outside the context of the package, and that is what I would think is the correct answer unless someone here takes it upon themselves.

@jmwoloso

This comment has been minimized.

Contributor

jmwoloso commented Mar 31, 2018

@MikeTam1021 please explain how to overcome this outside the context of the package. Thanks.

@MikeTam1021

This comment has been minimized.

MikeTam1021 commented Mar 31, 2018

I’m pretty sure that’s exactly what people in this thread have been discussing. I see lots of good solutions above that should work.

@h-vetinari

This comment has been minimized.

Contributor

h-vetinari commented Apr 1, 2018

@MikeTam1021 It's not about turning pandas into SQL (heaven forbid!), but I couldn't agree more with:

There should never be an unwanted automatic sort. If the user wants to sort the column names, let them do that manually.

Concatenating DataFrames should have the same effect as "writing them next to each other", and that implicit sort definitely violates the principle of least astonishment.

@MikeTam1021

This comment has been minimized.

MikeTam1021 commented Apr 1, 2018

I agree. It shouldn’t. It also assumes an order to the columns, which is SQLish, and not pure computer science. You should really know where you’re data is.

I hardly use pandas anymore after discovering this and many other issues. It has made me a better programmer.

@armant

This comment has been minimized.

armant commented Apr 4, 2018

+1 on this

brycepg added a commit to brycepg/pandas that referenced this issue Apr 5, 2018

Stop concat from attempting from sorting result columns
Preserve column order upon concatenation to obey
least astonishment principle.

Allow old behavior to be enabled by adding a boolean switch to
concat and DataFrame.append, mismatch_sort, which is by default disabled.

Close pandas-dev#4588

brycepg added a commit to brycepg/pandas that referenced this issue Apr 5, 2018

Stop concat from attempting to sort mismatched columns by default
Preserve column order upon concatenation to obey
least astonishment principle.

Allow old behavior to be enabled by adding a boolean switch to
concat and DataFrame.append, mismatch_sort, which is by default disabled.

Close pandas-dev#4588

brycepg added a commit to brycepg/pandas that referenced this issue Apr 5, 2018

Stop concat from attempting to sort mismatched columns by default
Preserve column order upon concatenation to obey
least astonishment principle.

Allow old behavior to be enabled by adding a boolean switch to
concat and DataFrame.append, mismatch_sort, which is by default disabled.

Close pandas-dev#4588

brycepg added a commit to brycepg/pandas that referenced this issue Apr 5, 2018

Stop concat from attempting to sort mismatched columns by default
Preserve column order upon concatenation to obey
least astonishment principle.

Allow old behavior to be enabled by adding a boolean switch to
concat and DataFrame.append, mismatch_sort, which is by default disabled.

Close pandas-dev#4588

brycepg added a commit to brycepg/pandas that referenced this issue Apr 5, 2018

Stop concat from attempting to sort mismatched columns by default
Preserve column order upon concatenation to obey
least astonishment principle.

Allow old behavior to be enabled by adding a boolean switch to
concat and DataFrame.append, mismatch_sort, which is by default disabled.

Close pandas-dev#4588

brycepg added a commit to brycepg/pandas that referenced this issue Apr 5, 2018

Stop concat from attempting to sort mismatched columns by default
Preserve column order upon concatenation to obey
least astonishment principle.

Allow old behavior to be enabled by adding a boolean switch to
concat and DataFrame.append, mismatch_sort, which is by default disabled.

Close pandas-dev#4588

brycepg added a commit to brycepg/pandas that referenced this issue Apr 5, 2018

Stop concat from attempting to sort mismatched columns by default
Preserve column order upon concatenation to obey
least astonishment principle.

Allow old behavior to be enabled by adding a boolean switch to
concat and DataFrame.append, mismatch_sort, which is by default disabled.

Close #4588

brycepg added a commit to brycepg/pandas that referenced this issue Apr 5, 2018

Stop concat from attempting to sort mismatched columns by default
Preserve column order upon concatenation to obey
least astonishment principle.

Allow old behavior to be enabled by adding a boolean switch to
concat and DataFrame.append, mismatch_sort, which is by default disabled.

Close pandas-dev#4588

@jreback jreback modified the milestones: Next Major Release, 0.23.0 Apr 5, 2018

@bcucek

This comment has been minimized.

bcucek commented Apr 18, 2018

This works for me:

cols = list(df1)+list(df2)
df1 = pd.concat([df1, df2])
df1 = df1.loc[:, cols]

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Apr 30, 2018

Stop concat from attempting to sort mismatched columns by default
Preserve column order upon concatenation to obey
least astonishment principle.

Allow old behavior to be enabled by adding a boolean switch to
concat and DataFrame.append, mismatch_sort, which is by default disabled.

Close pandas-dev#4588

TomAugspurger added a commit that referenced this issue May 1, 2018

Stop concat from attempting to sort mismatched columns by default (#2…
…0613)

* Stop concat from attempting to sort mismatched columns by default

Preserve column order upon concatenation to obey
least astonishment principle.

Allow old behavior to be enabled by adding a boolean switch to
concat and DataFrame.append, mismatch_sort, which is by default disabled.

Closes #4588
@DavidEscott

This comment has been minimized.

DavidEscott commented Oct 4, 2018

I have to bitch about how this patch is rolled out. You have simultaneously changed the function signature of concat AND introduced a warning about the usage. All within the same commit.

The problem with that is that we use pandas on multiple servers and cannot guarantee that all servers have the exact same version of pandas at all times. So now we have less technical users seeing warnings from programs they have never seen before, and are uncertain if the warning is a sign of a problem.

I can readily identify WHERE the warning is coming from, but I can't add either of the suggested options because that would break the program on any server running an older version of pandas.

It would have been preferable if you put the sorting capability in to 0.23, and added the warning to some later version. I know its a pain, but it's rather obnoxious to assume that the users can immediately update all deployments to the latest code.

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Oct 4, 2018

@DavidEscott

This comment has been minimized.

DavidEscott commented Oct 4, 2018

@TomAugspurger There are a multitude of ways that we on our side can deal with this. Certainly filtering warnings is one. Its not great because the mechanics of warnings filters are a bit ugly...

  1. I would have to add the filter to multiple programs
  2. Not a great way to specify a specific warning to filter:
  • I can filter by module and lineno, but that isn't a stable reference,
  • I can filter by module and FutureWarning but then I wouldn't get any warnings at all from pandas and would be surprised by other changes,
  • or I can filter by your long multi-line message
  1. And then remember to take that filter out when everything is upgraded and it no longer matters.

In any case the deficiencies in the warnings module are certainly not something I can put at the foot of the pandas team.

Nor is it your fault that we have an older server we can't easily upgrade, so that would be the other thing I can do (just upgrade all the damn deployments). Ultimately, I recognize that I have to do that and that it is my responsibility to try and keep our deployments close together.

It just seems a bit bizarre to me that you were so concerned about a possible change in user visible end behavior that you added this sort option to what was previously an underspecified API, and yet have simultaneously thrown a warning at the programmer... both the warning and the proposed change in sort behavior constitute "user visible behavior" in my book, just of different severities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment