API: DataFrame setitem: setting columns with a DataFrame RHS doesn't align column names? #46974

jorisvandenbossche · 2022-05-09T08:06:33Z

The case being considered here is when setting multiple columns into a DataFrame (using __setitem__, df[[..]] = ..), using a DataFrame right-hand-side value. So a simple, unambiguous example is:

>>> df1 = pd.DataFrame(np.arange(6).reshape(3, 2), columns=['a', 'b'])
>>> df2 = pd.DataFrame(np.arange(6).reshape(3, 2) * 2, columns=['a', 'b'])
>>> df1[['a', 'b']] = df2
>>> df1
   a   b
0  0   2
1  4   6
2  8  10

However, we are setting the multiple columns column-by-column in order, ignoring potential misaligned column names:

>>> df1[['a', 'b']] = df2[['b', 'a']]
>>> df1
    a  b
0   2  0
1   6  4
2  10  8

I think this is "expected" behaviour. Meaning, this seems to be intentional and long standing behaviour. Although I personally find this surprisin, especially because when using loc instead of plain setitem, i.e. df1.loc[:, ['a', 'b']] = df2[['b', 'a']], does align the column names:

>>> df1.loc[:, ['a', 'b']] = df2[['b', 'a']]
>>> df1
   a   b
0  0   2
1  4   6
2  8  10

I didn't directly find an issue about this, only a PR that touched the code that handles this but in case of duplicate columns (#39403), and a comment at https://github.com/pandas-dev/pandas/pull/39341/files#r563895152 about column names being irrelevant for setitem (cc @phofl @jbrockmendel)

But, because of the fact that we ignore alignment of column names, but then do the setting by name (and not position):

pandas/pandas/core/frame.py

Lines 3747 to 3750 in dd6869f

    
           if isinstance(value, DataFrame): 
        
               check_key_length(self.columns, key, value) 
        
               for k1, k2 in zip(key, value.columns): 
        
                   self[k1] = value[k2]

you get inconsistent results with duplicate column names.

For example, in this case the second column of df2 is set to both "b" columns of df1

>>> df1 = pd.DataFrame(np.arange(9).reshape(3, 3), columns=['a', 'b', 'b'])
>>> df2 = pd.DataFrame(np.arange(9).reshape(3, 3) * 2, columns=['b', 'a', 'c'])
>>> df1[['a', 'b']] = df2
>>> df1
    a   b   b
0   0   2   2
1   6   8   8
2  12  14  14

On the other hand, if I change the column names in df2 to also have duplicate columns, but in a different order, depending on the exact order you get an error or a "working" example:

>>> df2 = pd.DataFrame(np.arange(9).reshape(3, 3) * 2, columns=['b', 'a', 'b'])
>>> df1[['a', 'b']] = df2
...
ValueError: Columns must be same length as key

>>> df2 = pd.DataFrame(np.arange(9).reshape(3, 3) * 2, columns=['b', 'a', 'a'])
>>> df1[['a', 'b']] = df2
>>> df1
    a   b   b
0   0   2   4
1   6   8  10
2  12  14  16

And if the columns names order matches exactly, the columns are set "correctly" as well:

>>> df2 = pd.DataFrame(np.arange(9).reshape(3, 3) * 2, columns=['a', 'b', 'b'])
>>> df1[['a', 'b']] = df2
>>> df1
    a   b   b
0   0   2   4
1   6   8  10
2  12  14  16

So in general, in those examples, the column names do matter.

General questions:

Are we OK with __setitem__ (df[key] = value) with a dataframe value ignoring the value's column names? (not aligning key and value.columns) And are we OK with this being different as .loc[]?
If we keep the current behaviour, should we set those columns by position instead of column name, so that also for duplicate column names you don't get such inconsistent results?
(but how to we change this? (it's a breaking change) maybe we should deprecate/disallow such setitem with duplicate column names?)

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2022-05-09T22:28:06Z

maybe we should deprecate/disallow such setitem with duplicate column names?

yah, it sounds like the "correct" behavior is ambiguous/non-obvious, so seems like a good candidate to deprecate/disallow.

phofl · 2022-07-07T08:02:17Z

I agree with @jbrockmendel, the difference in behavior is expected but the inconsistency should be avoided.

jorisvandenbossche added Indexing Related to indexing on series/frames, not to indexes themselves API Design labels May 9, 2022

simonjayhawkins mentioned this issue May 25, 2022

BUG or DOC: NaNs generated when enlarge DataFrame with loc[:, cols] with a DataFrame RHS #47112

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: DataFrame setitem: setting columns with a DataFrame RHS doesn't align column names? #46974

API: DataFrame setitem: setting columns with a DataFrame RHS doesn't align column names? #46974

jorisvandenbossche commented May 9, 2022

jbrockmendel commented May 9, 2022

phofl commented Jul 7, 2022

API: DataFrame setitem: setting columns with a DataFrame RHS doesn't align column names? #46974

API: DataFrame setitem: setting columns with a DataFrame RHS doesn't align column names? #46974

Comments

jorisvandenbossche commented May 9, 2022

jbrockmendel commented May 9, 2022

phofl commented Jul 7, 2022