Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame.drop() fails when columns= is given as tuple #43978

Open
2 of 3 tasks
JBGreisman opened this issue Oct 11, 2021 · 6 comments
Open
2 of 3 tasks

BUG: DataFrame.drop() fails when columns= is given as tuple #43978

JBGreisman opened this issue Oct 11, 2021 · 6 comments
Assignees
Labels
Bug Needs Discussion Requires discussion from core team before further action Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.).

Comments

@JBGreisman
Copy link
Contributor

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(12).reshape(3, 4),
                  columns=['A', 'B', 'C', 'D'])

print(df.drop(columns=list(["A", "B"])))     # <-- Works
print(df.drop(columns=np.array(["A", "B"]))) # <-- Works
print(df.drop(columns=tuple(["A", "B"])))    # <-- Fails with KeyError

Issue Description

The documentation says that the columns= argument of DataFrame.drop() can take a single label or list-like, but it fails when given a tuple with more than column name. This method seems to work when the exact same column labels are provided as a list or a np.ndarray.

Just an observation -- a tuple with len()==1 does seem to work successfully here.

Expected Behavior

For the above example, df.drop(columns=("A", "B")) should produce the same output as columns=["A", "B"] or columns=np.array(["A", "B"]), resulting in the following DataFrame:

    C   D
0   2   3
1   6   7
2  10  11

Installed Versions

INSTALLED VERSIONS

commit : 73c6825
python : 3.8.11.final.0
python-bits : 64
OS : Darwin
OS-release : 17.7.0
Version : Darwin Kernel Version 17.7.0: Fri Oct 30 13:34:27 PDT 2020; root:xnu-4570.71.82.8~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.3
numpy : 1.21.2
pytz : 2021.1
dateutil : 2.8.2
pip : 21.0.1
setuptools : 58.0.4
Cython : None
pytest : 6.2.5
hypothesis : None
sphinx : 4.2.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.28.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@JBGreisman JBGreisman added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 11, 2021
@erfannariman
Copy link
Member

take

@erfannariman
Copy link
Member

erfannariman commented Oct 11, 2021

As already described in the draf PR, the ambiguity lies in the fact that we can use tuples to perform slicing on couple levels, while they can also be used as column names:

df = pd.DataFrame(np.arange(6).reshape(3, 2), columns=[("A", "B"), ("C", "D")])
print(df.drop(columns=("A", "B")))

   (C, D)
0       1
1       3
2       5

Or MultiIndex:

idx = pd.MultiIndex.from_tuples([("A", "B"), ("C", "D")])
df = pd.DataFrame(np.arange(6).reshape(3, 2), columns=idx)

print(df)
   A  C
   B  D
0  0  1
1  2  3
2  4  5


df.drop(columns=("A", "B"))
   C
   D
0  1
1  3
2  5

So it's not clear for me if this is expected behaviour or a bug.

@attack68
Copy link
Contributor

attack68 commented Oct 12, 2021

In another issue I have proposed prohibiting tuple-sequence input, since a tuple has the ambiguous possibility of an individual label. If a list is used there is no ambiguity. This has greater implications in the .loc method, for example, but it would be useful to be consistent here also. Therefore, I believe this should fail if the input is given as a tuple, and a tuple label does not exist.

Can find the exact post but this is similar: #42329 (comment)

@erfannariman
Copy link
Member

In another issue I have proposed prohibiting tuple-sequence input

Would tuples as column names also fall under this?

   (C, D)
0       1
1       3
2       5

@attack68
Copy link
Contributor

In another issue I have proposed prohibiting tuple-sequence input

Would tuples as column names also fall under this?

   (C, D)
0       1
1       3
2       5

Index Labels are allowed to be tuples, they are not allowed to be lists.

So:

  • [(0,1)] is a list of one label, which is (0,1).
  • (0,1) is a single index label, which may be ambigously a single level identifier or two level identifers.
  • (0, 1) is not a sequence of the two labels 0,1 and is not equivalent to [0, 1].

If you permit (0, 1) as a a sequence of two labels it opens up a host of unncessary problems when it is more explicit, anyway, to use lists.
In my opinion.

@mroeschke mroeschke added Needs Discussion Requires discussion from core team before further action Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 16, 2021
@jreback
Copy link
Contributor

jreback commented Oct 21, 2021

tuples are for Multiindexes; yes if its unambiguous you could just listify it but -1 on changing here as not worth it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Discussion Requires discussion from core team before further action Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.).
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants