Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: KeyError from missing column should list available columns #50076

Open
1 of 3 tasks
janosh opened this issue Dec 5, 2022 · 6 comments
Open
1 of 3 tasks

ENH: KeyError from missing column should list available columns #50076

janosh opened this issue Dec 5, 2022 · 6 comments
Labels
Enhancement Error Reporting Incorrect or improved errors from pandas Needs Discussion Requires discussion from core team before further action

Comments

@janosh
Copy link
Contributor

janosh commented Dec 5, 2022

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

import pandas as pd

df = pd.util.testing.makeMixedDataFrame()

print(f'{list(df)=}')
>>> list(df)=['A', 'B', 'C', 'D']

df[['foo']]
>>> KeyError: "['foo'] not in index"

Feature Description

The error message would be more helpful if it listed available columns:

df[['foo']]
>>> KeyError: "['foo'] not in columns=['A', 'B', 'C', 'D']"

Alternative Solutions

n/a

Additional Context

No response

@janosh janosh added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 5, 2022
@rhshadrach
Copy link
Member

rhshadrach commented Dec 5, 2022

pandas DataFrames can have a massive number of columns which would either (a) overload the stdout or (b) we would need to truncate the output. Even when there are few colums, we'd need to worry about the repr of the individual columns being long themselves.

@rhshadrach rhshadrach added the Error Reporting Incorrect or improved errors from pandas label Dec 5, 2022
@janosh
Copy link
Contributor Author

janosh commented Dec 5, 2022

Could check if

if len(', '.join(df)) < some_treshold:
    raise KeyError(f"['foo'] not in columns={', '.join(df)}")

and make some_treshold configurable via, say, pd.options.key_errors.max_col_list_len.

@mroeschke
Copy link
Member

Yeah as is I would be -0.5 to include this given @rhshadrach concerns

  1. In an interactive environment, one can quickly access df.columns or df.index after the error
  2. In a script/process, I guess it could be useful in the traceback with error logging infrastructure but may be too verbose more often than not

@janosh
Copy link
Contributor Author

janosh commented Dec 5, 2022

but may be too verbose more often than not

More often than not column count and name lengths should be manageable, no?

In a script/process, I guess it could be useful in the traceback with error logging infrastructure

Exactly, that's my use case! When a job fails and I only see it several hours later in a workflow with a dozen different dataframes, it can be hard to determine which data access is failing and how to fix it. I usually have to rerun the script interactively and print column names to determine the fix.

@rhshadrach rhshadrach added Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 8, 2022
@kostyafarber
Copy link
Contributor

kostyafarber commented Dec 9, 2022

Hey I'd like to work on this. Do we want to go ahead with making these changes?

Or are we not fully sold on this idea yet.

@phofl
Copy link
Member

phofl commented Dec 9, 2022

This needs more discussion first.

I am also leaning more towards no. We don't want to have a million options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Error Reporting Incorrect or improved errors from pandas Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

5 participants