Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: read_csv: Warn when dtype or converters contain keys that do not exist in the data #39467

Open
malinkallen opened this issue Jan 29, 2021 · 5 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv

Comments

@malinkallen
Copy link

Is your feature request related to a problem?

As a part of a research project, me and two co-authors have been mining GitHub for Jupyter notebooks and studied a number of frequent function calls in the notebooks. It appears that users sometimes make calls to read_csv, specifying dtype as a dict containing keys that are not present in the data set. When this happens, these keys are silently ignored. To take a simple example, one may write

df = pd.read_csv("my_data.csv", dtype={"a":float})

although there is no column named "a" in "my_data.csv". The key-value pair "a"-float is silently ignored.

This means that a user who misspells a column header, or tries to change the data type of a column that is filtered out (e.g. using usecols), or of a non-existing column that the user thinks exists, he/she will not get aware of this, and the data will not be converted as expected. If, instead, read_csv issued a warning, the user could easily detect the mistake and be able to fix it immediately. As a bonus, such a warning could also help remedying other bugs. For example, if a user tries to change the data type of the values in a column that is filtered out, the user most likely thinks that he/she filters out other columns than he/she actually does. A warning from read_csv would indicate that there is something wrong, and a user could find a bug that may otherwise go undetected or take time to debug.

We have noticed the same behavior when dicts passed as converters contain keys that are missing in the data set. For the same reasons as above, I think issuing a warning in these cases would be helpful for the users.

Describe the solution you'd like

read_csv should issue a warning when dicts passed as values for dtype or converters contain keys that are not present in the data set.

API breaking implications

If only a warning is issued, I don't see how this could break anything.

Describe alternatives you've considered

An alternative could be to issue an error instead of a warning, but that would risk breaking existing code.

@malinkallen malinkallen added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 29, 2021
@Liam3851
Copy link
Contributor

Not sure what others think but I don't think this is desirable. Sometimes people read a variety of files of related formats, and use the same set of dtypes/converters for all of them. E.g. all my columns called "time" might want to have the same dtype, all the columns of "price" a different dtype, etc.; when reading a file that doesn't contain the time column (say, with just daily data and no time data) I can still use the same set of dtypes/converters. This would warn against that practice for unclear reasons.

@attack68
Copy link
Contributor

I can see arguments from both sides.

Without any statistics I'm going to assume that the majority of the time one imports a single set of data that is predominantly known. In that case, a dtype warning is preferable to the user as per @malinkallen.

However for the more mechanical process of multiple data processing an option to suppress these warnings may be favourable. @Liam3851

Pandas has recently change the ability of .loc to specifically raise KetError if a key is not in the index. This clearly falls in the same category as each column being a key in the horizontal index. However, there is a difference in 'constructing a DataFrame from source' with 'indexing a constructed DataFrame' so perhaps an error is a step too far (and not backwards compatible) whereas the (optionally suppressible) Warning is an appropriate way to go?

@Liam3851
Copy link
Contributor

I don't think the analogy to key not in index is apt. The prior behavior was that if you .loc with key not in index pandas did an implicit reindex and added the key you missed. So better explicit than implicit, if you want to reindex and enlarge the DataFrame, do a reindex.

However if I specify extra dtypes in my dtypes kwarg dict, it makes absolutely no difference to the output-- it doesn't add an additional empty column. If it did, I would have a different opinion.

But also I disagree with the assumption that the most common case is reading a single CSV that is unrelated to any other CSVs. One of pandas' greatest strengths is merging and munging multiple datasets (esp. datasets at different frequencies). And when doing so, I think in fact it's a good design practice for people working with interrelated datasets to use a single dict of dtypes for all of them containing all the columns they're going to see. The more common failure case I've seen, rather than a typo in a column name in the dtypes, is that a dtype that needs specification remains unspecified when merging DataFrames from multiple CSVs. This can lead to subtle, hard to find bugs if you're merging DataFrames generated from multiple CSVs and a column they are merging on end up having different dtypes.

This proposal would basically by default warn people that they really should be creating a separate dict of dtypes for every CSV file they read, which it seems to me is actually an antipattern for the merge/munge use case.

@attack68
Copy link
Contributor

You raise some good points, there are clearly arguments for both. If warnings were implemented as a pd option the decision would then be 'warn by default' or 'defer warning by default'. This seems to have an asymmetric impact:

If 'warn by default' then any columns in a csv not found will raise. In your workflow these will appear, in first instance, and could be easily suppressed (particularly if guidance included in the warning) satisfying your preference. They will also satisfy @malinkallen use case by alerting to issues that would otherwise remain unseen.

If 'defer warning by default', then your use case is immediately satisfied, but the alternative is avoided since those potential errors in scope are not detected, it is also unlikely that a casual user becomes aware of the option and has legitimate understanding to activate it. It might become so little used as to even preclude its own existence.

So, IMO, if a warning were to be implemented I would prefer it to be default.

@jreback
Copy link
Contributor

jreback commented Jan 30, 2021

what we actually need are per optional controllable so to error / raise / ignore

but this would require some effort

something like

Option(converters=...., errors='raise') or whatever

it could be backwards compatible and not super complicated

@lithomas1 lithomas1 added IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 2, 2021
@lithomas1 lithomas1 added this to the Contributions Welcome milestone Feb 2, 2021
@lithomas1 lithomas1 added Dtype Conversions Unexpected or buggy dtype conversions Error Reporting Incorrect or improved errors from pandas labels Feb 25, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

6 participants