ENH: read_csv: Warn when dtype or converters contain keys that do not exist in the data #39467

malinkallen · 2021-01-29T12:33:46Z

Is your feature request related to a problem?

As a part of a research project, me and two co-authors have been mining GitHub for Jupyter notebooks and studied a number of frequent function calls in the notebooks. It appears that users sometimes make calls to read_csv, specifying dtype as a dict containing keys that are not present in the data set. When this happens, these keys are silently ignored. To take a simple example, one may write

df = pd.read_csv("my_data.csv", dtype={"a":float})

although there is no column named "a" in "my_data.csv". The key-value pair "a"-float is silently ignored.

This means that a user who misspells a column header, or tries to change the data type of a column that is filtered out (e.g. using usecols), or of a non-existing column that the user thinks exists, he/she will not get aware of this, and the data will not be converted as expected. If, instead, read_csv issued a warning, the user could easily detect the mistake and be able to fix it immediately. As a bonus, such a warning could also help remedying other bugs. For example, if a user tries to change the data type of the values in a column that is filtered out, the user most likely thinks that he/she filters out other columns than he/she actually does. A warning from read_csv would indicate that there is something wrong, and a user could find a bug that may otherwise go undetected or take time to debug.

We have noticed the same behavior when dicts passed as converters contain keys that are missing in the data set. For the same reasons as above, I think issuing a warning in these cases would be helpful for the users.

Describe the solution you'd like

read_csv should issue a warning when dicts passed as values for dtype or converters contain keys that are not present in the data set.

API breaking implications

If only a warning is issued, I don't see how this could break anything.

Describe alternatives you've considered

An alternative could be to issue an error instead of a warning, but that would risk breaking existing code.

The text was updated successfully, but these errors were encountered:

Liam3851 · 2021-01-30T00:45:44Z

Not sure what others think but I don't think this is desirable. Sometimes people read a variety of files of related formats, and use the same set of dtypes/converters for all of them. E.g. all my columns called "time" might want to have the same dtype, all the columns of "price" a different dtype, etc.; when reading a file that doesn't contain the time column (say, with just daily data and no time data) I can still use the same set of dtypes/converters. This would warn against that practice for unclear reasons.

attack68 · 2021-01-30T08:17:59Z

I can see arguments from both sides.

Without any statistics I'm going to assume that the majority of the time one imports a single set of data that is predominantly known. In that case, a dtype warning is preferable to the user as per @malinkallen.

However for the more mechanical process of multiple data processing an option to suppress these warnings may be favourable. @Liam3851

Pandas has recently change the ability of .loc to specifically raise KetError if a key is not in the index. This clearly falls in the same category as each column being a key in the horizontal index. However, there is a difference in 'constructing a DataFrame from source' with 'indexing a constructed DataFrame' so perhaps an error is a step too far (and not backwards compatible) whereas the (optionally suppressible) Warning is an appropriate way to go?

Liam3851 · 2021-01-30T13:16:27Z

I don't think the analogy to key not in index is apt. The prior behavior was that if you .loc with key not in index pandas did an implicit reindex and added the key you missed. So better explicit than implicit, if you want to reindex and enlarge the DataFrame, do a reindex.

However if I specify extra dtypes in my dtypes kwarg dict, it makes absolutely no difference to the output-- it doesn't add an additional empty column. If it did, I would have a different opinion.

But also I disagree with the assumption that the most common case is reading a single CSV that is unrelated to any other CSVs. One of pandas' greatest strengths is merging and munging multiple datasets (esp. datasets at different frequencies). And when doing so, I think in fact it's a good design practice for people working with interrelated datasets to use a single dict of dtypes for all of them containing all the columns they're going to see. The more common failure case I've seen, rather than a typo in a column name in the dtypes, is that a dtype that needs specification remains unspecified when merging DataFrames from multiple CSVs. This can lead to subtle, hard to find bugs if you're merging DataFrames generated from multiple CSVs and a column they are merging on end up having different dtypes.

This proposal would basically by default warn people that they really should be creating a separate dict of dtypes for every CSV file they read, which it seems to me is actually an antipattern for the merge/munge use case.

attack68 · 2021-01-30T19:41:47Z

You raise some good points, there are clearly arguments for both. If warnings were implemented as a pd option the decision would then be 'warn by default' or 'defer warning by default'. This seems to have an asymmetric impact:

If 'warn by default' then any columns in a csv not found will raise. In your workflow these will appear, in first instance, and could be easily suppressed (particularly if guidance included in the warning) satisfying your preference. They will also satisfy @malinkallen use case by alerting to issues that would otherwise remain unseen.

If 'defer warning by default', then your use case is immediately satisfied, but the alternative is avoided since those potential errors in scope are not detected, it is also unlikely that a casual user becomes aware of the option and has legitimate understanding to activate it. It might become so little used as to even preclude its own existence.

So, IMO, if a warning were to be implemented I would prefer it to be default.

jreback · 2021-01-30T23:04:01Z

what we actually need are per optional controllable so to error / raise / ignore

but this would require some effort

something like

Option(converters=...., errors='raise') or whatever

it could be backwards compatible and not super complicated

malinkallen added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 29, 2021

lithomas1 added IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 2, 2021

lithomas1 added this to the Contributions Welcome milestone Feb 2, 2021

lithomas1 added Dtype Conversions Unexpected or buggy dtype conversions Error Reporting Incorrect or improved errors from pandas labels Feb 25, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: read_csv: Warn when dtype or converters contain keys that do not exist in the data #39467

ENH: read_csv: Warn when dtype or converters contain keys that do not exist in the data #39467

malinkallen commented Jan 29, 2021

Liam3851 commented Jan 30, 2021

attack68 commented Jan 30, 2021

Liam3851 commented Jan 30, 2021

attack68 commented Jan 30, 2021

jreback commented Jan 30, 2021

ENH: read_csv: Warn when dtype or converters contain keys that do not exist in the data #39467

ENH: read_csv: Warn when dtype or converters contain keys that do not exist in the data #39467

Comments

malinkallen commented Jan 29, 2021

Is your feature request related to a problem?

Describe the solution you'd like

API breaking implications

Describe alternatives you've considered

Liam3851 commented Jan 30, 2021

attack68 commented Jan 30, 2021

Liam3851 commented Jan 30, 2021

attack68 commented Jan 30, 2021

jreback commented Jan 30, 2021