New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: read_csv: Warn when dtype or converters contain keys that do not exist in the data #39467
Comments
Not sure what others think but I don't think this is desirable. Sometimes people read a variety of files of related formats, and use the same set of dtypes/converters for all of them. E.g. all my columns called "time" might want to have the same dtype, all the columns of "price" a different dtype, etc.; when reading a file that doesn't contain the time column (say, with just daily data and no time data) I can still use the same set of dtypes/converters. This would warn against that practice for unclear reasons. |
I can see arguments from both sides. Without any statistics I'm going to assume that the majority of the time one imports a single set of data that is predominantly known. In that case, a dtype warning is preferable to the user as per @malinkallen. However for the more mechanical process of multiple data processing an option to suppress these warnings may be favourable. @Liam3851 Pandas has recently change the ability of |
I don't think the analogy to However if I specify extra dtypes in my dtypes kwarg dict, it makes absolutely no difference to the output-- it doesn't add an additional empty column. If it did, I would have a different opinion. But also I disagree with the assumption that the most common case is reading a single CSV that is unrelated to any other CSVs. One of pandas' greatest strengths is merging and munging multiple datasets (esp. datasets at different frequencies). And when doing so, I think in fact it's a good design practice for people working with interrelated datasets to use a single dict of dtypes for all of them containing all the columns they're going to see. The more common failure case I've seen, rather than a typo in a column name in the dtypes, is that a dtype that needs specification remains unspecified when merging DataFrames from multiple CSVs. This can lead to subtle, hard to find bugs if you're merging DataFrames generated from multiple CSVs and a column they are merging on end up having different dtypes. This proposal would basically by default warn people that they really should be creating a separate dict of dtypes for every CSV file they read, which it seems to me is actually an antipattern for the merge/munge use case. |
You raise some good points, there are clearly arguments for both. If warnings were implemented as a If 'warn by default' then any columns in a csv not found will raise. In your workflow these will appear, in first instance, and could be easily suppressed (particularly if guidance included in the warning) satisfying your preference. They will also satisfy @malinkallen use case by alerting to issues that would otherwise remain unseen. If 'defer warning by default', then your use case is immediately satisfied, but the alternative is avoided since those potential errors in scope are not detected, it is also unlikely that a casual user becomes aware of the option and has legitimate understanding to activate it. It might become so little used as to even preclude its own existence. So, IMO, if a warning were to be implemented I would prefer it to be default. |
what we actually need are per optional controllable so to error / raise / ignore but this would require some effort something like Option(converters=...., errors='raise') or whatever it could be backwards compatible and not super complicated |
Is your feature request related to a problem?
As a part of a research project, me and two co-authors have been mining GitHub for Jupyter notebooks and studied a number of frequent function calls in the notebooks. It appears that users sometimes make calls to
read_csv
, specifyingdtype
as a dict containing keys that are not present in the data set. When this happens, these keys are silently ignored. To take a simple example, one may writedf = pd.read_csv("my_data.csv", dtype={"a":float})
although there is no column named "a" in "my_data.csv". The key-value pair "a"-float is silently ignored.
This means that a user who misspells a column header, or tries to change the data type of a column that is filtered out (e.g. using
usecols
), or of a non-existing column that the user thinks exists, he/she will not get aware of this, and the data will not be converted as expected. If, instead,read_csv
issued a warning, the user could easily detect the mistake and be able to fix it immediately. As a bonus, such a warning could also help remedying other bugs. For example, if a user tries to change the data type of the values in a column that is filtered out, the user most likely thinks that he/she filters out other columns than he/she actually does. A warning fromread_csv
would indicate that there is something wrong, and a user could find a bug that may otherwise go undetected or take time to debug.We have noticed the same behavior when dicts passed as
converters
contain keys that are missing in the data set. For the same reasons as above, I think issuing a warning in these cases would be helpful for the users.Describe the solution you'd like
read_csv
should issue a warning when dicts passed as values fordtype
orconverters
contain keys that are not present in the data set.API breaking implications
If only a warning is issued, I don't see how this could break anything.
Describe alternatives you've considered
An alternative could be to issue an error instead of a warning, but that would risk breaking existing code.
The text was updated successfully, but these errors were encountered: