API: Warn about dups in names for read_csv #17346

gfyoung · 2017-08-26T12:53:58Z

Title is self-explanatory.

codecov · 2017-08-26T13:32:07Z

Codecov Report

Merging #17346 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #17346      +/-   ##
==========================================
- Coverage   91.26%   91.24%   -0.02%     
==========================================
  Files         163      163              
  Lines       49776    49783       +7     
==========================================
- Hits        45426    45424       -2     
- Misses       4350     4359       +9

Flag	Coverage Δ
#multiple	`89.04% <100%> (ø)`	⬆️
#single	`40.29% <57.14%> (-0.06%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/parsers.py	`95.48% <100%> (+0.02%)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.77% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d43aba8...9fcffa7. Read the comment docs.

jreback · 2017-08-29T13:04:29Z

pandas/io/parsers.py

+    counts = {}
+    warn_dups = False
+
+    for name in names:


just use set intersection here

How so? This is a fail-early method, which is why I chose it.

simply check for len(names) !+ len(set(names)). much more idiomatic

Fair enough. Done.

jreback · 2017-08-29T13:05:39Z

pandas/io/parsers.py

+        counts[name] = True
+
+    if warn_dups:
+        msg = ("Duplicate names specified. This "


so are we deprecating this? then this should be a FutureWarning.

Fair enough. Done.

jreback · 2017-08-30T10:09:37Z

pandas/io/parsers.py

@@ -406,6 +438,10 @@ def _read(filepath_or_buffer, kwds):
    chunksize = _validate_integer('chunksize', kwds.get('chunksize', None), 1)
    nrows = _validate_integer('nrows', kwds.get('nrows', None))

+    # Check for duplicates in names.
+    names = kwds.get("names", None)
+    _check_dup_names(names)


call this _validate_names and have it return names, so its a similar patter to the other validators

gfyoung · 2017-08-31T00:32:54Z

@jreback : All comments addressed, and tests are green. PTAL

gfyoung · 2017-09-01T15:05:02Z

@jreback @jorisvandenbossche : ping

jreback

typo in whatsnew, otherwise lgtm. make sure this is on the deprecation list as well

jreback · 2017-09-01T16:41:54Z

doc/source/whatsnew/v0.21.0.txt

@@ -283,6 +283,7 @@ Other API Changes
 - The Categorical constructor no longer accepts a scalar for the ``categories`` keyword. (:issue:`16022`)
 - Accessing a non-existent attribute on a closed :class:`~pandas.HDFStore` will now
  raise an ``AttributeError`` rather than a ``ClosedFileError`` (:issue:`16301`)
+- :func:`read_csv` now issues a ``UserWarning`` if the ``names`` parameter contains duplicates (:issue:`17095`)


should be FutureWarning

Doh! My bad for not catching that. Fixed.

Changing back to UserWarning in light of later discussion.

gfyoung · 2017-09-01T16:51:47Z

Fixed typo and added to deprecation list. Will merge on green then unless told otherwise.

jreback · 2017-09-01T17:00:32Z

@gfyoung wait for @jorisvandenbossche comment (as not sure if he commented here). IIRC a comment he made that having duplicate names is ok .

gfyoung · 2017-09-01T17:08:59Z

Sure thing. FWIW, @jorisvandenbossche agreed with your suggestion, see his comment here

@jorisvandenbossche : Any comments on this PR?

gfyoung · 2017-09-05T15:05:17Z

@jorisvandenbossche : ping if there any additional comments

gfyoung · 2017-09-08T16:18:51Z

@jreback : It's been a week, and I haven't heard anything from @jorisvandenbossche . Still wait, or can we merge this PR?

TomAugspurger · 2017-09-08T16:34:32Z

I think Joris is off on holiday. I believe he's back next week.

gfyoung · 2017-09-08T16:35:32Z

@TomAugspurger : Ah! I had a feeling that that was the case (I remember seeing an email about that). I'll wait then until he gets back.

gfyoung · 2017-09-13T09:40:17Z

@jorisvandenbossche : friendly ping

jorisvandenbossche · 2017-09-13T09:47:50Z

Can you remember me the rationale for deprecating this?
Is it because we cannot actually handle it well?

Because I actually previously had a usecase where this proved useful (I had a non-informative column every other column, gave it the same name in names and then dropped the single name. But this specific case can of course easily be solved differently, by giving names like 'dummy1', 'dummy2', .. and then removing all columns that start with 'dummy')

gfyoung · 2017-09-13T09:55:14Z

@jorisvandenbossche : Here is what you said back in July here.

Essentially, we are deprecating this behavior because names is a user-specified parameter, and passing in duplicate names deliberately only encourages buggy behavior.

gfyoung · 2017-09-18T08:12:14Z

@jorisvandenbossche : Any further thoughts on this?

jorisvandenbossche · 2017-09-18T13:33:21Z

Sorry for the slow response.

So maybe a more general question: is it our intention to once fix mangle_dupe_cols=True ? (as currently actually only the default False works). If we do, I see no good reason to disallow duplicates in passed names vs names in the csv file.

gfyoung · 2017-09-18T15:49:57Z

@jorisvandenbossche : No worries! I think you meant the other way around. mangle_dupe_cols=True is fully supported at this point. It is mangle_dupe_cols=False that we disabled.

The reason for us discouraging duplicates in names is because duplicates are generally more error-prone. Contrary to duplicates in a file, names is a deliberate choice.

@jreback : Thoughts?

jorisvandenbossche · 2017-09-18T15:55:52Z

I think you meant the other way around

yes, of course ... :-)

Contrary to duplicates in a file, names is a deliberate choice.

That's true. But there are many other ways to deliberately make a dataframe with duplicate columns which we don't disallow anyway.

To be clear, in general I am all for a restricted scope of capabilities/possibilties. But in this case, limiting the abilities of name does not actually reduce code complexity, but increases it. As the duplicates are already perfectly handled by the code, so we are introducing a special case. Therefore I was wondering whether this is actually needed (user will see that the names are mangled anyhow).

gfyoung · 2017-09-18T16:03:49Z

@jorisvandenbossche : True that we'll see them mangled anyhow, but why the need to add complexity to just handle them in the first place? I added the handling for duplicate names in an earlier PR because it was a bug, not to enhance support.

If the user really wants to have duplicate names, they can set it themselves and reading in the file, but I don't know if we want to actively encourage setting duplicate names to a read-in DataFrame.

jorisvandenbossche · 2017-09-18T16:12:02Z

Ah, I assumed that the mangling of names or the names from the header of the file was done using the same code path? That's not the case ?
(to state it another way: once we remove the deprecation in this PR, we can actually remove more code than is added in this PR?)

gfyoung · 2017-09-18T16:25:25Z

See #17095 : it's a not a ridiculous amount of new logic that I added, but new logic nonetheless 😄 Also, see @jreback comment in that PR here

gfyoung · 2017-09-21T18:13:21Z

@jorisvandenbossche : Any updates on this?

jreback · 2017-09-21T20:06:21Z

so the only reason we have mangle_dupe_columns is to support duplicates in the first place. I think I stated we should deprecate that argument entirely, then I would allow duplicates in names but show a UserWarning if names contains duplicates.

So pretty much allow what is happening today but with a UserWarning and reducing the path complexity a bit (removing mangle_dupe_columes).

Its not an error to have duplicates in names but I guess can't disallow it entirely.

gfyoung · 2017-09-21T20:24:04Z

@jreback : I don't recall you saying this before. In addition, I think there has been user interest and not mangling in cases when the CSV file itself contains dupe names. That being said, if we think making the warning less harsh is a good idea, I can do that.

gfyoung · 2017-09-23T18:10:54Z

@jreback : I made it issue a UserWarning instead. PTAL.

jreback · 2017-09-23T18:36:53Z

lgtm. you might want to note in the doc-string the same warning.

gfyoung · 2017-09-23T18:37:35Z

lgtm. you might want to note in the doc-string the same warning.

Sounds good. I'll quickly add that.

gfyoung · 2017-09-23T21:26:55Z

@jreback : All is green. PTAL.

jreback

lgtm modulo small comment

jreback · 2017-09-24T00:35:09Z

pandas/io/parsers.py

+    Check if the `names` parameter contains duplicates.
+
+    Currently, this function issues a warning if that is the case. In the
+    future, we will raise an error.


doc string needs updating

Good catch. Fixed.

xref pandas-devgh-17095.

gfyoung · 2017-09-24T04:35:13Z

@jreback : All is green. PTAL.

jreback · 2017-09-24T13:13:50Z

thanks @gfyoung I think fine for now, we can always revisit if needed.

xref pandas-devgh-17095.

gfyoung added API Design Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv labels Aug 26, 2017

gfyoung added this to the 0.21.0 milestone Aug 26, 2017

jreback requested changes Aug 29, 2017

View reviewed changes

gfyoung force-pushed the dup-names-warn branch 2 times, most recently from 1497183 to b1a7a4a Compare August 30, 2017 08:00

jreback requested changes Aug 30, 2017

View reviewed changes

gfyoung force-pushed the dup-names-warn branch 2 times, most recently from d75e1fb to 869e363 Compare August 30, 2017 15:14

jreback approved these changes Sep 1, 2017

View reviewed changes

jsexauer mentioned this pull request Sep 1, 2017

DEPR: Clean up list of deprecations from prior versions #6581

Closed

1 task

gfyoung force-pushed the dup-names-warn branch 2 times, most recently from 6378850 to 2ada940 Compare September 23, 2017 09:24

gfyoung force-pushed the dup-names-warn branch from 2ada940 to 3446a5f Compare September 23, 2017 18:39

jreback approved these changes Sep 24, 2017

View reviewed changes

API: Warn about dups in names for read_csv

9fcffa7

xref pandas-devgh-17095.

gfyoung force-pushed the dup-names-warn branch from 3446a5f to 9fcffa7 Compare September 24, 2017 00:43

jreback merged commit 1f51271 into pandas-dev:master Sep 24, 2017

gfyoung deleted the dup-names-warn branch September 25, 2017 00:59

alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017

API: Warn about dups in names for read_csv (pandas-dev#17346)

dcaf488

xref pandas-devgh-17095.

No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017

API: Warn about dups in names for read_csv (pandas-dev#17346)

4a48ad9

xref pandas-devgh-17095.

jreback mentioned this pull request Nov 21, 2019

DEPR: deprecations log for removed issues #13777

Closed

API: Warn about dups in names for read_csv #17346

API: Warn about dups in names for read_csv #17346

Conversation

gfyoung commented Aug 26, 2017 • edited

codecov bot commented Aug 26, 2017 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung commented Aug 31, 2017 • edited

gfyoung commented Sep 1, 2017

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung commented Sep 1, 2017

jreback commented Sep 1, 2017

gfyoung commented Sep 1, 2017

gfyoung commented Sep 5, 2017

gfyoung commented Sep 8, 2017

TomAugspurger commented Sep 8, 2017

gfyoung commented Sep 8, 2017

gfyoung commented Sep 13, 2017

jorisvandenbossche commented Sep 13, 2017

gfyoung commented Sep 13, 2017

gfyoung commented Sep 18, 2017

jorisvandenbossche commented Sep 18, 2017

gfyoung commented Sep 18, 2017 • edited

jorisvandenbossche commented Sep 18, 2017

gfyoung commented Sep 18, 2017

jorisvandenbossche commented Sep 18, 2017

gfyoung commented Sep 18, 2017

gfyoung commented Sep 21, 2017

jreback commented Sep 21, 2017

gfyoung commented Sep 21, 2017

gfyoung commented Sep 23, 2017

jreback commented Sep 23, 2017

gfyoung commented Sep 23, 2017

gfyoung commented Sep 23, 2017

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung commented Sep 24, 2017

jreback commented Sep 24, 2017

gfyoung commented Aug 26, 2017 •

edited

codecov bot commented Aug 26, 2017 •

edited

gfyoung commented Aug 31, 2017 •

edited

gfyoung commented Sep 18, 2017 •

edited