Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Warn about dups in names for read_csv #17346

Merged
merged 1 commit into from Sep 24, 2017

Conversation

@gfyoung
Copy link
Member

commented Aug 26, 2017

Title is self-explanatory.

xref #17095.

@gfyoung gfyoung added this to the 0.21.0 milestone Aug 26, 2017

@codecov

This comment has been minimized.

Copy link

commented Aug 26, 2017

Codecov Report

Merging #17346 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17346      +/-   ##
==========================================
- Coverage   91.26%   91.24%   -0.02%     
==========================================
  Files         163      163              
  Lines       49776    49783       +7     
==========================================
- Hits        45426    45424       -2     
- Misses       4350     4359       +9
Flag Coverage Δ
#multiple 89.04% <100%> (ø) ⬆️
#single 40.29% <57.14%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/io/parsers.py 95.48% <100%> (+0.02%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.77% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d43aba8...9fcffa7. Read the comment docs.

counts = {}
warn_dups = False

for name in names:

This comment has been minimized.

Copy link
@jreback

jreback Aug 29, 2017

Contributor

just use set intersection here

This comment has been minimized.

Copy link
@gfyoung

gfyoung Aug 29, 2017

Author Member

How so? This is a fail-early method, which is why I chose it.

This comment has been minimized.

Copy link
@jreback

jreback Aug 30, 2017

Contributor

simply check for len(names) !+ len(set(names)). much more idiomatic

This comment has been minimized.

Copy link
@gfyoung

gfyoung Aug 30, 2017

Author Member

Fair enough. Done.

counts[name] = True

if warn_dups:
msg = ("Duplicate names specified. This "

This comment has been minimized.

Copy link
@jreback

jreback Aug 29, 2017

Contributor

so are we deprecating this? then this should be a FutureWarning.

This comment has been minimized.

Copy link
@gfyoung

gfyoung Aug 29, 2017

Author Member

Fair enough. Done.

@gfyoung gfyoung force-pushed the forking-repos:dup-names-warn branch 2 times, most recently from 1497183 to b1a7a4a Aug 29, 2017

@@ -406,6 +438,10 @@ def _read(filepath_or_buffer, kwds):
chunksize = _validate_integer('chunksize', kwds.get('chunksize', None), 1)
nrows = _validate_integer('nrows', kwds.get('nrows', None))

# Check for duplicates in names.
names = kwds.get("names", None)
_check_dup_names(names)

This comment has been minimized.

Copy link
@jreback

jreback Aug 30, 2017

Contributor

call this _validate_names and have it return names, so its a similar patter to the other validators

This comment has been minimized.

Copy link
@gfyoung

gfyoung Aug 30, 2017

Author Member

Done.

@gfyoung gfyoung force-pushed the forking-repos:dup-names-warn branch 2 times, most recently from d75e1fb to 869e363 Aug 30, 2017

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented Aug 31, 2017

@jreback : All comments addressed, and tests are green. PTAL

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented Sep 1, 2017

@jreback

jreback approved these changes Sep 1, 2017

Copy link
Contributor

left a comment

typo in whatsnew, otherwise lgtm. make sure this is on the deprecation list as well

@@ -283,6 +283,7 @@ Other API Changes
- The Categorical constructor no longer accepts a scalar for the ``categories`` keyword. (:issue:`16022`)
- Accessing a non-existent attribute on a closed :class:`~pandas.HDFStore` will now
raise an ``AttributeError`` rather than a ``ClosedFileError`` (:issue:`16301`)
- :func:`read_csv` now issues a ``UserWarning`` if the ``names`` parameter contains duplicates (:issue:`17095`)

This comment has been minimized.

Copy link
@jreback

jreback Sep 1, 2017

Contributor

should be FutureWarning

This comment has been minimized.

Copy link
@gfyoung

gfyoung Sep 1, 2017

Author Member

Doh! My bad for not catching that. Fixed.

This comment has been minimized.

Copy link
@gfyoung

gfyoung Sep 23, 2017

Author Member

Changing back to UserWarning in light of later discussion.

@jsexauer jsexauer referenced this pull request Sep 1, 2017

Open

DEPR: deprecations from prior versions #6581

0 of 89 tasks complete
@gfyoung

This comment has been minimized.

Copy link
Member Author

commented Sep 1, 2017

Fixed typo and added to deprecation list. Will merge on green then unless told otherwise.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 1, 2017

@gfyoung wait for @jorisvandenbossche comment (as not sure if he commented here). IIRC a comment he made that having duplicate names is ok .

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented Sep 1, 2017

Sure thing. FWIW, @jorisvandenbossche agreed with your suggestion, see his comment here

@jorisvandenbossche : Any comments on this PR?

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented Sep 5, 2017

@jorisvandenbossche : ping if there any additional comments

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented Sep 8, 2017

@jreback : It's been a week, and I haven't heard anything from @jorisvandenbossche . Still wait, or can we merge this PR?

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Sep 8, 2017

I think Joris is off on holiday. I believe he's back next week.

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented Sep 8, 2017

@TomAugspurger : Ah! I had a feeling that that was the case (I remember seeing an email about that). I'll wait then until he gets back.

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented Sep 13, 2017

@jorisvandenbossche : friendly ping

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Sep 13, 2017

Can you remember me the rationale for deprecating this?
Is it because we cannot actually handle it well?

Because I actually previously had a usecase where this proved useful (I had a non-informative column every other column, gave it the same name in names and then dropped the single name. But this specific case can of course easily be solved differently, by giving names like 'dummy1', 'dummy2', .. and then removing all columns that start with 'dummy')

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented Sep 13, 2017

@jorisvandenbossche : Here is what you said back in July here.

Essentially, we are deprecating this behavior because names is a user-specified parameter, and passing in duplicate names deliberately only encourages buggy behavior.

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented Sep 18, 2017

@jorisvandenbossche : Any further thoughts on this?

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Sep 18, 2017

Sorry for the slow response.

So maybe a more general question: is it our intention to once fix mangle_dupe_cols=True ? (as currently actually only the default False works). If we do, I see no good reason to disallow duplicates in passed names vs names in the csv file.

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented Sep 18, 2017

@jorisvandenbossche : No worries! I think you meant the other way around. mangle_dupe_cols=True is fully supported at this point. It is mangle_dupe_cols=False that we disabled.

The reason for us discouraging duplicates in names is because duplicates are generally more error-prone. Contrary to duplicates in a file, names is a deliberate choice.

@jreback : Thoughts?

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Sep 18, 2017

I think you meant the other way around

yes, of course ... :-)

Contrary to duplicates in a file, names is a deliberate choice.

That's true. But there are many other ways to deliberately make a dataframe with duplicate columns which we don't disallow anyway.

To be clear, in general I am all for a restricted scope of capabilities/possibilties. But in this case, limiting the abilities of name does not actually reduce code complexity, but increases it. As the duplicates are already perfectly handled by the code, so we are introducing a special case. Therefore I was wondering whether this is actually needed (user will see that the names are mangled anyhow).

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented Sep 18, 2017

@jorisvandenbossche : True that we'll see them mangled anyhow, but why the need to add complexity to just handle them in the first place? I added the handling for duplicate names in an earlier PR because it was a bug, not to enhance support.

If the user really wants to have duplicate names, they can set it themselves and reading in the file, but I don't know if we want to actively encourage setting duplicate names to a read-in DataFrame.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Sep 18, 2017

Ah, I assumed that the mangling of names or the names from the header of the file was done using the same code path? That's not the case ?
(to state it another way: once we remove the deprecation in this PR, we can actually remove more code than is added in this PR?)

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented Sep 18, 2017

See #17095 : it's a not a ridiculous amount of new logic that I added, but new logic nonetheless 😄 Also, see @jreback comment in that PR here

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented Sep 21, 2017

@jorisvandenbossche : Any updates on this?

@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 21, 2017

so the only reason we have mangle_dupe_columns is to support duplicates in the first place. I think I stated we should deprecate that argument entirely, then I would allow duplicates in names but show a UserWarning if names contains duplicates.

So pretty much allow what is happening today but with a UserWarning and reducing the path complexity a bit (removing mangle_dupe_columes).

Its not an error to have duplicates in names but I guess can't disallow it entirely.

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented Sep 21, 2017

@jreback : I don't recall you saying this before. In addition, I think there has been user interest and not mangling in cases when the CSV file itself contains dupe names. That being said, if we think making the warning less harsh is a good idea, I can do that.

@gfyoung gfyoung force-pushed the forking-repos:dup-names-warn branch 2 times, most recently from 6378850 to 2ada940 Sep 23, 2017

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented Sep 23, 2017

@jreback : I made it issue a UserWarning instead. PTAL.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 23, 2017

lgtm. you might want to note in the doc-string the same warning.

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented Sep 23, 2017

lgtm. you might want to note in the doc-string the same warning.

Sounds good. I'll quickly add that.

@gfyoung gfyoung force-pushed the forking-repos:dup-names-warn branch from 2ada940 to 3446a5f Sep 23, 2017

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented Sep 23, 2017

@jreback : All is green. PTAL.

@jreback
Copy link
Contributor

left a comment

lgtm modulo small comment

Check if the `names` parameter contains duplicates.
Currently, this function issues a warning if that is the case. In the
future, we will raise an error.

This comment has been minimized.

Copy link
@jreback

jreback Sep 24, 2017

Contributor

doc string needs updating

This comment has been minimized.

Copy link
@gfyoung

gfyoung Sep 24, 2017

Author Member

Good catch. Fixed.

@gfyoung gfyoung force-pushed the forking-repos:dup-names-warn branch from 3446a5f to 9fcffa7 Sep 24, 2017

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented Sep 24, 2017

@jreback : All is green. PTAL.

@jreback jreback merged commit 1f51271 into pandas-dev:master Sep 24, 2017

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 24, 2017

thanks @gfyoung I think fine for now, we can always revisit if needed.

@gfyoung gfyoung deleted the forking-repos:dup-names-warn branch Sep 25, 2017

alanbato added a commit to alanbato/pandas that referenced this pull request Nov 10, 2017

No-Stream added a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.