Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add set_index to Series #22225

Closed
wants to merge 18 commits into from

Conversation

h-vetinari
Copy link
Contributor

@h-vetinari h-vetinari commented Aug 6, 2018

  • closes ENH: set_index for Series #21684
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • examples for series.set_index
  • whatsnew entry

Following #21684 (comment), I unified the method to core/generic.py. If this is approved in principle, I'll write some tests (both for DF/Series) and the whatsnew. On that note, it seems to me that df.set_index - particularly its kwargs - are basically untested. The only tests I found are:

  • pandas/tests/frame/test_indexing.TestDataFrameIndexingDatetimeWithTZ.test_set_reset
  • pandas/tests/frame/test_indexing.TestDataFrameIndexingUInt64.test_set_reset

both of which don't test any of the kwargs.

@jreback
Copy link
Contributor

jreback commented Aug 6, 2018

look in alter_axes

there are a ton of tests

@codecov
Copy link

codecov bot commented Aug 7, 2018

Codecov Report

Merging #22225 into master will decrease coverage by <.01%.
The diff coverage is 97.14%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22225      +/-   ##
==========================================
- Coverage   92.38%   92.37%   -0.01%     
==========================================
  Files         166      166              
  Lines       52395    52398       +3     
==========================================
+ Hits        48403    48404       +1     
- Misses       3992     3994       +2
Flag Coverage Δ
#multiple 90.8% <97.14%> (-0.01%) ⬇️
#single 43.03% <60%> (+0.02%) ⬆️
Impacted Files Coverage Δ
pandas/core/frame.py 96.85% <100%> (-0.08%) ⬇️
pandas/core/generic.py 96.69% <100%> (+0.07%) ⬆️
pandas/core/series.py 93.73% <90%> (+0.17%) ⬆️
pandas/core/internals/construction.py 95.93% <0%> (-0.75%) ⬇️
pandas/core/arrays/datetimes.py 97.68% <0%> (-0.34%) ⬇️
pandas/core/arrays/timedeltas.py 88.09% <0%> (-0.1%) ⬇️
pandas/util/testing.py 88% <0%> (-0.1%) ⬇️
pandas/io/formats/html.py 99.34% <0%> (-0.03%) ⬇️
pandas/core/indexes/datetimelike.py 98.52% <0%> (-0.01%) ⬇️
pandas/core/arrays/period.py 98.52% <0%> (-0.01%) ⬇️
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cb31b2b...7f35d2c. Read the comment docs.

@gfyoung gfyoung added Enhancement Indexing Related to indexing on series/frames, not to indexes themselves labels Aug 7, 2018
@h-vetinari
Copy link
Contributor Author

@jreback I cleaned up the tests for df.set_index before I moved to the ones for series.set_index. Since I'm guessing that this would be preferred as separate PRs, here's just the cleanup for df, without any changes to functionality (except some better warnings): #22236

@pep8speaks
Copy link

pep8speaks commented Sep 18, 2018

Hello @h-vetinari! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on January 06, 2019 at 14:40 Hours UTC

@h-vetinari h-vetinari force-pushed the series_set_index branch 4 times, most recently from 7355d16 to aadf50b Compare September 18, 2018 09:00
@h-vetinari h-vetinari changed the title ENH: Add set_index to Series (WIP) ENH: Add set_index to Series Sep 18, 2018
@h-vetinari h-vetinari force-pushed the series_set_index branch 6 times, most recently from eeffe68 to 3edbaea Compare September 18, 2018 11:04
@h-vetinari
Copy link
Contributor Author

Rebased this on #22236 #22526 and split of the changes from #22486 into a separate commit.

If you want to review this without #22486, just follow this link: https://github.com/pandas-dev/pandas/pull/22225/commits/3edbaeae7731885bf418c5ac50d0a9bcf82c0ed6

@h-vetinari
Copy link
Contributor Author

@gfyoung @jreback @WillAyd @TomAugspurger
Could someone please retrigger the failed travis job? Crazy waiting times on Circle/Appveyor currently.

@h-vetinari h-vetinari force-pushed the series_set_index branch 2 times, most recently from 8ab863b to 949e699 Compare October 19, 2018 14:44
@h-vetinari
Copy link
Contributor Author

h-vetinari commented Oct 19, 2018

Now that all the preps (#22236 #22526 #22486) are out of the way, can finally tackle this seemingly simple PR - adding set_index to Series. I moved the code from frame.py to generic.py, as mentioned in the issue by @gfyoung.

It's also green (Circle is behind but already passed before the last commit for linting)

3 2013 7 84
4 2014 10 31
"""
# parameter keys is checked in Series.set_index / DataFrame.set_index!
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not duplicating the - different - checks from Series.set_index and DataFrame.set_index here - is there any chance that someone calls NDFrame.set_index?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no

@h-vetinari
Copy link
Contributor Author

@jreback @gfyoung
ping :)

Copy link
Member

@gfyoung gfyoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

cc @jreback

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks generally ok, need to fix the duplicated doc-strings

@@ -3980,12 +3991,7 @@ def set_index(self, keys, drop=True, append=False, inplace=False,
2 2014 4 40
3 2013 7 84
4 2014 10 31

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the doc-test run for this doc-string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes


if not inplace:
return frame
vi = verify_integrity
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use abbreviations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was just for the line length, but ok, re-broke the line differently

pandas/core/generic.py Outdated Show resolved Hide resolved
3 2013 7 84
4 2014 10 31
"""
# parameter keys is checked in Series.set_index / DataFrame.set_index!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no

def set_index(self, arrays, append=False, inplace=False,
verify_integrity=False):
"""
Set the Series index (row labels) using one or more columns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about the doc-string

@@ -1075,6 +1075,87 @@ def _set_value(self, label, value, takeable=False):
return self
_set_value.__doc__ = set_value.__doc__

def set_index(self, arrays, append=False, inplace=False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't change the argument name, it is called keys currently, leave it that way

Copy link
Contributor Author

@h-vetinari h-vetinari Oct 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The argument currently does not exist - it makes sense for DataFrame to have it named keys, since the main application (arguably) is setting the index to a given column key, but it would be very confusing for Series.

The signatures differ already anyway (no drop for Series), so I'm strongly -1 on naming this parameter keys.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that doesn't make any sense, you are conflating 2 different things. change then name to match.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function signatures of subclasses should match the parent. We should also accept drop in the Series version, even if it's not used..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That, to me, would be an argument to not share the implementation in generic.py. A parameter named keys makes zero sense for series (and neither does drop for that matter - or rather it would be very confusing).

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Jan 4, 2019

@TomAugspurger @jorisvandenbossche
This would be ready (modulo a new docstring warning). Any comment before the cut-off?

@TomAugspurger
Copy link
Contributor

Apologies for the delay. I think I'm +1 to the idea of this. Taking a look at the diff now.

def set_index(self, arrays, append=False, inplace=False,
verify_integrity=False):

if not isinstance(arrays, list) or all(is_scalar(x) for x in arrays):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What case does the first condition handle here? I'm having trouble finding a case where Series.set_index([foo]) but Series.set_index(foo) doesn't (or wouldn't without the []).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger
The first condition makes sure the second one can execute (i.e. iteration). If it is not true and we end up wrapping it in a list, it will fail in the next step.

The point that is crucial though is that only lists are affected by this (and not all list-likes), because NDFrame.set_index would try to interpret a list of scalars as a list of keys.

In any case, I added some comments to clarify this behaviour.

@TomAugspurger
Copy link
Contributor

I can't tell if the doc build error from https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=6464 is real, or whether it's a bug in the linter. I can take a look later if you want.

@h-vetinari
Copy link
Contributor Author

@TomAugspurger
The docstring validation errors were real - some problems with the templating + dedent, plus some plain old oversights that were not yet being tested before. Should hopefully be passing now (locally it is).

Copy link
Contributor Author

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an explanatory comment.

def set_index(self, arrays, append=False, inplace=False,
verify_integrity=False):

if not isinstance(arrays, list) or all(is_scalar(x) for x in arrays):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger
The first condition makes sure the second one can execute (i.e. iteration). If it is not true and we end up wrapping it in a list, it will fail in the next step.

The point that is crucial though is that only lists are affected by this (and not all list-likes), because NDFrame.set_index would try to interpret a list of scalars as a list of keys.

In any case, I added some comments to clarify this behaviour.

@h-vetinari
Copy link
Contributor Author

@datapythonista
There's an error in the docstring validation here that I don't understand, even though Series.set_index and DataFrame.set_index pass validation for me locally (the two docstrings affected by this PR).

My only guess would be that NDFrame.set_index is being validated as well, and crashing on the templates. How would you recommend to solve/circumvent that?

As a side note, it would be awesome to see in the log exactly which docstring was causing an error...


See Also
--------
%(other_klass)s.set_index: Method adapted for %(other_klass)s.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your validation error is that there is no space before the colon on this line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd
If that would cause the error then the [Series|DataFrame].set_index docstrings should fail too, shouldn't they? Locally they pass, but let's hope you're right. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yea. So it does fail locally on NDFrame.set_index with the same error as shown in Azure logs

Copy link
Contributor Author

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see if this works.


See Also
--------
%(other_klass)s.set_index: Method adapted for %(other_klass)s.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd
If that would cause the error then the [Series|DataFrame].set_index docstrings should fail too, shouldn't they? Locally they pass, but let's hope you're right. :)

@datapythonista
Copy link
Member

@h-vetinari the error here is from numpydoc, not from our script. We use it to parse the parts of the docstring, and it's failing to parse it here. I guess there are different numpydoc versions in the CI and in your localhost, if it doesn't fail locally.

You can create an issue to capture numpydoc parser errors, and reraise the exception adding the failing docstring, that would be great. It's the first time I see it failing, that's why it's not implemented.

@WillAyd
Copy link
Member

WillAyd commented Jan 5, 2019

I think the error is because the docstring is actually created against NDFrame. AFACIT other items would add that docstring to _shared_docs which would then be referenced in the frame and series modules. You might just need to refactor that to get this to pass

@jreback
Copy link
Contributor

jreback commented Jan 6, 2019

@h-vetinari I know you keep updating this, but I am -1 on this for all of the reasons I have stated above. This completely changes the semantics of DataFrame.set_index by allowing thru ambiguous cases of arrays. Furthermore the semantics of Series are just plain weird, which only accept an array. This is going to lead to lots of confusion down the road. We have set_axis for exactly this purpose, now you are conflating them.

So unless you remove this, and I don't now if this is possible for Series, then this is a no-go.

@h-vetinari
Copy link
Contributor Author

@jreback

This completely changes the semantics of DataFrame.set_index by allowing thru ambiguous cases of arrays.

I know you're very busy, but I get the feeling you don't read what I've repeated time and time again in this thread: this PR changes absolutely nothing about the behaviour of DataFrame.set_index - as evidenced by the lack of changes in tests/frame/test_alter_axes.py.

You have also not commented on #24046, which I opened specifically to discuss your opposition to the actual / current / existing capabilities of DataFrame.set_index.

Furthermore the semantics of Series are just plain weird, which only accept an array

The only difference between Series.set_index and DataFrame.set_index in this PR is that the former does not allow column keys - which simply do not make sense for Series.

We have set_axis for exactly this purpose, now you are conflating them.

I have responded to this in detail above and in #24046. I'm not conflating them.

So unless you remove this, and I don't now if this is possible for Series, then this is a no-go.

I don't know what you're referring to by "this", can you please elaborate?

There are two core devs who have approved this PR (not counting @WillAyd's change request since it was only about docstring validation), and you're objecting on grounds that do not reflect the actual state of affairs - so IMO you're quite off-base in claiming this PR is so outlandish.

@jreback
Copy link
Contributor

jreback commented Jan 6, 2019

I know you're very busy, but I get the feeling you don't read what I've repeated time and time again in this thread: this PR changes absolutely nothing about the behaviour of DataFrame.set_index - as evidenced by the lack of changes in tests/frame/test_alter_axes.py.

I have never said I have a problem with what DataFrame.set_index does.

I will repeat once again. Series.set_index() has completely different semantics than DataFrame.set_index() which accepts keys (which are column names / levels). NOT an array of values. I am not sure are getting this. There for Series.set_index() is imply not possible the way you have done it here.

The fact that DataFrame.set_index() happens to accept in a particular case an array proves the point here, it IS ambiguous and has been tried several times before. It is NOT correct.

I have pointed you to .set_axis() which does exactly what you are proposing for Series.set_index(). The problem is that the name .set_index() takes keys and that is violated for Series, as are there is no such thing as keys.

@jreback jreback closed this Jan 6, 2019
@h-vetinari
Copy link
Contributor Author

I strongly object to how you're handling this - what does closing this PR achieve? How do you so easily overrule approving reviews (cc @gfyoung @TomAugspurger) and an ongoing discussion?

@jreback This completely changes the semantics of DataFrame.set_index ...

@h-vetinari this PR changes absolutely nothing about the behaviour of DataFrame.set_index

@jreback I have never said I have a problem with what DataFrame.set_index does.
[...] Series.set_index() has completely different semantics than DataFrame.set_index() which accepts keys (which are column names / levels). NOT an array of values.

You are simply wrong about that - it does easily and readily accept arrays of values (and this is a fundamental aspect of its purpose of setting the index).

@jreback
Copy link
Contributor

jreback commented Jan 6, 2019

@h-vetinari you are not listening. If you want to raise an issue and comment there, feel free.

@pandas-dev pandas-dev locked as resolved and limited conversation to collaborators Jan 6, 2019
@pandas-dev pandas-dev unlocked this conversation Jan 7, 2019
@h-vetinari
Copy link
Contributor Author

Quoting myself from this comment:

@h-vetinari: After not getting a response in #24697, I'd like to ask again how to proceed here. Deprecating lists within lists would have IMO been the most elegant solution, and was the only complaint against adding Series.set_index. Now that the decision has been taken to live with this ambiguity, can we please reopen #22225?

Tagging the participants of #24046 and #24697: @jreback @gfyoung @WillAyd @jorisvandenbossche @TomAugspurger @toobaz @jbrockmendel

@h-vetinari
Copy link
Contributor Author

@jreback @gfyoung @WillAyd @jorisvandenbossche @TomAugspurger
Please reopen this PR.

@ghost ghost mentioned this pull request Jul 31, 2019
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: set_index for Series
7 participants