Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR/API: disallow lists within list for set_index #24697

Closed
wants to merge 23 commits into from

Conversation

h-vetinari
Copy link
Contributor

@h-vetinari h-vetinari commented Jan 10, 2019

I wanted to add this before the RC gets cut for two reasons.
On the one hand, the docs have silently deprecated arrays in df.set_index compare master:

keys : label or list of label

Name or names of the columns that will be used as the index.

and 0.23.4:

keys : column label or list of column labels / arrays

This is IMO a breaking change. I'd wait for the outcome of the discussion of #24046, but I feel that could easily fall under the table before the RC, so I wanted to provide a worked-out implementation.

Equally importantly, #22486 added capabilities for lots of list-likes within a list to df.set_index, and has not seen a release yet. Therefore, deprecation would be much easier now, than after 0.24.0rc.

@jreback @TomAugspurger @jorisvandenbossche @toobaz @WillAyd

@codecov
Copy link

codecov bot commented Jan 10, 2019

Codecov Report

Merging #24697 into master will decrease coverage by 49.31%.
The diff coverage is 50%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #24697       +/-   ##
===========================================
- Coverage   92.38%   43.06%   -49.32%     
===========================================
  Files         166      166               
  Lines       52310    52309        -1     
===========================================
- Hits        48326    22527    -25799     
- Misses       3984    29782    +25798
Flag Coverage Δ
#multiple ?
#single 43.06% <50%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/frame.py 35.82% <50%> (-61.11%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/core/categorical.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.35%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-95.46%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.17%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.15%) ⬇️
... and 124 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d56db9d...ed0de1f. Read the comment docs.

@codecov
Copy link

codecov bot commented Jan 10, 2019

Codecov Report

Merging #24697 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24697      +/-   ##
==========================================
+ Coverage   91.75%   91.75%   +<.01%     
==========================================
  Files         173      173              
  Lines       52960    52966       +6     
==========================================
+ Hits        48595    48601       +6     
  Misses       4365     4365
Flag Coverage Δ
#multiple 90.33% <100%> (ø) ⬆️
#single 41.71% <50%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/frame.py 96.86% <100%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 42b4c97...0c65876. Read the comment docs.

@jorisvandenbossche
Copy link
Member

On the one hand, the docs have silently deprecated arrays in df.set_index compare master:

I don't think we silently deprecated anything. There was a DOC PR (#22775) updating the docstring, and I think it was rather an oversight of a contributor that didn't know that behaviour and which was not catched during review.

And that is a good catch! So I think the documentation change in this PR is certainly welcome, and the code changes require some discussion. Would you want to do the docstring changes in a separate PR that can be merged more quickly?

@jorisvandenbossche
Copy link
Member

@h-vetinari and to give you some expectations: I think most of us are rather swamped with trying to get the rc out, so I am not sure we will have time to finalize the set_index related discussions before the RC

@h-vetinari
Copy link
Contributor Author

@jorisvandenbossche

I don't think we silently deprecated anything. There was a DOC PR (#22775) updating the docstring, and I think it was rather an oversight of a contributor that didn't know that behaviour and which was not catched during review.

There was not accusation, I used "silent deprecation" to mean that this functionality is suddenly not mentioned anymore.

@h-vetinari and to give you some expectations: I think most of us are rather swamped with trying to get the rc out, so I am not sure we will have time to finalize the set_index related discussions before the RC

Which is fair enough. That's why I put up this PR (even though I don't really have the time), to let you judge the trade-off (e.g. currently adding more list-likes due to #22486, vs. the ambiguity we're talking about in #24046).

@toobaz
Copy link
Member

toobaz commented Jan 10, 2019

While I understand @jorisvandenbossche 's concern...

On the one hand, the docs have silently deprecated arrays in df.set_index compare master:

... I think on aligning the docs to the functionalities we can all agree, and

#22486 added capabilities for lots of list-likes within a list to df.set_index, and has not seen a release yet

I think this is an important argument even leaving aside #24046 , because I see (sorry for not noticing before) that tuples are among these "list-likes", which I think is just wrong. We discussed in several other issues the fact that in pandas we want to distinguish tuples - which are either keys of a MultiIndex, or pairs of keys for the two axes of a DataFrame - from lists - which are collections.

So I think this PR really deserves to get considered for the release. Will review in detail later today.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Jan 10, 2019

@toobaz: because I see (sorry for not noticing before) that tuples are among these "list-likes", which I think is just wrong.

I cautioned about the tuple case twice in this thread, but it was an explicit review request to use is_list_like and not instance-checks for specific classes (to the point that I had to implement #23065, because I strictly didn't want to allow sets, which would have been broken from the start).

In any case, I agree with you, hence this PR (as well as #24688).

@toobaz
Copy link
Member

toobaz commented Jan 10, 2019

it was an explicit review request to use is_list_like and not instance-checks for specific classes

I see... in a similar case I resolved to

32ee973#diff-1e79abbbdd150d4771b91ea60a4e1cc7R2701

... but I agree this is annoying, so see #24702 .

@h-vetinari
Copy link
Contributor Author

I saw @TomAugspurger mentioned cutting the RC soon. In the unlikely case that someone still wants to consider this on short notice, feel free to commit into this PR - I'm offline the next 6-7h.

- :meth:`DataFrame.set_index` now allows all one-dimensional list-likes, raises a ``TypeError`` for incorrect types,
has an improved ``KeyError`` message, and will not fail on duplicate column names with ``drop=True``. (:issue:`22484`)
- :meth:`DataFrame.set_index` now gives a better (and less frequent) KeyError, raises a ``ValueError`` for incorrect types,
and will not fail on duplicate column names with ``drop=True``. (:issue:`22484`)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reverting the added list-likes from #22486, what's left are the three points from #22484 that the former was meant to address.

@gfyoung gfyoung added Indexing Related to indexing on series/frames, not to indexes themselves Deprecate Functionality to remove in pandas DataFrame DataFrame data structure labels Jan 11, 2019
@h-vetinari
Copy link
Contributor Author

Well done everyone on getting the RC ready and out the door!

Was wondering if this PR should still be considered for 0.24 nevertheless? Seems unlikely, but not impossible, considering e.g. what @jreback wrote in #24060:

ok the RC, yeah small things are ok. If we have big then maybe need RC2

@TomAugspurger @jorisvandenbossche @jreback @toobaz @WillAyd

@jreback
Copy link
Contributor

jreback commented Jan 14, 2019

@h-vetinari would not be averse to: 1) improving the doc-string / examples 2) raising better errors on existing behavior, but the impl here does too many things.

Basically the non-controversial things would be fine.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Jan 14, 2019

@jreback: would not be averse to: 1) improving the doc-string / examples [...]

Split off #24762.

[...] 2) raising better errors on existing behavior, but the impl here does too many things.

The chances here are as advertised, and really minimal. The only things worth mentioning are:

-         if not isinstance(keys, list):
-             keys = [keys]
+         if (is_scalar(keys) or isinstance(keys, tuple)
+                 or isinstance(keys, (ABCIndexClass, ABCSeries, np.ndarray))):
+             # make sure we have a container of keys/arrays we can iterate over
+             # tuples can appear as valid column keys!
+             keys = [keys]
+         elif not isinstance(keys, list):
+             raise ValueError(err_msg)

which replaces indiscriminate list-wrapping with a very reasonable list of allowed types, and

+         depr_warn = False
          for col in keys:
              if (is_scalar(col) or isinstance(col, tuple)) and col in self:
                  # if col is a valid column key, everything is fine
                  continue
              elif is_scalar(col) and col not in self:
                  # tuples that are not keys are not considered missing,
                  # but illegal (see below)
                  missing.append(col)
-             elif (not is_list_like(col, allow_sets=False)
-                   or getattr(col, 'ndim', 1) > 1):
-                 raise TypeError(...)
+             elif isinstance(col, list):
+                 depr_warn = True
+             elif (not isinstance(col, (ABCIndexClass, ABCSeries, np.ndarray))
+                   or getattr(col, 'ndim', 1) > 1):
+                 raise ValueError(err_msg)

          if missing:
              raise KeyError('{}'.format(missing))
+         if depr_warn:
+             msg = ('passing lists within a list to the parameter "keys" is '
+                    'deprecated and will be removed in a future version.')
+             warnings.warn(msg, FutureWarning, stacklevel=2)

which does the actual deprecation and restricts to the allowed types (the TypeError we added for #22486 is I think inconsistent with how this is usually handled via ValueError elsewhere).

The rest is just doc changes, removing some test cases added by #22486 from the test parametrization, and catching the deprecation warnings in the tests (without changing the tests themselves).

@TomAugspurger
Copy link
Contributor

Sorry, I haven't been able to keep up. Are either of the following being deprecated?

In [8]: df.set_index([['a', 'b', 'c']])
Out[8]:
   A  B
a  1  4
b  2  5
c  3  6

In [9]: df.set_index(['A', ['a', 'b', 'c']])
Out[9]:
     B
A
1 a  4
2 b  5
3 c  6

what's our recommendation to users doing this (the message should tell them).

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Jan 14, 2019

@TomAugspurger
Both those cases are indeed being deprecated in this PR. As mentioned above lists are hardly ever used as an array (outside of toy example), as in most real-world use cases, the required data is already residing in Series/Index/ndarrays anyway. Furthermore, deprecating lists inside lists removed the ambiguity of list_of_scalar (keys vs values), and sacrifices no essential functionality as it's easy to wrap a list in an array.

I've adapted the warning message to give a clearer example of that.

@h-vetinari
Copy link
Contributor Author

@TomAugspurger @jorisvandenbossche @toobaz @jreback
friendly ping

@jreback
Copy link
Contributor

jreback commented Jan 16, 2019

not for 0.24.0

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Jan 20, 2019

@jreback @TomAugspurger @toobaz
Not in a rush with this one in particular, but the diff after #24762 is now much smaller/simpler.

@h-vetinari
Copy link
Contributor Author

@TomAugspurger @jorisvandenbossche @toobaz @jreback
Coming back to this after #25085 and #24984 are in.

@jorisvandenbossche
Copy link
Member

@h-vetinari To make sure there is a common understanding: this PR deprecates the usage of lists for the case of array-like of values (so not for the list of labels/array-likes).
So meaning: df.set_index(['A', 'B']) (list of labels) or df.set_index([np.array(['a', 'b', 'c']]) (list of array-likes) are still allowed, but df.set_index([['a', 'b', 'c']]) (list of array-likes where the array like is a list) would be deprecated?

Is there any ambiguous case this is solving? Or is it only with the intent to make it less complex/confusing for users?

@h-vetinari
Copy link
Contributor Author

@jorisvandenbossche: @h-vetinari To make sure there is a common understanding: this PR deprecates the usage of lists for the case of array-like of values (so not for the list of labels/array-likes).

That's correct.

Is there any ambiguous case this is solving? Or is it only with the intent to make it less complex/confusing for users?

One of the main points of discussion about the capabilities of df.set_index (e.g. in #24046) is that @jreback @WillAyd are strongly opposed to the fundamental ambiguity of df.set_index(['a', 'b', 'c']) (see e.g. here or OP of #24046), because lists are overloaded as an array-like and a container for combining several index columns.

This PR tries to break the stalemate between longstanding capabilities (including mixing column labels and arrays since before v0.12.) and this ambiguity, by deprecating the array-like role of lists, which can functionally be easily replaced by Series/Index/np.ndarray.

This also improves the situation with the confusingly-similar-yet-totally-different
df.set_index(['a', 'b', 'c']) vs. df.set_index([['a', 'b', 'c']]).

@@ -79,6 +79,7 @@ Deprecations
~~~~~~~~~~~~

- Deprecated the `M (months)` and `Y (year)` `units` parameter of :func: `pandas.to_timedelta`, :func: `pandas.Timedelta` and :func: `pandas.TimedeltaIndex` (:issue:`16344`)
- :meth:`DataFrame.set_index` has deprecated using lists of values *within* lists. It remains possible to pass array-likes, both directly and within a list.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would help here is a sub-section note that shows a good example of this.

@h-vetinari
Copy link
Contributor Author

@jreback: what would help here is a sub-section note that shows a good example of this.

I'm assuming you'll find the section I added too long, but it's easier to shorten than to add.

@h-vetinari
Copy link
Contributor Author

@jreback
Now green after the doc changes.

@h-vetinari
Copy link
Contributor Author

@jreback
PTAL at the doc section you requested

@jorisvandenbossche @TomAugspurger @gfyoung
Feel free to chime in as well. :)

@jreback
Copy link
Contributor

jreback commented Mar 10, 2019

after some discussion in real space. This is introducing more of a special case. This is not desired as adds user complexitly w/o a corresponding improvement. Thanks for the try here. This is a pretty tricky area of api and don't want to make it more confusing.

@jreback jreback closed this Mar 10, 2019
@h-vetinari
Copy link
Contributor Author

@jreback @jorisvandenbossche @TomAugspurger
This is unfortunate - would have liked to make the case for it at the dev-chat. This point was also the principal (only?) complaint about introducing Series.set_index (#22225) with the same capabilities. How do we proceed there? (reminder: you closed that PR unilaterally, but the need of such a method for Series was not in question.)

PS. Lists might seem like a special case here (although #22264 did exactly the same for .str.cat), but it's IMO very important to start deprecating lists-as-arrays (as opposed to lists-as-collections) throughout the code base. That ambiguity is at the heart of a lot of issues.

@h-vetinari h-vetinari mentioned this pull request May 31, 2019
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DataFrame DataFrame data structure Deprecate Functionality to remove in pandas Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: capabilities of df.set_index
6 participants