API: capabilities of df.set_index #24046

h-vetinari · 2018-12-02T13:11:10Z

This is coming out of a discussion that has stalled #22225 (which is about adding .set_index to Series, see #21684). The discussion has shifted away from what capabilities a putative Series.set_index should have, but what capabilities df.set_index has currently.

The main issue (for @jreback) is that df.set_index takes arrays:

@jreback: There were several attempts to have DataFrame.set_index take an array as well, but these never got off the ground.

@h-vetinari: I'm not sure when, but they certainly did get off the ground:

>>> import pandas as pd
>>> import numpy as np
>>> pd.__version__
'0.23.4'
>>>
>>> df = pd.DataFrame(np.random.randint(0, 10, (4, 4)), columns=list('abcd'))
>>> df.set_index(['a',          # label
...               df.index,     # Index
...               df.b ** 2,    # Series
...               df.b.values,  # ndarray
...               list('ABCD'), # list
...               'c'])         # label again
              b  d
a   b      c
0 0 0  2 A 1  0  2
8 1 1  4 B 4  1  4
3 2 25 5 C 8  5  5
0 3 9  7 D 2  3  7

Further on:

@jreback: @h-vetinari you are confusing the purpose of .set_axis. [...] The problem with .set_index on a DataFrame with an array is that it technically can work with an array and not keys. (meaning its not unambiguous)

I don't think I am confusing them. If I want to set the .index-attribute of a Series/DataFrame, then using .set_index is the most reasonable name by far. If anything, set_axis should be a superset of set_index (and a putative set_columns), that just switches between the two based on the axis-kwarg.

More than that, the current capabilities of df.set_index are a proper superset of df.set_axis(axis=0)**, in that it's possible to fill keys with only Series/Index/ndarray/list etc.:

>>> df.set_index(pd.Index(df.a))  # same result as Series directly below
>>> df.set_index(df.a) 
   a  b  c  d
a
0  0  0  1  2
8  8  1  4  4
3  3  5  8  5
0  0  3  2  7
>>> df.set_index(df.a.values)  # same result as list directly below
>>> df.set_index([[0, 8, 3, 0]])
   a  b  c  d
0  0  0  1  2
8  8  1  4  4
3  3  5  8  5
0  0  3  2  7

** there is one caveat, in that lists (and only lists; out of all containers) need to be wrapped in another list, i.e. df.set_index([[0, 8, 3, 0]]) instead of df.set_index([0, 8, 3, 0]). This is the heart of the ambiguity that @jreback mentioned above (because a list is interpreted as a list of column keys).

Summing up:

set_index is the most natural name for setting the .index-attribute
df.set_index should be able to process list-likes (as it currently does; this is the source of the ambiguity of the list case).
df.set_axis should be able to do everything that df.set_index does, and just switch between operating on index/columns based on the axis-kwarg (after all, index and columns are the two axes of a DF).
- it could be considered to add a method set_columns on a DataFrame
- The axis-kwarg of set_axis should just switch between the behaviour of set_index (i.e. dealing with keys and array-likes) and set_columns.
Series.set_index should support the same signature as df.set_index, with the exception of the drop-keyword (which only makes sense for column labels).
For Series, the set_index and set_axis methods should be exactly the same.

Since I can't tag @pandas-dev/pandas-core, here are a few individual tags: @jreback @TomAugspurger @jorisvandenbossche @gfyoung @WillAyd @jbrockmendel @jschendel @toobaz.

EDIT: Forgot to add an xref from @jreback:

@h-vetinari we had quite some discussion about this: #14829
and never reached resolution. This is an API question.

In that issue, there's discussion largely around .rename, and how to make that method more consistent. Also discussed was potentially introducing .relabel, as well as .set_columns.

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2018-12-02T17:39:22Z

@h-vetinari should list('ABC') in the first example be list(ABCD')? If not then I am confused in several directions.

h-vetinari · 2018-12-02T17:42:20Z

@jbrockmendel
That was indeed an artefact from merging together several things from the other thread to make this issue...

h-vetinari · 2019-01-02T16:47:36Z

@jreback Any comments here or in #22225?

h-vetinari · 2019-01-06T17:47:43Z

@jreback
I am honestly stunned by you closing #22225 and then locking it after I objected. So much for my motivation to work on some big PRs today.

@h-vetinari you are not listening. If you want to raise an issue or comment feel free.

I've opened this issue here for exactly this purpose (discussing your objections to existing capabilities of DataFrame.set_index) over a month ago.

toobaz · 2019-01-07T05:39:56Z

@h-vetinari While I think locking #22225 was an unnecessary move from @jreback , you have to realize that the "''overruling approving reviews''" thing is not a good argument to raise in such a discussion. True, in pandas we look for devs consensus, but in the end we prefer to do so by argumenting rather than by any form of proper vote. So that comment was not very useful, to use an euphemism.

Back to the topic of the discussion: if I understand correctly, @jreback is argumenting that df.set_index is already ambiguous enough to make it a bad idea to sponsor its use when passing anything but keys (which would be the only sensible use in Series.set_index); and at the same time, you are suggesting that

[ ] For Series, the set_index and set_axis methods should be exactly the same.

If this summary of the discussion is correct, then I think I am also against introducing Series.set_index.

And actually, I'm probably in favor of deprecating df.set_index to pass actual values, if df.set_axis is able to fulfill exactly the same task (didn't check).

I understand you argument about df.set_index being what one expects to use to set the .index (to some values)... but set_axis is simply too long established to be removed, and I find duplication in the API a worse problem than the maybe sub-optimal naming. Or in other words, I think

The axis-kwarg of set_axis should just switch between the behaviour of set_index (i.e. dealing with keys and array-likes) and set_columns.

is a bad idea. ''There should be one - and preferably only one - obvious way to do it."

EDIT: By the way: sorry for not reacting before to the ping - busy period.

jorisvandenbossche · 2019-01-07T11:29:21Z

@h-vetinari In my opinion, the locking was not the best way to handle the discussion in the PR, so sorry about that. In the meantime, Jeff has unlocked the conversation there, but let's continue the discussion here.

On the topic: given the behaviour of DataFrame.set_index (supporting setting a full array-like in addition of a list of column names), I personally don't have problems with adding a similar behaviour for Series.set_index.

I'm probably in favor of deprecating df.set_index to pass actual values, if df.set_axis is able to fulfill exactly the same task .....
.... but set_axis is simply too long established to be removed, and I find duplication in the API a worse problem than the maybe sub-optimal naming

@toobaz I personally almost never seen someone use set_axis (and never used it myself), but I do regularly see people use set_index with a full Series/array (but that's subjective of course).
So personally, I would rather deprecate set_axis and only keep set_index (but: set_axis also has the ability to set the columns, which set_index cannot do, so it is not fully duplicative and therefore probably cannot easily be deprecated).

TomAugspurger · 2019-01-07T12:36:54Z

Agreed that locking was not appropriate.

On the issue itself, to me it's pretty clear that Series.set_index(sequence) is a limiting case of DataFrame.set_index(Sequece[sequence])

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({"A": [1, 2, 3]})

In [3]: df.set_index([['a', 'b', 'c']])
Out[3]:
   A
a  1
b  2
c  3

Since In[3] works, I would expect that

In [4]: df.A.set_index([['a', 'b', 'c']])

work as well.

WillAyd · 2019-01-07T16:09:11Z

I don't think continuing to post here or in the associated PR is an effective use of anyone's time. Why don't we just add it as a discussion point to the next dev chat?

TomAugspurger · 2019-01-07T16:12:02Z

The implementation is fine, and deserves to go in 0.24.0 if we can agree on the desired behavior. No need to delay I think.

…

On Mon, Jan 7, 2019 at 10:09 AM William Ayd ***@***.***> wrote: I don't think continuing to post here or in the associated PR is an effective use of anyone's time. Why don't we just add it as a discussion point to the next dev chat? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#24046 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIrAEcEJoHmYf7ZMXGa7rvVpAlYUFks5vA3EtgaJpZM4Y9ZRa> .

WillAyd · 2019-01-07T16:15:31Z

There isn't agreeance on desired behavior hence why I suggest moving to a separate forum. I don't think it's something we need to push into 0.24 at the end here either

TomAugspurger · 2019-01-07T16:19:28Z

I really don't think this is a difficult decision though, is it? Do we want Series.set_index to accept arrays like DataFrame.set_index or not? Joris and I are I think +1, or at least ambivalent. Jeff seems to think that DataFrame.set_index doesn't accept arrays (e.g.
#22225 (comment))

DataFrame.set_index() which accepts keys (which are column names / levels). NOT an array of values.

which isn't correct, as shown in
#24046 (comment).

jorisvandenbossche · 2019-01-07T16:20:48Z

I agree that if it seems difficult to find an agreement, that discussing this on the next dev chat can be more productive / effective.
But, until now, there hasn't been much discussion on the PR, apart from between Jeff and h-vetinari (other people commented on the PR, but didn't really involve in the API discussion, apart from Tom agreeing with h-vetinari). So I would first still be interested in hearing what other people think about it, as I personally don't really see much reason to object it.

toobaz · 2019-01-07T16:23:10Z

On the issue itself, to me it's pretty clear that Series.set_index(sequence) is a limiting case of DataFrame.set_index(Sequece[sequence])

@TomAugspurger We all agree (I think) that it makes sense for df.set_index([a_list_of_labels]) to work. I think @jreback makes a good point however that there is no obvious reason (except parameter ambiguity) for df.set_index(a_list_of_labels) not to work (since df.set_index(Series(a_list_of_labels)) does), and that this causes a potential confusion that df.set_axis doesn't. Then maybe we can live with it... but let's admit this is not ideal.

One alternative (which I don't particularly like) is what (I think) .groupby(a_list) does, i.e., trying to find elements of a_list in the axis, and fallback to considering them as values otherwise.

WillAyd · 2019-01-07T16:23:22Z

I am -1 due to ambiguity. I don't know what the desired behavior of the following is:

df = pd.DataFrame(np.ones((3,3)))
df.set_index([1, 0, 2])

TomAugspurger · 2019-01-07T16:27:11Z

@toobaz can you show an example of df._set_index(a_list_of_labels) vs. [a_list_of_labels]? I don't think that #22225 is changing that at all.

@WillAyd that ambiguity exists today, and is unchanged by #22225. I don't think anyone has proposed deprecating that behavior.

jorisvandenbossche · 2019-01-07T16:27:51Z

I am -1 due to ambiguity. I don't know what the desired behavior of the following is:

That is not fully the discussion. As that is about a DataFrame, and that behaviour is already defined (it first prefers column names).
The question is rather what pd.Series([0, 0, 0]).set_index([1, 0, 2]) should do, which is much less ambiguous.

Given the confusion and talking next to each other, it might be good if someone attempts to make a good illustrated and complete summary of the actual discussion.

TomAugspurger · 2019-01-07T16:29:28Z

Is
#24046 (comment) a good summary? Make Series.set_index the limiting case of DataFrame.set_index? Any confusion points there?

toobaz · 2019-01-07T16:31:21Z

@toobaz can you show an example of df._set_index(a_list_of_labels) vs. [a_list_of_labels]? I don't think that #22225 is changing that at all.

No, it's not. But as already stated, if df.set_index(values_rather_than_keys) is a regrettable legacy causing ambiguity in the API, we'd rather not enhance its usage by paralleling it with Series.set_index, which would do only that (which is already done by Series.set_axis). I actually suggested deprecating it... which might not be our final decision, but is certainly related to #22225 .

TomAugspurger · 2019-01-07T16:32:55Z

@toobaz my apologies, I missed the paragraph where you suggested deprecating non-labels a values in DataFrame.set_index. Indeed, if we want to deprecate that then we should not go forward in #22225.

TomAugspurger · 2019-01-07T16:42:02Z

On deprecating passing values, rather that column labels to DataFrame.set_index: I don't think we should deprecate that. While there is ambiguity, as noted in @WillAyd's example in
#24046 (comment), I think it's quite useful to pass a mix of labels and keys.

In [11]: df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})

In [12]: df.set_index(["A", [1, 2, 3]])
Out[12]:
     B
A
1 1  4
2 2  5
3 3  6

Without that, I think you'd have some ugly

In [18]: df.set_axis(pd.MultiIndex.from_arrays([df.A, [1, 2, 3]]), inplace=False).drop(['A'], axis=1)
Out[18]:
     B
A
1 1  4
2 2  5
3 3  6

jreback · 2019-01-07T16:45:50Z

you could just raise / warn on ambguity

jreback · 2019-01-07T16:46:39Z

this needs coupling with possibly deprecating set_axis as well
because passing values is not documented in any way

jorisvandenbossche · 2019-01-07T16:49:01Z

this needs coupling with possibly deprecating set_axis as well

I would personally be happy to get rid of set_axis, but I think the main problem there is that we don't have an alternative for df.set_axis(['a', 'b'], axis=1) (setting column names) ?

toobaz · 2019-01-07T16:57:55Z

this needs coupling with possibly deprecating set_axis as well

Deprecating the one method that works as expected?!

I would personally be happy to get rid of set_axis, but I think the main problem there is that we don't have an alternative for df.set_axis(['a', 'b'], axis=1) (setting column names) ?

The problem is also for axis=0... (unless you know about the nested list trick)

ghost · 2019-07-25T04:38:03Z

Sorry, I removed the new content and also most of the old.

Without arguing for one position over another, here's the state of things as seen from the user's point of view who looks to the documentation for guidance:

# grep -R '\.index' doc/source
whatsnew/v0.14.0.rst
158:    df_multi.index = tuple_ind
164:    df_multi.index = mi

whatsnew/v0.16.0.rst
98:   s.index = pd.MultiIndex.from_tuples([(1, 2, 'a', 0),

user_guide/indexing.rst
1706:   data.index = index

getting_started/10min.rst
653:   ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9

user_guide/io.rst
1893:   dfj2.index = pd.date_range('20130101', periods=5)
2952:   df.index = df.index.set_names(['lvl1', 'lvl2'])

getting_started/basics.rst
233:   dfmi.index = pd.MultiIndex.from_tuples([(1, 'a'), (1, 'b'),

user_guide/timeseries.rst
2096:   ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9

user_guide/sparse.rst
306:   s.index = pd.MultiIndex.from_tuples([(1, 2, 'a', 0),

At the same time, all example uses of set_index are for the column case and there are 0 usage examples of set_axis which appears only in the auto-generated reference.

I think it's fair to say that the docs currently advocate for always setting the index directly and gives no indication that setting a series index in a method-chain is supported. I care about this more than I care about which competing alternative wins out.

My personal preference is still to keep set_index as it is and add its equivalent to Series. And in any case to update the documentation accordingly.

toobaz · 2019-07-25T06:16:00Z

I'm very sorry, but it is removing core functionality

You're not trying to understand.

toobaz · 2019-07-25T06:17:17Z

give up the array-like interpretation

I'm pretty sure this will never pass.

h-vetinari · 2019-07-25T07:37:13Z

@h-vetinari: I'm very sorry, but it is removing core functionality

@toobaz: You're not trying to understand.

I understand that, to you, the functionality is not removed because it lives on in set_axis. We just disagree about what constitutes good API design.

@h-vetinari: give up the array-like interpretation

@toobaz: I'm pretty sure this will never pass.

It doesn't have to be everywhere or at the same time, but removing such overloaded interpretations would greatly simplify the API surface as well as code maintainability. I'm saying it could be a goal to strive for, like striving to have tuples always be MI-keys (and was an argument why "special-casing" in option 3 can be desirable).

@pilkibun: [...] and there are 0 usage examples of set_axis which appears only in the auto-generated reference.

@toobaz, this is part of the reason why I consider shifting the array-capability from set_index to set_axis as a removal of functionality - it is not nearly as intuitive or wide-spread as set_index.

toobaz · 2019-07-25T07:56:55Z

We just disagree about what constitutes good API design.

We mostly disagree on what constitutes productive interaction. And from my part, this is the last "metacomment" in this discussion.

removing such overloaded interpretations would greatly simplify the API surface as well as code maintainability

I totally agree it would simplify the code, but we don't want our users to pay the price. And it would be particularly ironic to start doing so because we want to overload the interpretation of the argument to set_index.

So among the options you listed, only 4 and 5 are feasible. To be honest I don't like 5, but this might be subjective. Do you agree that a deprecation message which tells our users "please replace set_index with set_axis" will serve our users at least as well as a deprecation message which tells our users "please replace set_index with set_index(arrays=.)"?

h-vetinari · 2019-07-25T08:16:34Z

I totally agree it would simplify the code, but we don't want our users to pay the price. And it would be particularly ironic to start doing so because we want to overload the interpretation of the argument to set_index.

I agree that users should not pay the price. I also don't want the overloaded interpretation in set_index, but would tackle it differently.

Do you agree that a deprecation message which tells our users "please replace set_index with set_axis" will serve our users at least as well as a deprecation message which tells our users "please replace set_index with set_index(arrays=.)"?

As a deprecation, yes. After the removal, not so much (because set_index will still be the more obvious choice, also for arrays).

So among the options you listed, only 4 and 5 are feasible.

I think 3 and 5 are feasible. The overlap is 5, and I'd be happy to support that.

We mostly disagree on what constitutes productive interaction.

Indeed, and we both have our failings there. Thanks for taking the time/energy to stick with it.

toobaz · 2019-07-25T10:10:20Z

because set_index will still be the more obvious choice, also for arrays

Why "obvious"? For the name "set_index" vs. "set_axis"? Or because you like everything in the same method?
(either way, I think it's just a matter of having clear documentation)

I think 3 and 5 are feasible.

3 was so far ruled out by both me and @jreback - not very useful to cite it if you don't have new arguments. I guess we never discussed much 5, so I'll be happy to mention it in the next live chat. I still think we will end up choosing 4.

h-vetinari · 2019-07-25T18:09:38Z

Why "obvious"? For the name "set_index" vs. "set_axis"?

That plays a large role, yes, because we're set-ting the index-attribute, and because having an intuitive API is crucial. Documentation does not help with intuitiveness.

3 was so far ruled out by both me and @jreback - not very useful to cite it if you don't have new arguments.

I have enough arguments**, but so far you have hardly responded to them, except by appeal to authority. 4 or 5 is your opinion, 3 or 5 is mine. Of course, you're in a much better position to enforce your ideas, but that does not improve the strength of your argument.

Still, we have already found some sort of minimal common ground with 5. If your preference for 4 over 5 is not too large to be overcome at all, I'd be happy to submit a PR that implements 5.

** I realised I haven't even fully articulated one of the most important arguments against 4: saying that arrays can only be used in set_axis does not solve the lack of a set_index-method for Series! The rest of my main arguments is recapped below the fold:

that "arrays-in-set_index" was a specific enhancement and that this is core functionality not least due to its intuitiveness and centrality of indexes to pandas
that axis/index are two sides of the same coin (and not two duplicate coins), and that having different behaviours between set_index and set_axis is further detriment to intuitiveness
that deprecating list-as-array (vs. list-as-collection) is a goal to strive for, much more than a special case

toobaz · 2019-07-26T08:26:50Z

appeal to authority

This is a way to waste the time of both of us. So far, the only effect in this discussion of me being a core dev is that I am patiently replying to your rants, instead than doing more productive things.

But since you later repeated your arguments in a tidy way, I will for the last time to assume you're trying to be constructive, and repeat my objections to your arguments. But neither your arguments nor my replies are new - and my replies are not just mine, other devs contributed to this and related discussions. If you want to just discuss the points below over and over, let's do this in a live chat so at least we wast less time.

that "arrays-in-set_index" was a specific enhancement

The fact that an ability is added later on does not mean it cannot be deprecated (actually the opposite).

and that this is core functionality not least due to its intuitiveness and centrality of indexes to pandas

"setting index with arrays" is a functionality (not so sure it is even a "core" one - setting from columns is much more frequent in the code I see), having it in set_index is just a matter of organizing the API. axis is not a strange term in pandas, it is one of the core concepts, so while index might be even more immediate, we are definitely not hiding the functionality by keeping it in set_axis.

that axis/index are two sides of the same coin (and not two duplicate coins), and that having different behaviours between set_index and set_axis is further detriment to intuitiveness

Your argument would suggest we want to make them almost-duplicates (apart from the axis= argument), and this is precisely something any good API should avoid. Vice-versa, separating the functionalities allows for less ambiguity (no, "ambiguity" is not an argument against option 5, but having a single method do widely different things based on arguments is not a favor to our users).

that deprecating list-as-array (vs. list-as-collection) is a goal to strive for, much more than a special case

As already clearly stated: you are right this would simplify our life, but we don't want our users to pay for. We always allowed users to skip the step of explicitly creating a vectorized object to feed our vectorized objects, and I see no reason why we should change our mind now. As for terminology, 2D arrays are collections of 1D arrays, which are collections of elements, so lists of lists make perfect sense.

saying that arrays can only be used in set_axis does not solve the lack of a set_index-method for Series

There is nothing to "solve" if the set_index does something which is completely useless for Series - we have set_axis that does the useful part. And we have discussed this some time ago. This said, if we do find that some sort of Series.set_index is useful for compatibility (like the MultiIndex methods which were backported to flat Indexes as idempotent), I will have no general objections. But it will be a result of, not an argument for, our decision about DataFrame.set_index, and it is OT here.

h-vetinari · 2019-07-26T09:12:15Z

So far, the only effect in this discussion of me being a core dev is that I am patiently replying to your rants, instead than doing more productive things. [...] I will for the last time to assume you're trying to be constructive, [...]

I appreciate the time you took, and thanks for that. Indeed I was trying to have a productive discussion, and what seemed like rants to you was my response to the perception that you dismissed my points out of hand (until your last reply).

Your argument would suggest we want to make them almost-duplicates (apart from the axis= argument), and this is precisely something any good API should avoid.

The point is that the concepts axis and index are "almost-duplicates" already, and so their methods should reflect that. Thinking of the methods as duplicate is having things backwards, because the concept comes before the method.

We always allowed users to skip the step of explicitly creating a vectorized object to feed our vectorized objects, and I see no reason why we should change our mind now.

Except maybe if it is the cause of the hotly-contested ambiguity of list_of_scalar that stalled this whole discussion since about a year. It's clear that there are no great choices here, but leaving users to decipher the difference between set_index([0, 1, 2]) and set_index([[0, 1, 2]]) is not amazing either.

Anyway, thanks for taking the time to respond. I'll note that you didn't react to my attempt at extending an olive branch though:

@h-vetinari: Still, we have already found some sort of minimal common ground with 5. If your preference for 4 over 5 is not too large to be overcome at all, I'd be happy to submit a PR that implements 5.

toobaz · 2019-07-26T09:32:53Z

The point is that the concepts axis and index are "almost-duplicates" already, and so their methods should reflect that

In API design, "adherence to language" is only one of the many arguments - and an argument which is not new in this discussion. And for sure the fact that we have (almost-)duplicates in language doesn't mean we want (almost-)duplicates in the API.

the hotly-contested ambiguity of list_of_scalar that stalled this whole discussion since about a year

You seem to think that this problem is important because it stalled for a year - maybe it stalled for a year because there were more important things ;-)

It's clear that there are no great choices here, but leaving users to decipher the difference between set_index([0, 1, 2]) and set_index([[0, 1, 2]]) is not amazing either.

Flat lists and nested lists should be clearly different objects to our users.

I'll note that you didn't react to my attempt at extending an olive branch though

I did. I said I don't like 5, I said why, but I also said I'm happy to discuss at our next live chat.

Now, unless there are new arguments, I suggest we wait for that.

h-vetinari · 2019-07-26T09:44:00Z

And for sure the fact that we have (almost-)duplicates in language doesn't mean we want (almost-)duplicates in the API.

Then the consequence should be deprecating set_axis, but not having very different methods for almost duplicate concepts.

[...] but I also said I'm happy to discuss at our next live chat.

Is there a date/time already?

toobaz · 2019-07-26T10:17:36Z

Then the consequence should be deprecating set_axis, but not having very different methods for almost duplicate concepts.

Setting the index from columns and from data are two operations (I guess this is what you mean by "concepts") which are different enough (we have seen) as to raise the need to avoid ambiguity. It can be done through different functions, or different args, or by defining special cases users should be aware of, but we definitely agree they are distinct. Which of these solutions we pick is mostly not a matter of linguistics.

Is there a date/time already?

No

toobaz · 2019-07-26T10:21:09Z

No

(but just write me privately if you want to set up a chat with me)

wesm · 2019-07-26T15:14:59Z

I'd like to point out that the tone of this thread makes me a bit uncomfortable. As a reminder, this project has a code of conduct

https://github.com/pandas-dev/pandas-governance/blob/master/code-of-conduct.md

In such discussions, I think we (both maintainers and contributors) need to stick to facts and technical arguments and leave feelings and editorial comments out of the process. There is a risk in technical arguments to stoop to emotive conjugation (https://en.wikipedia.org/wiki/Emotive_conjugation) in describing others' actions.

In general my understanding is that this project operates on the basis of consensus-based decision making -- when there is no consensus about a change, the default option is probably to do nothing. In theory as the BDFL I can help settle disagreements, but I would prefer not to except in truly exceptional circumstances.

I question whether GitHub issues was the appropriate venue for this discussion compared with some form of RFC / design document.

h-vetinari · 2019-07-26T16:35:42Z

@wesm
Thanks for taking the time to respond here, although I regret that the reason was due to discomfort.

I have striven to avoid any emotionally charged words, but don't claim that I always succeeded. I believe all participants truly want the best for the combination of user- & maintainer-base, but such impassioned arguments take a lot of time and energy (which, I presume is the reason why many participants have not joined the discussion anymore).

I do object to the way some things were handled in this whole episode, but will not dwell on that. My main reasons for not resigning from this discussion are that I don't want the case dismissed without a fair hearing/counter-argument (even if I'm not core-dev), and that I feel the picture is much less one-sided even on the dev-side, as the impression that the last few comments might give.

I question whether GitHub issues was the appropriate venue for this discussion compared with some form of RFC / design document.

I'd be happy to participate in another format, but didn't know a better way than through an issue here.

PS. Thanks for the link about emotive conjugation. "How would I describe myself in their shoes" will be an excellent self-check before speaking/posting.

toobaz · 2019-10-09T18:08:23Z

In the last dev chat, which was held just few minutes ago, this issue was discussed and there was clear consensus for option 4, that is, "deprecate using set_index with arrays, and point to set_axis instead".

(Related to the "tone" of the discussion, I definitely try to keep emotionally charged language away from my comments, but whenever should I fail to do so, I welcome being explicitly reprimanded - ideally in private. I'm not a native speaker, and in any case I definitely won't be offended by any such rebuke. On the other hand, language aside, I think that when a discussion takes more of our time and energy than it is worth, there is nothing wrong in stating it, even if it is not a "fact or technical argument" on pandas itself.)

jorisvandenbossche · 2019-10-09T21:00:39Z

@toobaz can you give some reasoning why this is the clear preference?
I would need to go through this long thread the understand it more, but maybe some arguments were summarized on the call?

h-vetinari · 2019-10-10T06:18:04Z

@toobaz
Thanks for seeing this through. Although I would have liked to participate in the dev chat about this, and although I find the decision suboptimal, I guess any decision is better than no decision at this point. Time permitting, I'll look into a PR that deprecates arrays from set_index and outputs a nice warning to use set_axis.

@jorisvandenbossche
The lack of documentation and transparency (note: not an accusation, that's just the way it has been so far) in such cases is why I'm thinking about a pandas version of PEPs/NEPs (#28568). I have been swamped recently and couldn't respond on that issue, but I will pick it up again.

toobaz · 2019-10-10T07:10:38Z

@toobaz can you give some reasoning why this is the clear preference?

To be honest there was more a recall of the reasoning already exposed here than any new argument. Some devs (e.g., @WillAyd ) were already clearly in favour of option 4.

The only new thing was a proposal (by @TomAugspurger if I recall correctly) to deprecate set_index entirely, replacing with two methods set_index_keys and set_index_values, a clean, but more disruptive, solution. But in the end, consensus on 4 was reached pretty quickly.

@h-vetinari you probably already know, but just in case: the call was publicly announced on the [pydata] mailing list. In any case, again, if there had been new arguments I would have written them here. It is hard to see the 76 comments here + other in related issues a "lack of documentation and transparency".

I'm thinking about a pandas version of PEPs/NEPs (#28568).

Wlll reply there.

h-vetinari · 2019-10-10T07:50:15Z

@toobaz
I didn't know the announcement, thanks for the info.

@toobaz: It is hard to see the 76 comments here + other in related issues a "lack of documentation and transparency".

I do not consider dispersed discussion in several threads and comments as appropriate documentation (again: not as a criticism of you or the other devs, but rather of the current process). I added a comment about this in #28568.

TomAugspurger · 2019-10-10T12:36:24Z

@h-vetinari The calls and meeting notes are public.

to deprecate set_index entirely

Not quite: I was just choosing different names to highlight the different behavior. Clearly set_index should stay :)

toobaz · 2019-10-10T15:37:46Z

Not quite: I was just choosing different names to highlight the different behavior. Clearly set_index should stay :)

OK, thanks for the clarification ;-)

h-vetinari mentioned this issue Dec 2, 2018

ENH: Add set_index to Series #22225

Closed

5 tasks

gfyoung added Indexing Related to indexing on series/frames, not to indexes themselves API Design DataFrame DataFrame data structure labels Dec 2, 2018

This was referenced Dec 3, 2018

ENH: coalesce-method (upgrade for update/combine_first) #22812

Open

DOC: update str.cat example #23723

Merged

ghost mentioned this issue Jul 25, 2019

DEPR: NDFrame.set_axis inplace defaults to false #27525 #27600

Merged

h-vetinari mentioned this issue Sep 22, 2019

Pandas Enhancement Proposals? #28568

Closed

WillAyd mentioned this issue Oct 22, 2019

set_axis with callable #29145

Open

mroeschke added the Enhancement label Apr 20, 2020

mroeschke removed API Design Indexing Related to indexing on series/frames, not to indexes themselves DataFrame DataFrame data structure labels Jun 23, 2021

API: capabilities of df.set_index #24046

API: capabilities of df.set_index #24046

Comments

h-vetinari commented Dec 2, 2018 • edited

jbrockmendel commented Dec 2, 2018

h-vetinari commented Dec 2, 2018

h-vetinari commented Jan 2, 2019

h-vetinari commented Jan 6, 2019 • edited

toobaz commented Jan 7, 2019 • edited by jorisvandenbossche

jorisvandenbossche commented Jan 7, 2019

TomAugspurger commented Jan 7, 2019

WillAyd commented Jan 7, 2019

TomAugspurger commented Jan 7, 2019 via email

WillAyd commented Jan 7, 2019

TomAugspurger commented Jan 7, 2019

jorisvandenbossche commented Jan 7, 2019

toobaz commented Jan 7, 2019

WillAyd commented Jan 7, 2019

TomAugspurger commented Jan 7, 2019

jorisvandenbossche commented Jan 7, 2019

TomAugspurger commented Jan 7, 2019

toobaz commented Jan 7, 2019

TomAugspurger commented Jan 7, 2019 • edited

TomAugspurger commented Jan 7, 2019

jreback commented Jan 7, 2019

jreback commented Jan 7, 2019

jorisvandenbossche commented Jan 7, 2019

toobaz commented Jan 7, 2019

ghost commented Jul 25, 2019

toobaz commented Jul 25, 2019

toobaz commented Jul 25, 2019

h-vetinari commented Jul 25, 2019

toobaz commented Jul 25, 2019

h-vetinari commented Jul 25, 2019

toobaz commented Jul 25, 2019

h-vetinari commented Jul 25, 2019

toobaz commented Jul 26, 2019

h-vetinari commented Jul 26, 2019 • edited

toobaz commented Jul 26, 2019

h-vetinari commented Jul 26, 2019

toobaz commented Jul 26, 2019

toobaz commented Jul 26, 2019

wesm commented Jul 26, 2019

h-vetinari commented Jul 26, 2019 • edited

toobaz commented Oct 9, 2019

jorisvandenbossche commented Oct 9, 2019

h-vetinari commented Oct 10, 2019

toobaz commented Oct 10, 2019

h-vetinari commented Oct 10, 2019

TomAugspurger commented Oct 10, 2019

toobaz commented Oct 10, 2019

h-vetinari commented Dec 2, 2018 •

edited

h-vetinari commented Jan 6, 2019 •

edited

toobaz commented Jan 7, 2019 •

edited by jorisvandenbossche

TomAugspurger commented Jan 7, 2019 •

edited

h-vetinari commented Jul 26, 2019 •

edited

h-vetinari commented Jul 26, 2019 •

edited