Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: capabilities of df.set_index #24046

Open
2 of 7 tasks
h-vetinari opened this issue Dec 2, 2018 · 80 comments
Open
2 of 7 tasks

API: capabilities of df.set_index #24046

h-vetinari opened this issue Dec 2, 2018 · 80 comments

Comments

@h-vetinari
Copy link
Contributor

h-vetinari commented Dec 2, 2018

This is coming out of a discussion that has stalled #22225 (which is about adding .set_index to Series, see #21684). The discussion has shifted away from what capabilities a putative Series.set_index should have, but what capabilities df.set_index has currently.

The main issue (for @jreback) is that df.set_index takes arrays:

@jreback: There were several attempts to have DataFrame.set_index take an array as well, but these never got off the ground.

@h-vetinari: I'm not sure when, but they certainly did get off the ground:

>>> import pandas as pd
>>> import numpy as np
>>> pd.__version__
'0.23.4'
>>>
>>> df = pd.DataFrame(np.random.randint(0, 10, (4, 4)), columns=list('abcd'))
>>> df.set_index(['a',          # label
...               df.index,     # Index
...               df.b ** 2,    # Series
...               df.b.values,  # ndarray
...               list('ABCD'), # list
...               'c'])         # label again
              b  d
a   b      c
0 0 0  2 A 1  0  2
8 1 1  4 B 4  1  4
3 2 25 5 C 8  5  5
0 3 9  7 D 2  3  7

Further on:

@jreback: @h-vetinari you are confusing the purpose of .set_axis. [...] The problem with .set_index on a DataFrame with an array is that it technically can work with an array and not keys. (meaning its not unambiguous)

I don't think I am confusing them. If I want to set the .index-attribute of a Series/DataFrame, then using .set_index is the most reasonable name by far. If anything, set_axis should be a superset of set_index (and a putative set_columns), that just switches between the two based on the axis-kwarg.

More than that, the current capabilities of df.set_index are a proper superset of df.set_axis(axis=0)**, in that it's possible to fill keys with only Series/Index/ndarray/list etc.:

>>> df.set_index(pd.Index(df.a))  # same result as Series directly below
>>> df.set_index(df.a) 
   a  b  c  d
a
0  0  0  1  2
8  8  1  4  4
3  3  5  8  5
0  0  3  2  7
>>> df.set_index(df.a.values)  # same result as list directly below
>>> df.set_index([[0, 8, 3, 0]])
   a  b  c  d
0  0  0  1  2
8  8  1  4  4
3  3  5  8  5
0  0  3  2  7

** there is one caveat, in that lists (and only lists; out of all containers) need to be wrapped in another list, i.e. df.set_index([[0, 8, 3, 0]]) instead of df.set_index([0, 8, 3, 0]). This is the heart of the ambiguity that @jreback mentioned above (because a list is interpreted as a list of column keys).

Summing up:

  • set_index is the most natural name for setting the .index-attribute
  • df.set_index should be able to process list-likes (as it currently does; this is the source of the ambiguity of the list case).
  • df.set_axis should be able to do everything that df.set_index does, and just switch between operating on index/columns based on the axis-kwarg (after all, index and columns are the two axes of a DF).
    • it could be considered to add a method set_columns on a DataFrame
    • The axis-kwarg of set_axis should just switch between the behaviour of set_index (i.e. dealing with keys and array-likes) and set_columns.
  • Series.set_index should support the same signature as df.set_index, with the exception of the drop-keyword (which only makes sense for column labels).
  • For Series, the set_index and set_axis methods should be exactly the same.

Since I can't tag @pandas-dev/pandas-core, here are a few individual tags: @jreback @TomAugspurger @jorisvandenbossche @gfyoung @WillAyd @jbrockmendel @jschendel @toobaz.

EDIT: Forgot to add an xref from @jreback:

@h-vetinari we had quite some discussion about this: #14829
and never reached resolution. This is an API question.

In that issue, there's discussion largely around .rename, and how to make that method more consistent. Also discussed was potentially introducing .relabel, as well as .set_columns.

@jbrockmendel
Copy link
Member

@h-vetinari should list('ABC') in the first example be list(ABCD')? If not then I am confused in several directions.

@h-vetinari
Copy link
Contributor Author

@jbrockmendel
That was indeed an artefact from merging together several things from the other thread to make this issue...

@gfyoung gfyoung added Indexing Related to indexing on series/frames, not to indexes themselves API Design DataFrame DataFrame data structure labels Dec 2, 2018
@h-vetinari
Copy link
Contributor Author

@jreback Any comments here or in #22225?

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Jan 6, 2019

@jreback
I am honestly stunned by you closing #22225 and then locking it after I objected. So much for my motivation to work on some big PRs today.

@h-vetinari you are not listening. If you want to raise an issue or comment feel free.

I've opened this issue here for exactly this purpose (discussing your objections to existing capabilities of DataFrame.set_index) over a month ago.

@toobaz
Copy link
Member

toobaz commented Jan 7, 2019

@h-vetinari While I think locking #22225 was an unnecessary move from @jreback , you have to realize that the "''overruling approving reviews''" thing is not a good argument to raise in such a discussion. True, in pandas we look for devs consensus, but in the end we prefer to do so by argumenting rather than by any form of proper vote. So that comment was not very useful, to use an euphemism.

Back to the topic of the discussion: if I understand correctly, @jreback is argumenting that df.set_index is already ambiguous enough to make it a bad idea to sponsor its use when passing anything but keys (which would be the only sensible use in Series.set_index); and at the same time, you are suggesting that

[ ] For Series, the set_index and set_axis methods should be exactly the same.

If this summary of the discussion is correct, then I think I am also against introducing Series.set_index.

And actually, I'm probably in favor of deprecating df.set_index to pass actual values, if df.set_axis is able to fulfill exactly the same task (didn't check).

I understand you argument about df.set_index being what one expects to use to set the .index (to some values)... but set_axis is simply too long established to be removed, and I find duplication in the API a worse problem than the maybe sub-optimal naming. Or in other words, I think

The axis-kwarg of set_axis should just switch between the behaviour of set_index (i.e. dealing with keys and array-likes) and set_columns.

is a bad idea. ''There should be one - and preferably only one - obvious way to do it."

EDIT: By the way: sorry for not reacting before to the ping - busy period.

@jorisvandenbossche
Copy link
Member

@h-vetinari In my opinion, the locking was not the best way to handle the discussion in the PR, so sorry about that. In the meantime, Jeff has unlocked the conversation there, but let's continue the discussion here.


On the topic: given the behaviour of DataFrame.set_index (supporting setting a full array-like in addition of a list of column names), I personally don't have problems with adding a similar behaviour for Series.set_index.

I'm probably in favor of deprecating df.set_index to pass actual values, if df.set_axis is able to fulfill exactly the same task .....
.... but set_axis is simply too long established to be removed, and I find duplication in the API a worse problem than the maybe sub-optimal naming

@toobaz I personally almost never seen someone use set_axis (and never used it myself), but I do regularly see people use set_index with a full Series/array (but that's subjective of course).
So personally, I would rather deprecate set_axis and only keep set_index (but: set_axis also has the ability to set the columns, which set_index cannot do, so it is not fully duplicative and therefore probably cannot easily be deprecated).

@TomAugspurger
Copy link
Contributor

Agreed that locking was not appropriate.


On the issue itself, to me it's pretty clear that Series.set_index(sequence) is a limiting case of DataFrame.set_index(Sequece[sequence])

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({"A": [1, 2, 3]})

In [3]: df.set_index([['a', 'b', 'c']])
Out[3]:
   A
a  1
b  2
c  3

Since In[3] works, I would expect that

In [4]: df.A.set_index([['a', 'b', 'c']])

work as well.

@WillAyd
Copy link
Member

WillAyd commented Jan 7, 2019

I don't think continuing to post here or in the associated PR is an effective use of anyone's time. Why don't we just add it as a discussion point to the next dev chat?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 7, 2019 via email

@WillAyd
Copy link
Member

WillAyd commented Jan 7, 2019

There isn't agreeance on desired behavior hence why I suggest moving to a separate forum. I don't think it's something we need to push into 0.24 at the end here either

@TomAugspurger
Copy link
Contributor

I really don't think this is a difficult decision though, is it? Do we want Series.set_index to accept arrays like DataFrame.set_index or not? Joris and I are I think +1, or at least ambivalent. Jeff seems to think that DataFrame.set_index doesn't accept arrays (e.g.
#22225 (comment))

DataFrame.set_index() which accepts keys (which are column names / levels). NOT an array of values.

which isn't correct, as shown in
#24046 (comment).

@jorisvandenbossche
Copy link
Member

I agree that if it seems difficult to find an agreement, that discussing this on the next dev chat can be more productive / effective.
But, until now, there hasn't been much discussion on the PR, apart from between Jeff and h-vetinari (other people commented on the PR, but didn't really involve in the API discussion, apart from Tom agreeing with h-vetinari). So I would first still be interested in hearing what other people think about it, as I personally don't really see much reason to object it.

@toobaz
Copy link
Member

toobaz commented Jan 7, 2019

On the issue itself, to me it's pretty clear that Series.set_index(sequence) is a limiting case of DataFrame.set_index(Sequece[sequence])

@TomAugspurger We all agree (I think) that it makes sense for df.set_index([a_list_of_labels]) to work. I think @jreback makes a good point however that there is no obvious reason (except parameter ambiguity) for df.set_index(a_list_of_labels) not to work (since df.set_index(Series(a_list_of_labels)) does), and that this causes a potential confusion that df.set_axis doesn't. Then maybe we can live with it... but let's admit this is not ideal.

One alternative (which I don't particularly like) is what (I think) .groupby(a_list) does, i.e., trying to find elements of a_list in the axis, and fallback to considering them as values otherwise.

@WillAyd
Copy link
Member

WillAyd commented Jan 7, 2019

I am -1 due to ambiguity. I don't know what the desired behavior of the following is:

df = pd.DataFrame(np.ones((3,3)))
df.set_index([1, 0, 2])

@TomAugspurger
Copy link
Contributor

@toobaz can you show an example of df._set_index(a_list_of_labels) vs. [a_list_of_labels]? I don't think that #22225 is changing that at all.

@WillAyd that ambiguity exists today, and is unchanged by #22225. I don't think anyone has proposed deprecating that behavior.

@jorisvandenbossche
Copy link
Member

I am -1 due to ambiguity. I don't know what the desired behavior of the following is:

That is not fully the discussion. As that is about a DataFrame, and that behaviour is already defined (it first prefers column names).
The question is rather what pd.Series([0, 0, 0]).set_index([1, 0, 2]) should do, which is much less ambiguous.

Given the confusion and talking next to each other, it might be good if someone attempts to make a good illustrated and complete summary of the actual discussion.

@TomAugspurger
Copy link
Contributor

Is
#24046 (comment) a good summary? Make Series.set_index the limiting case of DataFrame.set_index? Any confusion points there?

@toobaz
Copy link
Member

toobaz commented Jan 7, 2019

@toobaz can you show an example of df._set_index(a_list_of_labels) vs. [a_list_of_labels]? I don't think that #22225 is changing that at all.

No, it's not. But as already stated, if df.set_index(values_rather_than_keys) is a regrettable legacy causing ambiguity in the API, we'd rather not enhance its usage by paralleling it with Series.set_index, which would do only that (which is already done by Series.set_axis). I actually suggested deprecating it... which might not be our final decision, but is certainly related to #22225 .

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 7, 2019

@toobaz my apologies, I missed the paragraph where you suggested deprecating non-labels a values in DataFrame.set_index. Indeed, if we want to deprecate that then we should not go forward in #22225.

@TomAugspurger
Copy link
Contributor

On deprecating passing values, rather that column labels to DataFrame.set_index: I don't think we should deprecate that. While there is ambiguity, as noted in @WillAyd's example in
#24046 (comment), I think it's quite useful to pass a mix of labels and keys.

In [11]: df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})

In [12]: df.set_index(["A", [1, 2, 3]])
Out[12]:
     B
A
1 1  4
2 2  5
3 3  6

Without that, I think you'd have some ugly

In [18]: df.set_axis(pd.MultiIndex.from_arrays([df.A, [1, 2, 3]]), inplace=False).drop(['A'], axis=1)
Out[18]:
     B
A
1 1  4
2 2  5
3 3  6

@jreback
Copy link
Contributor

jreback commented Jan 7, 2019

you could just raise / warn on ambguity

@jreback
Copy link
Contributor

jreback commented Jan 7, 2019

this needs coupling with possibly deprecating set_axis as well
because passing values is not documented in any way

@jorisvandenbossche
Copy link
Member

this needs coupling with possibly deprecating set_axis as well

I would personally be happy to get rid of set_axis, but I think the main problem there is that we don't have an alternative for df.set_axis(['a', 'b'], axis=1) (setting column names) ?

@toobaz
Copy link
Member

toobaz commented Jan 7, 2019

this needs coupling with possibly deprecating set_axis as well

Deprecating the one method that works as expected?!

I would personally be happy to get rid of set_axis, but I think the main problem there is that we don't have an alternative for df.set_axis(['a', 'b'], axis=1) (setting column names) ?

The problem is also for axis=0... (unless you know about the nested list trick)

@ghost
Copy link

ghost commented Jul 25, 2019

Sorry, I removed the new content and also most of the old.

Without arguing for one position over another, here's the state of things as seen from the user's point of view who looks to the documentation for guidance:

# grep -R '\.index' doc/source
whatsnew/v0.14.0.rst
158:    df_multi.index = tuple_ind
164:    df_multi.index = mi

whatsnew/v0.16.0.rst
98:   s.index = pd.MultiIndex.from_tuples([(1, 2, 'a', 0),

user_guide/indexing.rst
1706:   data.index = index

getting_started/10min.rst
653:   ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9

user_guide/io.rst
1893:   dfj2.index = pd.date_range('20130101', periods=5)
2952:   df.index = df.index.set_names(['lvl1', 'lvl2'])

getting_started/basics.rst
233:   dfmi.index = pd.MultiIndex.from_tuples([(1, 'a'), (1, 'b'),

user_guide/timeseries.rst
2096:   ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9

user_guide/sparse.rst
306:   s.index = pd.MultiIndex.from_tuples([(1, 2, 'a', 0),

At the same time, all example uses of set_index are for the column case and there are 0 usage examples of set_axis which appears only in the auto-generated reference.

I think it's fair to say that the docs currently advocate for always setting the index directly and gives no indication that setting a series index in a method-chain is supported. I care about this more than I care about which competing alternative wins out.

My personal preference is still to keep set_index as it is and add its equivalent to Series. And in any case to update the documentation accordingly.

@toobaz
Copy link
Member

toobaz commented Jul 25, 2019

I'm very sorry, but it is removing core functionality

You're not trying to understand.

@toobaz
Copy link
Member

toobaz commented Jul 25, 2019

give up the array-like interpretation

I'm pretty sure this will never pass.

@h-vetinari
Copy link
Contributor Author

@h-vetinari: I'm very sorry, but it is removing core functionality

@toobaz: You're not trying to understand.

I understand that, to you, the functionality is not removed because it lives on in set_axis. We just disagree about what constitutes good API design.

@h-vetinari: give up the array-like interpretation

@toobaz: I'm pretty sure this will never pass.

It doesn't have to be everywhere or at the same time, but removing such overloaded interpretations would greatly simplify the API surface as well as code maintainability. I'm saying it could be a goal to strive for, like striving to have tuples always be MI-keys (and was an argument why "special-casing" in option 3 can be desirable).

@pilkibun: [...] and there are 0 usage examples of set_axis which appears only in the auto-generated reference.

@toobaz, this is part of the reason why I consider shifting the array-capability from set_index to set_axis as a removal of functionality - it is not nearly as intuitive or wide-spread as set_index.

@toobaz
Copy link
Member

toobaz commented Jul 25, 2019

We just disagree about what constitutes good API design.

We mostly disagree on what constitutes productive interaction. And from my part, this is the last "metacomment" in this discussion.

removing such overloaded interpretations would greatly simplify the API surface as well as code maintainability

I totally agree it would simplify the code, but we don't want our users to pay the price. And it would be particularly ironic to start doing so because we want to overload the interpretation of the argument to set_index.

So among the options you listed, only 4 and 5 are feasible. To be honest I don't like 5, but this might be subjective. Do you agree that a deprecation message which tells our users "please replace set_index with set_axis" will serve our users at least as well as a deprecation message which tells our users "please replace set_index with set_index(arrays=.)"?

@h-vetinari
Copy link
Contributor Author

I totally agree it would simplify the code, but we don't want our users to pay the price. And it would be particularly ironic to start doing so because we want to overload the interpretation of the argument to set_index.

I agree that users should not pay the price. I also don't want the overloaded interpretation in set_index, but would tackle it differently.

Do you agree that a deprecation message which tells our users "please replace set_index with set_axis" will serve our users at least as well as a deprecation message which tells our users "please replace set_index with set_index(arrays=.)"?

As a deprecation, yes. After the removal, not so much (because set_index will still be the more obvious choice, also for arrays).

So among the options you listed, only 4 and 5 are feasible.

I think 3 and 5 are feasible. The overlap is 5, and I'd be happy to support that.

We mostly disagree on what constitutes productive interaction.

Indeed, and we both have our failings there. Thanks for taking the time/energy to stick with it.

@toobaz
Copy link
Member

toobaz commented Jul 25, 2019

because set_index will still be the more obvious choice, also for arrays

Why "obvious"? For the name "set_index" vs. "set_axis"? Or because you like everything in the same method?
(either way, I think it's just a matter of having clear documentation)

I think 3 and 5 are feasible.

3 was so far ruled out by both me and @jreback - not very useful to cite it if you don't have new arguments. I guess we never discussed much 5, so I'll be happy to mention it in the next live chat. I still think we will end up choosing 4.

@h-vetinari
Copy link
Contributor Author

Why "obvious"? For the name "set_index" vs. "set_axis"?

That plays a large role, yes, because we're set-ting the index-attribute, and because having an intuitive API is crucial. Documentation does not help with intuitiveness.

3 was so far ruled out by both me and @jreback - not very useful to cite it if you don't have new arguments.

I have enough arguments**, but so far you have hardly responded to them, except by appeal to authority. 4 or 5 is your opinion, 3 or 5 is mine. Of course, you're in a much better position to enforce your ideas, but that does not improve the strength of your argument.

Still, we have already found some sort of minimal common ground with 5. If your preference for 4 over 5 is not too large to be overcome at all, I'd be happy to submit a PR that implements 5.

** I realised I haven't even fully articulated one of the most important arguments against 4: saying that arrays can only be used in set_axis does not solve the lack of a set_index-method for Series! The rest of my main arguments is recapped below the fold:

  • that "arrays-in-set_index" was a specific enhancement and that this is core functionality not least due to its intuitiveness and centrality of indexes to pandas
  • that axis/index are two sides of the same coin (and not two duplicate coins), and that having different behaviours between set_index and set_axis is further detriment to intuitiveness
  • that deprecating list-as-array (vs. list-as-collection) is a goal to strive for, much more than a special case

@toobaz
Copy link
Member

toobaz commented Jul 26, 2019

appeal to authority

This is a way to waste the time of both of us. So far, the only effect in this discussion of me being a core dev is that I am patiently replying to your rants, instead than doing more productive things.

But since you later repeated your arguments in a tidy way, I will for the last time to assume you're trying to be constructive, and repeat my objections to your arguments. But neither your arguments nor my replies are new - and my replies are not just mine, other devs contributed to this and related discussions. If you want to just discuss the points below over and over, let's do this in a live chat so at least we wast less time.

that "arrays-in-set_index" was a specific enhancement

The fact that an ability is added later on does not mean it cannot be deprecated (actually the opposite).

and that this is core functionality not least due to its intuitiveness and centrality of indexes to pandas

"setting index with arrays" is a functionality (not so sure it is even a "core" one - setting from columns is much more frequent in the code I see), having it in set_index is just a matter of organizing the API. axis is not a strange term in pandas, it is one of the core concepts, so while index might be even more immediate, we are definitely not hiding the functionality by keeping it in set_axis.

that axis/index are two sides of the same coin (and not two duplicate coins), and that having different behaviours between set_index and set_axis is further detriment to intuitiveness

Your argument would suggest we want to make them almost-duplicates (apart from the axis= argument), and this is precisely something any good API should avoid. Vice-versa, separating the functionalities allows for less ambiguity (no, "ambiguity" is not an argument against option 5, but having a single method do widely different things based on arguments is not a favor to our users).

that deprecating list-as-array (vs. list-as-collection) is a goal to strive for, much more than a special case

As already clearly stated: you are right this would simplify our life, but we don't want our users to pay for. We always allowed users to skip the step of explicitly creating a vectorized object to feed our vectorized objects, and I see no reason why we should change our mind now. As for terminology, 2D arrays are collections of 1D arrays, which are collections of elements, so lists of lists make perfect sense.

saying that arrays can only be used in set_axis does not solve the lack of a set_index-method for Series

There is nothing to "solve" if the set_index does something which is completely useless for Series - we have set_axis that does the useful part. And we have discussed this some time ago. This said, if we do find that some sort of Series.set_index is useful for compatibility (like the MultiIndex methods which were backported to flat Indexes as idempotent), I will have no general objections. But it will be a result of, not an argument for, our decision about DataFrame.set_index, and it is OT here.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Jul 26, 2019

So far, the only effect in this discussion of me being a core dev is that I am patiently replying to your rants, instead than doing more productive things. [...] I will for the last time to assume you're trying to be constructive, [...]

I appreciate the time you took, and thanks for that. Indeed I was trying to have a productive discussion, and what seemed like rants to you was my response to the perception that you dismissed my points out of hand (until your last reply).

Your argument would suggest we want to make them almost-duplicates (apart from the axis= argument), and this is precisely something any good API should avoid.

The point is that the concepts axis and index are "almost-duplicates" already, and so their methods should reflect that. Thinking of the methods as duplicate is having things backwards, because the concept comes before the method.

We always allowed users to skip the step of explicitly creating a vectorized object to feed our vectorized objects, and I see no reason why we should change our mind now.

Except maybe if it is the cause of the hotly-contested ambiguity of list_of_scalar that stalled this whole discussion since about a year. It's clear that there are no great choices here, but leaving users to decipher the difference between set_index([0, 1, 2]) and set_index([[0, 1, 2]]) is not amazing either.


Anyway, thanks for taking the time to respond. I'll note that you didn't react to my attempt at extending an olive branch though:

@h-vetinari: Still, we have already found some sort of minimal common ground with 5. If your preference for 4 over 5 is not too large to be overcome at all, I'd be happy to submit a PR that implements 5.

@toobaz
Copy link
Member

toobaz commented Jul 26, 2019

The point is that the concepts axis and index are "almost-duplicates" already, and so their methods should reflect that

In API design, "adherence to language" is only one of the many arguments - and an argument which is not new in this discussion. And for sure the fact that we have (almost-)duplicates in language doesn't mean we want (almost-)duplicates in the API.

the hotly-contested ambiguity of list_of_scalar that stalled this whole discussion since about a year

You seem to think that this problem is important because it stalled for a year - maybe it stalled for a year because there were more important things ;-)

It's clear that there are no great choices here, but leaving users to decipher the difference between set_index([0, 1, 2]) and set_index([[0, 1, 2]]) is not amazing either.

Flat lists and nested lists should be clearly different objects to our users.

I'll note that you didn't react to my attempt at extending an olive branch though

I did. I said I don't like 5, I said why, but I also said I'm happy to discuss at our next live chat.

Now, unless there are new arguments, I suggest we wait for that.

@h-vetinari
Copy link
Contributor Author

And for sure the fact that we have (almost-)duplicates in language doesn't mean we want (almost-)duplicates in the API.

Then the consequence should be deprecating set_axis, but not having very different methods for almost duplicate concepts.

[...] but I also said I'm happy to discuss at our next live chat.

Is there a date/time already?

@toobaz
Copy link
Member

toobaz commented Jul 26, 2019

Then the consequence should be deprecating set_axis, but not having very different methods for almost duplicate concepts.

Setting the index from columns and from data are two operations (I guess this is what you mean by "concepts") which are different enough (we have seen) as to raise the need to avoid ambiguity. It can be done through different functions, or different args, or by defining special cases users should be aware of, but we definitely agree they are distinct. Which of these solutions we pick is mostly not a matter of linguistics.

Is there a date/time already?

No

@toobaz
Copy link
Member

toobaz commented Jul 26, 2019

No

(but just write me privately if you want to set up a chat with me)

@wesm
Copy link
Member

wesm commented Jul 26, 2019

I'd like to point out that the tone of this thread makes me a bit uncomfortable. As a reminder, this project has a code of conduct

https://github.com/pandas-dev/pandas-governance/blob/master/code-of-conduct.md

In such discussions, I think we (both maintainers and contributors) need to stick to facts and technical arguments and leave feelings and editorial comments out of the process. There is a risk in technical arguments to stoop to emotive conjugation (https://en.wikipedia.org/wiki/Emotive_conjugation) in describing others' actions.

In general my understanding is that this project operates on the basis of consensus-based decision making -- when there is no consensus about a change, the default option is probably to do nothing. In theory as the BDFL I can help settle disagreements, but I would prefer not to except in truly exceptional circumstances.

I question whether GitHub issues was the appropriate venue for this discussion compared with some form of RFC / design document.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Jul 26, 2019

@wesm
Thanks for taking the time to respond here, although I regret that the reason was due to discomfort.

I have striven to avoid any emotionally charged words, but don't claim that I always succeeded. I believe all participants truly want the best for the combination of user- & maintainer-base, but such impassioned arguments take a lot of time and energy (which, I presume is the reason why many participants have not joined the discussion anymore).

I do object to the way some things were handled in this whole episode, but will not dwell on that. My main reasons for not resigning from this discussion are that I don't want the case dismissed without a fair hearing/counter-argument (even if I'm not core-dev), and that I feel the picture is much less one-sided even on the dev-side, as the impression that the last few comments might give.

I question whether GitHub issues was the appropriate venue for this discussion compared with some form of RFC / design document.

I'd be happy to participate in another format, but didn't know a better way than through an issue here.

PS. Thanks for the link about emotive conjugation. "How would I describe myself in their shoes" will be an excellent self-check before speaking/posting.

@toobaz
Copy link
Member

toobaz commented Oct 9, 2019

In the last dev chat, which was held just few minutes ago, this issue was discussed and there was clear consensus for option 4, that is, "deprecate using set_index with arrays, and point to set_axis instead".

(Related to the "tone" of the discussion, I definitely try to keep emotionally charged language away from my comments, but whenever should I fail to do so, I welcome being explicitly reprimanded - ideally in private. I'm not a native speaker, and in any case I definitely won't be offended by any such rebuke. On the other hand, language aside, I think that when a discussion takes more of our time and energy than it is worth, there is nothing wrong in stating it, even if it is not a "fact or technical argument" on pandas itself.)

@jorisvandenbossche
Copy link
Member

@toobaz can you give some reasoning why this is the clear preference?
I would need to go through this long thread the understand it more, but maybe some arguments were summarized on the call?

@h-vetinari
Copy link
Contributor Author

@toobaz
Thanks for seeing this through. Although I would have liked to participate in the dev chat about this, and although I find the decision suboptimal, I guess any decision is better than no decision at this point. Time permitting, I'll look into a PR that deprecates arrays from set_index and outputs a nice warning to use set_axis.

@jorisvandenbossche
The lack of documentation and transparency (note: not an accusation, that's just the way it has been so far) in such cases is why I'm thinking about a pandas version of PEPs/NEPs (#28568). I have been swamped recently and couldn't respond on that issue, but I will pick it up again.

@toobaz
Copy link
Member

toobaz commented Oct 10, 2019

@toobaz can you give some reasoning why this is the clear preference?

To be honest there was more a recall of the reasoning already exposed here than any new argument. Some devs (e.g., @WillAyd ) were already clearly in favour of option 4.

The only new thing was a proposal (by @TomAugspurger if I recall correctly) to deprecate set_index entirely, replacing with two methods set_index_keys and set_index_values, a clean, but more disruptive, solution. But in the end, consensus on 4 was reached pretty quickly.

@h-vetinari you probably already know, but just in case: the call was publicly announced on the [pydata] mailing list. In any case, again, if there had been new arguments I would have written them here. It is hard to see the 76 comments here + other in related issues a "lack of documentation and transparency".

I'm thinking about a pandas version of PEPs/NEPs (#28568).

Wlll reply there.

@h-vetinari
Copy link
Contributor Author

@toobaz
I didn't know the announcement, thanks for the info.

@toobaz: It is hard to see the 76 comments here + other in related issues a "lack of documentation and transparency".

I do not consider dispersed discussion in several threads and comments as appropriate documentation (again: not as a criticism of you or the other devs, but rather of the current process). I added a comment about this in #28568.

@TomAugspurger
Copy link
Contributor

@h-vetinari The calls and meeting notes are public.

to deprecate set_index entirely

Not quite: I was just choosing different names to highlight the different behavior. Clearly set_index should stay :)

@toobaz
Copy link
Member

toobaz commented Oct 10, 2019

Not quite: I was just choosing different names to highlight the different behavior. Clearly set_index should stay :)

OK, thanks for the clarification ;-)

@mroeschke mroeschke removed API Design Indexing Related to indexing on series/frames, not to indexes themselves DataFrame DataFrame data structure labels Jun 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants