Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Index.unique() should always return an Index object of the same type #13395
Comments
shoyer
added Indexing Difficulty Novice Effort Low
labels
Jun 8, 2016
|
At the moment, I think DatetimeIndex is rather the exception, as most seem to return a numpy array (and CategoricalIndex a Categorical):
|
|
this is a dupe of #4126 |
jreback
closed this
Jun 8, 2016
jreback
reopened this
Jun 8, 2016
|
closing the other one actually. |
jreback
added Compat Difficulty Advanced Effort Medium and removed Difficulty Novice Effort Low
labels
Jun 8, 2016
jreback
added this to the
Next Major Release
milestone
Jun 8, 2016
jreback
added Difficulty Intermediate and removed Difficulty Advanced
labels
Jun 8, 2016
|
|
yeah I dont' think we ever changed |
jreback
modified the milestone: 0.19.0, Next Major Release
Jun 8, 2016
|
One reason not to change This definitely needs to go in a major release because it will break some user code. |
This was referenced Jul 5, 2016
|
In the PR of @sinhrks, it is now proposed to return an Index of the same type for both Index and Series. While for Index it seems logical to always return an Index of the same type, I am not very enthusiastic about |
|
I don't agree then you are then giving meaning to the index of the series that you are returning when it doesn't have any meaning (the ordering actually does have meaning but that is true in either case) so returning an Index is the correct action here |
|
Of course this boils down to not having a good array-like container that can hold all pandas supported types .. (Index is such a container, and can be used for that, but IMO to users it is not, to users it are the labels of the index/columns of a DataFrame/Series). Options:
|
|
I disagree, Index IS the container object and is most appropriate Series is plain confusing |
|
i think it's natural that
|
I agree. Returning an index for For Series.unique, I don't think we have any good options prior to pandas 2.0. I would stick with returning numpy arrays for now. |
you seem to be against natural things and seem to want pandas to be like numpy |
shoyer
referenced
this issue
Aug 13, 2016
Closed
DOC: Design drafts to assist with next-gen pandas internals discussion #13944
You misunderstand me. This is about what feels consistent with the current version of pandas:
|
|
ok, I'll change my opinion here. I can see |
|
As I mentioned in #13944, in pandas 2.0, I think the logical type for the return value of |
|
@shoyer yes and if pandas 2.0 was around the corner and we DIDN't have a 1.0 I would agree. However, we very-very rarely expose raw ndarrays to the user ATM. Aside from |
|
My opinion is that we should not introduce any breaking changes in 1.0 that On Thu, Aug 18, 2016 at 3:03 AM, Jeff Reback notifications@github.com
|
|
In the current pandas, I would vote for returning a Series, although also not ideal. But I agree with @shoyer that if we change it again for 2.0, it is maybe not very beneficial to change this for 1.0 as well. To be clear, we should care that the "but we will change that for 2.0" does not become a reason to not do any needed changes anymore now. But, in this case, I personally don't think the return value of * we could also return an object array of timestamps for that specific case |
|
ok, for 0.19.0 we need to change Ok, so the only question then is to make If it should eventually return a But these are just way to many iffs. This needs to be resolved asap. @wesm why don't you weigh in here. |
jreback
modified the milestone: 0.19.0, 0.20.0
Aug 18, 2016
|
Just read through this. In pandas 2.0 Several problems with
In [10]: s = pd.Series([1,2,3,4] * 4)
In [11]: unique_vals = s.unique()
In [12]: from pandas.util.testing import rands
In [13]: df = pd.DataFrame({'uniques': unique_vals}, index=[rands(10) for i in range(len(unique
...: _vals))])
In [14]: df
Out[14]:
uniques
mB2LJrlOw5 1
qPF14xkGNl 2
0nE5HHGM0d 3
AbQEAYpYmW 4If In [18]: unique_vals = pd.Series(unique_vals)
In [19]: df = pd.DataFrame({'uniques': unique_vals}, index=[rands(10) for i in range(len(unique
...: _vals))])
In [20]: df
Out[20]:
uniques
LiQZXm6K5V NaN
B8HABWAK2o NaN
4hIrDH3Ue0 NaN
JpaO9iMWTP NaNContrived as this may be is an enough of a concern to make me -0 on this and very nearly -1
I agree it sort of stinks that we have both ndarray and non-ndarray (e.g. categorical) return values for |
|
(I agree that Index.unique should always return an Index) |
|
I'm +1 to leave One issue related to returning |
|
Nice example of @wesm how series vs array could break code, so let's not do that (although the reindexing behaviour of the constructors is maybe also a point for discussion ...) For the issue in #13565 (return value for unique of a tz aware series), options are:
I would go for the first or the second, but not really a preference. |
|
This was discussed here in the original issue.
Originally I had this returning a DTI, however It was suggested that numpy compat was more important here. But it is quite simple to just return an So changing to 2) IMHO is the best; we should also change |
|
@jorisvandenbossche the conforming / reindexing behavior of the DataFrame ctor is a super valuable feature (and one of the very earliest ones from pandas 0.1) in my experience (it also saves a ctor-then-reindex step which results in an extra sweep of the data and copy). You can pass in a bunch of irregularly indexed data and "pluck" out the data that matches a particular "master" index that you have set. The alternative is to pass in label-naive arrays, which seems like an acceptable compromise |
|
@wesm getting a bit off topic ... but: reindexing is undoubtly a very valuable operation, I am personally just not sure if it should be the behaviour of the default constructor (could also be a dedicated method). It also leads to suprises and bugs/ambiguous behaviour. I recall some discussion in #9237, where it was not clear which should happen first (reindex based on given columns, or determine index values based on passed objects), and with Series objects with names resulting in an empty frame when specifying |
|
I see. We should discuss this separately, it seems like there was a consistency issue in the treatment of a single Series versus a dict of Series (i.e. determining a row index from the input series prior to selecting only the columns in |
shoyer commentedJun 8, 2016
This should also be noted in the docstring for the method.
Currently, it sometimes returns numpy arrays:
Most of the work here is probably writing comprehensive tests to check each index type.
xref: https://github.com/pydata/pandas/pull/13361/files/17209f92330c5e949934aec9dea039b35faf6e40#r66179418