Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: map() on Index returns an Index, not array #14506

Closed
wants to merge 11 commits into from

Conversation

nateyoder
Copy link
Contributor

@nateyoder nateyoder commented Oct 26, 2016

This is a follow on to #12798.

@nateyoder
Copy link
Contributor Author

nateyoder commented Oct 26, 2016

Note that the following changes result in some additional changes to the API which should potentially be addressed so I would appreciate suggestions.

  • MultiIndex maps to Index: because mapping a function can reduce or increase the dimensionality of the index I am returning an Index from these operations rather than inspecting the data to determine whether the output should potentially be a MultiIndex
  • Categorical and CategoricalIndex: When a categorical cannot be returned returns an Index of the appropriate type rather than an np.ndarray

@jreback
Copy link
Contributor

jreback commented Oct 26, 2016

@nateyoder use ensure_index to sniff for Index/Multi_index. You can simply return a CategoricalIndex (rather than a plain Categorical).

This will need to be in 0.20.0


API changes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to 0.20.0

@@ -943,7 +943,7 @@ def map(self, mapper):

Returns
-------
applied : Categorical or np.ndarray.
applied : Categorical or Index.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this is ok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -766,6 +766,26 @@ def test_sub(self):
self.assertRaises(TypeError, lambda: idx - idx.tolist())
self.assertRaises(TypeError, lambda: idx.tolist() - idx)

def test_map_identity_mapping(self):
for name, cur_index in self.indices.items():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment with the appropriate github issue numbers (don't go crazy but where appropriate)


def test_map_that_returns_tuples_creates_index_not_multi_index(self):
boolean_index = tm.makeIntIndex(3).map(lambda x: (x, x == 1))
expected = Index([(0, False), (1, True), (2, False)],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use ensure_index and this will be a MI

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves API Design Dtype Conversions Unexpected or buggy dtype conversions labels Oct 26, 2016
@nateyoder
Copy link
Contributor Author

nateyoder commented Oct 26, 2016

@jreback I may have misunderstood what you were saying but I gave using _ensure_index a shot with the below code but it didn't seem to do anything. I believe this may be because the map returns a np.ndarray rather than a list which would then be converted to a MultiIndex? Should I go ahead an put a tolist in there despite the performance drawbacks or am I totally barking up the wrong tree?

_ensure_index(Index(self._arrmap(self.values, mapper), **attributes))

@jreback
Copy link
Contributor

jreback commented Oct 26, 2016

hmm, actually it IS prob worth introspecting this and if it is ndim==1, then its no problem, if its is an array-like of tuples, then you can pass to MultiIndex.from_tuples directly.

@nateyoder
Copy link
Contributor Author

Because the outputs of _arrmap were actually always contained in a 1d np.ndarray I actually checked for the presence of a tuple. Let me know if you'd prefer something else.

@codecov-io
Copy link

codecov-io commented Oct 27, 2016

Current coverage is 85.31% (diff: 100%)

Merging #14506 into master will increase coverage by <.01%

@@             master     #14506   diff @@
==========================================
  Files           144        144          
  Lines         51016      51022     +6   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43522      43528     +6   
  Misses         7494       7494          
  Partials          0          0          

Powered by Codecov. Last update 3ba2cff...95e4440

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For DatetimeIndex (and probably timedelta and period as well), the return value is still an array if the result is not a datetime anymore (eg with .map(lambda x: 1) or .map(lambda x: x.hour)

applied : array
applied : Index
The output of the mapping function applied to the index.
If the function returns a tuple a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing some content

expected = Index([(0, ), (1, ), (2, )])
self.assert_index_equal(boolean_index, expected)

def test_map_that_reduces_multi_index_to_single_index_returns_index(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you shorten the test names a bit?
(or else just include them in one single test, then you can add the long name/explanation as a comment)

@@ -38,6 +38,25 @@ Other enhancements
Backwards incompatible API changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- ``map`` on an ``Index`` now returns an ``Index``, not an array (:issue:`12766`)
.. ipython:: python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some sphinx sytax comments (sphinx is rather picky ..):

  • blank line above .. ipython ...
  • Can you also indent this line (and all lines below) with two spaces? (to be at the level of the content of the bullet point) -> so everything will be part of this bullet point

@jorisvandenbossche jorisvandenbossche added this to the 0.20.0 milestone Oct 27, 2016
@nateyoder
Copy link
Contributor Author

Great catch on the DatetimeIndex! Sorry I missed that. I'll try to update the PR today.

@nateyoder
Copy link
Contributor Author

Unfortunately when making this change I ended up causing two other tests (series.test_apply.TestSeriesApply.test_apply_datetimetz and series.test_apply.TestSeriesMap.test_map_datetimetz) to fail because they became np.int64 instead of np.int32. Is this a significant failure that I should try to work around or should I change the dtype in the tests?

@nateyoder
Copy link
Contributor Author

Just wanted to check and see if there were any additional actions you would like taken on this PR? It appeared to me as though the AppVeyor failure was not due to my changes but am not familiar with AppVeyor.

@@ -38,8 +38,27 @@ Other enhancements
Backwards incompatible API changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. _whatsnew_0200.api:
- ``map`` on an ``Index`` now returns an ``Index``, not an array (:issue:`12766`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make a new sub-section

def test_map_identity_mapping(self):
# GH 12766
for name, cur_index in self.indices.items():
self.assert_index_equal(cur_index, cur_index.map(lambda x: x))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we just use assert_index_equal (not the self), same routine, but more consistent

Copy link
Contributor Author

@nateyoder nateyoder Nov 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I changed the ones related to my commit but left the other ones in test_base.py alone. Let me know if you'd like them changed as well.

exp = pd.Categorical(list('ababc'), categories=list('cba'),
ordered=True)
tm.assert_categorical_equal(result, exp)
exp = pd.CategoricalIndex(list('ababc'), categories=list('cba'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

show an example like this in the whatsnew as well (e.g. CategoryIndex.map -> CI rather than Category now)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jreback. Just wanted to touchbase here since I added this example and then you asked me to remove in another comment below. Did I add it in the wrong place or did you just decide it was a little overkill? Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's too long for the what's new; so need to pare it down

Copy link
Contributor Author

@nateyoder nateyoder Dec 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 sounds good. I shortened it up like you suggested. Let me know if you'd like any other changes.

@@ -124,7 +124,7 @@ def test_apply_datetimetz(self):

# change dtype
result = s.apply(lambda x: x.hour)
exp = pd.Series(list(range(24)) + [0], name='XX', dtype=np.int32)
exp = pd.Series(list(range(24)) + [0], name='XX', dtype=np.int64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I don't think this should have changed, these are normally int32s

any idea? (also this might not be the same on windows)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true it previously gave int32, but is there any reason for that? We almost always use int64 as the default integer size, and also currently Timestamp.hour gives you back int64 when you replace the map with an explicit loop:

In [20]: dtidx = pd.date_range(start='2012-01-01', periods=4)

In [29]: dtidx.map(lambda x: x.hour)
Out[29]: array([0, 0, 0, 0], dtype=int32)

In [30]: np.array([x.hour for x in dtidx])
Out[30]: array([0, 0, 0, 0])

In [31]: np.array([x.hour for x in dtidx]).dtype
Out[31]: dtype('int64')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably the result of using asobject below

In [37]: dtidx.map(lambda x: x.hour)
Out[37]: array([0, 0, 0, 0], dtype=int32)

In [36]: dtidx.asobject.map(lambda x: x.hour)
Out[36]: array([0, 0, 0, 0], dtype=int64)
``

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is actually an implementation detail (as they r stored as int32)

we can change but that should be separate PR

ideally just like to preserve here

raise TypeError
return result
except Exception:
return _algos.arrmap_object(self.asobject.values, f)
return self.asobject.map(f)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is better, using .values can change dtypes (esp timezones)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might actually close another issue...

@@ -104,7 +104,7 @@ def test_dateindex_conversion(self):
for freq in ('B', 'L', 'S'):
dateindex = tm.makeDateIndex(k=10, freq=freq)
rs = self.dtc.convert(dateindex, None, None)
xp = converter.dates.date2num(dateindex._mpl_repr())
xp = Index(converter.dates.date2num(dateindex._mpl_repr()))
tm.assert_almost_equal(rs, xp, decimals)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compare as index

@@ -1513,8 +1513,8 @@ def test_map(self):

f = lambda x: x.days
result = rng.map(f)
exp = np.array([f(x) for x in rng], dtype=np.int64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might show some of these examples in the whatsnew as well (e.g. period/timedelta)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jreback
Copy link
Contributor

jreback commented Nov 12, 2016

@nateyoder sorry for the delay. this looks pretty good. couple of comments. Pls have a look at the dtype changes w.r.t. .apply(lambda x: x.hour); these should be int32.

@sinhrks thoughts

@nateyoder
Copy link
Contributor Author

Hi @jreback. Sorry for my confusing message. I understand that it is the current behavior of indices to ALWAYS return int64 but if you would like to be able to return int32 rather than int64 types from map operations on Indices as you had said two weeks ago I believe that would have to change. Would you like me to try to make those changes or are you alright with map operations always returning int64 if an integer type is returned? Apologies again for the confusion.

@jreback
Copy link
Contributor

jreback commented Nov 27, 2016

not sure what you are referring

@jorisvandenbossche
Copy link
Member

@jreback I think the confusion is this: you ask that the return dtype should be int32 (as before: return type of an int32 array), but for an Index this can never be the case as even an int32 array will result in a int64 index.

So for Index.map, the resulting dtype has to change in this PR.

The question remains of course if it is possible to keep the dtype of Series.map/apply. But, this is already a bit inconsistent at the moment:

In [45]: dtidx = pd.date_range("2012-01-01", freq='H', periods=5)

In [46]: dtidx.map(lambda x: x.hour)
Out[46]: array([0, 1, 2, 3, 4], dtype=int32)

In [47]: dtidx2 = dtidx.tz_localize("Europe/Brussels")

In [48]: dtidx2.map(lambda x: x.hour)
Out[48]: array([0, 1, 2, 3, 4], dtype=int32)

In [49]: pd.Series(dtidx).map(lambda x: x.hour)
Out[49]: 
0    0
1    1
2    2
3    3
4    4
dtype: int64

In [50]: pd.Series(dtidx2).map(lambda x: x.hour)
Out[50]: 
0    0
1    1
2    2
3    3
4    4
dtype: int32

So for a series it gives in64 for datetimes, but int32 for timezone aware datetimes. So personally I wouldn't mind of all those case become int64 (for Index.map it will be in any case, only leaving tz aware Series to return int32)

@jreback
Copy link
Contributor

jreback commented Nov 28, 2016

@nateyoder @jorisvandenbossche response is good (I dont think this is a big deal that we r actually returning int32 from the datetime computations - can evaluate that on another issue)

I think my original comment was why we needed to change a test from int32 to int64; since we are always returning an Index which by definition have restricted dtypes the test needs to change

so ought to document this with an example of that (in whatsnew)

@nateyoder
Copy link
Contributor Author

Sounds good. I updated the whatsnew and believe I have no addressed all of the comments.

@@ -12,7 +12,7 @@ Highlights include:

Check the :ref:`API Changes <whatsnew_0200.api_breaking>` and :ref:`deprecations <whatsnew_0200.deprecations>` before updating.

.. contents:: What's new in v0.19.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already fixed in master, so you may need to rebase

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. just shorten up the whatsnew examples a bit. If you want refresh the docs a bit (with some examples), you can search where we use Index.map (I don't know if its even shown) in indexing.rst or advanced.rst. Can do this in a followup PR (or here if you want)

Map on Index types now return other Index types
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- ``map`` on an ``Index`` now returns an ``Index``, not an array (:issue:`12766`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

array -> np.array

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or 'numpy array'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


.. ipython:: python

mi = MultiIndex.from_tuples([(1, 2), (2, 4)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put this with where you define the Index as well. You can then put the previous/new for Multi with Index



- ``map`` on an ``CategoricalIndex`` now returns a ``CategoricalIndex``, not a Categorical

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can skip this example (the CategoricalIndex) and below

applied : array
applied : Index
The output of the mapping function applied to the index.
If the function returns a tuple with more than one element
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say the output Index type will be inferred

if isinstance(result, np.ndarray):
self._shallow_copy(result)

if not isinstance(result, Index):
raise TypeError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is caught in an outer scope and so is not seen. but can you add an informative message to this TypeError

@jorisvandenbossche
Copy link
Member

@nateyoder Could you update for the latest comments of @jreback? Would be nice to get this in!

@nateyoder
Copy link
Contributor Author

@jorisvandenbossche Sorry for the delay. Let me know if you notice anything else that needs to be updated.

Thanks I've enjoyed my first contribution!

@@ -91,8 +91,75 @@ Other enhancements
Backwards incompatible API changes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. _whatsnew_0200.api:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs a ref tag as well
.. _whatsnew.api_breaking.index_map

mi.map(lambda x: x[0])


- ``map`` on a Series withe datetime64 values may return int64 dtypes rather than int32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

withe -> with

@jreback
Copy link
Contributor

jreback commented Dec 16, 2016

lgtm. just a small doc change.

@jreback jreback closed this in 6f4e36a Dec 16, 2016
@jreback
Copy link
Contributor

jreback commented Dec 16, 2016

thanks!

great effort!

@jorisvandenbossche
Copy link
Member

@nateyoder Thanks a lot!

ischurov pushed a commit to ischurov/pandas that referenced this pull request Dec 19, 2016
closes pandas-dev#12766
closes pandas-dev#12798

This is a follow on to pandas-dev#12798.

Author: Nate Yoder <nate@whistle.com>

Closes pandas-dev#14506 from nateyoder/index_map_index and squashes the following commits:

95e4440 [Nate Yoder] fix typo and add ref tag in whatsnew
b36e83c [Nate Yoder] update whatsnew, fix documentation
4635e6a [Nate Yoder] compare as index
a17ddab [Nate Yoder] Fix unused import and docstrings per pep8radius docformatter; change other uses of assert_index_equal to testing instead os self
ab168e7 [Nate Yoder] Update whatsnew and add git PR to tests to denote changes
504c2a2 [Nate Yoder] Fix tests that weren't run by PyCharm
23c133d [Nate Yoder] Update tests to match dtype int64
07b772a [Nate Yoder] use the numpy results if we can to avoid repeating the computation just to create the object
a110be9 [Nate Yoder] make map on time tseries indices return index if dtype of output is not a tseries; sphinx changes; fix docstring
a596744 [Nate Yoder] introspect results from map so that if the output array has tuples we create a multiindex instead of an index
5fc66c3 [Nate Yoder] make map return an index if it operates on an index, multi index, or categorical index; map on a categorical will either return a categorical or an index (rather than a numpy array)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Dtype Conversions Unexpected or buggy dtype conversions Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: Index.map should return Index rather than array
4 participants