Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: Deprecate convert parameter in take #17352

Merged
merged 1 commit into from
Oct 1, 2017

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Aug 27, 2017

xref #16948.

The parameter is not respected, nor is it a parameter in many 'take' implementations.

cc @jorisvandenbossche

@gfyoung gfyoung added the Deprecate Functionality to remove in pandas label Aug 27, 2017
@gfyoung gfyoung force-pushed the take-convert-depr branch 2 times, most recently from 9e0bcad to 4471360 Compare August 28, 2017 04:07
@codecov
Copy link

codecov bot commented Aug 28, 2017

Codecov Report

Merging #17352 into master will decrease coverage by 0.01%.
The diff coverage is 86.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17352      +/-   ##
==========================================
- Coverage   91.03%   91.02%   -0.02%     
==========================================
  Files         162      162              
  Lines       49567    49577      +10     
==========================================
+ Hits        45125    45126       +1     
- Misses       4442     4451       +9
Flag Coverage Δ
#multiple 88.8% <86.66%> (ø) ⬆️
#single 40.23% <23.33%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby.py 92.21% <100%> (ø) ⬆️
pandas/core/sparse/series.py 95.13% <100%> (+0.04%) ⬆️
pandas/core/series.py 95.01% <100%> (ø) ⬆️
pandas/core/indexing.py 93.94% <75%> (ø) ⬆️
pandas/core/frame.py 97.72% <75%> (-0.1%) ⬇️
pandas/core/generic.py 92% <87.5%> (+0.01%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 473a7f3...4471360. Read the comment docs.

@codecov
Copy link

codecov bot commented Aug 28, 2017

Codecov Report

Merging #17352 into master will decrease coverage by 0.01%.
The diff coverage is 92.45%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17352      +/-   ##
==========================================
- Coverage   91.27%   91.25%   -0.02%     
==========================================
  Files         163      163              
  Lines       49766    49785      +19     
==========================================
+ Hits        45422    45433      +11     
- Misses       4344     4352       +8
Flag Coverage Δ
#multiple 89.05% <92.45%> (ø) ⬆️
#single 40.34% <60.37%> (-0.05%) ⬇️
Impacted Files Coverage Δ
pandas/core/sparse/series.py 95.15% <100%> (+0.05%) ⬆️
pandas/core/series.py 95.04% <100%> (+0.1%) ⬆️
pandas/core/groupby.py 92.23% <100%> (ø) ⬆️
pandas/core/frame.py 97.73% <75%> (-0.1%) ⬇️
pandas/core/indexing.py 93.94% <75%> (ø) ⬆️
pandas/core/generic.py 92.12% <96%> (+0.04%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 030e374...9325f21. Read the comment docs.

return self.data.take(self.sort_idx, axis=self.axis, convert=False)
with warnings.catch_warnings():
warnings.simplefilter("ignore")
return self.data.take(self.sort_idx, axis=self.axis, convert=False)
Copy link
Member Author

@gfyoung gfyoung Aug 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback @jorisvandenbossche : This is a tricky one for me here. On the one hand, I want to remove convert=False, but because axis of the groupby object might be invalid when calling Series.take, this is not possible for me to do (when True, it re-indexes the value according to the length of the slice along the index).

On the other hand, this catch_warnings(), while an easy way to get things to pass for now, is not a solution I want to merge. Suggestions on how to handle this problem (or is it best that we just expose convert for all relevant implementations, though I'm not a fan).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, might have found the answer myself 😄

@gfyoung gfyoung force-pushed the take-convert-depr branch 2 times, most recently from 78f0c4e to bfa4bf6 Compare August 28, 2017 05:02
@@ -3354,7 +3354,7 @@ def dropna(self, axis=0, how='any', thresh=None, subset=None,
else:
raise TypeError('must specify how or thresh')

result = self.take(mask.nonzero()[0], axis=axis, convert=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these internal uses should be removed. We're guaranteed the indexer is inbounds, so it's an unnecessary performance hit to check it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by this? DataFrame.take doesn't even respect the convert parameter. Thus, this will have no effect on anything. 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this example doesn't matter b/c it's on a frame, but does with the usages of _data.take

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And Series.take

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's a bounds check anymore in Series.take or _data.take because I removed all of them for that same reason. We have so many bounds checks all over the place that removing just one of them is okay now.

if kwargs:
nv.validate_take(tuple(), kwargs)

# check/convert indicies here
if convert:
indices = maybe_convert_indices(indices, len(self._get_axis(axis)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is safe to remove - with non numpy types, we call to an internal take impl that doesn't bounds check. Can you try pd.Series([1,2,3], dtype='category').take([-1, -2]) on your branch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so that breaks:

2    NaN
1    1.0
dtype: category
Categories (3, int64): [1, 2, 3]

That being said, that call works without dtype='category', meaning we do handle negative indices correctly in some cases already. I might as well add that handling for category too.

@jreback
Copy link
Contributor

jreback commented Aug 29, 2017

So I originally added this because I wanted the ability to a) convert neg to positive, and b) bounds check; these are not the default because of perf (IOW internally we often know that we don't need either of these things, but from a public perspetive you do need this).

These shouldn't have been on the public NDFrame though (so good to remove). So the question is, do we need a ._take() to do these things? (and .take() would just be a wrapper)?

@chris-b1
Copy link
Contributor

I'd be in favor of a _take with options to be unsafe. It's not like boundschecking is that expensive, but it's such a core operation I hate to give up any performance.

@gfyoung
Copy link
Member Author

gfyoung commented Aug 29, 2017

I suppose that in the name of performance, that's fair. I think I would just need to implement this solely for DataFrame and Series right?

@chris-b1
Copy link
Contributor

Yes, that's correct. BlockManager and SingleBlockManager will need to keep their take implementation, but I would keep it unprefixed since they are already internal.

@gfyoung
Copy link
Member Author

gfyoung commented Aug 29, 2017

@chris-b1 : Sorry, somehow your answer only confused me more. Yes or no, do I need to implement _take for just Series and DataFrame ?

@chris-b1
Copy link
Contributor

chris-b1 commented Aug 29, 2017 via email

@gfyoung
Copy link
Member Author

gfyoung commented Aug 30, 2017

@chris-b1 : Alright, here goes. Let's see what CI has to say about my implementation.

@gfyoung
Copy link
Member Author

gfyoung commented Aug 30, 2017

@chris-b1 @jreback : Successfully implemented _take with all green. PTAL.

@@ -2033,7 +2033,7 @@ def _ixs(self, i, axis=0):
return self.loc[:, lab_slice]
else:
if isinstance(label, Index):
return self.take(i, axis=1, convert=True)
return self.take(i, axis=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

every reference to an internal take should be to ._take (this is pretty much all of them)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really needed?
Can't we keep take and only use _take where it is needed? IMO it just makes it more complex to understand the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @jorisvandenbossche . I'm not sure I really understand what you mean by "reference to an internal take" since the original references are to a public function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have now 2 functions to do the same thing , a public and private version

it is much more clear to simply use _take internally in all cases

the public function is just an interface

otherwise a future reader will not know which to use - recipe for confusion

pls change all uses to _take

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be confusion anyhow, because then people will wonder what the difference is with pubkic take, and in majority of the cases, there is no difference. IMO that is more clear by using the public function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is AFAIK only one class where negative indexing isn't handled properly, and that is Categorical. Makes we wonder whether we should revisit the option of adding supporting given this disagreement about design though (@chris-b1 thoughts?)

Copy link
Member Author

@gfyoung gfyoung Sep 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me boil it down to this (in reference to my previous comment): besides potential test failures, is there reason why we call take_1d in pandas.core.categorical instead of just calling self._codes.take(...) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls change all references to _take

.take is ONLY a wrapper function and should not be used anywhere in the code base
this eliminates any possibility of confusion and is inline with the existing style
mixing usages is cause for trouble

your other references are simply not relevant here and are again purely internal uses (take_1d)
if u want to bring them up in another issue by all means

let's keep the scope to changing refs to the NDFrame .take changes

Copy link
Member Author

@gfyoung gfyoung Sep 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there were multiple places where I had to use ._take(), then I think I would be more inclined to agree with you @jreback . However, at this point, there is only one place where this had to happen, and if necessary, I can easily "clear up" any confusion by adding a comment.

IMO one usage isn't convincing enough for me to convert all of other invocations (both in NDFrame and by any other pandas class to use ._take()). I think I still have to side with @jorisvandenbossche on this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfyoung if this were a new public function, and you wrote a private impl _take as well. Then you certainly would use the private impl everywhere. I don't see a difference here. Pls make the change.

@@ -2058,6 +2058,25 @@ def __delitem__(self, key):
except KeyError:
pass

def _take(self, indices, axis=0, convert=True, is_copy=False):
"""
The internal version of `self.take` that will house the `convert`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more descriptive doc-string with parameters and such

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, done.

result._set_is_copy(self)

return result

def take(self, indices, axis=0, convert=True, is_copy=True, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also deprecate is_copy; that is another internal parameter; its only necessary for _take

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point. If you don't mind, I'll do this in a subsequent PR once the infrastructure for _take has been merged in.


See also
--------
numpy.ndarray.take
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have this? now that it is generic is shouldn't be needed

Copy link
Member Author

@gfyoung gfyoung Aug 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have what exactly? You commented on a not-so clear part of the diff.

@@ -604,14 +604,38 @@ def sparse_reindex(self, new_index):

def take(self, indices, axis=0, convert=True, *args, **kwargs):
"""
Sparse-compatible version of ndarray.take
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a shared doc string for all public .take()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


self._consolidate_inplace()
new_data = self._data.take(indices,
axis=self._get_block_manager_axis(axis),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass through convert

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@gfyoung gfyoung force-pushed the take-convert-depr branch 2 times, most recently from e0b209b to 35812d0 Compare September 23, 2017 10:08
@jreback
Copy link
Contributor

jreback commented Sep 23, 2017

so this still needs to change virtually all uses of .take() -> ._take(). We want to limit the exposure to user-facing .take() in any code that is not explicitly user facing.

if convert:
indices = maybe_convert_indices(indices, len(self._get_axis(axis)))

indices = _ensure_platform_int(indices)
new_index = self.index.take(indices)
new_values = self._values.take(indices)
return (self._constructor(new_values, index=new_index, fastpath=True)
.__finalize__(self))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason you can't just call the super method here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ._constructor ? I was trying to preserve existing (working) code as much as possible when making these changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no the entire routine
give it s try

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://travis-ci.org/pandas-dev/pandas/builds/279118227

I gave it a try, and there were a bunch of failures.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, not really sure why it shoulnt' work. its almost the same code

@gfyoung
Copy link
Member Author

gfyoung commented Sep 23, 2017

https://travis-ci.org/pandas-dev/pandas/jobs/279023089

@jreback : Not sure why doing this conversion to _.take() is causing test failures.

@gfyoung gfyoung force-pushed the take-convert-depr branch 9 times, most recently from 889fb83 to ec299bc Compare October 1, 2017 05:05
xref pandas-devgh-16948. The parameter is not respected,
nor is it a parameter in many 'take' implementations.
@gfyoung
Copy link
Member Author

gfyoung commented Oct 1, 2017

@jreback : I've done all of the converting (without breaking tests). PTAL.

@jreback jreback added this to the 0.21.0 milestone Oct 1, 2017
@jreback jreback merged commit 458c1dc into pandas-dev:master Oct 1, 2017
@jreback
Copy link
Contributor

jreback commented Oct 1, 2017

thanks @gfyoung !

@gfyoung gfyoung deleted the take-convert-depr branch October 1, 2017 20:32
ghost pushed a commit to reef-technologies/pandas that referenced this pull request Oct 2, 2017
* 'master' of github.com:pandas-dev/pandas: (188 commits)
  Separate out _convert_datetime_to_tsobject (pandas-dev#17715)
  DOC: remove whatsnew note for xref pandas-dev#17131
  BUG: Regression in .loc accepting a boolean Index as an indexer (pandas-dev#17738)
  DEPR: Deprecate cdate_range and merge into bdate_range (pandas-dev#17691)
  CLN: replace %s syntax with .format in pandas.core: categorical, common, config, config_init (pandas-dev#17735)
  Fixed the memory usage explanation of categorical in gotchas from O(nm) to O(n+m) (pandas-dev#17736)
  TST: add backward compat for offset testing for pickles (pandas-dev#17733)
  remove unused time conversion funcs (pandas-dev#17711)
  DEPR: Deprecate convert parameter in take (pandas-dev#17352)
  BUG:Time Grouper bug fix when applied for list groupers (pandas-dev#17587)
  BUG: Fix some PeriodIndex resampling issues (pandas-dev#16153)
  BUG: Fix unexpected sort in groupby (pandas-dev#17621)
  DOC: Fixed typo in documentation for 'pandas.DataFrame.replace' (pandas-dev#17731)
  BUG: Fix series rename called with str altering name rather index (GH17407) (pandas-dev#17654)
  DOC: Add examples for MultiIndex.get_locs + cleanups (pandas-dev#17675)
  Doc improvements for IntervalIndex and Interval (pandas-dev#17714)
  BUG: DataFrame sort_values and multiple "by" columns fails to order NaT correctly
  Last of the timezones funcs (pandas-dev#17669)
  Add missing file to _pyxfiles, delete commented-out (pandas-dev#17712)
  update imports of DateParseError, remove unused imports from tslib (pandas-dev#17713)
  ...
@jreback
Copy link
Contributor

jreback commented Oct 3, 2017

pandas/tests/sparse/test_series.py::TestSparseSeries::()::test_numpy_take
/home/travis/miniconda3/envs/pandas/lib/python3.5/site-packages/numpy/core/fromnumeric.py:123: FutureWarning: The 'convert' parameter is deprecated and will be removed in a future version.
return take(indices, axis, out, mode)

coming up on the 3.5 build of master

seems we r not suppressing somewhere (or prob needs to change to use _take maybe)

can u take care of @gfyoung

@gfyoung
Copy link
Member Author

gfyoung commented Oct 3, 2017

@jreback : The cause of the warning is that an older version of numpy passes in None to convert by default when calling np.take (see here).

Thus, this is a compatibility issue, as it disappears in later versions of numpy (I actually worked on fixing fromnumeric compatibility issues at some point).

I'll put a check in the test to expect that warning for older versions of numpy, as checking if not convert is pretty idiomatic, and I wouldn't want to change it just because an older version of numpy is issuing a warning when we call .take() in a non-recommended way.

gfyoung added a commit to forking-repos/pandas that referenced this pull request Oct 3, 2017
gfyoung added a commit to forking-repos/pandas that referenced this pull request Oct 3, 2017
jreback pushed a commit that referenced this pull request Oct 3, 2017
ghost pushed a commit to reef-technologies/pandas that referenced this pull request Oct 16, 2017
alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017
xref pandas-devgh-16948. The parameter is not respected,
nor is it a parameter in many 'take' implementations.
alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
xref pandas-devgh-16948. The parameter is not respected,
nor is it a parameter in many 'take' implementations.
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants