Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: deprecate .get_value and .set_value for Series, DataFrame, Panel & Sparse #17739

Merged
merged 1 commit into from
Oct 5, 2017

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Oct 2, 2017

closes #15269

@jreback jreback added Deprecate Functionality to remove in pandas Indexing Related to indexing on series/frames, not to indexes themselves labels Oct 2, 2017
@jreback jreback added this to the 0.21.0 milestone Oct 2, 2017
@jreback
Copy link
Contributor Author

jreback commented Oct 2, 2017

@@ -460,7 +497,35 @@ def set_value(self, index, col, value, takeable=False):
-------
frame : DataFrame
"""
dense = self.to_dense().set_value(index, col, value, takeable=takeable)
warnings.warn("get_value is deprecated and will be removed "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this one be "set_value is deprecated..." instead of "get_value"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks...i fixed the whole doc-string duplication mess as well.

@jorisvandenbossche
Copy link
Member

I am not fully convinced this is needed. As @wesm said in the issue, they were originally meant as fast access methods, and although performance has degraded over time, they are still considerably faster as any alternative.

I agree in the large majority of the use case you should not (or can avoid) having to set / get individual values, but why not keep them for those few cases you cannot otherwise?

@jreback
Copy link
Contributor Author

jreback commented Oct 2, 2017

well we have more than 1 way to do things (12 ways in fact). This is just a bad API design. Not to mention that these barely have any tests / attention, and they break in lots of cases (as they don't handle validation / dtypes propertly).

this is a no-brainer.

@jreback
Copy link
Contributor Author

jreback commented Oct 2, 2017

.iat/.at are properly supported and pretty close in perf

@codecov
Copy link

codecov bot commented Oct 2, 2017

Codecov Report

Merging #17739 into master will decrease coverage by 0.01%.
The diff coverage is 96.55%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17739      +/-   ##
==========================================
- Coverage   91.24%   91.22%   -0.02%     
==========================================
  Files         163      163              
  Lines       49918    49957      +39     
==========================================
+ Hits        45546    45574      +28     
- Misses       4372     4383      +11
Flag Coverage Δ
#multiple 89.03% <96.55%> (ø) ⬆️
#single 40.25% <39.65%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/core/sparse/frame.py 94.78% <100%> (+0.09%) ⬆️
pandas/core/frame.py 97.74% <100%> (-0.09%) ⬇️
pandas/core/indexing.py 93% <100%> (ø) ⬆️
pandas/core/sparse/series.py 95.26% <100%> (+0.11%) ⬆️
pandas/core/panel.py 96.94% <100%> (+0.03%) ⬆️
pandas/core/series.py 94.89% <77.77%> (-0.16%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 37860a5...ec7b086. Read the comment docs.

@jreback
Copy link
Contributor Author

jreback commented Oct 3, 2017

any final comments?

@@ -666,6 +666,7 @@ Deprecations
- ``cdate_range`` has been deprecated in favor of :func:`bdate_range`, which has gained ``weekmask`` and ``holidays`` parameters for building custom frequency date ranges. See the :ref:`documentation <timeseries.custom-freq-ranges>` for more details (:issue:`17596`)
- passing ``categories`` or ``ordered`` kwargs to :func:`Series.astype` is deprecated, in favor of passing a :ref:`CategoricalDtype <whatsnew_0210.enhancements.categorical_dtype>` (:issue:`17636`)
- Passing a non-existant column in ``.to_excel(..., columns=)`` is deprecated and will raise a ``KeyError`` in the future (:issue:`17295`)
- ``.get_value`` and ``.set_value`` on ``Series``, ``DataFrame``, ``Panel``, ``SparseSeries``, and ``SparseDataFrame`` are deprecated in favor of using ``.iat[]`` or ``.at[]`` accessors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be an issue number here?

Unrelated, but if you're going to make a change, there's also a typo on the line above this one that could be fixed: "non-existant" -> "non-existent". Can submit a fix myself if there's no need to add an issue number and changes don't need to be made.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, both fixed

@jorisvandenbossche
Copy link
Member

I already stated my reservations about the need for this.
So ideally it would be good to hear some more opinions from other devs (because if most are OK with this, it's fine for me too, I don't care about it too much)

@TomAugspurger
Copy link
Contributor

I feel the same as Joris, and will defer to others on this.

@wesm
Copy link
Member

wesm commented Oct 4, 2017

As we have already capitulated long ago on caring about microperformance (https://mail.python.org/pipermail/pandas-dev/2016-March/000499.html) I don't have strong feelings. If it does not cause us a maintenance hardship (maybe it does...), and the performance is better, why not keep them?

@jreback
Copy link
Contributor Author

jreback commented Oct 4, 2017

perf is almost exactly the same as .iat/.at; they are in fact wrappers around the internal versions of .set_value/.get_value. but absolutely this is an API/maintenance burden. too much surface area.

@jorisvandenbossche
Copy link
Member

Since they are indeed rather thin wrappers around the internal set/get_value, I don't think they are actually that much of a maintenance burden. But it is true it is an API 'burden' (they add to the long list of available functions that users have to cope with).
So it's a bit a trade-off. But they are still a factor of around 3x faster as .at. So given that I would tend to keep them (although I never use them myself).

@jreback
Copy link
Contributor Author

jreback commented Oct 4, 2017

you are completely missing the point.

They duplicate APIs. They are not tested in the least and they are very buggy. The entire point of .iat/.at is that they follow a common path and ARE tested with all dtypes.

We are not going to keep around buggy code that has duplicate API's. This such a maintance burden. We need to actively be removing code, not preserving untested, non-documented duplicative API's.

I personally would rather have correct code.

@wesm
Copy link
Member

wesm commented Oct 4, 2017

I wouldn't say he / I are missing the point but we have a slightly different perspective =) I am +0 with removing the APIs in the interest of simplification and maintainability.

@jorisvandenbossche
Copy link
Member

If you read my previous post, I think you can see that I am not completely missing the point. I agree that it is duplicate API. But there is still a trade-off between duplicate api vs speed / breaking people's code.
But I do not agree they are not tested. They are tested by the many indexing tests that use them under the hood.

But to the point: by further looking at it, I am certainly fine with the idea of deprecating those functions. As I would also like to clean up the API, and iat/at are I think nicer.
But, at/iat are actually also only very small wrappers around get_value/set_value. They also do almost as little data validation as get_value/set_value (eg the bug report about PeriodIndex with get_value (#15268), you have exactly the same using at). And I think in general this limited validation is actually the feature of those functions/methods (although the fact it cannot handle a Period on a PeriodIndex could be seen as a bug for at).
And the little bit extra validation they do (the reason they are slower) is IMO not that useful. So I would like to hear if you would be OK with making at/iat a bit faster (closer to get_value/set_value).

Eg, if you look at the at implementation for getting, it does the following validation:

  • check that it is a tuple -> otherwise raise ValueError
  • on integer axis, check that key is integer -> otherwise ValueError
  • on non-integer axis, check that key is not integer -> otherwise raise ValueError

But in all those case, when not doing this check, you would get a KeyError otherwise with the get_value that is used under the hood. But those checks make up for a big part of the reason that at is 3x as slow as get_value.
Are those checks necessary? They provide a little bit a more informative error message (compared to just KeyError). And dropping the checks would be an API change from ValueError to KeyError. But IMO they are not really important, and I would be in favor to drop them, making at closer in speed to get_value.

Would you be with such a change?

(and to repeat from above, I am OK with the idea of removing get_value/set_value)

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor doc comment, for the rest no code comments!

@jreback
Copy link
Contributor Author

jreback commented Oct 5, 2017

Eg, if you look at the at implementation for getting, it does the following validation:

check that it is a tuple -> otherwise raise ValueError
on integer axis, check that key is integer -> otherwise ValueError
on non-integer axis, check that key is not integer -> otherwise raise ValueError
But in all those case, when not doing this check, you would get a KeyError otherwise with the get_value that is used under the hood. But those checks make up for a big part of the reason that at is 3x as slow as get_value.
Are those checks necessary? They provide a little bit a more informative error message (compared to just KeyError). And dropping the checks would be an API change from ValueError to KeyError. But IMO they are not really important, and I would be in favor to drop them, making at closer in speed to get_value.

Would you be with such a change?

I guess my overarching point is that we have a framework that we can/should test .iat/at more. So for sure testing is +1. We already have some scalar fast paths that least to these routines, but some of those changes are certainly possible/welcome. Obviously not averse to speeding things up.

Things done in the unified indexing context are always welcome.

didn't see any doc comments left, can you point.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, forgot to push the button. Here is the minor doc comment.

@@ -1918,6 +1919,8 @@ def get_value(self, index, col, takeable=False):
"""
Quickly retrieve single value at passed column and index

.. deprecated:: 0.21.0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add here the alternative like in the deprecations message? (the "use '.at[]' or '.iat[]' instead")

(and the same for the other docstrings)

@jreback jreback merged commit 3fae8dd into pandas-dev:master Oct 5, 2017
ghost pushed a commit to reef-technologies/pandas that referenced this pull request Oct 16, 2017
alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DEPR: set_value / get_value
5 participants