-
-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: raise/warn SettingWithCopy when chained assignment is detected #5390
Conversation
Want! How would you detect, that a copy was expected? Like first assigning the loc result and then changing it? |
nope just setting a flag on the returned slices is enough when they r copied (2 cases r covered) |
I'm strongly opposed to this - what if you want to do some transformation on a row and then apply it to multiple other rows? It's far too opinionated and it only covers certain cases, and it doesn't work if you do fancy indexing:
It also doesn't work in this case:
|
@jtratner you are missing the point here, the last case DOES work. In general if its a single-dtype it WILL work. This addresses the problem when it DOESN't work and you don't realize that. The problem from a user perspective is that certain operations produce a copy that if you then try to set are actually a copy of the original (this most often happens when you take a cross section that must be converted from a view to a copy) |
I definitely don't understand the intricacies of what makes copies with numpy and pandas. That said, What I don't like about this is I can't do this: df = DataFrame({"A": [1, 2, 3, 4, 5], "B": [3.125, 4.12, 3.1, 6.2, 7.]})
row = df.loc[0]
row["A"] = 0
df - row But I can do it if it's the same dtype. Yes, it's blocking a certain kind of mistake, but how the heck do I make it so that I can do something with row? copy it? It also means that you can do changed assignment on columns, but can't on rows. (i.e. My biggest problem is that it make some Series totally unusable, simply because they were generated by loc. |
Also, in that example I gave, you aren't blocking the mutation of row: In [10]: df = DataFrame({"A": [1, 2, 3, 4, 5], "B": [3.125, 4.12, 3.1, 6.2, 7.]})
In [11]: row = df.loc[0]
In [12]: row["A"] = 0
---------------------------------------------------------------------------
SettingWithCopy Traceback (most recent call last)
<ipython-input-12-1d82f0d440e7> in <module>()
----> 1 row["A"] = 0
../pandas/core/series.py in __setitem__(self, key, value)
574 def __setitem__(self, key, value):
575 try:
--> 576 self._set_with_engine(key, value)
577 return
578 except (SettingWithCopy):
../pandas/core/series.py in _set_with_engine(self, key, value)
627 self.index._engine.set_value(values, key, value)
628 if getattr(self,'_copy',False):
--> 629 raise SettingWithCopy("trying to set value on a copy")
630 return
631 except KeyError:
SettingWithCopy: trying to set value on a copy
In [13]: df - row
Out[13]:
A B
0 1 0.000
1 2 0.995
2 3 -0.025
3 4 3.075
4 5 3.875
In [14]: row
Out[14]:
A 0.000
B 3.125
Name: 0, dtype: float64 |
This definitely needs to be for 0.14. It's a significant change! |
How about a compromise on this:
|
yeh...maybe a warning is ok....(3 can be done via warnings module) the issue is that sometimes it work and we allow it, but sometimes it won't. I can in theory |
Yeah, I know you can do it via warnings, but if you want to be able to say "If you're starting out with pandas, we recommend you set 'raise_on_setting_copy' to True", which then alters warnings on the callback. |
ok....option is easy |
I'm personally fine with a warning and letting users fiddle, your call on that. Maybe also performance test? |
no perf hit...this is can an attrib check.. you example above, FYI is technically chained assignment. The bottom line is a cross-section supposed to be a view or not? (e.g. maybe it should NEVER be), even it it COULD be |
@jtratner @JanSchulz ok...new behavior is at top...controlled by the option, default is to warn |
Much happier with this. Do you maybe want to add a note that it won't necessarily catch all chained assignment? |
well...hoping that we do actually catch them and just have some false positives :) if you have a case that you don't think we catch...... |
@wesm ? |
@jtratner ok with putting in? |
Could we change |
sure |
def _check_setitem_copy(self): | ||
""" validate if we are doing a settitem on a chained copy """ | ||
if getattr(self,'_is_copy',False): | ||
value = pd.get_option('chained_assignment') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe make this a global that gets set on the validator? get_option is actually expensive - nearly 100us.
@jtratner fixed up.....just made a fast access (internal) function |
@@ -512,6 +512,13 @@ def _get_root(key): | |||
cursor = cursor[p] | |||
return cursor, path[-1] | |||
|
|||
def _get_option_fast(key): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, needed to do this anyway.
I thought the version added would be confusing actually because this section was their before |
got it - that makes a lot of sense. |
It would be nice if there would be API way to remove the warning/raising ("I know what I do") from that object:
It would be even nicer if pandas would be more deterministic, eg. always return a "not view" when the value is assigned but always throw an error when a copy is returned in between two slice operations ("if I assign I want a copy, if I don't assign I want a View"?). |
Just set the config option to None. the problem is that the underlying decision about whether or not to copy is made by numpy rather than pandas. |
@JanSchulz being more deterministic is an interesting idea, and could be done in these cases (especially in the case of xs, which I think should always copy), which is the main abuse of the chained setting. In any event, any more comments; I think this is a nice warning (which is easy to turn off). |
@JanSchulz the basic rule is this if mixed type then will copy, if a single dtype will provide a view if numpy does (IOW not sure if this is a copy or not, it still may be). |
@jreback What a deterministic rule :-) I still find it madness... :-) import pandas as pd
df = pd.DataFrame({"A": [1, 2, 3, 4, 5], "B": [3, 4, 3, 6, 7]})
df2 = pd.DataFrame({"A": [1, 2, 3, 4, 5], "B": [3., 4., 3., 6., 7.]})
df.loc[0]["A"] = 0
df2.loc[0]["A"] = 0
print df.ix[0,"A"]
0
print df2.ix[0,"A"]
1
df["A"].loc[0] = 2
df2["A"].loc[0] = 2
print df.ix[0,"A"]
2
print df2.ix[0,"A"]
2 This even more so: import numpy as np
dfb = pd.DataFrame({'a' : ['one', 'one', 'two','three', 'two', 'one', 'six'],'c' : np.arange(7)})
print dfb.ix[0,"c"]
0
dfb['c'][dfb.a.str.startswith('o')] = 42
print dfb.ix[0,"c"]
42
dfb[dfb.a.str.startswith('o')]['c'] = 43
print dfb.ix[0,"c"]
42
dfb.loc[dfb.a.str.startswith('o'),'c'] = 44
print dfb.ix[0,"c"]
44 I see that a view can sometimes be more memory efficient, but this will be (or is) a nightmare for people who come from R, SPSS or excel and have never seen numpy. A way to make it more deterministic would be to add a wrapper around all indexer which ensure a view (say for example df.vix) or a copy (df.cix). Then you can at least be explicit what you want to do. Or an option that all default indexer use a "copy or error"-mode and only df.vix and so on a "view or error"-mode. So I think the basic rules are: |
(3) --> make a copy.
|
@JanSchulz you rules are sensible....maybe want to expand the chained assignement section even more! I think this PR will at the very least alert you that you might be doing something you should think about. Its not hard to eliminate view on cross-section/slicing. I don't think there is any particular benefit to it, unless your data is huge and you are taking a viewable slice (and that's prob why it was done in the first place). |
how about we just make ix optionally callable?
def __call__(self, force_copy=False):
self.force_copy=force_copy
return self And we could consider adding a config option. Seems like a much simpler and explicit way to handle it than new indexers. |
Or if we want to cache the ix object, then something like: def __call__(self, force_copy=False):
if self.force_copy == force_copy:
new_self = self
else:
from copy import copy
new_self = copy(self)
new_self.force_copy =force_copy
return new_self |
hmm that's interesting could pass parameters that way; in fact that's the reason u need xs explicitly (eg say u want to pass the level or axis parameter) |
just seems like it's relatively natural and means we don't have to do more overloading. Main thing is making sure ix is a thin wrapper so we don't have to deal with cacheing issues. |
ix (and friends) are created each time an indexing operation happens (and it's thin really just the obj reference) |
@jtratner any further comments? |
Looks good to me - thanks for making those changes |
API: raise/warn SettingWithCopy when chained assignment is detected
I've never really worked with warnings before, but is there a way to print the line number causing the warning when its displayed? I've got some of these warnings and I'm trying to figure out what's causing them. |
not sure you can also set the option to raise then catch it in the debugger so it will stop there |
Yes, it's simple to do and we should have bumped up the stacklevel on this. @TomAugspurger check out #5581 where I submitted a PR to fix this. |
Default is to warn
You can turn it off
Or set to raise