Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: raise/warn SettingWithCopy when chained assignment is detected #5390

Merged
merged 1 commit into from Nov 6, 2013

Conversation

@jreback
Copy link
Contributor

commented Oct 30, 2013

In [1]: df = DataFrame({"A": [1, 2, 3, 4, 5], "B": [3.125, 4.12, 3.1, 6.2, 7.]})

In [2]: row = df.loc[0]

Default is to warn

In [3]: row["A"] = 0
pandas/core/generic.py:1001: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  warnings.warn(t,SettingWithCopyWarning)

In [4]: row
Out[4]: 
A    0.000
B    3.125
Name: 0, dtype: float64

In [5]: df
Out[5]: 
   A      B
0  1  3.125
1  2  4.120
2  3  3.100
3  4  6.200
4  5  7.000

You can turn it off

In [6]: pd.set_option('chained',None)

In [7]: row["A"] = 0

In [8]: row
Out[8]: 
A    0.000
B    3.125
Name: 0, dtype: float64

In [9]: df
Out[9]: 
   A      B
0  1  3.125
1  2  4.120
2  3  3.100
3  4  6.200
4  5  7.000

Or set to raise

In [10]: row["A"] = 0
SettingWithCopyError: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
@jankatins

This comment has been minimized.

Copy link
Contributor

commented Oct 30, 2013

Want!

How would you detect, that a copy was expected? Like first assigning the loc result and then changing it?

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Oct 30, 2013

nope just setting a flag on the returned slices is enough when they r copied (2 cases r covered)

@jtratner

This comment has been minimized.

Copy link
Contributor

commented Oct 30, 2013

I'm strongly opposed to this - what if you want to do some transformation on a row and then apply it to multiple other rows? It's far too opinionated and it only covers certain cases, and it doesn't work if you do fancy indexing:

In [25]: l = df.loc[[0, 1]]

In [26]: l
Out[26]:
   A  B
0  0  2
1  1  3

In [27]: l["A"] = 5

In [28]: df
Out[28]:
   A  B
0  0  2
1  1  3

In [29]: l
Out[29]:
   A  B
0  5  2
1  5  3

It also doesn't work in this case:

In [3]: df = pd.DataFrame({"A": [1, 2, 3, 4, 5], "B": [5, 4, 3, 2, 1]}, index=['a', 'b', 'c', 'd', 'e'])

In [4]: df
Out[4]:
   A  B
a  1  5
b  2  4
c  3  3
d  4  2
e  5  1

In [5]: df.loc['a']["A"] = 5
@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Oct 30, 2013

@jtratner you are missing the point here, the last case DOES work. In general if its a single-dtype it WILL work.

This addresses the problem when it DOESN't work and you don't realize that. The problem from a user perspective is that certain operations produce a copy that if you then try to set are actually a copy of the original (this most often happens when you take a cross section that must be converted from a view to a copy)

@jtratner

This comment has been minimized.

Copy link
Contributor

commented Oct 31, 2013

I definitely don't understand the intricacies of what makes copies with numpy and pandas. That said, What I don't like about this is I can't do this:

df = DataFrame({"A": [1, 2, 3, 4, 5], "B": [3.125, 4.12, 3.1, 6.2, 7.]})
row = df.loc[0]
row["A"] = 0
df - row

But I can do it if it's the same dtype. Yes, it's blocking a certain kind of mistake, but how the heck do I make it so that I can do something with row? copy it?

It also means that you can do changed assignment on columns, but can't on rows. (i.e. df["A"][0] works).


My biggest problem is that it make some Series totally unusable, simply because they were generated by loc.

@jtratner

This comment has been minimized.

Copy link
Contributor

commented Oct 31, 2013

Also, in that example I gave, you aren't blocking the mutation of row:

In [10]: df = DataFrame({"A": [1, 2, 3, 4, 5], "B": [3.125, 4.12, 3.1, 6.2, 7.]})

In [11]: row = df.loc[0]

In [12]: row["A"] = 0
---------------------------------------------------------------------------
SettingWithCopy                           Traceback (most recent call last)
<ipython-input-12-1d82f0d440e7> in <module>()
----> 1 row["A"] = 0

../pandas/core/series.py in __setitem__(self, key, value)
    574     def __setitem__(self, key, value):
    575         try:
--> 576             self._set_with_engine(key, value)
    577             return
    578         except (SettingWithCopy):

../pandas/core/series.py in _set_with_engine(self, key, value)
    627             self.index._engine.set_value(values, key, value)
    628             if getattr(self,'_copy',False):
--> 629                 raise SettingWithCopy("trying to set value on a copy")
    630             return
    631         except KeyError:

SettingWithCopy: trying to set value on a copy

In [13]: df - row
Out[13]:
   A      B
0  1  0.000
1  2  0.995
2  3 -0.025
3  4  3.075
4  5  3.875

In [14]: row
Out[14]:
A    0.000
B    3.125
Name: 0, dtype: float64
@jtratner

This comment has been minimized.

Copy link
Contributor

commented Oct 31, 2013

This definitely needs to be for 0.14. It's a significant change!

@jtratner

This comment has been minimized.

Copy link
Contributor

commented Oct 31, 2013

How about a compromise on this:

  1. Make it into a warning
  2. Change the warning to mention chained assignment: "Trying to set a value on a copy, which may mean that you aren't setting the value on the object you want. This often occurs because of chained assignment."
  3. Offer config option to add a filter for that warning that causes it to raise.
@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Oct 31, 2013

yeh...maybe a warning is ok....(3 can be done via warnings module)

the issue is that sometimes it work and we allow it, but sometimes it won't. I can in theory
actually make it work (as I can figure out the error), but I think its a bit complicated, because
I there may be some hard to detect cases, IOW, this its possible this can be a false positive.

@jtratner

This comment has been minimized.

Copy link
Contributor

commented Oct 31, 2013

Yeah, I know you can do it via warnings, but if you want to be able to say "If you're starting out with pandas, we recommend you set 'raise_on_setting_copy' to True", which then alters warnings on the callback.

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Oct 31, 2013

ok....option is easy

@jtratner

This comment has been minimized.

Copy link
Contributor

commented Oct 31, 2013

I'm personally fine with a warning and letting users fiddle, your call on that. Maybe also performance test?

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Oct 31, 2013

no perf hit...this is can an attrib check..

you example above, FYI is technically chained assignment. The bottom line is a cross-section supposed to be a view or not? (e.g. maybe it should NEVER be), even it it COULD be

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Oct 31, 2013

@jtratner @JanSchulz ok...new behavior is at top...controlled by the option, default is to warn

@jtratner

This comment has been minimized.

Copy link
Contributor

commented Oct 31, 2013

Much happier with this. Do you maybe want to add a note that it won't necessarily catch all chained assignment?

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Oct 31, 2013

well...hoping that we do actually catch them and just have some false positives :)

if you have a case that you don't think we catch......

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Oct 31, 2013

@wesm ?

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Nov 1, 2013

@jtratner ok with putting in?

@jtratner

This comment has been minimized.

Copy link
Contributor

commented Nov 1, 2013

Could we change _copy to _is_copy and set a default of False on NDFrame?

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Nov 1, 2013

sure

@jtratner
jtratner reviewed Nov 2, 2013
View changes
pandas/core/generic.py Outdated
def _check_setitem_copy(self):
""" validate if we are doing a settitem on a chained copy """
if getattr(self,'_is_copy',False):
value = pd.get_option('chained_assignment')

This comment has been minimized.

Copy link
@jtratner

jtratner Nov 2, 2013

Contributor

maybe make this a global that gets set on the validator? get_option is actually expensive - nearly 100us.

@jtratner
jtratner reviewed Nov 2, 2013
View changes
pandas/core/generic.py Outdated
@@ -988,6 +990,17 @@ def _set_item(self, key, value):
self._data.set(key, value)
self._clear_item_cache()

def _check_setitem_copy(self):
""" validate if we are doing a settitem on a chained copy """
if getattr(self,'_is_copy',False):

This comment has been minimized.

Copy link
@jtratner

jtratner Nov 2, 2013

Contributor

and now this can just be if self._is_copy.

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Nov 2, 2013

@jtratner fixed up.....just made a fast access (internal) function

@@ -512,6 +512,13 @@ def _get_root(key):
cursor = cursor[p]
return cursor, path[-1]

def _get_option_fast(key):

This comment has been minimized.

Copy link
@jtratner

jtratner Nov 2, 2013

Contributor

great, needed to do this anyway.

# warnings
pd.set_option('chained_assignment','warn')
df = DataFrame({'A':['aaa','bbb','ccc'],'B':[1,2,3]})
df.loc[0]['A'] = 111

This comment has been minimized.

Copy link
@jtratner

jtratner Nov 2, 2013

Contributor

so don't you need to catch the warning here?

self.assert_(not df._is_copy)

df['A'][0] = -5
df['A'][1] = -6

This comment has been minimized.

Copy link
@jtratner

jtratner Nov 2, 2013

Contributor

so the logic of this is that it's okay to do 'chained-ish' assignment on Series, but can't do fancy indexing on copies?

In chained expressions, the order may determine whether a copy is returned or not:
In chained expressions, the order may determine whether a copy is returned or not.
If an expression will set values on a copy of a slice, then a ``SettingWithCopy``
exception will be raised (this raise/warn behavior is new starting in 0.13.0)

This comment has been minimized.

Copy link
@jtratner

jtratner Nov 2, 2013

Contributor

versionadded?

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Nov 2, 2013

I thought the version added would be confusing actually because this section was their before

@jtratner

This comment has been minimized.

Copy link
Contributor

commented Nov 2, 2013

got it - that makes a lot of sense.

@jankatins

This comment has been minimized.

Copy link
Contributor

commented Nov 2, 2013

It would be nice if there would be API way to remove the warning/raising ("I know what I do") from that object:

df = DataFrame({"A": [1, 2, 3, 4, 5], "B": [3.125, 4.12, 3.1, 6.2, 7.]})
row = df.loc[0]
# Here some magic...
row["A"] = 0
# No error/warning

It would be even nicer if pandas would be more deterministic, eg. always return a "not view" when the value is assigned but always throw an error when a copy is returned in between two slice operations ("if I assign I want a copy, if I don't assign I want a View"?).

@jtratner

This comment has been minimized.

Copy link
Contributor

commented Nov 2, 2013

Just set the config option to None.

the problem is that the underlying decision about whether or not to copy is made by numpy rather than pandas.

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Nov 2, 2013

@JanSchulz being more deterministic is an interesting idea, and could be done in these cases (especially in the case of xs, which I think should always copy), which is the main abuse of the chained setting.

In any event, any more comments; I think this is a nice warning (which is easy to turn off).

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Nov 2, 2013

@JanSchulz the basic rule is this

if mixed type then will copy, if a single dtype will provide a view if numpy does (IOW not sure if this is a copy or not, it still may be).

@jankatins

This comment has been minimized.

Copy link
Contributor

commented Nov 3, 2013

@jreback What a deterministic rule :-)

I still find it madness... :-)

import pandas as pd
df = pd.DataFrame({"A": [1, 2, 3, 4, 5], "B": [3, 4, 3, 6, 7]})
df2 = pd.DataFrame({"A": [1, 2, 3, 4, 5], "B": [3., 4., 3., 6., 7.]})
df.loc[0]["A"] = 0
df2.loc[0]["A"] = 0
print df.ix[0,"A"]
0
print df2.ix[0,"A"]
1
df["A"].loc[0] = 2
df2["A"].loc[0] = 2
print df.ix[0,"A"]
2
print df2.ix[0,"A"]
2

This even more so:

import numpy as np
dfb = pd.DataFrame({'a' : ['one', 'one', 'two','three', 'two', 'one', 'six'],'c' : np.arange(7)})
print dfb.ix[0,"c"]
0
dfb['c'][dfb.a.str.startswith('o')] = 42
print dfb.ix[0,"c"]
42
dfb[dfb.a.str.startswith('o')]['c'] = 43
print dfb.ix[0,"c"]
42
dfb.loc[dfb.a.str.startswith('o'),'c'] = 44
print dfb.ix[0,"c"]
44

I see that a view can sometimes be more memory efficient, but this will be (or is) a nightmare for people who come from R, SPSS or excel and have never seen numpy.

A way to make it more deterministic would be to add a wrapper around all indexer which ensure a view (say for example df.vix) or a copy (df.cix). Then you can at least be explicit what you want to do. Or an option that all default indexer use a "copy or error"-mode and only df.vix and so on a "view or error"-mode.

So I think the basic rules are:
1.) do not use chained assignments: replace all df{|.ix|.iloc|.iloc}[column][rows] by df.loc[row,col] (if col and row are both location based) or df.iloc[row, col] (if both are integer location based) or df.ix[row, col] (if mixed) and you are save.
2.) if you must, first select columns, then rows (to not get a mixed dtype)
3.) if you assign a value to a slice, which you don't want to show up in the original dataframe, make sure that this slice is a copy by ... [don't know how...]

@jtratner

This comment has been minimized.

Copy link
Contributor

commented Nov 3, 2013

(3) --> make a copy.

s = df.ix[a, b].copy()

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Nov 4, 2013

@JanSchulz you rules are sensible....maybe want to expand the chained assignement section even more!

I think this PR will at the very least alert you that you might be doing something you should think about.

Its not hard to eliminate view on cross-section/slicing. I don't think there is any particular benefit to it, unless your data is huge and you are taking a viewable slice (and that's prob why it was done in the first place).

@jtratner

This comment has been minimized.

Copy link
Contributor

commented Nov 4, 2013

how about we just make ix optionally callable?

df.ix(force_copy=True)[a, b]

def __call__(self, force_copy=False):
     self.force_copy=force_copy
     return self

And we could consider adding a config option. Seems like a much simpler and explicit way to handle it than new indexers.

@jtratner

This comment has been minimized.

Copy link
Contributor

commented Nov 4, 2013

Or if we want to cache the ix object, then something like:

def __call__(self, force_copy=False):
     if self.force_copy == force_copy:
         new_self = self
     else:
         from copy import copy
         new_self = copy(self)
         new_self.force_copy =force_copy
     return new_self
@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Nov 4, 2013

hmm that's interesting

could pass parameters that way; in fact that's the reason u need xs explicitly (eg say u want to pass the level or axis parameter)

@jtratner

This comment has been minimized.

Copy link
Contributor

commented Nov 4, 2013

just seems like it's relatively natural and means we don't have to do more overloading. Main thing is making sure ix is a thin wrapper so we don't have to deal with cacheing issues.

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Nov 4, 2013

ix (and friends) are created each time an indexing operation happens (and it's thin really just the obj reference)

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Nov 6, 2013

@jtratner any further comments?

@jtratner

This comment has been minimized.

Copy link
Contributor

commented Nov 6, 2013

Looks good to me - thanks for making those changes

jreback added a commit that referenced this pull request Nov 6, 2013
API: raise/warn SettingWithCopy when chained assignment is detected
@jreback jreback merged commit 0d5c06f into pandas-dev:master Nov 6, 2013
@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Nov 24, 2013

I've never really worked with warnings before, but is there a way to print the line number causing the warning when its displayed? I've got some of these warnings and I'm trying to figure out what's causing them.

@jreback

This comment has been minimized.

Copy link
Contributor Author

commented Nov 24, 2013

not sure

you can also set the option to raise then catch it in the debugger so it will stop there

@jtratner

This comment has been minimized.

Copy link
Contributor

commented Nov 24, 2013

Yes, it's simple to do and we should have bumped up the stacklevel on this.

@TomAugspurger check out #5581 where I submitted a PR to fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.