Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: raise/warn SettingWithCopy when chained assignment is detected #5390

Merged
merged 1 commit into from
Nov 6, 2013

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Oct 30, 2013

In [1]: df = DataFrame({"A": [1, 2, 3, 4, 5], "B": [3.125, 4.12, 3.1, 6.2, 7.]})

In [2]: row = df.loc[0]

Default is to warn

In [3]: row["A"] = 0
pandas/core/generic.py:1001: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  warnings.warn(t,SettingWithCopyWarning)

In [4]: row
Out[4]: 
A    0.000
B    3.125
Name: 0, dtype: float64

In [5]: df
Out[5]: 
   A      B
0  1  3.125
1  2  4.120
2  3  3.100
3  4  6.200
4  5  7.000

You can turn it off

In [6]: pd.set_option('chained',None)

In [7]: row["A"] = 0

In [8]: row
Out[8]: 
A    0.000
B    3.125
Name: 0, dtype: float64

In [9]: df
Out[9]: 
   A      B
0  1  3.125
1  2  4.120
2  3  3.100
3  4  6.200
4  5  7.000

Or set to raise

In [10]: row["A"] = 0
SettingWithCopyError: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead

@jankatins
Copy link
Contributor

Want!

How would you detect, that a copy was expected? Like first assigning the loc result and then changing it?

@jreback
Copy link
Contributor Author

jreback commented Oct 30, 2013

nope just setting a flag on the returned slices is enough when they r copied (2 cases r covered)

@jtratner
Copy link
Contributor

I'm strongly opposed to this - what if you want to do some transformation on a row and then apply it to multiple other rows? It's far too opinionated and it only covers certain cases, and it doesn't work if you do fancy indexing:

In [25]: l = df.loc[[0, 1]]

In [26]: l
Out[26]:
   A  B
0  0  2
1  1  3

In [27]: l["A"] = 5

In [28]: df
Out[28]:
   A  B
0  0  2
1  1  3

In [29]: l
Out[29]:
   A  B
0  5  2
1  5  3

It also doesn't work in this case:

In [3]: df = pd.DataFrame({"A": [1, 2, 3, 4, 5], "B": [5, 4, 3, 2, 1]}, index=['a', 'b', 'c', 'd', 'e'])

In [4]: df
Out[4]:
   A  B
a  1  5
b  2  4
c  3  3
d  4  2
e  5  1

In [5]: df.loc['a']["A"] = 5

@jreback
Copy link
Contributor Author

jreback commented Oct 30, 2013

@jtratner you are missing the point here, the last case DOES work. In general if its a single-dtype it WILL work.

This addresses the problem when it DOESN't work and you don't realize that. The problem from a user perspective is that certain operations produce a copy that if you then try to set are actually a copy of the original (this most often happens when you take a cross section that must be converted from a view to a copy)

@jtratner
Copy link
Contributor

I definitely don't understand the intricacies of what makes copies with numpy and pandas. That said, What I don't like about this is I can't do this:

df = DataFrame({"A": [1, 2, 3, 4, 5], "B": [3.125, 4.12, 3.1, 6.2, 7.]})
row = df.loc[0]
row["A"] = 0
df - row

But I can do it if it's the same dtype. Yes, it's blocking a certain kind of mistake, but how the heck do I make it so that I can do something with row? copy it?

It also means that you can do changed assignment on columns, but can't on rows. (i.e. df["A"][0] works).


My biggest problem is that it make some Series totally unusable, simply because they were generated by loc.

@jtratner
Copy link
Contributor

Also, in that example I gave, you aren't blocking the mutation of row:

In [10]: df = DataFrame({"A": [1, 2, 3, 4, 5], "B": [3.125, 4.12, 3.1, 6.2, 7.]})

In [11]: row = df.loc[0]

In [12]: row["A"] = 0
---------------------------------------------------------------------------
SettingWithCopy                           Traceback (most recent call last)
<ipython-input-12-1d82f0d440e7> in <module>()
----> 1 row["A"] = 0

../pandas/core/series.py in __setitem__(self, key, value)
    574     def __setitem__(self, key, value):
    575         try:
--> 576             self._set_with_engine(key, value)
    577             return
    578         except (SettingWithCopy):

../pandas/core/series.py in _set_with_engine(self, key, value)
    627             self.index._engine.set_value(values, key, value)
    628             if getattr(self,'_copy',False):
--> 629                 raise SettingWithCopy("trying to set value on a copy")
    630             return
    631         except KeyError:

SettingWithCopy: trying to set value on a copy

In [13]: df - row
Out[13]:
   A      B
0  1  0.000
1  2  0.995
2  3 -0.025
3  4  3.075
4  5  3.875

In [14]: row
Out[14]:
A    0.000
B    3.125
Name: 0, dtype: float64

@jtratner
Copy link
Contributor

This definitely needs to be for 0.14. It's a significant change!

@jtratner
Copy link
Contributor

How about a compromise on this:

  1. Make it into a warning
  2. Change the warning to mention chained assignment: "Trying to set a value on a copy, which may mean that you aren't setting the value on the object you want. This often occurs because of chained assignment."
  3. Offer config option to add a filter for that warning that causes it to raise.

@jreback
Copy link
Contributor Author

jreback commented Oct 31, 2013

yeh...maybe a warning is ok....(3 can be done via warnings module)

the issue is that sometimes it work and we allow it, but sometimes it won't. I can in theory
actually make it work (as I can figure out the error), but I think its a bit complicated, because
I there may be some hard to detect cases, IOW, this its possible this can be a false positive.

@jtratner
Copy link
Contributor

Yeah, I know you can do it via warnings, but if you want to be able to say "If you're starting out with pandas, we recommend you set 'raise_on_setting_copy' to True", which then alters warnings on the callback.

@jreback
Copy link
Contributor Author

jreback commented Oct 31, 2013

ok....option is easy

@jtratner
Copy link
Contributor

I'm personally fine with a warning and letting users fiddle, your call on that. Maybe also performance test?

@jreback
Copy link
Contributor Author

jreback commented Oct 31, 2013

no perf hit...this is can an attrib check..

you example above, FYI is technically chained assignment. The bottom line is a cross-section supposed to be a view or not? (e.g. maybe it should NEVER be), even it it COULD be

@jreback
Copy link
Contributor Author

jreback commented Oct 31, 2013

@jtratner @JanSchulz ok...new behavior is at top...controlled by the option, default is to warn

@jtratner
Copy link
Contributor

Much happier with this. Do you maybe want to add a note that it won't necessarily catch all chained assignment?

@jreback
Copy link
Contributor Author

jreback commented Oct 31, 2013

well...hoping that we do actually catch them and just have some false positives :)

if you have a case that you don't think we catch......

@jreback
Copy link
Contributor Author

jreback commented Oct 31, 2013

@wesm ?

@jreback
Copy link
Contributor Author

jreback commented Nov 1, 2013

@jtratner ok with putting in?

@jtratner
Copy link
Contributor

jtratner commented Nov 1, 2013

Could we change _copy to _is_copy and set a default of False on NDFrame?

@jreback
Copy link
Contributor Author

jreback commented Nov 1, 2013

sure

def _check_setitem_copy(self):
""" validate if we are doing a settitem on a chained copy """
if getattr(self,'_is_copy',False):
value = pd.get_option('chained_assignment')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe make this a global that gets set on the validator? get_option is actually expensive - nearly 100us.

@jreback
Copy link
Contributor Author

jreback commented Nov 2, 2013

@jtratner fixed up.....just made a fast access (internal) function

@@ -512,6 +512,13 @@ def _get_root(key):
cursor = cursor[p]
return cursor, path[-1]

def _get_option_fast(key):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, needed to do this anyway.

@jreback
Copy link
Contributor Author

jreback commented Nov 2, 2013

I thought the version added would be confusing actually because this section was their before

@jtratner
Copy link
Contributor

jtratner commented Nov 2, 2013

got it - that makes a lot of sense.

@jankatins
Copy link
Contributor

It would be nice if there would be API way to remove the warning/raising ("I know what I do") from that object:

df = DataFrame({"A": [1, 2, 3, 4, 5], "B": [3.125, 4.12, 3.1, 6.2, 7.]})
row = df.loc[0]
# Here some magic...
row["A"] = 0
# No error/warning

It would be even nicer if pandas would be more deterministic, eg. always return a "not view" when the value is assigned but always throw an error when a copy is returned in between two slice operations ("if I assign I want a copy, if I don't assign I want a View"?).

@jtratner
Copy link
Contributor

jtratner commented Nov 2, 2013

Just set the config option to None.

the problem is that the underlying decision about whether or not to copy is made by numpy rather than pandas.

@jreback
Copy link
Contributor Author

jreback commented Nov 2, 2013

@JanSchulz being more deterministic is an interesting idea, and could be done in these cases (especially in the case of xs, which I think should always copy), which is the main abuse of the chained setting.

In any event, any more comments; I think this is a nice warning (which is easy to turn off).

@jreback
Copy link
Contributor Author

jreback commented Nov 2, 2013

@JanSchulz the basic rule is this

if mixed type then will copy, if a single dtype will provide a view if numpy does (IOW not sure if this is a copy or not, it still may be).

@jankatins
Copy link
Contributor

@jreback What a deterministic rule :-)

I still find it madness... :-)

import pandas as pd
df = pd.DataFrame({"A": [1, 2, 3, 4, 5], "B": [3, 4, 3, 6, 7]})
df2 = pd.DataFrame({"A": [1, 2, 3, 4, 5], "B": [3., 4., 3., 6., 7.]})
df.loc[0]["A"] = 0
df2.loc[0]["A"] = 0
print df.ix[0,"A"]
0
print df2.ix[0,"A"]
1
df["A"].loc[0] = 2
df2["A"].loc[0] = 2
print df.ix[0,"A"]
2
print df2.ix[0,"A"]
2

This even more so:

import numpy as np
dfb = pd.DataFrame({'a' : ['one', 'one', 'two','three', 'two', 'one', 'six'],'c' : np.arange(7)})
print dfb.ix[0,"c"]
0
dfb['c'][dfb.a.str.startswith('o')] = 42
print dfb.ix[0,"c"]
42
dfb[dfb.a.str.startswith('o')]['c'] = 43
print dfb.ix[0,"c"]
42
dfb.loc[dfb.a.str.startswith('o'),'c'] = 44
print dfb.ix[0,"c"]
44

I see that a view can sometimes be more memory efficient, but this will be (or is) a nightmare for people who come from R, SPSS or excel and have never seen numpy.

A way to make it more deterministic would be to add a wrapper around all indexer which ensure a view (say for example df.vix) or a copy (df.cix). Then you can at least be explicit what you want to do. Or an option that all default indexer use a "copy or error"-mode and only df.vix and so on a "view or error"-mode.

So I think the basic rules are:
1.) do not use chained assignments: replace all df{|.ix|.iloc|.iloc}[column][rows] by df.loc[row,col] (if col and row are both location based) or df.iloc[row, col] (if both are integer location based) or df.ix[row, col] (if mixed) and you are save.
2.) if you must, first select columns, then rows (to not get a mixed dtype)
3.) if you assign a value to a slice, which you don't want to show up in the original dataframe, make sure that this slice is a copy by ... [don't know how...]

@jtratner
Copy link
Contributor

jtratner commented Nov 3, 2013

(3) --> make a copy.

s = df.ix[a, b].copy()

@jreback
Copy link
Contributor Author

jreback commented Nov 4, 2013

@JanSchulz you rules are sensible....maybe want to expand the chained assignement section even more!

I think this PR will at the very least alert you that you might be doing something you should think about.

Its not hard to eliminate view on cross-section/slicing. I don't think there is any particular benefit to it, unless your data is huge and you are taking a viewable slice (and that's prob why it was done in the first place).

@jtratner
Copy link
Contributor

jtratner commented Nov 4, 2013

how about we just make ix optionally callable?

df.ix(force_copy=True)[a, b]

def __call__(self, force_copy=False):
     self.force_copy=force_copy
     return self

And we could consider adding a config option. Seems like a much simpler and explicit way to handle it than new indexers.

@jtratner
Copy link
Contributor

jtratner commented Nov 4, 2013

Or if we want to cache the ix object, then something like:

def __call__(self, force_copy=False):
     if self.force_copy == force_copy:
         new_self = self
     else:
         from copy import copy
         new_self = copy(self)
         new_self.force_copy =force_copy
     return new_self

@jreback
Copy link
Contributor Author

jreback commented Nov 4, 2013

hmm that's interesting

could pass parameters that way; in fact that's the reason u need xs explicitly (eg say u want to pass the level or axis parameter)

@jtratner
Copy link
Contributor

jtratner commented Nov 4, 2013

just seems like it's relatively natural and means we don't have to do more overloading. Main thing is making sure ix is a thin wrapper so we don't have to deal with cacheing issues.

@jreback
Copy link
Contributor Author

jreback commented Nov 4, 2013

ix (and friends) are created each time an indexing operation happens (and it's thin really just the obj reference)

@jreback
Copy link
Contributor Author

jreback commented Nov 6, 2013

@jtratner any further comments?

@jtratner
Copy link
Contributor

jtratner commented Nov 6, 2013

Looks good to me - thanks for making those changes

jreback added a commit that referenced this pull request Nov 6, 2013
API: raise/warn SettingWithCopy when chained assignment is detected
@jreback jreback merged commit 0d5c06f into pandas-dev:master Nov 6, 2013
@TomAugspurger
Copy link
Contributor

I've never really worked with warnings before, but is there a way to print the line number causing the warning when its displayed? I've got some of these warnings and I'm trying to figure out what's causing them.

@jreback
Copy link
Contributor Author

jreback commented Nov 24, 2013

not sure

you can also set the option to raise then catch it in the debugger so it will stop there

@jtratner
Copy link
Contributor

Yes, it's simple to do and we should have bumped up the stacklevel on this.

@TomAugspurger check out #5581 where I submitted a PR to fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants