Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve clarity around when SettingWithCopyWarning can be ignored (if ever?) #8730

Closed
maxgrenderjones opened this issue Nov 4, 2014 · 14 comments · Fixed by #9659

Comments

@maxgrenderjones
Copy link
Contributor

commented Nov 4, 2014

I [Edit: thought I got] a SettingWarningCopyWarning when running the following code:

frame[columnone][frame[columntwo]>x]=y

The docs seem to imply that I can safely ignore this error.

# passed via reference (will stay)
In [273]: dfb['c'][dfb.a.str.startswith('o')] = 42

Is it therefore generally true that I can always ignore errors when the command I'm running is of the form:

frame[columnames][booleanconditiononframe]=x

I use this all the time, and I think it always works. If that's true, it would be great if it was specifically called out in a 'you can ignore this warning when...' section. If that's not true (e.g. it works for scalar x / numeric x but not for series x), it would be great if that were called out too.

Sorry if this seems trivial, but I'm trying to explain this to a colleague who's new to pandas and he's confused, and it looks like even old hands can be confused when confronted with this warning (see #6757).

[Edit: looking at his code again, I suspect the SettingWithCopy warning was from a different line of code - (I wish warnings made it clear where they were from). All the same, it would still be great if the docs could be clear if there are circumstances where you always get a reference rather than a copy]

@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 4, 2014

Their are no cases that I am aware that you should actually ignore this warning. Your code may still work, that's why its a warning. It has to do with whether what you are modifying is actually a copy of a view. In general if its a single dtyped frame it mostly will work, BUT NOT ALWAYS. And that's the rub. If you do certain types of indexing it will never work, others it will work. You are really playing with fire.

This is kept on as an actual error in the entire pandas test suite. You should really never do this.

Use

df.loc[row_indexer,col_indexer] = value and it will ALWAYS work, no matter what you are doing.

Otherwise you are just holding a gun, maybe shooting yourself when you least expect it.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 4, 2014

If you think the docs are unclear, pls submit a pull-request to clarify. It is important from a user's perspective that they are very clear.

That example shows a case where chained indexing DOES WORK, but does not imply that you should use it. (again if its unclear from reading, pls lmk / submit a PR to fix the docs).

@ischwabacher

This comment has been minimized.

Copy link
Contributor

commented Nov 5, 2014

The reason you would ignore the warning is that AFAIK pandas can't tell the difference between

# WRONG
frame[columnone][frame[columntwo]>x] = y  # this warns
# frame is unsafe here

and

# WORKS (but using .ix is still better)
temp = frame[columnone]
temp[frame[columntwo]>x] = y  # this warns
# frame is unsafe here, but temp is safe
frame = temp
# now frame is safe
@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 5, 2014

no these are treated the same

their is an is_copy flag that is set on the created frames which propogate

however they can act differently if the reference variable goes out of scope

that said in a single dtype case u can get away with this

but my recommendation still holds to always heed the warning and not use potentially unsafe constructs

@maxgrenderjones

This comment has been minimized.

Copy link
Contributor Author

commented Nov 18, 2014

Following up on this, I just ran:

dataframe.loc[dataframe[colname]==colvalue, newcolname]=1

and got

C:\Anaconda\lib\site-packages\pandas\core\indexing.py:245: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = np.nan

Isn't this what I'm supposed to do? (irrespective of the warning - it does appear to have done the right thing on my ~50k row dataframe)

(Once I can write an example of how to do this in a way that always works, I will submit a PR as this is super unclear to me and I bet others too)

@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 18, 2014

well, pls show_versions(). Its possible its from a prior statement (and just got checked here, if that is the case its a bug). You modifying something that is a view to something else. You will have to show a self-contained example.

@maxgrenderjones

This comment has been minimized.

Copy link
Contributor Author

commented Nov 18, 2014

Sorry for the lack of info before - assuming my use of .loc ought to work, I will see if I can construct an end-to-end example that triggers the warning

In [2]: pandas.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.7.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.15.1
nose: 1.3.3
Cython: 0.20.1
numpy: 1.9.1
scipy: 0.14.0
statsmodels: 0.5.0
IPython: 2.1.0
sphinx: 1.2.2
patsy: 0.3.0
dateutil: 1.5
pytz: 2014.9
bottleneck: 0.8.0
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.3.1
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.5
lxml: 3.3.5
bs4: 4.3.1
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.4
pymysql: 0.6.2.None
psycopg2: None
@maxgrenderjones

This comment has been minimized.

Copy link
Contributor Author

commented Nov 18, 2014

Answer - my code looks like this:

dataframe=loadDataFrame()
dataframe=dataframe[colnames] # don't need all the columns
... code code code...
dataframe.loc[dataframe[colname]==colvalue, newcolname]=1 # triggers warning

Replacing the offending line with

dataframe=dataframe.drop([col for col in dataframe.columns if col not in colnames], axis=1)

means I don't get the warning

So, seems like the documentation 'fix' in this case (or rather helpful pointer) is that the warning can be triggered by an action that takes place some way away from the line that raises the warning - thanks for your help!

@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 18, 2014

well the warning is exactly correct.

you are taking a 'view' of the dataframe (e.g. the df = df[colnames])
This wont' trigger a copy

when you are setting a value via loc you are then effectiviely setting BOTH the original and the new one. This is what the error is guarding against, namely propogating of the view.

simply enough to

This actually acomplished quite a bit, it provides you with a new smaller object, and the old one gets cleaned up (and not copy warnings),
df = df[colnames_i_care_about].copy()

is the proper idiom here

or you don't even need that at all

just directly

df.loc[row_indexer,column] = value

this ONLY sets that 1 column so the rest don't matter

Further you can also usecols=.... when you are reading/loading the frame from disk.

@adamklein

This comment has been minimized.

Copy link
Contributor

commented Mar 12, 2015

One of my users has code similar to this:

df1 = DataFrame({'x': Series(['a','b','c']), 'y': Series(['d','e','f'])})
df2 = df1[['x']]
df2['y'] = ['g', 'h', 'i']

And gets the warning. However, in the code base, lines 2 and 3 are quite far removed. So it's almost impossible to realize the warning stems from line 2, (because the warning happens on execution of line 3, far away). I don't know if anything can be done about this, but it can be a head scratcher to track down the source of the warnings.

(Or, even why there's a warning here, without understanding the subtle semantics of copy vs view)

@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 15, 2015

@adamklein

The reason the above triggers the warning is the soln that is for this case below is triggered. Basically you reindex a frame to another frame, which 'remembers' that this happend (via the .is_copy attribute). Then if a chained indexing operation happens the warning can be triggered.

In [3]: df = pd.DataFrame({'a': list(range(4)), 'b': list('ab..'), 'c': ['a', 'b', np.nan, 'd']})

In [4]: mask = pd.isnull(df.c)

In [5]: df[['c']][mask] = df[['b']][mask]
/Users/jreback/anaconda/bin/ipython:1: SettingWithCopyWarning: 

What you are doing in your example and what I am showing are different, e.g. you are adding a column, while the above is a masked indexing on the rows. But the only way to detect this is to set .is_copy in the first place, otherwise the logic becomes crazy complex.

Now I can turn off this case, so what you are doing will not trigger the warning, but my example above will no longer warn.

I suspect that what you are doing is much more common, so this would eliminate this false positive while NOT warning when it should in a type of chained indexing.

As you can guess detecting chained indexing is actually quite complex.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 15, 2015

ok, see #9659 this should fix @adamklein issue.

@kay1793

This comment has been minimized.

Copy link

commented Mar 15, 2015

The reason the above triggers the warning is the soln that is for this case below is triggered.

wow I bet that triumph of grammer crashes NLTK.

@adamklein

This comment has been minimized.

Copy link
Contributor

commented Mar 31, 2015

:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.