Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spurious SettingWithCopyWarning #5597

Closed
jseabold opened this issue Nov 27, 2013 · 19 comments · Fixed by #5584
Closed

spurious SettingWithCopyWarning #5597

jseabold opened this issue Nov 27, 2013 · 19 comments · Fixed by #5584
Labels
API Design Bug Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@jseabold
Copy link
Contributor

I'm getting spurious warnings on some old code that I'm running with new pandas. You can replicate by doing something like this (you have to take a subset of the data first, that's the key)

import pandas as pd
from pandas.core.common import SettingWithCopyWarning
from string import letters
import warnings
warnings.simplefilter('error', SettingWithCopyWarning)

def random_text(nobs=100):
    df = []
    for i in range(nobs):
        idx= np.random.randint(len(letters), size=2)
        idx.sort()
        df.append([letters[idx[0]:idx[1]]])

    return pd.DataFrame(df, columns=['letters'])

df = random_text(100000)

df = df.ix[df.letters.apply(lambda x : len(x) > 10)]
df['letters'] = df['letters'].apply(str.lower)
@jreback
Copy link
Contributor

jreback commented Nov 27, 2013

This is not spurious, but exactly as intended. Though you may view it as a False Positive (but the whole point of this warning is try to not have False Negatives).

The ix is implicity doing a take which returns a copy of the data. then you assigning a column of the data to new values, the 'original' frame is unchanged (which I believe is your intent). The warning is indicating that you are in effect assigning to a cross-section of the original, which MAY or MAY not be a copy (in the case of a single dtype is is 'usually' not a copy). In this case it is though.

The warning can be turned off by doing any of the following, or simply setting pd.set_option('chained_assignment',None)

df = random_text(100000)
indexer = df.letters.apply(lambda x : len(x) > 10)
df = df.ix[indexer].copy()
df['letters'] = df['letters'].apply(str.lower)

df = random_text(100000)
indexer = df.letters.apply(lambda x : len(x) > 10)
df = df.ix[indexer]
df.loc[:,'letters'] = df['letters'].apply(str.lower)

df = random_text(100000)
indexer = df.letters.apply(lambda x : len(x) > 10)
df = df.ix[indexer]._setitem_copy(False)
df['letters'] = df['letters'].apply(str.lower)

@jseabold
Copy link
Contributor Author

I see. Hmm, well I know I've copied the data... Same thing, but more explicitly, perhaps.

new_df = df.ix[df.letters.apply(lambda x : len(x) > 10)]
new_df.reset_index(drop=True, inplace=True)
del df
new_df['letters'] = new_df['letters'].apply(str.lower)

The warning can be trned off by doing any of the following, or simply setting
pd.set_option('chained_assignment',None)

Ok. I guess I'll set this because I don't care for this level of hand-holding. In numpy, you just know what you're doing copy vs. view or explicitly ask for a copy when you want to be sure.

What else does 'chained_assignment' handle? It sounds delightfully mysterious.

@jseabold
Copy link
Contributor Author

Hmm, maybe I need it I guess if the pandas view vs. copy is not as readily determined as numpy's. I am very wary of your it's usually a copy (or you're 'usually not producing garbage results' elsewhere, possibly silently). Sounds like I have to live with this noise for quite common operations. Plus the warning only occurs the first time by default. This is a pretty ugly state of affairs IMO. Meh.

@jreback
Copy link
Contributor

jreback commented Nov 27, 2013

agreed...the warning is really meant for new users for the most part

see the docs

the problem is people try to do this:

df[column][row] = ...., which doesn't ALWAYS work (in a single dtyped case it does, but not for multiple dtypes)

The problem is detecting this is quite difficult as the above, for example, yields 2 separate (and unreleated) __setitem__ calls, which no way of determining that. The example you give here is unfortunately a side-effect.

another possibility (though still not pretty) is to allow

df.ix(copy=False)[indexer] = ....

@jreback
Copy link
Contributor

jreback commented Nov 27, 2013

the thing is it has always been there, this warning just makes it known.

@jseabold
Copy link
Contributor Author

Makes sense. I had to get out of the df[][] = ... habit early on.

@aldanor
Copy link
Contributor

aldanor commented Nov 27, 2013

@jseabold @jreback Could the issue be reopened please?

I'm getting a millions of warnings too in the code which I know works as intended. I mean, come on, isn't it a little too much?

Something as trivial as

>>> df = pd.DataFrame({'a': [1]}).dropna()
>>> df['a'] += 1

results in

.../pandas/core/generic.py:1029: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  warnings.warn(t, SettingWithCopyWarning)

Do you think this kind of code could be misinterpreted in any possible way? Every snippet like this would effectively yield a false positive.

I consider the snippet above legit pandas code; what would you expect noobs to do here to avoid the warning? Dig into pd.set_option? Not too user-friendly. Do a .copy() after each operation, or use an ix(...) mentioned above? That's just extra boilerplate and not pretty at all :(

TL;DR I understand the need to warn the noobs, but at the same time this would yield a thousand times more false positives than the true ones.

Edit: I understand this has been partially addressed by #5584 for setting entire columns. I believe though there are many more particular cases where a warning can be assured.to be false positive :/

@jreback
Copy link
Contributor

jreback commented Nov 27, 2013

@aldanor I updated #5584, their WAS a supriuos indication when the operation is doing nothing except creating a new object.

pls give a try with that PR and lmk.

I think this warning is useful (you can always turn it off of course), but their are some false positives; just trying to remove the ones I can (as you can see above its not always possible).

If you have more cases, pls post here

@jreback jreback reopened this Nov 27, 2013
@dsm054
Copy link
Contributor

dsm054 commented Nov 27, 2013

Count me among the pandas users who find the current state of affairs very frustrating.

We can't have a warning being standard if you follow recommended practice, which means I can no longer recommend people do the obvious thing even if it's clear it's going to work. As a result I now have to write .loc everywhere, even where it's manifestly not necessary.

Admittedly the only way I can think to get around this offhand is to have df["a"] return a different Series view object each time, so that it could grow a _chained_count value which can be incremented.

@jreback
Copy link
Contributor

jreback commented Nov 27, 2013

@dsm054 can you give an example of where 'standard practice' actually triggers this? This shouldn't be triggering very often (that's the intent anyhow).

In the entire test suite it DOES NOT trigger (except on purpose), so must be missing some cases (after #5584)

@jreback
Copy link
Contributor

jreback commented Nov 29, 2013

@dsm054 ony other warnigns you are getting that look spurious?

@jseabold
Copy link
Contributor Author

I just noticed that the statsmodels test suite is littered with these warnings, but no failures. I've updated our code to ignore them, but surely this is spurious? If it's setting on a copy I'd assume the original to be unchanged.

[~/statsmodels/statsmodels-skipper/statsmodels/stats/tests/]
[1]: pd.version.version
[1]: '0.13.1-105-g8119991'

[~/statsmodels/statsmodels-skipper/statsmodels/stats/tests/]
[2]: table = pd.DataFrame(np.zeros((3,2)), columns=['A','B'])

[~/statsmodels/statsmodels-skipper/statsmodels/stats/tests/]
[3]: table.ix[2]['A'] = 3
/usr/local/bin/ipython:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
#!/usr/bin/python

[~/statsmodels/statsmodels-skipper/statsmodels/stats/tests/]
[4]: table
[4]: 
A  B
0  0  0
1  0  0
2  3  0

[3 rows x 2 columns]

@jreback
Copy link
Contributor

jreback commented Feb 13, 2014

@jseabold

This is THE specific case that SettingWithCopy is addressing.

you should NOT do

table.ix[2]['A'] = 3 and instead do table.loc[2,'A']

It WILL work in a single-dtyped case as numpy almost always (and that's the rub) will give you a view. But will not work in a mixed dtype case.

You can turn them off, but the point is to use the correct semantics. It is not a bug but a warning to catch a potential error.

The setting will proceed, but you may be setting a copy. So if succeeds you had a view and the warning was spurious. but you cannot be sure; better to set using the loc/ix indexer (in its full form).

@jseabold
Copy link
Contributor Author

What version was loc introduced in?

If I read what you're saying then I could use this right?

table.ix[2, 'A'] = 3

I don't get a warning here, and I know what the dtype of table will be always.

@jreback
Copy link
Contributor

jreback commented Feb 13, 2014

you can use .ix too (I just always use .loc)

that is the 'correct' expression to ensure it will always work

@jreback
Copy link
Contributor

jreback commented Feb 13, 2014

IIRC .loc/.iloc in 0.11

@jseabold
Copy link
Contributor Author

Ok thanks. That's what I thought. We are still supporting old pandas unfortunately, because we are still supporting old(er) numpy.

@apiszcz
Copy link

apiszcz commented Mar 7, 2015

I am getting the warning using .loc ?
df.loc[:,'f1']=f2['tf']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

FYI:
(Pdb) pd.version
'0.15.2'

@jreback
Copy link
Contributor

jreback commented Mar 7, 2015

that can generate a warning as well. simply use df['f1'] = f2['tf']

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Bug Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants