Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG? Normalize modifies pandas Series inplace #5427

Closed
TomAugspurger opened this issue Nov 7, 2015 · 12 comments
Closed

BUG? Normalize modifies pandas Series inplace #5427

TomAugspurger opened this issue Nov 7, 2015 · 12 comments
Assignees
Milestone

Comments

@TomAugspurger
Copy link
Contributor

Not sure if this is a bug, but it was surprising.

import pandas as pd
import numpy as np

In [24]: s = pd.Series(np.random.randn(10))

In [25]: norm = colors.Normalize(s.min() + .2, s.max() - .2)

In [26]: s
Out[26]:
0   -0.767090
1   -0.236357
2    0.031890
3    1.339422
4    0.994139
5   -1.481244
6   -0.253904
7   -0.032531
8    0.777265
9   -0.315098
dtype: float64

# normalize numpy array, doesn't modify
In [27]: norm(s.values)
Out[27]:
masked_array(data = [ 0.21240206  0.43165251  0.54246791  1.08262188  0.93998197 -0.08262188
  0.42440363  0.51585521  0.85038937  0.39912391],
             mask = False,
       fill_value = 1e+20)

In [28]: s
Out[28]:
0   -0.767090
1   -0.236357
2    0.031890
3    1.339422
4    0.994139
5   -1.481244
6   -0.253904
7   -0.032531
8    0.777265
9   -0.315098
dtype: float64

In [29]: norm(s)  # normalize series...
Out[29]:
masked_array(data = [ 0.21240206  0.43165251  0.54246791  1.08262188  0.93998197 -0.08262188
  0.42440363  0.51585521  0.85038937  0.39912391],
             mask = False,
       fill_value = 1e+20)

In [30]: s  # does modify.
Out[30]:
0    0.212402
1    0.431653
2    0.542468
3    1.082622
4    0.939982
5   -0.082622
6    0.424404
7    0.515855
8    0.850389
9    0.399124
dtype: float64

In [43]: matplotlib.__version__
Out[43]: '1.5.0'

Strangely, if s in a integer series (s = pd.Series(range(10))), it will not be modified. FWIW, a list of floats aren't modified either.

@mdboom
Copy link
Member

mdboom commented Nov 7, 2015

I'd say it's a bug. There have been similar surprises elsewhere in matplotlib over the years and I think the general consensus is to not mutate passed-in arrays.

@efiring
Copy link
Member

efiring commented Nov 7, 2015

It looks like the problem is in the process_value method. Removing the if isinstance(value, np.ndarray): in front of the copy operation would probably fix it; but there might be a better way.

@tacaswell tacaswell added this to the Next bugfix release (1.5.1) milestone Nov 7, 2015
@tacaswell tacaswell self-assigned this Dec 21, 2015
@tacaswell
Copy link
Member

The issue here is that ma.asarray shares memory with the Series

In [25]: tt = pd.Series(np.random.randn(10))

In [26]: mt = ma.asarray(tt)

In [27]: mt[0] = 0

In [31]: tt
Out[31]: 
0    0.000000
1    0.367954
2    1.454730
3   -0.322231
4    2.203396
5   -2.031685
6   -0.129424
7    0.099887
8   -0.063291
9    0.123325
dtype: float64

Below we work on result in place anyway (see Normalize.__call__) Ideally we would want a way to check if the ma.asarray triggered a copy, but I am not sure how do to that in a way that works with both arrays and Series.

@TomAugspurger
Copy link
Contributor Author

Does checking np.may_share_memory(tt, mt) do it? and if they do share memory, tt = tt.copy() or whatever the equivalent in the function is.

@tacaswell
Copy link
Member

Oh hey, that 'just works' 👍

@efiring
Copy link
Member

efiring commented Dec 22, 2015

It looks like that works. I was worried about overhead, but maybe it's OK.

In [10]: tt = pd.Series(np.random.randn(1000000))

In [11]: mt = np.ma.asarray(tt)

In [12]: np.may_share_memory(tt, mt)
True

In [13]: %timeit np.may_share_memory(tt, mt)
The slowest run took 6.42 times longer than the fastest. This could mean that an intermediate result is being cached
100000 loops, best of 3: 7.73 µs per loop

@efiring
Copy link
Member

efiring commented Dec 22, 2015

Is may_share_memory available in the earliest numpy that we support? The source files mem_overlap.c and .h have very short histories.

@tacaswell
Copy link
Member

I was also digging into that tring to sort out if lists are re-converted to arrays for this check or if it has a 'this does not support buffer-protocol, stop it' cut-out mode

@efiring
Copy link
Member

efiring commented Dec 22, 2015

Nope:

In [18]: xx = [1.1] * 10000000

In [19]: %timeit np.may_share_memory(xx, mt)
1 loops, best of 3: 279 ms per loop

@tacaswell
Copy link
Member

The original version went in to numpy in e122f0798a686a41616908e071531790e84e5414 (2007) but it looks like the version that is compatible with pandas is newer than that fae8369fba4a683783280b044915b11bbca07a44 (2013) which is ~ np 1.8

I think the two options are:

  1. unconditionally copy (bad prf for lists, wash for ndarrays, works for Series)
  2. change an element an see what happens, but that opens us up to chaos due to pandas doing index based __getitem__ instead of positional.

@efiring
Copy link
Member

efiring commented Dec 22, 2015

My original suggestion was #1, and I think it still looks good. I am not concerned about the performance for lists; they are not likely to be long enough that it matters. There might be other array-like beasts for which we end up with an unnecessary copy, but I don't think it is worth worrying about that now.

@tacaswell
Copy link
Member

@efiring Done. That rabbit whole almost looked promising (it was not).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants