BUG? Normalize modifies pandas Series inplace #5427

TomAugspurger · 2015-11-07T00:54:32Z

Not sure if this is a bug, but it was surprising.

import pandas as pd
import numpy as np

In [24]: s = pd.Series(np.random.randn(10))

In [25]: norm = colors.Normalize(s.min() + .2, s.max() - .2)

In [26]: s
Out[26]:
0   -0.767090
1   -0.236357
2    0.031890
3    1.339422
4    0.994139
5   -1.481244
6   -0.253904
7   -0.032531
8    0.777265
9   -0.315098
dtype: float64

# normalize numpy array, doesn't modify
In [27]: norm(s.values)
Out[27]:
masked_array(data = [ 0.21240206  0.43165251  0.54246791  1.08262188  0.93998197 -0.08262188
  0.42440363  0.51585521  0.85038937  0.39912391],
             mask = False,
       fill_value = 1e+20)

In [28]: s
Out[28]:
0   -0.767090
1   -0.236357
2    0.031890
3    1.339422
4    0.994139
5   -1.481244
6   -0.253904
7   -0.032531
8    0.777265
9   -0.315098
dtype: float64

In [29]: norm(s)  # normalize series...
Out[29]:
masked_array(data = [ 0.21240206  0.43165251  0.54246791  1.08262188  0.93998197 -0.08262188
  0.42440363  0.51585521  0.85038937  0.39912391],
             mask = False,
       fill_value = 1e+20)

In [30]: s  # does modify.
Out[30]:
0    0.212402
1    0.431653
2    0.542468
3    1.082622
4    0.939982
5   -0.082622
6    0.424404
7    0.515855
8    0.850389
9    0.399124
dtype: float64

In [43]: matplotlib.__version__
Out[43]: '1.5.0'

Strangely, if s in a integer series (s = pd.Series(range(10))), it will not be modified. FWIW, a list of floats aren't modified either.

The text was updated successfully, but these errors were encountered:

mdboom · 2015-11-07T02:31:58Z

I'd say it's a bug. There have been similar surprises elsewhere in matplotlib over the years and I think the general consensus is to not mutate passed-in arrays.

efiring · 2015-11-07T02:52:04Z

It looks like the problem is in the process_value method. Removing the if isinstance(value, np.ndarray): in front of the copy operation would probably fix it; but there might be a better way.

tacaswell · 2015-12-22T20:16:21Z

The issue here is that ma.asarray shares memory with the Series

In [25]: tt = pd.Series(np.random.randn(10))

In [26]: mt = ma.asarray(tt)

In [27]: mt[0] = 0

In [31]: tt
Out[31]: 
0    0.000000
1    0.367954
2    1.454730
3   -0.322231
4    2.203396
5   -2.031685
6   -0.129424
7    0.099887
8   -0.063291
9    0.123325
dtype: float64

Below we work on result in place anyway (see Normalize.__call__) Ideally we would want a way to check if the ma.asarray triggered a copy, but I am not sure how do to that in a way that works with both arrays and Series.

TomAugspurger · 2015-12-22T20:29:00Z

Does checking np.may_share_memory(tt, mt) do it? and if they do share memory, tt = tt.copy() or whatever the equivalent in the function is.

tacaswell · 2015-12-22T20:35:25Z

Oh hey, that 'just works' 👍

efiring · 2015-12-22T20:35:58Z

It looks like that works. I was worried about overhead, but maybe it's OK.

In [10]: tt = pd.Series(np.random.randn(1000000))

In [11]: mt = np.ma.asarray(tt)

In [12]: np.may_share_memory(tt, mt)
True

In [13]: %timeit np.may_share_memory(tt, mt)
The slowest run took 6.42 times longer than the fastest. This could mean that an intermediate result is being cached
100000 loops, best of 3: 7.73 µs per loop

efiring · 2015-12-22T20:43:35Z

Is may_share_memory available in the earliest numpy that we support? The source files mem_overlap.c and .h have very short histories.

tacaswell · 2015-12-22T20:48:55Z

I was also digging into that tring to sort out if lists are re-converted to arrays for this check or if it has a 'this does not support buffer-protocol, stop it' cut-out mode

efiring · 2015-12-22T20:56:19Z

Nope:

In [18]: xx = [1.1] * 10000000

In [19]: %timeit np.may_share_memory(xx, mt)
1 loops, best of 3: 279 ms per loop

tacaswell · 2015-12-22T21:00:35Z

The original version went in to numpy in e122f0798a686a41616908e071531790e84e5414 (2007) but it looks like the version that is compatible with pandas is newer than that fae8369fba4a683783280b044915b11bbca07a44 (2013) which is ~ np 1.8

I think the two options are:

unconditionally copy (bad prf for lists, wash for ndarrays, works for Series)
change an element an see what happens, but that opens us up to chaos due to pandas doing index based __getitem__ instead of positional.

efiring · 2015-12-22T21:11:38Z

My original suggestion was #1, and I think it still looks good. I am not concerned about the performance for lists; they are not likely to be long enough that it matters. There might be other array-like beasts for which we end up with an unnecessary copy, but I don't think it is worth worrying about that now.

tacaswell · 2015-12-22T21:39:44Z

@efiring Done. That rabbit whole almost looked promising (it was not).

tacaswell added this to the Next bugfix release (1.5.1) milestone Nov 7, 2015

tacaswell self-assigned this Dec 21, 2015

tacaswell mentioned this issue Dec 22, 2015

FIX: do not mutate Series input to Normalize() #5722

Merged

tacaswell added the status: needs review label Dec 22, 2015

tacaswell closed this as completed in b25c50b Dec 23, 2015

tacaswell removed the status: needs review label Dec 23, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG? Normalize modifies pandas Series inplace #5427

BUG? Normalize modifies pandas Series inplace #5427

TomAugspurger commented Nov 7, 2015

mdboom commented Nov 7, 2015

efiring commented Nov 7, 2015

tacaswell commented Dec 22, 2015

TomAugspurger commented Dec 22, 2015

tacaswell commented Dec 22, 2015

efiring commented Dec 22, 2015

efiring commented Dec 22, 2015

tacaswell commented Dec 22, 2015

efiring commented Dec 22, 2015

tacaswell commented Dec 22, 2015

efiring commented Dec 22, 2015

tacaswell commented Dec 22, 2015

BUG? Normalize modifies pandas Series inplace #5427

BUG? Normalize modifies pandas Series inplace #5427

Comments

TomAugspurger commented Nov 7, 2015

mdboom commented Nov 7, 2015

efiring commented Nov 7, 2015

tacaswell commented Dec 22, 2015

TomAugspurger commented Dec 22, 2015

tacaswell commented Dec 22, 2015

efiring commented Dec 22, 2015

efiring commented Dec 22, 2015

tacaswell commented Dec 22, 2015

efiring commented Dec 22, 2015

tacaswell commented Dec 22, 2015

efiring commented Dec 22, 2015

tacaswell commented Dec 22, 2015