BUG: DataFrame.replace is broken in 0.12-dev #4115

Closed
vfilimonov opened this Issue Jul 3, 2013 · 18 comments

Comments

Projects
None yet
4 participants
Contributor

vfilimonov commented Jul 3, 2013

the DataFrame.replace seems to be broken in 0.12 development version

this code:

df = pd.DataFrame({'Type':['Q','T','Q','Q','T'], 'tmp':2})
print df
print df.replace({'Type': {'Q': 0, 'T': 1}})

Return correct result in 0.11.0:

  Type  tmp
0    Q    2
1    T    2
2    Q    2
3    Q    2
4    T    2
5    T    2
  Type  tmp
0    0    2
1    1    2
2    0    2
3    0    2
4    1    2
5    1    2

and in 0.12.0.dev-795004d it does not work:


  Type  tmp
0    Q    2
1    T    2
2    Q    2
3    Q    2
4    T    2
5    T    2
   Type  tmp
0     0    2
1     1    2
2     0    2
3     1    2
4     0    2
5     1    2

Series.replace it works well, so the workaround for this bug is the following:

df['Type'] = df['Type'].replace({'Q': 0, 'T': 1})
Contributor

jreback commented Jul 3, 2013

@cpcloud take a look?

Member

cpcloud commented Jul 3, 2013

sure

while i'm fixing you can also do

df.replace({'Type': {'Q': 0}}).replace({'Type': {'T': 1}})
Member

cpcloud commented Jul 3, 2013

ah i remember this being a bit tricky. was the last change i made before submitting the original pr. @vfilimonov thanks for the report

Member

cpcloud commented Jul 3, 2013

it would be cool to look at the pandas devs "reaction time" (time to submit a pr after a bug report)...i wonder if the gh api has such information

Contributor

vfilimonov commented Jul 3, 2013

@cpcloud, thank you very much for taking care!

Aside: the variant with Series seems to be much faster if there are a lot of columns of different types. When the number of columns is not big the difference is not that large.

N = 1999999
df = pd.DataFrame({'Type':['T' if x>0.5 else 'Q' for x in np.random.rand(N)],
                   'tmp1':np.random.rand(N),'tmp2':np.random.rand(N),
                   'tmp3':'tmp','tmp4':np.random.rand(N),
                   'tmp5':np.random.rand(N),'tmp6':2,
                   'tmp7':np.random.rand(N),'tmp8':3,
                   'tmp9':pd.Timestamp('2000-01-01'),'tmp10':np.random.rand(N)})
df2 = df.copy()

%time df['Type']=df['Type'].replace({'Q': 0, 'T': 1})
%time df2=df2.replace({'Type': {'Q': 0}}).replace({'Type': {'T': 1}})

CPU times: user 0.16 s, sys: 0.01 s, total: 0.17 s
Wall time: 0.17 s
CPU times: user 0.73 s, sys: 0.18 s, total: 0.91 s
Wall time: 0.91 s

cpcloud was assigned Jul 3, 2013

Member

cpcloud commented Jul 3, 2013

series is doing something a bit different...it uses a lot of cythonized internals to do the work at the expense of being slightly less general, e.g., it doesn't accept regex. this will most likely change in 0.13 when series becomes an ndframe since all of the replace stuff is happening at the Block level

Contributor

vfilimonov commented Jul 3, 2013

The reaction time is really incredible. It is the second time when I submit a report and you guys fix the bug in 30 minutes! Thanks a lot

Member

cpcloud commented Jul 3, 2013

@vfilimonov thanks for the kind words! we appreciate it.

cpcloud closed this in #4117 Jul 3, 2013

Contributor

jtratner commented Jul 3, 2013

@cpcloud honestly seems like it would be pretty easy to change Series internals... Only issue is that it (might be) a bit slower...

Member

cpcloud commented Jul 3, 2013

i agree w/ u. just commenting on why it's a bit faster. it uses masking via cython whereas i'm using np.vectorize. a while back @jreback and i compared timings using pandas cythonized vectorize replacement and it was actually slower than vectorize, but not by much.

Contributor

jtratner commented Jul 3, 2013

@cpcloud if pandas dropped the vectorize, it could use nearly the same functions for arithmetic for Series and Frame (and probably even for Panel).

Contributor

jtratner commented Jul 3, 2013

@cpcloud that said, having code in 3 places is not a big deal...performance is more important.

Member

cpcloud commented Jul 3, 2013

there's are only 3 places where vectorize is used in pandas

./pandas/core/internals.py:
  830 :         f = np.vectorize(re_replacer, otypes=[self.dtype])
./pandas/io/pytables.py:
 3383 :         f = np.vectorize(lambda x: x.encode(encoding), otypes=[np.object])
 3402 :         f = np.vectorize(lambda x: x.decode(encoding),otypes=[np.object])

are you referring to use of things like map_infer_types?

Contributor

jreback commented Jul 3, 2013

keep in kind this is all going to change once series refactor
replace is a bit tricky but the underlying methods will all be the same (as what frame uses now)

Member

cpcloud commented Jul 3, 2013

yes we should wait for the motherload

Contributor

jreback commented Jul 3, 2013

@cpcloud the last 2 are because I saw u using it!
and didn't feel like dealing with strings/Unicode in cython

Member

cpcloud commented Jul 3, 2013

ha! str and unicode are super annoying in Cython when it comes to py2/3 issues and using char*.

Member

cpcloud commented Jul 3, 2013

which is really just because using char* for unicode is asking for trouble

cpcloud was unassigned by wesm Oct 12, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment