col.replace(dict) takes too much memory #6697

jankatins · 2014-03-24T11:06:01Z

I've code which run fine about ~10 month ago and now fails due to running out of memory. The code basically does this:

groups = [[1,2,3][222,333,444, 555], ...]  # from a 1.4MB json file
# about 30k such groups, results in ~70k replacements
for replace_list in groups:
        replacement = sorted(replace_list)[0]
        for replace_id in replace_list:
            if replace_id == replacement:
                continue
            replace_dict[replace_id] = replacement
# len(_col) == 974757
_col = df[column].astype("int64")
_col.replace(replace_dict, inplace=True)

I've now split the replacement_dict into 2k chunks and and this works (takes about 20 seconds for each chunk).

The text was updated successfully, but these errors were encountered:

jreback · 2014-03-24T11:13:29Z

can you provide the file, pandas, numpy version that it was better on and what you are trying on now?

also, try wi/o the in-place operation

jankatins · 2014-03-24T12:15:22Z

The replacement file is no problem, but the dataframe is internal and about 400MB csv unziped :-(

Now I do it similar to this:

groups = [[1,2,3][222,333,444, 555], ...]  # from a 1.4MB json file
# about 30k such groups, results in ~70k replacements
count = 0
_col = df[column].astype("int64")
replace_dict = {}
for replace_list in groups:
        count +=1
        replacement = sorted(replace_list)[0]
        for replace_id in replace_list:
            if replace_id == replacement:
                continue
            replace_dict[replace_id] = replacement
        if count > 2000:
            _col = _col.replace(replace_dict)
            replace_dict = {}
            count = 0
_col = _col.replace(replace_dict) # for the last chunk

This works (takes about 11min) and 4.7GB of RAM are in use.

pandas._np_version
Out[3]: '1.8.0'

pandas.__version__
Out[4]: '0.13.1-489-g7ffa655'

I don't remember the numpy/pandas version from 10 month ago (The replacement file was generated on 17.05.2013, so during that week it worked) :-( It was probably whatever was on pandas master and the latest numpy release (I compile pandas from head and install numpy from cgolhlkes installers).

jreback · 2014-03-24T12:26:21Z

@cpcloud ?

dsm054 · 2014-03-24T15:09:37Z

Something seems strange here.

@JanSchulz: how long does _col.apply(lambda x: replace_dict.get(x, x)) take?

jankatins · 2014-03-24T15:36:37Z

WTF:

replacements_by_XXX.json (16:35:42):  groups: 31612 replacements: 77846 =>  1.36s
replacements_by_XXXY.json (16:35:43):  groups: 19576 replacements: 33845 =>  1.03s
replacements_manual.json (16:35:44):  groups: 16 replacements: 17 =>  864ms
Wall time: 3.26 s

This took 20min with the above "2k chunks" code.

jreback · 2014-03-24T15:48:53Z

@JanSchulz I don't understand your problem exactly, but instead of replacing, why wouldn't you use these as indexers.

e.g. say you are replacing string to numbers; just construct a list (or series) of the strings, then
do s.take(numbers) (or s.iloc)?

will be orders of magnitude faster

dsm054 · 2014-03-24T15:54:03Z

@JanSchulz: are those the numbers you get when you use the apply/get combination?

@jreback: I think the issue is that the following doesn't feel right at all:

>>> to_replace = {i: 10**7+i for i in range(10**6)}
>>> series_length = 10**3
>>> col = pd.Series(np.random.randint(0, 10**6, series_length))
>>> 
>>> %time slow = col.replace(to_replace)
CPU times: user 30 s, sys: 0 ns, total: 30 s
Wall time: 30 s
>>> %time fast = col.apply(lambda x: to_replace.get(x,x))
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 1.53 ms
>>> 
>>> (slow == fast).all()
True

We should be getting O(n) values * O(1) lookup, making this very fast, and we're not. [Okay, I haven't varied the parameters enough to pick up the actual scaling -- could just be a really big constant, from what I've shown.]

jreback · 2014-03-24T15:59:20Z

hmm.....seems doing lots of work inside....

if you do this:

df = DataFrame({ 0 : col })
df.replace({ 0 : to_replace })

much faster....

prob a bug

dsm054 · 2014-03-24T16:01:07Z

I don't actually trust {0: to_replace} at the moment; see the discussion in #6689.

cpcloud · 2014-03-24T16:07:06Z

this is very weird ... looking now

cpcloud · 2014-03-24T16:38:53Z

prun of @dsm054's example shows that BlockManager.replace_list is to blame...figuring out the line ... lprun 1.0b3 breaks with newest ipython 😢

jankatins · 2014-03-24T17:46:43Z

@jreback each row in my data set has a "owner" and but sometimes (actually quite a lot of times) one owner has received several IDs and in that step I want to merge them into one. So in the final dataframe, the ID has to be set to the right (or at least to only one) value, so that I can df.groupby("id").

The shorter numbers are from the apply/get (2 seconds vs 20 mins from "chunked replace").

jankatins · 2014-03-24T17:49:22Z

Anyway: the initial problem is not so much the time it takes but that it now takes much more memory than in last year with ~0.11.

dsm054 · 2014-03-24T17:59:50Z

I'd bet a shiny penny (now a collector's item in Canada) that the time and memory issues aren't independent.

cpcloud · 2014-03-24T18:24:31Z

Time seems to be linear in the number of keys

if __name__ == '__main__':
    import time
    import numpy as np
    from pandas import Series
    start = 1
    stop = 10 ** 6 + 1
    step = 10 ** 5
    sizes = np.arange(start, stop, step)
    to_rep = {n: {i: 10 ** 5 + i for i in range(n)} for n in sizes}
    col = Series(np.random.randint(0, 10 ** 6, size=10 ** 3))

    times = Series(dtype=float, index=to_rep.keys())

    for k, v in to_rep.iteritems():
        tic = time.time()
        col.replace(v)
        toc = time.time() - tic
        times[k] = toc

    times.plot(ax=ax)

dsm054 · 2014-03-24T18:35:11Z

I think you can see the linearity in the code. Suppressing irrelevant detail, the structure looks like

for blk in self.blocks:
     for i, (s, d) in enumerate(zip(src_list, dest_list)):
           result = blk.replace(args)

and so it calls a replace (or a putmask on the non-object branch) for every key, value pair. It looks like we're trying to leverage numpy to speed things up. That might be good in the few-keyval long array regime, but it's going to be beaten by a simple listcomp + dict lookup pretty quickly in others.

cpcloud · 2014-03-24T18:37:02Z

Yep .... was just looking at that block can't really get too deep into it right now can look this evening.

jreback · 2014-04-06T15:40:30Z

@cpcloud any luck with this?

cpcloud · 2014-04-06T15:44:28Z

I haven't really looked at this. Getting the dtype issue fixed from #6689 more important i think ... i'm actually working on the that right now

cpcloud · 2014-04-06T20:38:20Z

I've just done %memit between v0.11.0 and master and I can't see any kind of memory usage difference between these two commits.

Commits between the date provided by @JanSchulz and the v0.11.0 tag date (2013-04-22):

git log --oneline --no-merges --after={2013-04-22} --before={2013-05-17} --name-only

shows

f9eea30 RLS: Version 0.11
RELEASE.rst
doc/source/v0.11.0.txt
setup.py

f24b923 BUG: respect passed chunksize in read_csv when using get_chunk function. close #3406
RELEASE.rst
pandas/io/parsers.py
pandas/io/tests/test_parsers.py

d385d25 DOC: Adding parameters to frequencies, offsets (issue #2916)
pandas/tseries/frequencies.py
pandas/tseries/offsets.py

so it looks like nothing was touched regarding DataFrame.replace() (e.g., making it faster)

memit numero uno

v0.11.0

In [5]: pd.version.version
Out[5]: '0.11.0'

In [6]: paste
from pandas import Series
from numpy.random import randint


def rep_test():
    s = Series(randint(10 ** 6, size=1000))
    to_rep = {i: 10 ** 5 + i for i in range(10 ** 6)}
    return s.replace(to_rep)
## -- End pasted text --

In [7]: %load_ext memory_profiler

In [8]: %memit rep_test()
peak memory: 1511.13 MiB, increment: 1422.84 MiB

y dos

In [3]: pd.version.version
Out[4]: '0.13.1-594-ge84efe5'

In [4]: %memit rep_test()
peak memory: 1381.62 MiB, increment: 1287.60 MiB

cpcloud · 2014-04-06T20:43:13Z

Wow changing rep_test from this

def rep_test():
    s = Series(randint(10 ** 6, size=1e3))
    to_rep = {i: 10 ** 5 + i for i in range(10 ** 6)}
    return s.replace(to_rep)

to

def rep_test():
    s = Series(randint(10 ** 6, size=1e6))
    to_rep = {i: 10 ** 5 + i for i in range(7 * 10 ** 4)}
    return s.replace(to_rep)

crashes the interpreter. Back to the drawing board.

cpcloud · 2014-04-06T21:03:40Z

ah i see the issue ... replace with dicts is now using replace list ... before the ndarray refactor Series had its own replace method that was different than the generic one. some nice stuff in the series replace method i may move some of that old code into current replace

cpcloud · 2014-04-19T13:28:09Z

@JanSchulz can you provide a sample dataset that is a faithful "reproduction" of the one you're using? Meaning, enough info so that I can create a fake frame to test the issue with. I still get huge memory usage for replace even in v0.11.0 so i'm not sure where your issue lies. The issue is that when there's an overlap between keys and values a mask is built up for each replacement value and stored in a dict which is where the bulk of the memory usage is coming from, this is how it worked in v0.11.0, so i'm not sure why you weren't seeing this before

cpcloud · 2014-04-19T13:52:03Z

@JanSchulz also are you using the exact same dataset to test this as when you used this with pandas 0.11?

jankatins · 2014-04-24T12:18:35Z

The sample is basically this:

record, authorid
1, 1111111111
2, 2222222222
3, 3333333333
4, 4444444444
5, 5555555555
6, 6666666666
7, 1111111111 # a second entry of author 111111111
....[about 1mio of such lines]...

The replacements would then replace 1111111111 with 2222222222 or 4444444444 and 5555555555 with 6666666666 ("merge two or more authors into one"), but leave some authorids alone. I've about 70k such replacements. The author IDs were int64.

The data is the same, only the mergelist was slightly different (found about 20 "manual" merges, the automatic merges were 70k).

cpcloud · 2014-04-26T19:03:54Z

@JanSchulz can you try your replacement on pandas version 0.11.0 and report back to let me know if there's a difference? I'm having trouble isolating the memory issue without the exact dataset.

Also, it would be great if you could come up with a reproducible example that doesn't depend on your dataset. I can't reproduce the changes in memory usage (i.e., this has been a memory hog since at least v0.11.0) with the examples I've shown above. Without such an example, I'm essentially shooting in the dark.

cpcloud · 2014-04-26T19:05:00Z

@jreback can i push to 0.15? i can't repro this

chedatomasz · 2020-01-24T18:03:46Z

Has there been any progress on this? I feel like it could be a great improvement, as - to the best of my knowledge - there is currently no fast way to replace a DataFrame column based on a dict, and that is a fairly common use case

mmcdermott · 2021-08-27T19:59:06Z

Any update on this? It would be greatly appreciated.

jreback · 2021-08-27T20:08:41Z

@mmcdermott this is a very old issue - check whether this still exists in master - if not investigations welcome

mmcdermott · 2021-08-27T20:19:07Z

I found this issue because my code was OOM when I tried to run df[col].replace(replace_dict), where replace_dict is a very large dictionary. As suggested by this thread, I replaced this code with df[col].apply(lambda x: replace_dict.get(x, x)) and it works fine, so the issue is still present.

DCunnama · 2021-10-01T09:34:04Z

I can confirm this is still happening, just ran into this

mmcdermott · 2021-10-01T15:44:24Z

The "apply" solution works fine for me to resolve this issue, but someone also recently pointed me to the map function: https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html which I've been told (but haven't tested myself) doesn't have the memory leak/footprint issue of replace with a large dict, in case it helps others.

DCunnama · 2021-10-02T08:10:51Z

Thanks, the "apply" solution works for me too, I will give the map function a try too.

jbrockmendel · 2021-12-18T06:51:15Z

Best guess based on having spent a lot of time in the relevant code recently (but not having done any profiling specific to this issue):

replace_dict.items() ends up getting passed to Block.replace_list. There we pre-compute (I think the pre-compute bit was an optimization more recent than this issue) a mask for each key in replace_dict. So we end up with len(replace_dict) ndarrays of block.values.size bytes each.

In the OP replace_dict had about 70k entries and len(df) was 974757 then we're looking at about 68GB worth of ndarrays.

Option 1) avoid that pre-allocation at potential speed hit
Option 2) when len(replace_dict) >> len(values) go down a different path that just iterates over values once doing a dict lookup for each one.

iKolt · 2022-11-14T16:10:49Z

This code works fine:

df['col'] = df['col'].map(replacing_dict).combine_first(df['col'])
or
df['col'] = df['col'].map(replacing_dict).fillna(df['col'])

Usage of fillna is faster for small Series (checked some atrificial cases - 2x speedup) than combine_first, but combine_first is more self-commenting for me and the difference in speed is fading than Series is ~10K lines or more

lithomas1 · 2023-01-13T22:42:54Z

Taking a look at this.

jreback added Missing-data labels Mar 24, 2014

jreback added this to the 0.14.0 milestone Mar 24, 2014

jreback assigned cpcloud Mar 24, 2014

cpcloud changed the title ~~col.replace(dict) takes to much memory~~ col.replace(dict) takes too much memory Apr 17, 2014

jreback modified the milestones: 0.15.0, 0.14.0 Apr 27, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015

aldanor mentioned this issue Mar 17, 2016

.replace() with a dict about 60x slower than .map().fillna() #12657

Closed

jreback mentioned this issue Mar 30, 2016

BUG/API: Fix .replace dtype conversion rules #12747

Closed

jbrockmendel added the replace replace method label Sep 20, 2020

mroeschke removed the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Apr 11, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

jbrockmendel mentioned this issue Feb 20, 2023

PERF: Use generator expression for Blocks.replace_list #50778

Merged

5 tasks

col.replace(dict) takes too much memory #6697

col.replace(dict) takes too much memory #6697

Comments

jankatins commented Mar 24, 2014

jreback commented Mar 24, 2014

jankatins commented Mar 24, 2014

jreback commented Mar 24, 2014

dsm054 commented Mar 24, 2014

jankatins commented Mar 24, 2014

jreback commented Mar 24, 2014

dsm054 commented Mar 24, 2014

jreback commented Mar 24, 2014

dsm054 commented Mar 24, 2014

cpcloud commented Mar 24, 2014

cpcloud commented Mar 24, 2014

jankatins commented Mar 24, 2014

jankatins commented Mar 24, 2014

dsm054 commented Mar 24, 2014

cpcloud commented Mar 24, 2014

dsm054 commented Mar 24, 2014

cpcloud commented Mar 24, 2014

jreback commented Apr 6, 2014

cpcloud commented Apr 6, 2014

cpcloud commented Apr 6, 2014

cpcloud commented Apr 6, 2014

cpcloud commented Apr 6, 2014

cpcloud commented Apr 19, 2014

cpcloud commented Apr 19, 2014

jankatins commented Apr 24, 2014

cpcloud commented Apr 26, 2014

cpcloud commented Apr 26, 2014

chedatomasz commented Jan 24, 2020

mmcdermott commented Aug 27, 2021

jreback commented Aug 27, 2021

mmcdermott commented Aug 27, 2021

DCunnama commented Oct 1, 2021

mmcdermott commented Oct 1, 2021

DCunnama commented Oct 2, 2021

jbrockmendel commented Dec 18, 2021

iKolt commented Nov 14, 2022

lithomas1 commented Jan 13, 2023