Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

col.replace(dict) takes too much memory #6697

Open
jankatins opened this issue Mar 24, 2014 · 37 comments
Open

col.replace(dict) takes too much memory #6697

jankatins opened this issue Mar 24, 2014 · 37 comments
Assignees
Labels
Performance Memory or execution speed performance replace replace method

Comments

@jankatins
Copy link
Contributor

I've code which run fine about ~10 month ago and now fails due to running out of memory. The code basically does this:

groups = [[1,2,3][222,333,444, 555], ...]  # from a 1.4MB json file
# about 30k such groups, results in ~70k replacements
for replace_list in groups:
        replacement = sorted(replace_list)[0]
        for replace_id in replace_list:
            if replace_id == replacement:
                continue
            replace_dict[replace_id] = replacement
# len(_col) == 974757
_col = df[column].astype("int64")
_col.replace(replace_dict, inplace=True)

I've now split the replacement_dict into 2k chunks and and this works (takes about 20 seconds for each chunk).

@jreback
Copy link
Contributor

jreback commented Mar 24, 2014

can you provide the file, pandas, numpy version that it was better on and what you are trying on now?

also, try wi/o the in-place operation

@jankatins
Copy link
Contributor Author

The replacement file is no problem, but the dataframe is internal and about 400MB csv unziped :-(

Now I do it similar to this:

groups = [[1,2,3][222,333,444, 555], ...]  # from a 1.4MB json file
# about 30k such groups, results in ~70k replacements
count = 0
_col = df[column].astype("int64")
replace_dict = {}
for replace_list in groups:
        count +=1
        replacement = sorted(replace_list)[0]
        for replace_id in replace_list:
            if replace_id == replacement:
                continue
            replace_dict[replace_id] = replacement
        if count > 2000:
            _col = _col.replace(replace_dict)
            replace_dict = {}
            count = 0
_col = _col.replace(replace_dict) # for the last chunk

This works (takes about 11min) and 4.7GB of RAM are in use.

pandas._np_version
Out[3]: '1.8.0'

pandas.__version__
Out[4]: '0.13.1-489-g7ffa655'

I don't remember the numpy/pandas version from 10 month ago (The replacement file was generated on 17.05.2013, so during that week it worked) :-( It was probably whatever was on pandas master and the latest numpy release (I compile pandas from head and install numpy from cgolhlkes installers).

@jreback
Copy link
Contributor

jreback commented Mar 24, 2014

@cpcloud ?

@dsm054
Copy link
Contributor

dsm054 commented Mar 24, 2014

Something seems strange here.

@JanSchulz: how long does _col.apply(lambda x: replace_dict.get(x, x)) take?

@jankatins
Copy link
Contributor Author

WTF:

replacements_by_XXX.json (16:35:42):  groups: 31612 replacements: 77846 =>  1.36s
replacements_by_XXXY.json (16:35:43):  groups: 19576 replacements: 33845 =>  1.03s
replacements_manual.json (16:35:44):  groups: 16 replacements: 17 =>  864ms
Wall time: 3.26 s

This took 20min with the above "2k chunks" code.

@jreback
Copy link
Contributor

jreback commented Mar 24, 2014

@JanSchulz I don't understand your problem exactly, but instead of replacing, why wouldn't you use these as indexers.

e.g. say you are replacing string to numbers; just construct a list (or series) of the strings, then
do s.take(numbers) (or s.iloc)?

will be orders of magnitude faster

@dsm054
Copy link
Contributor

dsm054 commented Mar 24, 2014

@JanSchulz: are those the numbers you get when you use the apply/get combination?

@jreback: I think the issue is that the following doesn't feel right at all:

>>> to_replace = {i: 10**7+i for i in range(10**6)}
>>> series_length = 10**3
>>> col = pd.Series(np.random.randint(0, 10**6, series_length))
>>> 
>>> %time slow = col.replace(to_replace)
CPU times: user 30 s, sys: 0 ns, total: 30 s
Wall time: 30 s
>>> %time fast = col.apply(lambda x: to_replace.get(x,x))
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 1.53 ms
>>> 
>>> (slow == fast).all()
True

We should be getting O(n) values * O(1) lookup, making this very fast, and we're not. [Okay, I haven't varied the parameters enough to pick up the actual scaling -- could just be a really big constant, from what I've shown.]

@jreback
Copy link
Contributor

jreback commented Mar 24, 2014

hmm.....seems doing lots of work inside....

if you do this:

df = DataFrame({ 0 : col })
df.replace({ 0 : to_replace })

much faster....

prob a bug

@dsm054
Copy link
Contributor

dsm054 commented Mar 24, 2014

I don't actually trust {0: to_replace} at the moment; see the discussion in #6689.

@jreback jreback added this to the 0.14.0 milestone Mar 24, 2014
@cpcloud
Copy link
Member

cpcloud commented Mar 24, 2014

this is very weird ... looking now

@cpcloud
Copy link
Member

cpcloud commented Mar 24, 2014

prun of @dsm054's example shows that BlockManager.replace_list is to blame...figuring out the line ... lprun 1.0b3 breaks with newest ipython 😢

@jankatins
Copy link
Contributor Author

@jreback each row in my data set has a "owner" and but sometimes (actually quite a lot of times) one owner has received several IDs and in that step I want to merge them into one. So in the final dataframe, the ID has to be set to the right (or at least to only one) value, so that I can df.groupby("id").

The shorter numbers are from the apply/get (2 seconds vs 20 mins from "chunked replace").

@jankatins
Copy link
Contributor Author

Anyway: the initial problem is not so much the time it takes but that it now takes much more memory than in last year with ~0.11.

@dsm054
Copy link
Contributor

dsm054 commented Mar 24, 2014

I'd bet a shiny penny (now a collector's item in Canada) that the time and memory issues aren't independent.

@cpcloud
Copy link
Member

cpcloud commented Mar 24, 2014

Time seems to be linear in the number of keys

if __name__ == '__main__':
    import time
    import numpy as np
    from pandas import Series
    start = 1
    stop = 10 ** 6 + 1
    step = 10 ** 5
    sizes = np.arange(start, stop, step)
    to_rep = {n: {i: 10 ** 5 + i for i in range(n)} for n in sizes}
    col = Series(np.random.randint(0, 10 ** 6, size=10 ** 3))

    times = Series(dtype=float, index=to_rep.keys())

    for k, v in to_rep.iteritems():
        tic = time.time()
        col.replace(v)
        toc = time.time() - tic
        times[k] = toc

    times.plot(ax=ax)

replace

@dsm054
Copy link
Contributor

dsm054 commented Mar 24, 2014

I think you can see the linearity in the code. Suppressing irrelevant detail, the structure looks like

for blk in self.blocks:
     for i, (s, d) in enumerate(zip(src_list, dest_list)):
           result = blk.replace(args)

and so it calls a replace (or a putmask on the non-object branch) for every key, value pair. It looks like we're trying to leverage numpy to speed things up. That might be good in the few-keyval long array regime, but it's going to be beaten by a simple listcomp + dict lookup pretty quickly in others.

@cpcloud
Copy link
Member

cpcloud commented Mar 24, 2014

Yep .... was just looking at that block can't really get too deep into it right now can look this evening.

@jreback
Copy link
Contributor

jreback commented Apr 6, 2014

@cpcloud any luck with this?

@cpcloud
Copy link
Member

cpcloud commented Apr 6, 2014

I haven't really looked at this. Getting the dtype issue fixed from #6689 more important i think ... i'm actually working on the that right now

@cpcloud
Copy link
Member

cpcloud commented Apr 6, 2014

I've just done %memit between v0.11.0 and master and I can't see any kind of memory usage difference between these two commits.

Commits between the date provided by @JanSchulz and the v0.11.0 tag date (2013-04-22):

git log --oneline --no-merges --after={2013-04-22} --before={2013-05-17} --name-only

shows

f9eea30 RLS: Version 0.11
RELEASE.rst
doc/source/v0.11.0.txt
setup.py

f24b923 BUG: respect passed chunksize in read_csv when using get_chunk function. close #3406
RELEASE.rst
pandas/io/parsers.py
pandas/io/tests/test_parsers.py

d385d25 DOC: Adding parameters to frequencies, offsets (issue #2916)
pandas/tseries/frequencies.py
pandas/tseries/offsets.py

so it looks like nothing was touched regarding DataFrame.replace() (e.g., making it faster)

memit numero uno

v0.11.0

In [5]: pd.version.version
Out[5]: '0.11.0'

In [6]: paste
from pandas import Series
from numpy.random import randint


def rep_test():
    s = Series(randint(10 ** 6, size=1000))
    to_rep = {i: 10 ** 5 + i for i in range(10 ** 6)}
    return s.replace(to_rep)
## -- End pasted text --

In [7]: %load_ext memory_profiler

In [8]: %memit rep_test()
peak memory: 1511.13 MiB, increment: 1422.84 MiB

y dos

In [3]: pd.version.version
Out[4]: '0.13.1-594-ge84efe5'

In [4]: %memit rep_test()
peak memory: 1381.62 MiB, increment: 1287.60 MiB

@cpcloud
Copy link
Member

cpcloud commented Apr 6, 2014

Wow changing rep_test from this

def rep_test():
    s = Series(randint(10 ** 6, size=1e3))
    to_rep = {i: 10 ** 5 + i for i in range(10 ** 6)}
    return s.replace(to_rep)

to

def rep_test():
    s = Series(randint(10 ** 6, size=1e6))
    to_rep = {i: 10 ** 5 + i for i in range(7 * 10 ** 4)}
    return s.replace(to_rep)

crashes the interpreter. Back to the drawing board.

@cpcloud
Copy link
Member

cpcloud commented Apr 6, 2014

ah i see the issue ... replace with dicts is now using replace list ... before the ndarray refactor Series had its own replace method that was different than the generic one. some nice stuff in the series replace method i may move some of that old code into current replace

@cpcloud cpcloud changed the title col.replace(dict) takes to much memory col.replace(dict) takes too much memory Apr 17, 2014
@cpcloud
Copy link
Member

cpcloud commented Apr 19, 2014

@JanSchulz can you provide a sample dataset that is a faithful "reproduction" of the one you're using? Meaning, enough info so that I can create a fake frame to test the issue with. I still get huge memory usage for replace even in v0.11.0 so i'm not sure where your issue lies. The issue is that when there's an overlap between keys and values a mask is built up for each replacement value and stored in a dict which is where the bulk of the memory usage is coming from, this is how it worked in v0.11.0, so i'm not sure why you weren't seeing this before

@cpcloud
Copy link
Member

cpcloud commented Apr 19, 2014

@JanSchulz also are you using the exact same dataset to test this as when you used this with pandas 0.11?

@jankatins
Copy link
Contributor Author

The sample is basically this:

record, authorid
1, 1111111111
2, 2222222222
3, 3333333333
4, 4444444444
5, 5555555555
6, 6666666666
7, 1111111111 # a second entry of author 111111111
....[about 1mio of such lines]...

The replacements would then replace 1111111111 with 2222222222 or 4444444444 and 5555555555 with 6666666666 ("merge two or more authors into one"), but leave some authorids alone. I've about 70k such replacements. The author IDs were int64.

The data is the same, only the mergelist was slightly different (found about 20 "manual" merges, the automatic merges were 70k).

@cpcloud
Copy link
Member

cpcloud commented Apr 26, 2014

@JanSchulz can you try your replacement on pandas version 0.11.0 and report back to let me know if there's a difference? I'm having trouble isolating the memory issue without the exact dataset.

Also, it would be great if you could come up with a reproducible example that doesn't depend on your dataset. I can't reproduce the changes in memory usage (i.e., this has been a memory hog since at least v0.11.0) with the examples I've shown above. Without such an example, I'm essentially shooting in the dark.

@cpcloud
Copy link
Member

cpcloud commented Apr 26, 2014

@jreback can i push to 0.15? i can't repro this

@chedatomasz
Copy link

Has there been any progress on this? I feel like it could be a great improvement, as - to the best of my knowledge - there is currently no fast way to replace a DataFrame column based on a dict, and that is a fairly common use case

@jbrockmendel jbrockmendel added the replace replace method label Sep 20, 2020
@mroeschke mroeschke removed the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Apr 11, 2021
@mmcdermott
Copy link

Any update on this? It would be greatly appreciated.

@jreback
Copy link
Contributor

jreback commented Aug 27, 2021

@mmcdermott this is a very old issue - check whether this still exists in master - if not investigations welcome

@mmcdermott
Copy link

I found this issue because my code was OOM when I tried to run df[col].replace(replace_dict), where replace_dict is a very large dictionary. As suggested by this thread, I replaced this code with df[col].apply(lambda x: replace_dict.get(x, x)) and it works fine, so the issue is still present.

@DCunnama
Copy link

DCunnama commented Oct 1, 2021

I can confirm this is still happening, just ran into this

@mmcdermott
Copy link

The "apply" solution works fine for me to resolve this issue, but someone also recently pointed me to the map function: https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html which I've been told (but haven't tested myself) doesn't have the memory leak/footprint issue of replace with a large dict, in case it helps others.

@DCunnama
Copy link

DCunnama commented Oct 2, 2021

Thanks, the "apply" solution works for me too, I will give the map function a try too.

@jbrockmendel
Copy link
Member

Best guess based on having spent a lot of time in the relevant code recently (but not having done any profiling specific to this issue):

replace_dict.items() ends up getting passed to Block.replace_list. There we pre-compute (I think the pre-compute bit was an optimization more recent than this issue) a mask for each key in replace_dict. So we end up with len(replace_dict) ndarrays of block.values.size bytes each.

In the OP replace_dict had about 70k entries and len(df) was 974757 then we're looking at about 68GB worth of ndarrays.

Option 1) avoid that pre-allocation at potential speed hit
Option 2) when len(replace_dict) >> len(values) go down a different path that just iterates over values once doing a dict lookup for each one.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@iKolt
Copy link

iKolt commented Nov 14, 2022

This code works fine:

df['col'] = df['col'].map(replacing_dict).combine_first(df['col'])
or
df['col'] = df['col'].map(replacing_dict).fillna(df['col'])

Usage of fillna is faster for small Series (checked some atrificial cases - 2x speedup) than combine_first, but combine_first is more self-commenting for me and the difference in speed is fading than Series is ~10K lines or more

@lithomas1
Copy link
Member

Taking a look at this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance replace replace method
Projects
None yet
Development

No branches or pull requests