-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
col.replace(dict) takes too much memory #6697
Comments
can you provide the file, pandas, numpy version that it was better on and what you are trying on now? also, try wi/o the in-place operation |
The replacement file is no problem, but the dataframe is internal and about 400MB csv unziped :-( Now I do it similar to this:
This works (takes about 11min) and 4.7GB of RAM are in use.
I don't remember the numpy/pandas version from 10 month ago (The replacement file was generated on 17.05.2013, so during that week it worked) :-( It was probably whatever was on pandas master and the latest numpy release (I compile pandas from head and install numpy from cgolhlkes installers). |
@cpcloud ? |
Something seems strange here. @JanSchulz: how long does |
WTF:
This took 20min with the above "2k chunks" code. |
@JanSchulz I don't understand your problem exactly, but instead of replacing, why wouldn't you use these as indexers. e.g. say you are replacing string to numbers; just construct a list (or series) of the strings, then will be orders of magnitude faster |
@JanSchulz: are those the numbers you get when you use the @jreback: I think the issue is that the following doesn't feel right at all:
We should be getting O(n) values * O(1) lookup, making this very fast, and we're not. [Okay, I haven't varied the parameters enough to pick up the actual scaling -- could just be a really big constant, from what I've shown.] |
hmm.....seems doing lots of work inside.... if you do this:
much faster.... prob a bug |
I don't actually trust |
this is very weird ... looking now |
|
@jreback each row in my data set has a "owner" and but sometimes (actually quite a lot of times) one owner has received several IDs and in that step I want to merge them into one. So in the final dataframe, the ID has to be set to the right (or at least to only one) value, so that I can The shorter numbers are from the |
Anyway: the initial problem is not so much the time it takes but that it now takes much more memory than in last year with ~0.11. |
I'd bet a shiny penny (now a collector's item in Canada) that the time and memory issues aren't independent. |
Time seems to be linear in the number of keys if __name__ == '__main__':
import time
import numpy as np
from pandas import Series
start = 1
stop = 10 ** 6 + 1
step = 10 ** 5
sizes = np.arange(start, stop, step)
to_rep = {n: {i: 10 ** 5 + i for i in range(n)} for n in sizes}
col = Series(np.random.randint(0, 10 ** 6, size=10 ** 3))
times = Series(dtype=float, index=to_rep.keys())
for k, v in to_rep.iteritems():
tic = time.time()
col.replace(v)
toc = time.time() - tic
times[k] = toc
times.plot(ax=ax) |
I think you can see the linearity in the code. Suppressing irrelevant detail, the structure looks like
and so it calls a replace (or a putmask on the non-object branch) for every key, value pair. It looks like we're trying to leverage numpy to speed things up. That might be good in the few-keyval long array regime, but it's going to be beaten by a simple listcomp + dict lookup pretty quickly in others. |
Yep .... was just looking at that block can't really get too deep into it right now can look this evening. |
@cpcloud any luck with this? |
I haven't really looked at this. Getting the |
I've just done Commits between the date provided by @JanSchulz and the git log --oneline --no-merges --after={2013-04-22} --before={2013-05-17} --name-only shows
so it looks like nothing was touched regarding memit numero uno
y dos
|
Wow changing def rep_test():
s = Series(randint(10 ** 6, size=1e3))
to_rep = {i: 10 ** 5 + i for i in range(10 ** 6)}
return s.replace(to_rep) to def rep_test():
s = Series(randint(10 ** 6, size=1e6))
to_rep = {i: 10 ** 5 + i for i in range(7 * 10 ** 4)}
return s.replace(to_rep) crashes the interpreter. Back to the drawing board. |
ah i see the issue ... replace with dicts is now using replace list ... before the ndarray refactor |
@JanSchulz can you provide a sample dataset that is a faithful "reproduction" of the one you're using? Meaning, enough info so that I can create a fake frame to test the issue with. I still get huge memory usage for replace even in v0.11.0 so i'm not sure where your issue lies. The issue is that when there's an overlap between keys and values a mask is built up for each replacement value and stored in a dict which is where the bulk of the memory usage is coming from, this is how it worked in v0.11.0, so i'm not sure why you weren't seeing this before |
@JanSchulz also are you using the exact same dataset to test this as when you used this with pandas 0.11? |
The sample is basically this:
The replacements would then replace The data is the same, only the mergelist was slightly different (found about 20 "manual" merges, the automatic merges were 70k). |
@JanSchulz can you try your replacement on pandas version 0.11.0 and report back to let me know if there's a difference? I'm having trouble isolating the memory issue without the exact dataset. Also, it would be great if you could come up with a reproducible example that doesn't depend on your dataset. I can't reproduce the changes in memory usage (i.e., this has been a memory hog since at least v0.11.0) with the examples I've shown above. Without such an example, I'm essentially shooting in the dark. |
@jreback can i push to 0.15? i can't repro this |
Has there been any progress on this? I feel like it could be a great improvement, as - to the best of my knowledge - there is currently no fast way to replace a DataFrame column based on a dict, and that is a fairly common use case |
Any update on this? It would be greatly appreciated. |
@mmcdermott this is a very old issue - check whether this still exists in master - if not investigations welcome |
I found this issue because my code was OOM when I tried to run |
I can confirm this is still happening, just ran into this |
The "apply" solution works fine for me to resolve this issue, but someone also recently pointed me to the |
Thanks, the "apply" solution works for me too, I will give the map function a try too. |
Best guess based on having spent a lot of time in the relevant code recently (but not having done any profiling specific to this issue):
In the OP replace_dict had about 70k entries and len(df) was 974757 then we're looking at about 68GB worth of ndarrays. Option 1) avoid that pre-allocation at potential speed hit |
This code works fine:
Usage of |
Taking a look at this. |
I've code which run fine about ~10 month ago and now fails due to running out of memory. The code basically does this:
I've now split the replacement_dict into 2k chunks and and this works (takes about 20 seconds for each chunk).
The text was updated successfully, but these errors were encountered: