New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rbind.fill still quadratic with huge factors (>1000 levels) #206
Comments
At a guess, it's in the unification of factor levels. When you take a 1-element slice of a factor with N elements and N levels, To skip that step there'd have to be some way for rbind.fill to notice or On Thu, Mar 13, 2014 at 3:37 PM, Kirill Müller notifications@github.comwrote:
|
Autodetection would need to happen in constant time (per row), of course. I think Is there a way to determine the address of the data for an R object (without diving into C/C++)? Address equality implies equality, and by finding duplicate addresses first (and ignoring the corresponding objects) we avoid the NxN runtime. A perhaps simpler variant would be to call On the other hand, my earlier variant combines N 1-element factors with 1 level each, and performance seems to degrade, too. Could this be a different issue? |
Unfortunately it looks like factor levels are deep-copied, so identical won't work (and it's O(N^2) before it gets to rbind.fill anyway):
|
But the behavior or |
Also, the behavior has changed, at least in R-devel: http://r.789695.n4.nabble.com/Deep-copy-of-factor-levels-tt4686956.html . @crowding: Would you be willing to take another look? |
Interesting. Presuming that pointer comparison works, the other thing that's probably contributing is the calls to |
I'd prefer starting with a special case for the most common usage in I think that matching a vector to another vector is O(N+M), in this case we wouldn't have to reinvent the wheel. Presumably, this would happen when creating a factor only after the data is complete. Would that be possible without mixing up too much code? It looks like there's already a faster version of |
Closing this since it should be better in dplyr |
Binding data frames with columns that are factors with many levels seems to take quadratic run time. See http://rpubs.com/krlmlr/14321 (and http://rpubs.com/krlmlr/huge_factors for an earlier variant).
@crowding: Any idea what might cause this?
The text was updated successfully, but these errors were encountered: