Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rbind.fill still quadratic with huge factors (>1000 levels) #206

Closed
krlmlr opened this issue Mar 13, 2014 · 8 comments
Closed

rbind.fill still quadratic with huge factors (>1000 levels) #206

krlmlr opened this issue Mar 13, 2014 · 8 comments

Comments

@krlmlr
Copy link

krlmlr commented Mar 13, 2014

Binding data frames with columns that are factors with many levels seems to take quadratic run time. See http://rpubs.com/krlmlr/14321 (and http://rpubs.com/krlmlr/huge_factors for an earlier variant).

@crowding: Any idea what might cause this?

@crowding
Copy link
Contributor

At a guess, it's in the unification of factor levels.

When you take a 1-element slice of a factor with N elements and N levels,
you get a factor with 1 element and N levels. When it gets to rbind.fill,
it's trying to recombine N 1-element factors which each have N levels, and
it does this by taking the union of all N^2 levels.

To skip that step there'd have to be some way for rbind.fill to notice or
be told that all the factor levels are the same.

On Thu, Mar 13, 2014 at 3:37 PM, Kirill Müller notifications@github.comwrote:

Binding data frames with columns that are factors with many levels seems
to take quadratic run time. See http://rpubs.com/krlmlr/14321 (and
http://rpubs.com/krlmlr/huge_factors for an earlier variant).

@crowding https://github.com/crowding: Any idea what might cause this?

Reply to this email directly or view it on GitHubhttps://github.com//issues/206
.

@krlmlr
Copy link
Author

krlmlr commented Mar 13, 2014

Autodetection would need to happen in constant time (per row), of course. I think identical can do that: http://rpubs.com/krlmlr/identical . I'm not sure how to tell rbind.fill that all factor levels are the same, so autodetection is perhaps safer.

Is there a way to determine the address of the data for an R object (without diving into C/C++)? Address equality implies equality, and by finding duplicate addresses first (and ignoring the corresponding objects) we avoid the NxN runtime. A perhaps simpler variant would be to call identical for each two consecutive elements (like diff).

On the other hand, my earlier variant combines N 1-element factors with 1 level each, and performance seems to degrade, too. Could this be a different issue?

@crowding
Copy link
Contributor

Unfortunately it looks like factor levels are deep-copied, so identical won't work (and it's O(N^2) before it gets to rbind.fill anyway):

> x <- factor(c("a", "b", "c"))
> x1 <- x[1]
> x2 <- x[2]
> .Internal(inspect(x1))
@13773b598 13 INTSXP g0c1 [OBJ,NAM(2),ATT] (len=1, tl=0) 1
ATTRIB:
  @11af23c98 02 LISTSXP g0c0 [] 
    TAG: @10080bbc8 01 SYMSXP g1c0 [MARK,NAM(2),LCK,gp=0x4000] "levels" (has value)
    @1247b8848 16 STRSXP g0c3 [NAM(1)] (len=3, tl=0)
      @1009b6fa8 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "a"
      @10085d038 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "b"
      @10081d068 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "c"
    TAG: @10080bf48 01 SYMSXP g1c0 [MARK,NAM(2),LCK,gp=0x4000] "class" (has value)
    @13773b6e8 16 STRSXP g0c1 [NAM(1)] (len=1, tl=0)
      @1008f4328 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "factor"
> .Internal(inspect(x2))
@13773aaf8 13 INTSXP g0c1 [OBJ,NAM(2),ATT] (len=1, tl=0) 2
ATTRIB:
  @11aeec070 02 LISTSXP g0c0 [] 
    TAG: @10080bbc8 01 SYMSXP g1c0 [MARK,NAM(2),LCK,gp=0x4000] "levels" (has value)
    @1247c19b8 16 STRSXP g0c3 [NAM(1)] (len=3, tl=0)
      @1009b6fa8 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "a"
      @10085d038 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "b"
      @10081d068 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "c"
    TAG: @10080bf48 01 SYMSXP g1c0 [MARK,NAM(2),LCK,gp=0x4000] "class" (has value)
    @13773ab88 16 STRSXP g0c1 [NAM(1)] (len=1, tl=0)
      @1008f4328 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "factor"

@krlmlr
Copy link
Author

krlmlr commented Mar 14, 2014

But the behavior or a_ply (i.e., adply without rbind.fill) seems to be linear: http://rpubs.com/krlmlr/huge_factors_force_discard . I don't know how to interpret the output of inspect -- I do see that @1247b8848 <> @1247c19b8, but somehow splitter_a seems to keep splitting time linear here.

@krlmlr
Copy link
Author

krlmlr commented Mar 17, 2014

Also, the behavior has changed, at least in R-devel: http://r.789695.n4.nabble.com/Deep-copy-of-factor-levels-tt4686956.html . @crowding: Would you be willing to take another look?

@crowding
Copy link
Contributor

Interesting. Presuming that pointer comparison works, the other thing that's probably contributing is the calls to match -- matching a single element to an N level factor is O(N). Possibly one could compile the unified factor levels into a heap and write a replacement in C that would bring it down to O(log N). Or just have a special case for the input being identical to the output/

@krlmlr
Copy link
Author

krlmlr commented Mar 17, 2014

I'd prefer starting with a special case for the most common usage in ddply or adply. How would you do the pointer comparison in R?

I think that matching a vector to another vector is O(N+M), in this case we wouldn't have to reinvent the wheel. Presumably, this would happen when creating a factor only after the data is complete. Would that be possible without mixing up too much code?

It looks like there's already a faster version of rbind.fill in the dplyr package, and eventually plyr should be able to use that.

@hadley
Copy link
Owner

hadley commented Mar 30, 2015

Closing this since it should be better in dplyr

@hadley hadley closed this as completed Mar 30, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants