rbind.fill still quadratic with huge factors (>1000 levels) #206

krlmlr · 2014-03-13T22:37:54Z

Binding data frames with columns that are factors with many levels seems to take quadratic run time. See http://rpubs.com/krlmlr/14321 (and http://rpubs.com/krlmlr/huge_factors for an earlier variant).

@crowding: Any idea what might cause this?

crowding · 2014-03-13T22:59:16Z

At a guess, it's in the unification of factor levels.

When you take a 1-element slice of a factor with N elements and N levels,
you get a factor with 1 element and N levels. When it gets to rbind.fill,
it's trying to recombine N 1-element factors which each have N levels, and
it does this by taking the union of all N^2 levels.

To skip that step there'd have to be some way for rbind.fill to notice or
be told that all the factor levels are the same.

On Thu, Mar 13, 2014 at 3:37 PM, Kirill Müller notifications@github.comwrote:

Binding data frames with columns that are factors with many levels seems
to take quadratic run time. See http://rpubs.com/krlmlr/14321 (and
http://rpubs.com/krlmlr/huge_factors for an earlier variant).

@crowding https://github.com/crowding: Any idea what might cause this?

Reply to this email directly or view it on GitHubhttps://github.com//issues/206
.

krlmlr · 2014-03-13T23:19:50Z

Autodetection would need to happen in constant time (per row), of course. I think identical can do that: http://rpubs.com/krlmlr/identical . I'm not sure how to tell rbind.fill that all factor levels are the same, so autodetection is perhaps safer.

Is there a way to determine the address of the data for an R object (without diving into C/C++)? Address equality implies equality, and by finding duplicate addresses first (and ignoring the corresponding objects) we avoid the NxN runtime. A perhaps simpler variant would be to call identical for each two consecutive elements (like diff).

On the other hand, my earlier variant combines N 1-element factors with 1 level each, and performance seems to degrade, too. Could this be a different issue?

crowding · 2014-03-14T04:52:00Z

Unfortunately it looks like factor levels are deep-copied, so identical won't work (and it's O(N^2) before it gets to rbind.fill anyway):

> x <- factor(c("a", "b", "c"))
> x1 <- x[1]
> x2 <- x[2]
> .Internal(inspect(x1))
@13773b598 13 INTSXP g0c1 [OBJ,NAM(2),ATT] (len=1, tl=0) 1
ATTRIB:
  @11af23c98 02 LISTSXP g0c0 [] 
    TAG: @10080bbc8 01 SYMSXP g1c0 [MARK,NAM(2),LCK,gp=0x4000] "levels" (has value)
    @1247b8848 16 STRSXP g0c3 [NAM(1)] (len=3, tl=0)
      @1009b6fa8 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "a"
      @10085d038 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "b"
      @10081d068 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "c"
    TAG: @10080bf48 01 SYMSXP g1c0 [MARK,NAM(2),LCK,gp=0x4000] "class" (has value)
    @13773b6e8 16 STRSXP g0c1 [NAM(1)] (len=1, tl=0)
      @1008f4328 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "factor"
> .Internal(inspect(x2))
@13773aaf8 13 INTSXP g0c1 [OBJ,NAM(2),ATT] (len=1, tl=0) 2
ATTRIB:
  @11aeec070 02 LISTSXP g0c0 [] 
    TAG: @10080bbc8 01 SYMSXP g1c0 [MARK,NAM(2),LCK,gp=0x4000] "levels" (has value)
    @1247c19b8 16 STRSXP g0c3 [NAM(1)] (len=3, tl=0)
      @1009b6fa8 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "a"
      @10085d038 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "b"
      @10081d068 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "c"
    TAG: @10080bf48 01 SYMSXP g1c0 [MARK,NAM(2),LCK,gp=0x4000] "class" (has value)
    @13773ab88 16 STRSXP g0c1 [NAM(1)] (len=1, tl=0)
      @1008f4328 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "factor"

krlmlr · 2014-03-14T08:00:53Z

But the behavior or a_ply (i.e., adply without rbind.fill) seems to be linear: http://rpubs.com/krlmlr/huge_factors_force_discard . I don't know how to interpret the output of inspect -- I do see that @1247b8848 <> @1247c19b8, but somehow splitter_a seems to keep splitting time linear here.

krlmlr · 2014-03-17T10:13:54Z

Also, the behavior has changed, at least in R-devel: http://r.789695.n4.nabble.com/Deep-copy-of-factor-levels-tt4686956.html . @crowding: Would you be willing to take another look?

crowding · 2014-03-17T10:56:10Z

Interesting. Presuming that pointer comparison works, the other thing that's probably contributing is the calls to match -- matching a single element to an N level factor is O(N). Possibly one could compile the unified factor levels into a heap and write a replacement in C that would bring it down to O(log N). Or just have a special case for the input being identical to the output/

krlmlr · 2014-03-17T11:38:22Z

I'd prefer starting with a special case for the most common usage in ddply or adply. How would you do the pointer comparison in R?

I think that matching a vector to another vector is O(N+M), in this case we wouldn't have to reinvent the wheel. Presumably, this would happen when creating a factor only after the data is complete. Would that be possible without mixing up too much code?

It looks like there's already a faster version of rbind.fill in the dplyr package, and eventually plyr should be able to use that.

hadley · 2015-03-30T20:16:20Z

Closing this since it should be better in dplyr

hadley closed this as completed Mar 30, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rbind.fill still quadratic with huge factors (>1000 levels) #206

rbind.fill still quadratic with huge factors (>1000 levels) #206

krlmlr commented Mar 13, 2014

crowding commented Mar 13, 2014

krlmlr commented Mar 13, 2014

crowding commented Mar 14, 2014

krlmlr commented Mar 14, 2014

krlmlr commented Mar 17, 2014

crowding commented Mar 17, 2014

krlmlr commented Mar 17, 2014

hadley commented Mar 30, 2015

rbind.fill still quadratic with huge factors (>1000 levels) #206

rbind.fill still quadratic with huge factors (>1000 levels) #206

Comments

krlmlr commented Mar 13, 2014

crowding commented Mar 13, 2014

krlmlr commented Mar 13, 2014

crowding commented Mar 14, 2014

krlmlr commented Mar 14, 2014

krlmlr commented Mar 17, 2014

crowding commented Mar 17, 2014

krlmlr commented Mar 17, 2014

hadley commented Mar 30, 2015