df module data.frame assembly is too slow #59

mschubert · 2015-09-08T11:03:30Z

With hpc_2.0, it is no problem to run > 1M function calls with very short runtimes and get their results - my tests finished with under 2 hours runtime.

However, the subsequent data.frame assembly takes over 24 hours. This needs to be quicker to be really useful.

The text was updated successfully, but these errors were encountered:

mschubert · 2015-09-08T18:25:30Z

Critical part is, in data_frame/call

Rprof()
    names(result) = 1:length(result)
    index$rep = add_rep

    if (!result_only) {
        rownames(index) = as.character(1:nrow(index))
        result = lapply(names(result), function(i) {
            if (is.null(names(result[[1]])))
                c(as.list(index[i,,drop=FALSE]), result=as.list(result[[i]]))
            else
                c(as.list(index[i,,drop=FALSE]), as.list(result[[i]]))
        })
    }
    if (tidy)
        result = dplyr::rbind_all(lapply(result, as.data.frame))
Rprof(NULL)

With the following profiling results

> summaryRprof()
$by.self
                           self.time self.pct total.time total.pct
".Call"                       372.52    46.50     376.00     46.94
"pmatch"                      223.60    27.91     225.74     28.18
"as.list"                      90.30    11.27     327.62     40.90
"match"                        14.82     1.85      39.98      4.99
"deparse"                      12.38     1.55      51.58      6.44
"data.frame"                    7.96     0.99      93.26     11.64

$by.total
                           total.time total.pct self.time self.pct
"<Anonymous>"                  801.06    100.00      1.96     0.24
"lapply"                       424.96     53.05      0.88     0.11
"FUN"                          424.68     53.01      1.50     0.19
".Call"                        376.00     46.94    372.52    46.50
"as.list"                      327.62     40.90     90.30    11.27
"["                            235.76     29.43      0.42     0.05
"[.data.frame"                 235.34     29.38      4.58     0.57
"pmatch"                       225.74     28.18    223.60    27.91
"as.data.frame.list"            97.26     12.14      0.36     0.04

800 s runtime with 145,000 rows (4 columns) total.

Bottleneck seems to be as.list and dplyr's .Call in rbind_all.

14,500 rows .Call takes < 15% (< 2 s), this does not seem to be O(n) at all.

mschubert · 2015-09-09T13:10:33Z

ref: tidyverse/dplyr#1396

mschubert · 2015-10-15T09:25:42Z

hpc-internal issues solved in db81467, rest dependent on upstream

mschubert added the enhancement label Sep 8, 2015

mschubert self-assigned this Sep 8, 2015

mschubert added bug in progress labels Sep 9, 2015

mschubert closed this as completed Oct 15, 2015

mschubert removed the in progress label Oct 15, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

df module data.frame assembly is too slow #59

df module data.frame assembly is too slow #59

mschubert commented Sep 8, 2015

mschubert commented Sep 8, 2015

mschubert commented Sep 9, 2015

mschubert commented Oct 15, 2015

df module data.frame assembly is too slow #59

df module data.frame assembly is too slow #59

Comments

mschubert commented Sep 8, 2015

mschubert commented Sep 8, 2015

mschubert commented Sep 9, 2015

mschubert commented Oct 15, 2015