-
Notifications
You must be signed in to change notification settings - Fork 0
/
benchmarking.Rmd
334 lines (295 loc) · 13.1 KB
/
benchmarking.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
---
title: "Benchmarking gigs against other software packages"
description: >
A comparison of gigs' performance against other growth analysis packages.
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
dpi = 120, message = FALSE, warning = FALSE,
dev.args = list(png = list(type = "cairo"))
)
# Set ggplot theming for consistent plots
ggplot2::theme_set(ggplot2::theme_bw())
```
```{r srr-tags, eval = FALSE, echo = FALSE}
#' @srrstats {1.5, 1.6} This article contains code for both performance and
#' accuracy claims when comparing this package with other software packages.
```
```{r load_timings_rda, echo = FALSE}
load(file = file.path("benchmarking.rda"))
```
# Rationale
This short article describes the performance of gigs relative to a
non-exhaustive group of R and non-R packages designed to implement child growth
standards.
| Package name | Language | On CRAN? |
|------------------------------------------------------------------------------|:---------|:---------|
| [anthro](https://cran.r-project.org/web/packages/anthro/index.html) | R | Yes |
| [childsds](https://cran.r-project.org/web/packages/childsds/index.html) | R | Yes |
| [gigs](https://www.github.com/lshtm-gigs/gigs/) | R | No |
| [ki-tools/growthstandards](https://www.github.com/ki-tools/growthstandards/) | R | No |
| [gigs](https://www.github.com/lshtm-gigs/gigs-stata/) | Stata | No |
| [zanthro](https://journals.sagepub.com/doi/epdf/10.1177/1536867X1301300211) | Stata | No |
Thus far, there is no comprehensive benchmark comparing these different
packages. This short article will compare the speed of each package from 1 to
100000 inputs, checking how fast each package can convert weight values to
z-scores in the WHO Child Growth standards.
We have performed these benchmarks on a Windows 10 system running a Ryzen 7
3700X processor and 16GB of DDR4 RAM, using R version 4.3.2. The Stata
benchmarks have been run on the same system in Stata 18.0 (revision 15 May
2023), using the [benchmark](https://github.com/mcaceresb/stata-benchmark) package for Stata.
## Set up benchmark dataset
The benchmarking and package comparisons will all use the same 100,000-row
dataset comprised of data which can be used to convert between z-scores
and centiles in the WHO weight-for-age standard. The z-scores, `x` variable
and sexes for each row are generated randomly with a pre-specified seed. As gigs
is validated against the published growth curve data from the WHO, we use it to
generate weight values in kg for each observation. This dataset can then be used
to compare speed and accuracy in conversion between values and z-scores/centiles
for the packages mentioned above.
```{r who_gs_dataset, echo = TRUE}
who_gs_dataset <- function(n, acronym = "wfa", seed = 154237890) {
withr::with_seed(
seed,
code = {
# Random z-scores around 0
z <- rnorm(n = n)
# X variables are non-discrete but within bounds of the desired acronym
xvars <- gigs::who_gs_coeffs[[acronym]][[1]][, 1]
x <- sample(xvars, size = n, replace = TRUE)
x_jitter <- c(runif(n = 5, min = -1, max = 1), 0)
x <- x + sample(x_jitter, size = n, replace = TRUE)
x <- pmax(pmin(x, max(xvars)), min(xvars))
# Sexes randomly sampled from male and female
sex <- sample(c("M", "F"), size = n, replace = TRUE)
}
)
acronym <- rep_len(acronym, n)
out <- data.frame(z = z, x = x, sex = sex, acronym = acronym) |>
dplyr::mutate(
y = gigs::who_gs_zscore2value(z = z, x = x, sex = sex, acronym = acronym)
)
purrr::set_names(out, c("z", "x", "sex", "acronym", "y"))
}
# Generate 100,000-row dataset
bench_dataset <- who_gs_dataset(n = 100000)
```
The first 10 rows of this dataset look like this:
```{r show_bench_dataset1, eval = FALSE}
bench_dataset[1:10, ]
```
```{r show_bench_dataset2, echo = FALSE}
knitr::kable(bench_dataset[1:10, ], align = "ccccc")
```
# Benchmark code
The `mbench_pkg()` function is used to benchmark each package over a range of
input sizes. Each call to it produces a tabular output containing the
median time required for `pkg_expr` to operate on the
data, ranging from 1 to 100,000 inputs.
```{r iteration}
mbench_pkg <- function(pkg_expr, pkg_name) {
n_inputs <- c(1, 10, 100, 500, 1000, 5000, 10000, 25000, 50000, 75000, 100000)
purrr::map_dfr(.x = n_inputs,
.f = \(no_of_inputs) {
dataset <- bench_dataset[seq_len(no_of_inputs), ]
mbench <- microbenchmark::microbenchmark(
dplyr::mutate(dataset, test = eval(pkg_expr)),
times = 25
)
c(n_inputs = no_of_inputs,
median_time = summary(mbench)$median,
time_units = attr(summary(mbench), which = "unit"))
}) |>
dplyr::mutate(n_inputs = as.integer(n_inputs),
median_time = as.numeric(median_time),
package = pkg_name)
}
```
## `anthro`
```{r bench_anthro, eval = FALSE}
anthro_timings <- mbench_pkg(
pkg_expr = quote(anthro::anthro_zscores(sex = sex,
age = x,
weight = y)$zwei),
pkg_name = paste("R: anthro", packageVersion(pkg = "anthro"))
)
```
## `childsds`
```{r bench_childsds, eval = FALSE}
childsds_timings <- mbench_pkg(
pkg_expr = quote(childsds::sds(value = y,
age = x / 365.25,
sex = sex,
male = "M",
female = "F",
item = "weight",
ref = childsds::who.ref)),
pkg_name = paste("R: childsds", packageVersion(pkg = "childsds"))
)
```
## `growthstandards`
```{r bench_growthstandards, eval = FALSE}
growthstandards_timings <- mbench_pkg(
pkg_expr = quote(
growthstandards::who_value2zscore(y = y, x = x,
sex = ifelse(sex == "M", "Male",
"Female"),
y_var = "wtkg")
),
pkg_name = paste("R: growthstandards",
packageVersion(pkg = "growthstandards"))
)
```
## `gigs`
```{r bench_gigs, eval = FALSE}
gigs_timings <- mbench_pkg(
pkg_expr = quote(gigs::who_gs_value2zscore(y = y, x = x, sex = sex,
acronym = acronym)),
pkg_name = paste("R: gigs", packageVersion(pkg = "gigs"))
)
```
## Stata
```{r, eval = FALSE}
# Save .dta file equivalent of benchmarking table. This can be used to benchmark
# Stata packages.
haven::write_dta(data = bench_dataset,
path = file.path("exclude", "statabench", "bench_dataset.dta"))
```
In Stata, the commands are run inside a do-file which utilises the
[benchmark](https://www.github.com/mcaceresb/stata-benchmark) package for Stata.
This script essentially does the same as `mbench_pkg()`, but for the
[Stata gigs package](https://www.github.com/lshtm-gigs/gigs-stata/) and the
zanthro package.
```{stata stata_bench_code, eval = FALSE}
// This is Stata code
foreach i in 1 10 100 500 1000 5000 10000 25000 50000 75000 100000 {
use "benchmarking/bench_dataset.dta", clear
qui drop if _n > `i'
di "Number of inputs: `i'"
bench, reps(25) restore last: ///
qui egen double z_gigs = who_gs(y, "wfa", "v2z"), ///
xvar(x) sex(sex) sexcode(m=M, f=F)
}
foreach i in 1 10 100 500 1000 5000 10000 25000 50000 75000 100000 {
use "benchmarking/bench_dataset.dta", clear
qui drop if _n > `i'
di "Number of inputs: `i'"
bench, reps(25) restore last: ///
qui egen z_anthro = zanthro(y, wa, WHO), xvar(x) gender(sex) ///
gencode(male=M, female=F) ageunit(day)
}
```
```{r bind_stata_timings, echo = FALSE, eval = FALSE}
possible_n <- c(1, 10, 100, 500, 1000, 5000, 10000, seq(25000, 100000, 25000))
gigs_0_4_0 <- c(0.009, 0.008, 0.009, 0.010, 0.012, 0.028, 0.047, 0.104, 0.198, 0.296, 0.405)
zbmicat_1_0_2 <- c(0.007, 0.008, 0.009, 0.017, 0.027, 0.105, 0.203, 0.479,
0.951, 1.4450, 2.046)
lens <- lengths(list(gigs_0_4_0, zbmicat_1_0_2))
n_inputs <- unlist(lapply(X = lens, FUN = \(x) possible_n[seq_len(x)]))
stata_timings <- data.frame(n_inputs = n_inputs,
median_time = c(gigs_0_4_0, zbmicat_1_0_2),
time_units = "seconds",
package = c(rep("Stata: gigs 0.4.0", lens[1]),
rep("Stata: zanthro 1.0.2", lens[2])))
rm(gigs_0_4_0, zbmicat_1_0_2, lens, n_inputs, possible_n)
```
The outputs from this script give a table of timings that look like this:
```{r show_stata}
stata_timings
```
## Plotting
Only by plotting the various timings tables can we see the trends for each
package:
```{r plot_timings, echo = FALSE}
dplyr::bind_rows(anthro_timings,
childsds_timings,
growthstandards_timings,
gigs_timings,
stata_timings) |>
dplyr::mutate(median_ms = dplyr::case_when(
time_units == "nanoseconds" ~ median_time / 1000000,
time_units == "microseconds" ~ median_time / 1000,
time_units == "milliseconds" ~ median_time,
time_units == "seconds" ~ median_time * 1000.,
)) |>
ggplot2::ggplot(ggplot2::aes(x = n_inputs,
y = median_ms,
colour = package)) +
ggplot2::geom_line() +
ggplot2::geom_point(colour = "black") +
ggplot2::labs(x = "Number of inputs",
y = "Median time taken (milliseconds)",
colour = "Package") +
ggsci::scale_colour_lancet()
```
On the whole, `anthro` is by far the slowest R package, taking around
`r round(anthro_timings$median_time[11], digits = 2)` seconds to process 100,000
inputs. This is in part because `anthro` computes results in every WHO Child
Growth standard each time `anthro::anthro_zscores()` is called, but also due to
a slower implementation of the WHO LMS process conversion than the other
packages.
Next slowest is the Stata package `zanthro`, which takes around
`r round(stata_timings$median_time[22], digits = 2)` seconds to compute results
in a single WHO standard. About 4 times faster than `zanthro` is `gigs` for
Stata 0.4.0, which scales more efficiently than `zanthro` and takes around
`r round(stata_timings$median_time[11], digits = 2)` seconds to convert 100,000
measurements to z-scores.
Leading the pack are three R implementations: `growthstandards`, `gigs`, and
`childsds`. The `growthstandards` package was the fastest at
~`r round(growthstandards_timings$median_time[11])` ms for 100,000 inputs,
followed by `gigs` (~`r round(gigs_timings$median_time[11])` ms) and `childsds`
(~`r round(childsds_timings$median_time[11])` ms).
# Package output similarity
The packages also differ slightly in how they convert values to
centiles/z-scores, which can affect the z-scores computed by each package even
when inputs are the same. In our testing of the WHO standards, we found that the
tested packages mostly agreed with each other, but that `childsds` did not
correctly perform z-scoring in the WHO standards.
This is because the WHO Child Growth standards constrain z-scores in the outer
tails (i.e. past 3 z-scores), as there were fewer data available at these
extreme values. More information on this constraining procedure can be found in
the reports referenced in the `gigs::who_gs_value2zscore()` documentation. As
`childsds` does not perform this constraining procedure, extreme values are
computed incorrectly:
```{r discrepancies, eval = FALSE}
discrepancies <- data.frame(z = c(-3.03, -2.97, 2.97, 3.03),
age_days = 0,
sex = "M") |>
dplyr::mutate(
weight_kg = gigs::who_gs_wfa_zscore2value(z, age_days, sex),
# GIGS z-score
z_gigs = gigs::who_gs_wfa_value2zscore(weight_kg, age_days, sex),
# growthstandards z-score
z_growthstandards = growthstandards::who_wtkg2zscore(
age_days, weight_kg, "Male"
),
# childsds z-score
z_childsds = childsds::sds(
value = weight_kg, age = age_days / 365.25,
sex = sex, male = "M", female = "F",
item = "weight", ref = childsds::who.ref
)
)
```
When we look at these z-scores, you can see that both `growthstandards` and
`gigs` correctly apply the constraining procedure; `childsds` does not. From
looking at the `anthro` source code, they also apply the constraining procedure.
```{r discrepancies_kable, echo = FALSE}
knitr::kable(discrepancies, align = "ccccc")
```
# Session information
```{r session_info1, echo = TRUE, eval = FALSE}
sessionInfo()
```
```{r session_info2, echo = FALSE, eval = FALSE}
session_info <- sessionInfo()
```
```{r session_info3, echo = FALSE, eval = TRUE}
session_info
```
[//]: # (Code to save/load all timings tables in a separate .rda file)
```{r save_benchmarking_rda, eval = FALSE, echo = FALSE}
save(file = file.path("vignettes/articles/benchmarking.rda"),
anthro_timings, childsds_timings, growthstandards_timings, gigs_timings,
stata_timings, discrepancies, session_info)
```