-
Notifications
You must be signed in to change notification settings - Fork 4
/
ipums-bigdata.Rmd
666 lines (529 loc) · 20.2 KB
/
ipums-bigdata.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
---
title: "Big IPUMS Data"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Big IPUMS Data}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
Browsing for IPUMS data can be a little like grocery shopping when
you're hungry---you show up to grab a couple things, but everything
looks so good that you end up with an overflowing cart.[^1]
Unfortunately, this can lead to extracts so large that they don't fit in
your computer's memory.
[^1]: Bonus joke: why is the IPUMS website better than any grocery
store? More free samples.
If you've got an extract that's too big, both the IPUMS website and the
ipumsr package have tools to help. There are four basic strategies:
1. Get more memory
2. Reduce the size of your extract
3. Read data in "chunks" or "yields"
4. Store data in a database
ipumsr can't do much for you when it comes to option 1, but it can help
facilitate some of the other options.
## Setup
The examples in this vignette will rely on a few helpful packages.
If you haven't already installed them, you can do so with:
```{r, eval=FALSE}
# To run the full vignette, you'll also need the following packages. If they
# aren't installed already, do so with:
install.packages("biglm")
install.packages("DBI")
install.packages("RSQLite")
install.packages("dbplyr")
```
```{r, include=FALSE}
installed_biglm <- requireNamespace("biglm")
installed_db_pkgs <- requireNamespace("DBI") &
requireNamespace("RSQLite") &
requireNamespace("dbplyr")
```
```{r, echo=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r, eval=TRUE, message=FALSE}
library(ipumsr)
library(dplyr)
```
<!-- TODO: These data could be updated: certain weighting variable codes -->
<!-- have changed -->
## Option 1: Trade money for convenience
If you need to work with a dataset that's too big for your RAM, the
simplest option is to get more space. If upgrading your hardware isn't
an option, paying for a cloud service like Amazon or Microsoft Azure may
be worth considering. Here are guides for using R on
[Amazon](https://aws.amazon.com/blogs/opensource/getting-started-with-r-on-amazon-web-services/)
and [Microsoft
Azure](https://www.jumpingrivers.com/blog/hosting-rstudio-server-on-azure/).
Of course, this option isn't feasible for most users---in this case,
updates to the data being used in the analysis or the processing
pipeline may be required.
## Option 2: Reduce extract size
### Remove unused data
The easiest way to reduce the size of your extract is to drop unused
samples and variables. This can be done through the extract interface
for the specific IPUMS project you're using or within R using the IPUMS
API (for projects that are supported).
If using the API, simply updated your extract definition code to exclude
the specifications that you no longer need. Then, resubmit the extract request
and download the new files.
See the [introduction to the IPUMS API](ipums-api.html) for more information
about making extract requests from ipumsr.
### Select cases {#select-cases}
For microdata projects, another good option for reducing extract size is
to select only those cases that are relevant to your research question,
producing an extract containing only data for a particular subset of values
for a given variable.
If you're using the IPUMS API, you can use `var_spec()` to specify case
selections for a variable in an extract definition. For instance, the following
would produce an extract only including records for married women:
```{r}
define_extract_micro(
"usa",
description = "2013 ACS Data for Married Women",
samples = "us2013a",
variables = list(
var_spec("MARST", case_selections = "1"),
var_spec("SEX", case_selections = "2")
)
)
```
If you're using the online interface, the **Select Cases** option will be
available on the last page before submitting an extract request.
### Use a sampled subset of the data
Yet another option (also only for microdata projects) is to take a
random subsample of the data before producing your extract.
Sampled data is not available via the IPUMS API, but you can use the
**Customize Sample Size** option in the online interface to do so. This also
appears on the final page before submitting an extract request.
If you've already submitted the extract, you can click the **REVISE** link
on the *Download or Revise Extracts* page to access these features and
produce a new data extract.
## Option 3: Process the data in pieces
ipumsr provides two related options for reading data sources in
increments:
- *Chunked* functions allow you to specify a function that will be
called on each chunk of data as it is read in as well as how you
would like the chunks to be combined at the end. These functions use
the readr
[framework](https://readr.tidyverse.org/reference/read_delim_chunked.html)
for reading chunked data.
- *Yielded* functions allow more flexibility by returning control to
the user between the loading of each piece of data. These functions
are unique to ipumsr and fixed-width data.
### Reading chunked data
Use `read_ipums_micro_chunked()` and `read_ipums_micro_list_chunked()`
to read data in chunks. These are analogous to the standard
`read_ipums_micro()` and `read_ipums_micro_list()` functions, but allow
you to specify a function that will be applied to each data chunk and
control how the results from these chunks are combined.
Below, we'll use chunking to outline solutions to three common use-cases
for IPUMS data: tabulation, regression and case selection.
First, we'll load our example data. *Note that we have down-sampled the
data in this example for storage reasons; none of the output "results"
reflected in this vignette should be considered legitimate!*
```{r}
cps_ddi_file <- ipums_example("cps_00097.xml")
```
#### Chunked tabulation {#chunked-tab}
Imagine we wanted to find the percent of people in the workforce grouped
by their self-reported health. Since our example extract is small enough
to fit in memory, we could load the full dataset with
`read_ipums_micro()`, use `lbl_relabel()` to relabel the `EMPSTAT` variable
into a binary variable, and count the people in each group.
```{r}
read_ipums_micro(cps_ddi_file, verbose = FALSE) %>%
mutate(
HEALTH = as_factor(HEALTH),
AT_WORK = as_factor(
lbl_relabel(
EMPSTAT,
lbl(1, "Yes") ~ .lbl == "At work",
lbl(0, "No") ~ .lbl != "At work"
)
)
) %>%
group_by(HEALTH, AT_WORK) %>%
summarize(n = n(), .groups = "drop")
```
For the sake of this example, let's imagine we can only store 1,000 rows
in memory at a time. In this case, we need to use a `chunked` function,
tabulate for each chunk, and then calculate the counts across all of the
chunks.
The `chunked` functions will apply a user-defined callback function to
each chunk. The callback takes two arguments: `x`, which represents the
data contained in a given chunk, and `pos`, which represents the
position of the chunk, expressed as the line in the input file at which
the chunk starts. Generally you will only need to use `x`, but the
callback must always take both arguments.
In this case, the callback will implement the same processing steps that
we demonstrated above:
```{r}
cb_function <- function(x, pos) {
x %>%
mutate(
HEALTH = as_factor(HEALTH),
AT_WORK = as_factor(
lbl_relabel(
EMPSTAT,
lbl(1, "Yes") ~ .lbl == "At work",
lbl(0, "No") ~ .lbl != "At work"
)
)
) %>%
group_by(HEALTH, AT_WORK) %>%
summarize(n = n(), .groups = "drop")
}
```
Next, we need to create a callback object, which will determine how we
want to combine the ultimate results for each chunk. ipumsr provides
three main types of callback objects that preserve variable metadata:
- `IpumsDataFrameCallback` combines the results from each chunk
together by row binding them together
- `IpumsListCallback` returns a list with one item per chunk
containing the results for that chunk. Use this when you don't want
to (or can't) immediately combine the results.
- `IpumsSideEffectCallback` does not return any results. Use this when
your callback function is intended only for its side effects (for
instance, if you are saving the results for each chunk to disk).
(ipumsr also provides a fourth callback used for running linear
regression models discussed [below](#chunked-reg)).
In this case, we want to row-bind the data frames returned by
`cb_function()`, so we use `IpumsDataFrameCallback`.
Callback objects are `{R6}`
objects, but you don't need to be familiar with R6 to use them.[^2] To
initialize a callback object, simply use `$new()`:
[^2]: If you're interested in learning more about R6, check out Hadley
Wickham's [Advanced R](https://adv-r.hadley.nz/r6.html?q=R6#r6)
book, which is available for free online.
```{r}
cb <- IpumsDataFrameCallback$new(cb_function)
```
At this point, we're ready to load the data in chunks. We use
`read_ipums_micro_chunked()` to specify the callback and chunk size:
```{r}
chunked_tabulations <- read_ipums_micro_chunked(
cps_ddi_file,
callback = cb,
chunk_size = 1000,
verbose = FALSE
)
chunked_tabulations
```
Now we have a data frame with the counts by health and work status
within each chunk. To get the full table, we just need to sum by health
and work status one more time:
```{r}
chunked_tabulations %>%
group_by(HEALTH, AT_WORK) %>%
summarize(n = sum(n), .groups = "drop")
```
#### Chunked regression {#chunked-reg}
With the biglm package, it is possible to use R to perform a regression
on data that is too large to store in memory all at once. The ipumsr
package provides another callback designed to make this simple:
`IpumsBiglmCallback`.
In this example, we'll conduct a regression with total hours worked
(`AHRSWORKT`) as the outcome and age (`AGE`) and self-reported health
(`HEALTH`) as predictors. (Note that this is intended as a code
demonstration, so we ignore many complexities that should be addressed
in real analyses.)
If we were running the analysis on our full dataset, we'd first load our
data and prepare the variables in our analysis for use in the model:
```{r}
data <- read_ipums_micro(cps_ddi_file, verbose = FALSE) %>%
mutate(
HEALTH = as_factor(HEALTH),
AHRSWORKT = lbl_na_if(AHRSWORKT, ~ .lbl == "NIU (Not in universe)"),
AT_WORK = as_factor(
lbl_relabel(
EMPSTAT,
lbl(1, "Yes") ~ .lbl == "At work",
lbl(0, "No") ~ .lbl != "At work"
)
)
) %>%
filter(AT_WORK == "Yes")
```
Then, we'd provide our model formula and data to `lm`:
```{r}
model <- lm(AHRSWORKT ~ AGE + I(AGE^2) + HEALTH, data = data)
summary(model)
```
To do the same regression, but with only 1,000 rows loaded at a time, we
work in a similar manner.
First we make an `IpumsBiglmCallback` callback object. We provide the
model formula as well as the code used to process the data before
running the regression:
```{r, eval=installed_biglm}
library(biglm)
```
```{r, eval=installed_biglm}
biglm_cb <- IpumsBiglmCallback$new(
model = AHRSWORKT ~ AGE + I(AGE^2) + HEALTH,
prep = function(x, pos) {
x %>%
mutate(
HEALTH = as_factor(HEALTH),
AHRSWORKT = lbl_na_if(AHRSWORKT, ~ .lbl == "NIU (Not in universe)"),
AT_WORK = as_factor(
lbl_relabel(
EMPSTAT,
lbl(1, "Yes") ~ .lbl == "At work",
lbl(0, "No") ~ .lbl != "At work"
)
)
) %>%
filter(AT_WORK == "Yes")
}
)
```
And then we read the data using `read_ipums_micro_chunked()`, passing
the callback that we just made.
```{r, eval=installed_biglm}
chunked_model <- read_ipums_micro_chunked(
cps_ddi_file,
callback = biglm_cb,
chunk_size = 1000,
verbose = FALSE
)
summary(chunked_model)
```
### Reading yielded data
In addition to chunked reading, ipumsr also provides the similar but
more flexible "yielded" reading.
`read_ipums_micro_yield()` and `read_ipums_micro_list_yield()` grant you
more freedom in determining what R code to run between chunks and
include the ability to have multiple files open at once. Additionally,
yields are compatible with the `bigglm` function from biglm, which
allows you to run glm models on data larger than memory.
The downside to this greater control is that yields have an API that is
unique to IPUMS data and the way they work is unusual for R code.
#### Yielded tabulation
We'll compare the `yield` and `chunked` functions by conducting the same
[tabulation example](#chunked-tab) from above using yields.
First, we create the yield object with the function
`read_ipums_micro_yield()`:
```{r}
data <- read_ipums_micro_yield(cps_ddi_file, verbose = FALSE)
```
This function returns an `R6` object which contains methods for reading
the data. The most important method is the `yield()` method which will
return `n` rows of data:
```{r}
# Return the first 10 rows of data
data$yield(10)
```
Note that the row position in the data is stored in the object, so
running the same code again will produce *different* rows of data:
```{r}
# Return the next 10 rows of data
data$yield(10)
```
Use `cur_pos` to get the current position in the data file:
```{r}
data$cur_pos
```
The `is_done()` method tells us whether we have read the entire file
yet:
```{r}
data$is_done()
```
In preparation for our actual example, we'll use `reset()` to reset to
the beginning of the data:
```{r}
data$reset()
```
Using `yield()` and `is_done()`, we can set up our processing pipeline.
First, we create an empty placeholder tibble to store our results:
```{r}
yield_results <- tibble(
HEALTH = factor(levels = c("Excellent", "Very good", "Good", "Fair", "Poor")),
AT_WORK = factor(levels = c("No", "Yes")),
n = integer(0)
)
```
Then, we iterate through the data, yielding 1,000 rows at a time and
processing the results as we did in the chunked example. The iteration
will end when we've finished reading the entire file.
```{r}
while (!data$is_done()) {
# Yield new data and process
new <- data$yield(n = 1000) %>%
mutate(
HEALTH = as_factor(HEALTH),
AT_WORK = as_factor(
lbl_relabel(
EMPSTAT,
lbl(1, "Yes") ~ .lbl == "At work",
lbl(0, "No") ~ .lbl != "At work"
)
)
) %>%
group_by(HEALTH, AT_WORK) %>%
summarize(n = n(), .groups = "drop")
# Combine the new yield with the previously processed yields
yield_results <- bind_rows(yield_results, new) %>%
group_by(HEALTH, AT_WORK) %>%
summarize(n = sum(n), .groups = "drop")
}
yield_results
```
#### Yielded GLM regression
One of the major benefits of the yielded reading over chunked reading is
that it is compatible with the GLM functions from biglm, allowing for
the use of more complicated models.
To run a logistic regression, we first need to reset our yield object
from the previous example:
```{r}
data$reset()
```
Next we make a function that takes a single argument: `reset`. When
`reset` is `TRUE`, it resets the data to the beginning. This is dictated
by `bigglm` from biglm.
To create this function, we use the the `reset()` method from the yield
object:
```{r}
get_model_data <- function(reset) {
if (reset) {
data$reset()
} else {
yield <- data$yield(n = 1000)
if (is.null(yield)) {
return(yield)
}
yield %>%
mutate(
HEALTH = as_factor(HEALTH),
WORK30PLUS = lbl_na_if(AHRSWORKT, ~ .lbl == "NIU (Not in universe)") >= 30,
AT_WORK = as_factor(
lbl_relabel(
EMPSTAT,
lbl(1, "Yes") ~ .lbl == "At work",
lbl(0, "No") ~ .lbl != "At work"
)
)
) %>%
filter(AT_WORK == "Yes")
}
}
```
Finally we feed this function and a model specification to the
`bigglm()` function:
```{r, eval=installed_biglm}
results <- bigglm(
WORK30PLUS ~ AGE + I(AGE^2) + HEALTH,
family = binomial(link = "logit"),
data = get_model_data
)
summary(results)
```
## Option 4: Use a database {#database}
Storing your data in a database is another way to work with data that
cannot fit into memory as a data frame. If you have access to a database
on a remote machine, then you can easily select and use parts of the
data for your analysis. Even databases on your own machine may provide
more efficient data storage or use your hard drive, enabling the data to
be loaded into R.
There are many different kinds of databases, each with their own
benefits and drawbacks, and the database you choose to use will be
specific to your use case. However, once you've chosen a database, there
will be two general steps:
1. Importing data into the database
2. Connecting the database to R
R has several tools that support database integration, including
`{DBI}`, `{dbplyr}`, `{sparklyr}`, `{bigrquery}`, and others.
In this example,
we'll use `{RSQLite}` to load the data into an in-memory database. (We use
RSQLite because it is easy to set up, but it is likely not efficient
enough to fully resolve issues with large IPUMS data, so it may be wise
to consider an alternative in practice.)
#### Importing data into the database
For rectangular extracts, it is likely simplest to load your data into
the database in CSV format, which is widely supported. If you are
working with a hierarchical extract (or your database software doesn't
support CSV format), then you can use an ipumsr `chunked` function to
load the data into a database without needing to store the entire
dataset in R.
See the [IPUMS data reading vignette](ipums-read.html#hierarchical-extracts)
for more about rectangular vs. hierarchical extracts.
```{r, eval=installed_db_pkgs, results="hide"}
library(DBI)
library(RSQLite)
# Connect to database
con <- dbConnect(SQLite(), path = ":memory:")
# Load file metadata
ddi <- read_ipums_ddi(cps_ddi_file)
# Write data to database in chunks
read_ipums_micro_chunked(
ddi,
readr::SideEffectChunkCallback$new(
function(x, pos) {
if (pos == 1) {
dbWriteTable(con, "cps", x)
} else {
dbWriteTable(con, "cps", x, row.names = FALSE, append = TRUE)
}
}
),
chunk_size = 1000,
verbose = FALSE
)
```
#### Connecting to a database with dbplyr
There are a variety of ways to access your data once it is stored in the
database. In this example, we use dbplyr. For more details about dbplyr,
see `vignette("dbplyr", package = "dbplyr")`.
To run a simple query for `AGE`, we can use the same syntax we would use
with dplyr:
```{r, eval=installed_db_pkgs}
example <- tbl(con, "cps")
example %>%
filter("AGE" > 25)
```
dbplyr shows us a nice preview of the first rows of the result of our
query, but the data still exist only in the database. You can use
`dplyr::collect()` to load the full results of the query into the
current R session. However, this would omit the variable metadata
attached to IPUMS data, since the database doesn't store this metadata:
```{r, eval=installed_db_pkgs}
data <- example %>%
filter("AGE" > 25) %>%
collect()
# Variable metadata is missing
ipums_val_labels(data$MONTH)
```
Instead, use `ipums_collect()`, which uses a provided `ipums_ddi` object
to reattach the metadata while loading into the R environment:
```{r, eval=installed_db_pkgs}
data <- example %>%
filter("AGE" > 25) %>%
ipums_collect(ddi)
ipums_val_labels(data$MONTH)
```
See the [value labels vignette](value-labels.html) more about variable
metadata in IPUMS data.
## Learning more
Big data isn't just a problem for IPUMS users, so there are many R
resources available.
See the documentation for the packages mentioned in the
[databases](#databases) section for more information about those
options.
For some past blog posts and articles on the topic, see the following:
- [*Big Data in
R*](https://github.com/smooney27/RWorkshopSER2024/blob/main/EPIC_R_BigData.pdf) -
Part of Stephen Mooney's EPIC: Epidemiologic Analysis Using R, June
2015 class
- [*Statistical Analysis with Open-Source R and RStudio on Amazon
EMR*](https://aws.amazon.com/blogs/big-data/statistical-analysis-with-open-source-r-and-rstudio-on-amazon-emr/) -
Markus Schmidberger on the AWS Big Data Blog
- [*Hosting RStudio Server on
Azure*](https://www.jumpingrivers.com/blog/hosting-rstudio-server-on-azure/) -
Colin Gillespie's blog post on using Rstudio on Azure
- [*Improving DBI: A
Retrospect*](https://www.r-consortium.org/blog/2017/05/15/improving-dbi-a-retrospect) -
Kirill Müller's report on the R Consortium grant to improve database
support in R