Content not found. Please use links in the navbar.
- -
- -

As contributors and maintainers of this project, we pledge to respect all people who contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities.


We are committed to making participation in this project a harassment-free experience for everyone, regardless of level of experience, gender, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion.


Examples of unacceptable behavior by participants include the use of sexual language or imagery, derogatory comments or personal attacks, trolling, public or private harassment, insults, or other unprofessional conduct.


Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed from the project team.


Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by opening an issue or contacting one or more of the project maintainers.


This Code of Conduct is adapted from the Contributor Covenant (http:contributor-covenant.org), version 1.0.0, available at http://contributor-covenant.org/version/1/0/0/

- - -
- -

Thank you for considering improving this project! By participating, you agree to abide by the code of conduct.


Issues (Reporting a problem or suggestion)


If you’ve experience a problem with the package, or have a suggestion for it, please post it on the issues tab. This space is meant for questions directly related to the R package, so questions related to your specific extract may be better answered via email to (but don’t worry about making a mistake, we know it is tough to tell the difference).


Since our extracts are such large files, posting minimal reproducible examples may be difficult. Therefore, it will be most helpful if you can provide as much detail about your problem as possible including the code and error message, the project the extract is from, the variables you have selected, file type, etc. We’ll do our best to answer your question.


Pull Requests (Making changes to the package)


We appreciate pull requests that follow these guidelines:

  1. Make sure that tests pass (and add new ones if possible).

  2. -
  3. Do your best to conform to the code style of the package, currently based on the tidyverse style guide. See the styler package to easily catch stylistic errors.

  4. -
  5. Please add you name and affiliation to the NOTICE.txt file.

  6. -
  7. Summarize your changes in the NEWS.md file.

  8. -

Basics of Pull Requests


If you’ve never worked on an R package before, the book R Packages by Hadley Wickham is a great resource for learning the mechanics of building an R package and contributing to R packages on github. Additionally, here’s a great primer on git and github specifically.


In the meantime, here’s a quick step-by-step guide on contributing to this project using RStudio:

  1. If don’t already have RStudio and Git installed, you can download them here and here.

  2. -
  3. Fork this repo (top right corner button on the github website).

  4. -
  5. Clone the repo from RStudio’s toolbar: File > New Project > From Verson Control > https://github.com/*YOUR_USER_NAME*/ipumsr/.

  6. -
  7. Make changes to your local copy.

  8. -
  9. Commit your changes and push them to the github webiste using RStudio’s Git pane (push using the green up arrow).

  10. -
  11. Submit a pull request, selecting the “compare across forks” option. Please include a short message summarizing your changes.

  12. -
- - - - -
- - - -

This vignette details the options available for requesting data from -IPUMS microdata projects via the IPUMS API.


If you haven’t yet learned the basics of the IPUMS API workflow, you -may want to start with the IPUMS API -introduction. The code below assumes you have registered and set up -your API key as described there.


Supported microdata collections -


IPUMS provides several data collections that are classified as -microdata. Currently, the following microdata collections are supported -by the IPUMS API (shown with the codes used to refer to them in -ipumsr):

  • IPUMS USA ("usa")
  • -
  • IPUMS CPS ("cps")
  • -
  • IPUMS International ("ipumsi")
  • -

API support will continue to be added for more collections in the -future. See the API -documentation for more information on upcoming additions to the -API.


In addition to microdata projects, the IPUMS API also supports IPUMS -NHGIS data. For details about obtaining IPUMS NHGIS data using ipumsr, -see the NHGIS-specific vignette.


Before getting started, we’ll load ipumsr and dplyr, -which will be helpful for this demo:

- -

Basic IPUMS microdata concepts -


Every microdata extract definition must contain a set of requested -samples and variables.


In an IPUMS microdata collection, a sample refers to a -distinct combination of records and variables. A record is a set of -values that describe the characteristics of a single unit of measurement -(e.g. a single person or a single household), and variables -define the characteristics that were measured.


A single sample can contain multiple record types (e.g. person -records, household records, or activity records, and more), each of -which correspond to different units of measurement.


Note that our usage of the term “sample” does not correspond -perfectly to the statistical sense of a subset of individuals from a -population. Many IPUMS samples are samples in the statistical sense, but -some are “full-count” samples, meaning they contain all individuals in a -population.


IPUMS microdata metadata (forthcoming) -


Of course, to request samples and variables, we have to know the -codes that the API uses to refer to them. For samples, the IPUMS API -uses special codes that don’t appear in the web-based extract builder. -For variables, the API uses the same variable names that appear on the -web.


While the IPUMS API does not yet provide a comprehensive set of -metadata endpoints for IPUMS microdata collections, users can use the -get_sample_info() function to identify the codes used to -refer to specific samples when communicating with the API.

-cps_samps <- get_sample_info("cps")
-#> # A tibble: 6 × 2
-#>   name        description         
-#>   <chr>       <chr>               
-#> 1 cps1962_03s IPUMS-CPS, ASEC 1962
-#> 2 cps1963_03s IPUMS-CPS, ASEC 1963
-#> 3 cps1964_03s IPUMS-CPS, ASEC 1964
-#> 4 cps1965_03s IPUMS-CPS, ASEC 1965
-#> 5 cps1966_03s IPUMS-CPS, ASEC 1966
-#> 6 cps1967_03s IPUMS-CPS, ASEC 1967

The values listed in the name column correspond to the -code that you would use to request that sample when creating an extract -definition to be submitted to the IPUMS API.


We can use basic functions from dplyr to filter the metadata to -samples of interest. For instance, to find all IPUMS International -samples for Mexico, we could do the following:

-ipumsi_samps <- get_sample_info("ipumsi")
-ipumsi_samps %>%
-  filter(grepl("Mexico", description))
-#> # A tibble: 70 × 2
-#>    name    description       
-#>    <chr>   <chr>             
-#>  1 mx1960a Mexico 1960       
-#>  2 mx1970a Mexico 1970       
-#>  3 mx1990a Mexico 1990       
-#>  4 mx1995a Mexico 1995       
-#>  5 mx2000a Mexico 2000       
-#>  6 mx2005a Mexico 2005       
-#>  7 mx2010a Mexico 2010       
-#>  8 mx2015a Mexico 2015       
-#>  9 mx2005h Mexico 2005 Q1 LFS
-#> 10 mx2005i Mexico 2005 Q2 LFS
-#> # ℹ 60 more rows

IPUMS intends to add support for accessing variable metadata via API -in the future. Until then, use the web-based extract builder for a given -collection to find variable names and availability by sample. See the IPUMS -API documentation for links to the extract builder for each -microdata collection with API support.


Alternatively, if you have made an extract previously through the web -interface, you can use get_extract_info() to identify the -variable names it includes. See the IPUMS API -introduction for more details.


Defining an IPUMS microdata extract request -


Each IPUMS collection has its own extract definition function that is -used to specify the parameters of a new extract request from scratch. -These functions take the form define_extract_*(). For -microdata collections, we have:

- -

When you define an extract request, you can specify the data to be -included in the extract and indicate the desired format and layout.


While each microdata collection has its own extract definition -function, each uses the same syntax. The examples in this vignette use -multiple collections, but the syntax they demonstrate can be applied to -all of the supported microdata collections.


A simple extract definition needs only to contain the names of the -samples and variables to include in the request:

-cps_ext <- define_extract_cps(
-  description = "Example CPS extract",
-  samples = c("cps2018_03s", "cps2019_03s"),
-  variables = c("AGE", "SEX", "RACE", "STATEFIP")
-#> Unsubmitted IPUMS CPS extract 
-#> Description: Example CPS extract
-#> Samples: (2 total) cps2018_03s, cps2019_03s
-#> Variables: (4 total) AGE, SEX, RACE, STATEFIP

This produces an ipums_extract object containing the -extract request specifications that is ready to be submitted to the -IPUMS API.


When you request a variable in your extract definition, the resulting -data extract will include that variable for all requested samples where -it is available. If you request a variable that is not available for any -requested samples, the IPUMS API will throw an informative error when -you try to submit your request.


Beyond just specifying samples and variables, there are several -additional options available to refine the data requested in a microdata -extract request.


Detailed variable specifications -


The IPUMS API supports several detailed specification options that -can be applied to individual variables in an extract request: case -selections, attached characteristics, and data quality flags.


Before we describe each of these options in depth, we’ll introduce -the syntax used to add them to your extract definition.


Syntax -


To add any of these options to a variable, we need to introduce the -var_spec() helper function.


var_spec() bundles all the selections for a given -variable together into a single object (in this case, a -var_spec object):

-var <- var_spec("SEX", case_selections = "2")
-#> List of 3
-#>  $ name               : chr "SEX"
-#>  $ case_selections    : chr "2"
-#>  $ case_selection_type: chr "general"
-#>  - attr(*, "class")= chr [1:3] "var_spec" "ipums_spec" "list"

To include this specification in our extract, we simply provide it to -the variables argument of our extract definition. When -multiple variables are included, pass a list of -var_spec objects:

-  description = "Case selection example",
-  samples = c("cps2018_03s", "cps2019_03s"),
-  variables = list(
-    var_spec("SEX", case_selections = "2"),
-    var_spec("AGE", attached_characteristics = "head")
-  )
-#> Unsubmitted IPUMS CPS extract 
-#> Description: Case selection example
-#> Samples: (2 total) cps2018_03s, cps2019_03s
-#> Variables: (2 total) SEX, AGE

In fact, if you investigate our original extract object from above, -you’ll notice that the variables have automatically been converted to -var_spec objects, even though they were provided as -character vectors:

-#> List of 4
-#>  $ AGE     :List of 1
-#>   ..$ name: chr "AGE"
-#>   ..- attr(*, "class")= chr [1:3] "var_spec" "ipums_spec" "list"
-#>  $ SEX     :List of 1
-#>   ..$ name: chr "SEX"
-#>   ..- attr(*, "class")= chr [1:3] "var_spec" "ipums_spec" "list"
-#>  $ RACE    :List of 1
-#>   ..$ name: chr "RACE"
-#>   ..- attr(*, "class")= chr [1:3] "var_spec" "ipums_spec" "list"
-#>  $ STATEFIP:List of 1
-#>   ..$ name: chr "STATEFIP"
-#>   ..- attr(*, "class")= chr [1:3] "var_spec" "ipums_spec" "list"

So, a var_spec object with no additional specifications -will produce the default data for a given variable. That is, the -following are equivalent:

-  description = "Example CPS extract",
-  samples = "cps2018_03s",
-  variables = "AGE"
-  description = "Example CPS extract",
-  samples = "cps2018_03s",
-  variables = var_spec("AGE")

Because all specified variables are converted to -var_spec objects, you can also pass a list where some -elements are var_spec objects and some are just variable -names. This is convenient when you only have detailed specifications for -a subset of variables:

-  description = "Case selection example",
-  samples = c("cps2018_03s", "cps2019_03s"),
-  variables = list(
-    var_spec("SEX", case_selections = "2"),
-    "AGE"
-  )
-#> Unsubmitted IPUMS CPS extract 
-#> Description: Case selection example
-#> Samples: (2 total) cps2018_03s, cps2019_03s
-#> Variables: (2 total) SEX, AGE

(Samples are also converted to their own samp_spec -objects, but as there currently aren’t any additional specifications -available for samples, there is no reason to use anything other than a -character vector in the samples argument.)


Now that we’ve covered the basic syntax for including detailed -variable specifications, we can describe the available options in more -depth.


Case selections -


Case selections allow us to limit the data to those records that -match a particular value on the specified variable.


For instance, the following specification would indicate that only -records with a value of "27" (Minnesota) or -"19" (Iowa) for the variable "STATEFIP" should -be included:

-var <- var_spec("STATEFIP", case_selections = c("27", "19"))

Some variables have versions with both general and detailed coding -schemes. By default, case selections are interpreted to refer to the -general codes:

-#> [1] "general"

For variables with detailed versions, you can also select on the -detailed codes.


For instance, the IPUMS USA variable RACE is available in both -general and detailed versions. If you wanted to limit your extract to -persons identifying as “Two major races”, you could do so by specifying -a case selection of "8". However, if you wanted to limit -your extract to only persons identifying as “White and Chinese” or -“White and Japanese”, you would need to specify detailed codes -"811" and "812".


To include case selections for detailed codes, set -case_selection_type = "detailed":

-# General case selection is the default
-var_spec("RACE", case_selections = "8")
-#> $name
-#> [1] "RACE"
-#> $case_selections
-#> [1] "8"
-#> $case_selection_type
-#> [1] "general"
-#> attr(,"class")
-#> [1] "var_spec"   "ipums_spec" "list"
-# For detailed case selection, change the `case_selection_type`
-  "RACE",
-  case_selections = c("811", "812"),
-  case_selection_type = "detailed"
-#> $name
-#> [1] "RACE"
-#> $case_selections
-#> [1] "811" "812"
-#> $case_selection_type
-#> [1] "detailed"
-#> attr(,"class")
-#> [1] "var_spec"   "ipums_spec" "list"

As noted above, IPUMS intends to add support for accessing variable -metadata via API in the future, such that users will be able to query -variable coding schemes right from their R sessions. Until then, use the -IPUMS web interface for a given collection to find general and detailed -variable codes for the purposes of case selection. See the IPUMS -API documentation for relevant links.


By default, case selection on person-level variables produces a data -file that includes only those individuals who match the specified values -for the specified variables. It’s also possible to use case selection to -include matching individuals and all other members of their -households, using the case_select_who parameter.


The case_select_who parameter must be the same for all -case selections in an extract, and thus is set at the extract level -rather than the var_spec level. To include all household -members of matching individuals, set -case_select_who = "households" in the extract -definition:

-  description = "Household level case selection",
-  samples = "us2021a",
-  variables = var_spec("RACE", case_selections = "8"),
-  case_select_who = "households"
-#> Unsubmitted IPUMS USA extract 
-#> Description: Household level case selection
-#> Samples: (1 total) us2021a
-#> Variables: (1 total) RACE

Attached characteristics -


IPUMS allows users to create variables that reflect the -characteristics of other household members. To do so, use the -attached_characteristics argument of -var_spec().


For instance, to attach the spouse’s SEX value to a -record:

-var_spec("SEX", attached_characteristics = "spouse")
-#> $name
-#> [1] "SEX"
-#> $attached_characteristics
-#> [1] "spouse"
-#> attr(,"class")
-#> [1] "var_spec"   "ipums_spec" "list"

This will add a new variable (in this case, SEX_SP) to -the output data that will contain the sex of a person’s spouse (if no -such record exists, the value will be 0).


Multiple attached characteristics can be attached for a single -variable:

-var_spec("AGE", attached_characteristics = c("mother", "father"))
-#> $name
-#> [1] "AGE"
-#> $attached_characteristics
-#> [1] "mother" "father"
-#> attr(,"class")
-#> [1] "var_spec"   "ipums_spec" "list"

Acceptable values are "spouse", "mother", -"father", and "head".


Data quality flags -


Some variables in the IPUMS have been edited for missing, illegible, -and inconsistent values. Data quality flags indicate which values are -edited or allocated.


To include data quality flags for an individual variable, use the -data_quality_flags argument to var_spec():

-var_spec("RACE", data_quality_flags = TRUE)
-#> $name
-#> [1] "RACE"
-#> $data_quality_flags
-#> [1] TRUE
-#> attr(,"class")
-#> [1] "var_spec"   "ipums_spec" "list"

This will produce a new variable (QRACE) containing the -data quality flag for the given variable.


To add data quality flags for all variables that have them, set -data_quality_flags = TRUE in your extract definition -directly:

-usa_ext <- define_extract_usa(
-  description = "Data quality flags",
-  samples = "us2021a",
-  variables = list(
-    var_spec("RACE", case_selections = "8"),
-    var_spec("AGE")
-  ),
-  data_quality_flags = TRUE

Each data quality flag corresponds to one or more variables, and the -codes for each flag vary based on the sample. See the documentation for -the IPUMS collection of interest for more information about data quality -flag codes.


Data structure and file format -


By default, microdata extract definitions will request data in a -rectangular structure and fixed-width file format.


Rectangular data are data where only person records are included, and -any household-level variables are converted to person-level variables by -copying the values from the associated household record onto all -household members.


To instead create a hierarchical extract, which includes separate -records for households and persons, set -data_structure = "hierarchical" in your extract -definition.


See the IPUMS data -reading vignette for more information about loading hierarchical -data into R.


To request a file format other than fixed-width, adjust the -data_format argument. Note that while you can request data -in a variety of formats (Stata, SPSS, etc.), ipumsr’s -read_ipums_micro() function only supports fixed-width and -csv files.


Next steps -


Once you have defined an extract request, you can submit the extract -for processing:

-usa_ext_submitted <- submit_extract(usa_ext)

The workflow for submitting and monitoring an extract request and -downloading its files when complete is described in the IPUMS API introduction.

- - - - -
- - - -

This vignette details the options available for requesting IPUMS -NHGIS data and metadata via the IPUMS API.


If you haven’t yet learned the basics of the IPUMS API workflow, you -may want to start with the IPUMS API -introduction. The code below assumes you have registered and set up -your API key as described there.


In addition to NHGIS, the IPUMS API also supports several microdata -projects. For details about obtaining IPUMS microdata using ipumsr, see -the microdata-specific vignette.


Before getting started, we’ll load ipumsr and some helpful packages -for this demo:

- -

Basic IPUMS NHGIS concepts -


IPUMS NHGIS supports 3 main types of data products: datasets, time -series tables, and shapefiles.

  • A dataset contains a collection of data tables -that each correspond to a particular tabulated summary statistic. A -dataset is distinguished by the years, geographic levels, and topics -that it covers. For instance, 2021 1-year data from the American -Community Survey (ACS) is encapsulated in a single dataset. In other -cases, a single census product will be split into multiple -datasets.

  • -
  • A time series table is a longitudinal data source that -links comparable statistics from multiple U.S. censuses in a single -bundle. A table is comprised of one or more related time series, each of -which describes a single summary statistic measured at multiple times -for a given geographic level.

  • -
  • A shapefile (or GIS file) contains geographic -data for a given geographic level and year. Typically, these files are -composed of polygon geometries containing the boundaries of census -reporting areas.

  • -

IPUMS NHGIS metadata -


Of course, to make a request for any of these data sources, we have -to know the codes that the API uses to refer to them. Fortunately, we -can browse the metadata for all available IPUMS NHGIS data sources with -get_metadata_nhgis().


Users can view summary metadata for all available data sources of a -given data type, or detailed metadata for a specific data source by -name.


Summary metadata -


To see a summary of all available sources for a given data product -type, use the type argument. This returns a data frame -containing the available datasets, data tables, time series tables, or -shapefiles.

-ds <- get_metadata_nhgis(type = "datasets")
-#> # A tibble: 6 × 4
-#>   name      group       description                              sequence
-#>   <chr>     <chr>       <chr>                                       <int>
-#> 1 1790_cPop 1790 Census Population Data [US, States & Counties]       101
-#> 2 1800_cPop 1800 Census Population Data [US, States & Counties]       201
-#> 3 1810_cPop 1810 Census Population Data [US, States & Counties]       301
-#> 4 1820_cPop 1820 Census Population Data [US, States & Counties]       401
-#> 5 1830_cPop 1830 Census Population Data [US, States & Counties]       501
-#> 6 1840_cAg  1840 Census Agriculture Data [US, States & Counties]      601

We can use basic functions from dplyr to filter the -metadata to those records of interest. For instance, if we wanted to -find all the data sources related to agriculture from the 1900 Census, -we could filter on group and description:

-ds %>%
-  filter(
-    group == "1900 Census",
-    grepl("Agriculture", description)
-  )
-#> # A tibble: 2 × 4
-#>   name       group       description                                    sequence
-#>   <chr>      <chr>       <chr>                                             <int>
-#> 1 1900_cAg   1900 Census Agriculture Data [US, States & Counties]           1401
-#> 2 1900_cPHAM 1900 Census Population, Housing, Agriculture & Manufactur…     1403

The values listed in the name column correspond to the -code that you would use to request that dataset when creating an extract -definition to be submitted to the IPUMS API.


Similarly, for time series tables:

-tst <- get_metadata_nhgis("time_series_tables")

While some of the metadata fields are consistent across different -data types, some, like geographic_integration, are specific -to time series tables:

-#> # A tibble: 6 × 7
-#>   name  description         geographic_integration sequence time_series years   
-#>   <chr> <chr>               <chr>                     <dbl> <list>      <list>  
-#> 1 A00   Total Population    Nominal                    100. <tibble>    <tibble>
-#> 2 AV0   Total Population    Nominal                    100. <tibble>    <tibble>
-#> 3 B78   Total Population    Nominal                    100. <tibble>    <tibble>
-#> 4 CL8   Total Population    Standardized to 2010       100. <tibble>    <tibble>
-#> 5 A57   Persons by Urban/R… Nominal                    101. <tibble>    <tibble>
-#> 6 A59   Persons by Urban/R… Nominal                    101. <tibble>    <tibble>
-#> # ℹ 1 more variable: geog_levels <list>

Note that for time series tables, some metadata fields are stored in -list columns, where each entry is itself a data frame:

-#> # A tibble: 24 × 3
-#>    name  description sequence
-#>    <chr> <chr>          <int>
-#>  1 1790  1790               1
-#>  2 1800  1800               2
-#>  3 1810  1810               3
-#>  4 1820  1820               4
-#>  5 1830  1830               5
-#>  6 1840  1840               6
-#>  7 1850  1850               7
-#>  8 1860  1860               8
-#>  9 1870  1870              12
-#> 10 1880  1880              22
-#> # ℹ 14 more rows
-#> # A tibble: 2 × 3
-#>   name   description   sequence
-#>   <chr>  <chr>            <int>
-#> 1 state  State                4
-#> 2 county State--County       25

To filter on these columns, we can use map_lgl() from -purrr. For instance, to find all time series tables that -include data from a particular year:

-# Iterate over each `years` entry, identifying whether that entry
-# contains "1840" in its `name` column.
-tst %>%
-  filter(map_lgl(years, ~ "1840" %in% .x$name))
-#> # A tibble: 2 × 7
-#>   name  description        geographic_integration sequence time_series years   
-#>   <chr> <chr>              <chr>                     <dbl> <list>      <list>  
-#> 1 A00   Total Population   Nominal                    100. <tibble>    <tibble>
-#> 2 A08   Persons by Sex [2] Nominal                    102. <tibble>    <tibble>
-#> # ℹ 1 more variable: geog_levels <list>

For more details on working with nested data frames, see this tidyr -article.


Detailed metadata -


Once we have identified a data source of interest, we can find out -more about its detailed options by providing its name to the -corresponding argument of get_metadata_nhgis():

-cAg_meta <- get_metadata_nhgis(dataset = "1900_cAg")

This provides a comprehensive list of the possible specifications for -the input data source. For instance, for the 1900_cAg -dataset, we have 66 tables to choose from, and 3 possible geographic -levels:

-#> # A tibble: 66 × 4
-#>    name  nhgis_code description                           sequence
-#>    <chr> <chr>      <chr>                                    <int>
-#>  1 NT1   AWS        Total Population                             1
-#>  2 NT2   AW3        Number of Farms                              2
-#>  3 NT3   AXE        Average Farm Size                            3
-#>  4 NT4   AXP        Farm Acreage                                 4
-#>  5 NT5   AXZ        Farm Management                              5
-#>  6 NT6   AYA        Race of Farmer                               6
-#>  7 NT7   AYJ        Race of Farmer by Detailed Management        7
-#>  8 NT8   AYK        Number of Farms                              8
-#>  9 NT9   AYL        Farms with Buildings                         9
-#> 10 NT10  AWT        Acres of Farmland                           10
-#> # ℹ 56 more rows
-#> # A tibble: 3 × 4
-#>   name   description   has_geog_extent_selection sequence
-#>   <chr>  <chr>         <lgl>                        <int>
-#> 1 nation Nation        FALSE                            1
-#> 2 state  State         FALSE                            4
-#> 3 county State--County FALSE                           25

You can also get detailed metadata for an individual data table. -Since data tables belong to specific datasets, both need to be specified -to identify a data table:

-get_metadata_nhgis(dataset = "1900_cAg", data_table = "NT2")
-#> $name
-#> [1] "NT2"
-#> $description
-#> [1] "Number of Farms"
-#> $universe
-#> [1] "Farms"
-#> $nhgis_code
-#> [1] "AW3"
-#> $sequence
-#> [1] 2
-#> $dataset_name
-#> [1] "1900_cAg"
-#> $variables
-#> # A tibble: 1 × 2
-#>   description nhgis_code
-#>   <chr>       <chr>     
-#> 1 Total       AW3001

Note that the name element is the one that contains the -codes used for interacting with the IPUMS API. The -nhgis_code element refers to the prefix attached to -individual variables in the output data, and the API will throw an error -if you use it in an extract definition. For more details on interpreting -each of the provided metadata elements, see the documentation for -get_metadata_nhgis().


Now that we have identified some of our options, we can go ahead and -define an extract request to submit to the IPUMS API.


Defining an IPUMS NHGIS extract request -


To create an extract definition containing the specifications for a -specific set of IPUMS NHGIS data, use -define_extract_nhgis().


When you define an extract request, you can specify the data to be -included in the extract and indicate the desired format and layout.


Basic extract definitions -


Let’s say we’re interested in getting state-level data on the number -of farms and their average size from the 1900_cAg dataset -that we identified above. As we can see in the metadata, these data are -contained in tables NT2 and NT3:

-#> # A tibble: 66 × 4
-#>    name  nhgis_code description                           sequence
-#>    <chr> <chr>      <chr>                                    <int>
-#>  1 NT1   AWS        Total Population                             1
-#>  2 NT2   AW3        Number of Farms                              2
-#>  3 NT3   AXE        Average Farm Size                            3
-#>  4 NT4   AXP        Farm Acreage                                 4
-#>  5 NT5   AXZ        Farm Management                              5
-#>  6 NT6   AYA        Race of Farmer                               6
-#>  7 NT7   AYJ        Race of Farmer by Detailed Management        7
-#>  8 NT8   AYK        Number of Farms                              8
-#>  9 NT9   AYL        Farms with Buildings                         9
-#> 10 NT10  AWT        Acres of Farmland                           10
-#> # ℹ 56 more rows

Dataset specifications -


To request these data, we need to make an explicit dataset -specification. All datasets must be associated with a selection of -data tables and geographic levels. We can use the ds_spec() -helper function to specify our selections for these parameters. -ds_spec() bundles all the selections for a given dataset -together into a single object (in this case, a ds_spec -object):

-dataset <- ds_spec(
-  "1900_cAg",
-  data_tables = c("NT1", "NT2"),
-  geog_levels = "state"
-#> List of 3
-#>  $ name       : chr "1900_cAg"
-#>  $ data_tables: chr [1:2] "NT1" "NT2"
-#>  $ geog_levels: chr "state"
-#>  - attr(*, "class")= chr [1:3] "ds_spec" "ipums_spec" "list"

This dataset specification can then be provided to the extract -definition:

-nhgis_ext <- define_extract_nhgis(
-  description = "Example farm data in 1900",
-  datasets = dataset
-#> Unsubmitted IPUMS NHGIS extract 
-#> Description: Example farm data in 1900
-#> Dataset: 1900_cAg
-#>   Tables: NT1, NT2
-#>   Geog Levels: state

Dataset specifications can also include selections for -years and breakdown_values, but these are not -available for all datasets.


Time series table specifications -


Similarly, to make a request for time series tables, use the -tst_spec() helper. This makes a tst_spec -object containing a time series table specification.


Time series tables do not contain individual data tables, but do -require a geographic level selection, and allow an optional selection of -years:

-  description = "Example time series table request",
-  time_series_tables = tst_spec(
-    "CW3",
-    geog_levels = c("county", "tract"),
-    years = c("1990", "2000")
-  )
-#> Unsubmitted IPUMS NHGIS extract 
-#> Description: Example time series table request
-#> Time Series Table: CW3
-#>   Geog Levels: county, tract
-#>   Years: 1990, 2000

Shapefile specifications -


Shapefiles don’t have any additional specification options, and -therefore can be requested simply by providing their names:

-  description = "Example shapefiles request",
-  shapefiles = c("us_county_2021_tl2021", "us_county_2020_tl2020")
-#> Unsubmitted IPUMS NHGIS extract 
-#> Description: Example shapefiles request
-#> Shapefiles: us_county_2021_tl2021, us_county_2020_tl2020

Invalid specifications -


An attempt to define an extract that does not have all the required -specifications for a given dataset or time series table will throw an -error:

-  description = "Invalid extract",
-  datasets = ds_spec("1900_STF1", data_tables = "NP1")
-#> Error in `validate_ipums_extract()`:
-#> ! Invalid `ds_spec` specification:
-#>  `geog_levels` must not contain missing values.

Note that it is still possible to make invalid extract requests (for -instance, by requesting a dataset or data table that doesn’t exist). -This kind of issue will be caught upon submission to the API, not upon -the creation of the extract definition.


More complicated extract definitions -


It’s possible to request data for multiple datasets (or time series -tables) in a single extract definition. To do so, pass a -list of ds_spec or tst_spec -objects in define_extract_nhgis():

-  description = "Slightly more complicated extract request",
-  datasets = list(
-    ds_spec("2018_ACS1", "B01001", "state"),
-    ds_spec("2019_ACS1", "B01001", "state")
-  ),
-  shapefiles = c("us_state_2018_tl2018", "us_state_2019_tl2019")
-#> Unsubmitted IPUMS NHGIS extract 
-#> Description: Slightly more complicated extract request
-#> Dataset: 2018_ACS1
-#>   Tables: B01001
-#>   Geog Levels: state
-#> Dataset: 2019_ACS1
-#>   Tables: B01001
-#>   Geog Levels: state
-#> Shapefiles: us_state_2018_tl2018, us_state_2019_tl2019

For extracts with multiple datasets or time series tables, it may be -easier to generate the specifications independently before creating your -extract request object. You can quickly create multiple -ds_spec objects by iterating across the specifications you -want to include. Here, we use purrr to do so, but you -could also use a for loop:

-ds_names <- c("2019_ACS1", "2018_ACS1")
-tables <- c("B01001", "B01002")
-geogs <- c("county", "state")
-# For each dataset to include, create a specification with the
-# data tabels and geog levels indicated above
-datasets <- purrr::map(
-  ds_names,
-  ~ ds_spec(name = .x, data_tables = tables, geog_levels = geogs)
-nhgis_ext <- define_extract_nhgis(
-  description = "Slightly more complicated extract request",
-  datasets = datasets
-#> Unsubmitted IPUMS NHGIS extract 
-#> Description: Slightly more complicated extract request
-#> Dataset: 2019_ACS1
-#>   Tables: B01001, B01002
-#>   Geog Levels: county, state
-#> Dataset: 2018_ACS1
-#>   Tables: B01001, B01002
-#>   Geog Levels: county, state

This workflow also makes it easy to quickly update the specifications -in the future. For instance, to add the 2017 ACS 1-year data to the -extract definition above, you’d only need to add -"2017_ACS1" to the ds_names variable. The -iteration would automatically add the selected tables and geog levels -for the new dataset. (This workflow works particularly well for ACS -datasets, which often have the same data table names across -datasets.)


Data layout and file format -


IPUMS NHGIS extract definitions also support additional options to -modify the layout and format of the extract’s resulting data files.


For extracts that contain time series tables, the -tst_layout argument indicates how the longitudinal data -should be organized.


For extracts that contain datasets with multiple breakdowns or data -types, use the breakdown_and_data_type_layout argument to -specify a layout . This is most common for data sources that contain -both estimates and margins of error, like the ACS.


File formats can be specified with the data_format -argument. IPUMS NHGIS currently distributes files in csv and fixed-width -format.


See the documentation for define_extract_nhgis() for -more details on these options.


Next steps -


Once you have defined an extract request, you can submit the extract -for processing:

-nhgis_ext_submitted <- submit_extract(nhgis_ext)

The workflow for submitting and monitoring an extract request and -downloading its files when complete is described in the IPUMS API introduction.

- - - - -
- - - -

The IPUMS API provides two asset types, both of which are supported -by ipumsr:

  • -IPUMS extract endpoints can be used to submit -extract requests for processing and download completed extract -files.
  • -
  • -IPUMS metadata endpoints can be used to discover -and explore available IPUMS data as well as retrieve codes, names, and -other extract parameters necessary to form extract requests.
  • -

Use of the IPUMS API enables the adoption of a programmatic workflow -that can help users to:

  • Precisely recreate the specifications of previous extract requests, -making analysis scripts reproducible and self-contained
  • -
  • Save extract request definitions that can be shared with others -without violating IPUMS conditions
  • -
  • Integrate the extract download process with functions to load data -into R
  • -
  • Quickly identify and explore available IPUMS data sources
  • -

The basic workflow for interacting with the IPUMS API is as -follows:

  1. -Define the parameters of an extract -request
  2. -
  3. -Submit the extract request to the IPUMS -API
  4. -
  5. -Wait for an extract to complete
  6. -
  7. -Download a completed extract
  8. -

Before getting started, we’ll load the necessary packages for the -examples in this vignette:

- -

API availability -


IPUMS extract support is currently available via API -for the following collections:

  • -
  • -
  • IPUMS International
  • -
  • -

Note that this support only includes data available via a -collection’s extract engine. Many collections provide additional data -via direct download, but these products are not supported by the IPUMS -API.


IPUMS metadata support is currently available via -API for the following collections:

  • -

API support will continue to be added for more collections in the -future. You can check general API availability for all IPUMS collections -with ipums_data_collections().

-#> # A tibble: 14 × 4
-#>    collection_name     collection_type code_for_api api_support
-#>    <chr>               <chr>           <chr>        <lgl>      
-#>  1 IPUMS USA           microdata       usa          TRUE       
-#>  2 IPUMS CPS           microdata       cps          TRUE       
-#>  3 IPUMS International microdata       ipumsi       TRUE       
-#>  4 IPUMS NHGIS         aggregate data  nhgis        TRUE       
-#>  5 IPUMS IHGIS         aggregate data  ihgis        FALSE      
-#>  6 IPUMS ATUS          microdata       atus         FALSE      
-#>  7 IPUMS AHTUS         microdata       ahtus        FALSE      
-#>  8 IPUMS MTUS          microdata       mtus         FALSE      
-#>  9 IPUMS DHS           microdata       dhs          FALSE      
-#> 10 IPUMS PMA           microdata       pma          FALSE      
-#> 11 IPUMS MICS          microdata       mics         FALSE      
-#> 12 IPUMS NHIS          microdata       nhis         FALSE      
-#> 13 IPUMS MEPS          microdata       meps         FALSE      
-#> 14 IPUMS Higher Ed     microdata       highered     FALSE

Note that the tools in ipumsr may not necessarily support all the -functionality currently supported by the IPUMS API. See the API -documentation for more information about its latest features.


Set up your API key -


To interact with the IPUMS API, you’ll need to register for access -with the IPUMS project you’ll be using. If you have not yet registered, -you can find links to register for each of the API-supported IPUMS -collections below:

- -

Once you’re registered, you’ll be able to create an API key.


By default, ipumsr API functions assume that your key is stored in -the IPUMS_API_KEY environment variable. You can also -provide your key directly to these functions, but storing it in an -environment variable saves you some typing and helps prevent you from -inadvertently sharing your key with others (for instance, on -GitHub).


You can save your API key to the IPUMS_API_KEY -environment variable with set_ipums_api_key(). To save your -key for use in future sessions, set save = TRUE. This will -add your API key to your .Renviron file in your user home -directory.

-# Save key in .Renviron for use across sessions
-set_ipums_api_key("paste-your-key-here", save = TRUE)

The rest of this vignette assumes you have obtained an API key and -stored it in the IPUMS_API_KEY environment variable.


Define an extract request -


Each IPUMS collection has its own extract definition function that is -used to specify the parameters of a new extract request from scratch. -These functions take the form define_extract_*():

- -

When you define an extract request, you can specify the data to be -included in the extract and indicate the desired format and layout.


For instance, the following defines a simple IPUMS USA extract -request for the AGE, SEX, RACE, -STATEFIP, and MARST variables from the 2018 -and 2019 American Community Survey (ACS):

-usa_ext_def <- define_extract_usa(
-  description = "USA extract for API vignette",
-  samples = c("us2018a", "us2019a"),
-  variables = c("AGE", "SEX", "RACE", "STATEFIP", "MARST")
-#> Unsubmitted IPUMS USA extract 
-#> Description: USA extract for API vignette
-#> Samples: (2 total) us2018a, us2019a
-#> Variables: (5 total) AGE, SEX, RACE, STATEFIP, MARST

The exact extract definition options vary across collections, but all -collections can be used with the same general workflow. For more details -on the available extract definition options, see the associated microdata and NHGIS vignettes.


For the purposes of demonstrating the overall workflow, we will -continue to work with the sample IPUMS USA extract definition created -above.


Extract request objects -


define_extract_*() functions always produce an -ipums_extract object, which can be handled by other API -functions (see ?ipums_extract). Furthermore, these objects -will have a subclass for the particular collection with which they are -associated.

-#> [1] "usa_extract"   "micro_extract" "ipums_extract" "list"

Many of the specifications for a given extract request object can be -accessed by indexing the object:

-#> [1] "us2018a" "us2019a"
-#> [1] "AGE"      "SEX"      "RACE"     "STATEFIP" "MARST"
-#> [1] "fixed_width"

ipums_extract objects also contain information about the -extract request’s processing status and its assigned extract number, -which serves as an identifier for the extract request. Since this -extract request is still unsubmitted, it has no request number:

-#> [1] "unsubmitted"
-#> [1] NA

To obtain the data requested in the extract definition, we must first -submit it to the IPUMS API for processing.


Submit an extract request -


To submit an extract definition, use -submit_extract().


If no errors are detected in the extract definition, a submitted -extract request will be returned with its assigned number and status. -Storing the returned object can be useful for checking the extract -request’s status later.

-usa_ext_submitted <- submit_extract(usa_ext_def)
-#> Successfully submitted IPUMS USA extract number 348

The extract number will be stored in the returned object:

-#> [1] 348
-#> [1] "queued"

Note that some fields of a submitted extract may be automatically -updated by the API upon submission. For instance, for microdata -extracts, additional preselected variables may be added to the extract -even if they weren’t specified explicitly in the extract definition.

-#>  [1] "YEAR"     "SAMPLE"   "SERIAL"   "CBSERIAL" "HHWT"     "CLUSTER" 
-#>  [7] "STATEFIP" "STRATA"   "GQ"       "PERNUM"   "PERWT"    "SEX"     
-#> [13] "AGE"      "MARST"    "RACE"

If you forget to store the updated extract object returned by -submit_extract(), you can use the -get_last_extract_info() helper to request the information -for your most recent extract request for a given collection:

-usa_ext_submitted <- get_last_extract_info("usa")
-#> [1] 348

Wait for an extract request to complete -


It may take some time for the IPUMS servers to process your extract -request. You can ensure that an extract has finished processing before -you attempt to download its files by using -wait_for_extract(). This polls the API regularly until -processing has completed (by default, each interval increases by 10 -seconds). It then returns an ipums_extract object -containing the completed extract definition.

-usa_ext_complete <- wait_for_extract(usa_ext_submitted)
-#> Checking extract status...
-#> Waiting 10 seconds...
-#> Checking extract status...
-#> IPUMS USA extract 348 is ready to download.
-#> [1] "completed"
-# `download_links` should be populated if the extract is ready for download
-#> [1] "r_command_file"     "basic_codebook"     "data"              
-#> [4] "stata_command_file" "sas_command_file"   "spss_command_file" 
-#> [7] "ddi_codebook"

Note that wait_for_extract() will tie up your R session -until your extract is ready to download. While this is fine in a -strictly programmatic workflow, it may be frustrating when working -interactively, especially for large extracts or when the IPUMS servers -are busy.


In these cases, you can manually check whether an extract is ready -for download with is_extract_ready(). As long as this -returns TRUE, you should be able to download your extract’s -files.

-#> [1] TRUE

For a more detailed status check, provide the extract’s collection -and number to get_extract_info(). This returns an -ipums_extract object reflecting the requested extract -definition with the most current status. The status of a -submitted extract will be one of "queued", -"started", "produced", -"canceled", "failed", or -"completed".

-usa_ext_submitted <- get_extract_info(usa_ext_submitted)
-#> [1] "completed"

Note that extracts are removed from the IPUMS servers after a set -period of time (72 hours for microdata collections, 2 weeks for IPUMS -NHGIS). Therefore, an extract that has a "completed" status -may still be unavailable for download.


is_extract_ready() will alert you if the extract has -expired and needs to be resubmitted. Simply use -submit_extract() to resubmit an extract request. Note that -this will produce a new extract (with a new extract number), -even if the extract definition is identical.


Download an extract -


Once your extract has finished processing, use -download_extract() to download the extract’s data files to -your local machine. This will return the path to the downloaded file(s) -required to load the data into R.


For microdata collections, this will be the path to the DDI codebook -(.xml) file, which can be used to read the associated data (contained in -a .dat.gz file).


For NHGIS, this will be a path to the .zip archive containing the -requested data files and/or shapefiles.

-# By default, downloads to your current working directory
-filepath <- download_extract(usa_ext_submitted)

The files produced by download_extract() can be passed -directly into the reader functions provided by ipumsr. For instance, for -microdata projects:

-ddi <- read_ipums_ddi(filepath)
-micro_data <- read_ipums_micro(ddi)

If instead you’re working with an NHGIS extract, use -read_nhgis() or read_ipums_sf().


See the associated vignette for more -information about loading IPUMS data into R.


Get info on past extracts -


To retrieve the definition corresponding to a particular extract, -provide its collection and number to get_extract_info(). -These can be provided either as a single string of the form -"collection:number" or as a length-2 vector: -c(collection, number). Several other API functions support -this syntax as well.

-usa_ext <- get_extract_info("usa:47")
-# Alternatively:
-usa_ext <- get_extract_info(c("usa", 47))
-#> Submitted IPUMS USA extract number 47
-#> Description: Test extract
-#> Samples: (1 total) us2017b

If you know you made a specific extract definition in the past, but -you can’t remember the exact number, you can use -get_extract_history() to peruse your recent extract -requests for a particular collection.


By default, this returns your 10 most recent extract requests as a -list of ipums_extract objects. You can adjust how many -requests to retrieve with the how_many argument:

-usa_extracts <- get_extract_history("usa", how_many = 3)
-#> [[1]]
-#> Submitted IPUMS USA extract number 348
-#> Description: USA extract for API vignette
-#> Samples: (2 total) us2018a, us2019a
-#> Variables: (15 total) YEAR, SAMPLE, SERIAL, CBSERIAL, HHWT, CLUSTER,...
-#> [[2]]
-#> Submitted IPUMS USA extract number 347
-#> Description: Data from long ago
-#> Samples: (1 total) us1880a
-#> Variables: (12 total) YEAR, SAMPLE, SERIAL, HHWT, CLUSTER, STRATA, G...
-#> [[3]]
-#> Submitted IPUMS USA extract number 346
-#> Description: Data from 2017 PRCS
-#> Samples: (1 total) us2017b
-#> Variables: (9 total) YEAR, SAMPLE, SERIAL, CBSERIAL, HHWT, GQ, PERNU...

Because this is a list of ipums_extract objects, you can -operate on them with the API functions that have been introduced -already.

-#> [1] TRUE

You can also iterate through your extract history to find extracts -with particular characteristics. For instance, we can use -purrr::keep() to find all extracts that contain a certain -variable or are ready for download:

-purrr::keep(usa_extracts, ~ "MARST" %in% names(.x$variables))
-#> [[1]]
-#> Submitted IPUMS USA extract number 348
-#> Description: USA extract for API vignette
-#> Samples: (2 total) us2018a, us2019a
-#> Variables: (15 total) YEAR, SAMPLE, SERIAL, CBSERIAL, HHWT, CLUSTER,...
-purrr::keep(usa_extracts, is_extract_ready)
-#> [[1]]
-#> Submitted IPUMS USA extract number 348
-#> Description: USA extract for API vignette
-#> Samples: (2 total) us2018a, us2019a
-#> Variables: (15 total) YEAR, SAMPLE, SERIAL, CBSERIAL, HHWT, CLUSTER,...
-#> [[2]]
-#> Submitted IPUMS USA extract number 347
-#> Description: Data from long ago
-#> Samples: (1 total) us1880a
-#> Variables: (12 total) YEAR, SAMPLE, SERIAL, HHWT, CLUSTER, STRATA, G...
-#> [[3]]
-#> Submitted IPUMS USA extract number 346
-#> Description: Data from 2017 PRCS
-#> Samples: (1 total) us2017b
-#> Variables: (9 total) YEAR, SAMPLE, SERIAL, CBSERIAL, HHWT, GQ, PERNU...

Or we can use the purrr::map() family to browse certain -values:

-purrr::map_chr(usa_extracts, ~ .x$description)
-#> [1] "USA extract for API vignette" "Data from long ago"          
-#> [3] "Data from 2017 PRCS"

If you regularly use only a single IPUMS collection, you can save -yourself some typing by setting that collection as your default. -set_ipums_default_collection() will save a specified -collection to the value of the IPUMS_DEFAULT_COLLECTION -environment variable. If you have a default collection set, API -functions will use that collection in all requests, assuming no other -collection is specified.

-set_ipums_default_collection("usa") # Set `save = TRUE` to store across sessions
-# Check the default collection:
-#> [1] "usa"
-# Most recent USA extract:
-usa_last <- get_last_extract_info()
-# Request info on extract request "usa:10"
-usa_ext_10 <- get_extract_info(10)
-# You can still request other collections as usual:
-cps_ext_10 <- get_extract_info("cps:10")

Share an extract definition -


One exciting feature enabled by the IPUMS API is the ability to share -a standardized extract definition with other IPUMS users so that they -can create an identical extract request themselves. The terms of use for -most IPUMS collections prohibit the public redistribution of IPUMS data, -but don’t prohibit the sharing of data extract definitions.


ipumsr facilitates this type of sharing with -save_extract_as_json() and -define_extract_from_json(), which read and write -ipums_extract objects to and from a standardized -JSON-formatted file.

-usa_ext_10 <- get_extract_info("usa:10")
-save_extract_as_json(usa_ext_10, file = "usa_extract_10.json")

At this point, you can send usa_extract_10.json to -another user to allow them to create a duplicate -ipums_extract object, which they can load and submit to the -API themselves (as long as they have API -access).

-clone_of_usa_ext_10 <- define_extract_from_json("usa_extract_10.json")
-usa_ext_10_resubmitted <- submit_extract(clone_of_usa_ext_10)

Note that the code in the previous chunk assumes that the file is -saved in the current working directory. If it’s saved somewhere else, -replace "usa_extract_10.json" with the full path to the -file.


Revise a previous extract request -


Occasionally, you may want to modify an existing extract definition -(e.g. to update an analysis with new data). The easiest way to do so is -to add the new specifications to the define_extract_*() -code that produced the original extract definition. This is why we -highly recommend that you save this code somewhere where it can be -accessed and updated in the future.


However, there are cases where the original extract definition code -does not exist (e.g. if the extract was created using the online IPUMS -extract system). In this case, the best approach is to view the extract -definition with get_extract_info() and create a new extract -definition (using a define_extract_*() function) that -reproduces that definition along with the desired modifications. While -this may be a bit tedious for complex extract definitions, it is a -one-time investment that will make any future updates to the extract -definition much easier.


Previously, we encouraged users to use the helpers -add_to_extract() and remove_from_extract() -when modifying extracts. We now encourage you to re-write extract -definitions because they improve reproducibility: extract definition -code will always be more clear and stable if it is written explicitly, -rather than based only on an old extract number. These two functions may -be retired in the future.


Putting it all together -


The core API functions in ipumsr are compatible with one another such -that they can be combined into a single pipeline that requests, -downloads, and reads your extract data into an R data frame:

-usa_data <- define_extract_usa(
-  "USA extract for API vignette",
-  c("us2018a", "us2019a"),
-  c("AGE", "SEX", "RACE", "STATEFIP")
-) %>%
-  submit_extract() %>%
-  wait_for_extract() %>%
-  download_extract() %>%
-  read_ipums_micro()

Note that for NHGIS extracts that contain both data and shapefiles, a -single file will need to be selected before reading, as -download_extract() will return the path to each file. For -instance, for a hypothetical nhgis_extract that contains -both tabular and spatial data:

-nhgis_data <- download_extract(nhgis_extract) %>%
-  purrr::pluck("data") %>% # Select only the tabular data file to read
-  read_nhgis()

Not only does this API workflow allow you to obtain IPUMS data -without ever leaving your R environment, but it also allows you to -retain a reproducible record of your process. This makes it much easier -to document your workflow, collaborate with other researchers, and -update your analysis in the future.

- - - - -
- - - -

Browsing for IPUMS data can be a little like grocery shopping when -you’re hungry—you show up to grab a couple things, but everything looks -so good that you end up with an overflowing cart.1 Unfortunately, this -can lead to extracts so large that they don’t fit in your computer’s -memory.


If you’ve got an extract that’s too big, both the IPUMS website and -the ipumsr package have tools to help. There are four basic -strategies:

  1. Get more memory
  2. -
  3. Reduce the size of your extract
  4. -
  5. Read data in “chunks” or “yields”
  6. -
  7. Store data in a database
  8. -

ipumsr can’t do much for you when it comes to option 1, but it can -help facilitate some of the other options.


Setup -


The examples in this vignette will rely on a few helpful packages. If -you haven’t already installed them, you can do so with:

-# To run the full vignette, you'll also need the following packages. If they
-# aren't installed already, do so with:
- - - -

Option 1: Trade money for convenience -


If you need to work with a dataset that’s too big for your RAM, the -simplest option is to get more space. If upgrading your hardware isn’t -an option, paying for a cloud service like Amazon or Microsoft Azure may -be worth considering. Here are guides for using R on Amazon -and Microsoft -Azure.


Of course, this option isn’t feasible for most users—in this case, -updates to the data being used in the analysis or the processing -pipeline may be required.


Option 2: Reduce extract size -


Remove unused data -


The easiest way to reduce the size of your extract is to drop unused -samples and variables. This can be done through the extract interface -for the specific IPUMS project you’re using or within R using the IPUMS -API (for projects that are supported).


If using the API, simply updated your extract definition code to -exclude the specifications that you no longer need. Then, resubmit the -extract request and download the new files.


See the introduction to the IPUMS API -for more information about making extract requests from ipumsr.


Select cases -


For microdata projects, another good option for reducing extract size -is to select only those cases that are relevant to your research -question, producing an extract containing only data for a particular -subset of values for a given variable.


If you’re using the IPUMS API, you can use var_spec() to -specify case selections for a variable in an extract definition. For -instance, the following would produce an extract only including records -for married women:

-  description = "2013 ACS Data for Married Women",
-  samples = "us2013a",
-  variables = list(
-    var_spec("MARST", case_selections = "1"),
-    var_spec("SEX", case_selections = "2")
-  )
-#> Unsubmitted IPUMS USA extract 
-#> Description: 2013 ACS Data for Married Women
-#> Samples: (1 total) us2013a
-#> Variables: (2 total) MARST, SEX

If you’re using the online interface, the Select -Cases option will be available on the last page before -submitting an extract request.


Use a sampled subset of the data -


Yet another option (also only for microdata projects) is to take a -random subsample of the data before producing your extract.


Sampled data is not available via the IPUMS API, but you can use the -Customize Sample Size option in the online interface to -do so. This also appears on the final page before submitting an extract -request.


If you’ve already submitted the extract, you can click the -REVISE link on the Download or Revise Extracts -page to access these features and produce a new data extract.


Option 3: Process the data in pieces -


ipumsr provides two related options for reading data sources in -increments:

  • -Chunked functions allow you to specify a function that will -be called on each chunk of data as it is read in as well as how you -would like the chunks to be combined at the end. These functions use the -readr framework -for reading chunked data.
  • -
  • -Yielded functions allow more flexibility by returning -control to the user between the loading of each piece of data. These -functions are unique to ipumsr and fixed-width data.
  • -

Reading chunked data -


Use read_ipums_micro_chunked() and -read_ipums_micro_list_chunked() to read data in chunks. -These are analogous to the standard read_ipums_micro() and -read_ipums_micro_list() functions, but allow you to specify -a function that will be applied to each data chunk and control how the -results from these chunks are combined.


Below, we’ll use chunking to outline solutions to three common -use-cases for IPUMS data: tabulation, regression and case selection.


First, we’ll load our example data. Note that we have -down-sampled the data in this example for storage reasons; none of the -output “results” reflected in this vignette should be considered -legitimate!

-cps_ddi_file <- ipums_example("cps_00097.xml")

Chunked tabulation -


Imagine we wanted to find the percent of people in the workforce -grouped by their self-reported health. Since our example extract is -small enough to fit in memory, we could load the full dataset with -read_ipums_micro(), use lbl_relabel() to -relabel the EMPSTAT variable into a binary variable, and -count the people in each group.

-read_ipums_micro(cps_ddi_file, verbose = FALSE) %>%
-  mutate(
-    HEALTH = as_factor(HEALTH),
-    AT_WORK = as_factor(
-      lbl_relabel(
-        EMPSTAT,
-        lbl(1, "Yes") ~ .lbl == "At work",
-        lbl(0, "No") ~ .lbl != "At work"
-      )
-    )
-  ) %>%
-  group_by(HEALTH, AT_WORK) %>%
-  summarize(n = n(), .groups = "drop")
-#> # A tibble: 10 × 3
-#>    HEALTH    AT_WORK     n
-#>    <fct>     <fct>   <int>
-#>  1 Excellent No       4055
-#>  2 Excellent Yes      2900
-#>  3 Very good No       3133
-#>  4 Very good Yes      3371
-#>  5 Good      No       2480
-#>  6 Good      Yes      2178
-#>  7 Fair      No       1123
-#>  8 Fair      Yes       443
-#>  9 Poor      No        603
-#> 10 Poor      Yes        65

For the sake of this example, let’s imagine we can only store 1,000 -rows in memory at a time. In this case, we need to use a -chunked function, tabulate for each chunk, and then -calculate the counts across all of the chunks.


The chunked functions will apply a user-defined callback -function to each chunk. The callback takes two arguments: -x, which represents the data contained in a given chunk, -and pos, which represents the position of the chunk, -expressed as the line in the input file at which the chunk starts. -Generally you will only need to use x, but the callback -must always take both arguments.


In this case, the callback will implement the same processing steps -that we demonstrated above:

-cb_function <- function(x, pos) {
-  x %>%
-    mutate(
-      HEALTH = as_factor(HEALTH),
-      AT_WORK = as_factor(
-        lbl_relabel(
-          EMPSTAT,
-          lbl(1, "Yes") ~ .lbl == "At work",
-          lbl(0, "No") ~ .lbl != "At work"
-        )
-      )
-    ) %>%
-    group_by(HEALTH, AT_WORK) %>%
-    summarize(n = n(), .groups = "drop")

Next, we need to create a callback object, which will determine how -we want to combine the ultimate results for each chunk. ipumsr provides -three main types of callback objects that preserve variable -metadata:

  • -IpumsDataFrameCallback combines the results from each -chunk together by row binding them together
  • -
  • -IpumsListCallback returns a list with one item per -chunk containing the results for that chunk. Use this when you don’t -want to (or can’t) immediately combine the results.
  • -
  • -IpumsSideEffectCallback does not return any results. -Use this when your callback function is intended only for its side -effects (for instance, if you are saving the results for each chunk to -disk).
  • -

(ipumsr also provides a fourth callback used for running linear -regression models discussed below).


In this case, we want to row-bind the data frames returned by -cb_function(), so we use -IpumsDataFrameCallback.


Callback objects are R6 objects, but you don’t need to -be familiar with R6 to use them.2 To initialize a callback object, simply use -$new():

-cb <- IpumsDataFrameCallback$new(cb_function)

At this point, we’re ready to load the data in chunks. We use -read_ipums_micro_chunked() to specify the callback and -chunk size:

-chunked_tabulations <- read_ipums_micro_chunked(
-  cps_ddi_file,
-  callback = cb,
-  chunk_size = 1000,
-  verbose = FALSE
-#> # A tibble: 209 × 3
-#>    HEALTH    AT_WORK     n
-#>    <fct>     <fct>   <int>
-#>  1 Excellent No        183
-#>  2 Excellent Yes       147
-#>  3 Very good No        134
-#>  4 Very good Yes       217
-#>  5 Good      No        111
-#>  6 Good      Yes       105
-#>  7 Fair      No         53
-#>  8 Fair      Yes        22
-#>  9 Poor      No         27
-#> 10 Poor      Yes         1
-#> # ℹ 199 more rows

Now we have a data frame with the counts by health and work status -within each chunk. To get the full table, we just need to sum by health -and work status one more time:

-chunked_tabulations %>%
-  group_by(HEALTH, AT_WORK) %>%
-  summarize(n = sum(n), .groups = "drop")
-#> # A tibble: 10 × 3
-#>    HEALTH    AT_WORK     n
-#>    <fct>     <fct>   <int>
-#>  1 Excellent No       4055
-#>  2 Excellent Yes      2900
-#>  3 Very good No       3133
-#>  4 Very good Yes      3371
-#>  5 Good      No       2480
-#>  6 Good      Yes      2178
-#>  7 Fair      No       1123
-#>  8 Fair      Yes       443
-#>  9 Poor      No        603
-#> 10 Poor      Yes        65

Chunked regression -


With the biglm package, it is possible to use R to perform a -regression on data that is too large to store in memory all at once. The -ipumsr package provides another callback designed to make this simple: -IpumsBiglmCallback.


In this example, we’ll conduct a regression with total hours worked -(AHRSWORKT) as the outcome and age (AGE) and -self-reported health (HEALTH) as predictors. (Note that -this is intended as a code demonstration, so we ignore many complexities -that should be addressed in real analyses.)


If we were running the analysis on our full dataset, we’d first load -our data and prepare the variables in our analysis for use in the -model:

-data <- read_ipums_micro(cps_ddi_file, verbose = FALSE) %>%
-  mutate(
-    HEALTH = as_factor(HEALTH),
-    AHRSWORKT = lbl_na_if(AHRSWORKT, ~ .lbl == "NIU (Not in universe)"),
-    AT_WORK = as_factor(
-      lbl_relabel(
-        EMPSTAT,
-        lbl(1, "Yes") ~ .lbl == "At work",
-        lbl(0, "No") ~ .lbl != "At work"
-      )
-    )
-  ) %>%
-  filter(AT_WORK == "Yes")

Then, we’d provide our model formula and data to lm:

-model <- lm(AHRSWORKT ~ AGE + I(AGE^2) + HEALTH, data = data)
-#> Call:
-#> lm(formula = AHRSWORKT ~ AGE + I(AGE^2) + HEALTH, data = data)
-#> Residuals:
-#>     Min      1Q  Median      3Q     Max 
-#> -41.217  -4.734  -0.077   5.957  63.994 
-#> Coefficients:
-#>                   Estimate Std. Error t value Pr(>|t|)    
-#> (Intercept)      5.2440289  1.1823985   4.435 9.31e-06 ***
-#> AGE              1.5868169  0.0573268  27.680  < 2e-16 ***
-#> I(AGE^2)        -0.0170043  0.0006568 -25.888  < 2e-16 ***
-#> HEALTHVery good -0.2550306  0.3276759  -0.778 0.436412    
-#> HEALTHGood      -0.9637395  0.3704123  -2.602 0.009289 ** 
-#> HEALTHFair      -3.8899430  0.6629725  -5.867 4.58e-09 ***
-#> HEALTHPoor      -5.7597200  1.6197136  -3.556 0.000378 ***
-#> ---
-#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-#> Residual standard error: 12.88 on 8950 degrees of freedom
-#> Multiple R-squared:  0.08711,    Adjusted R-squared:  0.0865 
-#> F-statistic: 142.3 on 6 and 8950 DF,  p-value: < 2.2e-16

To do the same regression, but with only 1,000 rows loaded at a time, -we work in a similar manner.


First we make an IpumsBiglmCallback callback object. We -provide the model formula as well as the code used to process the data -before running the regression:

-#> Loading required package: DBI
-biglm_cb <- IpumsBiglmCallback$new(
-  model = AHRSWORKT ~ AGE + I(AGE^2) + HEALTH,
-  prep = function(x, pos) {
-    x %>%
-      mutate(
-        HEALTH = as_factor(HEALTH),
-        AHRSWORKT = lbl_na_if(AHRSWORKT, ~ .lbl == "NIU (Not in universe)"),
-        AT_WORK = as_factor(
-          lbl_relabel(
-            EMPSTAT,
-            lbl(1, "Yes") ~ .lbl == "At work",
-            lbl(0, "No") ~ .lbl != "At work"
-          )
-        )
-      ) %>%
-      filter(AT_WORK == "Yes")
-  }

And then we read the data using -read_ipums_micro_chunked(), passing the callback that we -just made.

-chunked_model <- read_ipums_micro_chunked(
-  cps_ddi_file,
-  callback = biglm_cb,
-  chunk_size = 1000,
-  verbose = FALSE
-#> Large data regression model: biglm(AHRSWORKT ~ AGE + I(AGE^2) + HEALTH, data, ...)
-#> Sample size =  8957 
-#>                    Coef    (95%     CI)     SE      p
-#> (Intercept)      5.2440  2.8792  7.6088 1.1824 0.0000
-#> AGE              1.5868  1.4722  1.7015 0.0573 0.0000
-#> I(AGE^2)        -0.0170 -0.0183 -0.0157 0.0007 0.0000
-#> HEALTHVery good -0.2550 -0.9104  0.4003 0.3277 0.4364
-#> HEALTHGood      -0.9637 -1.7046 -0.2229 0.3704 0.0093
-#> HEALTHFair      -3.8899 -5.2159 -2.5640 0.6630 0.0000
-#> HEALTHPoor      -5.7597 -8.9991 -2.5203 1.6197 0.0004

Reading yielded data -


In addition to chunked reading, ipumsr also provides the similar but -more flexible “yielded” reading.


read_ipums_micro_yield() and -read_ipums_micro_list_yield() grant you more freedom in -determining what R code to run between chunks and include the ability to -have multiple files open at once. Additionally, yields are compatible -with the bigglm function from biglm, which allows you to -run glm models on data larger than memory.


The downside to this greater control is that yields have an API that -is unique to IPUMS data and the way they work is unusual for R code.


Yielded tabulation -


We’ll compare the yield and chunked -functions by conducting the same tabulation -example from above using yields.


First, we create the yield object with the function -read_ipums_micro_yield():

-data <- read_ipums_micro_yield(cps_ddi_file, verbose = FALSE)

This function returns an R6 object which contains -methods for reading the data. The most important method is the -yield() method which will return n rows of -data:

-# Return the first 10 rows of data
-#> # A tibble: 10 × 14
-#>    <dbl>  <dbl> <int+lb>   <dbl> <int+lb>   <dbl> <int+lb>  <dbl>   <dbl>  <dbl>
-#>  1  2011     33 3 [Marc… 2.01e13 1 [ASEC]    308. 1 [No]        1 2.01e13   308.
-#>  2  2011     33 3 [Marc… 2.01e13 1 [ASEC]    308. 1 [No]        2 2.01e13   217.
-#>  3  2011     33 3 [Marc… 2.01e13 1 [ASEC]    308. 1 [No]        3 2.01e13   249.
-#>  4  2011     46 3 [Marc… 2.01e13 1 [ASEC]    266. 1 [No]        1 2.01e13   266.
-#>  5  2011     46 3 [Marc… 2.01e13 1 [ASEC]    266. 1 [No]        2 2.01e13   266.
-#>  6  2011     46 3 [Marc… 2.01e13 1 [ASEC]    266. 1 [No]        3 2.01e13   265.
-#>  7  2011     46 3 [Marc… 2.01e13 1 [ASEC]    266. 1 [No]        4 2.01e13   296.
-#>  8  2011     64 3 [Marc… 2.01e13 1 [ASEC]    241. 1 [No]        1 2.01e13   241.
-#>  9  2011     64 3 [Marc… 2.01e13 1 [ASEC]    241. 1 [No]        2 2.01e13   241.
-#> 10  2011     64 3 [Marc… 2.01e13 1 [ASEC]    241. 1 [No]        3 2.01e13   278.
-#> # ℹ 4 more variables: AGE <int+lbl>, EMPSTAT <int+lbl>, AHRSWORKT <dbl+lbl>,
-#> #   HEALTH <int+lbl>

Note that the row position in the data is stored in the object, so -running the same code again will produce different rows of -data:

-# Return the next 10 rows of data
-#> # A tibble: 10 × 14
-#>    <dbl>  <dbl> <int+lb>   <dbl> <int+lb>   <dbl> <int+lb>  <dbl>   <dbl>  <dbl>
-#>  1  2011     82 3 [Marc… 0       1 [ASEC]    373. 1 [No]        1 0         373.
-#>  2  2011     82 3 [Marc… 0       1 [ASEC]    373. 1 [No]        2 0         373.
-#>  3  2011     82 3 [Marc… 0       1 [ASEC]    373. 1 [No]        3 0         326.
-#>  4  2011     86 3 [Marc… 2.01e13 1 [ASEC]    554. 1 [No]        1 2.01e13   554.
-#>  5  2011    104 3 [Marc… 2.01e13 1 [ASEC]    543. 1 [No]        1 2.01e13   543.
-#>  6  2011    104 3 [Marc… 2.01e13 1 [ASEC]    543. 1 [No]        2 2.01e13   543.
-#>  7  2011    106 3 [Marc… 2.01e13 1 [ASEC]    543. 1 [No]        1 2.01e13   543.
-#>  8  2011    137 3 [Marc… 2.01e13 1 [ASEC]    271. 1 [No]        1 2.01e13   271.
-#>  9  2011    137 3 [Marc… 2.01e13 1 [ASEC]    271. 1 [No]        2 2.01e13   271.
-#> 10  2011    137 3 [Marc… 2.01e13 1 [ASEC]    271. 1 [No]        3 2.01e13   365.
-#> # ℹ 4 more variables: AGE <int+lbl>, EMPSTAT <int+lbl>, AHRSWORKT <dbl+lbl>,
-#> #   HEALTH <int+lbl>

Use cur_pos to get the current position in the data -file:

-#> [1] 21

The is_done() method tells us whether we have read the -entire file yet:

-#> [1] FALSE

In preparation for our actual example, we’ll use reset() -to reset to the beginning of the data:


Using yield() and is_done(), we can set up -our processing pipeline. First, we create an empty placeholder tibble to -store our results:

-yield_results <- tibble(
-  HEALTH = factor(levels = c("Excellent", "Very good", "Good", "Fair", "Poor")),
-  AT_WORK = factor(levels = c("No", "Yes")),
-  n = integer(0)

Then, we iterate through the data, yielding 1,000 rows at a time and -processing the results as we did in the chunked example. The iteration -will end when we’ve finished reading the entire file.

-while (!data$is_done()) {
-  # Yield new data and process
-  new <- data$yield(n = 1000) %>%
-    mutate(
-      HEALTH = as_factor(HEALTH),
-      AT_WORK = as_factor(
-        lbl_relabel(
-          EMPSTAT,
-          lbl(1, "Yes") ~ .lbl == "At work",
-          lbl(0, "No") ~ .lbl != "At work"
-        )
-      )
-    ) %>%
-    group_by(HEALTH, AT_WORK) %>%
-    summarize(n = n(), .groups = "drop")
-  # Combine the new yield with the previously processed yields
-  yield_results <- bind_rows(yield_results, new) %>%
-    group_by(HEALTH, AT_WORK) %>%
-    summarize(n = sum(n), .groups = "drop")
-#> # A tibble: 10 × 3
-#>    HEALTH    AT_WORK     n
-#>    <fct>     <fct>   <int>
-#>  1 Excellent No       4055
-#>  2 Excellent Yes      2900
-#>  3 Very good No       3133
-#>  4 Very good Yes      3371
-#>  5 Good      No       2480
-#>  6 Good      Yes      2178
-#>  7 Fair      No       1123
-#>  8 Fair      Yes       443
-#>  9 Poor      No        603
-#> 10 Poor      Yes        65

Yielded GLM regression -


One of the major benefits of the yielded reading over chunked reading -is that it is compatible with the GLM functions from biglm, allowing for -the use of more complicated models.


To run a logistic regression, we first need to reset our yield object -from the previous example:


Next we make a function that takes a single argument: -reset. When reset is TRUE, it -resets the data to the beginning. This is dictated by -bigglm from biglm.


To create this function, we use the the reset() method -from the yield object:

-get_model_data <- function(reset) {
-  if (reset) {
-    data$reset()
-  } else {
-    yield <- data$yield(n = 1000)
-    if (is.null(yield)) {
-      return(yield)
-    }
-    yield %>%
-      mutate(
-        HEALTH = as_factor(HEALTH),
-        WORK30PLUS = lbl_na_if(AHRSWORKT, ~ .lbl == "NIU (Not in universe)") >= 30,
-        AT_WORK = as_factor(
-          lbl_relabel(
-            EMPSTAT,
-            lbl(1, "Yes") ~ .lbl == "At work",
-            lbl(0, "No") ~ .lbl != "At work"
-          )
-        )
-      ) %>%
-      filter(AT_WORK == "Yes")
-  }

Finally we feed this function and a model specification to the -bigglm() function:

-results <- bigglm(
-  family = binomial(link = "logit"),
-  data = get_model_data
-#> Large data regression model: bigglm(WORK30PLUS ~ AGE + I(AGE^2) + HEALTH, family = binomial(link = "logit"), 
-#>     data = get_model_data)
-#> Sample size =  8957 
-#>                    Coef    (95%     CI)     SE      p
-#> (Intercept)     -4.0021 -4.4297 -3.5744 0.2138 0.0000
-#> AGE              0.2714  0.2498  0.2930 0.0108 0.0000
-#> I(AGE^2)        -0.0029 -0.0032 -0.0027 0.0001 0.0000
-#> HEALTHVery good  0.0038 -0.1346  0.1423 0.0692 0.9557
-#> HEALTHGood      -0.1129 -0.2685  0.0426 0.0778 0.1465
-#> HEALTHFair      -0.6637 -0.9160 -0.4115 0.1261 0.0000
-#> HEALTHPoor      -0.7879 -1.3697 -0.2062 0.2909 0.0068

Option 4: Use a database -


Storing your data in a database is another way to work with data that -cannot fit into memory as a data frame. If you have access to a database -on a remote machine, then you can easily select and use parts of the -data for your analysis. Even databases on your own machine may provide -more efficient data storage or use your hard drive, enabling the data to -be loaded into R.


There are many different kinds of databases, each with their own -benefits and drawbacks, and the database you choose to use will be -specific to your use case. However, once you’ve chosen a database, there -will be two general steps:

  1. Importing data into the database
  2. -
  3. Connecting the database to R
  4. -

R has several tools that support database integration, including -DBI, dbplyr, sparklyr, -bigrquery, and others. In this example, we’ll use -RSQLite to load the data into an in-memory database. (We -use RSQLite because it is easy to set up, but it is likely not efficient -enough to fully resolve issues with large IPUMS data, so it may be wise -to consider an alternative in practice.)


Importing data into the database -


For rectangular extracts, it is likely simplest to load your data -into the database in CSV format, which is widely supported. If you are -working with a hierarchical extract (or your database software doesn’t -support CSV format), then you can use an ipumsr chunked -function to load the data into a database without needing to store the -entire dataset in R.


See the IPUMS data -reading vignette for more about rectangular vs. hierarchical -extracts.

-# Connect to database
-con <- dbConnect(SQLite(), path = ":memory:")
-# Load file metadata
-ddi <- read_ipums_ddi(cps_ddi_file)
-# Write data to database in chunks
-  ddi,
-  readr::SideEffectChunkCallback$new(
-    function(x, pos) {
-      if (pos == 1) {
-        dbWriteTable(con, "cps", x)
-      } else {
-        dbWriteTable(con, "cps", x, row.names = FALSE, append = TRUE)
-      }
-    }
-  ),
-  chunk_size = 1000,
-  verbose = FALSE

Connecting to a database with dbplyr -


There are a variety of ways to access your data once it is stored in -the database. In this example, we use dbplyr. For more details about -dbplyr, see vignette("dbplyr", package = "dbplyr").


To run a simple query for AGE, we can use the same -syntax we would use with dplyr:

-example <- tbl(con, "cps")
-example %>%
-  filter("AGE" > 25)
-#> # Source:   SQL [?? x 14]
-#> # Database: sqlite 3.43.2 []
-#>    <dbl>  <dbl> <int>   <dbl>    <int>   <dbl>    <int>  <dbl>   <dbl>  <dbl>
-#>  1  2011     33     3 2.01e13        1    308.        1      1 2.01e13   308.
-#>  2  2011     33     3 2.01e13        1    308.        1      2 2.01e13   217.
-#>  3  2011     33     3 2.01e13        1    308.        1      3 2.01e13   249.
-#>  4  2011     46     3 2.01e13        1    266.        1      1 2.01e13   266.
-#>  5  2011     46     3 2.01e13        1    266.        1      2 2.01e13   266.
-#>  6  2011     46     3 2.01e13        1    266.        1      3 2.01e13   265.
-#>  7  2011     46     3 2.01e13        1    266.        1      4 2.01e13   296.
-#>  8  2011     64     3 2.01e13        1    241.        1      1 2.01e13   241.
-#>  9  2011     64     3 2.01e13        1    241.        1      2 2.01e13   241.
-#> 10  2011     64     3 2.01e13        1    241.        1      3 2.01e13   278.
-#> # ℹ more rows
-#> # ℹ 4 more variables: AGE <int>, EMPSTAT <int>, AHRSWORKT <dbl>, HEALTH <int>

dbplyr shows us a nice preview of the first rows of the result of our -query, but the data still exist only in the database. You can use -dplyr::collect() to load the full results of the query into -the current R session. However, this would omit the variable metadata -attached to IPUMS data, since the database doesn’t store this -metadata:

-data <- example %>%
-  filter("AGE" > 25) %>%
-  collect()
-# Variable metadata is missing
-#> # A tibble: 0 × 2
-#> # ℹ 2 variables: val <dbl>, lbl <chr>

Instead, use ipums_collect(), which uses a provided -ipums_ddi object to reattach the metadata while loading -into the R environment:

-data <- example %>%
-  filter("AGE" > 25) %>%
-  ipums_collect(ddi)
-#> # A tibble: 12 × 2
-#>      val lbl      
-#>    <int> <chr>    
-#>  1     1 January  
-#>  2     2 February 
-#>  3     3 March    
-#>  4     4 April    
-#>  5     5 May      
-#>  6     6 June     
-#>  7     7 July     
-#>  8     8 August   
-#>  9     9 September
-#> 10    10 October  
-#> 11    11 November 
-#> 12    12 December

See the value labels vignette more -about variable metadata in IPUMS data.


Learning more -


Big data isn’t just a problem for IPUMS users, so there are many R -resources available.


See the documentation for the packages mentioned in the databases section for more information about those -options.


For some past blog posts and articles on the topic, see the -following:

- -
- -
Once you have downloaded an IPUMS extract, the next step is to load -its data into R for analysis.


For more information about IPUMS data and how to generate and -download a data extract, see the introduction -to IPUMS data.


IPUMS extract structure -


IPUMS extracts will be organized slightly differently for different -IPUMS projects. In general, -all projects will provide multiple files in a data extract. The files -most relevant to ipumsr are:

  • The metadata file containing information about the -variables included in the extract data
  • -
  • One or more data files, depending on the project -and specifications in the extract
  • -

Both of these files are necessary to properly load data into R. -Obviously, the data files contain the actual data values to be loaded. -But because these are often in fixed-width format, the metadata files -are required to correctly parse the data on load.


Even for .csv files, the metadata file allows for the addition of -contextual variable information to the loaded data. This makes it much -easier to interpret the values in the data variables and effectively use -them in your data processing pipeline. See the value labels vignette for more information -on working with these labels.


Reading microdata extracts -


Microdata extracts typically provide their metadata in a DDI (.xml) -file separate from the compressed data (.dat.gz) files.


Provide the path to the DDI file to read_ipums_micro() -to directly load its associated data file into R.

-# Example data
-cps_ddi_file <- ipums_example("cps_00157.xml")
-cps_data <- read_ipums_micro(cps_ddi_file)
-#> # A tibble: 6 × 8
-#>   <dbl>  <dbl> <int+lbl>   <dbl> <int+lbl>       <dbl>  <dbl> <dbl+lbl>         
-#> 1  1962     80 3 [March]   1476. 55 [Wisconsin]      1  1476.      4883         
-#> 2  1962     80 3 [March]   1476. 55 [Wisconsin]      2  1471.      5800         
-#> 3  1962     80 3 [March]   1476. 55 [Wisconsin]      3  1579. 999999998 [Missin…
-#> 4  1962     82 3 [March]   1598. 27 [Minnesota]      1  1598.     14015         
-#> 5  1962     83 3 [March]   1707. 27 [Minnesota]      1  1707.     16552         
-#> 6  1962     84 3 [March]   1790. 27 [Minnesota]      1  1790.      6375

Note that you provide the path to the DDI file, not the data -file. This is because ipumsr needs to find both the DDI and data files -to read in your data, and the DDI file includes the name of the data -file, whereas the data file contains only the raw data.


The loaded data have been parsed correctly and include variable -metadata in each column. For a summary of the column contents, use -ipums_var_info():

-#> # A tibble: 8 × 4
-#>   var_name var_label                                         var_desc val_labels
-#>   <chr>    <chr>                                             <chr>    <list>    
-#> 1 YEAR     Survey year                                       "YEAR r… <tibble>  
-#> 2 SERIAL   Household serial number                           "SERIAL… <tibble>  
-#> 3 MONTH    Month                                             "MONTH … <tibble>  
-#> 4 ASECWTH  Annual Social and Economic Supplement Household … "ASECWT… <tibble>  
-#> 5 STATEFIP State (FIPS code)                                 "STATEF… <tibble>  
-#> 6 PERNUM   Person number in sample unit                      "PERNUM… <tibble>  
-#> 7 ASECWT   Annual Social and Economic Supplement Weight      "ASECWT… <tibble>  
-#> 8 INCTOT   Total personal income                             "INCTOT… <tibble>

This information is also attached to specific columns. You can obtain -it with attributes() or by using ipumsr helpers:

-#> $labels
-#>   January  February     March     April       May      June      July    August 
-#>         1         2         3         4         5         6         7         8 
-#> September   October  November  December 
-#>         9        10        11        12 
-#> $class
-#> [1] "haven_labelled" "vctrs_vctr"     "integer"       
-#> $label
-#> [1] "Month"
-#> $var_desc
-#> [1] "MONTH indicates the calendar month of the CPS interview."
-#> # A tibble: 12 × 2
-#>      val lbl      
-#>    <int> <chr>    
-#>  1     1 January  
-#>  2     2 February 
-#>  3     3 March    
-#>  4     4 April    
-#>  5     5 May      
-#>  6     6 June     
-#>  7     7 July     
-#>  8     8 August   
-#>  9     9 September
-#> 10    10 October  
-#> 11    11 November 
-#> 12    12 December

While this is the most straightforward way to load microdata, it’s -often advantageous to independently load the DDI file into an -ipums_ddi object containing the metadata:

-cps_ddi <- read_ipums_ddi(cps_ddi_file)
-#> An IPUMS DDI for IPUMS CPS with 8 variables
-#> Extract 'cps_00157.dat' created on 2023-07-10
-#> User notes:  User-provided description: Reproducing cps00006

This is because many common data processing functions have the -side-effect of removing these attributes:

-# This doesn't actually change the data...
-cps_data2 <- cps_data %>%
-  mutate(MONTH = ifelse(TRUE, MONTH, MONTH))
-# but removes attributes!
-#> # A tibble: 0 × 2
-#> # ℹ 2 variables: val <dbl>, lbl <chr>

In this case, you can always use the separate DDI as a metadata -reference:

-ipums_val_labels(cps_ddi, var = MONTH)
-#> # A tibble: 12 × 2
-#>      val lbl      
-#>    <dbl> <chr>    
-#>  1     1 January  
-#>  2     2 February 
-#>  3     3 March    
-#>  4     4 April    
-#>  5     5 May      
-#>  6     6 June     
-#>  7     7 July     
-#>  8     8 August   
-#>  9     9 September
-#> 10    10 October  
-#> 11    11 November 
-#> 12    12 December

Or even reattach the metadata, assuming the variable names still -match those in the DDI:

-cps_data2 <- set_ipums_var_attributes(cps_data2, cps_ddi)
-#> # A tibble: 12 × 2
-#>      val lbl      
-#>    <int> <chr>    
-#>  1     1 January  
-#>  2     2 February 
-#>  3     3 March    
-#>  4     4 April    
-#>  5     5 May      
-#>  6     6 June     
-#>  7     7 July     
-#>  8     8 August   
-#>  9     9 September
-#> 10    10 October  
-#> 11    11 November 
-#> 12    12 December

Hierarchical extracts -


IPUMS microdata can come in either rectangular or -hierarchical format.


Rectangular data are transformed such that every row of data -represents the same type of record. For instance, each row will -represent a person record, and all household-level information for that -person will be included in the same row. (This is the case for -cps_data shown in the example above.)


Hierarchical data have records of different types interspersed in a -single file. For instance, a household record will be included in its -own row followed by the person records associated with that -household.


Hierarchical data can be loaded in list format or long format. -read_ipums_micro() will read in long format:

-cps_hier_ddi <- read_ipums_ddi(ipums_example("cps_00159.xml"))
-#> Use of data from IPUMS CPS is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.
-#> # A tibble: 11,053 × 9
-#>    <chr+lbl>  <dbl>  <dbl> <int+lb>   <dbl> <int+lb>  <dbl>  <dbl> <dbl+lbl>    
-#>  1 H [Househ…  1962     80  3 [Mar…   1476. 55 [Wis…     NA    NA  NA           
-#>  2 P [Person…  1962     80 NA           NA  NA            1  1476.  4.88e3      
-#>  3 P [Person…  1962     80 NA           NA  NA            2  1471.  5.8 e3      
-#>  4 P [Person…  1962     80 NA           NA  NA            3  1579.  1.00e9 [Mis…
-#>  5 H [Househ…  1962     82  3 [Mar…   1598. 27 [Min…     NA    NA  NA           
-#>  6 P [Person…  1962     82 NA           NA  NA            1  1598.  1.40e4      
-#>  7 H [Househ…  1962     83  3 [Mar…   1707. 27 [Min…     NA    NA  NA           
-#>  8 P [Person…  1962     83 NA           NA  NA            1  1707.  1.66e4      
-#>  9 H [Househ…  1962     84  3 [Mar…   1790. 27 [Min…     NA    NA  NA           
-#> 10 P [Person…  1962     84 NA           NA  NA            1  1790.  6.38e3      
-#> # ℹ 11,043 more rows

The long format consists of a single tibble -that includes rows with varying record types. In this example, some rows -have a record type of “Household” and others have a record type of -“Person”. Variables that do not apply to a particular record type will -be filled with NA in rows of that record type.


To read data in list format, use -read_ipums_micro_list(). This function returns a list where -each element contains all the records for a given record type:

-#> Use of data from IPUMS CPS is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.
-#> # A tibble: 3,385 × 6
-#>    <chr+lbl>            <dbl>  <dbl> <int+lbl>   <dbl> <int+lbl>     
-#>  1 H [Household Record]  1962     80 3 [March]   1476. 55 [Wisconsin]
-#>  2 H [Household Record]  1962     82 3 [March]   1598. 27 [Minnesota]
-#>  3 H [Household Record]  1962     83 3 [March]   1707. 27 [Minnesota]
-#>  4 H [Household Record]  1962     84 3 [March]   1790. 27 [Minnesota]
-#>  5 H [Household Record]  1962    107 3 [March]   4355. 19 [Iowa]     
-#>  6 H [Household Record]  1962    108 3 [March]   1479. 19 [Iowa]     
-#>  7 H [Household Record]  1962    122 3 [March]   3603. 27 [Minnesota]
-#>  8 H [Household Record]  1962    124 3 [March]   4104. 55 [Wisconsin]
-#>  9 H [Household Record]  1962    125 3 [March]   2182. 55 [Wisconsin]
-#> 10 H [Household Record]  1962    126 3 [March]   1826. 55 [Wisconsin]
-#> # ℹ 3,375 more rows
-#> # A tibble: 7,668 × 6
-#>    RECTYPE            YEAR SERIAL PERNUM ASECWT INCTOT                          
-#>    <chr+lbl>         <dbl>  <dbl>  <dbl>  <dbl> <dbl+lbl>                       
-#>  1 P [Person Record]  1962     80      1  1476.      4883                       
-#>  2 P [Person Record]  1962     80      2  1471.      5800                       
-#>  3 P [Person Record]  1962     80      3  1579. 999999998 [Missing. (1962-1964 …
-#>  4 P [Person Record]  1962     82      1  1598.     14015                       
-#>  5 P [Person Record]  1962     83      1  1707.     16552                       
-#>  6 P [Person Record]  1962     84      1  1790.      6375                       
-#>  7 P [Person Record]  1962    107      1  4355. 999999999 [N.I.U.]              
-#>  8 P [Person Record]  1962    107      2  1386.         0                       
-#>  9 P [Person Record]  1962    107      3  1629.       600                       
-#> 10 P [Person Record]  1962    107      4  1432. 999999999 [N.I.U.]              
-#> # ℹ 7,658 more rows

read_ipums_micro() and -read_ipums_micro_list() also support partial loading by -selecting only a subset of columns or a limited number of rows. See the -documentation for more details about other options.


Reading IPUMS NHGIS extracts -


Unlike microdata projects, NHGIS extracts provide their data and -metadata files bundled into a single .zip archive. -read_nhgis() anticipates this structure and can read data -files directly from this file without the need to manually extract the -files:

-nhgis_ex1 <- ipums_example("nhgis0972_csv.zip")
-nhgis_data <- read_nhgis(nhgis_ex1)
-#> Use of data from NHGIS is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.
-#> Rows: 71 Columns: 25
-#> ── Column specification ────────────────────────────────────────────────────────
-#> Delimiter: ","
-#> dbl (13): YEAR, MSA_CMSAA, INTPTLAT, INTPTLNG, PSADC, D6Z001, D6Z002, D6Z003...
-#>  Use `spec()` to retrieve the full column specification for this data.
-#>  Specify the column types or set `show_col_types = FALSE` to quiet this message.
-#> # A tibble: 71 × 25
-#>    <chr>   <dbl> <chr>  <chr> <lgl>         <dbl> <chr>     <chr> <lgl>   <lgl> 
-#>  1 G0080    1990 OH     28    NA             1692 Akron, O… 0080  NA      NA    
-#>  2 G0360    1990 CA     49    NA             4472 Anaheim-… 0360  NA      NA    
-#>  3 G0440    1990 MI     35    NA             2162 Ann Arbo… 0440  NA      NA    
-#>  4 G0620    1990 IL     14    NA             1602 Aurora--… 0620  NA      NA    
-#>  5 G0845    1990 PA     78    NA             6282 Beaver C… 0845  NA      NA    
-#>  6 G0875    1990 NJ     70    NA             5602 Bergen--… 0875  NA      NA    
-#>  7 G1120    1990 MA     07    NA             1122 Boston, … 1120  NA      NA    
-#>  8 G1125    1990 CO     34    NA             2082 Boulder-… 1125  NA      NA    
-#>  9 G1145    1990 TX     42    NA             3362 Brazoria… 1145  NA      NA    
-#> 10 G1160    1990 CT     70    NA             5602 Bridgepo… 1160  NA      NA    
-#> # ℹ 61 more rows
-#> # ℹ 15 more variables: AREALAND <chr>, AREAWAT <chr>, ANPSADPI <chr>,
-#> #   FUNCSTAT <chr>, INTPTLAT <dbl>, INTPTLNG <dbl>, PSADC <dbl>, D6Z001 <dbl>,
-#> #   D6Z002 <dbl>, D6Z003 <dbl>, D6Z004 <dbl>, D6Z005 <dbl>, D6Z006 <dbl>,
-#> #   D6Z007 <dbl>, D6Z008 <dbl>

Like microdata extracts, the data include variable-level metadata, -where available:

-#> $label
-#> [1] "Total area: 1989 to March 1990"
-#> $var_desc
-#> [1] "Table D6Z: Year Structure Built (Universe: Housing Units)"

However, variable metadata for NHGIS data are slightly different than -those provided by microdata products. First, they come from a .txt -codebook file rather than an .xml DDI file. Codebooks can still be -loaded into an ipums_ddi object, but fields that do not -apply to aggregate data will be empty. In general, NHGIS codebooks -provide only variable labels and descriptions, along with citation -information.

-nhgis_cb <- read_nhgis_codebook(nhgis_ex1)
-# Most useful metadata for NHGIS is for variable labels:
-ipums_var_info(nhgis_cb) %>%
-  select(var_name, var_label, var_desc)
-#> # A tibble: 25 × 3
-#>    var_name  var_label                                                  var_desc
-#>    <chr>     <chr>                                                      <chr>   
-#>  1 GISJOIN   GIS Join Match Code                                        ""      
-#>  2 YEAR      Data File Year                                             ""      
-#>  3 STUSAB    State/US Abbreviation                                      ""      
-#>  4 CMSA      Consolidated Metropolitan Statistical Area                 ""      
-#>  5 DIVISIONA Division Code                                              ""      
-#>  6 MSA_CMSAA Metropolitan Statistical Area/Consolidated Metropolitan S… ""      
-#>  7 PMSA      Primary Metropolitan Statistical Area Name                 ""      
-#>  8 PMSAA     Primary Metropolitan Statistical Area Code                 ""      
-#>  9 REGIONA   Region Code                                                ""      
-#> 10 STATEA    State Code                                                 ""      
-#> # ℹ 15 more rows

By design, NHGIS codebooks are human-readable, and it may be easier -to interpret their contents in raw format. To view the codebook itself -without converting to an ipums_ddi object, set -raw = TRUE.

-nhgis_cb <- read_nhgis_codebook(nhgis_ex1, raw = TRUE)
-cat(nhgis_cb[1:20], sep = "\n")
-#> --------------------------------------------------------------------------------
-#> Codebook for NHGIS data file 'nhgis0972_ds135_1990_pmsa'
-#> --------------------------------------------------------------------------------
-#> Contents
-#>     - Data Summary
-#>     - Data Dictionary
-#>     - Citation and Use
-#> Additional documentation on NHGIS data sources is available at: 
-#>     https://www.nhgis.org/documentation/tabular-data 
-#> --------------------------------------------------------------------------------
-#> Data Summary
-#> --------------------------------------------------------------------------------
-#> Year:             1990
-#> Geographic level: Consolidated Metropolitan Statistical Area--Primary Metropolitan Statistical Area
-#> Dataset:          1990 Census: SSTF 9 - Housing Characteristics of New Units
-#>    NHGIS code:    1990_SSTF09

Handling multiple files -


For more complicated NHGIS extracts that include data from multiple -data sources, the provided .zip archive will contain multiple codebook -and data files.


You can view the files contained in an extract to determine if this -is the case:

-nhgis_ex2 <- ipums_example("nhgis0731_csv.zip")
-#> # A tibble: 2 × 2
-#>   type  file                                          
-#>   <chr> <chr>                                         
-#> 1 data  nhgis0731_csv/nhgis0731_ds239_20185_nation.csv
-#> 2 data  nhgis0731_csv/nhgis0731_ts_nominal_state.csv

In these cases, you can use the file_select argument to -indicate which file to load. file_select supports most -features of the tidyselect -selection language. (See ?selection_language for -documentation of the features supported in ipumsr.)

-nhgis_data2 <- read_nhgis(nhgis_ex2, file_select = contains("nation"))
-nhgis_data3 <- read_nhgis(nhgis_ex2, file_select = contains("ts_nominal_state"))

The matching codebook should automatically be loaded and attached to -the data:

-#> $label
-#> [1] "Estimates: Total"
-#> $var_desc
-#> [1] "Table AJWB: Sex by Age (Universe: Total population)"
-#> $label
-#> [1] "1790: Persons: Total"
-#> $var_desc
-#> [1] "Table A00: Total Population"

(If for some reason the codebook is not loaded correctly, you can -load it separately with read_nhgis_codebook(), which also -accepts a file_select specification.)


file_select also accepts the full path or the index of -the file to load:

-# Match by file name
-read_nhgis(nhgis_ex2, file_select = "nhgis0731_csv/nhgis0731_ds239_20185_nation.csv")
-# Match first file in extract
-read_nhgis(nhgis_ex2, file_select = 1)

NHGIS data formats -


CSV data -


NHGIS data are most easily handled in .csv format. -read_nhgis() uses readr::read_csv() to handle -the generation of column type specifications. If the guessed -specifications are incorrect, you can use the col_types -argument to adjust. This is most likely to occur for columns that -contain geographic codes that are stored as numeric values:

-# Convert MSA codes to character format
-  nhgis_ex1,
-  col_types = c(MSA_CMSAA = "c"),
-  verbose = FALSE
-#> # A tibble: 71 × 25
-#>    <chr>   <dbl> <chr>  <chr> <lgl>     <chr>     <chr>     <chr> <lgl>   <lgl> 
-#>  1 G0080    1990 OH     28    NA        1692      Akron, O… 0080  NA      NA    
-#>  2 G0360    1990 CA     49    NA        4472      Anaheim-… 0360  NA      NA    
-#>  3 G0440    1990 MI     35    NA        2162      Ann Arbo… 0440  NA      NA    
-#>  4 G0620    1990 IL     14    NA        1602      Aurora--… 0620  NA      NA    
-#>  5 G0845    1990 PA     78    NA        6282      Beaver C… 0845  NA      NA    
-#>  6 G0875    1990 NJ     70    NA        5602      Bergen--… 0875  NA      NA    
-#>  7 G1120    1990 MA     07    NA        1122      Boston, … 1120  NA      NA    
-#>  8 G1125    1990 CO     34    NA        2082      Boulder-… 1125  NA      NA    
-#>  9 G1145    1990 TX     42    NA        3362      Brazoria… 1145  NA      NA    
-#> 10 G1160    1990 CT     70    NA        5602      Bridgepo… 1160  NA      NA    
-#> # ℹ 61 more rows
-#> # ℹ 15 more variables: AREALAND <chr>, AREAWAT <chr>, ANPSADPI <chr>,
-#> #   FUNCSTAT <chr>, INTPTLAT <dbl>, INTPTLNG <dbl>, PSADC <dbl>, D6Z001 <dbl>,
-#> #   D6Z002 <dbl>, D6Z003 <dbl>, D6Z004 <dbl>, D6Z005 <dbl>, D6Z006 <dbl>,
-#> #   D6Z007 <dbl>, D6Z008 <dbl>

Fixed-width data -


read_nhgis() also handles NHGIS files provided in -fixed-width format:

