![](http://tech.popdata.org/ipumsr/logo.png)
diff --git a/docs/404.html b/docs/404.html deleted file mode 100644 index 1791127c..00000000 --- a/docs/404.html +++ /dev/null @@ -1,135 +0,0 @@ - - -
- - - - -As contributors and maintainers of this project, we pledge to respect all people who contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities.
-We are committed to making participation in this project a harassment-free experience for everyone, regardless of level of experience, gender, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion.
-Examples of unacceptable behavior by participants include the use of sexual language or imagery, derogatory comments or personal attacks, trolling, public or private harassment, insults, or other unprofessional conduct.
-Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed from the project team.
-Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by opening an issue or contacting one or more of the project maintainers.
-This Code of Conduct is adapted from the Contributor Covenant (http:contributor-covenant.org), version 1.0.0, available at http://contributor-covenant.org/version/1/0/0/
-Thank you for considering improving this project! By participating, you agree to abide by the code of conduct.
-If you’ve experience a problem with the package, or have a suggestion for it, please post it on the issues tab. This space is meant for questions directly related to the R package, so questions related to your specific extract may be better answered via email to ipums@umn.edu (but don’t worry about making a mistake, we know it is tough to tell the difference).
-Since our extracts are such large files, posting minimal reproducible examples may be difficult. Therefore, it will be most helpful if you can provide as much detail about your problem as possible including the code and error message, the project the extract is from, the variables you have selected, file type, etc. We’ll do our best to answer your question.
-We appreciate pull requests that follow these guidelines:
-Make sure that tests pass (and add new ones if possible).
Do your best to conform to the code style of the package, currently based on the tidyverse style guide. See the styler package to easily catch stylistic errors.
Please add you name and affiliation to the NOTICE.txt file.
Summarize your changes in the NEWS.md file.
If you’ve never worked on an R package before, the book R Packages by Hadley Wickham is a great resource for learning the mechanics of building an R package and contributing to R packages on github. Additionally, here’s a great primer on git and github specifically.
-In the meantime, here’s a quick step-by-step guide on contributing to this project using RStudio:
-If don’t already have RStudio and Git installed, you can download them here and here.
Fork this repo (top right corner button on the github website).
Clone the repo from RStudio’s toolbar: File > New Project > From Verson Control > https://github.com/*YOUR_USER_NAME*/ipumsr/
.
Make changes to your local copy.
Commit your changes and push them to the github webiste using RStudio’s Git pane (push using the green up arrow).
Submit a pull request, selecting the “compare across forks” option. Please include a short message summarizing your changes.
This vignette details the options available for requesting data from -IPUMS microdata projects via the IPUMS API.
-If you haven’t yet learned the basics of the IPUMS API workflow, you -may want to start with the IPUMS API -introduction. The code below assumes you have registered and set up -your API key as described there.
-IPUMS provides several data collections that are classified as -microdata. Currently, the following microdata collections are supported -by the IPUMS API (shown with the codes used to refer to them in -ipumsr):
-"usa"
)"cps"
)"ipumsi"
)API support will continue to be added for more collections in the -future. See the API -documentation for more information on upcoming additions to the -API.
-In addition to microdata projects, the IPUMS API also supports IPUMS -NHGIS data. For details about obtaining IPUMS NHGIS data using ipumsr, -see the NHGIS-specific vignette.
-Before getting started, we’ll load ipumsr and dplyr, -which will be helpful for this demo:
- -Every microdata extract definition must contain a set of requested -samples and variables.
-In an IPUMS microdata collection, a sample refers to a -distinct combination of records and variables. A record is a set of -values that describe the characteristics of a single unit of measurement -(e.g. a single person or a single household), and variables -define the characteristics that were measured.
-A single sample can contain multiple record types (e.g. person -records, household records, or activity records, and more), each of -which correspond to different units of measurement.
-Note that our usage of the term “sample” does not correspond -perfectly to the statistical sense of a subset of individuals from a -population. Many IPUMS samples are samples in the statistical sense, but -some are “full-count” samples, meaning they contain all individuals in a -population.
-Of course, to request samples and variables, we have to know the -codes that the API uses to refer to them. For samples, the IPUMS API -uses special codes that don’t appear in the web-based extract builder. -For variables, the API uses the same variable names that appear on the -web.
-While the IPUMS API does not yet provide a comprehensive set of
-metadata endpoints for IPUMS microdata collections, users can use the
-get_sample_info()
function to identify the codes used to
-refer to specific samples when communicating with the API.
-cps_samps <- get_sample_info("cps")
-
-head(cps_samps)
-#> # A tibble: 6 × 2
-#> name description
-#> <chr> <chr>
-#> 1 cps1962_03s IPUMS-CPS, ASEC 1962
-#> 2 cps1963_03s IPUMS-CPS, ASEC 1963
-#> 3 cps1964_03s IPUMS-CPS, ASEC 1964
-#> 4 cps1965_03s IPUMS-CPS, ASEC 1965
-#> 5 cps1966_03s IPUMS-CPS, ASEC 1966
-#> 6 cps1967_03s IPUMS-CPS, ASEC 1967
The values listed in the name
column correspond to the
-code that you would use to request that sample when creating an extract
-definition to be submitted to the IPUMS API.
We can use basic functions from dplyr to filter the metadata to -samples of interest. For instance, to find all IPUMS International -samples for Mexico, we could do the following:
-
-ipumsi_samps <- get_sample_info("ipumsi")
-
-ipumsi_samps %>%
- filter(grepl("Mexico", description))
-#> # A tibble: 70 × 2
-#> name description
-#> <chr> <chr>
-#> 1 mx1960a Mexico 1960
-#> 2 mx1970a Mexico 1970
-#> 3 mx1990a Mexico 1990
-#> 4 mx1995a Mexico 1995
-#> 5 mx2000a Mexico 2000
-#> 6 mx2005a Mexico 2005
-#> 7 mx2010a Mexico 2010
-#> 8 mx2015a Mexico 2015
-#> 9 mx2005h Mexico 2005 Q1 LFS
-#> 10 mx2005i Mexico 2005 Q2 LFS
-#> # ℹ 60 more rows
IPUMS intends to add support for accessing variable metadata via API -in the future. Until then, use the web-based extract builder for a given -collection to find variable names and availability by sample. See the IPUMS -API documentation for links to the extract builder for each -microdata collection with API support.
-Alternatively, if you have made an extract previously through the web
-interface, you can use get_extract_info()
to identify the
-variable names it includes. See the IPUMS API
-introduction for more details.
Each IPUMS collection has its own extract definition function that is
-used to specify the parameters of a new extract request from scratch.
-These functions take the form define_extract_*()
. For
-microdata collections, we have:
define_extract_usa()
-define_extract_cps()
-define_extract_ipumsi()
-When you define an extract request, you can specify the data to be -included in the extract and indicate the desired format and layout.
-While each microdata collection has its own extract definition -function, each uses the same syntax. The examples in this vignette use -multiple collections, but the syntax they demonstrate can be applied to -all of the supported microdata collections.
-A simple extract definition needs only to contain the names of the -samples and variables to include in the request:
-
-cps_ext <- define_extract_cps(
- description = "Example CPS extract",
- samples = c("cps2018_03s", "cps2019_03s"),
- variables = c("AGE", "SEX", "RACE", "STATEFIP")
-)
-
-cps_ext
-#> Unsubmitted IPUMS CPS extract
-#> Description: Example CPS extract
-#>
-#> Samples: (2 total) cps2018_03s, cps2019_03s
-#> Variables: (4 total) AGE, SEX, RACE, STATEFIP
This produces an ipums_extract
object containing the
-extract request specifications that is ready to be submitted to the
-IPUMS API.
When you request a variable in your extract definition, the resulting -data extract will include that variable for all requested samples where -it is available. If you request a variable that is not available for any -requested samples, the IPUMS API will throw an informative error when -you try to submit your request.
-Beyond just specifying samples and variables, there are several -additional options available to refine the data requested in a microdata -extract request.
-The IPUMS API supports several detailed specification options that -can be applied to individual variables in an extract request: case -selections, attached characteristics, and data quality flags.
-Before we describe each of these options in depth, we’ll introduce -the syntax used to add them to your extract definition.
-To add any of these options to a variable, we need to introduce the
-var_spec()
helper function.
var_spec()
bundles all the selections for a given
-variable together into a single object (in this case, a
-var_spec
object):
-var <- var_spec("SEX", case_selections = "2")
-
-str(var)
-#> List of 3
-#> $ name : chr "SEX"
-#> $ case_selections : chr "2"
-#> $ case_selection_type: chr "general"
-#> - attr(*, "class")= chr [1:3] "var_spec" "ipums_spec" "list"
To include this specification in our extract, we simply provide it to
-the variables
argument of our extract definition. When
-multiple variables are included, pass a list
of
-var_spec
objects:
-define_extract_cps(
- description = "Case selection example",
- samples = c("cps2018_03s", "cps2019_03s"),
- variables = list(
- var_spec("SEX", case_selections = "2"),
- var_spec("AGE", attached_characteristics = "head")
- )
-)
-#> Unsubmitted IPUMS CPS extract
-#> Description: Case selection example
-#>
-#> Samples: (2 total) cps2018_03s, cps2019_03s
-#> Variables: (2 total) SEX, AGE
In fact, if you investigate our original extract object from above,
-you’ll notice that the variables have automatically been converted to
-var_spec
objects, even though they were provided as
-character vectors:
-str(cps_ext$variables)
-#> List of 4
-#> $ AGE :List of 1
-#> ..$ name: chr "AGE"
-#> ..- attr(*, "class")= chr [1:3] "var_spec" "ipums_spec" "list"
-#> $ SEX :List of 1
-#> ..$ name: chr "SEX"
-#> ..- attr(*, "class")= chr [1:3] "var_spec" "ipums_spec" "list"
-#> $ RACE :List of 1
-#> ..$ name: chr "RACE"
-#> ..- attr(*, "class")= chr [1:3] "var_spec" "ipums_spec" "list"
-#> $ STATEFIP:List of 1
-#> ..$ name: chr "STATEFIP"
-#> ..- attr(*, "class")= chr [1:3] "var_spec" "ipums_spec" "list"
So, a var_spec
object with no additional specifications
-will produce the default data for a given variable. That is, the
-following are equivalent:
-define_extract_cps(
- description = "Example CPS extract",
- samples = "cps2018_03s",
- variables = "AGE"
-)
-
-define_extract_cps(
- description = "Example CPS extract",
- samples = "cps2018_03s",
- variables = var_spec("AGE")
-)
Because all specified variables are converted to
-var_spec
objects, you can also pass a list where some
-elements are var_spec
objects and some are just variable
-names. This is convenient when you only have detailed specifications for
-a subset of variables:
-define_extract_cps(
- description = "Case selection example",
- samples = c("cps2018_03s", "cps2019_03s"),
- variables = list(
- var_spec("SEX", case_selections = "2"),
- "AGE"
- )
-)
-#> Unsubmitted IPUMS CPS extract
-#> Description: Case selection example
-#>
-#> Samples: (2 total) cps2018_03s, cps2019_03s
-#> Variables: (2 total) SEX, AGE
(Samples are also converted to their own samp_spec
-objects, but as there currently aren’t any additional specifications
-available for samples, there is no reason to use anything other than a
-character vector in the samples
argument.)
Now that we’ve covered the basic syntax for including detailed -variable specifications, we can describe the available options in more -depth.
-Case selections allow us to limit the data to those records that -match a particular value on the specified variable.
-For instance, the following specification would indicate that only
-records with a value of "27"
(Minnesota) or
-"19"
(Iowa) for the variable "STATEFIP"
should
-be included:
Some variables have versions with both general and detailed coding -schemes. By default, case selections are interpreted to refer to the -general codes:
-
-var$case_selection_type
-#> [1] "general"
For variables with detailed versions, you can also select on the -detailed codes.
-For instance, the IPUMS USA variable RACE is available in both
-general and detailed versions. If you wanted to limit your extract to
-persons identifying as “Two major races”, you could do so by specifying
-a case selection of "8"
. However, if you wanted to limit
-your extract to only persons identifying as “White and Chinese” or
-“White and Japanese”, you would need to specify detailed codes
-"811"
and "812"
.
To include case selections for detailed codes, set
-case_selection_type = "detailed"
:
-# General case selection is the default
-var_spec("RACE", case_selections = "8")
-#> $name
-#> [1] "RACE"
-#>
-#> $case_selections
-#> [1] "8"
-#>
-#> $case_selection_type
-#> [1] "general"
-#>
-#> attr(,"class")
-#> [1] "var_spec" "ipums_spec" "list"
-
-# For detailed case selection, change the `case_selection_type`
-var_spec(
- "RACE",
- case_selections = c("811", "812"),
- case_selection_type = "detailed"
-)
-#> $name
-#> [1] "RACE"
-#>
-#> $case_selections
-#> [1] "811" "812"
-#>
-#> $case_selection_type
-#> [1] "detailed"
-#>
-#> attr(,"class")
-#> [1] "var_spec" "ipums_spec" "list"
As noted above, IPUMS intends to add support for accessing variable -metadata via API in the future, such that users will be able to query -variable coding schemes right from their R sessions. Until then, use the -IPUMS web interface for a given collection to find general and detailed -variable codes for the purposes of case selection. See the IPUMS -API documentation for relevant links.
-By default, case selection on person-level variables produces a data
-file that includes only those individuals who match the specified values
-for the specified variables. It’s also possible to use case selection to
-include matching individuals and all other members of their
-households, using the case_select_who
parameter.
The case_select_who
parameter must be the same for all
-case selections in an extract, and thus is set at the extract level
-rather than the var_spec
level. To include all household
-members of matching individuals, set
-case_select_who = "households"
in the extract
-definition:
-define_extract_usa(
- description = "Household level case selection",
- samples = "us2021a",
- variables = var_spec("RACE", case_selections = "8"),
- case_select_who = "households"
-)
-#> Unsubmitted IPUMS USA extract
-#> Description: Household level case selection
-#>
-#> Samples: (1 total) us2021a
-#> Variables: (1 total) RACE
IPUMS allows users to create variables that reflect the
-characteristics of other household members. To do so, use the
-attached_characteristics
argument of
-var_spec()
.
For instance, to attach the spouse’s SEX
value to a
-record:
-var_spec("SEX", attached_characteristics = "spouse")
-#> $name
-#> [1] "SEX"
-#>
-#> $attached_characteristics
-#> [1] "spouse"
-#>
-#> attr(,"class")
-#> [1] "var_spec" "ipums_spec" "list"
This will add a new variable (in this case, SEX_SP
) to
-the output data that will contain the sex of a person’s spouse (if no
-such record exists, the value will be 0).
Multiple attached characteristics can be attached for a single -variable:
-
-var_spec("AGE", attached_characteristics = c("mother", "father"))
-#> $name
-#> [1] "AGE"
-#>
-#> $attached_characteristics
-#> [1] "mother" "father"
-#>
-#> attr(,"class")
-#> [1] "var_spec" "ipums_spec" "list"
Acceptable values are "spouse"
, "mother"
,
-"father"
, and "head"
.
Some variables in the IPUMS have been edited for missing, illegible, -and inconsistent values. Data quality flags indicate which values are -edited or allocated.
-To include data quality flags for an individual variable, use the
-data_quality_flags
argument to var_spec()
:
-var_spec("RACE", data_quality_flags = TRUE)
-#> $name
-#> [1] "RACE"
-#>
-#> $data_quality_flags
-#> [1] TRUE
-#>
-#> attr(,"class")
-#> [1] "var_spec" "ipums_spec" "list"
This will produce a new variable (QRACE
) containing the
-data quality flag for the given variable.
To add data quality flags for all variables that have them, set
-data_quality_flags = TRUE
in your extract definition
-directly:
-usa_ext <- define_extract_usa(
- description = "Data quality flags",
- samples = "us2021a",
- variables = list(
- var_spec("RACE", case_selections = "8"),
- var_spec("AGE")
- ),
- data_quality_flags = TRUE
-)
Each data quality flag corresponds to one or more variables, and the -codes for each flag vary based on the sample. See the documentation for -the IPUMS collection of interest for more information about data quality -flag codes.
-By default, microdata extract definitions will request data in a -rectangular structure and fixed-width file format.
-Rectangular data are data where only person records are included, and -any household-level variables are converted to person-level variables by -copying the values from the associated household record onto all -household members.
-To instead create a hierarchical extract, which includes separate
-records for households and persons, set
-data_structure = "hierarchical"
in your extract
-definition.
See the IPUMS data -reading vignette for more information about loading hierarchical -data into R.
-To request a file format other than fixed-width, adjust the
-data_format
argument. Note that while you can request data
-in a variety of formats (Stata, SPSS, etc.), ipumsr’s
-read_ipums_micro()
function only supports fixed-width and
-csv files.
Once you have defined an extract request, you can submit the extract -for processing:
-
-usa_ext_submitted <- submit_extract(usa_ext)
The workflow for submitting and monitoring an extract request and -downloading its files when complete is described in the IPUMS API introduction.
-This vignette details the options available for requesting IPUMS -NHGIS data and metadata via the IPUMS API.
-If you haven’t yet learned the basics of the IPUMS API workflow, you -may want to start with the IPUMS API -introduction. The code below assumes you have registered and set up -your API key as described there.
-In addition to NHGIS, the IPUMS API also supports several microdata -projects. For details about obtaining IPUMS microdata using ipumsr, see -the microdata-specific vignette.
-Before getting started, we’ll load ipumsr and some helpful packages -for this demo:
- -IPUMS NHGIS supports 3 main types of data products: datasets, time -series tables, and shapefiles.
-A dataset contains a collection of data tables -that each correspond to a particular tabulated summary statistic. A -dataset is distinguished by the years, geographic levels, and topics -that it covers. For instance, 2021 1-year data from the American -Community Survey (ACS) is encapsulated in a single dataset. In other -cases, a single census product will be split into multiple -datasets.
A time series table is a longitudinal data source that -links comparable statistics from multiple U.S. censuses in a single -bundle. A table is comprised of one or more related time series, each of -which describes a single summary statistic measured at multiple times -for a given geographic level.
A shapefile (or GIS file) contains geographic -data for a given geographic level and year. Typically, these files are -composed of polygon geometries containing the boundaries of census -reporting areas.
Of course, to make a request for any of these data sources, we have
-to know the codes that the API uses to refer to them. Fortunately, we
-can browse the metadata for all available IPUMS NHGIS data sources with
-get_metadata_nhgis()
.
Users can view summary metadata for all available data sources of a -given data type, or detailed metadata for a specific data source by -name.
-To see a summary of all available sources for a given data product
-type, use the type
argument. This returns a data frame
-containing the available datasets, data tables, time series tables, or
-shapefiles.
-ds <- get_metadata_nhgis(type = "datasets")
-
-head(ds)
-#> # A tibble: 6 × 4
-#> name group description sequence
-#> <chr> <chr> <chr> <int>
-#> 1 1790_cPop 1790 Census Population Data [US, States & Counties] 101
-#> 2 1800_cPop 1800 Census Population Data [US, States & Counties] 201
-#> 3 1810_cPop 1810 Census Population Data [US, States & Counties] 301
-#> 4 1820_cPop 1820 Census Population Data [US, States & Counties] 401
-#> 5 1830_cPop 1830 Census Population Data [US, States & Counties] 501
-#> 6 1840_cAg 1840 Census Agriculture Data [US, States & Counties] 601
We can use basic functions from dplyr to filter the
-metadata to those records of interest. For instance, if we wanted to
-find all the data sources related to agriculture from the 1900 Census,
-we could filter on group
and description
:
-ds %>%
- filter(
- group == "1900 Census",
- grepl("Agriculture", description)
- )
-#> # A tibble: 2 × 4
-#> name group description sequence
-#> <chr> <chr> <chr> <int>
-#> 1 1900_cAg 1900 Census Agriculture Data [US, States & Counties] 1401
-#> 2 1900_cPHAM 1900 Census Population, Housing, Agriculture & Manufactur… 1403
The values listed in the name
column correspond to the
-code that you would use to request that dataset when creating an extract
-definition to be submitted to the IPUMS API.
Similarly, for time series tables:
-
-tst <- get_metadata_nhgis("time_series_tables")
While some of the metadata fields are consistent across different
-data types, some, like geographic_integration
, are specific
-to time series tables:
-head(tst)
-#> # A tibble: 6 × 7
-#> name description geographic_integration sequence time_series years
-#> <chr> <chr> <chr> <dbl> <list> <list>
-#> 1 A00 Total Population Nominal 100. <tibble> <tibble>
-#> 2 AV0 Total Population Nominal 100. <tibble> <tibble>
-#> 3 B78 Total Population Nominal 100. <tibble> <tibble>
-#> 4 CL8 Total Population Standardized to 2010 100. <tibble> <tibble>
-#> 5 A57 Persons by Urban/R… Nominal 101. <tibble> <tibble>
-#> 6 A59 Persons by Urban/R… Nominal 101. <tibble> <tibble>
-#> # ℹ 1 more variable: geog_levels <list>
Note that for time series tables, some metadata fields are stored in -list columns, where each entry is itself a data frame:
-
-tst$years[[1]]
-#> # A tibble: 24 × 3
-#> name description sequence
-#> <chr> <chr> <int>
-#> 1 1790 1790 1
-#> 2 1800 1800 2
-#> 3 1810 1810 3
-#> 4 1820 1820 4
-#> 5 1830 1830 5
-#> 6 1840 1840 6
-#> 7 1850 1850 7
-#> 8 1860 1860 8
-#> 9 1870 1870 12
-#> 10 1880 1880 22
-#> # ℹ 14 more rows
-
-tst$geog_levels[[1]]
-#> # A tibble: 2 × 3
-#> name description sequence
-#> <chr> <chr> <int>
-#> 1 state State 4
-#> 2 county State--County 25
To filter on these columns, we can use map_lgl()
from
-purrr. For instance, to find all time series tables that
-include data from a particular year:
-# Iterate over each `years` entry, identifying whether that entry
-# contains "1840" in its `name` column.
-tst %>%
- filter(map_lgl(years, ~ "1840" %in% .x$name))
-#> # A tibble: 2 × 7
-#> name description geographic_integration sequence time_series years
-#> <chr> <chr> <chr> <dbl> <list> <list>
-#> 1 A00 Total Population Nominal 100. <tibble> <tibble>
-#> 2 A08 Persons by Sex [2] Nominal 102. <tibble> <tibble>
-#> # ℹ 1 more variable: geog_levels <list>
For more details on working with nested data frames, see this tidyr -article.
-Once we have identified a data source of interest, we can find out
-more about its detailed options by providing its name to the
-corresponding argument of get_metadata_nhgis()
:
-cAg_meta <- get_metadata_nhgis(dataset = "1900_cAg")
This provides a comprehensive list of the possible specifications for
-the input data source. For instance, for the 1900_cAg
-dataset, we have 66 tables to choose from, and 3 possible geographic
-levels:
-cAg_meta$data_tables
-#> # A tibble: 66 × 4
-#> name nhgis_code description sequence
-#> <chr> <chr> <chr> <int>
-#> 1 NT1 AWS Total Population 1
-#> 2 NT2 AW3 Number of Farms 2
-#> 3 NT3 AXE Average Farm Size 3
-#> 4 NT4 AXP Farm Acreage 4
-#> 5 NT5 AXZ Farm Management 5
-#> 6 NT6 AYA Race of Farmer 6
-#> 7 NT7 AYJ Race of Farmer by Detailed Management 7
-#> 8 NT8 AYK Number of Farms 8
-#> 9 NT9 AYL Farms with Buildings 9
-#> 10 NT10 AWT Acres of Farmland 10
-#> # ℹ 56 more rows
-
-cAg_meta$geog_levels
-#> # A tibble: 3 × 4
-#> name description has_geog_extent_selection sequence
-#> <chr> <chr> <lgl> <int>
-#> 1 nation Nation FALSE 1
-#> 2 state State FALSE 4
-#> 3 county State--County FALSE 25
You can also get detailed metadata for an individual data table. -Since data tables belong to specific datasets, both need to be specified -to identify a data table:
-
-get_metadata_nhgis(dataset = "1900_cAg", data_table = "NT2")
-#> $name
-#> [1] "NT2"
-#>
-#> $description
-#> [1] "Number of Farms"
-#>
-#> $universe
-#> [1] "Farms"
-#>
-#> $nhgis_code
-#> [1] "AW3"
-#>
-#> $sequence
-#> [1] 2
-#>
-#> $dataset_name
-#> [1] "1900_cAg"
-#>
-#> $variables
-#> # A tibble: 1 × 2
-#> description nhgis_code
-#> <chr> <chr>
-#> 1 Total AW3001
Note that the name
element is the one that contains the
-codes used for interacting with the IPUMS API. The
-nhgis_code
element refers to the prefix attached to
-individual variables in the output data, and the API will throw an error
-if you use it in an extract definition. For more details on interpreting
-each of the provided metadata elements, see the documentation for
-get_metadata_nhgis()
.
Now that we have identified some of our options, we can go ahead and -define an extract request to submit to the IPUMS API.
-To create an extract definition containing the specifications for a
-specific set of IPUMS NHGIS data, use
-define_extract_nhgis()
.
When you define an extract request, you can specify the data to be -included in the extract and indicate the desired format and layout.
-Let’s say we’re interested in getting state-level data on the number
-of farms and their average size from the 1900_cAg
dataset
-that we identified above. As we can see in the metadata, these data are
-contained in tables NT2
and NT3
:
-cAg_meta$data_tables
-#> # A tibble: 66 × 4
-#> name nhgis_code description sequence
-#> <chr> <chr> <chr> <int>
-#> 1 NT1 AWS Total Population 1
-#> 2 NT2 AW3 Number of Farms 2
-#> 3 NT3 AXE Average Farm Size 3
-#> 4 NT4 AXP Farm Acreage 4
-#> 5 NT5 AXZ Farm Management 5
-#> 6 NT6 AYA Race of Farmer 6
-#> 7 NT7 AYJ Race of Farmer by Detailed Management 7
-#> 8 NT8 AYK Number of Farms 8
-#> 9 NT9 AYL Farms with Buildings 9
-#> 10 NT10 AWT Acres of Farmland 10
-#> # ℹ 56 more rows
To request these data, we need to make an explicit dataset
-specification. All datasets must be associated with a selection of
-data tables and geographic levels. We can use the ds_spec()
-helper function to specify our selections for these parameters.
-ds_spec()
bundles all the selections for a given dataset
-together into a single object (in this case, a ds_spec
-object):
-dataset <- ds_spec(
- "1900_cAg",
- data_tables = c("NT1", "NT2"),
- geog_levels = "state"
-)
-
-str(dataset)
-#> List of 3
-#> $ name : chr "1900_cAg"
-#> $ data_tables: chr [1:2] "NT1" "NT2"
-#> $ geog_levels: chr "state"
-#> - attr(*, "class")= chr [1:3] "ds_spec" "ipums_spec" "list"
This dataset specification can then be provided to the extract -definition:
-
-nhgis_ext <- define_extract_nhgis(
- description = "Example farm data in 1900",
- datasets = dataset
-)
-
-nhgis_ext
-#> Unsubmitted IPUMS NHGIS extract
-#> Description: Example farm data in 1900
-#>
-#> Dataset: 1900_cAg
-#> Tables: NT1, NT2
-#> Geog Levels: state
Dataset specifications can also include selections for
-years
and breakdown_values
, but these are not
-available for all datasets.
Similarly, to make a request for time series tables, use the
-tst_spec()
helper. This makes a tst_spec
-object containing a time series table specification.
Time series tables do not contain individual data tables, but do -require a geographic level selection, and allow an optional selection of -years:
-
-define_extract_nhgis(
- description = "Example time series table request",
- time_series_tables = tst_spec(
- "CW3",
- geog_levels = c("county", "tract"),
- years = c("1990", "2000")
- )
-)
-#> Unsubmitted IPUMS NHGIS extract
-#> Description: Example time series table request
-#>
-#> Time Series Table: CW3
-#> Geog Levels: county, tract
-#> Years: 1990, 2000
Shapefiles don’t have any additional specification options, and -therefore can be requested simply by providing their names:
-
-define_extract_nhgis(
- description = "Example shapefiles request",
- shapefiles = c("us_county_2021_tl2021", "us_county_2020_tl2020")
-)
-#> Unsubmitted IPUMS NHGIS extract
-#> Description: Example shapefiles request
-#>
-#> Shapefiles: us_county_2021_tl2021, us_county_2020_tl2020
An attempt to define an extract that does not have all the required -specifications for a given dataset or time series table will throw an -error:
-
-define_extract_nhgis(
- description = "Invalid extract",
- datasets = ds_spec("1900_STF1", data_tables = "NP1")
-)
-#> Error in `validate_ipums_extract()`:
-#> ! Invalid `ds_spec` specification:
-#> ✖ `geog_levels` must not contain missing values.
Note that it is still possible to make invalid extract requests (for -instance, by requesting a dataset or data table that doesn’t exist). -This kind of issue will be caught upon submission to the API, not upon -the creation of the extract definition.
-It’s possible to request data for multiple datasets (or time series
-tables) in a single extract definition. To do so, pass a
-list
of ds_spec
or tst_spec
-objects in define_extract_nhgis()
:
-define_extract_nhgis(
- description = "Slightly more complicated extract request",
- datasets = list(
- ds_spec("2018_ACS1", "B01001", "state"),
- ds_spec("2019_ACS1", "B01001", "state")
- ),
- shapefiles = c("us_state_2018_tl2018", "us_state_2019_tl2019")
-)
-#> Unsubmitted IPUMS NHGIS extract
-#> Description: Slightly more complicated extract request
-#>
-#> Dataset: 2018_ACS1
-#> Tables: B01001
-#> Geog Levels: state
-#>
-#> Dataset: 2019_ACS1
-#> Tables: B01001
-#> Geog Levels: state
-#>
-#> Shapefiles: us_state_2018_tl2018, us_state_2019_tl2019
For extracts with multiple datasets or time series tables, it may be
-easier to generate the specifications independently before creating your
-extract request object. You can quickly create multiple
-ds_spec
objects by iterating across the specifications you
-want to include. Here, we use purrr to do so, but you
-could also use a for
loop:
-ds_names <- c("2019_ACS1", "2018_ACS1")
-tables <- c("B01001", "B01002")
-geogs <- c("county", "state")
-
-# For each dataset to include, create a specification with the
-# data tabels and geog levels indicated above
-datasets <- purrr::map(
- ds_names,
- ~ ds_spec(name = .x, data_tables = tables, geog_levels = geogs)
-)
-
-nhgis_ext <- define_extract_nhgis(
- description = "Slightly more complicated extract request",
- datasets = datasets
-)
-
-nhgis_ext
-#> Unsubmitted IPUMS NHGIS extract
-#> Description: Slightly more complicated extract request
-#>
-#> Dataset: 2019_ACS1
-#> Tables: B01001, B01002
-#> Geog Levels: county, state
-#>
-#> Dataset: 2018_ACS1
-#> Tables: B01001, B01002
-#> Geog Levels: county, state
This workflow also makes it easy to quickly update the specifications
-in the future. For instance, to add the 2017 ACS 1-year data to the
-extract definition above, you’d only need to add
-"2017_ACS1"
to the ds_names
variable. The
-iteration would automatically add the selected tables and geog levels
-for the new dataset. (This workflow works particularly well for ACS
-datasets, which often have the same data table names across
-datasets.)
IPUMS NHGIS extract definitions also support additional options to -modify the layout and format of the extract’s resulting data files.
-For extracts that contain time series tables, the
-tst_layout
argument indicates how the longitudinal data
-should be organized.
For extracts that contain datasets with multiple breakdowns or data
-types, use the breakdown_and_data_type_layout
argument to
-specify a layout . This is most common for data sources that contain
-both estimates and margins of error, like the ACS.
File formats can be specified with the data_format
-argument. IPUMS NHGIS currently distributes files in csv and fixed-width
-format.
See the documentation for define_extract_nhgis()
for
-more details on these options.
Once you have defined an extract request, you can submit the extract -for processing:
-
-nhgis_ext_submitted <- submit_extract(nhgis_ext)
The workflow for submitting and monitoring an extract request and -downloading its files when complete is described in the IPUMS API introduction.
-The IPUMS API provides two asset types, both of which are supported -by ipumsr:
-Use of the IPUMS API enables the adoption of a programmatic workflow -that can help users to:
-The basic workflow for interacting with the IPUMS API is as -follows:
-Before getting started, we’ll load the necessary packages for the -examples in this vignette:
- -IPUMS extract support is currently available via API -for the following collections:
-Note that this support only includes data available via a -collection’s extract engine. Many collections provide additional data -via direct download, but these products are not supported by the IPUMS -API.
-IPUMS metadata support is currently available via -API for the following collections:
-API support will continue to be added for more collections in the
-future. You can check general API availability for all IPUMS collections
-with ipums_data_collections()
.
-ipums_data_collections()
-#> # A tibble: 14 × 4
-#> collection_name collection_type code_for_api api_support
-#> <chr> <chr> <chr> <lgl>
-#> 1 IPUMS USA microdata usa TRUE
-#> 2 IPUMS CPS microdata cps TRUE
-#> 3 IPUMS International microdata ipumsi TRUE
-#> 4 IPUMS NHGIS aggregate data nhgis TRUE
-#> 5 IPUMS IHGIS aggregate data ihgis FALSE
-#> 6 IPUMS ATUS microdata atus FALSE
-#> 7 IPUMS AHTUS microdata ahtus FALSE
-#> 8 IPUMS MTUS microdata mtus FALSE
-#> 9 IPUMS DHS microdata dhs FALSE
-#> 10 IPUMS PMA microdata pma FALSE
-#> 11 IPUMS MICS microdata mics FALSE
-#> 12 IPUMS NHIS microdata nhis FALSE
-#> 13 IPUMS MEPS microdata meps FALSE
-#> 14 IPUMS Higher Ed microdata highered FALSE
Note that the tools in ipumsr may not necessarily support all the -functionality currently supported by the IPUMS API. See the API -documentation for more information about its latest features.
-To interact with the IPUMS API, you’ll need to register for access -with the IPUMS project you’ll be using. If you have not yet registered, -you can find links to register for each of the API-supported IPUMS -collections below:
- -Once you’re registered, you’ll be able to create an API key.
-By default, ipumsr API functions assume that your key is stored in
-the IPUMS_API_KEY
environment variable. You can also
-provide your key directly to these functions, but storing it in an
-environment variable saves you some typing and helps prevent you from
-inadvertently sharing your key with others (for instance, on
-GitHub).
You can save your API key to the IPUMS_API_KEY
-environment variable with set_ipums_api_key()
. To save your
-key for use in future sessions, set save = TRUE
. This will
-add your API key to your .Renviron
file in your user home
-directory.
-# Save key in .Renviron for use across sessions
-set_ipums_api_key("paste-your-key-here", save = TRUE)
The rest of this vignette assumes you have obtained an API key and
-stored it in the IPUMS_API_KEY
environment variable.
Each IPUMS collection has its own extract definition function that is
-used to specify the parameters of a new extract request from scratch.
-These functions take the form define_extract_*()
:
When you define an extract request, you can specify the data to be -included in the extract and indicate the desired format and layout.
-For instance, the following defines a simple IPUMS USA extract
-request for the AGE
, SEX
, RACE
,
-STATEFIP
, and MARST
variables from the 2018
-and 2019 American Community Survey (ACS):
-usa_ext_def <- define_extract_usa(
- description = "USA extract for API vignette",
- samples = c("us2018a", "us2019a"),
- variables = c("AGE", "SEX", "RACE", "STATEFIP", "MARST")
-)
-
-usa_ext_def
-#> Unsubmitted IPUMS USA extract
-#> Description: USA extract for API vignette
-#>
-#> Samples: (2 total) us2018a, us2019a
-#> Variables: (5 total) AGE, SEX, RACE, STATEFIP, MARST
The exact extract definition options vary across collections, but all -collections can be used with the same general workflow. For more details -on the available extract definition options, see the associated microdata and NHGIS vignettes.
-For the purposes of demonstrating the overall workflow, we will -continue to work with the sample IPUMS USA extract definition created -above.
-define_extract_*()
functions always produce an
-ipums_extract
object, which can be handled by other API
-functions (see ?ipums_extract
). Furthermore, these objects
-will have a subclass for the particular collection with which they are
-associated.
-class(usa_ext_def)
-#> [1] "usa_extract" "micro_extract" "ipums_extract" "list"
Many of the specifications for a given extract request object can be -accessed by indexing the object:
-
-names(usa_ext_def$samples)
-#> [1] "us2018a" "us2019a"
-
-names(usa_ext_def$variables)
-#> [1] "AGE" "SEX" "RACE" "STATEFIP" "MARST"
-
-usa_ext_def$data_format
-#> [1] "fixed_width"
ipums_extract
objects also contain information about the
-extract request’s processing status and its assigned extract number,
-which serves as an identifier for the extract request. Since this
-extract request is still unsubmitted, it has no request number:
-usa_ext_def$status
-#> [1] "unsubmitted"
-
-usa_ext_def$number
-#> [1] NA
To obtain the data requested in the extract definition, we must first -submit it to the IPUMS API for processing.
-To submit an extract definition, use
-submit_extract()
.
If no errors are detected in the extract definition, a submitted -extract request will be returned with its assigned number and status. -Storing the returned object can be useful for checking the extract -request’s status later.
-
-usa_ext_submitted <- submit_extract(usa_ext_def)
-#> Successfully submitted IPUMS USA extract number 348
The extract number will be stored in the returned object:
-
-usa_ext_submitted$number
-#> [1] 348
-
-usa_ext_submitted$status
-#> [1] "queued"
Note that some fields of a submitted extract may be automatically -updated by the API upon submission. For instance, for microdata -extracts, additional preselected variables may be added to the extract -even if they weren’t specified explicitly in the extract definition.
-
-names(usa_ext_submitted$variables)
-#> [1] "YEAR" "SAMPLE" "SERIAL" "CBSERIAL" "HHWT" "CLUSTER"
-#> [7] "STATEFIP" "STRATA" "GQ" "PERNUM" "PERWT" "SEX"
-#> [13] "AGE" "MARST" "RACE"
If you forget to store the updated extract object returned by
-submit_extract()
, you can use the
-get_last_extract_info()
helper to request the information
-for your most recent extract request for a given collection:
-usa_ext_submitted <- get_last_extract_info("usa")
-
-usa_ext_submitted$number
-#> [1] 348
It may take some time for the IPUMS servers to process your extract
-request. You can ensure that an extract has finished processing before
-you attempt to download its files by using
-wait_for_extract()
. This polls the API regularly until
-processing has completed (by default, each interval increases by 10
-seconds). It then returns an ipums_extract
object
-containing the completed extract definition.
-usa_ext_complete <- wait_for_extract(usa_ext_submitted)
-#> Checking extract status...
-#> Waiting 10 seconds...
-#> Checking extract status...
-#> IPUMS USA extract 348 is ready to download.
-
-usa_ext_complete$status
-#> [1] "completed"
-
-# `download_links` should be populated if the extract is ready for download
-names(usa_ext_complete$download_links)
-#> [1] "r_command_file" "basic_codebook" "data"
-#> [4] "stata_command_file" "sas_command_file" "spss_command_file"
-#> [7] "ddi_codebook"
Note that wait_for_extract()
will tie up your R session
-until your extract is ready to download. While this is fine in a
-strictly programmatic workflow, it may be frustrating when working
-interactively, especially for large extracts or when the IPUMS servers
-are busy.
In these cases, you can manually check whether an extract is ready
-for download with is_extract_ready()
. As long as this
-returns TRUE
, you should be able to download your extract’s
-files.
-is_extract_ready(usa_ext_submitted)
-#> [1] TRUE
For a more detailed status check, provide the extract’s collection
-and number to get_extract_info()
. This returns an
-ipums_extract
object reflecting the requested extract
-definition with the most current status. The status
of a
-submitted extract will be one of "queued"
,
-"started"
, "produced"
,
-"canceled"
, "failed"
, or
-"completed"
.
-usa_ext_submitted <- get_extract_info(usa_ext_submitted)
-
-usa_ext_submitted$status
-#> [1] "completed"
Note that extracts are removed from the IPUMS servers after a set
-period of time (72 hours for microdata collections, 2 weeks for IPUMS
-NHGIS). Therefore, an extract that has a "completed"
status
-may still be unavailable for download.
is_extract_ready()
will alert you if the extract has
-expired and needs to be resubmitted. Simply use
-submit_extract()
to resubmit an extract request. Note that
-this will produce a new extract (with a new extract number),
-even if the extract definition is identical.
Once your extract has finished processing, use
-download_extract()
to download the extract’s data files to
-your local machine. This will return the path to the downloaded file(s)
-required to load the data into R.
For microdata collections, this will be the path to the DDI codebook -(.xml) file, which can be used to read the associated data (contained in -a .dat.gz file).
-For NHGIS, this will be a path to the .zip archive containing the -requested data files and/or shapefiles.
-
-# By default, downloads to your current working directory
-filepath <- download_extract(usa_ext_submitted)
The files produced by download_extract()
can be passed
-directly into the reader functions provided by ipumsr. For instance, for
-microdata projects:
-ddi <- read_ipums_ddi(filepath)
-micro_data <- read_ipums_micro(ddi)
If instead you’re working with an NHGIS extract, use
-read_nhgis()
or read_ipums_sf()
.
See the associated vignette for more -information about loading IPUMS data into R.
-To retrieve the definition corresponding to a particular extract,
-provide its collection and number to get_extract_info()
.
-These can be provided either as a single string of the form
-"collection:number"
or as a length-2 vector:
-c(collection, number)
. Several other API functions support
-this syntax as well.
-usa_ext <- get_extract_info("usa:47")
-
-# Alternatively:
-usa_ext <- get_extract_info(c("usa", 47))
-
-usa_ext
-#> Submitted IPUMS USA extract number 47
-#> Description: Test extract
-#>
-#> Samples: (1 total) us2017b
-#> Variables: (8 total) YEAR, SAMPLE, SERIAL, CBSERIAL, HHWT, GQ, PERNUM, PERWT
If you know you made a specific extract definition in the past, but
-you can’t remember the exact number, you can use
-get_extract_history()
to peruse your recent extract
-requests for a particular collection.
By default, this returns your 10 most recent extract requests as a
-list of ipums_extract
objects. You can adjust how many
-requests to retrieve with the how_many
argument:
-usa_extracts <- get_extract_history("usa", how_many = 3)
-
-usa_extracts
-#> [[1]]
-#> Submitted IPUMS USA extract number 348
-#> Description: USA extract for API vignette
-#>
-#> Samples: (2 total) us2018a, us2019a
-#> Variables: (15 total) YEAR, SAMPLE, SERIAL, CBSERIAL, HHWT, CLUSTER,...
-#>
-#> [[2]]
-#> Submitted IPUMS USA extract number 347
-#> Description: Data from long ago
-#>
-#> Samples: (1 total) us1880a
-#> Variables: (12 total) YEAR, SAMPLE, SERIAL, HHWT, CLUSTER, STRATA, G...
-#>
-#> [[3]]
-#> Submitted IPUMS USA extract number 346
-#> Description: Data from 2017 PRCS
-#>
-#> Samples: (1 total) us2017b
-#> Variables: (9 total) YEAR, SAMPLE, SERIAL, CBSERIAL, HHWT, GQ, PERNU...
Because this is a list of ipums_extract
objects, you can
-operate on them with the API functions that have been introduced
-already.
-is_extract_ready(usa_extracts[[2]])
-#> [1] TRUE
You can also iterate through your extract history to find extracts
-with particular characteristics. For instance, we can use
-purrr::keep()
to find all extracts that contain a certain
-variable or are ready for download:
-purrr::keep(usa_extracts, ~ "MARST" %in% names(.x$variables))
-#> [[1]]
-#> Submitted IPUMS USA extract number 348
-#> Description: USA extract for API vignette
-#>
-#> Samples: (2 total) us2018a, us2019a
-#> Variables: (15 total) YEAR, SAMPLE, SERIAL, CBSERIAL, HHWT, CLUSTER,...
-
-purrr::keep(usa_extracts, is_extract_ready)
-#> [[1]]
-#> Submitted IPUMS USA extract number 348
-#> Description: USA extract for API vignette
-#>
-#> Samples: (2 total) us2018a, us2019a
-#> Variables: (15 total) YEAR, SAMPLE, SERIAL, CBSERIAL, HHWT, CLUSTER,...
-#>
-#> [[2]]
-#> Submitted IPUMS USA extract number 347
-#> Description: Data from long ago
-#>
-#> Samples: (1 total) us1880a
-#> Variables: (12 total) YEAR, SAMPLE, SERIAL, HHWT, CLUSTER, STRATA, G...
-#>
-#> [[3]]
-#> Submitted IPUMS USA extract number 346
-#> Description: Data from 2017 PRCS
-#>
-#> Samples: (1 total) us2017b
-#> Variables: (9 total) YEAR, SAMPLE, SERIAL, CBSERIAL, HHWT, GQ, PERNU...
Or we can use the purrr::map()
family to browse certain
-values:
-purrr::map_chr(usa_extracts, ~ .x$description)
-#> [1] "USA extract for API vignette" "Data from long ago"
-#> [3] "Data from 2017 PRCS"
If you regularly use only a single IPUMS collection, you can save
-yourself some typing by setting that collection as your default.
-set_ipums_default_collection()
will save a specified
-collection to the value of the IPUMS_DEFAULT_COLLECTION
-environment variable. If you have a default collection set, API
-functions will use that collection in all requests, assuming no other
-collection is specified.
-set_ipums_default_collection("usa") # Set `save = TRUE` to store across sessions
-# Check the default collection:
-Sys.getenv("IPUMS_DEFAULT_COLLECTION")
-#> [1] "usa"
-
-# Most recent USA extract:
-usa_last <- get_last_extract_info()
-
-# Request info on extract request "usa:10"
-usa_ext_10 <- get_extract_info(10)
-
-# You can still request other collections as usual:
-cps_ext_10 <- get_extract_info("cps:10")
One exciting feature enabled by the IPUMS API is the ability to share -a standardized extract definition with other IPUMS users so that they -can create an identical extract request themselves. The terms of use for -most IPUMS collections prohibit the public redistribution of IPUMS data, -but don’t prohibit the sharing of data extract definitions.
-ipumsr facilitates this type of sharing with
-save_extract_as_json()
and
-define_extract_from_json()
, which read and write
-ipums_extract
objects to and from a standardized
-JSON-formatted file.
-usa_ext_10 <- get_extract_info("usa:10")
-save_extract_as_json(usa_ext_10, file = "usa_extract_10.json")
At this point, you can send usa_extract_10.json
to
-another user to allow them to create a duplicate
-ipums_extract
object, which they can load and submit to the
-API themselves (as long as they have API
-access).
-clone_of_usa_ext_10 <- define_extract_from_json("usa_extract_10.json")
-usa_ext_10_resubmitted <- submit_extract(clone_of_usa_ext_10)
Note that the code in the previous chunk assumes that the file is
-saved in the current working directory. If it’s saved somewhere else,
-replace "usa_extract_10.json"
with the full path to the
-file.
Occasionally, you may want to modify an existing extract definition
-(e.g. to update an analysis with new data). The easiest way to do so is
-to add the new specifications to the define_extract_*()
-code that produced the original extract definition. This is why we
-highly recommend that you save this code somewhere where it can be
-accessed and updated in the future.
However, there are cases where the original extract definition code
-does not exist (e.g. if the extract was created using the online IPUMS
-extract system). In this case, the best approach is to view the extract
-definition with get_extract_info()
and create a new extract
-definition (using a define_extract_*()
function) that
-reproduces that definition along with the desired modifications. While
-this may be a bit tedious for complex extract definitions, it is a
-one-time investment that will make any future updates to the extract
-definition much easier.
Previously, we encouraged users to use the helpers
-add_to_extract()
and remove_from_extract()
-when modifying extracts. We now encourage you to re-write extract
-definitions because they improve reproducibility: extract definition
-code will always be more clear and stable if it is written explicitly,
-rather than based only on an old extract number. These two functions may
-be retired in the future.
The core API functions in ipumsr are compatible with one another such -that they can be combined into a single pipeline that requests, -downloads, and reads your extract data into an R data frame:
-
-usa_data <- define_extract_usa(
- "USA extract for API vignette",
- c("us2018a", "us2019a"),
- c("AGE", "SEX", "RACE", "STATEFIP")
-) %>%
- submit_extract() %>%
- wait_for_extract() %>%
- download_extract() %>%
- read_ipums_micro()
Note that for NHGIS extracts that contain both data and shapefiles, a
-single file will need to be selected before reading, as
-download_extract()
will return the path to each file. For
-instance, for a hypothetical nhgis_extract
that contains
-both tabular and spatial data:
-nhgis_data <- download_extract(nhgis_extract) %>%
- purrr::pluck("data") %>% # Select only the tabular data file to read
- read_nhgis()
Not only does this API workflow allow you to obtain IPUMS data -without ever leaving your R environment, but it also allows you to -retain a reproducible record of your process. This makes it much easier -to document your workflow, collaborate with other researchers, and -update your analysis in the future.
-Browsing for IPUMS data can be a little like grocery shopping when -you’re hungry—you show up to grab a couple things, but everything looks -so good that you end up with an overflowing cart.1 Unfortunately, this -can lead to extracts so large that they don’t fit in your computer’s -memory.
-If you’ve got an extract that’s too big, both the IPUMS website and -the ipumsr package have tools to help. There are four basic -strategies:
-ipumsr can’t do much for you when it comes to option 1, but it can -help facilitate some of the other options.
-The examples in this vignette will rely on a few helpful packages. If -you haven’t already installed them, you can do so with:
-
-# To run the full vignette, you'll also need the following packages. If they
-# aren't installed already, do so with:
-install.packages("biglm")
-install.packages("DBI")
-install.packages("RSQLite")
-install.packages("dbplyr")
If you need to work with a dataset that’s too big for your RAM, the -simplest option is to get more space. If upgrading your hardware isn’t -an option, paying for a cloud service like Amazon or Microsoft Azure may -be worth considering. Here are guides for using R on Amazon -and Microsoft -Azure.
-Of course, this option isn’t feasible for most users—in this case, -updates to the data being used in the analysis or the processing -pipeline may be required.
-The easiest way to reduce the size of your extract is to drop unused -samples and variables. This can be done through the extract interface -for the specific IPUMS project you’re using or within R using the IPUMS -API (for projects that are supported).
-If using the API, simply updated your extract definition code to -exclude the specifications that you no longer need. Then, resubmit the -extract request and download the new files.
-See the introduction to the IPUMS API -for more information about making extract requests from ipumsr.
-For microdata projects, another good option for reducing extract size -is to select only those cases that are relevant to your research -question, producing an extract containing only data for a particular -subset of values for a given variable.
-If you’re using the IPUMS API, you can use var_spec()
to
-specify case selections for a variable in an extract definition. For
-instance, the following would produce an extract only including records
-for married women:
-define_extract_usa(
- description = "2013 ACS Data for Married Women",
- samples = "us2013a",
- variables = list(
- var_spec("MARST", case_selections = "1"),
- var_spec("SEX", case_selections = "2")
- )
-)
-#> Unsubmitted IPUMS USA extract
-#> Description: 2013 ACS Data for Married Women
-#>
-#> Samples: (1 total) us2013a
-#> Variables: (2 total) MARST, SEX
If you’re using the online interface, the Select -Cases option will be available on the last page before -submitting an extract request.
-Yet another option (also only for microdata projects) is to take a -random subsample of the data before producing your extract.
-Sampled data is not available via the IPUMS API, but you can use the -Customize Sample Size option in the online interface to -do so. This also appears on the final page before submitting an extract -request.
-If you’ve already submitted the extract, you can click the -REVISE link on the Download or Revise Extracts -page to access these features and produce a new data extract.
-ipumsr provides two related options for reading data sources in -increments:
-Use read_ipums_micro_chunked()
and
-read_ipums_micro_list_chunked()
to read data in chunks.
-These are analogous to the standard read_ipums_micro()
and
-read_ipums_micro_list()
functions, but allow you to specify
-a function that will be applied to each data chunk and control how the
-results from these chunks are combined.
Below, we’ll use chunking to outline solutions to three common -use-cases for IPUMS data: tabulation, regression and case selection.
-First, we’ll load our example data. Note that we have -down-sampled the data in this example for storage reasons; none of the -output “results” reflected in this vignette should be considered -legitimate!
-
-cps_ddi_file <- ipums_example("cps_00097.xml")
Imagine we wanted to find the percent of people in the workforce
-grouped by their self-reported health. Since our example extract is
-small enough to fit in memory, we could load the full dataset with
-read_ipums_micro()
, use lbl_relabel()
to
-relabel the EMPSTAT
variable into a binary variable, and
-count the people in each group.
-read_ipums_micro(cps_ddi_file, verbose = FALSE) %>%
- mutate(
- HEALTH = as_factor(HEALTH),
- AT_WORK = as_factor(
- lbl_relabel(
- EMPSTAT,
- lbl(1, "Yes") ~ .lbl == "At work",
- lbl(0, "No") ~ .lbl != "At work"
- )
- )
- ) %>%
- group_by(HEALTH, AT_WORK) %>%
- summarize(n = n(), .groups = "drop")
-#> # A tibble: 10 × 3
-#> HEALTH AT_WORK n
-#> <fct> <fct> <int>
-#> 1 Excellent No 4055
-#> 2 Excellent Yes 2900
-#> 3 Very good No 3133
-#> 4 Very good Yes 3371
-#> 5 Good No 2480
-#> 6 Good Yes 2178
-#> 7 Fair No 1123
-#> 8 Fair Yes 443
-#> 9 Poor No 603
-#> 10 Poor Yes 65
For the sake of this example, let’s imagine we can only store 1,000
-rows in memory at a time. In this case, we need to use a
-chunked
function, tabulate for each chunk, and then
-calculate the counts across all of the chunks.
The chunked
functions will apply a user-defined callback
-function to each chunk. The callback takes two arguments:
-x
, which represents the data contained in a given chunk,
-and pos
, which represents the position of the chunk,
-expressed as the line in the input file at which the chunk starts.
-Generally you will only need to use x
, but the callback
-must always take both arguments.
In this case, the callback will implement the same processing steps -that we demonstrated above:
-
-cb_function <- function(x, pos) {
- x %>%
- mutate(
- HEALTH = as_factor(HEALTH),
- AT_WORK = as_factor(
- lbl_relabel(
- EMPSTAT,
- lbl(1, "Yes") ~ .lbl == "At work",
- lbl(0, "No") ~ .lbl != "At work"
- )
- )
- ) %>%
- group_by(HEALTH, AT_WORK) %>%
- summarize(n = n(), .groups = "drop")
-}
Next, we need to create a callback object, which will determine how -we want to combine the ultimate results for each chunk. ipumsr provides -three main types of callback objects that preserve variable -metadata:
-IpumsDataFrameCallback
combines the results from each
-chunk together by row binding them togetherIpumsListCallback
returns a list with one item per
-chunk containing the results for that chunk. Use this when you don’t
-want to (or can’t) immediately combine the results.IpumsSideEffectCallback
does not return any results.
-Use this when your callback function is intended only for its side
-effects (for instance, if you are saving the results for each chunk to
-disk).(ipumsr also provides a fourth callback used for running linear -regression models discussed below).
-In this case, we want to row-bind the data frames returned by
-cb_function()
, so we use
-IpumsDataFrameCallback
.
Callback objects are R6 objects, but you don’t need to
-be familiar with R6 to use them.2 To initialize a callback object, simply use
-$new()
:
-cb <- IpumsDataFrameCallback$new(cb_function)
At this point, we’re ready to load the data in chunks. We use
-read_ipums_micro_chunked()
to specify the callback and
-chunk size:
-chunked_tabulations <- read_ipums_micro_chunked(
- cps_ddi_file,
- callback = cb,
- chunk_size = 1000,
- verbose = FALSE
-)
-
-chunked_tabulations
-#> # A tibble: 209 × 3
-#> HEALTH AT_WORK n
-#> <fct> <fct> <int>
-#> 1 Excellent No 183
-#> 2 Excellent Yes 147
-#> 3 Very good No 134
-#> 4 Very good Yes 217
-#> 5 Good No 111
-#> 6 Good Yes 105
-#> 7 Fair No 53
-#> 8 Fair Yes 22
-#> 9 Poor No 27
-#> 10 Poor Yes 1
-#> # ℹ 199 more rows
Now we have a data frame with the counts by health and work status -within each chunk. To get the full table, we just need to sum by health -and work status one more time:
-
-chunked_tabulations %>%
- group_by(HEALTH, AT_WORK) %>%
- summarize(n = sum(n), .groups = "drop")
-#> # A tibble: 10 × 3
-#> HEALTH AT_WORK n
-#> <fct> <fct> <int>
-#> 1 Excellent No 4055
-#> 2 Excellent Yes 2900
-#> 3 Very good No 3133
-#> 4 Very good Yes 3371
-#> 5 Good No 2480
-#> 6 Good Yes 2178
-#> 7 Fair No 1123
-#> 8 Fair Yes 443
-#> 9 Poor No 603
-#> 10 Poor Yes 65
With the biglm package, it is possible to use R to perform a
-regression on data that is too large to store in memory all at once. The
-ipumsr package provides another callback designed to make this simple:
-IpumsBiglmCallback
.
In this example, we’ll conduct a regression with total hours worked
-(AHRSWORKT
) as the outcome and age (AGE
) and
-self-reported health (HEALTH
) as predictors. (Note that
-this is intended as a code demonstration, so we ignore many complexities
-that should be addressed in real analyses.)
If we were running the analysis on our full dataset, we’d first load -our data and prepare the variables in our analysis for use in the -model:
-
-data <- read_ipums_micro(cps_ddi_file, verbose = FALSE) %>%
- mutate(
- HEALTH = as_factor(HEALTH),
- AHRSWORKT = lbl_na_if(AHRSWORKT, ~ .lbl == "NIU (Not in universe)"),
- AT_WORK = as_factor(
- lbl_relabel(
- EMPSTAT,
- lbl(1, "Yes") ~ .lbl == "At work",
- lbl(0, "No") ~ .lbl != "At work"
- )
- )
- ) %>%
- filter(AT_WORK == "Yes")
Then, we’d provide our model formula and data to lm
:
-model <- lm(AHRSWORKT ~ AGE + I(AGE^2) + HEALTH, data = data)
-summary(model)
-#>
-#> Call:
-#> lm(formula = AHRSWORKT ~ AGE + I(AGE^2) + HEALTH, data = data)
-#>
-#> Residuals:
-#> Min 1Q Median 3Q Max
-#> -41.217 -4.734 -0.077 5.957 63.994
-#>
-#> Coefficients:
-#> Estimate Std. Error t value Pr(>|t|)
-#> (Intercept) 5.2440289 1.1823985 4.435 9.31e-06 ***
-#> AGE 1.5868169 0.0573268 27.680 < 2e-16 ***
-#> I(AGE^2) -0.0170043 0.0006568 -25.888 < 2e-16 ***
-#> HEALTHVery good -0.2550306 0.3276759 -0.778 0.436412
-#> HEALTHGood -0.9637395 0.3704123 -2.602 0.009289 **
-#> HEALTHFair -3.8899430 0.6629725 -5.867 4.58e-09 ***
-#> HEALTHPoor -5.7597200 1.6197136 -3.556 0.000378 ***
-#> ---
-#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
-#>
-#> Residual standard error: 12.88 on 8950 degrees of freedom
-#> Multiple R-squared: 0.08711, Adjusted R-squared: 0.0865
-#> F-statistic: 142.3 on 6 and 8950 DF, p-value: < 2.2e-16
To do the same regression, but with only 1,000 rows loaded at a time, -we work in a similar manner.
-First we make an IpumsBiglmCallback
callback object. We
-provide the model formula as well as the code used to process the data
-before running the regression:
-library(biglm)
-#> Loading required package: DBI
-
-biglm_cb <- IpumsBiglmCallback$new(
- model = AHRSWORKT ~ AGE + I(AGE^2) + HEALTH,
- prep = function(x, pos) {
- x %>%
- mutate(
- HEALTH = as_factor(HEALTH),
- AHRSWORKT = lbl_na_if(AHRSWORKT, ~ .lbl == "NIU (Not in universe)"),
- AT_WORK = as_factor(
- lbl_relabel(
- EMPSTAT,
- lbl(1, "Yes") ~ .lbl == "At work",
- lbl(0, "No") ~ .lbl != "At work"
- )
- )
- ) %>%
- filter(AT_WORK == "Yes")
- }
-)
And then we read the data using
-read_ipums_micro_chunked()
, passing the callback that we
-just made.
-chunked_model <- read_ipums_micro_chunked(
- cps_ddi_file,
- callback = biglm_cb,
- chunk_size = 1000,
- verbose = FALSE
-)
-
-summary(chunked_model)
-#> Large data regression model: biglm(AHRSWORKT ~ AGE + I(AGE^2) + HEALTH, data, ...)
-#> Sample size = 8957
-#> Coef (95% CI) SE p
-#> (Intercept) 5.2440 2.8792 7.6088 1.1824 0.0000
-#> AGE 1.5868 1.4722 1.7015 0.0573 0.0000
-#> I(AGE^2) -0.0170 -0.0183 -0.0157 0.0007 0.0000
-#> HEALTHVery good -0.2550 -0.9104 0.4003 0.3277 0.4364
-#> HEALTHGood -0.9637 -1.7046 -0.2229 0.3704 0.0093
-#> HEALTHFair -3.8899 -5.2159 -2.5640 0.6630 0.0000
-#> HEALTHPoor -5.7597 -8.9991 -2.5203 1.6197 0.0004
In addition to chunked reading, ipumsr also provides the similar but -more flexible “yielded” reading.
-read_ipums_micro_yield()
and
-read_ipums_micro_list_yield()
grant you more freedom in
-determining what R code to run between chunks and include the ability to
-have multiple files open at once. Additionally, yields are compatible
-with the bigglm
function from biglm, which allows you to
-run glm models on data larger than memory.
The downside to this greater control is that yields have an API that -is unique to IPUMS data and the way they work is unusual for R code.
-We’ll compare the yield
and chunked
-functions by conducting the same tabulation
-example from above using yields.
First, we create the yield object with the function
-read_ipums_micro_yield()
:
-data <- read_ipums_micro_yield(cps_ddi_file, verbose = FALSE)
This function returns an R6
object which contains
-methods for reading the data. The most important method is the
-yield()
method which will return n
rows of
-data:
-# Return the first 10 rows of data
-data$yield(10)
-#> # A tibble: 10 × 14
-#> YEAR SERIAL MONTH CPSID ASECFLAG ASECWTH FOODSTMP PERNUM CPSIDP ASECWT
-#> <dbl> <dbl> <int+lb> <dbl> <int+lb> <dbl> <int+lb> <dbl> <dbl> <dbl>
-#> 1 2011 33 3 [Marc… 2.01e13 1 [ASEC] 308. 1 [No] 1 2.01e13 308.
-#> 2 2011 33 3 [Marc… 2.01e13 1 [ASEC] 308. 1 [No] 2 2.01e13 217.
-#> 3 2011 33 3 [Marc… 2.01e13 1 [ASEC] 308. 1 [No] 3 2.01e13 249.
-#> 4 2011 46 3 [Marc… 2.01e13 1 [ASEC] 266. 1 [No] 1 2.01e13 266.
-#> 5 2011 46 3 [Marc… 2.01e13 1 [ASEC] 266. 1 [No] 2 2.01e13 266.
-#> 6 2011 46 3 [Marc… 2.01e13 1 [ASEC] 266. 1 [No] 3 2.01e13 265.
-#> 7 2011 46 3 [Marc… 2.01e13 1 [ASEC] 266. 1 [No] 4 2.01e13 296.
-#> 8 2011 64 3 [Marc… 2.01e13 1 [ASEC] 241. 1 [No] 1 2.01e13 241.
-#> 9 2011 64 3 [Marc… 2.01e13 1 [ASEC] 241. 1 [No] 2 2.01e13 241.
-#> 10 2011 64 3 [Marc… 2.01e13 1 [ASEC] 241. 1 [No] 3 2.01e13 278.
-#> # ℹ 4 more variables: AGE <int+lbl>, EMPSTAT <int+lbl>, AHRSWORKT <dbl+lbl>,
-#> # HEALTH <int+lbl>
Note that the row position in the data is stored in the object, so -running the same code again will produce different rows of -data:
-
-# Return the next 10 rows of data
-data$yield(10)
-#> # A tibble: 10 × 14
-#> YEAR SERIAL MONTH CPSID ASECFLAG ASECWTH FOODSTMP PERNUM CPSIDP ASECWT
-#> <dbl> <dbl> <int+lb> <dbl> <int+lb> <dbl> <int+lb> <dbl> <dbl> <dbl>
-#> 1 2011 82 3 [Marc… 0 1 [ASEC] 373. 1 [No] 1 0 373.
-#> 2 2011 82 3 [Marc… 0 1 [ASEC] 373. 1 [No] 2 0 373.
-#> 3 2011 82 3 [Marc… 0 1 [ASEC] 373. 1 [No] 3 0 326.
-#> 4 2011 86 3 [Marc… 2.01e13 1 [ASEC] 554. 1 [No] 1 2.01e13 554.
-#> 5 2011 104 3 [Marc… 2.01e13 1 [ASEC] 543. 1 [No] 1 2.01e13 543.
-#> 6 2011 104 3 [Marc… 2.01e13 1 [ASEC] 543. 1 [No] 2 2.01e13 543.
-#> 7 2011 106 3 [Marc… 2.01e13 1 [ASEC] 543. 1 [No] 1 2.01e13 543.
-#> 8 2011 137 3 [Marc… 2.01e13 1 [ASEC] 271. 1 [No] 1 2.01e13 271.
-#> 9 2011 137 3 [Marc… 2.01e13 1 [ASEC] 271. 1 [No] 2 2.01e13 271.
-#> 10 2011 137 3 [Marc… 2.01e13 1 [ASEC] 271. 1 [No] 3 2.01e13 365.
-#> # ℹ 4 more variables: AGE <int+lbl>, EMPSTAT <int+lbl>, AHRSWORKT <dbl+lbl>,
-#> # HEALTH <int+lbl>
Use cur_pos
to get the current position in the data
-file:
-data$cur_pos
-#> [1] 21
The is_done()
method tells us whether we have read the
-entire file yet:
-data$is_done()
-#> [1] FALSE
In preparation for our actual example, we’ll use reset()
-to reset to the beginning of the data:
-data$reset()
Using yield()
and is_done()
, we can set up
-our processing pipeline. First, we create an empty placeholder tibble to
-store our results:
-yield_results <- tibble(
- HEALTH = factor(levels = c("Excellent", "Very good", "Good", "Fair", "Poor")),
- AT_WORK = factor(levels = c("No", "Yes")),
- n = integer(0)
-)
Then, we iterate through the data, yielding 1,000 rows at a time and -processing the results as we did in the chunked example. The iteration -will end when we’ve finished reading the entire file.
-
-while (!data$is_done()) {
- # Yield new data and process
- new <- data$yield(n = 1000) %>%
- mutate(
- HEALTH = as_factor(HEALTH),
- AT_WORK = as_factor(
- lbl_relabel(
- EMPSTAT,
- lbl(1, "Yes") ~ .lbl == "At work",
- lbl(0, "No") ~ .lbl != "At work"
- )
- )
- ) %>%
- group_by(HEALTH, AT_WORK) %>%
- summarize(n = n(), .groups = "drop")
-
- # Combine the new yield with the previously processed yields
- yield_results <- bind_rows(yield_results, new) %>%
- group_by(HEALTH, AT_WORK) %>%
- summarize(n = sum(n), .groups = "drop")
-}
-
-yield_results
-#> # A tibble: 10 × 3
-#> HEALTH AT_WORK n
-#> <fct> <fct> <int>
-#> 1 Excellent No 4055
-#> 2 Excellent Yes 2900
-#> 3 Very good No 3133
-#> 4 Very good Yes 3371
-#> 5 Good No 2480
-#> 6 Good Yes 2178
-#> 7 Fair No 1123
-#> 8 Fair Yes 443
-#> 9 Poor No 603
-#> 10 Poor Yes 65
One of the major benefits of the yielded reading over chunked reading -is that it is compatible with the GLM functions from biglm, allowing for -the use of more complicated models.
-To run a logistic regression, we first need to reset our yield object -from the previous example:
-
-data$reset()
Next we make a function that takes a single argument:
-reset
. When reset
is TRUE
, it
-resets the data to the beginning. This is dictated by
-bigglm
from biglm.
To create this function, we use the the reset()
method
-from the yield object:
-get_model_data <- function(reset) {
- if (reset) {
- data$reset()
- } else {
- yield <- data$yield(n = 1000)
-
- if (is.null(yield)) {
- return(yield)
- }
-
- yield %>%
- mutate(
- HEALTH = as_factor(HEALTH),
- WORK30PLUS = lbl_na_if(AHRSWORKT, ~ .lbl == "NIU (Not in universe)") >= 30,
- AT_WORK = as_factor(
- lbl_relabel(
- EMPSTAT,
- lbl(1, "Yes") ~ .lbl == "At work",
- lbl(0, "No") ~ .lbl != "At work"
- )
- )
- ) %>%
- filter(AT_WORK == "Yes")
- }
-}
Finally we feed this function and a model specification to the
-bigglm()
function:
-results <- bigglm(
- WORK30PLUS ~ AGE + I(AGE^2) + HEALTH,
- family = binomial(link = "logit"),
- data = get_model_data
-)
-
-summary(results)
-#> Large data regression model: bigglm(WORK30PLUS ~ AGE + I(AGE^2) + HEALTH, family = binomial(link = "logit"),
-#> data = get_model_data)
-#> Sample size = 8957
-#> Coef (95% CI) SE p
-#> (Intercept) -4.0021 -4.4297 -3.5744 0.2138 0.0000
-#> AGE 0.2714 0.2498 0.2930 0.0108 0.0000
-#> I(AGE^2) -0.0029 -0.0032 -0.0027 0.0001 0.0000
-#> HEALTHVery good 0.0038 -0.1346 0.1423 0.0692 0.9557
-#> HEALTHGood -0.1129 -0.2685 0.0426 0.0778 0.1465
-#> HEALTHFair -0.6637 -0.9160 -0.4115 0.1261 0.0000
-#> HEALTHPoor -0.7879 -1.3697 -0.2062 0.2909 0.0068
Storing your data in a database is another way to work with data that -cannot fit into memory as a data frame. If you have access to a database -on a remote machine, then you can easily select and use parts of the -data for your analysis. Even databases on your own machine may provide -more efficient data storage or use your hard drive, enabling the data to -be loaded into R.
-There are many different kinds of databases, each with their own -benefits and drawbacks, and the database you choose to use will be -specific to your use case. However, once you’ve chosen a database, there -will be two general steps:
-R has several tools that support database integration, including -DBI, dbplyr, sparklyr, -bigrquery, and others. In this example, we’ll use -RSQLite to load the data into an in-memory database. (We -use RSQLite because it is easy to set up, but it is likely not efficient -enough to fully resolve issues with large IPUMS data, so it may be wise -to consider an alternative in practice.)
-For rectangular extracts, it is likely simplest to load your data
-into the database in CSV format, which is widely supported. If you are
-working with a hierarchical extract (or your database software doesn’t
-support CSV format), then you can use an ipumsr chunked
-function to load the data into a database without needing to store the
-entire dataset in R.
See the IPUMS data -reading vignette for more about rectangular vs. hierarchical -extracts.
-
-library(DBI)
-library(RSQLite)
-
-# Connect to database
-con <- dbConnect(SQLite(), path = ":memory:")
-
-# Load file metadata
-ddi <- read_ipums_ddi(cps_ddi_file)
-
-# Write data to database in chunks
-read_ipums_micro_chunked(
- ddi,
- readr::SideEffectChunkCallback$new(
- function(x, pos) {
- if (pos == 1) {
- dbWriteTable(con, "cps", x)
- } else {
- dbWriteTable(con, "cps", x, row.names = FALSE, append = TRUE)
- }
- }
- ),
- chunk_size = 1000,
- verbose = FALSE
-)
There are a variety of ways to access your data once it is stored in
-the database. In this example, we use dbplyr. For more details about
-dbplyr, see vignette("dbplyr", package = "dbplyr")
.
To run a simple query for AGE
, we can use the same
-syntax we would use with dplyr:
-example <- tbl(con, "cps")
-
-example %>%
- filter("AGE" > 25)
-#> # Source: SQL [?? x 14]
-#> # Database: sqlite 3.43.2 []
-#> YEAR SERIAL MONTH CPSID ASECFLAG ASECWTH FOODSTMP PERNUM CPSIDP ASECWT
-#> <dbl> <dbl> <int> <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl>
-#> 1 2011 33 3 2.01e13 1 308. 1 1 2.01e13 308.
-#> 2 2011 33 3 2.01e13 1 308. 1 2 2.01e13 217.
-#> 3 2011 33 3 2.01e13 1 308. 1 3 2.01e13 249.
-#> 4 2011 46 3 2.01e13 1 266. 1 1 2.01e13 266.
-#> 5 2011 46 3 2.01e13 1 266. 1 2 2.01e13 266.
-#> 6 2011 46 3 2.01e13 1 266. 1 3 2.01e13 265.
-#> 7 2011 46 3 2.01e13 1 266. 1 4 2.01e13 296.
-#> 8 2011 64 3 2.01e13 1 241. 1 1 2.01e13 241.
-#> 9 2011 64 3 2.01e13 1 241. 1 2 2.01e13 241.
-#> 10 2011 64 3 2.01e13 1 241. 1 3 2.01e13 278.
-#> # ℹ more rows
-#> # ℹ 4 more variables: AGE <int>, EMPSTAT <int>, AHRSWORKT <dbl>, HEALTH <int>
dbplyr shows us a nice preview of the first rows of the result of our
-query, but the data still exist only in the database. You can use
-dplyr::collect()
to load the full results of the query into
-the current R session. However, this would omit the variable metadata
-attached to IPUMS data, since the database doesn’t store this
-metadata:
-data <- example %>%
- filter("AGE" > 25) %>%
- collect()
-
-# Variable metadata is missing
-ipums_val_labels(data$MONTH)
-#> # A tibble: 0 × 2
-#> # ℹ 2 variables: val <dbl>, lbl <chr>
Instead, use ipums_collect()
, which uses a provided
-ipums_ddi
object to reattach the metadata while loading
-into the R environment:
-data <- example %>%
- filter("AGE" > 25) %>%
- ipums_collect(ddi)
-
-ipums_val_labels(data$MONTH)
-#> # A tibble: 12 × 2
-#> val lbl
-#> <int> <chr>
-#> 1 1 January
-#> 2 2 February
-#> 3 3 March
-#> 4 4 April
-#> 5 5 May
-#> 6 6 June
-#> 7 7 July
-#> 8 8 August
-#> 9 9 September
-#> 10 10 October
-#> 11 11 November
-#> 12 12 December
See the value labels vignette more -about variable metadata in IPUMS data.
-Big data isn’t just a problem for IPUMS users, so there are many R -resources available.
-See the documentation for the packages mentioned in the databases section for more information about those -options.
-For some past blog posts and articles on the topic, see the -following:
-Once you have downloaded an IPUMS extract, the next step is to load -its data into R for analysis.
-For more information about IPUMS data and how to generate and -download a data extract, see the introduction -to IPUMS data.
-IPUMS extracts will be organized slightly differently for different -IPUMS projects. In general, -all projects will provide multiple files in a data extract. The files -most relevant to ipumsr are:
-Both of these files are necessary to properly load data into R. -Obviously, the data files contain the actual data values to be loaded. -But because these are often in fixed-width format, the metadata files -are required to correctly parse the data on load.
-Even for .csv files, the metadata file allows for the addition of -contextual variable information to the loaded data. This makes it much -easier to interpret the values in the data variables and effectively use -them in your data processing pipeline. See the value labels vignette for more information -on working with these labels.
-Microdata extracts typically provide their metadata in a DDI (.xml) -file separate from the compressed data (.dat.gz) files.
-Provide the path to the DDI file to read_ipums_micro()
-to directly load its associated data file into R.
-library(ipumsr)
-library(dplyr)
-
-# Example data
-cps_ddi_file <- ipums_example("cps_00157.xml")
-
-cps_data <- read_ipums_micro(cps_ddi_file)
-
-head(cps_data)
-#> # A tibble: 6 × 8
-#> YEAR SERIAL MONTH ASECWTH STATEFIP PERNUM ASECWT INCTOT
-#> <dbl> <dbl> <int+lbl> <dbl> <int+lbl> <dbl> <dbl> <dbl+lbl>
-#> 1 1962 80 3 [March] 1476. 55 [Wisconsin] 1 1476. 4883
-#> 2 1962 80 3 [March] 1476. 55 [Wisconsin] 2 1471. 5800
-#> 3 1962 80 3 [March] 1476. 55 [Wisconsin] 3 1579. 999999998 [Missin…
-#> 4 1962 82 3 [March] 1598. 27 [Minnesota] 1 1598. 14015
-#> 5 1962 83 3 [March] 1707. 27 [Minnesota] 1 1707. 16552
-#> 6 1962 84 3 [March] 1790. 27 [Minnesota] 1 1790. 6375
Note that you provide the path to the DDI file, not the data -file. This is because ipumsr needs to find both the DDI and data files -to read in your data, and the DDI file includes the name of the data -file, whereas the data file contains only the raw data.
-The loaded data have been parsed correctly and include variable
-metadata in each column. For a summary of the column contents, use
-ipums_var_info()
:
-ipums_var_info(cps_data)
-#> # A tibble: 8 × 4
-#> var_name var_label var_desc val_labels
-#> <chr> <chr> <chr> <list>
-#> 1 YEAR Survey year "YEAR r… <tibble>
-#> 2 SERIAL Household serial number "SERIAL… <tibble>
-#> 3 MONTH Month "MONTH … <tibble>
-#> 4 ASECWTH Annual Social and Economic Supplement Household … "ASECWT… <tibble>
-#> 5 STATEFIP State (FIPS code) "STATEF… <tibble>
-#> 6 PERNUM Person number in sample unit "PERNUM… <tibble>
-#> 7 ASECWT Annual Social and Economic Supplement Weight "ASECWT… <tibble>
-#> 8 INCTOT Total personal income "INCTOT… <tibble>
This information is also attached to specific columns. You can obtain
-it with attributes()
or by using ipumsr helpers:
-attributes(cps_data$MONTH)
-#> $labels
-#> January February March April May June July August
-#> 1 2 3 4 5 6 7 8
-#> September October November December
-#> 9 10 11 12
-#>
-#> $class
-#> [1] "haven_labelled" "vctrs_vctr" "integer"
-#>
-#> $label
-#> [1] "Month"
-#>
-#> $var_desc
-#> [1] "MONTH indicates the calendar month of the CPS interview."
-
-ipums_val_labels(cps_data$MONTH)
-#> # A tibble: 12 × 2
-#> val lbl
-#> <int> <chr>
-#> 1 1 January
-#> 2 2 February
-#> 3 3 March
-#> 4 4 April
-#> 5 5 May
-#> 6 6 June
-#> 7 7 July
-#> 8 8 August
-#> 9 9 September
-#> 10 10 October
-#> 11 11 November
-#> 12 12 December
While this is the most straightforward way to load microdata, it’s
-often advantageous to independently load the DDI file into an
-ipums_ddi
object containing the metadata:
-cps_ddi <- read_ipums_ddi(cps_ddi_file)
-
-cps_ddi
-#> An IPUMS DDI for IPUMS CPS with 8 variables
-#> Extract 'cps_00157.dat' created on 2023-07-10
-#> User notes: User-provided description: Reproducing cps00006
This is because many common data processing functions have the -side-effect of removing these attributes:
-
-# This doesn't actually change the data...
-cps_data2 <- cps_data %>%
- mutate(MONTH = ifelse(TRUE, MONTH, MONTH))
-
-# but removes attributes!
-ipums_val_labels(cps_data2$MONTH)
-#> # A tibble: 0 × 2
-#> # ℹ 2 variables: val <dbl>, lbl <chr>
In this case, you can always use the separate DDI as a metadata -reference:
-
-ipums_val_labels(cps_ddi, var = MONTH)
-#> # A tibble: 12 × 2
-#> val lbl
-#> <dbl> <chr>
-#> 1 1 January
-#> 2 2 February
-#> 3 3 March
-#> 4 4 April
-#> 5 5 May
-#> 6 6 June
-#> 7 7 July
-#> 8 8 August
-#> 9 9 September
-#> 10 10 October
-#> 11 11 November
-#> 12 12 December
Or even reattach the metadata, assuming the variable names still -match those in the DDI:
-
-cps_data2 <- set_ipums_var_attributes(cps_data2, cps_ddi)
-
-ipums_val_labels(cps_data2$MONTH)
-#> # A tibble: 12 × 2
-#> val lbl
-#> <int> <chr>
-#> 1 1 January
-#> 2 2 February
-#> 3 3 March
-#> 4 4 April
-#> 5 5 May
-#> 6 6 June
-#> 7 7 July
-#> 8 8 August
-#> 9 9 September
-#> 10 10 October
-#> 11 11 November
-#> 12 12 December
IPUMS microdata can come in either rectangular or -hierarchical format.
-Rectangular data are transformed such that every row of data
-represents the same type of record. For instance, each row will
-represent a person record, and all household-level information for that
-person will be included in the same row. (This is the case for
-cps_data
shown in the example above.)
Hierarchical data have records of different types interspersed in a -single file. For instance, a household record will be included in its -own row followed by the person records associated with that -household.
-Hierarchical data can be loaded in list format or long format.
-read_ipums_micro()
will read in long format:
-cps_hier_ddi <- read_ipums_ddi(ipums_example("cps_00159.xml"))
-
-read_ipums_micro(cps_hier_ddi)
-#> Use of data from IPUMS CPS is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.
-#> # A tibble: 11,053 × 9
-#> RECTYPE YEAR SERIAL MONTH ASECWTH STATEFIP PERNUM ASECWT INCTOT
-#> <chr+lbl> <dbl> <dbl> <int+lb> <dbl> <int+lb> <dbl> <dbl> <dbl+lbl>
-#> 1 H [Househ… 1962 80 3 [Mar… 1476. 55 [Wis… NA NA NA
-#> 2 P [Person… 1962 80 NA NA NA 1 1476. 4.88e3
-#> 3 P [Person… 1962 80 NA NA NA 2 1471. 5.8 e3
-#> 4 P [Person… 1962 80 NA NA NA 3 1579. 1.00e9 [Mis…
-#> 5 H [Househ… 1962 82 3 [Mar… 1598. 27 [Min… NA NA NA
-#> 6 P [Person… 1962 82 NA NA NA 1 1598. 1.40e4
-#> 7 H [Househ… 1962 83 3 [Mar… 1707. 27 [Min… NA NA NA
-#> 8 P [Person… 1962 83 NA NA NA 1 1707. 1.66e4
-#> 9 H [Househ… 1962 84 3 [Mar… 1790. 27 [Min… NA NA NA
-#> 10 P [Person… 1962 84 NA NA NA 1 1790. 6.38e3
-#> # ℹ 11,043 more rows
The long format consists of a single tibble
-that includes rows with varying record types. In this example, some rows
-have a record type of “Household” and others have a record type of
-“Person”. Variables that do not apply to a particular record type will
-be filled with NA
in rows of that record type.
To read data in list format, use
-read_ipums_micro_list()
. This function returns a list where
-each element contains all the records for a given record type:
-read_ipums_micro_list(cps_hier_ddi)
-#> Use of data from IPUMS CPS is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.
-#> $HOUSEHOLD
-#> # A tibble: 3,385 × 6
-#> RECTYPE YEAR SERIAL MONTH ASECWTH STATEFIP
-#> <chr+lbl> <dbl> <dbl> <int+lbl> <dbl> <int+lbl>
-#> 1 H [Household Record] 1962 80 3 [March] 1476. 55 [Wisconsin]
-#> 2 H [Household Record] 1962 82 3 [March] 1598. 27 [Minnesota]
-#> 3 H [Household Record] 1962 83 3 [March] 1707. 27 [Minnesota]
-#> 4 H [Household Record] 1962 84 3 [March] 1790. 27 [Minnesota]
-#> 5 H [Household Record] 1962 107 3 [March] 4355. 19 [Iowa]
-#> 6 H [Household Record] 1962 108 3 [March] 1479. 19 [Iowa]
-#> 7 H [Household Record] 1962 122 3 [March] 3603. 27 [Minnesota]
-#> 8 H [Household Record] 1962 124 3 [March] 4104. 55 [Wisconsin]
-#> 9 H [Household Record] 1962 125 3 [March] 2182. 55 [Wisconsin]
-#> 10 H [Household Record] 1962 126 3 [March] 1826. 55 [Wisconsin]
-#> # ℹ 3,375 more rows
-#>
-#> $PERSON
-#> # A tibble: 7,668 × 6
-#> RECTYPE YEAR SERIAL PERNUM ASECWT INCTOT
-#> <chr+lbl> <dbl> <dbl> <dbl> <dbl> <dbl+lbl>
-#> 1 P [Person Record] 1962 80 1 1476. 4883
-#> 2 P [Person Record] 1962 80 2 1471. 5800
-#> 3 P [Person Record] 1962 80 3 1579. 999999998 [Missing. (1962-1964 …
-#> 4 P [Person Record] 1962 82 1 1598. 14015
-#> 5 P [Person Record] 1962 83 1 1707. 16552
-#> 6 P [Person Record] 1962 84 1 1790. 6375
-#> 7 P [Person Record] 1962 107 1 4355. 999999999 [N.I.U.]
-#> 8 P [Person Record] 1962 107 2 1386. 0
-#> 9 P [Person Record] 1962 107 3 1629. 600
-#> 10 P [Person Record] 1962 107 4 1432. 999999999 [N.I.U.]
-#> # ℹ 7,658 more rows
read_ipums_micro()
and
-read_ipums_micro_list()
also support partial loading by
-selecting only a subset of columns or a limited number of rows. See the
-documentation for more details about other options.
Unlike microdata projects, NHGIS extracts provide their data and
-metadata files bundled into a single .zip archive.
-read_nhgis()
anticipates this structure and can read data
-files directly from this file without the need to manually extract the
-files:
-nhgis_ex1 <- ipums_example("nhgis0972_csv.zip")
-
-nhgis_data <- read_nhgis(nhgis_ex1)
-#> Use of data from NHGIS is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.
-#> Rows: 71 Columns: 25
-#> ── Column specification ────────────────────────────────────────────────────────
-#> Delimiter: ","
-#> chr (9): GISJOIN, STUSAB, CMSA, PMSA, PMSAA, AREALAND, AREAWAT, ANPSADPI, F...
-#> dbl (13): YEAR, MSA_CMSAA, INTPTLAT, INTPTLNG, PSADC, D6Z001, D6Z002, D6Z003...
-#> lgl (3): DIVISIONA, REGIONA, STATEA
-#>
-#> ℹ Use `spec()` to retrieve the full column specification for this data.
-#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
-
-nhgis_data
-#> # A tibble: 71 × 25
-#> GISJOIN YEAR STUSAB CMSA DIVISIONA MSA_CMSAA PMSA PMSAA REGIONA STATEA
-#> <chr> <dbl> <chr> <chr> <lgl> <dbl> <chr> <chr> <lgl> <lgl>
-#> 1 G0080 1990 OH 28 NA 1692 Akron, O… 0080 NA NA
-#> 2 G0360 1990 CA 49 NA 4472 Anaheim-… 0360 NA NA
-#> 3 G0440 1990 MI 35 NA 2162 Ann Arbo… 0440 NA NA
-#> 4 G0620 1990 IL 14 NA 1602 Aurora--… 0620 NA NA
-#> 5 G0845 1990 PA 78 NA 6282 Beaver C… 0845 NA NA
-#> 6 G0875 1990 NJ 70 NA 5602 Bergen--… 0875 NA NA
-#> 7 G1120 1990 MA 07 NA 1122 Boston, … 1120 NA NA
-#> 8 G1125 1990 CO 34 NA 2082 Boulder-… 1125 NA NA
-#> 9 G1145 1990 TX 42 NA 3362 Brazoria… 1145 NA NA
-#> 10 G1160 1990 CT 70 NA 5602 Bridgepo… 1160 NA NA
-#> # ℹ 61 more rows
-#> # ℹ 15 more variables: AREALAND <chr>, AREAWAT <chr>, ANPSADPI <chr>,
-#> # FUNCSTAT <chr>, INTPTLAT <dbl>, INTPTLNG <dbl>, PSADC <dbl>, D6Z001 <dbl>,
-#> # D6Z002 <dbl>, D6Z003 <dbl>, D6Z004 <dbl>, D6Z005 <dbl>, D6Z006 <dbl>,
-#> # D6Z007 <dbl>, D6Z008 <dbl>
Like microdata extracts, the data include variable-level metadata, -where available:
-
-attributes(nhgis_data$D6Z001)
-#> $label
-#> [1] "Total area: 1989 to March 1990"
-#>
-#> $var_desc
-#> [1] "Table D6Z: Year Structure Built (Universe: Housing Units)"
However, variable metadata for NHGIS data are slightly different than
-those provided by microdata products. First, they come from a .txt
-codebook file rather than an .xml DDI file. Codebooks can still be
-loaded into an ipums_ddi
object, but fields that do not
-apply to aggregate data will be empty. In general, NHGIS codebooks
-provide only variable labels and descriptions, along with citation
-information.
-nhgis_cb <- read_nhgis_codebook(nhgis_ex1)
-
-# Most useful metadata for NHGIS is for variable labels:
-ipums_var_info(nhgis_cb) %>%
- select(var_name, var_label, var_desc)
-#> # A tibble: 25 × 3
-#> var_name var_label var_desc
-#> <chr> <chr> <chr>
-#> 1 GISJOIN GIS Join Match Code ""
-#> 2 YEAR Data File Year ""
-#> 3 STUSAB State/US Abbreviation ""
-#> 4 CMSA Consolidated Metropolitan Statistical Area ""
-#> 5 DIVISIONA Division Code ""
-#> 6 MSA_CMSAA Metropolitan Statistical Area/Consolidated Metropolitan S… ""
-#> 7 PMSA Primary Metropolitan Statistical Area Name ""
-#> 8 PMSAA Primary Metropolitan Statistical Area Code ""
-#> 9 REGIONA Region Code ""
-#> 10 STATEA State Code ""
-#> # ℹ 15 more rows
By design, NHGIS codebooks are human-readable, and it may be easier
-to interpret their contents in raw format. To view the codebook itself
-without converting to an ipums_ddi
object, set
-raw = TRUE
.
-nhgis_cb <- read_nhgis_codebook(nhgis_ex1, raw = TRUE)
-
-cat(nhgis_cb[1:20], sep = "\n")
-#> --------------------------------------------------------------------------------
-#> Codebook for NHGIS data file 'nhgis0972_ds135_1990_pmsa'
-#> --------------------------------------------------------------------------------
-#>
-#> Contents
-#> - Data Summary
-#> - Data Dictionary
-#> - Citation and Use
-#>
-#> Additional documentation on NHGIS data sources is available at:
-#> https://www.nhgis.org/documentation/tabular-data
-#>
-#> --------------------------------------------------------------------------------
-#> Data Summary
-#> --------------------------------------------------------------------------------
-#>
-#> Year: 1990
-#> Geographic level: Consolidated Metropolitan Statistical Area--Primary Metropolitan Statistical Area
-#> Dataset: 1990 Census: SSTF 9 - Housing Characteristics of New Units
-#> NHGIS code: 1990_SSTF09
For more complicated NHGIS extracts that include data from multiple -data sources, the provided .zip archive will contain multiple codebook -and data files.
-You can view the files contained in an extract to determine if this -is the case:
-
-nhgis_ex2 <- ipums_example("nhgis0731_csv.zip")
-
-ipums_list_files(nhgis_ex2)
-#> # A tibble: 2 × 2
-#> type file
-#> <chr> <chr>
-#> 1 data nhgis0731_csv/nhgis0731_ds239_20185_nation.csv
-#> 2 data nhgis0731_csv/nhgis0731_ts_nominal_state.csv
In these cases, you can use the file_select
argument to
-indicate which file to load. file_select
supports most
-features of the tidyselect
-selection language. (See ?selection_language
for
-documentation of the features supported in ipumsr.)
-nhgis_data2 <- read_nhgis(nhgis_ex2, file_select = contains("nation"))
-nhgis_data3 <- read_nhgis(nhgis_ex2, file_select = contains("ts_nominal_state"))
The matching codebook should automatically be loaded and attached to -the data:
-
-attributes(nhgis_data2$AJWBE001)
-#> $label
-#> [1] "Estimates: Total"
-#>
-#> $var_desc
-#> [1] "Table AJWB: Sex by Age (Universe: Total population)"
-
-attributes(nhgis_data3$A00AA1790)
-#> $label
-#> [1] "1790: Persons: Total"
-#>
-#> $var_desc
-#> [1] "Table A00: Total Population"
(If for some reason the codebook is not loaded correctly, you can
-load it separately with read_nhgis_codebook()
, which also
-accepts a file_select
specification.)
file_select
also accepts the full path or the index of
-the file to load:
-# Match by file name
-read_nhgis(nhgis_ex2, file_select = "nhgis0731_csv/nhgis0731_ds239_20185_nation.csv")
-
-# Match first file in extract
-read_nhgis(nhgis_ex2, file_select = 1)
NHGIS data are most easily handled in .csv format.
-read_nhgis()
uses readr::read_csv()
to handle
-the generation of column type specifications. If the guessed
-specifications are incorrect, you can use the col_types
-argument to adjust. This is most likely to occur for columns that
-contain geographic codes that are stored as numeric values:
-# Convert MSA codes to character format
-read_nhgis(
- nhgis_ex1,
- col_types = c(MSA_CMSAA = "c"),
- verbose = FALSE
-)
-#> # A tibble: 71 × 25
-#> GISJOIN YEAR STUSAB CMSA DIVISIONA MSA_CMSAA PMSA PMSAA REGIONA STATEA
-#> <chr> <dbl> <chr> <chr> <lgl> <chr> <chr> <chr> <lgl> <lgl>
-#> 1 G0080 1990 OH 28 NA 1692 Akron, O… 0080 NA NA
-#> 2 G0360 1990 CA 49 NA 4472 Anaheim-… 0360 NA NA
-#> 3 G0440 1990 MI 35 NA 2162 Ann Arbo… 0440 NA NA
-#> 4 G0620 1990 IL 14 NA 1602 Aurora--… 0620 NA NA
-#> 5 G0845 1990 PA 78 NA 6282 Beaver C… 0845 NA NA
-#> 6 G0875 1990 NJ 70 NA 5602 Bergen--… 0875 NA NA
-#> 7 G1120 1990 MA 07 NA 1122 Boston, … 1120 NA NA
-#> 8 G1125 1990 CO 34 NA 2082 Boulder-… 1125 NA NA
-#> 9 G1145 1990 TX 42 NA 3362 Brazoria… 1145 NA NA
-#> 10 G1160 1990 CT 70 NA 5602 Bridgepo… 1160 NA NA
-#> # ℹ 61 more rows
-#> # ℹ 15 more variables: AREALAND <chr>, AREAWAT <chr>, ANPSADPI <chr>,
-#> # FUNCSTAT <chr>, INTPTLAT <dbl>, INTPTLNG <dbl>, PSADC <dbl>, D6Z001 <dbl>,
-#> # D6Z002 <dbl>, D6Z003 <dbl>, D6Z004 <dbl>, D6Z005 <dbl>, D6Z006 <dbl>,
-#> # D6Z007 <dbl>, D6Z008 <dbl>
read_nhgis()
also handles NHGIS files provided in
-fixed-width format:
-nhgis_fwf <- ipums_example("nhgis0730_fixed.zip")
-
-nhgis_fwf_data <- read_nhgis(nhgis_fwf, file_select = matches("ts_nominal"))
-#> Use of data from NHGIS is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.
-#> Rows: 84 Columns: 28
-#> ── Column specification ────────────────────────────────────────────────────────
-#>
-#> chr (4): GISJOIN, STATE, STATEFP, STATENH
-#> dbl (24): A00AA1790, A00AA1800, A00AA1810, A00AA1820, A00AA1830, A00AA1840, ...
-#>
-#> ℹ Use `spec()` to retrieve the full column specification for this data.
-#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
-
-nhgis_fwf_data
-#> # A tibble: 84 × 28
-#> GISJOIN STATE STATEFP STATENH A00AA1790 A00AA1800 A00AA1810 A00AA1820
-#> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
-#> 1 G010 Alabama 01 010 NA NA NA 127901
-#> 2 G020 Alaska 02 020 NA NA NA NA
-#> 3 G025 Alaska Terri… NA 025 NA NA NA NA
-#> 4 G040 Arizona 04 040 NA NA NA NA
-#> 5 G045 Arizona Terr… NA 045 NA NA NA NA
-#> 6 G050 Arkansas 05 050 NA NA NA NA
-#> 7 G055 Arkansas Ter… NA 055 NA NA NA 14273
-#> 8 G060 California 06 060 NA NA NA NA
-#> 9 G080 Colorado 08 080 NA NA NA NA
-#> 10 G085 Colorado Ter… NA 085 NA NA NA NA
-#> # ℹ 74 more rows
-#> # ℹ 20 more variables: A00AA1830 <dbl>, A00AA1840 <dbl>, A00AA1850 <dbl>,
-#> # A00AA1860 <dbl>, A00AA1870 <dbl>, A00AA1880 <dbl>, A00AA1890 <dbl>,
-#> # A00AA1900 <dbl>, A00AA1910 <dbl>, A00AA1920 <dbl>, A00AA1930 <dbl>,
-#> # A00AA1940 <dbl>, A00AA1950 <dbl>, A00AA1960 <dbl>, A00AA1970 <dbl>,
-#> # A00AA1980 <dbl>, A00AA1990 <dbl>, A00AA2000 <dbl>, A00AA2010 <dbl>,
-#> # A00AA2020 <dbl>
The correct parsing of NHGIS fixed-width files is driven by the -column parsing information contained in the .do file provided in the -.zip archive. This contains information not only about column positions -and data types, but also implicit decimals in the data.
-If you no longer have access to the .do file, it is best to resubmit
-and/or re-download the extract (you may also consider converting to .csv
-format in the process). If you have moved the .do file, provide its file
-path to the do_file
argument to use its column parsing
-information.
Note that unlike read_ipums_micro()
, fixed-width files
-for NHGIS are still handled by providing the path to the data
-file, not the metadata file (i.e. you cannot provide an
-ipums_ddi
object to the data_file
argument of
-read_nhgis()
). This is for syntactical consistency with the
-loading of NHGIS .csv files.
IPUMS distributes spatial data for several projects.
-Use read_ipums_sf()
to load spatial data from any of
-these sources as an sf
object from sf.
read_ipums_sf()
also supports the loading of spatial
-files within .zip archives and the file_select
syntax for
-file selection when multiple internal files are present.
-nhgis_shp_file <- ipums_example("nhgis0972_shape_small.zip")
-
-shp_data <- read_ipums_sf(nhgis_shp_file)
-
-head(shp_data)
-#> Simple feature collection with 6 features and 8 fields
-#> Geometry type: MULTIPOLYGON
-#> Dimension: XY
-#> Bounding box: xmin: -129888.4 ymin: -967051.1 xmax: 1948770 ymax: 751282.5
-#> Projected CRS: USA_Contiguous_Albers_Equal_Area_Conic
-#> # A tibble: 6 × 9
-#> PMSA MSACMSA ALTCMSA GISJOIN GISJOIN2 SHAPE_AREA SHAPE_LEN GISJOIN3
-#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
-#> 1 3280 3282 41 G3280 3280 2840869482. 320921. G32823280
-#> 2 5760 5602 70 G5760 5760 237428573. 126226. G56025760
-#> 3 1145 3362 42 G1145 1145 3730749183. 489789. G33621145
-#> 4 1920 1922 31 G1920 1920 12068105590. 543164. G19221920
-#> 5 0080 1692 28 G0080 0080 2401347006. 218892. G16920080
-#> 6 1640 1642 21 G1640 1640 5608404797. 415671. G16421640
-#> # ℹ 1 more variable: geometry <MULTIPOLYGON [m]>
These data can then be joined to associated tabular data. To preserve
-IPUMS attributes from the tabular data used in the join, use an
-ipums_shape_*_join()
function:
-joined_data <- ipums_shape_left_join(
- nhgis_data,
- shp_data,
- by = "GISJOIN"
-)
-
-attributes(joined_data$MSA_CMSAA)
-#> $label
-#> [1] "Metropolitan Statistical Area/Consolidated Metropolitan Statistical Area Code"
-#>
-#> $var_desc
-#> [1] ""
For NHGIS data, the join code typically corresponds to the
-GISJOIN
variable. However, for microdata projects, the
-variable name used for a geographic level in the tabular data may differ
-from that in the spatial data. Consult the documentation and metadata
-for these files to identify the correct join columns and use the
-by
argument to join on these columns.
Once joined, data include both statistical and spatial information -along with the variable metadata.
-Longitudinal analysis of geographic data is complicated by the fact -that geographic boundaries shift over time. IPUMS therefore provides -multiple types of spatial data:
-Furthermore, some NHGIS time series tables have been standardized -such that the statistics have been adjusted to apply to a year-specific -geographical boundary.
-When using spatial data, it is important to consult the -project-specific documentation to ensure you are using the most -appropriate boundaries for your research question and the data included -in your analysis. As always, documentation for the IPUMS project you’re -working with should explain the different options available.
-This article provides an overview of how to find, request, download, -and read IPUMS data into R. For a general introduction to IPUMS and -ipumsr, see the ipumsr home -page.
-IPUMS data are free, but do require registration. New users can -register with a particular IPUMS project by clicking the -Register link at the top right of the project -website.
-Users obtain IPUMS data by creating and submitting an extract -request. This specifies which data to include in the resulting -extract (or data extract). IPUMS servers process each -submitted extract request, and when complete, users can download the -extract containing the requested data.
-Extracts typically contain both data and metadata files. Data files -typically come as fixed-width (.dat) files or comma-delimited (.csv) -files. Metadata files contain information about the data file and its -contents, including variable descriptions and parsing instructions for -fixed-width data files. IPUMS microdata projects provide metadata in DDI -(.xml) files. Aggregate data projects provide metadata in either .txt or -.csv formats.
-Users can submit extract requests and download extracts via either -the IPUMS website or the IPUMS API. -ipumsr provides a set of client tools to interface with the API. Note -that only certain -IPUMS projects are currently supported by the IPUMS API.
-To create a new extract request via an IPUMS project website (e.g. IPUMS CPS), navigate to the -extract interface for that project by clicking Select -Data in the heading of the project website.
-The project’s extract interface allows you to explore what’s -available, find documentation about data concepts and sources, and -specify the data you’d like to download. The data selection parameters -will differ across projects; see each project’s documentation for more -details on the available options.
-If you’ve never created an extract for the project you’re interested -in, a good way to learn the basics is to watch a project-specific video -on creating extracts hosted on the IPUMS Tutorials -page.
-Once your extract is ready, click the green Download -button to download the data file. Then, right-click the -DDI link in the Codebook column, and select -Save Link As… (see below).
-Note that some browsers may display different text, but there should -be an option to download the DDI file as .xml. (For instance, on Safari, -select Download Linked File As….) For ipumsr to read -the metadata, you must save the file in .xml format, -not .html format.
-Aggregate data projects include data and metadata together in a -single .zip archive. To download them, simply click on the green -Tables button (for tabular data) and/or GIS -Files button (for spatial boundary or location data) in the -Download Data column.
-Users can also create and submit extract requests within R by using -ipumsr functions that interface with the IPUMS API. The IPUMS API -currently supports access to the extract system for certain -IPUMS collections.
-ipumsr provides an interface to the IPUMS extract system via the -IPUMS API for the following collections:
-ipumsr provides access to comprehensive metadata via the IPUMS API -for the following collections:
-Users can query NHGIS metadata to explore available data when -specifying NHGIS extract requests.
-A listing of available samples is provided for the following -collections:
-Increased access to metadata for these projects is in progress. -Currently, creating extract requests for these projects requires using -the corresponding project websites to find samples and variables of -interest and obtain their API identifiers for use in R extract -definitions.
-Once you have identified the data you would like to request, the -workflow for requesting and downloading data via API is -straightforward.
-First, define the parameters of your extract. The available extract -definition options will differ by IPUMS data collection. See the microdata API request and NHGIS API request vignettes for more -details on defining an extract.
-
-cps_extract_request <- define_extract_cps(
- description = "2018-2019 CPS Data",
- samples = c("cps2018_05s", "cps2019_05s"),
- variables = c("SEX", "AGE", "YEAR")
-)
-
-nhgis_extract_request <- define_extract_nhgis(
- description = "NHGIS Data via IPUMS API",
- datasets = ds_spec(
- "1990_STF1",
- data_tables = c("NP1", "NP2", "NP3"),
- geog_levels = "state"
- )
-)
Next, submit your extract definition. After waiting for it to -complete, you can download the files directly to your local machine -without ever having to leave R:
-
-submitted_extract <- submit_extract(cps_extract_request)
-downloadable_extract <- wait_for_extract(submitted_extract)
-data_files <- download_extract(downloadable_extract)
You can also get the specifications of your previous extract -requests, even if they weren’t made with the API:
-
-past_extracts <- get_extract_history("nhgis")
See the introduction to the IPUMS API -for more details about how to use ipumsr to interact with the IPUMS -API.
-Once you have downloaded an extract, you can load the data into R
-with the family of read_*()
functions in ipumsr. These
-functions expand on those provided in readr in two
-ways:
File loading is covered in depth in the reading IPUMS data vignette.
-For microdata files, use the read_ipums_micro_*()
family
-with the DDI (.xml) metadata file for your extract:
-cps_file <- ipums_example("cps_00157.xml")
-cps_data <- read_ipums_micro(cps_file)
-#> Use of data from IPUMS CPS is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.
-
-head(cps_data)
-#> # A tibble: 6 × 8
-#> YEAR SERIAL MONTH ASECWTH STATEFIP PERNUM ASECWT INCTOT
-#> <dbl> <dbl> <int+lbl> <dbl> <int+lbl> <dbl> <dbl> <dbl+lbl>
-#> 1 1962 80 3 [March] 1476. 55 [Wisconsin] 1 1476. 4883
-#> 2 1962 80 3 [March] 1476. 55 [Wisconsin] 2 1471. 5800
-#> 3 1962 80 3 [March] 1476. 55 [Wisconsin] 3 1579. 999999998 [Missin…
-#> 4 1962 82 3 [March] 1598. 27 [Minnesota] 1 1598. 14015
-#> 5 1962 83 3 [March] 1707. 27 [Minnesota] 1 1707. 16552
-#> 6 1962 84 3 [March] 1790. 27 [Minnesota] 1 1790. 6375
For NHGIS files, use read_nhgis()
:
-nhgis_file <- ipums_example("nhgis0972_csv.zip")
-nhgis_data <- read_nhgis(nhgis_file, verbose = FALSE)
-
-head(nhgis_data)
-#> # A tibble: 6 × 25
-#> GISJOIN YEAR STUSAB CMSA DIVISIONA MSA_CMSAA PMSA PMSAA REGIONA STATEA
-#> <chr> <dbl> <chr> <chr> <lgl> <dbl> <chr> <chr> <lgl> <lgl>
-#> 1 G0080 1990 OH 28 NA 1692 Akron, OH… 0080 NA NA
-#> 2 G0360 1990 CA 49 NA 4472 Anaheim--… 0360 NA NA
-#> 3 G0440 1990 MI 35 NA 2162 Ann Arbor… 0440 NA NA
-#> 4 G0620 1990 IL 14 NA 1602 Aurora--E… 0620 NA NA
-#> 5 G0845 1990 PA 78 NA 6282 Beaver Co… 0845 NA NA
-#> 6 G0875 1990 NJ 70 NA 5602 Bergen--P… 0875 NA NA
-#> # ℹ 15 more variables: AREALAND <chr>, AREAWAT <chr>, ANPSADPI <chr>,
-#> # FUNCSTAT <chr>, INTPTLAT <dbl>, INTPTLNG <dbl>, PSADC <dbl>, D6Z001 <dbl>,
-#> # D6Z002 <dbl>, D6Z003 <dbl>, D6Z004 <dbl>, D6Z005 <dbl>, D6Z006 <dbl>,
-#> # D6Z007 <dbl>, D6Z008 <dbl>
ipumsr also supports the reading of IPUMS shapefiles (spatial
-boundary and location files) into the sf
format provided by
-the sf package:
-shp_file <- ipums_example("nhgis0972_shape_small.zip")
-nhgis_shp <- read_ipums_sf(shp_file)
-
-head(nhgis_shp)
-#> Simple feature collection with 6 features and 8 fields
-#> Geometry type: MULTIPOLYGON
-#> Dimension: XY
-#> Bounding box: xmin: -129888.4 ymin: -967051.1 xmax: 1948770 ymax: 751282.5
-#> Projected CRS: USA_Contiguous_Albers_Equal_Area_Conic
-#> # A tibble: 6 × 9
-#> PMSA MSACMSA ALTCMSA GISJOIN GISJOIN2 SHAPE_AREA SHAPE_LEN GISJOIN3
-#> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
-#> 1 3280 3282 41 G3280 3280 2840869482. 320921. G32823280
-#> 2 5760 5602 70 G5760 5760 237428573. 126226. G56025760
-#> 3 1145 3362 42 G1145 1145 3730749183. 489789. G33621145
-#> 4 1920 1922 31 G1920 1920 12068105590. 543164. G19221920
-#> 5 0080 1692 28 G0080 0080 2401347006. 218892. G16920080
-#> 6 1640 1642 21 G1640 1640 5608404797. 415671. G16421640
-#> # ℹ 1 more variable: geometry <MULTIPOLYGON [m]>
ipumsr is primarily designed to read data produced by the IPUMS -extract system. However, IPUMS does distribute other files, often -available via direct download. In many cases, these can be loaded with -ipumsr. Otherwise, these files can likely be handled by existing data -reading packages like readr (for delimited files) or -haven (for Stata, SPSS, or SAS files).
-Load a file’s metadata with read_ipums_ddi()
(for
-microdata projects) and read_nhgis_codebook()
(for NHGIS).
-These provide file- and variable-level metadata for a given data source,
-which can be used to interpret the data contents.
-cps_meta <- read_ipums_ddi(cps_file)
-nhgis_meta <- read_nhgis_codebook(nhgis_file)
Summarize the variable metadata for a dataset using
-ipums_var_info()
:
-ipums_var_info(cps_meta)
-#> # A tibble: 8 × 10
-#> var_name var_label var_desc val_labels code_instr start end imp_decim
-#> <chr> <chr> <chr> <list> <chr> <dbl> <dbl> <dbl>
-#> 1 YEAR Survey year "YEAR r… <tibble> "YEAR is … 1 4 0
-#> 2 SERIAL Household seria… "SERIAL… <tibble> "SERIAL i… 5 9 0
-#> 3 MONTH Month "MONTH … <tibble> NA 10 11 0
-#> 4 ASECWTH Annual Social a… "ASECWT… <tibble> "ASECWTH … 12 22 4
-#> 5 STATEFIP State (FIPS cod… "STATEF… <tibble> NA 23 24 0
-#> 6 PERNUM Person number i… "PERNUM… <tibble> "PERNUM i… 25 26 0
-#> 7 ASECWT Annual Social a… "ASECWT… <tibble> "ASECWT i… 27 37 4
-#> 8 INCTOT Total personal … "INCTOT… <tibble> "99999999… 38 46 0
-#> # ℹ 2 more variables: var_type <chr>, rectypes <lgl>
You can also get contextual details for specific variables:
-
-ipums_var_desc(cps_data$INCTOT)
-#> [1] "INCTOT indicates each respondent's total pre-tax personal income or losses from all sources for the previous calendar year. Amounts are expressed as they were reported to the interviewer; users must adjust for inflation using Consumer Price Index adjustment factors."
-
-ipums_val_labels(cps_data$STATEFIP)
-#> # A tibble: 75 × 2
-#> val lbl
-#> <int> <chr>
-#> 1 1 Alabama
-#> 2 2 Alaska
-#> 3 4 Arizona
-#> 4 5 Arkansas
-#> 5 6 California
-#> 6 8 Colorado
-#> 7 9 Connecticut
-#> 8 10 Delaware
-#> 9 11 District of Columbia
-#> 10 12 Florida
-#> # ℹ 65 more rows
ipumsr also provides a family of lbl_*()
functions to
-assist in accessing and manipulating the value-level metadata included
-in IPUMS data. This allows for value labels to be incorporated into the
-data processing pipeline. For instance:
-# Remove labels for values that do not appear in the data
-cps_data$STATEFIP <- lbl_clean(cps_data$STATEFIP)
-
-ipums_val_labels(cps_data$STATEFIP)
-#> # A tibble: 5 × 2
-#> val lbl
-#> <int> <chr>
-#> 1 19 Iowa
-#> 2 27 Minnesota
-#> 3 38 North Dakota
-#> 4 46 South Dakota
-#> 5 55 Wisconsin
-# Combine North and South Dakota into a single value/label pair
-cps_data$STATEFIP <- lbl_relabel(
- cps_data$STATEFIP,
- lbl("38_46", "Dakotas") ~ grepl("Dakota", .lbl)
-)
-
-ipums_val_labels(cps_data$STATEFIP)
-#> # A tibble: 4 × 2
-#> val lbl
-#> <chr> <chr>
-#> 1 19 Iowa
-#> 2 27 Minnesota
-#> 3 38_46 Dakotas
-#> 4 55 Wisconsin
See the value labels vignette for -more details.
-IPUMS data come with three primary types of variable-level -metadata:
-Variable labels are succinct labels -that serve as human-readable variable names (in contrast to more -esoteric column names).
Variable descriptions are extended text -descriptions of the contents of a variable. These provide more -information about what a given variable measures.
Value labels link particular data
-values to more meaningful text labels. For instance, the
-HEALTH
variable may have data values including
-1
and 2
, but these are actually stand-ins for
-“Excellent” and “Very good” health. This mapping would be contained in a
-value-label pair that includes a value and its associated
-label.
The rest of this article will focus on value labels; for more about
-variable labels and descriptions, see ?ipums_var_info
.
ipumsr uses the labelled
-class from the haven package to handle value labels.
You can see this in the column data types when loading IPUMS data.
-Note that <int+lbl>
appears below MONTH
-and ASECFLAG
:
-library(ipumsr)
-
-ddi <- read_ipums_ddi(ipums_example("cps_00160.xml"))
-cps <- read_ipums_micro(ddi, verbose = FALSE)
-
-cps[, 1:5]
-#> # A tibble: 10,883 × 5
-#> YEAR SERIAL MONTH CPSID ASECFLAG
-#> <dbl> <dbl> <int+lbl> <dbl> <int+lbl>
-#> 1 2016 24138 3 [March] 2.02e13 1 [ASEC]
-#> 2 2016 24139 3 [March] 2.02e13 1 [ASEC]
-#> 3 2016 24139 3 [March] 2.02e13 1 [ASEC]
-#> 4 2016 24140 3 [March] 2.02e13 1 [ASEC]
-#> 5 2016 24140 3 [March] 2.02e13 1 [ASEC]
-#> 6 2016 24140 3 [March] 2.02e13 1 [ASEC]
-#> 7 2016 24141 3 [March] 2.02e13 1 [ASEC]
-#> 8 2016 24142 3 [March] 2.02e13 1 [ASEC]
-#> 9 2016 24142 3 [March] 2.02e13 1 [ASEC]
-#> 10 2016 24142 3 [March] 2.02e13 1 [ASEC]
-#> # ℹ 10,873 more rows
This indicates that the data contained in these columns are integers
-but include value labels. You can use the function
-is.labelled()
to determine if a variable is indeed
-labelled:
-is.labelled(cps$STATEFIP)
-#> [1] TRUE
Some of the labels are actually printed inline alongside their data -values, but it can be easier to see them by isolating them:
-
-# Labels print when accessing the column
-head(cps$MONTH)
-#> <labelled<integer>[6]>: Month
-#> [1] 3 3 3 3 3 3
-#>
-#> Labels:
-#> value label
-#> 1 January
-#> 2 February
-#> 3 March
-#> 4 April
-#> 5 May
-#> 6 June
-#> 7 July
-#> 8 August
-#> 9 September
-#> 10 October
-#> 11 November
-#> 12 December
-
-# Get labels alone
-ipums_val_labels(cps$MONTH)
-#> # A tibble: 12 × 2
-#> val lbl
-#> <int> <chr>
-#> 1 1 January
-#> 2 2 February
-#> 3 3 March
-#> 4 4 April
-#> 5 5 May
-#> 6 6 June
-#> 7 7 July
-#> 8 8 August
-#> 9 9 September
-#> 10 10 October
-#> 11 11 November
-#> 12 12 December
labelled
vs. factor
-Base R already supports the linking of numeric data to categories
-using its factor
data type. While factors may be more
-familiar, they were designed to support efficient calculations in linear
-models, not as a human-readable labeling system for interpreting and
-processing data.
Compared to factors, labelled
vectors have two main
-properties that make them more suitable for working with IPUMS data:
Consider the case of the AGE
variable. For many IPUMS
-products, AGE
provides a person’s age in years, but certain
-special values have other interpretations:
-head(cps$AGE)
-#> <labelled<integer>[6]>: Age
-#> [1] 54 54 52 38 15 38
-#>
-#> Labels:
-#> value label
-#> 0 Under 1 year
-#> 90 90 (90+, 1988-2002)
-#> 99 99+
As you can see, the 0 value represents all ages less than 1, and the
-90 and 99 values actually represent ranges of ages. Coercing
-AGE
to a factor would convert all values of 0 to 1, because
-factors always assign values starting at 1:
-cps$AGE_FACTOR <- as_factor(cps$AGE)
-
-age0_factor <- cps[cps$AGE == 0, ]$AGE_FACTOR
-
-# The levels look the same
-unique(age0_factor)
-#> [1] Under 1 year
-#> 84 Levels: Under 1 year 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ... 99+
-
-# But the values have changed
-unique(as.numeric(age0_factor))
-#> [1] 1
Additionally, because not all values exist in the data, high values, -like 85, 90, and 99 have been mapped to lower values:
-
-age85_factor <- cps[cps$AGE == 85, ]$AGE_FACTOR
-
-unique(as.numeric(age85_factor))
-#> [1] 82
These different representations lead to inconsistencies in calculated -values:
-
-mean(cps$AGE)
-#> [1] 35.0226
-
-mean(as.numeric(cps$AGE_FACTOR))
-#> [1] 35.94836
labelled
variables
-While labelled
variables provide the benefits described
-above, they also present challenges.
For example, you may have noticed that both of the means -calculated above are suspect:
-AGE_FACTOR
, the values have been
-remapped during conversion and several are inconsistent with the
-original data.AGE
, we have considered all people over
-90 to be exactly 90, and all people over 99 to be exactly
-99—labelled
variables don’t ensure that calculations are
-correct any more than factors do!Furthermore, many R functions ignore value labels or even actively -remove them from the data:
-
-ipums_val_labels(cps$HEALTH)
-#> # A tibble: 5 × 2
-#> val lbl
-#> <int> <chr>
-#> 1 1 Excellent
-#> 2 2 Very good
-#> 3 3 Good
-#> 4 4 Fair
-#> 5 5 Poor
-
-HEALTH2 <- ifelse(cps$HEALTH > 3, 3, cps$HEALTH)
-ipums_val_labels(HEALTH2)
-#> # A tibble: 0 × 2
-#> # ℹ 2 variables: val <dbl>, lbl <chr>
So, labelled
vectors are not intended for use throughout
-the entire analysis process. Instead, they should be used during the
-initial data preparation process to convert raw data into values that
-are more meaningful. These can then be converted to other variable types
-(often factors) for analysis.
Unfortunately, this isn’t a process that can typically be automated, -as it depends primarily on the research questions the data will be used -to address. However, ipumsr provides several functions to manipulate -value labels to make this process easier.
-Use as_factor()
once labels have the correct categories
-and need no further manipulation. For instance, MONTH
-already has sensible categories, so we can convert it to a factor right
-away:
-ipums_val_labels(cps$MONTH)
-#> # A tibble: 12 × 2
-#> val lbl
-#> <int> <chr>
-#> 1 1 January
-#> 2 2 February
-#> 3 3 March
-#> 4 4 April
-#> 5 5 May
-#> 6 6 June
-#> 7 7 July
-#> 8 8 August
-#> 9 9 September
-#> 10 10 October
-#> 11 11 November
-#> 12 12 December
-
-cps$MONTH <- as_factor(cps$MONTH)
as_factor()
can also convert all labelled
-variables in a data frame to factors at once. If you prefer to work with
-factors, you can do this conversion immediately after loading data, and
-then prepare these variables using techniques you would use for
-factors.
-cps <- as_factor(cps)
-
-# ... further preparation of variables as factors
If you prefer to handle these variables in labelled
-format, you can use the lbl_*
helpers first, then call
-as_factor()
on the entire data frame.
Some variables may be more appropriate to use as numeric values
-rather than factors. In these cases, you can simply remove the labels
-with zap_labels()
.
INCTOT
, which measures personal income, fits this
-description:
-inctot_num <- zap_labels(cps$INCTOT)
-
-typeof(inctot_num)
-#> [1] "double"
-
-ipums_val_labels(inctot_num)
-#> # A tibble: 0 × 2
-#> # ℹ 2 variables: val <dbl>, lbl <chr>
Note that labelled values are not generally intended to be
-interpreted as numeric values, so zap_labels()
should only
-be used after labels have been properly handled. For example, in
-INCTOT
, labelled values used to identify missing values are
-encoded with large numbers:
-ipums_val_labels(cps$INCTOT)
-#> # A tibble: 2 × 2
-#> val lbl
-#> <dbl> <chr>
-#> 1 999999998 Missing. (1962-1964 only)
-#> 2 999999999 N.I.U.
Treating these as legitimate observations will significantly skew any
-calculations with this variable if not first converted to
-NA
.
Many IPUMS variables use labelled values to identify missing data.
-This allows for more detail about why certain observations were missing
-than would be available were values loaded as NA
.
As we saw with INCTOT
, value labels were used to
-identify two types of missing data: those that are legitimately missing
-and those that are not in the universe of observations.
-ipums_val_labels(cps$INCTOT)
-#> # A tibble: 2 × 2
-#> val lbl
-#> <dbl> <chr>
-#> 1 999999998 Missing. (1962-1964 only)
-#> 2 999999999 N.I.U.
To convert one or both of these labelled values to NA
,
-use lbl_na_if()
. To use lbl_na_if()
, you must
-supply a function to handle the conversion. The function should take a
-value-label pair as its input and output TRUE
for those
-pairs whose values should be converted to NA
.
Several lbl_*
helper functions, including
-lbl_na_if()
, require a user-defined function to handle
-recoding of value-label pairs. ipumsr provides a syntax to easily
-reference the values and labels in this user-defined function:
.val
argument references the values
-.lbl
argument references the labels
-For instance, to convert all values equal to 999999999
-to NA
, we can provide a function that uses the
-.val
argument:
-# Convert to NA using function that returns TRUE for all labelled values equal to 99999999
-inctot_na <- lbl_na_if(
- cps$INCTOT,
- function(.val, .lbl) .val == 999999999
-)
-
-# All 99999999 values have been converted to NA
-any(inctot_na == 999999999, na.rm = TRUE)
-#> [1] FALSE
-
-# And the label has been removed:
-ipums_val_labels(inctot_na)
-#> # A tibble: 1 × 2
-#> val lbl
-#> <dbl> <chr>
-#> 1 999999998 Missing. (1962-1964 only)
We could achieve the same result by referencing the labels -themselves:
-
-# Convert to NA for labels that contain "N.I.U."
-inctot_na2 <- lbl_na_if(
- cps$INCTOT,
- function(.val, .lbl) grepl("N.I.U.", .lbl)
-)
-
-# Same result
-all(inctot_na2 == inctot_na, na.rm = TRUE)
-#> [1] TRUE
You can also specify the function using a one-sided formula:
-
-lbl_na_if(cps$INCTOT, ~ .val == 999999999)
Note that .val
only refers to labelled
-values—unlabelled values are not affected:
-x <- lbl_na_if(cps$INCTOT, ~ .val >= 0)
-
-# Unlabelled values greater than the cutoff are still present:
-length(which(x > 0))
-#> [1] 7501
To convert unlabelled values to NA
, use
-dplyr::na_if()
instead.
lbl_relabel()
can be used to create new value-label
-pairs, often to recombine existing labels into more general categories.
-It takes a two-sided formula to handle the relabeling:
lbl()
helper to define a
-new value-label pair.TRUE
for those value-label pairs that should be relabelled
-with the new value-label pair from the left-hand side.The function again uses the .val
and .lbl
-syntax mentioned above to refer to values and
-labels, respectively.
For instance, we could reclassify the categories in
-MIGRATE1
such that all migration within a state is captured
-in a single category:
-ipums_val_labels(cps$MIGRATE1)
-#> # A tibble: 8 × 2
-#> val lbl
-#> <int> <chr>
-#> 1 0 NIU
-#> 2 1 Same house
-#> 3 2 Different house, place not reported
-#> 4 3 Moved within county
-#> 5 4 Moved within state, different county
-#> 6 5 Moved between states
-#> 7 6 Abroad
-#> 8 9 Unknown
-
-cps$MIGRATE1 <- lbl_relabel(
- cps$MIGRATE1,
- lbl(0, "NIU / Missing / Unknown") ~ .val %in% c(0, 2, 9),
- lbl(1, "Stayed in state") ~ .val %in% c(1, 3, 4)
-)
-
-ipums_val_labels(cps$MIGRATE1)
-#> # A tibble: 4 × 2
-#> val lbl
-#> <dbl> <chr>
-#> 1 0 NIU / Missing / Unknown
-#> 2 1 Stayed in state
-#> 3 5 Moved between states
-#> 4 6 Abroad
Many IPUMS variables include detailed labels that are grouped -together into more general categories. These are often encoded with -multi-digit values, where the starting digit refers to the larger -category.
-For instance, the EDUC
variable contains categories for
-individual grades as well as categories for multiple grade groups:
-head(ipums_val_labels(cps$EDUC), 15)
-#> # A tibble: 15 × 2
-#> val lbl
-#> <int> <chr>
-#> 1 0 NIU or no schooling
-#> 2 1 NIU or blank
-#> 3 2 None or preschool
-#> 4 10 Grades 1, 2, 3, or 4
-#> 5 11 Grade 1
-#> 6 12 Grade 2
-#> 7 13 Grade 3
-#> 8 14 Grade 4
-#> 9 20 Grades 5 or 6
-#> 10 21 Grade 5
-#> 11 22 Grade 6
-#> 12 30 Grades 7 or 8
-#> 13 31 Grade 7
-#> 14 32 Grade 8
-#> 15 40 Grade 9
You could use lbl_relabel()
to collapse the detailed
-categories into the more general ones, but you would have to define new
-value labels for all the categories. Instead, you could use
-lbl_collapse()
.
lbl_collapse()
uses a function that takes
-.val
and .lbl
arguments
-and returns the new value each input value should be assigned to. The
-label of the lowest original value is used for each collapsed group. To
-group by the tens digit, use the integer division operator
-%/%
:
-# %/% refers to integer division, which divides but discards the remainder
-10 %/% 10
-#> [1] 1
-11 %/% 10
-#> [1] 1
-
-# Convert to groups by tens digit
-cps$EDUC2 <- lbl_collapse(cps$EDUC, ~ .val %/% 10)
-
-ipums_val_labels(cps$EDUC2)
-#> # A tibble: 14 × 2
-#> val lbl
-#> <dbl> <chr>
-#> 1 0 NIU or no schooling
-#> 2 1 Grades 1, 2, 3, or 4
-#> 3 2 Grades 5 or 6
-#> 4 3 Grades 7 or 8
-#> 5 4 Grade 9
-#> 6 5 Grade 10
-#> 7 6 Grade 11
-#> 8 7 Grade 12
-#> 9 8 1 year of college
-#> 10 9 2 years of college
-#> 11 10 3 years of college
-#> 12 11 4 years of college
-#> 13 12 5+ years of college
-#> 14 99 Missing/Unknown
It is always worth checking that the new labels make sense based on
-your research question. For instance, in the above example, both
-"12th grade, no diploma"
and
-"High school diploma or equivalent"
are collapsed to a
-single group as they both have values in the 70s. This may be suitable
-for your purposes, but for more control, it is best to use
-lbl_relabel()
.
Note that lbl_relabel()
and lbl_collapse()
-only operate on labelled values, and are therefore designed for
-use with fully labelled
vectors. That is, if you attempt to
-relabel a vector that has some unlabelled values, they will be converted
-to NA
.
To avoid this, you can add labels for all values using
-lbl_add_vals()
before relabeling (see below). In general, this shouldn’t be necessary, as
-most partially-labelled vectors only include labels with ancillary
-information, like missing value indicators. These can typically be
-handled by other helpers, like lbl_na_if()
, without
-requiring relabeling.
Some variables may contain labels for values that don’t appear in the
-data. Unused levels still appear in factor representations of these
-variables, so it is often beneficial to remove them with
-lbl_clean()
:
-ipums_val_labels(cps$STATEFIP)
-#> # A tibble: 75 × 2
-#> val lbl
-#> <int> <chr>
-#> 1 1 Alabama
-#> 2 2 Alaska
-#> 3 4 Arizona
-#> 4 5 Arkansas
-#> 5 6 California
-#> 6 8 Colorado
-#> 7 9 Connecticut
-#> 8 10 Delaware
-#> 9 11 District of Columbia
-#> 10 12 Florida
-#> # ℹ 65 more rows
-
-ipums_val_labels(lbl_clean(cps$STATEFIP))
-#> # A tibble: 5 × 2
-#> val lbl
-#> <int> <chr>
-#> 1 19 Iowa
-#> 2 27 Minnesota
-#> 3 38 North Dakota
-#> 4 46 South Dakota
-#> 5 55 Wisconsin
As mentioned above, value labels are intended to be used as an
-intermediate data structure for preparing newly-imported data. As such,
-you’re not likely to need to add new labels, but if you do, use
-lbl_add()
, lbl_add_vals()
, or
-lbl_define()
.
lbl_add()
takes an arbitrary number of
-lbl()
placeholders that will be added to a given
-labelled
vector:
-x <- haven::labelled(
- c(100, 200, 105, 990, 999, 230),
- c(`Unknown` = 990, NIU = 999)
-)
-
-lbl_add(
- x,
- lbl(100, "$100"),
- lbl(105, "$105"),
- lbl(200, "$200"),
- lbl(230, "$230")
-)
-#> <labelled<double>[6]>
-#> [1] 100 200 105 990 999 230
-#>
-#> Labels:
-#> value label
-#> 100 $100
-#> 105 $105
-#> 200 $200
-#> 230 $230
-#> 990 Unknown
-#> 999 NIU
lbl_add_vals()
adds labels for all unlabelled values in
-a labelled
vector with an optional labeller function. (This
-can be useful if you wish to operate on a partially labelled vector with
-a function that requires labelled input, like
-lbl_relabel()
.)
-# `.` refers to each label value
-lbl_add_vals(x, ~ paste0("$", .))
-#> <labelled<double>[6]>
-#> [1] 100 200 105 990 999 230
-#>
-#> Labels:
-#> value label
-#> 100 $100
-#> 105 $105
-#> 200 $200
-#> 230 $230
-#> 990 Unknown
-#> 999 NIU
lbl_define()
makes a labelled
vector out of
-an unlabelled one. Use the same syntax as is used for
-lbl_relabel()
to define new labels based on the unlabelled
-values:
-age <- c(10, 12, 16, 18, 20, 22, 25, 27)
-
-# Group age values into two label groups.
-# Values not captured by the right hand side functions remain unlabelled
-lbl_define(
- age,
- lbl(1, "Pre-college age") ~ .val < 18,
- lbl(2, "College age") ~ .val >= 18 & .val <= 22
-)
-#> <labelled<double>[8]>
-#> [1] 1 1 1 2 2 2 25 27
-#>
-#> Labels:
-#> value label
-#> 1 Pre-college age
-#> 2 College age
Once all labelled variables have been appropriately converted to -factors or numeric values, the data can move forward in the processing -pipeline.
-The haven package, which underlies ipumsr’s handling
-of value labels, provides more details on the labelled
-class. See vignette("semantics", package = "haven")
.
The labelled package provides other methods for -manipulating value labels, some of which overlap those provided by -ipumsr.
-The questionr package includes functions for exploring
-labelled
variables. In particular, the functions
-describe
, freq
and lookfor
all
-print out to console information about the variable using the value
-labels.
Finally, the foreign and prettyR packages
-don’t use the labelled
class, but provide similar
-functionality for handling value labels, which could be adapted for use
-with labelled
vectors.