-
Notifications
You must be signed in to change notification settings - Fork 1
/
README.Rmd
169 lines (108 loc) · 5.37 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
output: github_document
editor_options:
chunk_output_type: console
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%",
message = FALSE,
warning = FALSE
)
devtools::load_all()
```
# healthdatacsv <img src='man/figures/logo.png' align="right" height="139" />
<!-- badges: start -->
[![Lifecycle: maturing](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://www.tidyverse.org/lifecycle/#maturing)
[![Travis build status](https://travis-ci.org/iecastro/healthdatacsv.svg?branch=master)](https://travis-ci.org/iecastro/healthdatacsv)
[![Codecov test coverage](https://codecov.io/gh/iecastro/healthdatacsv/branch/master/graph/badge.svg)](https://codecov.io/gh/iecastro/healthdatacsv?branch=master)
[![peer-review](https://badges.ropensci.org/358_status.svg)](https://github.com/ropensci/software-review/issues/358)
[![AppVeyor build status](https://ci.appveyor.com/api/projects/status/github/iecastro/healthdatacsv?branch=master&svg=true)](https://ci.appveyor.com/project/iecastro/healthdatacsv)
<!-- badges: end -->
**healthdatacsv** allows users to query the [healthdata.gov API](https://healthdata.gov/node/25341) catalog and return tidy data frames. This package focuses on the data.json endpoint and will download datasets if available via file download, namely, in csv format.
## Installation
You can install the development version of **healthdatacsv** from [GitHub](https://github.com/) with:
``` {r, eval = FALSE}
install.packages("remotes")
remotes::install_github("iecastro/healthdatacsv")
```
## Examples
```{r}
library(healthdatacsv)
```
Basic examples which show you how to use **healthdatacsv**:
- Querying API based on keywords:
```{r example}
fetch_catalog(keyword = "alcohol|drugs")
```
- Querying API and downloading data for single product:
```{r}
fetch_catalog("Centers for Disease Control and Prevention",
keyword = "built environment") %>%
fetch_data()
```
- Querying API and downloading data for multiple data products. Data will be nested in a list-column:
```{r}
library(dplyr) # for table manipulation verbs
# query catalog
fetch_catalog(keyword = "alcohol")
# fetch data
fetch_catalog(keyword = "alcohol") %>%
dplyr::slice(1) %>% # dplyr
fetch_data()
```
`fetch_data()` wraps the [vroom](https://vroom.r-lib.org/) function, which helps quickly read relatively large delimited files:
```{r}
# PRAMS data
# CDC surveillance for reproductive health
# query, filter, and fetch
prams <- fetch_catalog(keyword = "reproductive health") %>%
mutate(year = readr::parse_number(product)) %>%
filter(year >= 2010) %>%
arrange(year) %>%
fetch_data()
prams %>%
select(product, data_tbl)
```
## Basic workflow
Say you're interested in searching for CDC datasets related to the built environment. You can use **healthdatacsv** to fetch the catalog of available data products:
```{r}
cdc_built_env <- fetch_catalog("Centers for Disease Control and Prevention",
keyword = "built environment")
cdc_built_env
```
In this case, there is only one product available. To learn more about the dataset, you can simply `pull` the description:
```{r}
cdc_built_env %>%
dplyr::pull(description)
```
This dataset relates to state legislation on nutrition, physical activity, and obesity during 2001-2017. Data only includes enacted legislation.
To download the data, we pass the catalog object to the `fetch_data` function. Since there is only one dataset to download, we can `unnest` in the same pipe. *If the catalog has more than one product that you'd like to keep, it is recommended to unnest each product separately.* If the catalog consists of several time points of the same dataset, you could unnest in one pipe, given all column names are the same.
```{r}
data_raw <- cdc_built_env %>%
fetch_data() %>% #>
tidyr::unnest(data_tbl)
data_raw
```
### Some helper functions
`get_agencies()` will query and download names of agencies listed in the catalog. You can enter partial name or initials of agency of interest to check if it has data cataloged in the API. Argument relies on [regular expression](https://stringr.tidyverse.org/articles/regular-expressions.html) to detect string matches.
```{r}
get_agencies("NIH|CDC|FDA")
get_agencies("Institute|Drug")
```
The resulting dataframe will depend on matches to the string supplied. To pull all agencies cataloged, simply use `get_agencies` with the default argument:
```{r}
get_agencies()
```
`get_keywords()` will extract all keywords cataloged. The function accepts as argument a full *agency* name (in proper title case) which can be obtained with `get_agencies`.
```{r}
get_keywords("National Institutes of Health (NIH)")
```
Conversely, this function can be run with no argument for a data frame of all keywords and agencies cataloged.
#### Info
Development of this package partly supported by a research grant from the National Institute on Alcohol Abuse and Alcoholism - NIH Grant #R34AA026745. This product is not endorsed nor certified by either healthdata.gov or NIH/NIAAA.
Please note that the 'healthdatacsv' project is released with a [Contributor Code of Conduct](.github/CODE_OF_CONDUCT.md). By contributing to this project, you agree to abide by its terms.