README.Rmd

---
output: github_document
editor_options: 
  chunk_output_type: console
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%",
  message = FALSE,
  warning = FALSE
)

devtools::load_all()

```

# healthdatacsv <img src='man/figures/logo.png' align="right" height="139" />

<!-- badges: start -->
[![Lifecycle: maturing](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://www.tidyverse.org/lifecycle/#maturing)
[![Travis build status](https://travis-ci.org/iecastro/healthdatacsv.svg?branch=master)](https://travis-ci.org/iecastro/healthdatacsv)
[![Codecov test coverage](https://codecov.io/gh/iecastro/healthdatacsv/branch/master/graph/badge.svg)](https://codecov.io/gh/iecastro/healthdatacsv?branch=master)
[![peer-review](https://badges.ropensci.org/358_status.svg)](https://github.com/ropensci/software-review/issues/358)
[![AppVeyor build status](https://ci.appveyor.com/api/projects/status/github/iecastro/healthdatacsv?branch=master&svg=true)](https://ci.appveyor.com/project/iecastro/healthdatacsv)
<!-- badges: end -->

**healthdatacsv** allows users to query the [healthdata.gov API](https://healthdata.gov/node/25341) catalog and return tidy data frames. This package focuses on the data.json endpoint and will download datasets if available via file download, namely, in csv format.  

## Installation

You can install the development version of **healthdatacsv** from [GitHub](https://github.com/) with:

``` {r, eval = FALSE}
install.packages("remotes")
remotes::install_github("iecastro/healthdatacsv")

```

## Examples

```{r}
library(healthdatacsv)
```

Basic examples which show you how to use **healthdatacsv**:  

- Querying API based on keywords:  

```{r example}
fetch_catalog(keyword = "alcohol|drugs")

```

- Querying API and downloading data for single product:  

```{r}
fetch_catalog("Centers for Disease Control and Prevention",
                               keyword = "built environment") %>% 
  fetch_data()

```

- Querying API and downloading data for multiple data products.  Data will be nested in a list-column:  

```{r} 
library(dplyr) # for table manipulation verbs

# query catalog
fetch_catalog(keyword = "alcohol")

# fetch data
fetch_catalog(keyword = "alcohol") %>% 
  dplyr::slice(1) %>% # dplyr 
  fetch_data()

```

`fetch_data()` wraps the [vroom](https://vroom.r-lib.org/) function, which helps quickly read relatively large delimited files:  

```{r}
# PRAMS data
# CDC surveillance for reproductive health
# query, filter, and fetch
prams <- fetch_catalog(keyword = "reproductive health") %>% 
  mutate(year = readr::parse_number(product)) %>% 
  filter(year >= 2010) %>% 
  arrange(year) %>% 
  fetch_data()

prams %>% 
  select(product, data_tbl)

```


## Basic workflow

Say you're interested in searching for CDC datasets related to the built environment.  You can use **healthdatacsv** to fetch the catalog of available data products:

```{r}
cdc_built_env <- fetch_catalog("Centers for Disease Control and Prevention",
                               keyword = "built environment")

cdc_built_env

```

In this case, there is only one product available.  To learn more about the dataset, you can simply `pull` the description:

```{r}
cdc_built_env %>%
  dplyr::pull(description)

```

This dataset relates to state legislation on nutrition, physical activity, and obesity during 2001-2017.  Data only includes enacted legislation.

To download the data, we pass the catalog object to the `fetch_data` function.  Since there is only one dataset to download, we can `unnest` in the same pipe.  *If the catalog has more than one product that you'd like to keep, it is recommended to unnest each product separately.* If the catalog consists of several time points of the same dataset, you could unnest in one pipe, given all column names are the same. 


```{r}
data_raw <- cdc_built_env %>%
  fetch_data() %>% #> 
  tidyr::unnest(data_tbl)

data_raw
```


### Some helper functions

`get_agencies()` will query and download names of agencies listed in the catalog. You can enter partial name or initials of agency of interest to check if it has data cataloged in the API. Argument relies on [regular expression](https://stringr.tidyverse.org/articles/regular-expressions.html) to detect string matches.

```{r}

get_agencies("NIH|CDC|FDA")

get_agencies("Institute|Drug")

```

The resulting dataframe will depend on matches to the string supplied. To pull all agencies cataloged, simply use `get_agencies` with the default argument:

```{r}
get_agencies()
```

`get_keywords()` will extract all keywords cataloged. The function accepts as argument a full *agency* name (in proper title case) which can be obtained with `get_agencies`.  

```{r}
get_keywords("National Institutes of Health (NIH)")

```

Conversely, this function can be run with no argument for a data frame of all keywords and agencies cataloged.

#### Info

Development of this package partly supported by a research grant from the National Institute on Alcohol Abuse and Alcoholism - NIH Grant #R34AA026745. This product is not endorsed nor certified by either healthdata.gov or NIH/NIAAA.  

Please note that the 'healthdatacsv' project is released with a [Contributor Code of Conduct](.github/CODE_OF_CONDUCT.md). By contributing to this project, you agree to abide by its terms.