Permalink
Switch branches/tags
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
55 lines (46 sloc) 1.92 KB
---
title: "Data cleaning"
output: github_document
---
I explicitly use this package to teach data cleaning, so have refactored my old cleaning code into several scripts. I also include them as compiled Markdown reports. Caveat: these are realistic cleaning scripts! Not the highly polished ones people write with 20/20 hindsight :) I wouldn't necessarily clean it the same way again (and I would download more recent data!), but at this point there is great value in reproducing the data I've been using for ~5 years.
Cleaning history
* 2010: The first time I documented cleaning this dataset. I started with
delimited files I exported from Excel. Not present in this repo.
* 2014: I re-cleaned the data and (mostly) forced myself to pull it straight
out of the spreadsheets. Used the `gdata` package. It was kind of painful, due to encoding and other issues. See the scripts in this state in [v0.1.0](https://github.com/jennybc/gapminder/tree/v0.1.0/data-raw).
* 2015: I revisited the cleaning and switched to `readxl`. This was much less painful. Present day.
```{r results='asis', echo = FALSE, warning = FALSE}
library(tidyverse)
library(stringr)
library(knitr)
library(here)
x <- tibble(fls = list.files(here("data-raw"))) %>%
mutate(fls_basename = basename(fls)) %>%
separate(fls_basename, c("script", "slug", "ext"), "[_\\.]")
x <- x %>%
filter(script %>% str_detect("^[0-9]+"),
ext %>% str_detect("R|r|md|tsv")) %>%
select(-slug)
y <- x %>%
group_by(script) %>%
nest()
collapse_md_links <- function(x) {
x %>% {
paste0("[", ., "](", ., ")")
} %>%
paste(collapse = ", ")
}
jfun <- function(z) {
tibble(
r_script = z$fls[z$ext == "R"] %>% collapse_md_links(),
notebook = z$fls[z$ext == "md"] %>% collapse_md_links(),
tsv = z$fls[z$ext == "tsv"] %>% collapse_md_links()
)
}
y$data %>%
map_df(jfun) %>%
kable()
```
```{r eval = FALSE, echo = FALSE}
devtools::session_info()
```