Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
616 lines (390 sloc) 9.43 KB
---
title: "Data Analysis in R"
subtitle: "Data Cleaning"
author: "Michael E DeWitt Jr"
date: "2018-09-20 (updated: `r Sys.Date()`)"
output:
xaringan::moon_reader:
df_print: tibble
css: ["mtheme_max.css", "fonts_mtheme_max.css"]
self_contained: false
lib_dir: libs
nature:
ratio: '16:9'
highlightLanguage: R
countIncrementalSlides: false
---
```{r setup, include=FALSE}
options(htmltools.dir.version = FALSE)
knitr::opts_chunk$set(rows.print=5)
library(knitr)
library(tidyverse)
```
background-image: url(https://images.officeworld.com/products/500/272885.JPG)
# Cleaning is always a chore...
???
Image credit: [Office World](https://images.officeworld.com/products/500/272885.JPG)
---
background-image: url(https://secure.i.telegraph.co.uk/multimedia/archive/02064/Sir-James-Dyson_2064492b.jpg)
# But wouldn't you rather be a Dyson?
???
Image credit: [Daily Telegraph](https://secure.i.telegraph.co.uk/multimedia/archive/02064/Sir-James-Dyson_2064492b.jpg)
---
background-image: url(https://amp.businessinsider.com/images/5b9144b8e199f336008b5aff-960-719.jpg)
background-size: 70%
# Or even better...
---
class: inverse, center, middle
# Get Started
---
# Getting Data in...
.pull-left[
The first part of the process in uploading data into R.
But the biggest question is what kind of data do you have?
]
.pull-right[
```{r echo=FALSE}
include_graphics("https://i.stack.imgur.com/SSDlE.jpg")
```
]
---
# Some Common Data Formats
- Delimited Files (fixed width, comma, tab, ...)
- Excel
- SPSS
- STATA
- Semi- structured (JSON, XML)
---
# Enter `readr`, part of the `tidyverse`
`readr` is a package in the `tidyverse`
--
Designed with interfaces to handle most files
--
Has some nice defaults that make it easier than some base packages
---
# Starting with CSVs
## Syntax
```{r eval=FALSE}
data <- read_csv("path_to_csv.csv")
```
--
## Options
- Specify column names
- Where data start
- Column types
---
# Other delimiters
## Tab Delimited Files
```{r eval=FALSE}
data <- read_tsv("path_to_csv.txt")
```
## Any other delimiter
```{r eval=FALSE}
data <- read_delim("path_to_csv.txt", delim = ";")
```
---
# Now to `haven`
The `haven` package can be used to read more proprietary data formats into R
## SAS
```{r eval=FALSE}
my_sas <- read_sas("myfile.sas7bdat")
```
## SPSS
```{r eval=FALSE}
my_sas <- read_spss("myfile.sav")
```
## STATA
```{r eval=FALSE}
my_sas <- read_dta("myfile.dta")
```
---
# Excel using `readxl`
And of course the most ubiquitous data form...
```{r eval=FALSE}
my_excel <-read_excel("my_excel_file.xlsx")
```
--
- Specify cell ranges
- Name columns
- etc
---
# And other formats
Json with `jsonlite`
XML with `xml2` or `XML` _(my preference is `xml2`)_
---
class: inverse, center, middle
# Now for cleaning
---
# Tidy data
Most of the `tidyverse` relies on the ["tidy"](https://vita.had.co.nz/papers/tidy-data.pdf) paradigm of data
--
One observation per row, one attribute per column
--
Once data is in the format, visualisation and modeling becomes easier
---
# Wide to Narrow and Back Again
We will use the `tidyr` package to help move from wide data to narrow data to move to the tidy format
--
## Key functions
### `spread`
Narrow to Wide
###`gather`
Wide to narrow
---
# `Spread`
#### Syntax
```{r eval=FALSE}
spread(data = data,
key = "What You Want to Be Columns",
value = "The value you want in the rows",
fill = "What you want to appear if there are no values")
```
```{r echo=FALSE}
iris_narrow <- iris %>%
group_by(Species) %>%
summarise(mu_petal = mean(Petal.Length),
mu_sepal = mean(Sepal.Width))
```
--
#### Example
```{r echo=FALSE}
# From http://stackoverflow.com/questions/1181060
stocks <- tibble(
time = as.Date('2010-01-01') + 0:9,
X = rnorm(10, 0, 1),
Y = rnorm(10, 0, 2),
Z = rnorm(10, 0, 4)
)
stocks <- gather(stocks, stock, price, -time)
```
.pull-left[
```{r}
stocks
```
]
.pull-right[
```{r}
stocks %>%
spread(stock, price)
```
]
---
# `Gather`
#### Syntax
```{r eval=FALSE}
gather(data = data,
key = "new name of column",
value = "what you want to call value",
-"what you don't want grouped")
```
```{r echo=FALSE}
mini_iris <- iris[c(1, 51, 101), ]
```
--
#### Example
.pull-left[
```{r}
mini_iris
```
]
.pull-right[
```{r}
mini_iris %>%
gather(attribute, measurement, -Species)
```
]
---
class: inverse, center, middle
# Now Let's Practice `spread` and `gather`
---
# Subset with `select`
Suppose you want to retain only a few columns.
This can be done with the `select` command from `dplyr`
```{r rows.print = 5}
iris %>%
select(Sepal.Length, Petal.Width)
```
---
# Helpers for `select`
**`starts_with("something")`** selects those columns that start with a specified string
**`contains("something")`** selects those columns that have a string *anywhere* in the column name
**`ends_with("something")`** selects those columns that end with a specified string
**`everything()`** selects everything else that was not explicitly selected
--
```{r echo=FALSE}
iris <- as_tibble(iris)
```
```{r rows.print = 5}
iris %>%
select(starts_with("Sepal"))
```
---
# `unite` and `separate`
**`unite`** is a function that allows you to join (paste together) two columns
```{r echo=FALSE}
sample_df <- tribble(
~col1, ~col2, ~ col3,
"A", 1, "X",
"B", 2, "Y",
"C",3, "Z",
)
```
#### Syntax
```{r eval=FALSE}
unite(data = data, col = "new_column_name", sep = "separator",
remove = "T/F if you want to drop the columns", -column_you_dont_want_joined)
```
--
#### Example
.pull-left[
```{r}
sample_df
```
]
.pull-right[
```{r}
sample_df %>%
unite(col = "united", sep = "_",
remove = FALSE, -col3)
```
]
---
# `unite` and `separate`
**`separate`** is a function that allows you to split a column into multiple columns
```{r echo=FALSE}
sample_df <- tribble(
~united, ~ col3,
"A_1", "X",
"B_2", "Y",
"C_3", "Z",
)
```
#### Syntax
```{r eval=FALSE}
separate(data = data, col = "what to separate", into = "new columns",
sep = " what to separate by", remove = "T/F")
```
--
#### Example
.pull-left[
```{r}
sample_df
```
]
.pull-right[
```{r}
sample_df %>%
separate(col = united, into = c("col1", "col2"),
sep = "_", remove = TRUE)
```
]
---
class: inverse, center, middle
# Now Let's Practice `select`, `gather` and `unite`
---
# Using `filter`
`filter` can be used to filter values to those ones that you would like
#### Syntax
```{r eval=FALSE}
filter(data = data, condition to test)
```
--
#### Example
.pull-left[
```{r rows.print=5}
table(iris[["Species"]])
```
]
.pull-right[
```{r rows.print = 5}
iris %>%
filter(Species == "setosa") %>%
select(Species, contains("Sepal"))
```
]
---
# Logical Operators
- **==** for testing equivalence
- **!=** not equal to
- **>** / **<** greater than or less than
- **>=** / **<=** greater than or less than or equal to
- **&**/ **|** and and or allows for combining conditions (e.g. Species == "Setosa" & Sepal.Length >5)
---
# `rename` for changing column names
**`rename`** can be used
#### Syntax
```{r eval=FALSE}
rename(data = data, new_name = old_name)
```
--
#### Example
.pull-left[
```{r rows.print=5}
iris
```
]
.pull-right[
```{r rows.print = 5}
iris %>%
rename(sepal_length = Sepal.Length) %>%
select(sepal_length, ends_with("Width"))
```
]
---
#`mutate` to add new columns
**mutate** can be used to create a _derrived_ or _calculated_ column
#### Syntax
```{r eval=FALSE}
mutate(data = data, new_column = operation you want to do)
```
--
#### Example
.pull-left[
```{r rows.print=5}
mtcars
```
]
.pull-right[
```{r rows.print = 5}
mtcars %>%
mutate(mpg_per_wt = mpg / wt) %>%
select(mpg, wt, mpg_per_wt)
```
]
---
# Group Operations
`group_by` allows for group-wise observations
You can then feed these groups to the `sumarise/summarize` function to do group-wise calculations
--
```{r}
iris %>%
group_by(Species) %>%
summarise(mu_length = mean(Sepal.Length),
med_width = median(Sepal.Width))
```
---
# Add more groups with `group_by`
```{r}
mtcars %>%
group_by(cyl, am) %>%
summarise(avg_mpg = mean(mpg),
n = n(),
min_wt = min(wt),
max_wt = max(wt)) %>%
mutate(perc_of_group = n/sum(n)) %>%
ungroup() %>% # Needed to remove grouping
mutate(perc_total = n/sum(n))
```
---
# To Recap
**`gather`** and **`spread`** to transform our data between wide and long forms
**`select`** to subset columns
**`filter`** to filter data to a specific condition
**`unite`** and **`separate`** to combine and separate columns
**`rename`** to rename our columns
**`mutate`** to add new derrived or calculated columns
**`group_by`** for group-wise operations
**`summarise`** for summary functions on grouped variables
---
class: inverse, center, middle
# Thanks!
You can’t perform that action at this time.