# R Data Cleaning Cheat Sheet

In [1]:
require(tidyverse)

Loading required package: tidyverse

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.1     [32m✔[39m [34mstringr  [39m 1.5.2
[32m✔[39m [34mggplot2  [39m 4.0.0     [32m✔[39m [34mtibble   [39m 3.3.0
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.1.0     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [2]:
data <- read_csv("scores.csv")

[1mRows: [22m[34m4[39m [1mColumns: [22m[34m6[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (2): participant_id, group
[32mdbl[39m (4): condition, test1, test2, test3

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


## Handling missing data sentinels

Sometimes, special values (like -999) called *sentinels* are used to indicate missing data. The `replace` method can be used to replace these values with `None` to indicate that they are missing.

In [3]:
data

participant_id,group,condition,test1,test2,test3
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
1,exp,1,6,9,-999
1,exp,2,4,8,9
2,con,1,9,10,8
2,con,2,7,9,7


To replace missing data sentinels in the `test3` column, using `na_if`.

In [4]:
data |>
  mutate(test3 = na_if(test3, -999))

participant_id,group,condition,test1,test2,test3
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
1,exp,1,6,9,
1,exp,2,4,8,9.0
2,con,1,9,10,8.0
2,con,2,7,9,7.0


To replace missing data sentinels in multiple columns, use `across` to apply the `na_if` function to each column. The `~` indicates that the code after (here, `na_if(.x, -999))`) should be treated as a function, which will be applied to each of the indicated columns. The `.` is a placeholder for the column being evaluated.

In [5]:
data |>
  mutate(across(c(test1, test2, test3), ~ na_if(., -999)))

participant_id,group,condition,test1,test2,test3
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
1,exp,1,6,9,
1,exp,2,4,8,9.0
2,con,1,9,10,8.0
2,con,2,7,9,7.0


Alternatively, we can set the `na` input when calling `read_csv` to set those special values to `null` immediately, in all columns. Here, setting `show_col_types` suppresses the messages that are usually shown when using `read_csv`.

In [6]:
data <- read_csv("scores.csv", na = "-999", show_col_types = FALSE)
data

participant_id,group,condition,test1,test2,test3
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
1,exp,1,6,9,
1,exp,2,4,8,9.0
2,con,1,9,10,8.0
2,con,2,7,9,7.0


## Recoding variables

Use `recode` or `case_match` to change the values of variables and `as.numeric` or `as.character` to change the data type of variables.

In [7]:
data |>
  mutate(group = recode(group, "exp" = "Experimental", "con" = "Control"))

participant_id,group,condition,test1,test2,test3
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
1,Experimental,1,6,9,
1,Experimental,2,4,8,9.0
2,Control,1,9,10,8.0
2,Control,2,7,9,7.0


In [8]:
data |>
  mutate(group = case_match(group, "exp" ~ "Experimental", "con" ~ "Control"))

participant_id,group,condition,test1,test2,test3
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
1,Experimental,1,6,9,
1,Experimental,2,4,8,9.0
2,Control,1,9,10,8.0
2,Control,2,7,9,7.0


To change a variable into a different data type, use a type conversion function such as `as.numeric` or `as.character`.

In [9]:
data |>
  mutate(condition = as.character(condition))

participant_id,group,condition,test1,test2,test3
<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
1,exp,1,6,9,
1,exp,2,4,8,9.0
2,con,1,9,10,8.0
2,con,2,7,9,7.0


To change a numeric value to a string (known as a character vector in R), use `case_match`.

In [10]:
data |>
  mutate(condition = case_match(condition, 1 ~ "target", 2 ~ "lure"))

participant_id,group,condition,test1,test2,test3
<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>
1,exp,target,6,9,
1,exp,lure,4,8,9.0
2,con,target,9,10,8.0
2,con,lure,7,9,7.0


## Working with missing data

Missing data should be marked as `NA` in DataFrames. They may then easily be excluded from calculations.

To count the number of `NA` values in each column, use `summarise_all` (which applies some summarizing function to all columns) with `~ sum(is.na(.))`, which sums the count of `NA` values for each column.

In [11]:
summarise_all(data, ~ sum(is.na(.)))

participant_id,group,condition,test1,test2,test3
<int>,<int>,<int>,<int>,<int>,<int>
0,0,0,0,0,1


Calculations will not ignore `NA` values by default.

In [12]:
data |> 
  summarise(across(c(test1, test2, test3), mean))

test1,test2,test3
<dbl>,<dbl>,<dbl>
6.5,9,


Calculations can be set to ignore `NA` values by setting `na.rm` to `TRUE`.

In [13]:
data |>
  summarise(across(c(test1, test2, test3), ~ mean(., na.rm = TRUE)))

test1,test2,test3
<dbl>,<dbl>,<dbl>
6.5,9,8


Alternatively, can replace `NA` values with some other value.

In [14]:
data |>
  mutate(across(c(test1, test2, test3), ~ replace_na(., 0)))

participant_id,group,condition,test1,test2,test3
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
1,exp,1,6,9,0
1,exp,2,4,8,9
2,con,1,9,10,8
2,con,2,7,9,7


Use `drop_na` to completely exclude rows with `NA`.

In [15]:
data |>
  drop_na()

participant_id,group,condition,test1,test2,test3
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
1,exp,2,4,8,9
2,con,1,9,10,8
2,con,2,7,9,7
