forked from stephaniehicks/jhustatcomputing2022
-
Notifications
You must be signed in to change notification settings - Fork 4
/
index.qmd
704 lines (499 loc) · 22.2 KB
/
index.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
---
title: "22 - Factors"
author:
- name: Leonardo Collado Torres
url: http://lcolladotor.github.io/
affiliations:
- id: libd
name: Lieber Institute for Brain Development
url: https://libd.org/
- id: jhsph
name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics
url: https://publichealth.jhu.edu/departments/biostatistics
description: "An introduction to working categorial variables using factors in R"
categories: [module 5, week 7, tidyverse, factors, categorial variables]
---
*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the `r paste0("[GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/", basename(dirname(getwd())), "/", basename(getwd()), "/index.qmd)")`.*
# Pre-lecture materials
### Read ahead
::: callout-note
## Read ahead
**Before class, you can prepare by reading the following materials:**
1. [Wrangling Categorical Data in R](https://peerj.com/preprints/3163) by Amelia McNamara, Nicholas J Horton
2. <https://swcarpentry.github.io/r-novice-inflammation/12-supp-factors>
3. <https://forcats.tidyverse.org>
:::
### Acknowledgements
Material for this lecture was borrowed and adopted from
- [Wrangling Categorical Data in R](https://peerj.com/preprints/3163) by Amelia McNamara, Nicholas J Horton
- <https://r4ds.had.co.nz/factors>
# Learning objectives
::: callout-note
# Learning objectives
**At the end of this lesson you will:**
- How to create factors and some challenges working with them in base R
- An introduction to the `forcats` package in the `tidyverse` to work with **cat**egorical variables in R
:::
# Introduction
**Factors** are used for working with **categorical variables**, or variables that have a fixed and known set of possible values (income bracket, U.S. state, political affiliation).
Factors are **useful when**:
- You want to **include categorical variables in regression models**
- You want to **plot categorical data** (e.g. want to map categorical variables to aesthetic attributes)
- You want to **display character vectors in a non-alphabetical order**
::: callout-tip
### Example
Imagine that you have a variable that records month:
```{r}
x <- c("Dec", "Apr", "Jan", "Mar")
```
Using a string to record this variable has two problems:
1. There are only twelve possible months, and there's nothing saving you from typos:
```{r}
x_typo <- c("Dec", "Apr", "Jam", "Mar")
```
2. It doesn't sort in a useful way:
```{r}
sort(x)
```
:::
## Factor basics
You can fix both of these problems with a **factor**.
To create a factor you must start by creating a list of the valid **levels**:
```{r}
month_levels <- c(
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
```
Now we can create a factor with the `factor()` function defining the `levels` argument:
```{r}
y <- factor(x, levels = month_levels)
y
```
We can see what happens if we try to **sort the factor**:
```{r}
sort(y)
```
We can also check the **attributes of the factor**:
```{r}
attributes(y)
```
If you want to access the set of levels directly, you can do so with `levels()`:
```{r}
levels(y)
```
::: callout-tip
### Note
Any values not in the level will be silently converted to NA:
```{r}
y_typo <- factor(x_typo, levels = month_levels)
y_typo
```
:::
## Challenges working with categorical data
Working with categorical data can really helpful in many situations, but it also be challenging.
For example,
1. What if the **original data source** for where the categorical data is getting ingested **changes**?
- If a domain expert is providing spreadsheet data at regular intervals, code that worked on the initial data may not generate an error message, but could silently produce incorrect results.
2. What if a **new level** of a categorical data is added in an updated dataset?
3. When categorical data are coded with numerical values, it can be easy to **break the relationship between category numbers and category labels** without realizing it, thus losing the information encoded in a variable.
- Let's consider an example of this below.
::: callout-tip
### Example
Consider a set of decades,
```{r}
#| message: false
library(tidyverse)
x1_original <- c(10, 10, 10, 50, 60, 20, 20, 40)
x1_factor <- factor(x1_original)
attributes(x1_factor)
tibble(x1_original, x1_factor) %>%
mutate(x1_numeric = as.numeric(x1_factor))
```
Instead of creating a new variable with a numeric version of the value of the factor variable `x1_factor`, the **variable loses the original numerical categories** and **creates a factor number** (i.e., 10 is mapped to 1, 20 is mapped to 2, and 40 is mapped to 3, etc).
:::
This **result is unexpected** because `base::as.numeric()` is intended to recover numeric information by coercing a character variable.
::: callout-tip
### Example
Compare the following:
```{r}
#| warning: true
as.numeric(c("hello"))
as.numeric(factor(c("hello")))
```
In the first example, R does not how to convert the character string to a numeric, so it returns a `NA`.
In the second example, it creates factor numbers and orders them according to an alphabetical order. Here is another example of this behavior:
```{r}
#| warning: true
as.numeric(factor(c("hello", "goodbye")))
```
:::
This behavior of the `factor()` function feels unexpected at best.
Another example of **unexpected behavior** is how the function will **silently make a missing value** because the values in the data and the levels do not match.
```{r}
factor("a", levels = "c")
```
The **unfortunate behavior of factors in R** has led to an online movement against the default behavior of many data import functions to make factors out of any variable composed as strings.
The tidyverse is part of this movement, with functions from the `readr` package defaulting to leaving strings as-is. Others used to chose to add `options(stringAsFactors = FALSE)` into their start up commands to override R's default of `stringsAsFactors = TRUE` in functions such as `read.table()`. However, that is no longer needed in recent versions of R as the default has become `stringsAsFactors = FALSE` as documented on the official R blog: <https://blog.r-project.org/2020/02/16/stringsasfactors/>.
## Factors when modeling data
So if factors are so troublesome, what's the point of them in the first place?
Factors are **still necessary for some data analytic tasks**. The most salient case is in **statistical modeling**.
When you pass a factor variable into `lm()` or `glm()`, R automatically creates indicator (or more colloquially 'dummy') variables for each of the levels and picks one as a reference group.
For simple cases, this behavior can also be **achieved with a character vector**.
However, to choose **which level to use as a reference level** or to order classes, factors must be used.
::: callout-tip
### Example
Consider a vector of character strings with three income levels:
```{r}
income_level <- c(
rep("low", 10),
rep("medium", 10),
rep("high", 10)
)
income_level
```
Here, it **might make sense to use the lowest income level (low) as the reference** class so that all the other coefficients can be interpreted in comparison to it.
However, R would use **high as the reference** by default because 'h' comes before 'l' in the alphabet.
```{r}
x <- factor(income_level)
x
y <- rnorm(30) # generate some random obs from a normal dist
lm(y ~ x)
```
:::
## Memory req for factors and character strings
Consider a large character string such as `income_level` corresponding to a categorical variable.
```{r}
income_level <- c(
rep("low", 10000),
rep("medium", 10000),
rep("high", 10000)
)
```
In early versions of R, storing categorical data as a factor variable was considerably more efficient than storing the same data as strings, because factor variables only store the factor labels once.
However, R now uses a global string pool, so each unique string is only stored once, which means storage is now less of an issue.
```{r}
format(object.size(income_level), units = "Kb") # size of the character string
format(object.size(factor(income_level)), units = "Kb") # size of the factor
```
## Summary
Factors can be really useful in many data analytic tasks, but the base R functions to work with factors can lead to some unexpected behavior that can catch new R users.
Let's introduce a package to make wrangling factors easier.
# `forcats`
Next, we will introduce the `forcats` package, which is part of the core `tidyverse`, but can also be loaded directly
```{r}
library("forcats")
```
It provides tools for dealing with **cat**egorical variables (and it's an anagram of factors!) using a wide range of helpers for working with factors.
## General Social Survey
For the rest of this lecture, we are going to use the `gss_cat` dataset that is installed when you load `forcats`.
It's a sample of data from the [General Social Survey](https://gss.norc.org), a long-running US survey conducted by the independent research organization NORC at the University of Chicago.
The **survey has thousands of questions**, so in `gss_cat`.
I have selected a handful that will illustrate some common challenges you will encounter when working with factors.
```{r}
gss_cat
```
::: callout-tip
### Pro-tip
Since this dataset is provided by a package, you can get more information about the variables with `?gss_cat`.
:::
When factors are stored in a `tibble`, you cannot see their levels so easily. One way to view them is with `count()`:
```{r}
gss_cat %>%
count(race)
```
Or with a bar chart using the `geom_bar()` geom:
```{r}
#| fig-alt: >
#| A bar chart showing the distribution of race. There are ~2000
#| records with race "Other", 3000 with race "Black", and other
#| 15,000 with race "White".
gss_cat %>%
ggplot(aes(x = race)) +
geom_bar()
```
::: callout-tip
### Important
When **working with factors**, the **two most common operations** are
1. Changing the **order** of the levels
2. Changing the **values** of the levels
:::
Those operations are described in the sections below.
## Modifying factor order
It's often useful to **change the order of the factor levels** in a visualization.
Let's explore the `relig` (religion) factor:
```{r}
gss_cat %>%
count(relig)
```
We see there are 15 categories in the `gss_cat` dataset.
```{r}
attributes(gss_cat$relig)
```
The first level is "No answer" followed by "Don't know", and so on.
Imagine you want to explore the average number of hours spent watching TV (`tvhours`) per day across religions (`relig`):
```{r}
#| fig-alt: >
#| A scatterplot of with tvhours on the x-axis and religion on the y-axis.
#| The y-axis is ordered seemingly aribtrarily making it hard to get
#| any sense of overall pattern.
relig_summary <- gss_cat %>%
group_by(relig) %>%
summarise(
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
)
relig_summary %>%
ggplot(aes(x = tvhours, y = relig)) +
geom_point()
```
The y-axis lists the levels of the `relig` factor in the order of the levels.
However, it is **hard to read this plot** because **there's no overall pattern**.
### `fct_reorder`
We can improve it by **reordering the levels** of `relig` using `fct_reorder()`. `fct_reorder(.f, .x, .fun)` takes three arguments:
- `.f`, the factor whose levels you want to modify.
- `.x`, a numeric vector that you want to use to reorder the levels.
- Optionally, `.fun`, a function that's used if there are multiple values of `x` for each value of `f`. The default value is `median`.
```{r}
#| fig-alt: >
#| The same scatterplot as above, but now the religion is displayed in
#| increasing order of tvhours. "Other eastern" has the fewest tvhours
#| under 2, and "Don't know" has the highest (over 5).
relig_summary %>%
ggplot(aes(
x = tvhours,
y = fct_reorder(.f = relig, .x = tvhours)
)) +
geom_point()
```
**Reordering** religion makes it **much easier to see** that people in the "Don't know" category watch much more TV, and Hinduism & Other Eastern religions watch much less.
As you start making more complicated transformations, I recommend moving them out of `aes()` and into a separate `mutate()` step.
::: callout-tip
### Example
You could rewrite the plot above as:
```{r}
relig_summary %>%
mutate(relig = fct_reorder(relig, tvhours)) %>%
ggplot(aes(x = tvhours, y = relig)) +
geom_point()
```
:::
::: callout-tip
### Another example
What if we create a similar plot looking at how average age varies across reported income level?
```{r}
#| fig-alt: >
#| A scatterplot with age on the x-axis and income on the y-axis. Income
#| has been reordered in order of average age which doesn't make much
#| sense. One section of the y-axis goes from $6000-6999, then <$1000,
#| then $8000-9999.
rincome_summary <-
gss_cat %>%
group_by(rincome) %>%
summarise(
age = mean(age, na.rm = TRUE),
n = n()
)
## Original rincome order
rincome_summary %>%
ggplot(aes(x = age, y = rincome)) +
geom_point()
## rincome re-ordered by age's values
rincome_summary %>%
ggplot(aes(x = age, y = fct_reorder(.f = rincome, .x = age))) +
geom_point()
```
Here, arbitrarily reordering the levels isn't a good idea! That's because `rincome` already has a principled order that we shouldn't mess with.
:::
::: callout-tip
### Pro-tip
Reserve `fct_reorder()` for factors whose levels are arbitrarily ordered.
:::
::: callout-note
### Question
Let's practice `fct_reorder()`. Using the `palmerpenguins` dataset,
1. Calculate the average `bill_length_mm` for each species
2. Create a scatter plot showing the average for each species.\
3. Go back and reorder the factor `species` based on the average bill length from largest to smallest.
4. Now order it from smallest to largest
```{r}
library(palmerpenguins)
penguins
## Try it out
```
:::
### `fct_relevel`
However, it does make sense to pull "Not applicable" to the front with the other special levels.
You can use `fct_relevel()`.
It takes a factor, `f`, and then any number of levels that you want to move to the front of the line.
```{r}
#| fig-alt: >
#| The same scatterplot but now "Not Applicable" is displayed at the
#| bottom of the y-axis. Generally there is a positive association
#| between income and age, and the income band with the highest average
#| age is "Not applicable".
rincome_summary %>%
ggplot(aes(age, fct_relevel(rincome, "Not applicable"))) +
geom_point()
```
::: callout-tip
### Note
Any levels not mentioned in `fct_relevel` will be left in their existing order.
:::
Another type of reordering is useful when you are coloring the lines on a plot. `fct_reorder2(f, x, y)` reorders the factor `f` by the `y` values associated with the largest `x` values.
This makes the plot easier to read because the colors of the line at the far right of the plot will line up with the legend.
```{r}
#| layout-ncol: 2
#| fig-width: 4
#| fig-height: 2
#| fig-alt: >
#| - A line plot with age on the x-axis and proportion on the y-axis.
#| There is one line for each category of marital status: no answer,
#| never married, separated, divorced, widowed, and married. It is
#| a little hard to read the plot because the order of the legend is
#| unrelated to the lines on the plot.
#| - Rearranging the legend makes the plot easier to read because the
#| legend colours now match the order of the lines on the far right
#| of the plot. You can see some unsuprising patterns: the proportion
#| never marred decreases with age, married forms an upside down U
#| shape, and widowed starts off low but increases steeply after age
#| 60.
by_age <-
gss_cat %>%
filter(!is.na(age)) %>%
count(age, marital) %>%
group_by(age) %>%
mutate(prop = n / sum(n))
by_age %>%
ggplot(aes(age, prop, colour = marital)) +
geom_line(na.rm = TRUE)
by_age %>%
ggplot(aes(age, prop, colour = fct_reorder2(marital, age, prop))) +
geom_line() +
labs(colour = "marital")
```
### `fct_infreq`
Finally, for bar plots, you can use `fct_infreq()` to order levels in decreasing frequency: this is the simplest type of reordering because it doesn't need any extra variables. Combine it with `fct_rev()` if you want them in increasing frequency so that in the bar plot largest values are on the right, not the left.
```{r}
#| fig-alt: >
#| A bar char of marital status ordered in from least to most common:
#| no answer (~0), separated (~1,000), widowed (~2,000), divorced
#| (~3,000), never married (~5,000), married (~10,000).
gss_cat %>%
mutate(marital = marital %>% fct_infreq() %>% fct_rev()) %>%
ggplot(aes(marital)) +
geom_bar()
```
## Modifying factor levels
More powerful than changing the orders of the levels is changing their values. This allows you to clarify labels for publication, and collapse levels for high-level displays.
### `fct_recode`
The most general and powerful tool is `fct_recode()`. It allows you to recode, or change, the value of each level. For example, take the `gss_cat$partyid`:
```{r}
gss_cat %>%
count(partyid)
```
The **levels are terse and inconsistent**.
Let's tweak them to be longer and use a parallel construction.
Like most rename and recoding functions in the tidyverse:
- the **new values go on the left**
- the **old values go on the right**
```{r}
gss_cat %>%
mutate(partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat"
)) %>%
count(partyid)
```
::: callout-tip
### Note
`fct_recode()` will leave the levels that aren't explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesn't exist.
:::
To combine groups, you can assign multiple old levels to the same new level:
```{r}
gss_cat %>%
mutate(partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat",
"Other" = "No answer",
"Other" = "Don't know",
"Other" = "Other party"
)) %>%
count(partyid)
```
Use this technique with care: if you group together categories that are truly different you will end up with misleading results.
### `fct_collapse`
If you want to collapse a lot of levels, `fct_collapse()` is a useful variant of `fct_recode()`.
For **each new variable**, you can **provide a vector of old levels**:
```{r}
gss_cat %>%
mutate(partyid = fct_collapse(partyid,
"other" = c("No answer", "Don't know", "Other party"),
"rep" = c("Strong republican", "Not str republican"),
"ind" = c("Ind,near rep", "Independent", "Ind,near dem"),
"dem" = c("Not str democrat", "Strong democrat")
)) %>%
count(partyid)
```
### `fct_lump_*`
Sometimes you **just want to lump together the small groups** to make a plot or table simpler.
That's the **job of the `fct_lump_*()` family of functions**.
`fct_lump_lowfreq()` is a simple starting point that progressively lumps the smallest groups categories into "Other", always keeping "Other" as the smallest category.
```{r}
gss_cat %>%
mutate(relig = fct_lump_lowfreq(relig)) %>%
count(relig)
```
In this case it's not very helpful: it is true that the majority of Americans in this survey are Protestant, but we'd probably like to see some more details!
Instead, we can use the `fct_lump_n()` to **specify that we want exactly 10 groups**:
```{r}
gss_cat %>%
mutate(relig = fct_lump_n(relig, n = 10)) %>%
count(relig, sort = TRUE) %>%
print(n = Inf)
```
Read the documentation to learn about `fct_lump_min()` and `fct_lump_prop()` which are useful in other cases.
## Ordered factors
There's a **special type of factor** that needs to be mentioned briefly: ordered factors.
**Ordered factors**, created with `ordered()`, imply a strict ordering and equal distance between levels:
The **first level** is "less than" the **second level** by the same amount that the second level is "less than" the **third level**, and so on...
You can recognize them when printing because they use `<` between the factor levels:
```{r}
ordered(c("a", "b", "c"))
```
However, in practice, `ordered()` factors **behave very similarly to regular factors**.
# Post-lecture materials
### Final Questions
Here are some post-lecture questions to help you think about the material discussed.
::: callout-note
### Questions
1. Explore the distribution of `rincome` (reported income). What makes the default bar chart hard to understand? How could you improve the plot?
2. What is the most common `relig` in this survey? What's the most common `partyid`?
3. Which `relig` does `denom` (denomination) apply to? How can you find out with a table? How can you find out with a visualization?
4. There are some suspiciously high numbers in `tvhours`. Is the mean a good summary?
5. For each factor in `gss_cat` identify whether the order of the levels is arbitrary or principled.
6. Why did moving "Not applicable" to the front of the levels move it to the bottom of the plot?
7. How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?
8. How could you collapse `rincome` into a small set of categories?
9. Notice there are 9 groups (excluding other) in the `fct_lump` example above. Why not 10? (Hint: type `?fct_lump`, and find the default for the argument `other_level` is "Other".)
:::
### Additional Resources
::: callout-tip
- <https://r4ds.had.co.nz/factors>
- [Wrangling Categorical Data in R](https://peerj.com/preprints/3163) by Amelia McNamara, Nicholas J Horton
- <https://swcarpentry.github.io/r-novice-inflammation/12-supp-factors>
- <https://forcats.tidyverse.org>
:::
# R session information
```{r}
options(width = 120)
sessioninfo::session_info()
```