-
Notifications
You must be signed in to change notification settings - Fork 16
/
week1-3.Rmd
176 lines (147 loc) · 7.52 KB
/
week1-3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
---
title: A Vocabulary of Marks
layout: post
output:
md_document:
preserve_yaml: true
---
_Encodings available in ggplot2._
[Recording](https://mediaspace.wisc.edu/media/Week+1+-+3A+A+Vocabulary+of+Marks/1_ke20cnja), [Code](https://github.com/krisrs1128/stat679_code/blob/main/notes/week1-3.Rmd)
```{r, echo = FALSE}
library(knitr)
opts_knit$set(base_dir = "/", base.url = "/")
opts_chunk$set(
warning = FALSE,
message = FALSE,
fig.path = "stat679_notes/assets/week1-3/"
)
```
```{r}
library(tidyverse)
library(scales)
theme_set(theme_minimal())
```
1. The choice of encodings influences (1) the types of comparisons that a visualization suggests and (2) the accuracy of the conclusions that readers leave with. With this in mind, it’s in our best interest to build a rich vocabulary of potential visual encodings. The more kinds of marks and encodings that are at your fingertips, the better your chances are that you’ll arrive at a configuration that helps you achieve your purpose.
2. **Point marks** can encode data fields using their x and y positions, color,
size, and shape. Below, each mark is a country, and we’re using shape and the y
position to distinguish between country clusters.
{% highlight R %}
```{r, eval = FALSE}
gapminder <- read_csv("https://uwmadison.box.com/shared/static/dyz0qohqvgake2ghm4ngupbltkzpqb7t.csv", col_types = cols()) %>%
mutate(cluster = as.factor(cluster)) # specify that cluster is nominal
gap2000 <- gapminder %>%
filter(year == 2000) # keep only year 2000
ggplot(gap2000) +
geom_point(aes(x = fertility, y = cluster, shape = cluster))
```
{% endhighlight %}
```{r, echo = FALSE}
gapminder <- read_csv("https://uwmadison.box.com/shared/static/dyz0qohqvgake2ghm4ngupbltkzpqb7t.csv", col_types = cols()) %>%
mutate(cluster = as.factor(cluster)) # specify that cluster is nominal
gap2000 <- gapminder %>%
filter(year == 2000) # keep only year 2000
ggplot(gap2000) +
geom_point(aes(x = fertility, y = cluster, shape = cluster))
```
3. **Bar marks** let us associate a continuous field with a nominal one.
```{r}
ggplot(gap2000) +
geom_col(aes(country, pop))
```
4. This plot can be improved. The grid lines and tick marks associated with each bar are distracting and the axis labels are all running over one another. We resolve this by changing the theme and turning the bars on their side^[An alternative is to turn rotate the labels by 90 degrees. I prefer to turn the whole plot this, because this way, readers don’t have to tilt their heads to read the country names.]
```{r, fig.height = 8.5}
ggplot(gap2000) +
geom_col(aes(pop, country)) +
theme(
panel.grid.major.y = element_blank(),
axis.ticks = element_blank() # remove tick marks
)
```
5. To make comparisons between countries with similar populations easier, we can
order them by population (alphabetical ordering is not that meaningful). To
compare clusters, we can color in the bars.
```{r, fig.height = 8.5}
ggplot(gap2000) +
geom_bar(aes(pop, reorder(country, pop), fill = cluster), stat = "identity") +
theme(
axis.ticks = element_blank(),
panel.grid.major.y = element_blank()
)
```
6. We’ve been spending a lot of time on this plot. This is because I want to emphasize that a visualization is not just something we can get just by memorizing some magic (programming) incantation. Instead, it is something worth critically engaging with and refining, in a similar way that we would refine an essay or speech. Philosophy aside, there are still a few points that need to be improved in this figure,
* The axis titles are not meaningful.
* There is a strange gap between the left hand edge of the plot and the start of the bars.
* I would also prefer if the bars were exactly touching one another, without the small vertical gap.
* The scientific notation for population size is unnecessarily technical.
* The color scheme is a bit boring...
7. I’ve addressed each issue in the block below. Can you tell which piece of
code makes which change? Try removing different components to verify your
guesses.
```{r, fig.height = 8.5}
cols <- c("#80BFA2", "#7EB6D9", "#3E428C", "#D98BB6", "#BF2E21", "#F23A29")
ggplot(gap2000) +
geom_col(
aes(pop, reorder(country, pop), fill = cluster),
width = 1
) +
scale_x_continuous(label = label_number_si(), expand = c(0, 0, 0.1, 0.1)) +
scale_fill_manual(values = cols) +
labs(x = "Population", y = "Country", fill = "Country Group", color = "Country Group") +
theme(
axis.ticks = element_blank(),
panel.grid.major.y = element_blank()
)
```
8. **Segment marks**. In the plot above, each bar is anchored at 0. Instead, we could have each bar encode two continuous values, a left and right. To illustrate, let’s compare the minimum and maximimum life expectancies within each country cluster. We’ll need to create a new data.frame with just the summary information. For this, we `group_by` each cluster, so that a summarise call finds the minimum and maximum life expectancies restricted to each cluster.
```{r}
# find summary statistics
life_ranges <- gap2000 %>%
group_by(cluster) %>%
summarise(
min_life = min(life_expect),
max_life = max(life_expect)
)
ggplot(life_ranges) +
geom_segment(
aes(min_life, reorder(cluster, max_life), xend = max_life, yend = cluster, col = cluster),
size = 5,
) +
scale_color_manual(values = cols) +
labs(x = "Minimum and Maximum Expected Span", col = "Country Group", y = "Country Group") +
xlim(0, 85) # otherwise would only range from 42 to 82
```
9. **Line marks** are useful for comparing changes. Our eyes naturally focus on rates of change when we see lines. Below, we’ll plot the fertility over time, colored in by country cluster. The group argument is useful for ensuring each country gets its own line; if we removed it, ggplot2 would become confused by the fact that the same x (year) values are associated with multiple y’s (fertility rates).
```{r}
ggplot(gapminder) +
geom_line(
aes(year, fertility, col = cluster, group = country),
alpha = 0.7, size = 0.9
) +
scale_x_continuous(expand = c(0, 0)) + # same trick of removing gap
scale_color_manual(values = cols)
```
10. Area marks have a flavor of both bar and line marks. The filled area supports absolute comparisons, while the changes in shape suggest derivatives.
```{r}
population_sums <- gapminder %>%
group_by(year, cluster) %>%
summarise(total_pop = sum(pop))
ggplot(population_sums) +
geom_area(aes(year, total_pop, fill = cluster)) +
scale_y_continuous(expand = c(0, 0, .1, .1), label = label_number_si()) +
scale_x_continuous(expand = c(0, 0)) +
scale_fill_manual(values = cols)
```
11. Just like in bar marks, we don’t necessarily need to anchor the y-axis at 0. For example, here the bottom and top of each area mark is given by the 30% and 70% quantiles of population within each country cluster.
```{r}
population_ranges <- gapminder %>%
group_by(year, cluster) %>%
summarise(min_pop = quantile(pop, 0.3), max_pop = quantile(pop, 0.7))
ggplot(population_ranges) +
geom_ribbon(
aes(x = year, ymin = min_pop, ymax = max_pop, fill = cluster),
alpha = 0.8
) +
scale_y_continuous(expand = c(0, 0, .1, .1), label = label_number_si()) +
scale_x_continuous(expand = c(0, 0)) +
scale_fill_manual(values = cols)
```