generated from jtr13/cctemplate
-
Notifications
You must be signed in to change notification settings - Fork 10
Expand file tree
/
Copy pathtwo_continuous_var.qmd
More file actions
208 lines (135 loc) · 6.31 KB
/
two_continuous_var.qmd
File metadata and controls
208 lines (135 loc) · 6.31 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
# Two continuous variables
In this chapter, we will look at techniques that explore the relationships between two continuous variables.
## Scatterplot
### Basics and implications
For the following example, we use data set `SpeedSki`.
```{r,fig.width=4.8, fig.height=3.6}
library(GDAdata)
library(ggplot2)
ggplot(SpeedSki, aes(Year, Speed)) +
geom_point() +
labs(x = "Birth year", y = "Speed achieved (km/hr)") +
ggtitle("Skiers by birth year and speed achieved")
```
In our example, we simply use `geom_point` on variables `Year` and `Speed` to create the scatterplot. we try to capture if there is a relationship between the age of a player and the speed he/she can achieve. From the graph, it seems such relationship does not exist. Overall, scatterplots are very useful in understanding the correlation (or lack thereof) between variables. The scatterplot gives a good idea of whether that relationship is positive or negative and if there’s a correlation. However, don’t mistake correlation in a scatterplot for causation!
### Overplotting
In some situations a scatter plot faces the problem of overplotting as there are so many points overlapping. Consider the following example from class. To save time, we randomly sample 20% of the data in advance.
```{r,fig.width=4.8, fig.height=3.6}
library(dplyr)
library(ggplot2movies)
sample <- slice_sample(movies, prop = 0.2)
ggplot(sample,aes(x=votes,y=rating)) +
geom_point() +
ggtitle("Votes vs. rating") +
theme_classic()
```
To create better visuals, we can use:
* Alpha blending - `alpha=...`
* Open circles - `pch=21`
* smaller circles - `size=...` or `shape="."`
```{r,fig.height=8}
library(gridExtra)
f1 <- ggplot(sample,aes(x=votes,y=rating)) +
geom_point(alpha=0.3) +
theme_classic() +
ggtitle("Alpha blending")
f2 <- ggplot(sample,aes(x=votes,y=rating)) +
geom_point(pch = 21) +
theme_classic() +
ggtitle("Open circle")
f3 <- ggplot(sample,aes(x=votes,y=rating)) +
geom_point(size=0.5) +
theme_classic() +
ggtitle("Smaller circle")
grid.arrange(f1, f2, f3,nrow = 3)
```
Other methods that directly deal with the data:
* Randomly sample data - as shown in the first code chunk using `sample_n`
* Subset - split data into bins using `ntile(votes, 10)`
* Remove outliers
* Transform to log scale
### Interactive scatterplot
You can create an interactive scatterplot using `plotly`. In the following example, we take 1% of the movie data set to present a better visual. We plotted the votes vs. rating and grouped by the year they are released. In this graph:
* You can hover on to the points to see the title of the movie
* You can double click on the year legend to look at a certain year
* You can zoom into a certain part of the graph to better understand the data points.
```{r,fig.width=6, fig.height=4.5}
library(plotly)
sample2 <- slice_sample(movies,prop=0.01) |>
filter(year > 2000)
plot_ly(sample2, x = ~votes, y = ~rating,
color = ~as.factor(year), text= ~title,
hoverinfo = 'text')
```
### Modifications
#### Contour lines
Contour lines give a sense of the density of the data at a glance.
For these contour maps, we will use the `SpeedSki` dataset.
Contour lines can be added to the plot using geom_density_2d() and contour lines work best when combined with other layers
```{r,fig.width=4.8, fig.height=3.6}
ggplot(SpeedSki, aes(Year, Speed)) +
geom_density_2d(bins=5) +
geom_point() +
ggtitle("Scatter plot with contour line")
```
You can use `bins` to control the number of contour bins.
#### Scatterplot matrices
If you want to compare multiple parameters to each other, consider using a scatterplot matrix. This will allow you to show many comparisons in a compact and efficient manner.
For these scatterplot matrices, we use the `movies` dataset from the `ggplot2movies` package.
As a default, the base R plot() function will create a scatterplot matrix when given multiple variables:
```{r}
sample3 <- slice_sample(movies,prop=0.01) #sample data
splomvar <- sample3 |>
dplyr::select(length, budget, votes, rating, year)
plot(splomvar)
```
While this is quite useful for personal exploration of a dataset, it is **not** recommended for presentation purposes. Something called the [Hermann grid illusion](https://en.wikipedia.org/wiki/Grid_illusion){target="_blank"} makes this plot very difficult to examine.
## Heatmaps
### Basics and implications
In the following example, we still use the `SpeedSki` data set.
```{r,fig.width=4.8, fig.height=3.6}
ggplot(SpeedSki, aes(Year, Speed)) +
geom_bin2d()
```
To create a heatmap, simply substitute `geom_point()` with `geom_bin2d()`. Generally, heat maps are like a combination of scatterplots and histograms: they allow you to compare different parameters while also seeing their relative distributions.
### Modifications
For the following section, we introduce some variations on heatmaps.
#### Change number of bins / binwidth
By default, `geom_bin2d()` use 30 bins. Similar to a histogram, we can change the number of bins or binwidth.
```{r,fig.width=4, fig.height=3}
ggplot(SpeedSki, aes(Year, Speed)) +
geom_bin2d(binwidth = c(5,5)) +
ggtitle("Changing binwidth")
```
Notice we are specifying the binwidth for both x and y axis.
#### Combine with a scatterplot
```{r,fig.width=4, fig.height=3}
ggplot(SpeedSki, aes(Year, Speed)) +
geom_bin2d(binwidth = c(10, 10), alpha = .4) +
geom_point(size = 2) +
ggtitle("Combined with scatterplot")
```
#### Change color scale
You can change the continuous scale of color
```{r,fig.width=4, fig.height=3}
ggplot(SpeedSki, aes(Year, Speed)) +
geom_bin2d() +
ggtitle("Changing color scale") +
scale_fill_viridis_c()
```
#### Hex heatmap
One alternative is a hex heatmap. You can create the graph using `geom_hex`
```{r,fig.width=4, fig.height=3}
ggplot(SpeedSki, aes(Year, Speed)) +
geom_hex(binswidth = c(10,10)) +
ggtitle("Hex heatmap")
```
#### Alternative approach to color
If you look at all the previous examples, you might notice that lighter points correspond to more clustered points, which is somewhat counter-intuitive. The following example suggests an alternative approach in color scale.
```{r,fig.width=6, fig.height=4}
ggplot(SpeedSki, aes(Year, Speed)) +
geom_hex(bins=12) +
scale_fill_gradient(low = "grey", high = "purple") +
theme_classic(18) +
ggtitle("Alternative approach to color")
```