-
Notifications
You must be signed in to change notification settings - Fork 10
Expand file tree
/
Copy pathmultidimensional_categorical.qmd
More file actions
285 lines (218 loc) · 9.91 KB
/
multidimensional_categorical.qmd
File metadata and controls
285 lines (218 loc) · 9.91 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
# Multidimensional categorical variables
In this chapter, we will focus on multivariate categorical data. Here, it is noteworthy that multivariate plot is not the same as multiple variable plot, where the former is used for analysis with multiple outcomes.
## Barcharts
Bar chats are used to display the frequency of multidimensional categorical variables. In the next few plots you will be shown different kinds of bar charts.
### Stacked bar chart
```{r}
#| fig-width: 4
#| fig-height: 3
library(tidyverse)
cases <- read.csv("data/icecream.csv") |>
mutate(Age = fct_relevel(Age, "young"))
icecreamcolors <- c("#ff99ff", "#cc9966") # pink, coffee
ggplot(cases, aes(x = Age, fill = Favorite)) +
geom_bar() + scale_fill_manual(values = icecreamcolors)
```
### Grouped bar chart
Use ``position = "dodge"`` to create grouped bar chart
```{r,fig.width=4, fig.height=3}
ggplot(cases, aes(x = Age, fill = Favorite)) +
geom_bar(position = "dodge") +
scale_fill_manual(values = icecreamcolors)
```
### Grouped bar chart with facets
```{r,fig.width=4, fig.height=3}
ggplot(cases, aes(x = Age)) +
geom_bar(position = "dodge") +
facet_wrap(~Favorite)
```
### Grouped barchart with three categorical variables
```{r,fig.width=4, fig.height=3}
counts3 <- cases |>
group_by(Age, Favorite, Music) |>
summarize(Freq = n()) |>
ungroup() |>
complete(Age, Favorite, Music, fill = list(Freq = 0))
ggplot(counts3, aes(x = Favorite, y = Freq, fill = Music)) +
geom_col(position = "dodge") +
facet_wrap(~Age)
```
## Chi square test of independence
In this section, we would like to show how to use chi-square test to check the independence between two features.
We will use the following example to answer: Are older Americans more interested in local news than younger Americans? The dataset is collected from [here](https://www.journalism.org/2019/08/14/methodology-local-news-demographics/).
```{r,fig.width=4.8, fig.height=3.6}
local <- data.frame(Age = c("18-29", "30-49", "50-64", "65+"),
Freq = c(2851, 9967, 11163, 10911)) |>
mutate(Followers = round(Freq*c(.15, .28, .38, .42)),
Nonfollowers = Freq - Followers) |>
select(-Freq)
knitr::kable(local[,1:2])
```
The chi-square hypothesis is set to be:
Null hypothesis: Age and tendency to follow local news are independent
Alternative hypothesis: Age and tendence to follow local news are NOT independent
```{r,fig.width=4.8, fig.height=3.6}
localmat <- as.matrix(local[,2:3])
rownames(localmat) <- local$Age
X <- chisq.test(localmat, correct = FALSE)
X$observed
X$expected
X
```
We compare observed to expected and then the p-value tells that age and tendency are independent features. We are good to move on to next stage on mosaic plots.
## Mosaic plots
Mosaic plots are used for visualizing data from two or more qualitative variables to show their proportions or associations.
### Mosaic plot with one variable
```{r,fig.width=4.8, fig.height=3.6}
library(grid)
icecream <- read.csv("data/MusicIcecream.csv") |>
mutate(Age = fct_relevel(Age, "young"))
icecreamcolors <- c("#ff99ff", "#cc9966")
counts2 <- icecream |>
group_by(Age, Favorite) |>
summarize(Freq = sum(Freq))
vcd::mosaic(~Age, direction = "v", counts2)
```
### Mosaic plot with two variables
```{r,fig.width=4.8, fig.height=3.6}
vcd::mosaic(Favorite ~ Age, counts2, direction = c("v", "h"),
highlighting_fill = icecreamcolors)
```
### Mosaic plot with three variables(Best practice)
Here's some criteria of best practice of mosaic plots :
>Dependent variables is split last and split horizontally
>
>Fill is set to dependent variable
>
>Other variables are split vertically
>
>Most important level of dependent variable is closest to the x-axis and darkest (or most noticable shade)
>
```{r, fig.width=4.8, fig.height=3.6}
vcd::mosaic(Favorite ~ Age + Music, counts3,
direction = c("v", "v", "h"),
highlighting_fill = icecreamcolors)
```
### Mosaic pairs plot
Use ``pairs`` method to plot a matrix of pairwise mosaic plots for class ``table``:
```{r, fig.width=4.8, fig.height=3.6}
pairs(table(cases[,2:4]), highlighting = 2)
```
### Mosaic plots: spine plot
Spine plot is a mosaic plot with straight, parallel cuts in one dimension (“spines”) and only one variable cutting in the other direction.
```{r,fig.width=4.8, fig.height=3.6}
library(vcdExtra)
library(forcats)
foodorder <- Alligator |> group_by(food) |> summarize(Freq = sum(count)) |>
arrange(Freq) |> pull(food)
ally <- Alligator |>
rename(Freq = count) |>
mutate(size = fct_relevel(size, "small"),
food = factor(food, levels = foodorder),
food = fct_relevel(food, "other"))
vcd::mosaic(food ~ sex + size, ally,
direction = c("v", "v", "h"),
highlighting_fill= RColorBrewer::brewer.pal(5, "Accent"))
```
### Mosaic plot: tree map
Treemap is a filled rectangular plot representing hierarchical data (fill color does not necessarily represent frequency count)
```{r,fig.width=7.2, fig.height=4.8}
library(treemap)
data(GNI2014)
treemap::treemap(GNI2014,
index=c("continent", "iso3"),
vSize="population",
vColor="GNI",
type="value",
format.legend = list(scientific = FALSE, big.mark = " "))
```
## Diverging stacked bar chart
This type of chart works well with likert data, or any ordinal data with categories that span two opposing poles. The code below uses the `likert()` function from the **HH** package.
```{r}
library(HH)
gdata <- read_csv("data/gender.csv")
HH::likert(Group~., gdata, positive.order = TRUE,
col=likertColorBrewer(3, ReferenceZero = NULL,
BrewerPaletteName = "BrBG"),
main = "% saying the country __ when \n it comes to giving women equal rights with men",
xlab = "percent", ylab = "")
```
## Diverging stacked bar chart (with faceting)
Use `|` to condition (facet) on factor levels
```{r}
gdata$Section <- c("Overall", "Gender", "Gender", "Party", "Party")
gdata <- gdata |> dplyr::select(Section, Group, everything())
# sort facets manually
gdata <- gdata |> mutate(Section = factor(Section,
levels = c("Party", "Gender", "Overall")))
likert(Group ~ . | Section,
data = gdata,
scales = list(y = list(relation = "free")), # equivalent to scales = "free_y"
layout = c(1, 3), # controls position of subplots
positive.order = TRUE,
col=likertColorBrewer(3, ReferenceZero = NULL,
BrewerPaletteName = "BrBG"),
main = "% saying the country __ when \n it comes to giving women equal rights with men",
xlab = "percent",
ylab = NULL)
```
Reference: R. Heiberger and N. Robbins, [Design of Diverging Stacked Bar Charts for Likert Scales and Other Applications](https://www.jstatsoft.org/article/view/v057i05)
## Alluvial diagrams
Alluvial diagrams are usually used to represent the flow changes in network structure over time or between different levels.
The following plot shows the essential components of alluvial plots used in the naming schemes and documentation (axis, alluvium, stratum, lode):
<center>
{width=75%}
</center>
### ggalluvial
```{r,fig.width=4.8, fig.height=3.6}
library(ggalluvial)
df2 <- data.frame(Class1 = c("Stats", "Math", "Stats", "Math", "Stats", "Math", "Stats", "Math"),
Class2 = c("French", "French", "Art", "Art", "French", "French", "Art", "Art"),
Class3 = c("Gym", "Gym", "Gym", "Gym", "Lunch", "Lunch", "Lunch", "Lunch"),
Freq = c(20, 3, 40, 5, 10, 2, 5, 15))
ggplot(df2, aes(axis1 = Class1, axis2 = Class2, axis3 = Class3, y = Freq)) +
geom_alluvium(color='black') +
geom_stratum() +
geom_text(stat = "stratum", aes(label = paste(after_stat(stratum), "\n", after_stat(count)))) +
scale_x_discrete(limits = c("Class1", "Class2", "Class3"))
```
You can choose to color the alluvium by different variables, for example, the first variable ``Class1`` here:
```{r,fig.width=4.8, fig.height=3.6}
ggplot(df2, aes(axis1 = Class1, axis2 = Class2, axis3 = Class3, y = Freq)) +
geom_alluvium(aes(fill = Class1), width = 1/12) +
geom_stratum() +
geom_text(stat = "stratum", aes(label = paste(after_stat(stratum), "\n", after_stat(count)))) +
scale_x_discrete(limits = c("Class1", "Class2", "Class3"))
```
### geom_flow
Another way of plotting alluvial diagrams is using ``geom_flow`` rather than ``geom_alluvium``:
```{r,fig.width=4.8, fig.height=3.6}
ggplot(df2, aes(axis1 = Class1, axis2 = Class2, axis3 = Class3, y = Freq)) +
geom_flow(aes(fill = Class1), width = 1/12) +
geom_stratum() +
geom_text(stat = "stratum", aes(label = paste(after_stat(stratum), "\n", after_stat(count)))) +
scale_x_discrete(limits = c("Class1", "Class2", "Class3"))
```
After we use ``geom_flow``, all Math students learning Art came together, which is also the same as Stats students. It makes the graph much clearer than ``geom_alluvium`` since there is less cross alluviums between each axises.
## Heat map
Besides what have been systematically introduced in ``Chapter 9.2 Heatmaps``, this part demonstrated a special case of heat map when both x and y are categorical. Here the heat map can been seen as a clustered bar chart and a pre-defined theme is used to show the dense more clearly.
```{r,fig.width=7.2, fig.height=4.8}
library(vcdExtra)
library(dplyr)
theme_heat <- theme_classic() +
theme(axis.line = element_blank(),
axis.ticks = element_blank())
orderedclasses <- c("Farm", "LoM", "UpM", "LoNM", "UpNM")
mydata <- Yamaguchi87
mydata$Son <- factor(mydata$Son, levels = orderedclasses)
mydata$Father <- factor(mydata$Father,
levels = orderedclasses)
mydata3 <- mydata |> group_by(Country, Father) |>
mutate(Total = sum(Freq)) |> ungroup()
ggplot(mydata3, aes(x = Father, y = Son)) +
geom_tile(aes(fill = (Freq/Total)), color = "white") +
coord_fixed() +
scale_fill_gradient2(low = "black", mid = "white",
high = "red", midpoint = .2) +
facet_wrap(~Country) + theme_heat
```