-
Notifications
You must be signed in to change notification settings - Fork 1
/
lmermultimember_intro.Rmd
338 lines (237 loc) · 18.5 KB
/
lmermultimember_intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
---
title: "Introduction to lmerMultiMember"
output: rmarkdown::html_vignette
author: "Jeroen van Paridon"
date: "2022-10-27"
vignette: >
%\VignetteIndexEntry{Introduction to lmerMultiMember}
\usepackage[utf8]{inputenc}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
chunk_output_type: inline
---
### Introduction to multiple membership models
This vignette will walk you through an `lmerMultiMember` analysis of a dataset that is included with the package. It contains examples for specifying and fitting an `lmerMultiMember` model and examples for using the package's helper functions, including functions for creating multiple membership indicator matrices, creating Bradley-Terry/adversarial indicator matrices, and creating interaction/nested multiple membership matrices. Basic understanding of the generalized linear mixed-effects model (GLMM) is assumed; if you would like to learn more about GLMMs you can refer to the `lme4` vignettes.
```{r setup, message = FALSE}
# load a few packages that we will be using a lot
library(lmerMultiMember) # for dataset and modeling
library(kableExtra) # for displaying tabular data
library(dplyr) # for data manipulation
library(ggplot2) # for general plotting stuff
# there are a few packages used in this vignette that are not loaded here
# we will just call their functions using the :: operator as needed
```
### Doubles tennis data
In this vignette we'll try to identify the best male doubles tennis player of the 2010s.
This is a tricky question, because when a pair of players plays another pair of players, determining the individual contribution of each player to the outcome is not straightforward. However, we can use a multiple membership model to get a measure of each player's individual contribution to a team's doubles success.
Tennis scores are a somewhat inconvenient outcome to model, because the number of points scored against an opponent is not a good measure of player performance. Men's doubles matches are played in a best-of-three or best-of-five format, meaning that the first team to two or three _sets_ wins. To win a _set_, a team has to win six _games_, and lead by at least two games. To win a _game_, a team has to score four _points_, with at least a two point lead.[^1]
[^1]: This nested system of best-of-N and minimal-advantage rules means that while scoring zero points against an opponent signifies bad performance, scoring a very large number of points does not signify that a team was much better than their opponent; a large number of points scored in a match means that both teams had difficulty gaining the minimal point advantage necessary to win games and sets, a sign that both teams were in fact quite evenly matched. The optimal score, then, occurs when a team is much stronger than their opponent and they only need just enough points to quickly win each game.
Because modeling the score is a somewhat thorny problem, for this tutorial we will oversimplify a little bit and instead model match wins and losses, which we can do with a simple binomial GLMM (Generalized Linear Mixed-effects Model).
We can write this model out as follows:
$$
\operatorname{logit}(P(win)) = X\beta + Zu
$$
Where $X$ is the (dense) matrix of fixed effects predictors and $\beta$ is the vector of fixed effects coefficients. $Z$ is the (sparse) matrix of random effects indicators and $u$ is the vector of random effects (and we're using a logit-link to convert to probabilities).
In a traditional mixed-effects model, the sparse indicator matrix $Z$ for each random effect would have exactly 1 indicator per observation, and look like this:
$$
Z =
\begin{bmatrix}
1 & 0 & 0 & 0 \\
0 & 1 & 0 & 0 \\
0 & 0 & 1 & 0 \\
0 & 0 & 0 & 1 \\
\end{bmatrix}
$$
The indicator matrix for a multiple membership model can have multiple indicators per case:
$$
Z =
\begin{bmatrix}
1 & 1 & 0 & 0 \\
0 & 1 & 1 & 0 \\
0 & 0 & 1 & 1 \\
1 & 0 & 0 & 1 \\
\end{bmatrix}
$$
More generally, we can specify arbitrary indicator or weight matrices for the random effects, including fractional weights and negative weights/indicators (a class of model known as Bradley-Terry models):
$$
Z =
\begin{bmatrix}
1 & -1 & 0 & 0 \\
0 & 1 & -1 & 0 \\
0 & 0 & 1 & -1 \\
-1 & 0 & 0 & 1 \\
\end{bmatrix}
$$
In this vignette we'll use all three kinds of indicator matrices to see how best to model our doubles tennis dataset.
We have results for ATP men's doubles tennis matches from 2010 to 2019, compiled by Jeff Sackmann. This dataset is included with `lmerMultiMember`, so you if you have this package installed you can load it yourself and play around with it.
```{r doubles_data}
# load ATP doubles data from package
data("atp_doubles", package = "lmerMultiMember")
# display the first 1000 matches in the ATP doubles dataset
atp_doubles %>%
head(1000) %>%
kable() %>%
kable_styling(bootstrap_options = c("hover", "condensed"), font_size = 16) %>%
scroll_box(height = "250px")
```
---
### Modeling player strength for players on team 1
To figure out which player is strongest, we'll model player effects using multiple membership random intercepts. `lmerMultiMember` provides a helper function that generates an indicator matrix from columns of a dataframe. To illustrate what we mean by that, we'll generate the indicator matrix `Wt` for the players on team 1 in each of the first five matches in our dataset and plot it.
```{r indicator_matrix}
# generate sparse matrix for small bit of the dataset
Wt <- weights_from_columns(atp_doubles[1:5, c("team1_player1_name",
"team1_player2_name")])
# display matrix plot of the indicator matrix
matplot <- plot_membership_matrix(Wt)
update(matplot, xlab = "Team 1 players", ylab = "Match",
at = c(-1.001, .5, 1.001), col.regions = c("white", "coral1"))
```
As you can see, the indicator matrix has 1s (filled squares) for the players in a match, and 0s (empty space) everywhere else. There are two players in team 1 in each match, so for each row/match in the matrix, you'll find two filled squares.
Next, we generate the indicator matrix `Wp` for the players in team 1 across _all_ matches in our dataset and fit a GLMM to predict match winners from that indicator matrix. We explicitly set the global intercept to 0, because there shouldn't be any systematic advantage for team 1 over team 2 when we model the whole dataset.[^2]
[^2]: Which team is team 1 and which is team 2 is randomly assigned and entirely arbitrary in this dataset.
```{r doubles_model}
# generate sparse indicator matrix for whole dataset
Wp <- weights_from_columns(atp_doubles[, c("team1_player1_name",
"team1_player2_name")])
# fit GLMM
m1 <- glmer(team1_win ~ 0 + (1 | team1_player),
family = "binomial", memberships = list(team1_player = Wp),
data = atp_doubles)
# display model summary
summary(m1)
```
The model summary tells us that the model fitted is a multiple membership model, and the line at the bottom shows us that the minimum, mean, and maximum number of team 1 players per match was 2, as expected.
To view the strongest players according to this model, we can extract the random intercepts and plot them. These intercepts will be on a log odds ratio scale, but they're relatively easy to interpret for our purposes: The strongest player will have the largest intercept value.
```{r doubles_plot, dpi = 300}
# extract random effects
m1_ranefs <- broom.mixed::tidy(m1, effects = "ran_vals", conf.int = TRUE) %>%
.[.$group == "team1_player", ] %>%
.[order(.$estimate, decreasing = TRUE), ]
# plot top 10
m1_ranefs[1:10, ] %>%
ggplot(aes(x = estimate, y = factor(level, level = level))) +
geom_point(size = 3, color = "coral1") +
geom_linerange(aes(xmin = conf.low, xmax = conf.high),
size = 1, color = "coral1") +
theme_bw() + xlab("") + ylab("") +
ggtitle("Increase in log(odds ratio) associated with player") +
theme(axis.ticks = element_blank(),
axis.text.x = element_text(size = 12, angle = 60, hjust = 1.0),
axis.text.y = element_text(size = 12),
plot.title = element_text(size = 14),
rect = element_rect(fill = "transparent")) +
scale_y_discrete(limits = rev) +
geom_vline(xintercept = 0, color = "coral2", size = 1.0)
```
A simple sanity check for our model is to see whether any of the players in this top 10 are actually accomplished doubles players. And they are! Bob and Mike Bryan are widely considered to be the best men's doubles team of all time. While by the end of the 2010s they were no longer in their prime, the model puts them in the top 10 of 2010s men's doubles players and that's a good sign that we're capturing player variance correctly.
### Modeling player strength for players on both teams
One important drawback of how we've modeled player effects in our model is that we're only considering team 1 players, not team 2. Tennis matches are not played against a random replacement-level opponent, they're played against a real team that can vary in strength. That means that a losing to Bob and Mike Bryan should probably not harm your personal player strength estimate as much as losing against a random other team would, but our modeling approach does not account for this. (Of course another drawback of this approach is that we're essentially ignoring half of the available data about player strength.)
To account for team 2 player strength, we _could_ enter team 2 as another random intercept, but it's preferable to integrate team 2 player strength estimates with our team 1 player strength estimates. Many of the players in our dataset occur as both members of team 1 and members of team 2, so we'll want to make sure we share that information for player strength estimates across our estimates for both teams.
One way to pool player strength estimates across both teams is to model a single random intercept per player, putting a negative indicator/weight on that random intercept when a player is on team 2 (i.e. that player is working against team 1 winning, which is the event for which we are modeling the probability!). This pools information about player strength across players' appearances on both teams, which should shrink the uncertainty around our estimates.
```{r improved_indicator_matrix}
# generate indicator matrices for both teams for part of the dataset
Wp1 <- weights_from_columns(atp_doubles[1:5, c("team1_player1_name",
"team1_player2_name")])
Wp2 <- weights_from_columns(atp_doubles[1:5, c("team2_player1_name",
"team2_player2_name")])
# create a Bradley-Terry/adversarial indicator matrix
Wp <- bradleyterry_from_sparse(Wp1, Wp2)
# display matrix plot of the indicator matrix
matplot <- plot_membership_matrix(Wp)
cmap <- colorRampPalette(c("#6495ed", "white", "coral1"))(100)
matplot$legend$right$args$key$col <- cmap
update(matplot, xlab = "Players", ylab = "Match",
at = seq(-1.001, 1.001, length.out = 100),
col.regions = cmap)
```
This is a small part of our new indicator matrix. Positive indicators are still colored orange (_coral_, actually), but the new negative indicators are colored blue. There are two teams of two players, so four filled in squares per row/match, 2 orange and 2 blue squares. When we use this new indicator matrix in an `lmerMultiMember` model, the model becomes a type of Bradley-Terry model, a class of models that predicts win probabilities of competing items drawn from a single pool (players, in our case).
```{r improved_doubles_model}
# generate sparse indicator matrices for both teams
Wp1 <- weights_from_columns(atp_doubles[, c("team1_player1_name",
"team1_player2_name")])
Wp2 <- weights_from_columns(atp_doubles[, c("team2_player1_name",
"team2_player2_name")])
# create a Bradley-Terry/adversarial indicator matrix
Wp <- bradleyterry_from_sparse(Wp1, Wp2)
# fit GLMM
m2 <- glmer(team1_win ~ 0 + (1 | player),
family = "binomial", memberships = list(player = Wp),
data = atp_doubles)
# display model summary
summary(m2)
```
You may have noticed that the multiple membership summary line at the bottom of the summary now indicates a min, mean, and max of 0 for the observations. That is because in a Bradley-Terry style adversarial effects model, the indicators for team 1 and team 2 cancel each other out (i.e. the indicators for team 1 are +1, whereas the indicators for team 2 are -1). This looks a little funny in the summary, but it's actually quite easy to tell if you specified your Bradley-Terry model correctly because the indicators should always sum to zero!
```{r improved_doubles_plot, dpi = 300}
# extract random effects
m2_ranefs <- broom.mixed::tidy(m2, effects = "ran_vals", conf.int = TRUE) %>%
.[.$group == "player", ] %>%
.[order(.$estimate, decreasing = TRUE), ]
# plot top 10
m2_ranefs[1:10, ] %>%
ggplot(aes(x = estimate, y = factor(level, level = level))) +
geom_point(size = 3, color = "coral1") +
geom_linerange(aes(xmin = conf.low, xmax = conf.high),
size = 1, color = "coral1") +
theme_bw() + xlab("") + ylab("") +
ggtitle("Increase in log(odds ratio) associated with player") +
theme(axis.ticks = element_blank(),
axis.text.x = element_text(size = 12, angle = 60, hjust = 1.0),
axis.text.y = element_text(size = 12),
plot.title = element_text(size = 14),
rect = element_rect(fill = "transparent")) +
scale_y_discrete(limits = rev) +
geom_vline(xintercept = 0, color = "coral2", size = 1.0)
```
As you can see, modeling team 2 player variability in conjunction with team 1 player variability using a Bradley-Terry style model has shrunk the uncertainty around our estimates of player strength, reflecting a higher level of confidence in these estimates.
We can confirm the intuition that this model fits better by comparing the log-likelihoods of both models.
```{r lr_test}
lmtest::lrtest(m1, m2)
anova(m1, m2)
```
Model 2 makes better within-sample predictions, as demonstrated by the lower log-likelihood, but most importantly it does so with the same number of parameters (the `Df` column in the summary shows `0`, for no difference in degrees of freedom). We did not add any random effects to model 1 to produce model 2, we are simply using the same random intercepts more sensibly.
### Using nested/interaction multiple membership to find the player with the strongest year of the 2010s
Players don't maintain a single level of performance across their whole career. They tend to start off weaker, then peak, and then fall off and eventually retire. Some players retire soon after their peak, while others keep playing at a lower level of performance for many years. Our model does not account for this change in player performance over time, but we can add the year in which a match was played to our random effects structure. This will let us identify which players had the strongest years of play during the 2010s.
We'll use `Matrix::fac2sparse()` to generate the indicator matrix for the years, and then use the `lmerMultiMember::interaction_weights()` helper function to create a sparse indicator matrix for the interaction of the year and player random effects.
```{r interaction_model}
# get year in which tournament was played
atp_doubles$year <- substr(atp_doubles$tourney_date, 1, 4)
# create sparse matrix from years
Wy <- Matrix::fac2sparse(atp_doubles$year)
# generate sparse interaction matrix from sparse player and year matrices
Wpy <- interaction_weights(Wp, Wy)
# fit model
m3 <- glmer(team1_win ~ 0 + (1 | playerXyear),
family = "binomial", memberships = list(playerXyear = Wpy),
data = atp_doubles)
# print model summary
summary(m3)
```
```{r interaction_plot, dpi = 300}
# extract random effects
m3_ranefs <- broom.mixed::tidy(m3, effects = "ran_vals", conf.int = TRUE) %>%
.[.$group == "playerXyear", ] %>%
.[order(.$estimate, decreasing = TRUE), ] %>%
mutate(level = gsub("\\.", " in ", level))
# plot top 10
m3_ranefs[1:10, ] %>%
ggplot(aes(x = estimate, y = factor(level, level = level))) +
geom_point(size = 3, color = "coral1") +
geom_linerange(aes(xmin = conf.low, xmax = conf.high),
size = 1, color = "coral1") +
theme_bw() + xlab("") + ylab("") +
ggtitle("Increase in log(odds ratio) associated with player") +
theme(axis.ticks = element_blank(),
axis.text.x = element_text(size = 12, angle = 60, hjust = 1.0),
axis.text.y = element_text(size = 12),
plot.title = element_text(size = 14),
rect = element_rect(fill = "transparent")) +
scale_y_discrete(limits = rev) +
geom_vline(xintercept = 0, color = "coral2", size = 1.0)
```
When we plot the data like this, we can see that the player with the single strongest year of tennis during the 2010s was Marcelo Melo in 2015, with the Bryans' 2013 performance a close second.[^3] These estimates pass simple sanity checks: Marcelo Melo, for example, reached the number 1 spot on the ATP men's doubles ranking in 2015, which fits with it being a successful season. Likewise, the Bryans were ranked number 1 on the ATP men's doubles ranking in 2013.
[^3]: It may seems strange to compare these estimates across years, given that 2013 Bob Bryan never actually played 2015 Marcelo Melo. However, as a measure of how dominant a player was relative to every other player that year, it is still interesting to compare random intercepts across years. If you object to the "strongest year" phrasing, you can choose to read this as "most dominant season".
The uncertainty intervals on these estimates are larger than the intervals on the whole-decade strength estimates for each player. The previous model had ten times the observations/matches per player, so this increase in uncertainty is to be expected.
### Some final notes
This vignette is intended as a template and tutorial for using `lmerMultiMember`. It is not intended to be a definitive analysis of tennis player performance. There are many other factors one might want to consider when modeling tennis matches, but that's obviously beyond the scope of this vignette.
If you have any questions or notes, feel free to reach out to the [authors](https://jvparidon.github.io/lmerMultiMember/authors.html). If you find an error or bug, please file an [issue](https://github.com/jvparidon/lmerMultiMember/issues/new) on the Github repository.
If you want to get some practice with this package, one exercise we suggest would be to try modifying the code for the last model to identify the strongest player on each surface (clay, grass, hardcourt) instead of the strongest player by year.