-
Notifications
You must be signed in to change notification settings - Fork 7
/
causal.qmd
499 lines (348 loc) · 34.3 KB
/
causal.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
# Causal Modeling {#sec-causal}
:::{.content-visible when-format='html'}
> All those causal effects will be lost in time, like tears in rain... without adequate counterfactual considerations.
> ~ Roy Batty (paraphrased)
:::
:::{.content-visible when-format='html'}
```{r}
#| echo: false
#| label: fig-causal-dag
#| fig-cap: A causal DAG
library(ggdag)
# using more R-like syntax to create the same DAG
tidy_ggdag <- dagify(
target ~ x + z2 + w2 + w1,
x ~ z1 + w1 + w2,
z1 ~ w1 + v,
z2 ~ w2 + v,
w1 ~ ~w2, # bidirected path
exposure = 'x',
outcome = 'target'
) %>%
tidy_dagitty()
# ggdag(tidy_ggdag, label_col = okabe_ito_colors[1], text_color = 'black')
tidy_ggdag |>
ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
geom_dag_edges(
edge_width = 1,
edge_color = 'grey92',
arrow_directed = grid::arrow(length = grid::unit(10, 'pt'), type = 'closed')
) +
geom_dag_point(aes(), show.legend = FALSE, color = okabe_ito[1], size = 25) +
geom_dag_text(size = 5) +
theme_void()
# theme_dag() +
# theme_clean()
ggsave('img/causal-dag.svg', width = 8, height = 6)
```
:::
:::{.content-visible when-format='pdf'}
![](img/causal-dag.svg){width=50% #fig-causal-dag}
:::
Causal inference is a very important topic in machine learning and statistics, and it is also a very difficult one to understand well, or consistently, because *not everyone agrees on how to define a cause in the first place*. Our focus here is merely practical- we just want to show how it can play out in the modeling landscape from a very high level overview. This is such a rabbit hole that we won't be able to go into much detail, but we will try to give you a sense of the landscape, and some of the key ideas.
## Key ideas
- No model can tell you whether a relationship is causal or not. Causality is inferred, not proven, based on the available evidence.
- The exact same models would be used for similar data settings to answer a causal question or a purely predictive question. The primary difference is in the interpretation of the results.
- Experimental design, such as randomized control trials, are the gold standard for causal inference. But the gold standard is often not practical, and not without its limitation even when it is.
- Causal inference is often done with observational data, which is often the only option, and that's okay.
- Counterfactual thinking is at the heart of causal inference, but is useful for all modeling contexts.
- Several models exist which are typically employed to answer a more causal-oriented question. These include structural equation models, graphical models, uplift modeling, and more.
- Interactions are the norm, if not the reality. Causal inference generally regards a single effect. If the normal setting is that such an effect would always vary depending on other features, you should question why you want to aggregate your results to a single 'effect', since that effect would be potentially misleading.
### Why it matters
Often we need a precise statement about the feature-target relationship, not just whether there is some relationship. For example, we might want to know whether a drug works well, or whether showing an advertisement results in a certain amount of new sales. Whether or not random assignment was used, we generally need to know whether the effect is real, and the size of the effect, and often, the uncertainty in that estimate. Causal modeling is, like machine learning, more of an approach than a specific model, and that approach may involve the design or implementing models we've already seen in a different way to answer the key question. Without more precision in our understanding, we could miss the effect, or overstate it, and make bad decisions as a result.
### Good to know
This section is pretty high level, and we are not going to go into much detail here so even just some understanding of correlation and modeling would likely be enough.
## Classic Experimental Design
<!-- TODO: PLACEHOLDER until we can get some sort of coin flip random assignment pic. The current one, while fine will not translate to the print/bw -->
```{r}
#| eval: false
#| echo: false
#| label: r-random-assignment
#|
# Load the necessary library
# Create a data frame with random assignments
set.seed(123)
N = 250
df <- data.frame(
x = rnorm(250, sd = .1),
y = rnorm(250, sd = .1),
group = sample(c('A', 'B'), 250, replace = TRUE),
col_group = sample(rainbow(5), 250, replace = TRUE)
)
dot_size = 3
# Plot the data as a single cluster
p_all = ggplot(df, aes(x, y)) +
geom_point(
aes(color = col_group),
fill = 'white',
pch=21,
size = dot_size,
show.legend = FALSE,
alpha = 1
) +
labs(
title = 'Random Assignment to Groups A and B',
) +
theme_void() +
theme(
plot.title = element_text(size = 30, hjust = 0.5, margin = margin(.5, .5, .5, .5, 'cm'))
)
# p_all
# Plot the data separated into groups A and B
p_groups = ggplot(df, aes(x, y)) +
geom_point(
aes(color = col_group),
fill = 'white',
# border = 5,
pch=21,
size = dot_size,
show.legend = FALSE,
alpha = 1
) +
facet_wrap(~ group) +
labs(
# title = 'Random Assignment to Groups A and B',
) +
theme_void() +
theme(
strip.text = element_text(size = 36, margin = margin(.5, .5, .5, .5, 'cm')),
panel.border = element_rect(color = 'gray25', fill=NA),
panel.spacing = unit(2, 'cm'),
plot.margin = margin(1, 1, 1, 1, 'cm')
)
library(waffle)
p_waff = df |>
count(group, col_group) |>
ggplot( aes(values=n, fill = col_group)) +
geom_waffle(n_rows = 10, size = .5, color = 'white', show.legend = FALSE) +
facet_wrap(~group) +
theme_void() +
theme(
strip.text = element_text(size = 36, margin = margin(.5, .5, .5, .5, 'cm')),
# panel.border = element_rect(color = 'black', fill=NA),
panel.spacing = unit(2, 'cm'),
plot.margin = margin(1, 1, 1, 1, 'cm')
)
design <- '
##11##
222222
333333
'
(p_all /
p_groups/
p_waff) +
plot_layout(width = c(1, 2, 2), design = design) &
scale_color_manual(values = okabe_ito, aesthetics = c('fill', 'color'))
ggsave(
'img/causal-random-assignment.svg',
width = 9,
height = 12,
dpi = 300
)
```
![Random Assignment](img/causal-random-assignment.svg){width=40% #fig-random-assignment}
Many are familiar with the basic idea of an experiment, where we have a **treatment** group and a **control** group, and we want to measure the difference between the two groups. The 'treatment' could regard a new drug, a marketing campaign, or a new app's feature. If we randomly assign our observational units to the two groups, say, one that gets the new app feature and the other doesn't, we can be more confident that the two groups are essentially the same aside from the treatment, and that any difference in the outcome, for example, customer satisfaction with the app, is due to the treatment.
This is the basic idea behind a **randomized control trial**, or **RCT**. We can randomly assign the groups in a variety of ways, but you can think of it as flipping a coin, and assigning each sample to the treatment when the coin comes up on one side, and to the control when it comes up on the other. The idea is that the only difference between the two groups is the treatment, and so any difference in the outcome can be attributed to the treatment. This is visualized in @fig-random-assignment, where the colors represent different groups, and the groups are essentially the same aside from the treatment.
Many of those who have taken a statistics course have been exposed to the simple t-test to determine whether two groups are different. The t-test tells us whether the difference in means between the two groups is *statistically* significant. However, it definitely *does not* tell us whether the treatment itself caused the difference, whether the effect is large, nor whether the effect is real, or even if the treatment is a good idea to do in the first place. It just tells us whether the two groups are statistically different.
Turns out, a t-test is just a linear regression model. It's a special case of linear regression where there is only one independent variable, and it is a categorical variable with two levels. The coefficient from the linear regression would tell you the mean difference of the outcome between the two groups. Under the same conditions, the t-statistic from the linear regression and the t-test would have identical statistical results.
Analysis of variance, or **ANOVA**, allows the t-test to be extended to more than two groups, and multiple features, and is also commonly employed to analyze the results of experimental design settings. But ANOVA is still just a linear regression. Even when we get into more complicated design settings such as repeated measures and mixed design, it's still just a linear regression, we'd just be using mixed models (@sec-lm-extend-mixed-models).
If using a linear regression didn't suggest any notion of causality to you before, it certainly shouldn't now either. The model is *identical* whether there was an experimental design with random assignment or not. The only difference is that the data was collected in a different way, and the theoretical assumptions and motivations are different. Even the statistical assumptions are the same whether you use random assignment or the there are one or more groups, or whether the treatment is continuous or categorical.
Experimental design[^exprand] can give us more confidence in the causal explanation of model results, whatever model is used, and this is why we like to use it when we can. It helps us control for the unobserved factors that might otherwise be influencing the results. If we can be fairly certain the observations are essentially the same *except* for the treatment, then we can be more confident that the treatment is the cause of the difference, and we can be more confident in the causal interpretation of the results. But it doesn't change the model itself, and the results of a model don't prove a causal relationship by themselves. Your experimental study will also be limited by the quality of the data, and the population it generalizes to. Even with strong design and modeling, if care isn't taken in the modeling process to even assess the generalization of the results (@sec-ml-generalization), you may find they don't hold up[^explimits].
[^exprand]: Note that experimental design is not just any setting that uses random assignment, but more generally how we introduce *control* in the sample settings.
[^explimits]: Many experimental design settings involve sometimes very small samples due to the cost of the treatment implementation and other reasons. This often limits exploration of more complex relationships (e.g. interactions), and it is relatively rare to see any assessment of performance generalization. It would probably worry many to know how many experimental results are based on p-values with small data, and this is the part of the problem seen with the [replication crisis](https://en.wikipedia.org/wiki/Replication_crisis) in science.
:::{.callout-note title='A/B Testing' collapse='true'}
**A/B testing** is just marketing-speak for a project focused on comparing two groups. It implies randomized assignment, but you'd have to understand the context to know if that is actually the case.
:::
## Natural Experiments
```{r}
#| eval: false
#| echo: false
deaths = read_csv('misc/covid_new_deaths_per_million.csv', col_select= matches('date|United States$')) |>
rename(deaths_per_mi = `United States`)
vax = read_csv('misc/us_covid_vaccinations.csv')
dim(deaths)
dim(vax)
deaths |> print(n=100)
# min(deaths$date)
# length(seq.Date(as_date('2020-01-05'), as_date('2022-12-31'), by = 'day'))
covid_us = full_join(
vax, deaths |> select(date, deaths_per_mi) |> filter(deaths_per_mi>0), by = 'date'
)
covid_us |>
select(date, people_vaccinated, deaths_per_mi) |>
filter(date < '2022-01-01') |>
drop_na() |>
# mutate(total_vaccinations = total_vaccinations - dplyr::lag(total_vaccinations)) |>
mutate(people_vaccinated_million = people_vaccinated/1e6) |>
select(-people_vaccinated) |>
pivot_longer(-date, names_to = 'variable', values_to = 'value') |>
filter(variable != 'people_vax_million') |>
mutate(variable = if_else(variable == 'deaths_per_mi', '# deaths per million', '# vaccinated in millions')) |>
# group_by(variable) |>
# mutate(scale_value = scale(value)) |>
ggplot(aes(x = date, y = value)) +
# geom_point(aes(color = variable)) +
geom_line(aes(color = variable, linetype = variable)) +
geom_smooth(aes(color = variable, linetype = variable), se=FALSE, method = 'gam', formula = y ~ s(x, bs='cs', k = 25)) +
scale_x_date(date_breaks = '1 month', labels = label_date_short()) +
see::scale_color_okabeito() +
# geom_smooth(method = 'lm') +
labs(
x = '',
y = '',
caption = 'Data from Our World in Data: https://github.com/owid/',
title = 'Covid Vaccinations and Deaths in the US'
) +
theme(
legend.direction = 'vertical',
)
ggsave(
'img/causal-covid-vax-deaths.svg',
width = 8,
height = 6
)
```
```{r}
#| eval: false
#| echo: false
#| label: r-causal-covid-derivs
# just eda
library(mgcv)
library(gratia)
model_data = covid_us |>
select(date, people_vaccinated, deaths_per_mi) |>
filter(date < '2022-01-01') |>
drop_na() |>
# mutate(total_vaccinations = total_vaccinations - dplyr::lag(total_vaccinations)) |>
mutate(people_vaccinated_million = people_vaccinated/1e6)
# Model death data
death_model <- gam(deaths_per_mi ~ s(date), data = model_data |> mutate(date = as.numeric(date)))
death_derivatives <- derivatives(death_model, term = 's(date)')
# Model vaccination data
vax_model <- gam(people_vaccinated_million ~ s(date), data = model_data |> mutate(date = as.numeric(date)))
vax_derivatives <- derivatives(vax_model, term = 's(date)')
# Plot death derivatives
# draw(death_derivatives)
# draw(vax_derivatives)
vax_derivatives |>
select(date = data, vax_d = derivative) |>
left_join(death_derivatives |> select(date = data, death_d = derivative), by = 'date') |>
# arrange(date) |>
mutate(
date = as_date(date),
# vax_d = dplyr::lag(vax_d, 4)
) |>
pivot_longer(-date, names_to = 'variable', values_to = 'value') |>
ggplot(aes(x = date, y = value)) +
geom_line(aes(color = variable, linetype = variable)) +
scale_x_date(date_breaks = '1 month', labels = label_date_short()) +
see::scale_color_okabeito() +
labs(
x = '',
y = '',
)
# ggplot(aes(x = vax_d, y = death_d)) +
# geom_point()
```
![Covid Vaccinations and Deaths in the US](img/causal-covid-vax-deaths.svg){width=50% #fig-covid-vax-deaths}
As we noted, random assignment or a formal experiment is not always possible or practical to implement. But sometimes we get to do it anyway, or at least we can get pretty close! Sometimes, the world gives us a **natural experiment**, where the assignment to the groups is essentially random, or where there is clear break before and after some event occurs, such that we examine the change as we would in pre-post design.
For example, the covid pandemic allowed us to potentially examine vaccination effects, governmental policy effects, the effectiveness remote work, and more. This was not a tightly controlled experiment, but it's something we can treat very similar to an experiment, and we can compare the differences in various outcomes before and after the pandemic to see what changes took place.
## Causal Inference
Reasoning about causality is a very old topic, philosophically dating back millennia, and more formally [hundreds of years](https://plato.stanford.edu/entries/causation-medieval/). Random assignment is a relatively new idea, say [150 years old](https://plato.stanford.edu/entries/peirce/), but was posited even before Wright, Fisher, and Neyman and the 20th century rise of statistics. But with stats and random assignment we had a way to start using models to help us reason about causal relationships. [Pearl and others](https://muse.jhu.edu/pub/56/article/867087/summary) came along to provide a perspective from computer science, and things have been progressing along. We were actually using programming approaches to do causal inference back in the 1970s even! Economists got into the game too (e.g., Heckman), and eventually most scientific academic disciplines were well acquainted with causal inference in some fashion.
Now we can use recently developed modeling approaches to help us reason about causal relationships, which can be both a blessing and a curse. Our models can be more complex, and we can use more data, which can potentially give us more confidence in our conclusions. But we can still be easily fooled by our models, as well as by ourselves. So we'll need to be careful in how we go about things, but let's see what some of our options are!
## Models for Causal Inference
Any model can be used to answer a causal question, and which one you use will depend on the data setting and the question you are asking. The following covers a few models that might be seen in various academic and professional settings.
### Linear Regression
Yep, [linear regression](https://matheusfacure.github.io/python-causality-handbook/05-The-Unreasonable-Effectiveness-of-Linear-Regression.html). The old standby is possibly the mostly widely used model for causal inference, historically speaking and even today. We've seen linear regression as a graphical model @fig-graph-lm, and in that sense, it can serve as the starting point for structural equation models and related models that we'll talk about next that many consider to be true causal models. It can also be used as a baseline model for other more complex causal model approaches. Linear regression can potentially tell us for any particular feature, what that feature's relationship with the target is, holding the other features constant. This **ceteris paribus** interpretation - 'all else being equal' - already gets us into a causal mindset.
However, your standard linear model doesn't care where the data came from or what the underlying structure *should* be. It only does what you ask of it, and will tell you about group differences whether they come from a randomized experiment or not. For example, if you don't include features that would have a say in how the treatment comes about (**counfounders**), you could get a biased estimate of the effect[^biasedcause]. Basic linear regression also cannot tell you whether X is the effect of Y or vice versa. As such, linear regression by itself cannot save us from the difficulties of causal inference, nor really be considered a causal model. But it can be useful as a starting point in conjunction with other approaches.
[^biasedcause]: A reminder that a conclusion of 'no effect' is also a causal statement, and can be just as biased as any other statement. Also, you can come to the same *practical* conclusion with a biased estimate as with an unbiased one.
:::{.callout type='info' title='Weighting and Sampling Methods' collapse='true'}
Common techniques for traditional statistical models used for causal inference include a variety of **weighting** or **sampling** methods. These methods are used to adjust the data so that the 'treatment' groups are more similar, and its effect can be more accurately estimated. Sampling methods include techniques such as stratification and matching, which focus on the selection of the sample as a means to balance treatment and control groups. Weighting methods include inverse probability weighting and propensity score weighting, which focus on adjusting the weights of the observations to make the groups more similar.
These methods are not models themselves, and potentially can be used with just about any model that attempts to estimate the effect of a treatment.
:::
### Graphical Models & Structural Equation Models
**Graphical and Structural Equation Models (SEM)** are flexible approach to regression and classification (see @fig-causal-dag), and have one of the longest histories of formal statistical modeling, dating back over a century[^wright]. They are widely employed in the social sciences, and are often used to model both observed *and* latent variables (@sec-data-latent), with either serving as features or targets. They are also used to model causal relationships, to the point that historically they were even called 'causal graphical models' or 'causal structural models'. SEMs are a special case of **graphical models**, which are common tools in computer science and non-social science disciplines.
[^wright]: [Wright](https://en.wikipedia.org/wiki/Sewall_Wright) is credited with coming up with what would be called **path analysis** in the 1920s, which is a precursor to and part of SEM.
Formal graphical models like SEM provide a much richer set of tools for controlling various confounding, interaction, and indirect effects than simpler linear models. For this reason, they can very useful for causal inference. Unfortunately for those looking for causal effects, the basic input for SEM is a correlation matrix, and the basic output is a correlation matrix. Insert your favorite modeling quote here - you know which one! The point is that SEM, like linear regression, can no more tell you whether a relationship is causal than the linear regression or t-test could[^sembias].
[^sembias]: Your authors have to admit some bias here. We've spent a lot of our past dealing with SEMs, and almost every application we saw had too little data and too little generalization, and were grossly overfit. Many SEM programs even added multiple ways to overfit the data even further, and it is difficult to trust the results reported in many papers that used them. But that's not the fault of SEM in general- it can be a useful tool when used correctly, and it can help answer causal questions, but it is not a magic bullet, and it doesn't make anyone look fancier by using it.
:::{.callout-note title='Causal Language' collapse='true'}
It's often been suggested that we keep certain phrasing (e.g. feature X has an *effect* on target Y) only for the causal model setting. But the model we use can only tell us that the data is consistent with the effect we're trying to understand, not that it actually exists. In everyday language, we often use causal language whenever we think the relationship is or should be causal, and that's fine, and we think that's okay in a modeling context too, as long as you are clear about the limits of your generalizability.
:::
### Counterfactual Thinking
<!-- TODO: could use some viz here -->
When we think about causality, we really out to think about **counterfactuals**. What would have happened if I had done something different? What would have happened if I had done something sooner rather than later? What would have happened if I had done nothing at all? It's natural to question our own actions in this way, but we can think like this in a modeling context too. In terms of our treatment effect example, we can summarize counterfactual thinking as:
> the question is not whether there is a difference between A and B but whether there would still be a difference if A *was* B and B *was* A.
This is the essence of counterfactual thinking. It's not about whether there is a difference between two groups, but whether there would still be a difference if those in one group had actually been treated differently. In this sense, we are concerned with the **potential outcomes** of the treatment, however defined.
Here is a more concrete example:
- Roy is shown ad A, and buys the product.
- Pris is shown ad B, and does not buy the product.
What are we to make of this? Which ad is better? **A** seems to be, but maybe Pris wouldn't have bought the product if shown that ad either, and maybe Roy would have bought the product if shown ad **B** too! With counterfactual thinking, we are concerned with the potential outcomes of the treatment, which in this case is whether or not to show the ad.
Let's say ad A is the new one, i.e., our treatment group, and B is the status quo ad, our control group. Without randomization, our real question can't be answered by a simple test of whether means or predictions are different among the two groups, as this estimate would be biased if the groups are already different in some way to start with. The real effect is whether, for those who saw ad A, what the difference in the outcome would be if they hadn't seen it.
From a prediction stand point, we can get an initial estimate straightforwardly. We demonstrated this before in @sec-knowing-counterfactual-predictions, but can revisit it briefly here. For those in the treatment, we can just plug in their feature values with treatment set to ad A. Then we just make a prediction with treatment set to ad B[^slearn].
[^slearn]: This is basically the **S-Learner** approach to meta-learning, which we'll discuss in a bit. It is generally too weak
:::{.panel-tabset}
##### Python
```{python}
#| eval: false
#| label: py-demo-counterfactual
model.predict(X.assign(treatment = 'A')) -
model.predict(X.assign(treatment = 'B'))
```
##### R
```{r}
#| eval: false
#| label: r-demo-counterfactual
predict(model, X |> mutate(treatment = 'A')) -
predict(model, X |> mutate(treatment = 'B'))
```
:::
With counterfactual thinking explicitly in mind, we can see that the difference in predictions is the difference in the potential outcomes of the treatment. This is a very simple demo to illustrate how easy it is to start getting some counterfactual results from our models. But it's typically not quite that simple in practice, and there are many ways to get this estimate wrong as well. As in other circumstances, the data, or our assumptions about the problem can potentially lead us astray. Assuming those aspects of our modeling endeavor are in order, this is one way to get an estimate of the causal effect of the treatment.
### Uplift Modeling
The counterfactual prediction we just did provides a result that can be called the **uplift** or **gain** from the treatment, especially when compared to a baseline metric. **Uplift modeling** is a general term applied to models where counterfactual thinking is at the forefront, especially in a marketing context. Uplift modeling is not a specific model per se, but any model that is used to answer a question about the potential outcomes of a treatment. The key question is what is the gain, or uplift, in applying a treatment vs. not? Typically any statistical model can be used to answer this question, and often the model is a classification model, whether Roy both the product or not.
It is common in uplift modeling to distinguish certain types of individuals or instances, and we think it's useful to extend this to other modeling contexts as well. In the context of our previous example they are:
- **Sure things**: those who would buy the product whether or not shown the ad.
- **Lost causes**: those who would not buy the product whether or not shown the ad.
- **Sleeping dogs**: those who would buy the product if not shown the ad, but not if they are shown the ad. Also referred to as the 'Do not disturb' group!
- **Persuadables**: those who would buy the product if shown the ad, but not if not shown the ad.
We can generalize beyond the marketing context to just think about response to any treatment we might be interested in. It's worthwhile to think about which aspects of your data could correspond to these groups. One of the additional goals in uplift modeling is to identify persuadables for additional treatment efforts, and to avoid wasting money on the lost causes. But to get there, we have to think causally first!
:::{.callout type='tip' title='Uplift Modeling in R and Python' collapse='true'}
There are more widely used tools for uplift modeling and meta-learners in Python than in R, but there are some options in R as well. In Python you can check out [causalml](https://causalml.readthedocs.io/en/latest/index.html) and [sci-kit uplift](https://www.uplift-modeling.com/en/v0.5.1/index.html) for some nice tutorials and documentation.
:::
### Meta-Learning
[Meta-learners](https://arxiv.org/pdf/1706.03461.pdf) are used in machine learning contexts to assess potentially causal relationships between some treatment and outcome. The core model can actually be any kind you might want to use, but in which extra steps are taken to assess the causal relationship. The most common types of meta-learners are:
- **S-learner** - **s**ingle model for both groups; predict the (counterfactual) difference as when all observations are treated vs when all are not, similar to our code demo above.
- **T-learner** - **t**wo models, one for each treatment group; predict the difference as when all observations are treated vs when all are not for both models, and take the difference
- **X-learner** - a more complicated modification to the T-learner also using a multi-step approach.
<!-- TODO: this doesn't really fit- **Transformed outcome**: More of a data processing step. We transform the target so that it becomes a regression problem in which the prediction is the difference in the potential outcomes. This simplifies the problem to a single model, and can be quite effective in what would normally be a classification problem. -->
Some additional variants of these models exist, and they can be used in a variety of settings, not just uplift modeling. The key idea is to use the model to predict the potential outcomes of the treatment, and then to take the difference between the two predictions as the causal effect.
:::{.callout-note title='Meta-Learners vs. Meta-Analysis' collapse='true'}
Meta-learners are not to be confused with **meta-analysis**, which is also related to understanding causal effects. Meta-analysis attempts to combine the results of multiple *studies* to get a better estimate of the true effect. The studies are typically conducted by different researchers and in different settings. The term **meta-learning** has also been used to refer to what is more commonly called **ensemble learning**, the approach used in random forests and boosting. It is also used by other people that don't bother to look things up before naming their technical terms.
:::
### Others Models Used for Causal Inference
Note that there are many models that would fall under the umbrella of causal inference, and several that are discipline specific, but really are only a special application of some of the ones we've already seen. A few you might come across:
- Marginal structural models
- Instrumental variables and two-stage least squares
- Propensity score matching/weighting
- Regression discontinuity design
- Mediation/moderation analysis
- Meta-analysis
- Bayesian networks
In general, any modeling technique might be employed as part of a causal modeling approach. To actually makes causal statements, you'll generally need to ensure that the assumptions for those claims are tenable.
## Wrapping Up
We've been pretty loose in our presentation here, and glossed over many details with causal modeling. Our main goal is to give you some idea of the domain, but more so the models used and things to think about when you want to answer a causal question with your data.
Models used in statistical analysis and machine learning are not causal models, but when we take a causal model from the realm of ideas and apply it to the real world, a causal model becomes a statistical/ML model with more assumptions, and with additional steps taken to address those assumptions[^assume]. These assumptions are required in order to make stronger causal statements, but neither the assumptions, data, nor model can prove that the underlying theory is causally correct. Things like random assignment, sampling, a complex model and good data can possibly help the situation, but they can't save you from a fundamental misunderstanding of the problem, or data that may still be consistent with that misunderstanding. Nothing about employing a causal model inherently makes better predictions either.
Causal modeling is hard, and most of the difficulty lies outside of the realm of models and data. The model implemented reflects the causal theory, which can be a correct or incorrect idea about how the world works. In the end, the main thing is that when we want to make causal statements, we'll make do with what data we have, and be careful that we rule out some of the other obvious explanations and issues. The better we can control the setting, or the better we can do things from a modeling standpoint, the more confident we can be in making causal claims. Causal modeling is an exercise in reasoning, which makes it such an interesting endeavor.
[^assume]: Gentle reminder that making an assumption does not mean the assumption is correct, or even provable.
### The Common Thread
Engaging in causal modeling may not even require you to learn any new models, but you will typically have to do more to be able to make causal statements. The key is to think about the problem in a different way, and to be more careful about the assumptions you are making. You may need to do more to ensure that your data and model are consistent with the assumptions you are making.
### Choose Your Own Adventure
From here you might revisit some of the previous models and think about how you might use them to answer a causal question. You might also look into some of the other models we've mentioned here, and see how they are used in practice via the additional resources below.
### Additional Resources
We have only scratched the surface here, and there is a lot more to learn. Here are some resources to get you started:
- [Causal Inference in R](https://www.r-causal.org/) @barrett_causal_2024
- [Causal Inference The Mixtape](https://mixtape.scunning.com/) @cunningham_causal_2023
- [Causal Inference for the Brave and True](https://matheusfacure.github.io/python-causality-handbook/) @facure_alves_causal_2022
- [Applied Causal Inference Powered by ML and AI](https://arxiv.org/abs/2403.02467) @chernozhukov_applied_2024
- [The C-Word](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5888052/) @hernan_c-word_2018
## Exercise
If you look into causal modeling, you'll find mention of problematic covariates such as [colliders](https://en.wikipedia.org/wiki/Collider_(statistics)) or [confounders](https://en.wikipedia.org/wiki/Confounding). Can you think of a way to determine if something is a collider or confounder that would not involve a statistical approach or model?