/
30_causality.Rmd
1872 lines (1627 loc) · 123 KB
/
30_causality.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Bayesian Causal Inference {#ch:causality}
<!------- TO DO ---------
- why no build with tufte?
------------------------->
```{r stopper, eval = FALSE, cache = FALSE, include = FALSE}
knitr::knit_exit()
```
```{r knitr-03-causality, include = FALSE, cache = FALSE}
source(here::here("assets-bookdown", "knitr-helpers.R"))
```
<!-- who knows if this works on next latex build -->
$\renewcommand{\ind}[0]{\perp \!\!\! \perp}$
$\renewcommand{\doop}[1]{\mathit{do}\left(#1\right)}$
$\renewcommand{\diff}[1]{\, \mathrm{d}#1}$
$\renewcommand{\E}[1]{\mathbb{E}\left[#1\right]}$
$\renewcommand{\p}[1]{p\left(#1\right)}$
```{r r-03-causality, cache = FALSE}
library("knitr")
library("here")
library("ggdag")
library("magrittr")
library("tidyverse")
library("ggplot2")
library("latex2exp")
library("scales")
library("patchwork")
library("broom")
library("brms")
library("tidybayes")
library("ggforce")
```
<!------- TO DO ---------
This might need to be chapter 2!!!!
- The model has a discussion of priors that might feel out of place
without first couching it in a viewpoint of Bayes for the project.
- Is there only one viewpoint? No?
- The model is a "pragmatic Bayes" (MCMC for fitting, structural priors),
whereas the causal inf stuff is mixed in its view.
Sometimes its "normalizing Bayes" (information for sensitivity testing),
other times is merely "structural" WIPs.
Maybe one thing to articulate (to myself, to others) is a position on how
structural priors make us think about properties of Bayes estimators.
(in the RDD case, the OLS model is consistent but we aren't in asymptopia.
How else do we evaluate the properties of the Bayes estimator
when we include structural information?)
------------------------->
<!------- TO DO ---------
- is the intro too specific?
- does the project follow through on the main promises?
- Have an abstract and then an outline of the chapter?
- missing right now: connect to political science!
------------------------->
I use the estimates of district-party public ideology from Chapter \@ref(ch:model) to conduct two causal studies later in this project that, like the ideal point model, use a Bayesian modeling framework.
While Bayesian methods are commonplace in ideal point modeling, the approach is almost entirely absent from causal inference work in political science.
The purpose of this chapter is to orient the reader toward a Bayesian framework for causal inference.
The discussion in this chapter highlights three primary contributions of Bayesian modeling for this project.
First, I argue that causal inference is best understood as a problem of posterior predictive inference.
Causal models are models for missing data: what we would observe if a treatment variable were set to a different value.
Bayesian causal inference describes the plausible values of unobserved potential outcomes—or, more generally, the probability distribution of any causal estimand—given the data.
This is how researchers think about causal inference, even if implicitly, almost all of the time.
Second, the Bayesian framework is a coherent method for quantifying uncertainty, which has several benefits for this thesis.
District-party ideology, a key variable in this project, is not fully observed.
It is only estimated up to a probability distribution using the measurement model in Chapter \@ref(ch:model).
Uncertainty about the causal effects of district-party ideology therefore contain two sources of uncertainty: statistical uncertainty about counterfactual data, and measurement uncertainty about the observed values of district-party ideology before causal interventions.
Bayesian analysis quantifies uncertainty in causal effects as if they were any other posterior quantity: by marginalizing the posterior distribution over all uncertain model parameters.
This unified method of uncertainty quantification is also valuable for multi-stage causal analyses and flexible models with many correlated parameters, both of which appear in the causal analyses to follow.
Third, prior information often improves the estimation of causal effects.
The empirical analyses in this project use priors for regularization: penalizing the complexity of a flexible model against overfitting.
This is a common concern in the search for heterogeneous treatment effects, where the search for interactions or nonlinearities increases the number of potential false positive findings.
Priors can encode other types of prior information, including structural information about possible data and modeling assumptions.
I show how these sorts of priors can improve the precision of causal estimates and clarify how estimates are sensitive to prior assumptions.
This chapter unpacks these issues according to the following outline.
I begin by reviewing the notation and terminology for causal modeling in empirical research, where data and causal estimands are posed in terms of "potential outcomes" or "counterfactual" observations.
I then describe a Bayesian reinterpretation of these models, which uses probability distributions to quantify uncertainty about causal effects and counterfactual data.
Bayesian methods are not heavily used in political science, so I spend much of the chapter explaining what a Bayesian approach to causal inference means with theoretical and practical justifications: how priors are inescapable for many causal claims, how priors provide valuable structure to improve the estimation of causal effects, and practical advice for constructing and evaluating Bayesian causal models.
I provide examples of Bayesian causal modeling by replicating and extending published studies in political science: A Bayesian regression discontinuity analysis that uses priors to improve the precision and credibility of causal estimates, and a Bayesian meta-analysis that uses priors to highlight the consequences of modeling assumptions.
## Overview of Key Concepts
### Causal models {#sec:causal-inf}
As an area of scientific development, _causal inference_ refers to the formal modeling of causal effects, the assumptions required to identify causal effects, and research designs that make those assumptions plausible.
Scientific disciplines, especially social sciences, have long been interested in substantiating causal claims using data, but the rigorous definition of the full causal model and identifying assumptions are what distinguish the current causal inference movement from other informal approaches.
This section reviews causal inference by breaking it into a three-part hierarchy: causal models, causal identification, and statistical estimation.
The first level of the causal inference hierarchy is the _causal model_.
The causal model is an omniscient view of a causal system that defines its mathematical first principals.
The dominant modeling approach to causal inference in political science is rooted in a model of _potential outcomes_ [@rubin:1974:potential-outcomes; @rubin:2005:potential-outcomes].
This "Rubin model" formalizes the concept of a causal effect by first defining a space of potential outcomes.
The outcome variable $Y$ for unit $i$ is a function of a treatment variable $A$.
"Treatment" refers only to a causal factor of interest, regardless of whether the treatment is randomly assigned.^[
Some causal inference literatures refer to treatments as "exposures," which may feel more broadly applicable to settings beyond experiments. For this project, I make no distinction between treatments and exposures.
]
Considering a binary treatment assignment where $A = 1$ represents treatment and $A = 0$ represents control, unit $i$'s outcome under treatment is represented as $Y_{i}(A_{i} = 1)$ or $Y_{i}(1)$, and the outcome under control would be $Y_{i}(A_{i} = 0)$ or $Y_{i}(0)$.
The benefit of expressing $Y$ in terms of hypothetical values of $A$ allows the causal model to describe, with formal exactitude, the entire space of possible outcomes that result from treatment assignment as well as causal effects of treatment.
The treatment effect for an individual unit, denoted $\tau_{i}$, is the difference in potential outcomes when changing the treatment $A_{i}$.
\begin{align}
\tau_{i} &= Y_{i}(A_{i} = 1) - Y_{i}(A_{i} = 0)
(\#eq:tau-i)
\end{align}
This formulation generalizes to multi-valued treatments as well.
If $\tau_{i}$ equals any value other than $0$, then $A_{i}$ has a causal effect on $Y_{i}$.
Defining the causal model in terms of unit-level effects provides an exact, minimal definition of a causal effect: $A$ affects $Y$ if the treatment has a nonzero effect _for any unit_.
A causal model may describe more complex features of a causal system, such as whether a unit complies with their treatment assignment, whether the unit's potential outcome depends on other variables, and so on.
Although the causal model perfectly describes a causal system, the model is only a hypothetical device.
Because a unit can receive only one treatment, the researcher can observe only one outcome per unit.
This renders the causal effect $\tau_{i}$ unidentifiable from data.
This is the core philosophical problem in causal inference, and it means that no causal effects can ever be observed.
Causal effects can only be _inferred_ by layering on additional assumptions [@holland:1986:causal-inf].
_Causal identification assumptions_ are the second level of the causal inference hierarchy.
Identification assumptions specify the conditions under which counterfactual data can be inferred from observed data [@keele:2015:causal-inf].
The implications of identification assumptions are typically posed in terms of _expectations_ about potential outcomes that average over units, $\E{Y_{i}\left(A_{i}\right)}$, instead of unit-level potential outcomes.
This is because it requires fewer assumptions to identify aggregate causal effects than to identify individual potential outcomes.
Aggregate level causal effects, defined in terms of expectations over potential outcomes, are typically known as causal estimands.
Example estimands include average treatment effects, conditional average treatment effects, local average treatment effects, and so on.
The final layer of the causal inference hierarchy is _statistical estimation_.
Identification assumptions describe minimally sufficient conditions for _nonparametric_ identification of causal estimands [@keele:2015:causal-inf].
Causal estimands are infinite-data expectations in perfectly defined covariate strata.
Real data are often less convenient, with noisily estimated averages and continuous covariates whose strata often must be modeled in some way to make causal estimation feasible.
There is no guarantee that linear regression models, or any parametric models, will correctly model the data and recover causal effects, so causal methodologists often seek methods that minimize additional statistical assumptions.
This hierarchy is helpful for organizing this chapter because it helps clarifies why researchers use certain research designs or statistical approaches to overcome particular problems with their data.
Statistical assumptions can undermine identification assumptions [@blackwell-olson:2020:interactions; @goodman-bacon:2018:DiD-timing; @hahn-et-al:2018:regularization-confounding], which is why causal inference scholars tend to promote estimation strategies that rely on as few additional assumptions as possible [@keele:2015:causal-inf].
One way to avoid these assumptions is to use research designs that eliminate confounding "by design" rather than through statistical adjustment, such as randomized experiments, instrumental variables, regression discontinuity, and difference-in-differences [for instance, @angrist-pischke:2008:mostly-harmless].
Research projects without those designs must invoke "selection on observables"—the statistical approach that assumes that confounders are controlled—although many methodological advancements in matching, semi-parametric models, and machine learning allow researchers to relax functional form assumptions in their statistical models [@sekhon:2009:opiates; @ratkovic-tingley:2017-direct-estimation; @hill:2011:bart; @samii-et-al:2016:retrospective-causal-inference-ML].
Causal inference is not synonymous with the "agnostic" statistical approach [e.g. @aronow-miller:2019:agnostic-statistics], but it is animated by a similar motivation to identify statistical methods that rely on as few fragile assumptions as possible.
<!-- This dissertation will employ machine learning methods, in particular Bayesian neural networks (BNNs), to estimate regression functions that rely less on exact, reduced-form model specification choices. -->
<!------- TO DO ---------
- do I do ML? Do I do "semi-parametric" splines?
------------------------->
The three-part hierarchy is also useful because it clarifies where my contributions around Bayesian causal estimation will be focused.
As I discuss below, the "easiest way in" for Bayesian methods is through statistical estimation (level 3) because some causal estimation methods are convenient to implement using Bayesian technologies [@imbens-rubin:1997:bayes-compliance; @ornstein-duck-mayr:2020:GP-RDD].
I push this further by arguing that Bayesian analysis changes the interpretation of the causal model (level 1) by specifying probability distributions over the space of potential outcomes.
This probability distribution allows the researcher to say which causal effects and counterfactual data are _more plausible than others_, which is a desirable property of statistical inference that is not available through conventional inference methods.
The Bayesian approach also has the power to extend the meaning of identification assumptions (level 2) by construing them also as probabilistic rather than fixed features of a causal analysis [@oganisian-roy:2020:bayes-estimation].
### Bayesian inference {#sec:bayes-inf}
Bayesian inference is a contentious and misunderstood topic in empirical political science, so it is important to establish some foundations and intuitions before melding it with causal modeling.
This section introduces Bayesian methods by skipping past the common descriptions that are often unhelpful and confusing—subjective probability, prior "beliefs," the posterior is proportional to the prior times the likelihood—and instead describes an "inside view" of Bayesian analysis on its own terms [@mcelreath:2017:decolonized-bayes].
Bayesian analysis uses conditional probability to conduct statistical inference: what is the probability distribution of unknown model quantities, conditional on the observed model quantities?
It begins with a joint probability distribution for all variables in a model.
In most cases these variables are denoted as data $\mathbf{y}$ and parameters $\boldsymbol{\pi}$, but in Bayesian analysis, the distinction between data and parameters has only to do with which variables are observed or unobserved.^[
The semantic distinction between "data" and "parameter" is often sloppier in practice than many researchers would like to think.
Many statistical analyses use aggregate estimates of lower-level processes as if they were known, such as per-capita income or the percentage of women who vote for the Democratic presidential candidate.
These quantities are not knowable from finite data, and instead behave like random variables in that their values could differ under repeated sampling, so it might make sense to view their "true values" as parameters.
From a Bayesian point of view, these are meaningless semantics, since both data and parameters are merely random variables modeled with probability distributions.
The Bayesian view has a similar spirit to the @blackwell-et-al:2017:measurement-error view of measurement uncertainty, where "measurement error" falls on a spectrum between fully observed data and missing data.
]
\begin{align}
\p{\mathbf{y}, \boldsymbol{\pi}} \equiv \p{\mathbf{y} \cap \boldsymbol{\pi}}
(\#eq:joint-model)
\end{align}
The joint probability model represents the multitude of ways that the variables could be configured in the world.
Conditioning on observed variables rules out many configurations of the unobserved variables, leaving behind only the unobserved variables that are consistent with observed data.
\begin{align}
\p{\boldsymbol{\pi} \mid \mathbf{y}} &=
\frac{\p{\mathbf{y} \cap \boldsymbol{\pi}}}{\p{\mathbf{y}}}
(\#eq:condition-joint-model)
\end{align}
From this perspective, Bayesian analysis is "just counting" [@mcelreath:2020:bayes-counting]—counting the number of model configurations that remain after conditioning on known information.
Bayes' Theorem is an expression for this conditioning process based on a particular factorization of the joint model,
\begin{align}
\begin{split}
\p{\mathbf{y}, \boldsymbol{\pi}} &= \p{\mathbf{y} \mid \boldsymbol{\pi}}\p{\boldsymbol{\pi}} \\
\p{\boldsymbol{\pi} \mid \mathbf{y}} &=
\frac{
\p{\mathbf{y} \mid \boldsymbol{\pi}}\p{\boldsymbol{\pi}}
}{
\p{\mathbf{y}}
}
\end{split}
(\#eq:bayes)
\end{align}
which reveals how researchers commonly interface with Bayesian analysis: specifying a model for data conditional on parameters, $\p{\mathbf{y} \mid \boldsymbol{\pi}}$,
and a model for the marginal distribution of parameters, $\p{\boldsymbol{\pi}}$.
These models are often called the "likelihood" and "prior distribution."
The controversy surrounding Bayesian analysis arises from different perspectives about which constructs we choose to describe using probabilities.
Researchers routinely model data given parameters, but many feel that modeling the marginal distribution of parameters is unscientific.
This is because the marginal parameter distribution often represents "prior information" about which parameter values are plausible without observing the data.
The inside view demystifies priors by acknowledging that a prior and a likelihood are fundamentally the same thing: using a probability distribution to quantify uncertainty about the value of a yet-unseen variable [@mcelreath:2017:decolonized-bayes].
Likelihoods, in turn, are priors for data: assumptions that relate observed data to unobserved variables [@lemm:1996:priors-generalization].
Using likelihoods to learn from data presents a similar epistemic problem as the fundamental problem of causal inference: assumptions are required to draw any inferences at all.
Bayesian updating, from the inside view, means considering a multitude of possible model configurations and pruning the configurations based on their consistency with the observed data.
The prior model, $p(\mathbf{y} \mid \boldsymbol{\pi})p(\boldsymbol{\pi})$, describes an overly broad set of possible configurations between data and parameters.
These configurations include a distribution of possible data given parameters, $p(\mathbf{y} \mid \boldsymbol{\pi})$, and a distribution of possible parameters, $p(\boldsymbol{\pi})$.
Bayesian updating decides which configurations are more plausible based on how likely it would be to observe our data under those configurations.
The plausibility of a parameter value—its posterior probability—is greater if the observed data are more likely to occur under that parameter value versus some other value.
In turn, the posterior distribution downweights parameter values that are implausible or inconsistent with the data [@mcelreath:2020:rethinking-2, chapter 2].
This is an important distinction from non-Bayesian statistical inference as conventionally performed in political science, which has no comparable notion of "plausible parameters given the data."
As it connects to causal inference, this means that discussing "plausible causal effects" is not possible without a probability distribution over causal effects.
The mission in the remainder of this chapter is to establish a framework for causal inference in terms of plausible effects and plausible counterfactuals.
## Probabilistic Potential Outcomes Model {#sec:intro-bcm}
Having reviewed the basics of causal models and Bayesian inference, we now turn to a framework for Bayesian causal modeling.
The distinguishing feature of a Bayesian causal model is that the elemental units of the model, the potential outcomes, are given probability distributions.
This probability distribution reflects available causal information that exists outside the current dataset.
Bayesian inference proceeds by updating our information about causal effects and counterfactual potential outcomes in light of the observed data.
This section introduces this modeling framework at a high level, provides a probabilistic interpretation and notation for potential outcomes modeling, and describes how the Bayesian framework affects the "hierarchy of causal inference."
As with other causal models, we begin at the unit level.
Unit $i$ receives a treatment $A_{i} = a$, with potential outcomes $Y_{i}\left(A_{i} = a\right)$.
Suppose a binary treatment case where $A_{i}$ can take values $0$ or $1$, so the unit-level causal effect is $\tau_{i} = Y_{i}\left(1\right) - Y_{i}\left(0\right)$.
Although $\tau_{i}$ is unidentified, it is possible to estimate population-level causal quantities by invoking identification assumptions.
For instance, the conditional average treatment effect at $X_{i} = x$, $\bar{\tau}(1, 0, X = x) = \E{Y_{i}(1) - Y_{i}(0) \mid X_{i} = x}$, can be estimated from observed data assuming no hidden treatments, no interference, conditional ignorability, and positive treatment assignment probability [@rubin:2005:potential-outcomes].
Suppressing the unit index $i$,
\begin{align}
\begin{split}
\bar{\tau}(1, 0, x)
&= \E{Y(A = 1) - Y(A = 0) \mid X = x} \\
&= \E{Y(A = 1) \mid X = x} - \E{Y(A = 0) \mid X = x} \\
&= \E{Y \mid A = 1, X = x} - \E{Y \mid A = 0, X = x}
\end{split}
(\#eq:cate-proof)
\end{align}
where the third line is obtained by the identification assumptions.
The identification assumptions connect _causal estimands_ and what I will call _observable estimands_.
Causal estimands are the true causal quantities, but they are unobservable because they are stated as contrasts of potential outcomes.
Observable estimands are the observable analogs of causal estimands and are equivalent to causal estimands only if identification assumptions hold.
Other literature refers to observable estimands as "nonparametric estimators" [@keele:2015:causal-inf], but I steer clear of this language because the distinction between observable estimands and estimators is important for understanding the contributions of the Bayesian causal approach.
The transition to a Bayesian probabilistic model begins with an acknowledgment that no estimate of the observable estimand, $\E{Y \mid A = a, X = x}$, will be exact.
The assumptions identify causal effects only in an infinite data regime where the observable estimand is known exactly.
Inference about causal effects from finite samples, however, requires further statistical assumptions that link the observable estimand to an estimator or model.
<!------- TO DO ---------
- redo with an expectation-level model
------------------------->
Let $f(A_{i}, X_{i}, \boldsymbol{\pi}) + \epsilon_{i}$ be a model for $Y_{i}$ consisting of a function $f(\cdot)$ of treatment $A_{i}$, covariates $X_{i}$, and parameters $\boldsymbol{\pi}$, and an error term $\epsilon_{i}$ where $\E{\epsilon_{i}} = 0$.
This setup is similar to any modeling assumption that appears in observational causal inference to link an estimator to the observable estimand, including parametric models for covariate adjustment, propensity models, matching, and more [@acharya-blackwell-sen:2016:direct-effects; @sekhon:2009:opiates].
<!------- TO DO ---------
- cites for parametric adjustment, propensity, matching
- robust
- ensembles in causal inf
------------------------->
This implies a model for the CATE that differences the modeled outcome over the treatment.
\begin{align}
\begin{split}
\bar{\tau}(1, 0, x)
&= \E{Y \mid A = 1, X = x} - \E{Y \mid A = 0, X = x} \\
&=
\E{
f(A_{i} = 1, X_{i} = x, \boldsymbol{\pi}) -
f(A_{i} = 0, X_{i} = x, \boldsymbol{\pi})
}
\end{split}
(\#eq:cate-f)
\end{align}
The Bayesian approach, inspired largely by @rubin:1978:bayesian, constructs $f()$ as a joint model for data and parameters: $\p{Y, \boldsymbol{\pi}} = \p{Y \mid f(A, U, \boldsymbol{\pi})}\p{\boldsymbol{\pi}}$.
The data are distributed conditional on the model prediction $f()$, which is a function of parameters $\boldsymbol{\pi}$.
The parameters also have a prior distribution $\p{\boldsymbol{\pi}}$, or a distribution marginal of the data.
These models for data and parameters are added statistical assumptions on top of causal identification assumptions.
The data model is similar to any estimation approach that uses a probability model for errors (e.g. any MLE method or OLS with Normal errors).
The parameter model has no analog in OLS or unpenalized MLE, but this added statistical assumption will be leveraged as a major benefit as we explore Bayesian causal estimation below.
The joint generative model is sufficient to characterize the probability distribution for the conditional average treatment effect as defined in Equation \@ref(eq:cate-f),
\begin{align}
p(\bar{\tau}(1, 0, x))
&= \int p\left[
f(A = 1, X = x, \boldsymbol{\pi})
- f(A = 0, X = x, \boldsymbol{\pi}) \mid
\boldsymbol{\pi}
\right]
\p{\boldsymbol{\pi}}
\diff{\boldsymbol{\pi}}
(\#eq:prior-cate)
\end{align}
which is the probability distribution of model contrasts for $A = 1$ versus $A = 0$.
Integrating over $\boldsymbol{\pi}$ in Equation \@ref(eq:prior-cate) marginalizes the distribution with respect to the uncertain parameters.
Because the marginalized parameters are distributed according to the prior $\p{\boldsymbol{\pi}}$, the expression in \@ref(eq:prior-cate) represents a prior distribution for the CATE.
This is an inherent feature of the Bayesian approach: probability distributions of causal quantities even before data are observed.
Conditioning on the observed data returns the posterior distribution for the CATE,
\begin{align}
p(\bar{\tau}(1, 0, x) \mid Y)
&= \int
p\left[
f(A = 1, X = x, \boldsymbol{\pi}) - f(A = 0, X = x, \boldsymbol{\pi})
\mid
\boldsymbol{\pi}, Y
\right]
\p{\boldsymbol{\pi} \mid Y}
\diff{\boldsymbol{\pi}}
(\#eq:post-cate)
\end{align}
which marginalizes over the parameters after conditioning on the data.
Causal models, at their core, are models for counterfactual data.
Because Bayesian models are _generative_ models for parameters and data, they contain all of the machinery required to directly quantify counterfactual potential outcomes using probability distributions.
Bayesian causal models facilitate probabilistic causal inference at the unit level by generating posterior distributions for counterfactual observations.
To see this in action, we start by acknowledging that we can use any joint model to generate a predictive distribution for data $Y$ from fixed model parameters [@mcelreath:2020:bayes-counting].
Denote these generated observations as $\tilde{Y}$ to distinguish them from the data observed $Y$.
If we average this predictive distribution $\p{\tilde{Y} \mid \boldsymbol{\pi}}$ over the prior distribution of parameters, we obtain a "prior predictive distribution"---the distribution of data we would expect under the prior [@gelman-et-al:2013:BDA].
\begin{align}
\p{\tilde{Y} \mid A = a, X = x} &= \int \p{\tilde{Y} \mid A = a, X = x, \boldsymbol{\pi}}\p{\boldsymbol{\pi}} \diff{\boldsymbol{\pi}}
(\#eq:prior-predictive)
\end{align}
If we condition on the observed data before generating new observations, this is called a "posterior predictive distribution"—the distribution of data that we expect from the posterior parameters.
\begin{align}
\p{\tilde{Y} \mid Y, A = a, X = x} &= \int \p{\tilde{Y} \mid A = a, X = x, \boldsymbol{\pi}}\p{\boldsymbol{\pi} \mid Y} \diff{\boldsymbol{\pi}}
(\#eq:post-predictive)
\end{align}
These predictive distributions are the basis for out-of-sample inference in any Bayesian generative model.^[
Simulations of this sort are possible under any likelihood-based model that specifies a probability distribution for the data.
Bayesian predictive distributions include the additional step of marginalizing over the parameter distribution instead of conditioning on fixed parameters.
This makes Bayesian predictive distributions a more complete accounting of statistical and epistemic sources of uncertainty.
]
Invoking the causal identification assumptions, we generate a predictive distribution for counterfactual data as well by setting the treatment $A$ to some other value $A = a'$.
Denote these counterfactual predictions $\tilde{Y}_{i}'$, which I subscript $i$ to show that this model implies a probability distribution for individual data points as well as aggregate treatment effects.
The posterior predictive distribution for counterfactual data is
\begin{align}
\p{\tilde{Y}_{i}' \mid Y, A_{i} = a', X_{i} = x} &= \int \p{\tilde{Y}_{i}' \mid A_{i} = a', X_{i} = x, \boldsymbol{\pi}}\p{\boldsymbol{\pi} \mid Y} \diff{\boldsymbol{\pi}},
(\#eq:counterfactual-predictive)
\end{align}
which is sustained by causal identification assumptions as well as a distributional assumption for data given parameters.^[
One notable feature of these predictive distributions is that they condition on a known treatment status $A = a$ or $A = a'$.
Another way to consider the prior distribution for a potential outcome is to marginalize over the treatment status, which is itself a random variable whose value is unknown prior to observing any data.
This is the approach laid out by @rubin:1978:bayesian, and while it is more general, it is also more abstract and less directly useful for the applications in this project.
]
Bayesian causal models can be so summarized: if a causal model defines a space of potential outcomes, then a Bayesian causal model gives potential outcomes a probabilistic representation.
Probability densities over potential outcomes are defined in the prior and in the posterior, and they can be defined all the way to the unit level if the generative model contains a probability distribution for unit data.^[
Some modeling approaches can estimate average causal effects with group-level statistics only, eliding the unit-level model altogether.
This can weaken the model's dependence on parametric assumptions for units, falling back onto more dependable parametric assumptions for the statistics, e.g. the Central Limit Theorem for group means.
A model of this type will naturally stop short of defining probability distributions for counterfactual units, but it does define probability distributions for counterfactual means.
In some cases, such as binary outcome data, means in each group are sufficient statistics for the raw data, so the unit level model is implied by the group-level model.
]
In short, the Bayesian view of causal inference is a _missing data model for counterfactual means or counterfactual observations_—a view that is at least as old as @rubin:1978:bayesian.^[
In more general modeling contexts beyond causal inference,@jackman:2000:bayes-missing-data makes a similar argument that all estimates, inferences, and goodness-of-fit statistics can be unified as functions of missing data, with Bayesian posterior sampling as a natural way to describe our information about these functions.
]
Bayesian methods for causal inference have appeared in political science only sporadically in the decades since [e.g. @horiuchi-et-al:2007:experimental-design; @green-et-al:2016:lawn-signs; @ornstein-duck-mayr:2020:GP-RDD].
### Why Bayesian causal modeling? {#sec:why-bcm}
<!------- TO DO ---------
- fix this section: it feels unfocused and rambling
------------------------->
A Bayesian view of causal inference is possible, but why it is valuable?
This section describes several benefits that are related to this project, although other projects could certainly find other benefits.
In short: Bayesian methods facilitate direct inference about plausible causal effects without notions of repeated sampling, which makes it valuable to observational data often used in political science.
Probability distributions provides a convenient interface for incorporating uncertainty in multi-stage estimation routines, data with measurement error, and flexible models with many correlated parameters and regularization, all of which appear in subsequent chapters.
The Bayesian causal approach is sensible for causal inference because it facilitates direct, probabilistic inference about treatment effects given the data: which effect sizes are more likely or less likely than others.
While $p$-values and confidence intervals are often misused to make probabilistic statements about parameters, the posterior distribution and posterior intervals actually enable the researcher to state the probability of substantive treatment effects, negligible effects [@rainey:2014:negligible], and more.
Positive statements about plausible causal effects are a natural way to discuss the results of causal research: "the world probably works in this way, given the evidence."^[
Rubin writes, in the context of causal inference, that "a posterior distribution with clearly stated prior distributions is the most natural way to summarize evidence for a scientific question" [@rubin:2005:potential-outcomes, p. 327].
]
This language requires probability distributions over parameters.
Non-Bayesian methods, meanwhile, conduct inference about the plausibility of _data_ given fixed parameters; inferences about parameters is indirect and requires an additional layer of decision theory.
Non-Bayesian inference can be awkward as a result—for instance, using a $p$-value to claim that data are inconsistent with a null hypothesis that the researcher never thought was credible to begin with [@gill:1999:NHST].
Restated more formally, the non-Bayesian researcher routinely conducts inference by estimating $\p{\mathbf{y} \mid \text{Null Hypothesis}}$, when they are usually more interested in $\p{\text{Alternative Hypothesis} \mid \mathbf{y}}$.
Probabilistic inference for parameters is especially valuable when the observed data represent the entire population, which is common for observational causal inference in political science.
Historical data have no possibility to be resampled from the broader population, so estimators cannot inherit their statistical properties from their sampling distributions [@western-jackman:1994:comparative-bayes].^[
Causal researchers sometimes invoke a "design-based" uncertainty framework, where randomness in treatment assignment is a source of uncertainty instead of population resampling [@keele-et-al:2012:RI; @abadie-et-al:2020:design-uncertainty].
This approach is uncommon except among researchers on the cutting edge of "agnostic" statistical practices [@aronow-miller:2019:agnostic-statistics].
]
The foundations of uncertainty in Bayesian inference, meanwhile, are probability distributions that represent imperfect pre-data information about the generative processes underlying the variables in a model.
Whether this imperfect information corresponds to sampling randomness or other epistemic uncertainty can be subsumed in the Bayesian framework [@rubin:1978:bayesian].^[
Bayesian statisticians remain interested in the frequency properties of their methods such as interval coverage [@rubin:1984:bayes-frequency], which partially motivates an interest in "objective Bayesian inference" [@berger:2006:objective-bayes; @fienberg:2006:objective-bayes-comments].
]
It is common for advocates of Bayesian inference to celebrate the fact that the posterior distribution quantifies uncertainty in all random variables simultaneously, but this is especially useful for causal methods that entail multiple estimation steps.
Multi-stage procedures require estimates from one stage to serve as inputs in other stages, introducing additional measurement error into the estimates from later results.
These multi-stage procedures are common in causal inference and include instrumental variables, propensity score weighting, synthetic control, and other structural models [@angrist-pischke:2008:mostly-harmless; @imai-et-al:2011:black-mox-mediation; @acharya-blackwell-sen:2016:direct-effects; @xu:2017:synthetic-control; @blackwell-glynn:2018:causal-TSCS].
Bayesian methods combine all estimation stages into one joint model, so a Bayesian treatment effect estimate will naturally reflect uncertainty in all model stages by marginalizing the posterior distribution over the "design stage" parameters [@mccandless-et-al:2009:bayesian-pscore; @liao:2019:bayesian-causal-inference; @zigler-dominici:2014:propensity-uncertainty].
Joint modeling is similar to "uncertainty propagation" approaches that use numerical methods to simulate early-stage uncertainty in later-stage models.
Unlike uncertainty propagation, however, a fully Bayesian model treats early-stage estimates as priors for later-stage estimates, so all uncertain parameters are updated using full information from all stages of the model [@liao-zigler:2020:two-stage-bayes; @zigler:2016:bayes-propensity; @zigler-et-al:2013:feedback-propensity].
The combined modeling approach is important for this project because the key independent variable, district-party public ideology, is an uncertain estimate from a measurement model.
Estimates for the causal effect of district-party ideology therefore contain two sources of error: statistical uncertainty about the causal effect itself, and measurement error in the underlying data.
Building a combined model to estimate ideal points and causal effects simultaneously would be logistically overwhelming, but the full model can be approximated by drawing ideal points from a prior in the causal analyses.
This is a method that I implement in Chapter \@ref(ch:positioning).
One final justification for Bayesian causal modeling is that prior information is everywhere.
This is a longer discussion that I untangle in Section \@ref(sec:causal-priors), but to preview, priors matter for the way researchers think about their modeling decisions, and they affect the inferences that researchers draw from data, even if they wish to avoid explicit Bayesian thinking about their analyses.
### Bayesian modeling and the hierarchy of causal inference {#sec:bayes-hierarchy}
This section interprets the Bayesian causal inference framework in light of the "hierarchy of causal inference" described in Section \@ref(sec:causal-inf).
The hierarchy helps us account for the ways that Bayesian methods have already been invoked for causal inference in political science and in other fields, and it helps us understand how the Bayesian statistical paradigm reinterprets causal inference more broadly.
To review, the hierarchy consists of three parts:
1. The causal model: definition of potential outcome space, causal estimands expressed in terms of potential outcomes.
2. Identification assumptions: linkage from causal estimands expressed using potential outcomes to observable estimands expressed using observed data.
3. Estimation: Methods for estimating observable estimands with finite data.
We began our discussion of the Bayesian causal model above by considering a Bayesian model for an observable estimand.
Bayes was invoked as "mere estimation," level 3 of the hierarchy.
As only an estimation method, a Bayesian estimator (such as a posterior expectation value) doesn't obviously change the meaning of the observable estimand or the causal estimand.
We can evaluate the Bayesian model for its bias and variance like any other estimator.
"Mere estimation" is where many Bayesian causal approaches appear in political science and other fields.
The estimation benefits of Bayes tend to fall into three categories: priors provide practical stabilization or regularization, posterior distributions are convenient quantifications of uncertainty, or MCMC provides a tractable way to fit a complex model.
"Mere estimation" regards Bayesian inference as practically valuable but theoretically unnecessary insofar researchers might prefer non-Bayesian solutions to the same problems.
Examples include the use of Bayesian Additive Regression Trees [or BART, @chipman-et-al:2010:BART; @hill:2011:bart] for heterogeneous treatment effects [@green-kern:2012:bart], Gaussian processes for smooth functions in regression discontinuity [@ornstein-duck-mayr:2020:GP-RDD] and augmented LASSO estimators [@tibshirani:1996:lasso; @park-casella:2008:bayesian-lasso] for sparse regression methods [@ratkovic-tingley:2017-direct-estimation; @ratkovic-tingley:2017:sparse-lasso-plus].^[
A recent, notable example from economics is @meager:2019:microcredit, who uses Bayesian random effects meta-analysis to aggregate evidence from micro-credit experiments.
See @rubin:1981:eight-schools for an introduction to Bayesian meta-analysis of experiments.
]
These methods use priors to regularize richly parameterized models and MCMC for estimation, but the theoretical implications of Bayesian causal estimation are not a major focus.
What does it mean for Bayesian estimation to have theoretical implications for causal inference?
This brings our focus to level one of the causal inference hierarchy: the model of potential outcomes.
Any estimation method that invokes Bayesian tools requires a prior for model parameters, which imply prior densities on causal estimands.
If the joint model contains a unit-level data model as well, which is the case for most regression approaches, then unit potential outcomes also have prior probability densities: some potential outcomes are more plausible than others, even before seeing data.
This is a decisive theoretical departure from a non-Bayesian approach to causal modeling, where potential outcomes and causal effects are merely defined.
The benefit of this departure is the ability to specify posterior distributions for unobserved potential outcomes directly, which a few recent methodology papers in political science have invoked for missing data due to noncompliance [@horiuchi-et-al:2007:experimental-design], synthetic control estimation [@carlson:2020:GP-synth] and regression settings [@ratkovic-tingley:2017-direct-estimation].
But these papers do not highlight the fact that these methods also imply _priors_ for counterfactuals.
As a result, skeptical applied researchers have little guidance for understanding what it means to have priors on counterfactual data, theoretically or practically.
I discuss priors in more detail in Section \@ref(sec:causal-priors).
Priors do not have to be an inconvenience.
There are many scenarios where priors can relax assumptions, building robustness checks directly into a statistical model.
This is how Bayesian inference affects layer two of the hierarchy: identification assumptions.
By their nature, identification assumptions can never be validated by consulting the data, so most causal inference research projects simply condition the analysis on the identification assumptions holding [see @hartman-hidalgo:2018:equivalence-tests for a hypothesis testing approach to identification assumptions].
A Bayesian model can relax these assumptions by instead posing them as priors that reflect the researcher's reasonable expectations about the remaining biases in a research design [@oganisian-roy:2020:bayes-estimation].
This generalizes the notion of "sensitivity tests" for measuring the robustness of causal inferences to violated identification assumptions [e.g. @imai-et-al:2011:black-mox-mediation; @acharya-blackwell-sen:2016:direct-effects]
by marginalizing over these sensitivity parameters instead of conditioning on fixed values.
One recent political science example of this approach is @leavitt:2020:bayes-did, who frames the parallel trends assumption in a difference-in-differences design as a prior over unobserved trends.
This introduces an additional layer of "epistemic uncertainty" into Bayesian causal inference that is ordinarily assumed to be zero.
For a more general discussion of identification assumptions as priors, see @oganisian-roy:2020:bayes-estimation.
<!-- bayesian viewpoints -->
<!-- @rubin:1978:bayesian -->
<!-- @rubin:2005:potential-outcomes -->
<!-- @baldi-shahbaba:2019:bayesian-causality -->
<!-- meta-analysis -->
<!-- @rubin:1981:eight-schools -->
<!-- @meager:2019:microcredit -->
<!-- @green-et-al:2016:lawn-signs -->
<!-- compliance -->
<!-- @imbens-rubin:1997:bayes-compliance -->
<!-- @horiuchi-et-al:2007:experimental-design -->
<!-- BART for heterogeneity -->
<!-- @green-kern:2012:bart -->
<!-- @guess-coppock:2018:backlash -->
<!-- direct counterfactuals w/ incidental Bayes -->
<!-- @ratkovic-tingley:2017-direct-estimation -->
<!-- @carlson:2020:GP-synth -->
<!-- rdd -->
<!-- @ornstein-duck-mayr:2020:GP-RDD -->
<!-- @branson-at-al:2019:bayes-rdd -->
<!-- @chib-jacobi:2016:bayes-rdd -->
<!-- ? -->
<!-- @lattimore-rohde:2019:do-calc-bayes-rule -->
## Understanding Priors in Causal Inference {#sec:causal-priors}
<!------- TO DO ---------
- from earlier section:
It is important to note at the outset that this "inside view" of Bayesian modeling, and its implications for causal inference, are coherent even if using uninformative prior distributions for parameters.
This is how Bayesian methods tend to appear in political science to date, with noninformative priors that exist primarily to facilitate Bayesian computation for difficult estimation problems.
The infamy of Bayesian methods, however, is owed to the ability of the researcher to specify "informative" priors that concentrate probability density on model configurations that are thought to be more plausible even before data are analyzed.
There are many modeling scenarios where this concentration of probability delivers results that are almost unthinkable without prior structure: multilevel models that allocate variance to different layers of hierarchy, highly parameterized models with correlated parameters such as spline regression, and sparse regressions where regularizing priors are used to shrink coefficients and preserve degrees of freedom to overcome the "curse of dimensionality" [@bishop:2006:pattern-rec; @gelman-et-al:2013:BDA].
At the same time, many researchers are skeptical of Bayesian methods because supplying a model with non-data information can be spun as data falsification [@garcia-perez:2019:bayes-data-falsification].
As I elaborate in Section \@ref(sec:causal-priors), it is a mistake to equate flat prior "flatness" with prior "uninformativeness," and there are many legitimate sources of prior information that have nothing to do with subjective beliefs.
------------------------->
The distinguishing feature of Bayesian analysis that attracts most of its controversy is the prior distribution over model parameters.
At the same time, priors also deliver most of the benefits of Bayesian modeling.
This section unravels several common confusions about priors as they relate to modeling in general and especially for causal inference.
What do priors do, and how can they be used responsibly?
I have two goals with this section.
The first is to undermine the view that flat priors are sensible default choices.
I explain how flat priors are not always uninformative and how uninformative priors are not always flat.
The second goal is to provide guidance for specifying weakly informative priors that improve causal estimation without introducing unreasonably strong assumptions.
### Information, belief, and data falsification
<!------- TO DO ---------
- figure this out
DATA FALSIFICATION digression
------------------------->
Bayesian analysis is often characterized as overly subjective.
If priors are a way for researchers to insert their "beliefs" into a statistical analysis, what is the point of data?
Some have argued that Bayesian analysis with informative priors is analytically equivalent to "data falsification" because priors and data influence the posterior distribution through the same mechanism: adding information to the log posterior distribution [@garcia-perez:2019:bayes-data-falsification].
This hesitation can be eased with two lines of thought.
First, it is helpful to think of priors as _information_, not "belief."
A prior is any assumption that brings probabilistic information into a model.
This is not unique to Bayesian models, since all likelihood functions represent probabilistic assumptions about data as well.
This project regards priors as pragmatic devices.
Priors are "belief functions" only in the sense that they represent the support for a parameter value _within a model_, but the researcher is "morally certain" that the model is wrong, so their degree of belief in a prior is actually zero [@gelman-shalizi:2013:bayes-philosophy, p. 19--20].
Priors, like other model assumptions, represent reasonable approximations that impose structure on the information obtained from data.
Researchers often care about other pragmatic consequences of priors such as their frequency properties, specifying noninformative priors for optimal learning from data [@rubin:1984:bayes-frequency; @berger:2006:objective-bayes; @fienberg:2006:objective-bayes-comments], and workflow practices for model-building that are not strictly consistent with Bayesian theory [@gelman-shalizi:2013:bayes-philosophy; @betancourt:2018:workflow-blog; @gabry-et-al:2019:visualization].
### Flatness is a relative, not an absolute, property of priors {#sec:prior-flatness}
Researchers commonly encounter Bayesian methods to solve an inconvenient estimation problem but would like to avoid the difficulty of specifying priors.
It is common for these researchers to err toward "flat" or "diffuse" priors that assign equal or nearly equal probability to all parameter values.
This feels "least biased" because the Bayesian model most-closely represents an unpenalized maximum likelihood model, with which the researcher is more familiar.
One common Bayesian argument against flat priors is that they understate the researcher's actual prior information—an argument that is both obvious and uncompelling.
Instead, I argue that flat priors often lead researchers to misunderstand what their models actually say.
If a parameter has a flat prior, functions of that parameter are not guaranteed to have a flat prior.
Furthermore, flat priors are only flat with respect to a particular parameterization of a model.
If the parameterization changes, a flat prior in the new parameterization will not represent the same prior information.
In general, the researcher must understand the data's functional dependence on the parameters in order to understand the consequences of so-called uninformative priors.
```{r normal-factoring}
normal_mu <- 3
normal_sd <- 2
implied_normal <-
tibble(
x = seq(-10, 10, .01),
pi = dnorm(x),
h = dnorm(x, mean = normal_mu, sd = normal_sd)
) %>%
pivot_longer(cols = -x, values_to = "density", names_to = "param") %>%
print()
```
```{r plot-normal-factoring}
ggplot(implied_normal) +
aes(x = x, color = param, fill = param) +
geom_ribbon(
aes(ymin = 0, ymax = density),
alpha = 0.7,
outline.type = "upper"
) +
labs(
x = NULL,
y = NULL,
title = "Priors and Implied Priors",
subtitle = "Functions of parameters have implied prior density"
) +
annotate(
geom = "text",
x = 0 + normal_mu + normal_sd,
y = 0.9 * (filter(implied_normal, param == "h") %$% max(density)),
label = TeX("Transformed parameter: $h(\\pi)$"),
hjust = 0
) +
annotate(
geom = "text",
x = 0 + 1,
y = 0.9 * (filter(implied_normal, param == "pi") %$% max(density)),
label = TeX("Original parameter: $\\pi$"),
hjust = 0
) +
coord_cartesian(xlim = c(-3, 10)) +
scale_fill_manual(values = c(primary, tertiary)) +
scale_color_manual(values = c(primary, tertiary)) +
scale_y_continuous(breaks = NULL) +
scale_x_continuous(breaks = seq(-10, 10, 2)) +
theme(legend.position = "none")
```
To understand the consequences of prior choices, it is essential to understand the _implied prior_.
Suppose we have a parameter $\pi$ and a function of that parameter, $h(\pi)$.
If $\pi$ has a prior density, then $h(\pi)$ has an implied prior density, which is affected by the density of $\pi$ and the function $h(\cdot)$.
Consider a simple example where $\pi$ is distributed $\mathrm{Normal}\left( 0, 1 \right)$, and $h(\pi) = `r normal_mu` + `r normal_sd`\pi$.
The implied prior for $h(\pi)$ is $\mathrm{Normal}\left( `r normal_mu`, `r normal_sd` \right)$, which is shown in Figure \@ref(fig:plot-normal-factoring).
Importantly, note the density at a particular value of $\pi$ is almost certainly not equal to the density of $h(\pi)$.
This shows that functions of parameters have prior density, but the density of the function will almost certainly differ from the density of the original parameters.
```{r plot-normal-factoring, include = TRUE, fig.scap = "Implied prior density for a function of a parameters.", fig.cap = "If a parameter has a density, a function of the parameter also has a density that is almost always unequal to the original density.", fig.height = 5, out.width = "70%"}
```
```{r beta-binom}
n_bb <- 1e5
set.seed(9348)
tibble(
alpha_beta = rbeta(n_bb, 1, 1),
eta_normal = rnorm(n_bb, 0, 10),
alpha_normal = plogis(eta_normal),
eta_logistic = rlogis(n_bb, 0, 1),
alpha_logistic = plogis(eta_logistic)
) %>%
mutate(
across(
starts_with("alpha"),
.fns = list(x = ~ rbinom(n(), size = n_bb, prob = .))
)
) %>%
pivot_longer(
cols = ends_with("x"),
names_to = "term",
values_to = "value"
) %>%
ggplot() +
aes(x = value) +
geom_histogram(
fill = primary,
boundary = 0,
alpha = 0.8
) +
facet_wrap(
~ fct_relevel(term, "alpha_beta_x", "alpha_normal_x"),
labeller = as_labeller(
c(
"alpha_beta_x" = "α ~ Beta(1, 1)",
"alpha_normal_x" = "logit(α) ~ Normal(0, 10)",
"alpha_logistic_x" = "logit(α) ~ Logistic(0, 1)"
)
)
) +
scale_x_continuous(
breaks = seq(0, n_bb, length.out = 3),
labels = c("0", "N / 2", "N")
) +
ggeasy::easy_remove_legend() +
ggeasy::easy_remove_y_axis() +
labs(
title = "Prior Flatness ≠ Prior Vagueness",
subtitle = "How transformations of parameter space affect implied priors",
x = "Random variable X ~ Binom(N, α)",
y = NULL
)
```
Implied priors help us understand the unintended consequences of flat priors by highlighting circumstances where flat priors, believed to be reasonable and conservative, create problematic data [@seaman-et-al:2012:noninformative-priors].
Consider a binomial random variable that counts $X$ successes out of $N$ independent trials with success probability $\alpha$.
We represent prior ignorance about $\alpha$ using a flat $\mathrm{Beta}\left(1, 1\right)$ density.
Now consider the identical model but reparameterized as a logit model, which is a common model for estimating $\alpha$ with covariates.
The logit model introduces the parameter $\eta$, the logit-scale equivalent of $\alpha$.
\begin{align}
\begin{split}
X &\sim \text{Binom}(N, \alpha) \\
\text{logit}(\alpha) &= \eta
\end{split}
(\#eq:logit-model)
\end{align}
How do we put a prior on $\eta$ that represents diffuse information about $X$?
We follow a default instinct and give $\eta$ a "vague" prior with a wide variance, $\mathrm{Normal}(0, 10)$.
If we take both of these models and generate `r comma(n_bb)` prior simulations for $X \sim \mathrm{Binom}(N, \alpha)$, depicted as histograms in Figure \@ref(fig:beta-binom), the implied priors for $X$ do not resemble one another at all.
The first panel shows the implied prior for $X$ when $\alpha$ has a flat Beta prior, resulting in a distribution for $X$ that is also flat.
The middle panel shows the implied prior for $X$ when $\eta$ has a wide Normal prior, resulting in a prior that concentrates $X$ at very small and very large values.
This is because the $\mathrm{Normal}\left(0, 10\right)$ prior places most probability density on $\eta$ values that represent unreasonably small or large $\alpha$ values.
Only a thin range of logit-scale values map to probabilities that we routinely encounter in political science: logit values between $-3$ and $3$ correspond to probabilities between $`r plogis(-3) %>% round(3)`$ and $`r plogis(3) %>% round(3)`$.
In order to obtain a flat prior for $\alpha$ using a logit model, we would actually use the prior $\eta \sim \mathrm{Logistic}\left(0, 1\right)$, shown in the third panel of Figure \@ref(fig:beta-binom).^[
The standard Logistic prior creates a flat density on the probability scale because the logit model uses the cumulative Logistic distribution function to convert logit values to probabilities.
This same intuition holds for a probit model: a $\text{Normal}\left(0, 1\right)$ prior on the probit scale represents a flat prior on the probability scale.
]
```{r beta-binom, include = TRUE, fig.width = 9, fig.height = 4, out.width = "100%", fig.scap = "How parameterization affects priors: binomial case.", fig.cap = "How parameterization affects priors. Transforming a model likelihood requires a commensurate transformation in the prior in order to produce the same data."}
```
What general lessons can we draw from these exercises?
It is a mistake to assume that the _shape_ of a prior represents its informativeness.
The relationship between shape and informativeness is contingent on the functions that map priors over parameters to implied priors over data.^[
Bayesian jargon would say that flat priors are not "invariant to reparameterization" of the likelihood.
Understanding invariant priors is an animating motive for so-called "objective Bayes" methods.
More in Section \@ref(sec:bayes-how-to).
]
These mismatches between prior shape and prior information can be exaggerated by nonlinear functions that compress and stretch probability mass from one space to the next.
In other words, the model matters.
This is a general principal of Bayesian model-building that is essential for understanding Bayesian causal inference: "the prior can often only be understood in the context of the likelihood" [@gelman-et-al:2017:prior-likelihood].
We should be prepared to encounter models where flat priors for parameters yield data with highly informative prior distributions.^[
This has affected Bayesian causal inference in political science already: for instance, @horiuchi-et-al:2007:experimental-design model treatment propensity with a probit model where coefficients are given Normal priors with variance of $100$.
]
We should also be prepared to encounter models where non-informative priors over data are achieved using non-flat priors for parameters.
As it relates to causal inference, this means that the prior that "lets the data speak" may not be the flattest prior.
### Priors and model parameterization {#sec:prior-parameterization}
Researchers have prior information about the _world_, but they must specify prior distributions in a model.
The parameter space of the model may not obviously represent the natural scale of prior information, making it challenging to specify priors.
We have also seen that transforming a parameter space can change the shape of a prior.
The contingency of priors on parameter spaces is initially inconvenient, but researchers can use it to their advantage when building a model for causal effects.
If a prior is challenging to specify in one parameter space, the researcher can (and should) reparameterize the problem in order to specify priors in a more convenient parameter space.
This section briefly discusses three points: what is parameterization, how does parameterization affect prior distributions, and how can the researcher use parameterization to their advantage.
All Bayesian models feature a model of data, $\p{\mathbf{y} \mid \boldsymbol{\pi}}$.
Like all models, the data model could be written in multiple equivalent ways according to different parameterizations.
For example, the Binomial likelihood model above could be parameterized in terms of the probability parameter $\alpha$ or the log-odds parameter $\eta$.
Because these are equivalent parameterizations, then for any $\eta = \text{logit}\left(\alpha\right)$, the likelihood of a data point will be equal.
These reparameterizations are ubiquitous in statistics and computing, and researchers often leverage them to expedite analytic or computational tasks.^[
A relatable example: OLS regression typically do not calculate $\hat{\beta} = X^{\intercal}X^{-1}X^{\intercal}Y$ directly.
Instead, they calculate a reparameterization of the same problem (typically the "QR parameterization") that returns equivalent results but is easier to implement in the computer.
]
Parameterization is important for Bayesian statistics because it affects which parameters are available for prior specification.
Parameters that are difficult to understand in one parameterization may be easier to understand in an equivalent parameterization.
These parameterizations are not only opportunities for researchers to understand their models better, but they can reveal the consequences of certain prior choices, helping the researcher select priors that better represent the desired amount of prior information.
```{r diff-means-example}
means_data <-
tibble(
r = 1:100000
) %>%
mutate(
m1 = rbeta(n(), 1, 1),
diff = m1 - rbeta(n(), 1, 1)
) %>%
pivot_longer(
cols = -r,
names_to = "param",
values_to = "value"
) %>%
print()
```
```{r plot-diff-means-example}
ggplot(means_data) +
aes(x = value) +
facet_wrap(
~ fct_rev(param),
scales = "free_x",
labeller =
c("m1" = "Group means: flat on (0,1) interval",
"diff" = "Implied prior on difference in means"
) %>%
as_labeller()
) +
geom_histogram(
fill = primary,
boundary = 0,
binwidth = .02,
alpha = 0.9
) +
geom_blank(aes(y = 2400)) +
ggeasy::easy_remove_y_axis() +
labs(
title = "Prior Distribution for Difference in Means",
subtitle = "Histogram of prior draws",
y = NULL,
x = "",
caption = "(Note: horizontal axes not fixed across panels)"
)
```
Consider a randomized experiment with a binary outcome measure $y_{i}$ and a binary treatment assignment $z_{i}$.
Suppose that the causal estimand of interest is a difference in means, $\bar{y}_{z = 1} - \bar{y}_{z = 0}$, which is commonly estimated using a linear probability model.
We can parameterize the model in two equivalent ways.
First is a conventional regression setup,
\begin{align}
y_{i} &= \alpha + \beta z_{i} + \epsilon_{i}
(\#eq:diff-means-example)
\end{align}
where $\alpha$ is the control group mean, $\alpha + \beta$ is the treatment group mean, $\beta$ represents the difference in means, and $\epsilon_{i}$ is an error term for unit $i$.
I refer to this parameterization as the "treatment-effect" parameterization, because it contains the effect parameter $\beta$.
This parameterization is common even for experimental settings because it can be estimated using standard regression software.
This parameterization is unappealing for Bayesian inference, however, because it presents a challenge for prior specification.
It is difficult to specify a prior for $\beta$ that implies a prior for the treatment group mean equals the prior for the control group mean.
How do you set a prior on a difference in means, anyway?
The researcher can instead use parameterization to their advantage, setting up a model that estimates the mean for each experimental group directly.
Letting the mean in group $z$ be $\mu_{z}$, this parameterization would be
\begin{align}
y_{i} &= \mu_{z[i]} + \epsilon_{i}
(\#eq:separate-means-example)
\end{align}
where the difference in means is calculated as $\mu_{1} - \mu_{0}$.
I refer to this parameterization as the "difference-in-means" parameterization.
The difference-in-means parameterizations is equivalent in the likelihood to the treatment-effect parameterization, but it is much simpler to place equivalent priors on each group mean when the model is parameterized directly in terms of the means.
This example is also enlightening because it highlights an instance where flat priors have unanticipated consequences.
Suppose that we use the difference-in-means parameterization and specify flat priors on each group mean: $\mu_{z} \sim \text{Uniform}(0, 1)$.
The left panel of Figure \@ref(fig:plot-diff-means-example) shows a histogram of prior simulations for the group mean, which is simply flat on the $[0, 1]$ interval.
If we specify flat priors on both group means, what is the implied prior for the difference in means?
The right panel of Figure \@ref(fig:plot-diff-means-example) shows that the difference in means will actually have a triangular prior distribution with a mode at $0$.
The prior is still non-informative—after all, it results from flat priors about each group—but just because the prior is non-informative does not mean that it will always be flat.
This example is important because it highlights how a researcher's default tendencies—the treatment-effect parameterization and an impulse toward flat priors—can be incompatible when we actually examine the consequences of these modeling choices.
A flat prior on the treatment effect creates non-equivalent priors for the group means, but equivalent (flat) priors on the group means create a non-flat prior on the treatment effect.
When a researcher confronts a modeling scenario, it is not enough to simply assume that a flat prior will return an optimal "data-driven" result, because the actual informativeness of a flat prior depends on the parameterization of the model and the functions being calculated with the model parameters.
```{r plot-diff-means-example, include = TRUE, fig.height = 5, fig.width = 8, out.width = "100%", fig.scap = "Prior distribution for a difference in means.", fig.cap = "Prior distributions a difference in means. If group means have a flat prior (left), the difference in means has a triangular prior with mode at zero (right). Note that $x$-axes are not fixed across panels."}
```
<!-- kill data -->
```{r}
rm(means_data)
```
### Principled and pragmatic approaches to prior specification {#sec:bayes-how-to}
How do we construct sensible priors if prior flatness is not always sensible and sensible priors are not always flat?
This section offers productive guidance for specifying priors and discusses the appropriateness of different prior strategies for causal inference applications.
I emphasize the use of "weakly informative" priors and discuss some heuristic rules that can guide prior choices in many scenarios.
A Bayesian approach to causal inference does not mean using magical priors to somehow de-confound a treatment variable.
Causal identification is a matter of research design, and simply asserting a prior belief that the treatment assignment is ignorable would be a misunderstanding of the role of priors.
Priors provide structure for a model to learn from data, so data are still of paramount importance.
It is helpful to imagine different approaches to prior specification as lying on a spectrum from least informative to most informative.
The least informative priors might be regarded as "nuisance" priors that are specified for no other purpose than to express uncertainty with a posterior distribution [@gelman-et-al:2017:prior-likelihood].
We have already discussed how prior flatness is a misleading heuristic for choosing a non-informative prior.
Statisticians in the "objective Bayes" tradition have developed several rule-based strategies for specifying non-informative priors without researcher intervention [@kass-wasserman:1996:formal-priors], most notably Jeffreys priors and reference priors.
The rules that determine these priors have complex information-theoretic justifications—
Jeffreys priors [@jeffreys:1946:invariant-prior; @jeffreys:1998:theory-of] are defined in relation to the Fisher information matrix with the goal of extracting the most unobstructed information from the data as possible;
reference priors [@bernardo:1979:reference-prior] maximize the KL divergence of the posterior distribution relative to the prior, i.e. the prior that maximizes the "amount learned" from data.
An important property of these approaches is that they are invariant to the model parameterization—if model $1$ and model $2$ are equivalent reparameterizations of one another, then the objective prior for model $1$ yields the same posterior distribution as the objective prior for model $2$.
Objective priors are principled and general representations of prior ignorance, which make them superior to flat priors as representations of prior ignorance.
Objective priors may still be undesirable because, like flat priors, they may still misrepresent the amount of information that the researcher has about model parameters, and they can imply priors for data that the researcher finds objectionable.^[
For instance, the Jeffrey's prior for a Binomial random variable produces data with a similar implied prior to the middle panel of Figure \@ref(fig:beta-binom) above.
]
On the other end of the spectrum are informative priors.
These priors are more often used to represent substantive beliefs or specific information about model parameters.
Fully informative priors concentrate probability mass in narrower regions of probability space, excluding other regions that might be possible.
These priors are commonly used for regularizing estimates and stabilizing weakly- or non-identified quantities, which are more common in measurement or predictive modeling than in inferential modeling.
Fully informative priors may be undesirable for causal inference because the bias–variance trade-off may be too great, a situation that @hahn-et-al:2018:regularization-confounding describe as "regularization-induced confounding."
Regularization-induced confounding occurs when confounding effects, such as regression adjustments, are under-adjusted due to regularization, which can severely bias the treatment effect estimate even if all confounders are observed.^[
In high-dimensional causal inference problems, regularization is necessary to estimate sparse signals and prevent overfitting [@samii-et-al:2016:retrospective-causal-inference-ML].
In these settings, regularization-induced confounding can be ameliorated by modeling the treatment separately and estimating the treatment effect by residualizing the observed treatment against the predicted treatment.
This routine, sometimes called "Neyman orthogonalization," facilitates unbiased causal inference in the presence of strong, high-dimensional confounding [@chernozhukov-et-al:2018:double-ML; @hahn-et-al:2018:regularization-confounding; @hahn-et-al:2020:bayesian-causal-forests; @ratkovic:2019:rehabilitating-regression].
]
In between non-informative and fully-informative priors is a region where Bayesian methods to causal inference will be both sensible and efficacious.
Bayesian researchers refer to this neighborhood as "weakly informative priors" [@gelman-et-al:2008:weak-logit; @gelman-et-al:2017:prior-likelihood].
Weakly informative priors provide some regularization of parameter estimates but still allow the model to be surprised by data.
Weak information is more commonly thought of as "downweighting" unreasonable parameter values rather than "upweighting" the researcher's subjective beliefs.
Weak information can take many forms, but I highlight four sources of weak information that will always be available to the researcher: structural information about data and parameters, the likelihood model itself, the number of predictor variables, and the tail behavior of a log prior density.
One source of weak information is "structural" information about the mathematical properties of model constructs [@gelman-et-al:2017:prior-likelihood].
For example, probabilities are bounded in the $[0, 1]$ interval, correlations in $[-1, 1]$, and variances in $[0, \infty)$.
These constraints sound trivial, but they are often consequential.
Linear probability models (LPMs), for example, are often preferred over logistic models in experimental settings because of their similarity to a difference-in-means $T$-test, but LPM estimates can sometimes escape the $(0, 1)$ interval.
Even if a point estimate doesn't escape its structural bounds, structural priors can improve the precision of an estimate by removing invalid parameter values from the posterior distribution.
I present such an example below in Section \@ref(sec:rdd).
A related source of weak information is the likelihood model itself.
If we know the scale of the outcome data (and we ordinarily do), the likelihood provides prior information by defining the data's functional dependence on parameters.
That was an important feature of the prior for the IRT model in Chapter \@ref(ch:model): knowing that the probit model maps quantile values to probabilities using the Normal CDF provides a lot of information about which quantile values are plausible to obtain in the model.
This principle generalizes to other models with other link functions, including linear models with an identity link.
```{r scale-p}
tibble(
r = 1:10000
) %>%
mutate(
p1 = rnorm(n()),
p5 = rnorm(n()) + rnorm(n()) + rnorm(n()) + rnorm(n()) + rnorm(n()),
) %>%
pivot_longer(
cols = c(p1, p5),
names_to = "dim",
values_to = "sim"
) %>%
ggplot(aes(x = sim)) +
facet_wrap(~ dim, nrow = 1) +
geom_histogram(bins = 100)
```
A third source of weak information is the number of predictors in a model.
Bayesian statisticians generally give regression coefficients tighter priors as the number of coefficients increases [@simpson-et-al:2017:PC-prior].
To understand the intuition for this decision, recall that the mathematical structure of a regression is a weighted sum of the predictors.
As the number of predictors increases, the weight on each predictor ought to decrease on average.
If a model includes additional predictors without adjusting the prior, the regression function's prior distribution will grow in variance because each coefficient adds another random variable to the regression function.
Scaling the prior with the number of predictors counteracts the inflation of prior variance.
```{r reg-priors}
df_t <- 5
reg_priors <- tibble(
x = seq(-5, 5, .01),
Laplace = extraDistr::dlaplace(x, mu = 0, sigma = 1),
Cauchy = dcauchy(x, 0, 1),
`T` = dt(x, df = df_t),
Normal = dnorm(x, 0, 1)
) %>%
pivot_longer(
cols = -x,
names_to = "family",
values_to = "Density"
) %>%
mutate(
`Log Density` = log(Density),
family = case_when(
family == "T" ~ str_glue("T(df = {df_t})"),
TRUE ~ family
),
family = fct_relevel(family, "Normal", str_glue("T(df = {df_t})"), "Cauchy", "Laplace")
) %>%
print()
```
```{r gg-reg-priors}
plot_dens <- ggplot(reg_priors) +
aes(x = x, y = Density) +
facet_wrap(~ family, nrow = 1) +
geom_line() +
labs(x = NULL)
plot_log_dens <- ggplot(reg_priors) +
aes(x = x, y = `Log Density`) +
facet_wrap(~ family, nrow = 1) +
geom_line() +
labs(x = NULL) +
coord_cartesian(ylim = c(-4.5, -0.5))
```
```{r plot-reg-priors}
(plot_dens / plot_log_dens) +
plot_annotation(
title = "Comparison of Mean-Zero Priors",
subtitle = "Regularization properties of the log density"
)
```
One final tactic for specifying weak priors is to have foreknowledge of the tail behavior of certain families of prior distributions.
All priors will regularize estimates, but some priors regularize more aggressively than others even if they look similar.
Knowing a prior's tail behavior can guide which prior to choose.
Tail behavior is especially important when budgeting for the possibility that the data contradict the prior.
Prior distributions with flat tails will not regularize estimates as strongly, allowing the data to overcome the prior more easily.
Other priors have thinner tails that decay more rapidly, regularizing estimates much more strongly in the tails.
The tail behavior of a selection of prior distributions is highlighted in Figure \@ref(fig:plot-reg-priors).
The figure compares the density and log density of Normal, T (`r df_t` degrees of freedom), and Cauchy distributions, along with a Laplace distribution that serves as a conceptual stand-in for a sparsity-inducing prior.
The log scale is helpful visualizing the impact of a prior's "penalty" on an estimate, where lower values of log density indicate a greater prior penalty on that parameter value.
The log scale representation of the prior highlights how prior densities that look similar to one another—such as the Normal, T5, and Cauchy—can differ dramatically in their practical performance.
Normal distributions have a quadratic log density, which is a Bayesian analog to the "L2" ridge penalty^[
Indeed, the maximum _a posteriori_ estimate from a Normal prior is equivalent to the maximum likelihood estimate using an L2 norm penalty @bishop:2006:pattern-rec.
]
and regularizes more strongly than the gentler T5 and Cauchy priors.
The Laplace distribution regularizes more strongly than the other priors near the mode because log density does not gently approach its maximum.
Despite this aggressive behavior near the mode, the Laplace prior does not regularize as strongly as the Normal prior in its tails, which is why the Laplace prior is often chosen as a prior for sparse regression.^[
Like the analogy between the Normal prior and L2 regularization, the MAP estimate under a Laplace prior is equivalent to the maximum likelihood estimate using L1 (absolutely value norm) regularization [@park-casella:2008:bayesian-lasso; @bishop:2006:pattern-rec].
More work on sparsity-promoting priors finds that horseshoe priors have more flexibility to control sparsity near zero and non-regularized signal detection farther from zero [@carvalho-et-al:2010:horseshoe-prior; @piironen-vehtari:2017:horseshoe-hyperprior; @piironen-vehtari:2017:horseshoe-sparse-vs-reg].
]
Comparing prior _log_ densities in this way is generally helpful for deciding between prior families based on their regularizing behaviors.
```{r plot-reg-priors, include = TRUE, out.width = "100%", fig.height = 7, fig.width = 10, fig.scap = "Density and log density for common mean-zero distributions.", fig.cap = "Density and log density for common mean-zero distributions. Log densities highlight the different tail behaviors of similar-looking distributions."}
```
If a researcher is ever in doubt about the consequences of their prior choices, a principled Bayesian workflow contains several important model-checking tools [@gabry-et-al:2019:visualization; @betancourt:2018:workflow-blog].
Most important are prior predictive simulation and model-checking with fake data.
Prior predictive simulation (sometimes called prior predictive "checking" or prior "pushforward" simulation) consists of drawing a sample of parameters from the joint prior and using those parameters to simulate data.
This produces a prior predictive distribution for the data, $\int \p{\mathbf{y} \mid \boldsymbol{\pi}}\p{\boldsymbol{\pi}}\diff \boldsymbol{\pi}$.
The distribution of simulated data should be only weakly informative—the draws are concentrated near the region of possible outcome values, but the distribution should be broader than the marginal distribution of the observed data.
Different features of a model can be stress-tested by fitting the model with simulated data.
Fitting fake data is helpful for exposing and correcting undesirable features of a model while avoiding unnecessary "looks" at the data, which violate both Bayesian coherence and frequentist $p$-values.^[