index.qmd

```{r}
#| echo: false
#| message: false
#| warning: false
options(digits = 4)
library(lme4)
library(lmerTest)
library(pander)
library(ggplot2)
library(optmatch)
library(RItools)
```

## Outline
1. Review of propensity scores and matching
2. Separating PS and matching
3. Introduction to `optmatch`
4. Optimal and full matching
5. Speeding up matching
6. Final example

# Review of Matching

## Overlap & Balance #1

- Overlap: Are variables observed over the same range for both groups?

```{r}
#| echo: false
#| fig-align: center
set.seed(1)
x1 <- rnorm(100)
x2 <- rnorm(100, 3)
d <- data.frame(x = c(x1, x2), group = factor(rep(letters[1:2], each = 100)))
ggplot(d, aes(x, fill = group), na.rm = TRUE) +
  geom_density(alpha = 0.2) +
  theme(text = element_text(size = 20))
```

## Overlap & Balance #2

- Balance: Are the distributions of variables the same for both groups?

```{r}
#| echo: false
#| fig-align: center
x1 <- rbeta(100, 2, 1)
x2 <- rbeta(100, 1, 2)
d <- data.frame(x = c(x1, x2), group = factor(rep(letters[1:2], each = 100)))
ggplot(d, aes(x, fill = group), na.rm = TRUE) +
  geom_density(alpha = 0.2) +
  theme(text = element_text(size = 20))
```

## Overlap & Balance #3

- Matching is very good at addressing *partial* overlap.
- Propensity score matching is theoretically good at addressing imbalance.
  - Practically maybe not.
- Weighting/Regression may be better at addressing imbalance solely.

## Overlap & Balance #4

- Overlap is easy to observe and quantify, and comes for "free" with matching,
  so usually not a focus.
- Balance is moderately hard to observe and quantify (especially in higher
  dimensions) and will be the focus of matching

## Love plot

```{r}
#| echo: false
#| message: false
data(nuclearplants)
plot(balanceTest(pr ~ cost + t1 + t2 + ne + ct + bw + cum.n + pt,
                 data = nuclearplants)) +
  theme(legend.position = "off", text = element_text(size = 20)) +
  geom_point(size=5)
```

## Distance

- How "far away" are a pair of observations?
    - Easy for 1 dimension, less clear for more dimensions.
- Can define distance however you want
- Common choices:
    - Euclidean
    - Mahalanobis

## Measuring match quality

- Formal goal: Balance
    - Harder to define
    - Is often multi-dimensional
- Instead, measure total distance within a matched set
- Goal: Minimize distance in matched sets
- Strong correlation between balance and distance

## Propensity Score Matching {.smaller}

- Matching on more than one covariate is challenging.

$$
    ps_i(\mathbf{X}_i) = P(Z_i = 1 | \mathbf{X}_i)
$$

- Probability of an observation being in group 1 (usually treatment) given their
  characteristics.
- Propensity scores can be thought of as dimension reduction.
- Scores are usually estimated by logistic regression predicting group
  membership.
    - "Kitchen sink"
    - Potentially exclude variables associated with treatment but **not**
      outcome. [@austin2011introduction, pp. 414-415]

## Main Benefit of PS matching

- If we can ...
    1. Observe the true propensity score.
    2. Have an infinitely large sample size.
    3. Can pair observations that have identical PS's.
- ... then we get covariate balance on observed *and unobserved* covariates!
- Of course, not perfect, but better than nothing.

# Separating Matching and Propensity Scores

## Matching without PS

- If small number of covariates to match on relative to sample size, and
  covariates have good overlap, can match directly.
- Treat PS as "just another" variable.

## PS without Matching

- Weighting
- Subsetting
- Stratafication
- Treating PS as a predictor in Regression
- ...
- Each has it's own pro's and con's.

# Introduction to `optmatch` and `RItools`

## `optmatch`

::: {style="font-size: 70%;"}

[@hansen2006optimal; @optmatch]

:::

- Package for optimal full matching (both terms to be defined).
- Simplest form:

```{r}
#| echo: false
d <- data.frame(group = c(0,0,0,0,0,0,0,0,1,1,1),
                x1 = c(2,4,3,5,6,3,4,1,0,1,3),
                x2 = c(3,2,1,2,3,1,2,4,6,4,7))
op <- options()
options(str = strOptions(vec.len = 6))
```

```{r}
#| message: false
#install.packages("optmatch")
library(optmatch)
str(d)
d$match <- pairmatch(group ~ x1 + x2, data = d)
```

```{r}
#| echo: false
options(op)
```


----

- Generates a `factor` variable identifying match membership, or `NA`.

```{r}
d$match
summary(d$match)
```

---


- Can be used in analysis, for example (not executed):

```{r}
#| eval: false
lm(y ~ group + x1 + x2 + match, data = d)
lm(y ~ group + x1 + x2, data = d[!is.na(d$match), ])
lmer(y ~ group + x1 + x2 + (1|match), data = d)
```

- Note that all of the above still used `x1` and `x2` *even though we matched on
  them*.
    - Balance vs overlap
    - Doubly robust

## `RItools`

::: {style="font-size: 70%;"}

[@RItools]

:::


- A collection of useful function for randomization inference
- Most useful for `balanceTest` function
    - Checks balance and `plot`ing the result of `balanceTest` will produce a
      Love plot.

----

```{r}
#| message: false
#install.packages("RItools")
library(RItools)
baltest <- balanceTest(group ~ x1 + x2, data = d)
baltest
```

- Null hypothesis is balance - we *don't* want to reject!

----

```{r}
#| eval: false
plot(baltest)
```

```{r}
#| echo: false
#| message: false
#| fig-align: center
plot(baltest) +
  theme(text = element_text(size=20), legend.position = "off") +
  geom_point(size = 5)
```

----

- Now compare the unmatched data to the matched data.

```{r}
#| code-overflow: scroll
baltest2 <- balanceTest(group ~ x1 + x2 + strata(match), data = d)
print(baltest2, horizontal = FALSE)
```

----

```{r}
#| eval: false
plot(baltest2)
```

```{r}
#| echo: false
#| message: false
#| fig-align: center
plot(baltest2) + geom_point(size = 5) +
  theme(legend.position = "bottom", text = element_text(size=20))
```

# Optimal and Full matching

## j:k notation

- Pair matched data is 1:1 - a single treatment to a single control.
- 1:2 would mean each treatment member shares two controls.
- 3:1 means three treatment members share a single control.
- The optimal match will have all matched sets of size j:k where either j
  = 1 or k = 1.

## Optimal matching

- Greedy matching matches well on first few values, but can suffer later on,
  while a sub-optimal early match may improve later matches.
    - Shuffle the data...
- A optimal match is equivalent to considering **all** possible matching
  structures.
- Optimal is a harder problem but always produces better matches
- `optmatch` package uses optimal matching.
- Distrust any method which uses greedy matching!

## 1:k or j:1 matching

- Optimal pair matching useful if sample sizes are close - easy to interpret.
- Not good if one group is much larger.
    - Loses data
- 1:k or j:1 matching allows less data loss.

## 1:k or j:1 matching in `optmatch`

- `controls` argument to `pairmatch` - # of controls per treated unit
    - 2, 3, 4, etc for 1:k.
    - 1/2, 1/3, 1/4, etc for j:1.

```{r}
d$match2 <- pairmatch(group ~ x1 + x2,
                      controls = 2, data = d)
d$match2
summary(d$match2)
```

----

```{r}
baltest3 <- balanceTest(group ~ x1 + x2 + strata(match2) +
                          strata(match),
                     data = d)
print(baltest3, horizontal = FALSE)
```

----

```{r}
#| eval: false
plot(baltest3)
```
```{r}
#| echo: false
#| message: false
#| fig-align: center
plot(baltest3) + geom_point(size = 5) +
  theme(legend.position = "bottom", text = element_text(size=20))
```

## Impossible matches fail

Setting up restrictions on matches that cannot be met produce an error:

```{r}
#| error: true
pairmatch(group ~ x1 + x2, data = d, controls = 3)
```

## Issues with 1:k or j:1 matching

- Even with 1:k, can still lose up to $(n_c - k*n_t)$ data.
    - E.g., 10 treatment, 67 controls, 1:6 matching loses 7 controls.
- Can force bad matches just to meet goal. Flexibility might help.

## An example

:::: {.columns}

::: {.column width="50%"}
Treatment | Control
:--------:|:------:
5         |  4
10        |  5
          || 6
          || 11
:::

::: {.column width="50%"}
:::

::::

## An example

:::: {.columns}

::: {.column width="50%"}
Treatment | Control
:--------:|:------:
5         |  4
10        |  5
          || 6
          || 11
:::

::: {.column width="50%"}
1:2 matching:

Treatment | Control
:--------:|:------:
5         |  4, 5
10        |  6, 11

- 6 is a poor match for 10.
:::

::::

## An example

:::: {.columns}

::: {.column width="50%"}
Treatment | Control
:--------:|:------:
5         |  4
10        |  5
          || 6
          || 11
:::

::: {.column width="50%"}
Flexible matched set sizes:

Treatment | Control  | Size
:--------:|:--------:|:---:
5         |  4, 5, 6 | 1:3
10        |  11      | 1:1
:::

::::


## Fullmatching

- Allow 1:k or j:1 where k or j can vary per matched set.

```{r}
d$fullmatch1 <- fullmatch(group ~ x1 + x2, data = d)
summary(d$fullmatch1)
```

- Guaranteed to find the best possible match.
- Best may not be useful: 99:1 and 1:99.
    - Also lowers power (ESS)

----

- Instead we set constraints, e.g. all sets between 2:1 and 1:3.

```{r}
d$fullmatch2 <- fullmatch(group ~ x1 + x2,
                          max.controls = 3,
                          min.controls = 1/2,
                          data = d)
summary(d$fullmatch2)
```

----

```{r}
baltest4 <- balanceTest(group ~ x1 + x2 + strata(fullmatch2) +
                    strata(fullmatch1) + strata(match2) +
                       strata(match), data = d)
print(baltest4, horizontal = FALSE)
```

----

```{r}
#| eval: false
plot(baltest4)
```
```{r}
#| echo: false
#| message: false
#| fig-align: center
plot(baltest4) + geom_point(size = 5) +
  theme(legend.position = "bottom", text = element_text(size=20))
```

# Speeding up matching

## Distance Matrix

<center>
```{r}
#| echo: false
#| panel: center
dist <- matrix(c(1, 4, 0, Inf, 2,
                 0, 5, 1, 5, Inf,
                 6, 1, 4, 4, 3), nrow = 3, byrow = FALSE)
rownames(dist) <- c("t1", "t2", "t3")
colnames(dist) <- c("c1", "c2", "c3", "c4", "c5")
dist
```
</center>

- Each entry represents a distance.
- `0` represents "identical", `Inf` represents never match.
- The more `Inf`, the faster matching will run.

----

- Built in `optmatch`, similar notation

```{r}
#| results: "hide"
m1 <- match_on(group ~ x1 + x2, data = d)
m1
```

```{r}
#| echo: false
as.matrix(m1)
```

- How do we induce `Inf` into the distance matrix?

## Exact Matching

```{r}
#| echo: false
d <- d[,1:3]
d$category <- c(0,1,1,0,0,1,0,1,1,0,1)
op <- options()
options(str = strOptions(vec.len = 6))
```

```{r}
str(d)
```

```{r}
#| echo: false
options(op)
```

```{r}
em <- exactMatch(group ~ category, data = d)
as.matrix(em)
```

- More useful than just adding a lot of `Inf`'s...

----

- ... because each sub-problem can be considered its own matching problem.

```{r}
em
```

----

- We can combine distance matrices

```{r}
m1 + em
```

- This works for distance matricies of any format (from `match_on`, `exactMatch`
  or `caliper` [which we'll see in a few slides]).

----

- Can be done in a single step as well:

```{r}
m2 <- match_on(group ~ x1 + x2, data = d, within = em)
m2
```

- This should be slightly faster than two separate calls.

----

- Now, perform matching directly on this distance.

```{r}
emmatch <- fullmatch(m2, data = d)
summary(emmatch)
emmatch
```

## Calipering

- For any pairs with large distances, we may want to either
    - Ensure they never match.
        - (Overall match may suffer, but no individual match is terrible.)
    - Speed up calculations by not checking them.

----

```{r}
as.matrix(m1)
c1 <- caliper(m1, width = 4)
c1
m1 + c1
```

## Calipering one dimension

- Sometimes you might want to caliper only one dimension rather than overall
  distance - e.g. only `x1`, not the combination of `x1` and `x2`.
- Two steps process.
    1. Generate a distance matrix for `x1` and caliper it (creating a matrix of
       `0`'s and `Inf`'s).
    2. Generate the distance matrix for `x1` and `x2` and add it to the
       caliper'd matrix from step 1.
- Slightly different result than calipering overall distance.

## Combining it all

```{r}
mm <- match_on(group ~ x1 + x2, data = d,
               within = em, caliper = 4)
mm
```

- Looks like some unmatchable controls.

```{r}
summary(fullmatch(mm, data = d))
```

## Discussion {.smaller}

- 3-way trade-off between balance, speed and effective sample size (power).
    - More constraints on j:k leads to more power but is slower.
        - Too few constraints can limit usefulness.
        - May throw away a lot of data, negating power gain.
    - More exact matching or calipering leads to faster but less balanced
      matches.
        - Too many restrictions can reduce power.
        - Too few restrictions can limit usefulness.
    - Dropping observations leads to more balance matches with lower power.

# Example

## ECLS data

- Early Childhood Longitudinal Survey

```{r}
#| echo: false
ecls <- read.csv("ecls/ecls.csv")
ecls <- ecls[complete.cases(ecls),]
```
```{r}
dim(ecls)
```

- Treatment group is catholic school vs public school.

```{r}
table(ecls$catholic)
```

## Simulation Results 1

```{r}
#| echo: false
#| cache: true
eclscomplete <- ecls[complete.cases(ecls),]

full_form <- catholic ~ race_white + race_black + race_hispanic + race_asian +
  p5numpla + p5hmage + p5hdage + w3momscr + w3dadscr +
  w3income + w3povrty + p5fstamp

save <- matrix(nrow = 8, ncol = 3)
colnames(save) <- c("Model",
                    "Time",
                    "ESS")

# Model 1: Pairmatching
s1 <- system.time(pm1 <- pairmatch(full_form, data = eclscomplete))
save[1,] <- c("Pair match",
              s1[1],
              effectiveSampleSize(pm1))

# Model 2: Pairmatching 1:2
s2 <- system.time(pm2 <- pairmatch(full_form, controls = 2, data = eclscomplete))
save[2,] <- c("Pair match (1:2)",
              s2[1],
              effectiveSampleSize(pm2))

# Model 3: Unrestricted fullmatch
s3 <- system.time(fm1 <- fullmatch(full_form, data = eclscomplete))
save[3,] <- c("Full match (unres.)",
              s3[1],
              effectiveSampleSize(fm1))

# Model 4: Restricted fullmatch
s4 <- system.time(fm2 <- fullmatch(full_form, data = eclscomplete,
                                   min = 1/10, max = 10))
save[4,] <- c("Full match (restr.)",
              s4[1],
              effectiveSampleSize(fm2))

# Model 5: Fullmatch with ExactMatch
s5 <- system.time({
  em <- exactMatch(catholic ~ race, data = eclscomplete)
  dist <- match_on(full_form, within = em, data = eclscomplete)
  fm3 <- fullmatch(dist, data = eclscomplete)
})
save[5,] <- c("Full + Exact",
              s5[1],
              effectiveSampleSize(fm3))

# Model 6: Propensity score pair match
s6 <- system.time({
  psmod <- glm(full_form, data = eclscomplete, family = binomial)
  ps <- predict(psmod, type = "response")
  psm1 <- pairmatch(catholic ~ ps, data = eclscomplete)
})
save[6,] <- c("PS Pair",
              s6[1],
              effectiveSampleSize(psm1))

# Model 7: PS Full match
s7 <- system.time({
  psmod <- glm(full_form, data = eclscomplete, family = binomial)
  ps <- predict(psmod, type = "response")
  psm2 <- fullmatch(catholic ~ ps, data = eclscomplete)
})
save[7,] <- c("PS Full",
              s7[1],
              effectiveSampleSize(psm2))

# Model 8: PS Full + Exact
s8 <- system.time({
  em <- exactMatch(catholic ~ race, data = eclscomplete)
  psmod <- glm(full_form, data = eclscomplete, family = binomial)
  ps <- predict(psmod, type = "response")
  psm3 <- fullmatch(catholic ~ ps, data = eclscomplete, within = em)
})
save[8,] <- c("PSM + Exact",
              s8[1],
              effectiveSampleSize(psm3))
```

```{r}
#| echo: false
save <- as.data.frame(save, stringsAsFactors = FALSE)
save$Time <- as.numeric(save$Time)
save$ESS <- as.numeric(save$ESS)
knitr::kable(save, digits = 2)
```

## Simulation Results 2

```{r}
#| echo: false
#| message: false
#| warning: false
labels <- c("Unmatched", "Pair", "1:2 Pair")
plot(balanceTest(update(full_form,
                        . ~ . + strata(pm1) + strata(pm2)),
                 data = eclscomplete), absolute = TRUE,
     xlab = "Absolute Standardized Differences") +
  geom_point(size = 5) +
  scale_color_discrete(name = "", labels = labels) +
  scale_shape_discrete(name = "", labels = labels) +
  theme(legend.position = "bottom", text = element_text(size=20))
```

## Simulation Results 3

```{r}
#| echo: false
#| message: false
#| warning: false
labels <- c("Unmatched", "Fullmatch", "Fullmatch, Restrictions",
            "Fullmatch, Exactmatch")
plot(balanceTest(update(full_form,
                        . ~ . + strata(fm1) + strata(fm2) + strata(fm3)),
                 data = eclscomplete), absolute = TRUE,
     xlab = "Absolute Standardized Differences") +
  geom_point(size = 5) +
  scale_color_discrete(name = "", labels = labels) +
  scale_shape_discrete(name = "", labels = labels) +
  theme(legend.position = "bottom", text = element_text(size=20)) +
  guides(color=guide_legend(nrow = 2,byrow = TRUE))
```

## Simulation Results 4

```{r}
#| echo: false
#| message: false
#| warning: false
labels <- c("Unmatched", "PSM Pair", "PSM Full, Restrictions",
            "PSM Full, Exactmatch")
plot(balanceTest(update(full_form,
                        . ~ . + strata(psm1) + strata(psm2) + strata(psm3)),
                 data = eclscomplete), absolute = TRUE,
     xlab = "Absolute Standardized Differences") +
  geom_point(size = 5) +
  scale_color_discrete(name = "", labels = labels) +
  scale_shape_discrete(name = "", labels = labels) +
  theme(legend.position = "bottom", text = element_text(size=20)) +
  guides(color=guide_legend(nrow = 2,byrow = TRUE))

```

## Simulation Results 5 {.smaller}


```{r}
#| echo: false
#| message: false
resp_form <- c5r2mtsc ~ catholic + race_white + race_black + race_hispanic +
  race_asian + p5numpla + p5hmage + p5hdage + w3momscr + w3dadscr +
  I(w3income/10^3) + w3povrty + p5fstamp
save <- rbind(c(0, NA, NA), save)
save[1,1] <- "Unmatched"


results <- rbind(summary(lm(resp_form,
                          data = eclscomplete))$coeff["catholic", c(1, 4)],
               summary(lmer(update(resp_form, . ~ . + (1 | pm1)),
                            data = eclscomplete))$coeff["catholic", c(1, 5)],
               summary(lmer(update(resp_form, . ~ . + (1 | pm2)),
                            data = eclscomplete))$coeff["catholic", c(1, 5)],
               summary(lmer(update(resp_form, . ~ . + (1 | fm1)),
                            data = eclscomplete))$coeff["catholic", c(1, 5)],
               summary(lmer(update(resp_form, . ~ . + (1 | fm2)),
                            data = eclscomplete))$coeff["catholic", c(1, 5)],
               summary(lmer(update(resp_form, . ~ . + (1 | fm3)),
                            data = eclscomplete))$coeff["catholic", c(1, 5)],
               summary(lmer(update(resp_form, . ~ . + (1 | psm1)),
                            data = eclscomplete))$coeff["catholic", c(1, 5)],
               summary(lmer(update(resp_form, . ~ . + (1 | psm2)),
                            data = eclscomplete))$coeff["catholic", c(1, 5)],
               summary(lmer(update(resp_form, . ~ . + (1 | psm3)),
                            data = eclscomplete))$coeff["catholic", c(1, 5)])
colnames(results) <- c("Est. Coef", "p-val")

save$"Est. Coef" <- results[, 1]
save$"p-value" <- ifelse(results[, "p-val"] < .001, "<.001", results[, "p-val"])


knitr::kable(save, digits = 2)
```

# Other Matching Methods

- **quickmatch** - Generalized Full Matching, fast \& more than two groups
- **cem** - Coarsened Exact matching, bins continuous variables and uses Exact
  matching
- The **MatchIt** package in R handles these and other forms of matching
    - Including **optmatch**! (But the actual **optmatch** package allows more
      flexiblity).

# References