# Chapter 18: Variable Selection and High-Dimensional Data

This chapter summarizes the problems of incorrect variable selection in causal analyses and outlines some practical guidance.

## 18.1 The different goals of variable selection

Variable selection for prediction is fundamentally different from causal analyses. In prediction, we just want to make better prediction - we do not care if the variables are confounders or not as long as it improves predictive strength. For example, prior hospitalization may help predict future heart failure, but we would not sugguest to stop admitting people to the hospital in order to prevent heart failures.

## 18.2 Variables that induce or amplify bias

Colliders induce bias when controlling for it. Unfortunately, even whengiven the temporal ordering of $A,L,Y$, we cannot determine from the data whether or not $A$ affects $L$. Thus the decision to adjust for $L$ must be based on information outside of the data.

There is also a concept called $Z$-bias where controlling for an instrument can *amplify* the bias of unobserved confounding, referred to as *bias amplification*.

## 18.3 Causal inference and macine learning

Let's say that controlling for a set of variables $X$ will not induce or amplify bias.

Our next problem is the problem of high-dimensionality or multiple continuous variables.

We can try using prediction-esque algorithms such as lasso and ridge regression, however, by themselves they do not suffice to adequately adjust for confounding in high-dimensional settings. These algorithms must be used in conjunction with doubly robust estimators with two modifications:
- sample splitting
- cross-fitting

This is necessary if we hope to construct valid 95% Wald confidence intervals (i.e., intervals that trap the causal parameter of interest at least 95% of the time).

## 18.4 Doubly robust machine learning estimators

Through sample splitting and cross-fitting, we can combine doubly robust estimation and machine learning to obtain causal effect estimates which have known statistical properties and which use all the available data.

An active research area is the development of procedures to detect whether the bias of doubly robust split-sample estimators is the order of or larger than the standard error.

## 18.5 Variable selection is a difficult problem

Combination of causal inference methods with machine learning algorithms for confounder selection can, under certain conditions, result in correct statistical inference. However, doubly robust machine learning does not solve all our problems for at least 3 reasons.

First, the available subject-matter knowledge may be insufficient to identify all important confounders.

Second, the implementation has been difficult and computationally expensive.

Third, there is no guarantee that the variance given by the estimation will be small enough for meaningful causal inference.

