# (Pearl, 2018) The Seven Tools of Causal Inference with Reflections on Machine Learning

*Some quick notes from Pearl's brief summary of the current innovation in Causal Inference, from his point of view.*

https://ftp.cs.ucla.edu/pub/stat_ser/r481.pdf

The dramatic success in machine learning has led to increasing expectations of autonomous systems that exhibit human-level intelligence.

However, there are several fundamental obstacles
1. *adaptability* or *robustness*: current systems lack the capability of recognizing or reacting to new circumstances they have not been specifically programmed for.
2. *explanability*: machine learning models remain mostly black boxes
3. understanding the *cause-effect connections*, which is a necessary ingredient for achieving human-level intelligence (the paper claims that this is an opinion of the author). This when solved would allow machines to answer "What If?" kind of questions as well as "What if I make it happen?", "What if I had acted differently?".

Next, the author describes a 3-level hierarchy that restricts and governs inferences in causal reasoning. Then, the author summarizes how traditional impediments are circumvented using modern tools of causal inference (7 of them!)



## The Three Layer Causal Hierarchy

The 3 layer classification unveils the kind of questions that each class is capable of answering. The levels are:
1. *Association* $P(y|x)$: purely statistical relationships. The hallmark of current ML methods.
2. *Intervention* $P(y|do(x),z)$: e.g., "What will happen if we double the price?". can be estimated experimentally from randomized trials or analytically using causal bayesian networks
3. *Counterfactual* $P(y_x|x',y')$: e.g., "What if I had acted differently?", thus necessitating retrospective reasoning.

Counterfactuals are at the top because they subsume interventional and associational questions.

The expression $P(y_x|x',y')$ stands for "The probability that event $Y=y$ would be oserved had $X$ been $x$, given that we actually observed $X$ to be $x'$ and $Y$ to be $y'$. For example, the probability that Joe's salary would be $y$ had he finished college, given that his actual salary is $y'$ and that he had only two years of college. Such sentences can be computed only when we possess functional or Structural Equation models, or properties of such models.





## The Seven Tools of Causal Inference (Or what you Can Do With a Causal Model that you could not do without?)

The author mentions that questions involving "cause", "attributed to", "preventing", etc., all causal questions, until recently science gave us no means even to articulate them, let alone answer them. The author further claims that only a few decades ago scientists were unable to write down a mathematical equation for the obvious fact that "mud does not cause rain".

In the past 3 decades, a mathematical langauge has been developed for managing causes and effects, accompanied by a set of tools that turn causal analysis into a mathematical game, like solving algebraic equations.

The author calls this mathematical framework that led to this transformation as "Structural Causal Models (SCM)", which deploys in 3 parts:
1. Graphical models
2. Structural Equations
3. Counterfactual and Interventional logic

Graphical models serve as a language for representing what we know about the world, counterfactuals help us articulate what we want to know, while structural equations serve to tie the two together in a solid semantics.

In addition, there is an "inference engine" that takes as input:
- Query
- Assumptions (i.e., graphical model)
- Data

and outputs:
- Estimand ($E_S$)
- Estimate ($\hat E_S$)
- Fit Indices ($F$)

For example, let's assume that we have a query of the causal effect of $X$ on $Y$:
$$Q=P(Y|do(X))$$
, with a confounder $Z$.

Finally, let the data be sampled at random from a joint distribution $P(X,Y,Z)$. The estimand ($E_S$) will be the formula:
$$E_S=\sum_Z P(Y|X,Z)P(Z)$$
, which defines a procedure of estimation.

The actual Estimate $\hat E_S$ can be produced by any number of techniques that produce a consistent estimate of $E_S$ from finite samples of $P(X,Y,Z)$.

Finally, the Fit index will be NULL. In other words, after examining the structure of the graph, the engine should conclude that the assumption encoded do not have any testable implications. Therefore, the veracity of the resultant estimate must lean entirely on the assumptions encoded in the graph.

Efficient and complete algorithms have been developed to decide identifiability and produce estimands for a variety of counterfactual queries and a variety of data types.

## Tool 1: Encoding Causal Assumptions - Transparency and Testability

Transparency enables analysts to discern whether the assumptions encoded are plausible, or whether additional assumptions are warranted

Testability permits us to determine whether the assumptions encoded are compatible with the available data and identify those that need repair.

Transparency is done through graphs and testability is faciliated through a graphical criterion called $d$-separation.

## Tool 2: $do$-calculus and the control of confounding

Deconfounding has been demystified through a graphical criterion called "back-door". When backd-door does not hold, the do-calculus is available, which predicts the effect of policy interventions whenever feasible.

## Tool 3: The Algorithmization of Counterfactuals

Able to formalize counterfactual reasoning within the graphical representation. Every structural equation model determines the truth value of every counterfactual sentence.

## Tool 4: Mediation Analysis and the Assessment of Direct and Indirect Effects

Typical queries answerable by this analysis are: What fraction of the effect of $X$ on $Y$ is mediated by variable $Z$?

## Tool 5: Adaptability, External Validity and Sample Selection Bias

A machine trained in one environment cannot be expected to perform well when environmental conditions change, unless the changes are localized and identified. This problem have manifested fields such as "domain adaptation", "transfer learning", "life-long learning", and "explainable AI". This inherently requires a causal model.

## Tool 6: Recovering from Missing Data

Using causal models of the missingness process we can now formalize the conditions under which causal and probabilistic relationships can be recovered from incomplete data and, whenever the conditions are satisfied, produce a consistent estimate of the desired relationship.

## Tool 7: Causal Discovery

A broader field of causality in which causal graphs are recovered from data (whenever possible), enabling the identification and estimation of causal effects.