# Chapter 2: Graphical Models and their applications

Reading material is [here](http://bayes.cs.ucla.edu/PRIMER/primer-ch2.pdf)

I don't think the title is great, for me this chapter was more of an "anatomy of DAGs", while chapters 3 & 4 were more about the applications. The critical part for me was $\S$2.4 on $d$-separation, although here I think the emphasis should be on **blocking paths**, and $d$-separation is just one application of that. We will see other, more important, applications in chapter 3 when using the adjustment formulas.

The material in $\S$2.5 was okay (testing a causal model), but this material is covered again -- and with more rigor _and_ practical application -- in $\S$3.3 (page 65, where it is made as a comment at the end of a long section).

## Conditional independence

$\S$1.3.4 defined independent and conditionally independent events. It is worth revisiting these concepts again. 

In many probability texts, the intersection $A \bigcap B$ is used, which makes sense for subsets of outcomes $A$ and $B$. Where we have random variables $A$ and $B$, instead of using subsets like $(A = a)\bigcap(B=b)$, I will use the notation $(A=a, B=b)$.


### Independent

There are two ways we traditionally think of events $A$ and $B$ being independent. One is the simple
$$P(A, B) = P(A)P(B)$$
where we have taken Pearl's $(A, B)$ for "A and B" over the more set theoretic $A \bigcap B$, as we generally will be using random variables. 


A more natural definition is
$$P(A | B) = P(A)$$
i.e. knowing $B$ doesn't change the probabilities of $A$. In my view, this is more directly connected to the name.

<details><summary>Proof of equivalence</summary>
<p>

To see these definitions are equivalent, let's show the first implies the second. We have for any events
$$P(A | B) = P(A, B) / P(B) \Rightarrow P(A, B) = P(A | B) P(B)$$
If we have the first statement, $P(A,B) = P(A)P(B)$,
$$P(A)P(B) = P(A | B) P(B)$$
Provided $P(B) \neq 0$ we can cancel the $P(B)$ from each side, and get $P(A) = P(A|B)$ as promised. If $P(B) = 0$, then $P(A|B)$ is generally not sensible (e.g. "what is the probability it is sunny outside given that I rolled a 8 on my normal 6-sided die"). So I guess I should change the definition to
$$P(A=a | B=b) = P(A=a), \text{ for all values $b$ which can occur, means $A$ and $B$ are independent}$$

To show the second implies the first, let's start with $P(A|B) = P(A)$. Then
$$P(A, B) = P(A|B)P(B) = P(A)P(B)$$
QED


</p>
</details>



### Conditionally independent

We say _A and B are conditionally independent given C_ if
$$P(A | B, C) = P(A | C)$$

Here are two consequences
1. This is a symmetric relationship, so $P(B| A,C) = P(B|C)$ if the above holds
2. If A, B, and C are all independent, then $A$ and $B$ are still conditionally independent given $C$

The two properties are pretty intuitive, given the name. Proofs in collapsible sections below.

<details><summary>Proof of 1 (symmetric)</summary>
<p>


They are pretty natural given the name, as well as being intuitive. To show the first one, for any events $A$, $B$, and $C$ we have
$$P(A, B, C) = P(A | B, C)P(B, C) = P(B | A, C)P(A,C)$$
If $A$ and $B$ are independent conditional on $C$, we get
$$P(A | C)P(B, C) = P(B| A,C)P(A, C)$$
We know $P(A, C) = P(A | C)P(C)$, so
$$P(A|C)P(B,C) = P(B|A, C) P(A|C) P(C)$$
Canceling
$$P(B,C) = P(B|A, C)P(C)$$
We also have $P(B, C) = P(B|C)P(C)$, and we can cancel $P(C)$ from both sides, leaving
$$P(B|C) = P(B| A, C)$$
which is what we intended to show. (Note the possibilities were I canceled things implicitly assumed those quantities were not zero; like the previous case this means I am assuming that I don't condition on things that cannot happen)

</p>
</details>

### Counter-intuitive facts about conditional independence

If $A$ and $B$ are independent, it does **not** follow that $A$ and $B$ are independent conditional on $C$!

By conditioning, we can relate two independent events. This is the "conditioning on collider" problem we will see later. Often this can be thought of as a "selection bias" -- the choice of conditioning on $C$ gives us a subpopulation, and that correlates $A$ and $B$ in the subpopulation, even though they are uncorrelated "in nature".

Simple toy example: 
- If I roll two six-sided die, the outcomes are uncorrelated. 
- Call die 1's outcome $A$, and die 2's outcome $B$. $A$ and $B$ are independent. If I tell you what $A$ is, it doesn't help you determine what $B$ is (and vice-versa).
- In games, we are often interested in the total, $C = A + B$. It is clear $C$ depends on $A$ and $B$.
- If we condition on $C$, $A$ and $B$ are perfectly anti-correlated! e.g. If I tell you that "I rolled a 7" (C=7), knowing $A=3$ tells you $B=4$, even though $A$ and $B$ are independent.
- $A$ and $B$ are _not_ conditionally independent on $C$.

This is an example of poor naming: saying $A$ and $B$ are independent suggests a more powerful statement than $A$ and $B$ are conditionally independent given $C$, which isn't actually true.

More serious examples:
- [Berkson's paradox](https://en.wikipedia.org/wiki/Berkson%27s_paradox): originally the observation that diabetes and cholecystitis were negatively correlated _within hospital in-patients_. In general, this would be a problem with any two random diseases. People in the in-patient population have some ailment, so someone without diabetes in this population is _more_ likely to have other diseases than the general population (which is what causes them to go to the hospital)
- Has application to survivor bias in survey completition
- Also known as the "explain-elsewhere" affect: if an outcome can be achieved multiple ways, knowing it wasn't achieved one way makes the other ways more likely.

## Anatomy of a DAG

We will label all our sections as "common graph term" / "common statistical effect". This breaks down the common patterns that we see in DAGs.

In [1]:
%%javascript
require.config({
    paths: { 
        d3: 'https://d3js.org/d3.v5.min'
    }
});

<IPython.core.display.Javascript object>

In [2]:
from dag import draw_dag

<IPython.core.display.Javascript object>

### Chains / mediation

A _chain_ is a unidirectional path of events e.g. $A \rightarrow B \rightarrow C \rightarrow D \rightarrow ...$. i.e. events that are connected with arrows flowing from beginning to end.

In a chain:
1. Neighboring events on the chain are __not__ independent
2. Any pair of events are _generally_ not independent

There are some contrived ("pathological") cases where $A$ and $C$ could end up being independent if there are other sources of noise.

If we control for a node in chain, nodes on the left of the controlled node are conditionally independent of nodes on the right of controlled node, provided the chain is the only path connecting the two nodes.


#### Example 1
Given $A \rightarrow B \rightarrow C \rightarrow D$, without controlling for anything, $A$ and $D$ (likely) are not independent. If we control for $C$, then $A$ and $D$ are independent. We could write
$$A \rightarrow B --> (C) \rightarrow D$$

We would still have 
* $A$ and $B$ are **not** independent (they are neighbors)
* $A$ and $B$ are **not** conditionally independent on C.
* $B$ and $C$ are **not** independent (they are neighbors)
* $B$ and $C$ **are** (trivially) conditionally independent on C (because $C$ just takes a single value!)
* $B$ and $D$ are **not** independent (they are connected by a chain)
* $B$ and $D$ **are** conditionally independent on C (controlling $C$ blocks information about $B$ reaching $D$).

#### Intuition:

1. Intervening anywhere along the chain will have an effect on the final node in the chain, but will make everything before the final manipulation irrelevant
2. Even though $A$ and $D$ are not directly affected, we still have an intuitive sense that $A$ "morally" affects $D$, but it does so via $B$ and $C$. Sometimes $B$ and $C$ are called "mediators" for this reason.

### Forks / confounding

A fork is when a variable influence two children: $C_1 \leftarrow P \rightarrow C_2$. The "parent" $P$ is a common cause to both it's children, and is commonly referred to as a _confounder_ -- $C_1$ and $C_2$ are not independent, but because they are driven by a common factor. 

Popular examples are ice-cream sales and shark attacks both increasing as temperature increases, or global temperatures increasing as the number of pirates decrease (both driven by different factors, temperatures going up over time and piracy curtailed over time)

If $C_1 \leftarrow P \rightarrow C_2$ is the only connection between $C_1$ and $C_2$:
- $C_1$ and $C_2$ are not independent
- $C_1$ and $C_2$ are conditionally independent on $P$

#### Intuition

1. In _observational_ data $C_1$ and $C_2$ are correlated (via $P$), but trying to intervene and change $C_1$ directly without changing $P$ doesn't affect $C_2$. e.g. closing all ice-cream shops doesn't decrease the number of shark attacks, even though there is a positive correlation seen in observational data. Intervening can destroy the previous positive correlation.
2. This is the most well-known statistical fallicy, and is the one that leads us to "control for everything" just in case.

### Colliders / selection or survivor bias

A collider is when causal information comes into a node from opposing directions, such as $A \rightarrow C \leftarrow B$. If there are no other paths between $A$ and $B$:
- $A$ and $B$ are independent
- $A$ and $B$ are not conditionally independent on Z

Examples were given in the section about "counter-intuitive facts about conditional independence":

- $(roll 1) \rightarrow (sum) \leftarrow (roll 2)$: two independent rolls are correlated if I know their sum
- $(\text{diabetes}) \rightarrow (\text{go to hospital}) \leftarrow (\text{other diseases})$: other diseases are more negatively correlated with diabetes than in the general population, because (most) people in the in-patient population have at least one medical issue

#### Intuition

We generally don't want to condition on colliders (although chapter 3 does show a couple of interesting counter-examples). Colliders block information flow on their own.

Most of us know about the problems with selection bias (which is a common way colliders show up, as in the in-patient example).

## $\S$2.4 path blocking and $d$-separation

We have looked at graphs with a single path and discussed what to control for, and what not to. What should we do with a graph that has multiple paths, such as the one below

In [3]:
positions = {'T': [300, 30], 'Z': [100, 180], 'W': [200, 250], 'U': [200, 350], 'X':[300, 180], 'Y': [500, 180]}
draw_dag({'T': ['Z', 'Y'],
          'Z': ['W'],
          'W': ['U'],
          'X': ['Y', 'W']}, positions=positions)

<IPython.core.display.Javascript object>

Pearl shows each node with unknown noise terms entering each node, I am going to neglect these and say that each circle is a random variable (and is capable of generating its own noise).

Let's call the set of we condition on $\mathcal{C}$. We want to be able to answer the question if two random variables in this graph are conditionally independent on $\mathcal{C}$ (i.e. if we control for everything in $\mathcal{C}$, will our nodes be independent, or dependent)?

Let's look at the nodes $Z$ and $Y$ in the DAG above and ask: if we don't condition on anything, are they independent? Intuitively, the answer is "no". Ignoring all the other complicated pieces of the graph, we can tell $T$ is a _confounder_ for $Z$ and $Y$, so changes in $T$ are likely to correlate $Z$ and $Y$.

### Defining $d$-separation

#### Paths

A (simple) **path** between two nodes $Z$ and $Y$ is any sequence of nodes
- starting at $Z$,
- ending at $Y$
- any two adjacent nodes in the sequence are connected by an edge in the DAG (without regard to direction)
- no node is repeated

The part about disregarding direction is probably the hardest piece to follow =). In the example above, there are two paths from Z to Y:
1. Z to T to Y
2. Z to W to X to Y

The "no repeated node" rule is just there to eliminate stepping back and forward (e.g. Z to T to Z to T to Z to T to Y)

#### Blocked paths

A path $p$ is blocked by the controlling nodes $\mathcal{C}$ if either:
1. There is a chain or a fork in the path, and one of the intermediate nodes is in $\mathcal{C}$
2. There is a collider $E \rightarrow B \leftarrow F$ and $B$ is not controlled for ($B \not\in \mathcal{C}$), and no decendent of $B$ is controlled for.

#### $d$-separation

Two nodes are $d$-separated if every (simple) path between them is blocked.


### Back to our example

In the example above, if we don't control for anything, are $Z$ and $Y$ $d$-separated? We need to check the two paths
1. $Z$ to $T$ to $Y$
2. $Z$ to $W$ to $X$ to $Y$

Path 1 is not blocked: we have a fork $Z \leftarrow T \rightarrow Y$ where $T$ is not controlled. This matches our intuition earlier that $T$ is a confounder.

Path 2 _is_ blocked: we have a collider at $W$, and $W$ is not controlled for.

Since we have at least one path that isn't blocked, $Z$ and $Y$ are not $d$-separated. i.e. they will not be independent.

##### Control for just T?

What happens if we just control for $T$ (i.e. $\mathcal{C} = \{T\}$).

Path 1 is blocked (blocking the parent of a fork blocks the fork).

Path 2 is still blocked by the collider $W$.

All paths are blocked! So $Z$ and $Y$ are $d$-separated, conditional on $T$. Another way of saying this is $Z$ and $Y$ are conditionally independent on $T$.

##### Control for $T$ and $W$

The _only_ way of blocking path 1 is for $T$ to be controlled for.

If we control for $W$, now we have "unblocked" path 2, so now information can flow through $W$ to $X$. We see that there is nothing stopping the information from flowing from $X$ to $Y$, so the entire path is unblocked.

If we control for $T$ and $W$, $Z$ and $Y$ are _not_ d-separated (i.e. they are conditionally dependent on one another)

##### Overview

| Control for  | Are $Z$ and $Y$ d-separated? | Comments | 
| --- | --- | --- |
| T | Y | First example above, must control T to block path 1 |
| T, W | N | Second example, controlling W 'unblocks' path 2 |
| T, U | N | Controlling U (a descendant of W) has the same effect as controlling W |
| T, U, W | N | Same as above |
| T, W, X | Y | Controlling W unblocks the collider, but is blocked again at X ||


## Why do we care

$d$-separation makes a testable claim about the graph we claim models the phenomena. In the DAG

In [4]:
draw_dag({'T': ['Z', 'Y'],
          'Z': ['W'],
          'W': ['U'],
          'X': ['Y', 'W']},
         positions=positions)

<IPython.core.display.Javascript object>

controlling for $T$ tells us that within levels of $T$ we should see $Y$ and $Z$ as independent, i.e.
$$P(Z | Y, T) = P(Z | T)$$
This is a testable implication, just given the raw data. i.e. we do have some sanity checks that our DAG is a reasonable description of reality.

#### Python check!

More important, however, is the idea of blocking paths. When thinking about interventions, we will want to block all paths that have an arrow pointing _into_ the node we are changing, because our intervention will break that effort. Even though this in Chapter 3, it is what I think one of the important applications are.

## Intervention ($\S$ 3.1 and 3.2)

If we look at the DAG above, and you wanted to change $Y$, if you were offered an expensive change to your procedures that changed $Z$, without affecting $T$, would you take it? 

The answer should be "no": changing $Z$ makes downstream changes to $W$ and $U$, but never to $Y$ -- by directly modifying $Z$ we are breaking the previous relationship between $T$ and $Z$. Any arrows into $Z$ becomes irrelevant, because we have set $Z$ directly.

We can $Z$ has no effect just by looking at the DAG directly. If you enabled me to spend money to change $X$ or $T$, then I should consider it.

Let's go back to the Simpson's rule case from the first chapter and look at interventions there. As a reminder, the data is


| Sex        | Drug A recoveries | Drug B  recoveries |  Total |
| ---------- | ----------------- | ------------------ | ------ |
| **Male**   | 74 out of 80 (93%)| 234 out of 270 (87%)| 350   |
| **Female** | 193 out of 270 (71%)| 54 out of 80 (68%)| 350 | 
| **Total**  | 267 out of 350 (76%)| 288 out of 350 (82%)| 700 |  

In [5]:
# DAG
draw_dag({'drug': ['recovery'], 'sex': ['drug', 'recovery']})

<IPython.core.display.Javascript object>

i.e. `sex` is a confounder, but `drug` also has a direct effect as well.

We want to know what would happen if we lived in a world where we gave everyone Drug A (instead of letting sex influence the choice). The only actual data we have to hand are the observed rates in the table above. The rules we use for intervening, if there is only a single intervention $X=x$, are

0. We delete all incoming edges into $X$, because we are setting its value by hand. This is the modified graph.
1. We are now dealing with a modified probability distribution $P_m$, the probability distribution that would exist if we had instead force the intervention. It is described by the modified DAG.
2. Probabilities that don't include $X$, and are not conditioned on $X$, are the same in the modified and original distribution (i.e. we only affect probabilities where $X$ enters the picture -- we don't care how we got those values)
3. This is the trickiest one: the backdoor criterion. In this case it claims that 
  $$P_m(\text{recovery}|D, \text{SEX}) = P(\text{recovery}|D, \text{SEX})$$
  i.e. if we know the drug taken and the sex of the patient, the effect is the same regardless of whether the drug was imposed (intervention) or assigned by some other means (such as rationing by gender).

Let's first look at the observational data: what is P(recover|drug A) - P(recover | drug B)? We have
$$P(\text{recover}|\text{drug A}) = 267/350\approx 0.76, \quad\quad P(\text{recover}|\text{drug B}) = 288/350\approx 0.82$$
so the observed difference in treatment is $0.76-0.82 = -0.06$. i.e. a randomly drawn person from the data set has a 6pp higher chance of recovering if they were on drug B. As discussed last time, this is because drug B is largely made up of male patients, and male patients have a better chance of recovering on either drug. It doesn't tell us what to expect if we forced everyone onto drug A, vs forcing everyone onto drug B.

To answer the second question, we need the modified probabilities. The modified DAG is

In [6]:
# delete all arrows into "drug", as we set that by hand
draw_dag({'drug': ['recovery'], 'sex': ['recovery']})

<IPython.core.display.Javascript object>

We have
\begin{align*}
P_m(\text{recover} | D) &= \sum_{\text{sex}\in\{M,F\}} P_m(\text{recover} | D, \,\text{sex}) P_m(\text{sex}|D)&\text{[Baye's theorem for any prob distribution]}\\
&=  \sum_{\text{sex}\in\{M,F\}} P_m(\text{recover} | D, \,\text{sex}) P_m(\text{sex})&\text{[In modified graph, sex (and everything else) is independent of D]}\\
&= \sum_{\text{sex}\in\{M,F\}} P_m(\text{recover} | D, \,\text{sex}) P(\text{sex}) & \text{[P(sex) doesn't use the intervening variable]}\\
&= \sum_{\text{sex}\in\{M,F\}} P(\text{recover} | D, \,\text{sex}) P(\text{sex}) &\text{[Backdoor argument, doesn't matter how sex and D are set]}
\end{align*}



Later, in chapter 3, we make the backdoor argument more generalizable. If we buy into this, however, we get the modified distribution on the left, and only observational probabilities on the right. If we forced everyone to take drug A:

\begin{align*}
P_m(\text{recover}|D=a) &= P(\text{recover}|D=a,\,\,S=M)P(M) + P(\text{recover}|D=a,\,\,S=F)P(F)\\
&= (0.93)(0.5) + (0.71)(0.5)\\
&= 0.82
\end{align*}
A similar calculation gives $P_m(\text{recover}|D=b) = 0.775$, and we see the difference between a world where everyone took drug A and everyone took drug B is 0.044; i.e. 4.4pp more people would recover in a world where everyone took drug A.

i.e. this calculation told us to calculate the average rate by sex, then take the straight average of the two sexes.

### What does the blood pressure example look like?

In the blood pressure example, we had the same numerical values, but concluded the right thing to do was to ignore blood pressure and pool everyone. Specifically we had the table


| post-treatment BP  | Drug A recoveries | Drug B  recoveries |  Total |
| ---------- | ----------------- | ------------------ | ------ |
| **Low**   | 74 out of 80 (93%)| 234 out of 270 (87%)| 350   |
| **High** | 193 out of 270 (71%)| 54 out of 80 (68%)| 350 | 
| **Total**  | 267 out of 350 (76%)| 288 out of 350 (82%)| 700 |  

and the DAG

In [7]:
draw_dag({'drug': ['BP', 'recovery'], 'BP':['recovery']})

<IPython.core.display.Javascript object>

What is $P_m(\text{recovery}|D)$, where we force everyone to take the drug A or B? 

- There are no arrows coming into `drug`, so the modified DAG is the same.
- $P_m(BP|D)=P(BP|D)$ as it doesn't matter how $D$ got its value. This edge didn't exist in the sex/drug case.
- $P_m(\text{recovery}) | BP, D) = P(\text{recovery}|BP, D)$, as recovery only depends on the values of $D$ and $BP$, not how they got there. This is the same argument as the sex / drug case.

Okay, so we have
\begin{align*}
P_m(\text{recovery}|D) &= \sum_{\text{BP}}P_m(\text{recovery}|D,\, BP)P_m(BP | D) &\text{[Baye's rule]}\\
&= \sum_{\text{BP}}P(\text{recovery}|D,\, BP)P(BP | D) &\text{[Bullet points 2 and 3]}\\
&= P(\text{recovery}|D) &\text{[Baye's rule]}
\end{align*}
i.e. for this DAG, the observed difference in drug effects is the same as if we intervened (-6pp). 

### What just happened?

In the sex/drug case, we didn't have $P_m(\text{sex}|D)$ being the same as $P(\text{sex}|D)$. In the observational data set, your sex had an influence over which drug you were given. In the intervention case, everyone was given the same drug. That leads to a difference in the modified and intervention distributions.

In the blood pressure case, the BP was set by the drug that you took, regardless of how that drug was assigned. The distribution would be the same for both mechanisms.

## Making the backdoor criteria feel more formal

It may feel a little ad-hoc which probabilties in the modified distribution are the same, and which ones are different. If we have an intervention on a variable $X$, probabilities that don't have $X$ \emph{or} are not conditioned on $X$ are the same in the modified and unmodified case.

A simplified version of the backdoor criteria is, if we intervene to set $X=x$
$$P_m(Y | X=x, Z) = P(Y| X=x, Z) \text{ if $Y$ and $X$ are conditionally independent after (i) removing edges FROM X and (ii) controlling for Z}$$
i.e. this is one place we can ignore the difference between "setting" and "observing" can be ignored.

To gain intuition about why this is the case, consider how in an observational setting we could have $X$ and $Y$ be related:
1. Changes in X could change Y directly
2. Changes in X could change other things, that could change Y
3. Changes in confounders could change X and Y together, giving a non-causal contribution.

If we intervene, paths 1 and 2 are still valid ways for changes in X to change Y, but contributions from paths of type 3 will differ, because we have stopped $X$ from being influenced by other things. So we want to identify if we have blocked all possible paths that contribute to 3.

Let's search for the dangerous paths and make sure they are blocked. All paths of type 1 and 2 start with edges coming from X. We want those effects, and are not the dangerous paths we are looking for, so throw those away for now (we only want to identify dangerous paths). In this modified graph, are $X$ and $Y$ d-seperated, after controlling for $Z$? If the answer is yes, then it makes no difference how $X$ gets it's value as 
$$P(Y | X, Z) = P(Y | Z)$$
by definition of $Z$ separation. If the answer is no, then changing $X$ will change $Y$ along these "unwanted" paths -- i.e. some of the type 3 information will leak, and the "intervention" probabilities and the "observation" probabilities will differ.

Note that in our backdoor formula actually works for a \emph{set} of nodes $Z$, not just a single node. Let's go through a complex-ish DAG and see how to apply it.

In [8]:
positions = {'sex': [50,50], 'weight': [300,50], 'drug': [50,300], 'recovery': [300,300]}

draw_dag({'weight': ['recovery'], 
          'drug': ['recovery', 'weight'],
          'sex': ['drug', 'weight']}, 
         positions=positions)
    

<IPython.core.display.Javascript object>

Should we condition on sex, or can we just use observational data to determine the drug effectiveness?

First we eliminate the actual effects we are interested in (recall we are looking for nuisance paths). This gives us

In [9]:
draw_dag({'weight': ['recovery'], 
          'drug': [],
          'sex': ['drug', 'weight']}, 
         positions=positions)

<IPython.core.display.Javascript object>

We see if nothing is controlled for, drug is not independent of anything on this graph, so
$$P_m(anything | drug) \neq P(anything | drug)$$
In particular, we would expect the distribution $P_m(recovery|drug)$ based on intervention to \emph{not} match the observed distribution. 

On the other hand, controlling either `sex` or `weight` blocks this path, so we have
$$P_m(\text{recovery}|\text{drug}, \text{sex}) = P(\text{recovery}|\text{drug}, \text{sex})$$
and
$$P_m(\text{recovery}|\text{drug}, \text{weight}) = P(\text{recovery}|\text{drug}, \text{weight})$$
and
$$P_m(\text{recovery}|\text{drug}, \text{sex}, \text{weight}) = P(\text{recovery}|\text{drug}, \text{sex}, \text{weight})$$


### Example (Problem 3.3.1 in Pearl)

1. List all of the sets of variables that satisfy the backdoor criterion to determine the causal
effect of X on Y.
2. List all of the minimal sets of variables that satisfy the backdoor criterion to determine the causal effect of X on Y (i.e., any set of variables such that, if you removed any one of the variables from the set, it would no longer meet the criterion).
3. List all minimal sets of variables that need be measured in order to identify the effect of D on Y. Repeat, for the effect of {W, D} on Y

In [13]:
COL0, COL1, COL2 = 100, 300, 500
ROW0 ,ROW1, ROW2 = 100, 300, 500

positions = {'B': [COL0, ROW0], 'C': [COL2, ROW0],
             'A': [COL0, ROW1], 'Z': [COL1, ROW1], 'D':[COL2, ROW1],
             'X': [COL0, ROW2], 'W': [COL1, ROW2], 'Y':[COL2, ROW2]}
draw_dag({'A': ['X'], 'B': ['A', 'Z'], 'C': ['D', 'Z'],
          'D': ['Y'], 'W': ['Y'], 'X' :['W'], 'Z': ['X','Y']}, positions=positions)

<IPython.core.display.Javascript object>

To answer 1, let's isolate the dangerous paths and find sets of variables that block the remaining paths

In [14]:
draw_dag({'A': ['X'], 'B': ['A', 'Z'], 'C': ['D', 'Z'],
          'D': ['Y'], 'W': ['Y'], 'X' :[], 'Z': ['X','Y']}, positions=positions)

<IPython.core.display.Javascript object>

The paths from $X$ to $Y$:
- Path 1: X to A to B to Z to Y
- Path 2: X to A to B to Z to C to D to Y
- Path 3: X to Z to Y
- Path 4: X to Z to C to D to Y

Path 3 is blocked if and only if we control for Z. So Z is always in the conditioned set.

Path 1 is not blocked by Z (we have to condition on Z). Path 1 is blocked if at least one of A or B is conditioned on. 

Once we block Path 1 and Path 3, then 1 and 2 are blocked. It is optional whether we block C or D.

Blocking sets are therefore
- {Z, A}
- {Z, B}
- {Z, A, B}
- {Z, A, C}
- {Z, A, D}
- {Z, A, C, D}
- {Z, B, C}
- {Z, B, D}
- {Z, B, C, D}
- {Z, A, B, C}
- {Z, A, B, D}
- {Z, A, B, C, D}

Minimal sets are 
- {Z, A}
- {Z, B}

