In [1]:
%%javascript
require.config({
    paths: { 
        d3: 'https://d3js.org/d3.v5.min'
    }
});

<IPython.core.display.Javascript object>

In [2]:
from IPython.display import display, Javascript, HTML
import json

# This is needed to load d3
display(HTML(filename='tree.css.html'));
display(Javascript(filename="tree.js"))

<IPython.core.display.Javascript object>

In [3]:
def draw_network(data, width=600, height=400):
    """Uses D3 to draw a graph
    
    Parameters:
    data: List of Lists
          Each list contains links of the form [parent_node_label, child_node_label]
          
    Is unable to draw nodes not connected to any other edges
    """
    display(Javascript("""
        (function(element){
            require(['tree'], function(tree) {
                tree(element.get(0), %s, %d, %d);
            });
        })(element);
    """ % (json.dumps(data), width, height)))
    

def draw_dag(parents_to_children, width=600, height=400):
    """Draws a DAG
    
    Parameters:
    parents_to_children: Dictionary
                         Keys are nodes, values are a list of children
    """
    network_data = []
    for parent, children in parents_to_children.items():
        links = [[parent, child] for child in children]
        network_data.extend(links)
    draw_network(network_data, width, height)

In [4]:
# Example of usage
draw_dag({'A': ['B', 'C'], 
          'B': [], 
          'C': ['D', 'E'], 
          'D': [], 
          'E': []})

<IPython.core.display.Javascript object>

## Start causal inference / data science imports

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression

%matplotlib inline

Let's look at an example from daggity's [Learn Simpson's Rule](http://dagitty.net/learn/simpson/) example. 

This is the machine they have with 2 levels. By convention:
- `X` is the "treatment". This is the thing we are looking to directly change and intervene on.
- `Y` is the "effect"
- `Z{\d}` are features of our model that we have measured
- `U` are unknown quantities, so we cannot control for them

Information only flows along the arrows. If we know what the value of a variable is, we don't care _how_ it got that value.

The DAG will generate randomly, it is probably clearest if you organize it as 
```
      Z1 

U     Z2     Z3

      Z4     Z5

X             Y
```
to see what the arrows are doing

In [6]:
draw_dag({
    'Z1': ['Z3', 'U'],
    'Z3': ['Z5', 'Z2'],
    'Z5': ['Z4', 'Y'],
    'U': ['Z4', 'X', 'Z2'],
    'Z4': [],
    'Z2': [],
    'X': ['Y'],
    'Y': []})

<IPython.core.display.Javascript object>

We see the only arrow _from_ X is to _Y_. This is the _direct effect_ of X on Y (which is what we will try and measure). We are going to simulate this as 
$$\Delta Y = 2 \Delta X$$
in our data generation code below.

Note that if `X` gets large because of random variation, or because we intervene and make it large, then Y will be affected by  twice the change in X.

If `X` gets large because `U` is large (and `U` affects `X`), then `Y` will change in some other way, because `U` will also change the different values of `Z*`.

In some DAGs there will be an _indirect_ effect of X on Y, where X affects something that affects something that .... that affects something that affects Y. In this case, there are no indirect effects of `X` on `Y`

In this DAG:
| Node | Direct Effect on Y | Indirect Effect on Y |
| --- | --- | --- |
| X | Yes | No |
| Z1 | No | Yes |
| Z2 | No | No |
| Z3 | No | Yes |
| Z4 | No | No |
| Z5 | Yes | No |
| U | No | Yes |

It is possible to have a DAG where a node has a direct effect _and_ an indirect effect, but there are no examples in the DAG above. If there was an edge from `Z3` to `Y` in this DAG, then `Z3` would have both a direct and indirect effect.

### Meaning of the labels

Here is an example of a data generation process that follows this DAG. A variable can only by defined in terms of either a new randomly generated value (in this case, all normal distributions) _or_ from random variables that are it's parent.

e.g. 
```
Z5 = F(np.random.<something> , Z3)
```
can only happen because `Z3` is a parent of `Z5` in the DAG. We cannot have `Z5 = F(np.random.<something>, Z3, Z1)` because `Z1` is not a parent of `Z3`.

**Simplification**

Here we have assumed that all random number generation processes are normal, and all functions are linear. These are not necessary assumptions for causal reasoning.

In [7]:
def simpson_three_level_data_generation(samples, stddev, direct_causal_effect):
    Z1 = np.random.normal(scale=stddev, size=samples)
    Z3 = np.random.normal(scale=stddev, size=samples) + Z1
    Z5 = np.random.normal(scale=stddev, size=samples) + Z3
    U  = np.random.normal(scale=stddev, size=samples) + Z1
    Z4 = np.random.normal(scale=stddev, size=samples) + Z5 + U
    Z2 = np.random.normal(scale=stddev, size=samples) + Z3 + U
    X  = np.random.normal(scale=stddev, size=samples) + U
    Y  = np.random.normal(scale=stddev, size=samples) + direct_causal_effect*X + 10*Z5
    return pd.DataFrame({
        'X': X,
        'Z1': Z1,
        'Z2': Z2,
        'Z3': Z3,
        'Z4': Z4,
        'Z5': Z5,
        'Y': Y})

In [8]:
# Example of using regression, controlling for everything
data = simpson_three_level_data_generation(10000, 2.2, direct_causal_effect = 2)
features = data.drop('Y', axis=1)
target = data.Y

lr = LinearRegression().fit(features, target)

dict(zip(features.columns, lr.coef_))

{'X': 2.0091631376790082,
 'Z1': 0.0016256258107629543,
 'Z2': -0.004632465092770681,
 'Z3': 0.006305195305137504,
 'Z4': -0.009620455253360334,
 'Z5': 10.013350686446367}

We haven't done too poorly detecting the causal effect between `X` and `Y` controlling for everything. Let's not control for anything:

In [9]:
features = data[['X']]
target = data.Y

lr = LinearRegression().fit(features, target)

dict(zip(features.columns, lr.coef_))

{'X': 5.297741133292554}

It maybe isn't surprising that controlling for nothing leaves all the pressure on `X`. Let's try just controlling for `Z1`:

In [10]:
features = data[['X', 'Z1']]
target = data.Y

lr = LinearRegression().fit(features, target)

dict(zip(features.columns, lr.coef_))

{'X': 1.9805796621256535, 'Z1': 9.883481408270647}

Doing better. In fact, we can make a function to control for different subsets of variables:

In [11]:
def get_coefficients(df, feature_names=None, target_name='Y'):
    if feature_names is None:
        feature_names = [col for col in df.columns if col != target_name]
    features = df[feature_names]
    target = df[target_name]
    lr = LinearRegression().fit(features, target)
    coef = dict(zip(feature_names, lr.coef_))
    return pd.Series(coef).to_frame().T
        

Redoing the previous calculations, and recall that by construction the causal effect of `X` on `Y` is 2:

In [12]:
get_coefficients(data)

Unnamed: 0,X,Z1,Z2,Z3,Z4,Z5
0,2.009163,0.001626,-0.004632,0.006305,-0.00962,10.013351


In [13]:
get_coefficients(data, ['X'])

Unnamed: 0,X
0,5.297741


In [14]:
get_coefficients(data, ['X', 'Z1'])

Unnamed: 0,X,Z1
0,1.98058,9.883481


Let's look at some other conditions:

In [15]:
get_coefficients(data, ['X', 'Z1', 'Z3', 'Z5'])

Unnamed: 0,X,Z1,Z3,Z5
0,2.002017,-0.005255,0.001396,10.003829


We get a _great_ answer, simply by controlling for `Z5`

In [16]:
get_coefficients(data, ['X', 'Z5'])

Unnamed: 0,X,Z5
0,2.000857,10.003389


If we control for `Z1`, `Z2`, and `Z3` we get a good estimate:

In [17]:
get_coefficients(data, ['X', 'Z1', 'Z2', 'Z3'])

Unnamed: 0,X,Z1,Z2,Z3
0,2.061902,0.096563,-0.036475,9.846974


Note that more conditioning can make things worse. We decide to add `Z4` "just to be safe" and our estimate of the  (direct) effect of `X` on `Y` gets _worse_:

In [18]:
get_coefficients(data, ['X', 'Z1', 'Z2', 'Z3', 'Z4'])

Unnamed: 0,X,Z1,Z2,Z3,Z4
0,0.602703,-1.322824,-1.420422,7.019978,4.26782


## A more concrete example: Salary and Experience

In many community colleges, the relationship between salary and experience is given by a table. In this case, we are going to make it a simple formula: the amount you get paid (in some units) is 3 times your experience plus twice your education level (0=hs, 1=bachelors, 2=graduate degree)

Each level of school is assumed to take 4 years beyond high school.

In addition, someone might have taken a year or two as gap years, not gaining experience nor an education.

Let's put this together as a data generation exercise:

In [19]:
N = 3000

education = np.random.choice([0, 1, 2],p=[0.05,0.75,0.2], size=(N))
experience = np.random.binomial(25, p=0.8, size=(N))
gap_years = np.random.choice([0, 1, 2],p=[0.2,0.7,0.1], size=(N))

age = 18 + 4*education + experience + gap_years

salary = 3*experience + 2*education

df = pd.DataFrame({
    'education': education, 'experience': experience, 'gap': gap_years, 'age': age, 'salary': salary
})

Each feature is a node in a DAG. If the equation for feature X used values of other features P, we need P to be a parent of X in the DAG.

The corresponding DAG is

In [20]:
draw_dag({'education': ['salary', 'age'],
          'experience': ['age', 'salary'],
          'gap_years': ['age'],
          'salary': []})

<IPython.core.display.Javascript object>

By rearranging the DAG, you should be able to see that `salary` only depends on `education` and `experience` (directly), and that the other variables are not relevant. Taking our data:

In [21]:
get_coefficients(df, ['education', 'experience'], 'salary')

Unnamed: 0,education,experience
0,2.0,3.0


Naturally we get the answer exactly right! What if we (incorrectly) control for age? Does controlling for an irrelevant variable matter?

In [22]:
get_coefficients(df, ['education', 'experience', 'age'], 'salary')

Unnamed: 0,education,experience,age
0,2.0,3.0,1.877894e-16


In this case, no. What about gap year?

In [23]:
get_coefficients(df, ['education', 'experience', 'gap'], 'salary')

Unnamed: 0,education,experience,gap
0,2.0,3.0,2.740494e-16


Great! It seems to make no difference.

What if we control for "everything" (i.e. both age and gap, rather than both separately?)

In [24]:
get_coefficients(df, ['education', 'experience', 'gap', 'age'], 'salary')

Unnamed: 0,education,experience,gap,age
0,-0.315789,2.421053,-0.578947,0.578947


Oh no!! We see that we now have a _negative_ correlation between education and salary!