[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ksachdeva/rethinking-tensorflow-probability/blob/master/notebooks/06_the_haunted_dag_and_the_causal_terror.ipynb)

# Chapter 6 - The Haunted Dag and The Causal Terror (Work in Progress)

## Imports and utility functions

In [None]:
# Install packages that are not installed in colab
try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False

if IN_COLAB:
    %tensorflow_version 2.X
    
    !pip install --upgrade daft
    !pip install causalgraphicalmodels
    !pip install watermark
    !pip install arviz
    !pip install tensorflow_probability==0.9.0

In [None]:
%load_ext watermark

In [None]:
from functools import partial

# Core
import numpy as np
import arviz as az
import pandas as pd
import tensorflow as tf
import tensorflow_probability as tfp
import scipy.stats as stats
from scipy.interpolate import griddata
from scipy.interpolate import BSpline

# visualization 
import daft
import matplotlib.pyplot as plt
from causalgraphicalmodels import CausalGraphicalModel


# aliases
tfd = tfp.distributions
Root = tfd.JointDistributionCoroutine.Root

In [None]:
%watermark -p numpy,tensorflow,tensorflow_probability,arviz,scipy,pandas,daft,causalgraphicalmodels

In [None]:
# config of various plotting libraries
%config InlineBackend.figure_format = 'retina'
az.style.use('arviz-darkgrid')

### Tensorflow MCMC sampling helpers


In [None]:
# Helper functions to run MCMC Sampling
def trace_to_arviz(trace, sample_stats=None):
    posterior = {k: v.numpy() for k, v in trace.items()}
    if sample_stats:
        sample_stats = {k: v.numpy().T for k, v in sample_stats.items()}
    return az.from_dict(posterior=posterior, sample_stats=sample_stats)

@tf.function
def run_hmc_chain(log_posterior, inits=dict(), step_size=1, burn_in=2000, num_samples=5000):
    
  adaptation_steps = int(0.5 * burn_in)
  
  hmc = tfp.mcmc.SimpleStepSizeAdaptation(
      # The actual HMC is very simple to define
      tfp.mcmc.HamiltonianMonteCarlo(
          target_log_prob_fn = log_posterior,    # Log Posterior goes here
          num_leapfrog_steps = 3,
          step_size = step_size                  # constant step size
      ),
      num_adaptation_steps = adaptation_steps
  )
    
  results = tfp.mcmc.sample_chain(
      num_results = num_samples,
      num_burnin_steps = burn_in,
      current_state = inits.values(),
      kernel = hmc,
      trace_fn=None
  )
  return results

def sample_mcmc_hmc(log_posterior, inits=dict(), step_size=0.1, burn_in=1000, num_samples=1000):
    results = run_hmc_chain(log_posterior, inits, step_size, burn_in, num_samples)
    # convert results to numpy arrays for easy processing later on
    numpy_results = list(map(lambda r : r.numpy(), results))
    posterior = dict(zip(list(inits.keys()), numpy_results))    
    return posterior

def sample_from_posterior(jdc, observed_data, params, step_size=0.1):    
    """ Helper function to sample from the posterior distribution given the observed value"""
    
    # we get the sample from our model definition
    # that will act as the init state for the chain
    samples = jdc.sample()    
    samples_without_outcome = samples[:-1]
    
    # build a dicitonary using the params
    init_state = dict(zip(params, samples_without_outcome))

    target_log_prob_fn = lambda *x: jdc.log_prob(x + (observed_data,))

    return sample_mcmc_hmc(target_log_prob_fn,inits=init_state,step_size=step_size)
    
def squeeze_post(post):
    return {k : v.squeeze() for k, v in post.items()}

### DATASET URLS

In [None]:
# You could change base url to local dir or a remoate raw github content
_BASE_URL = "https://raw.githubusercontent.com/rmcelreath/rethinking/Experimental/data"

WAFFLE_DIVORCE_DATASET_PATH = f"{_BASE_URL}/WaffleDivorce.csv"
MILK_DATASET_PATH = f"{_BASE_URL}/milk.csv"
HOWELL_DATASET_PATH = f"{_BASE_URL}/Howell1.csv"

## Code 6.1


**Berkson's paradox** - A *false* observation of a *negative* coorelation between 2 positive traits.

Members of a population which have some positive trait tend to lack a second even though -:

* The traits may be unrelated

* Or, they may be even positively related.

e.g Resturants at good location have bad food even though location & food have no correlation

Author give another name to **Berkon's paradox**. He calls it **selection-distortion effect**.

The gist of the idea here is that when a sample is selected on a combination of 2 (or more) variables, the relationship between those 2 variables is different after selection than it was before. 


Because of above, author suggests that we should always be cautious of adding more predictor variables to our regression as it may introduce statistical selection with in the model. The phenomenon has a name and it is called **COLLIDER BIAS**.

He suggests to look at the **causal model** as the remedy.

There are actually 3 types of hazards when we add more predictor variables -

* Multicollinearity 
* Post treatment bias
* Collider bias

Collider - a beam in which two particles are made to collide (collision)


In [None]:
# A simulated example to demonstrate selection-disortion 

_SEED = 1914

N = 200
p = 0.1  # proportion to select

# uncorrelated newsworthiness & trustworthiness
seed = tfp.util.SeedStream(_SEED, salt="")
nw = tfd.Normal(loc=0., scale=1.).sample(N, seed=seed()).numpy()
seed = tfp.util.SeedStream(_SEED, salt="")
tw = tfd.Normal(loc=0., scale=1.).sample(N, seed=seed()).numpy()

# select top 10% of combined scores
s = nw + tw  # total score
q = np.quantile(s, 1 - p)  # top 10% threshold

selected = np.where(s >= q, True, False)

np.corrcoef(tw[selected], nw[selected])[0, 1]

## Code 6.2


**Multicollinearity** - when the predictor variables are stronly correlated.

When this happens the posterior distribution will seem to suggest that none of the variables is reliably associated with the outcome. Author says that this happens because of how multiple regression work !!. 


TODO - Investigate above point ie. how multiple regression leads to this.

Author suggest that the model will still infer correct but it would be difficult to understand it.

Here we again create an artifical dataset about height & its relation to the lenght of the legs as predictor variables


In [None]:
_SEED = 909

N = 100

def generate_height_leg_data():
    seed = tfp.util.SeedStream(_SEED, salt="leg_exp")
    height = tfd.Normal(loc=10., scale=2.).sample(N, seed=seed()).numpy()
    leg_prop = tfd.Uniform(low=0.4, high=0.5).sample(N, seed=seed()).numpy()

    # left & right leg as proportion + error
    leg_left = leg_prop * height + tfd.Normal(loc=0, scale=0.02).sample(N, seed=seed()).numpy()

    leg_right = leg_prop * height + tfd.Normal(loc=0, scale=0.02).sample(N, seed=seed()).numpy()

    # build a dataframe using above
    d = pd.DataFrame({
        "height" : height,
        "leg_left" : leg_left,
        "leg_right" : leg_right
    })

    return d

d = generate_height_leg_data()

d

## Code 6.3 (not working !)

In [None]:
def model_6_1(leg_left_data, leg_right_data):
    def _generator():
      alpha = yield Root(tfd.Sample(tfd.Normal(loc=10., scale=100., name="alpha"), sample_shape=1))
      betaL = yield Root(tfd.Sample(tfd.Normal(loc=2., scale=10., name="betaL"), sample_shape=1))
      betaR = yield Root(tfd.Sample(tfd.Normal(loc=2., scale=10., name="betaR"), sample_shape=1))
      sigma = yield Root(tfd.Sample(tfd.Exponential(rate=1., name="sigma"), sample_shape=1))
    
      mu =  alpha + betaL * leg_left_data + betaR * leg_right_data
        
      height = yield tfd.Independent(tfd.Normal(loc=mu, scale=sigma, name="height"), reinterpreted_batch_ndims=1)

    return tfd.JointDistributionCoroutine(_generator, validate_args=True)    
    
jdc_6_1 = model_6_1(d.leg_left.values, d.leg_right.values)

posterior_6_1 = sample_from_posterior(
                    jdc_6_1, 
                    observed_data=d.height.values, 
                    params=['alpha', 'betaL', 'betaR', 'sigma'])

az_trace = az.from_dict(posterior=posterior_6_1, sample_stats=None)

az.summary(az_trace)

In [None]:
np_post = squeeze_post(posterior_6_1)

az.plot_forest(np_post, credible_interval=0.89);

## Code 6.5  (not working !)

In [None]:
az.plot_pair(np_post, ["betaR", "betaL"], plot_kwargs={"alpha": 0.1});

## Code 6.6  (not working !)

In [None]:
sum_blbr = np_post["betaL"] + np_post["betaR"]
az.plot_kde(sum_blbr, label="sum of bl and br");

## Code 6.8

**Multicollinear milk**

In [None]:
d = pd.read_csv(MILK_DATASET_PATH, sep=";")

d["K"] = d["kcal.per.g"].pipe(lambda x: (x - x.mean()) / x.std())
d["F"] = d["perc.fat"].pipe(lambda x: (x - x.mean()) / x.std())
d["L"] = d["perc.lactose"].pipe(lambda x: (x - x.mean()) / x.std())

## Code 6.9



In [None]:
# we are building here 2 models. 

# KCal is regressed on perc.fat

def model_6_3(per_fat):
    def _generator():
      alpha = yield Root(tfd.Sample(tfd.Normal(loc=0., scale=0.2, name="alpha"), sample_shape=1))
      betaF = yield Root(tfd.Sample(tfd.Normal(loc=0., scale=0.5, name="betaF"), sample_shape=1))      
      sigma = yield Root(tfd.Sample(tfd.Exponential(rate=1., name="sigma"), sample_shape=1))
    
      mu =  alpha + betaF * per_fat
        
      K = yield tfd.Independent(tfd.Normal(loc=mu, scale=sigma, name="K"), reinterpreted_batch_ndims=1)

    return tfd.JointDistributionCoroutine(_generator, validate_args=True)    
    
jdc_6_3 = model_6_3(d.F.values)

posterior_6_3 = sample_from_posterior(
                    jdc_6_3, 
                    observed_data=d.K.values, 
                    params=['alpha', 'betaF', 'sigma'])
                            
                            
# KCal is regressed on perc.lactose

def model_6_4(per_lac):
    def _generator():
      alpha = yield Root(tfd.Sample(tfd.Normal(loc=0., scale=0.2, name="alpha"), sample_shape=1))
      betaL = yield Root(tfd.Sample(tfd.Normal(loc=0., scale=0.5, name="betaL"), sample_shape=1))      
      sigma = yield Root(tfd.Sample(tfd.Exponential(rate=1., name="sigma"), sample_shape=1))
    
      mu =  alpha + betaL * per_lac
        
      K = yield tfd.Independent(tfd.Normal(loc=mu, scale=sigma, name="K"), reinterpreted_batch_ndims=1)

    return tfd.JointDistributionCoroutine(_generator, validate_args=True)    
    
jdc_6_4 = model_6_4(d.L.values)

posterior_6_4 = sample_from_posterior(
                    jdc_6_4, 
                    observed_data=d.K.values, 
                    params=['alpha', 'betaL', 'sigma'])


az_trace_6_3 = az.from_dict(posterior=posterior_6_3, sample_stats=None)
az_trace_6_4 = az.from_dict(posterior=posterior_6_4, sample_stats=None)

print("="*20)
print(az.summary(az_trace_6_3))
print("="*20)

print("="*20)
print(az.summary(az_trace_6_4))
print("="*20)

Posterior for betaF & betaL are mirror images of each other (0.857 vs -0.898)

This seems to imply that both predictors have strong association with the outcome. In next section we will see what happens when we combine both of them in the regression model

## Code 6.10

In [None]:
def model_6_5(per_fat, per_lac):
    def _generator():
      alpha = yield Root(tfd.Sample(tfd.Normal(loc=0., scale=0.2, name="alpha"), sample_shape=1))
      betaF = yield Root(tfd.Sample(tfd.Normal(loc=0., scale=0.5, name="betaF"), sample_shape=1))      
      betaL = yield Root(tfd.Sample(tfd.Normal(loc=0., scale=0.5, name="betaL"), sample_shape=1))      
      sigma = yield Root(tfd.Sample(tfd.Exponential(rate=1., name="sigma"), sample_shape=1))
    
      mu =  alpha + betaF * per_fat + betaL * per_lac
        
      K = yield tfd.Independent(tfd.Normal(loc=mu, scale=sigma, name="K"), reinterpreted_batch_ndims=1)

    return tfd.JointDistributionCoroutine(_generator, validate_args=True)    
    
jdc_6_5 = model_6_5(d.F.values, d.L.values)

posterior_6_5 = sample_from_posterior(
                    jdc_6_5, 
                    observed_data=d.K.values, 
                    params=['alpha', 'betaF', 'betaL', 'sigma'])

az_trace_6_5 = az.from_dict(posterior=posterior_6_5, sample_stats=None)
az.summary(az_trace_6_5)

Notice how the standard deviations of the posterior for betaF & betaL has jumped up significantly

## Code 6.11



In [None]:
az.plot_pair(d[["kcal.per.g", "perc.fat", "perc.lactose"]].to_dict("list"));

* perc.fat is positively related to outcome

* perc.lac is negatively related to outcome

* perc.fat & perc.lac are negatively coorelated to each other

Either of them helps in predicting KCal but neither helps much once we already know the other.

Author suggests that often before modelling people look at the correlation between variables & drop them if it is the case. He calls this a mistake and suggest that the pairwise correlations are not the problems. 

He argues that it is the conditional association - not correlations - that matter. 

Associations with in data alone are not enough to decide what to do. 

He motivates to look at the problem causally and mentions the existence of an unobserved variable. This type of variable is also called **Non identifiable variable** i.e. the structure of the data & the model does not make it possible to esitmate this parameter's value.

## Code 6.12  (TODO)

Simulating collinearity using Milk dataset

## Code 6.13

**Omitted Variable bias** - Problems that arise because of **not** including predictior variables

**Post-treatment bias** - Problems that arise becuase of including **improper** predictor variables