# Buidling Causal Graphical Models

Reminder that we're using tools to infer what the data generating process (DGP) may look like. One set of data may imply many different DGPs, so it's a hard game to play. This chapter is focused on graphically representing the DGP.

## 3.1.1 Transportation Case Study

In [1]:
# 3.2 Building the transportation DAG in pgmpy

from pgmpy.models import DiscreteBayesianNetwork
model = DiscreteBayesianNetwork(
    [
        ('Age', 'Education'),
        ('Gender', 'Education'),
        ('Education', 'Occupation'),
        ('Education', 'Residence'),
        ('Occupation', 'Transportation'),
        ('Residence', 'Transportation')
    ]
)

model

<pgmpy.models.DiscreteBayesianNetwork.DiscreteBayesianNetwork at 0x1357327e0>

## 3.1.3 DAGs as Communication Tools

- Examples in which you've used DAGs for communication
- Discuss DAG limitations as a reminder
    - "How" vs "What"
- Logic Gates visualize "How"
    

DAGs represent and codify causal assumptions that can be used in further computation purposes. they also represent time, in that the causal arrow assumes forward movement in time. 

Identified ways of getting around the "acyclic" assumption in DAGs without necessarily having to relax the criteria. Example given is to unpack the "cycles" into discrete steps and track their path over time. 

## Linking Causality to Conditional Independence

Highlighted how causal DAG allows a much simpler structure for formulating joint probability, because the causal pathway carries some assumptions about conditional independence.

Specifically, in the geneology example, the author demonstrates where the 2 parents' blood types contain enough info to explain a child's blood type without knowing the parents' blood type.

This characteristic, where direct "parent" properties supercede "grandparent" properties, is called the "causal Markov property"



## Scaffolding for Causal Machine Learning Models

Building with the DAG as the scaffolding permits the further goal - building causal machine learning models. These can be used for prediction and causal inference.

After factoring under the assumption of causal conditional indepdence, the "factors" are also called Markov Kernels



### Labeling Causal Abstractions

Labels are by their nature fluid and changing, so it's important to be careful about consistency among labels. Author citied the example of race, whose definition has definitely changed over time.

"In machine learning, we're often encouraged to blindly label data and not think about the DGP"



## Training a Model on a Causal Dag

In [2]:
import polars as pl
url = 'https://raw.githubusercontent.com/altdeep/causalML/master/datasets/transportation_survey.csv'
df = pl.read_csv(url)
df.columns = ["Age", "Gender", "Education", "Occupation", "Residence", "Transportation"]

In [3]:
# There are people in this world who use data without looking at it. Don't let it be you!
# in notebooks, "sample" can give you a better idea what's in a dataframe than "head"

df.sample(10)

Age,Gender,Education,Occupation,Residence,Transportation
str,str,str,str,str,str
"""old""","""M""","""high""","""emp""","""big""","""other"""
"""adult""","""F""","""high""","""emp""","""big""","""car"""
"""adult""","""F""","""high""","""emp""","""small""","""car"""
"""adult""","""M""","""high""","""emp""","""big""","""car"""
"""old""","""M""","""high""","""emp""","""big""","""car"""
"""young""","""M""","""high""","""emp""","""small""","""car"""
"""young""","""F""","""uni""","""emp""","""big""","""car"""
"""old""","""M""","""high""","""emp""","""big""","""car"""
"""young""","""F""","""uni""","""emp""","""big""","""car"""
"""young""","""F""","""high""","""emp""","""big""","""car"""


In [12]:
# Learning Parameters for the causal Markov kernels
model.fit(df.to_pandas())  # polars wins on convenience, pandas wins on compatibility
causal_markov_kernels = model.get_cpds()
print(causal_markov_kernels)

INFO:pgmpy: Datatype (N=numerical, C=Categorical Unordered, O=Categorical Ordered) inferred from data: 
 {'Age': 'C', 'Gender': 'C', 'Education': 'C', 'Occupation': 'C', 'Residence': 'C', 'Transportation': 'C'}


[<TabularCPD representing P(Age:3) at 0x16b3bb410>, <TabularCPD representing P(Education:2 | Age:3, Gender:2) at 0x16b1a7d10>, <TabularCPD representing P(Gender:2) at 0x16b1a7da0>, <TabularCPD representing P(Occupation:2 | Education:2) at 0x16b1a5880>, <TabularCPD representing P(Residence:2 | Education:2) at 0x16b1a7290>, <TabularCPD representing P(Transportation:3 | Occupation:2, Residence:2) at 0x16b1a6ae0>]


In [5]:
cmk_T_max_likelihood = causal_markov_kernels[-1]
cmk_T_max_likelihood.to_dataframe().T  # this is MUCH better than print(cmk_T)

Occupation,emp,emp,self,self
Residence,big,small,big,small
Transportation,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
car,0.703431,0.52439,0.444444,1.0
other,0.134804,0.085366,0.333333,0.0
train,0.161765,0.390244,0.222222,0.0


### Different Techniques for Parameter Learning

"Maximum Likelihood" seeks the parameter that maximizes the likelihood of seeing the data we use to train the model. For categorical data, this is a matter of proportions of the seen data

Dirichilet conjugate priors allow us to calculate distributions using only simple math. This is embedded in pgmpy

In [6]:
from pgmpy.estimators import BayesianEstimator

model.fit(
    df.to_pandas(),
    estimator=BayesianEstimator,
    prior_type="dirichlet",
    pseudo_counts=1  # the parameters of the Dirichlet prior
)

causal_markov_kernels = model.get_cpds()  # extract causal markov kernels - conditional probability distributions
cmk_T_bayesian = causal_markov_kernels[-1]  # We're after probability of Transportation type given Occupation and Residence
cmk_T_bayesian.to_dataframe().T  # Again, this transpose view mirrors what you see in the book, but as a df rather than string

INFO:pgmpy: Datatype (N=numerical, C=Categorical Unordered, O=Categorical Ordered) inferred from data: 
 {'Age': 'C', 'Gender': 'C', 'Education': 'C', 'Occupation': 'C', 'Residence': 'C', 'Transportation': 'C'}


Occupation,emp,emp,self,self
Residence,big,small,big,small
Transportation,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
car,0.70073,0.517647,0.416667,0.5
other,0.136253,0.094118,0.333333,0.25
train,0.163017,0.388235,0.25,0.25


In [11]:
# Bayesians and Frequentists can argue here

print("Max Likelihood Estimation Approach") 
print("Recall from page 83 that this is looking at proportions as they appear in the data")
print(cmk_T_max_likelihood.to_dataframe().T)
print("\nBayesian Estimation Approach")
print("Recall from page 84 that this approach uses a prior distribution, which helps avoid extremes such as the MLE approach saying that all entrepreneurs in small cities use cars (rightmost column)")
print(cmk_T_bayesian.to_dataframe().T)


Max Likelihood Estimation Approach
Recall from page 83 that this is looking at proportions as they appear in the data
Occupation           emp                self      
Residence            big     small       big small
Transportation                                    
car             0.703431  0.524390  0.444444   1.0
other           0.134804  0.085366  0.333333   0.0
train           0.161765  0.390244  0.222222   0.0

Bayesian Estimation Approach
Recall from page 84 that this approach uses a prior distribution, which helps avoid extremes such as the MLE approach saying that all entrepreneurs in small cities use cars (rightmost column)
Occupation           emp                self      
Residence            big     small       big small
Transportation                                    
car             0.700730  0.517647  0.416667  0.50
other           0.136253  0.094118  0.333333  0.25
train           0.163017  0.388235  0.250000  0.25


### Other Techniques for Paramter Estimation

Author reminds us that the DAG exists totally separate from the algorithms and methods we use to estimate the causal impact. So whether age is calculated in seconds or days, or whether we use a neural net or logistic regression to model our assumptions is separate from the DAG itself

### Latent Variables

Remember that we are searching to understand a data generating process (DGP), not necessarily report on data. Data, then are clues and context about what the DGP may be. 

This means we have to give consideration to "latent variables" which are characteristics not directly observed in the data but inferrable from the data.

In [None]:
from pgmpy.models import DiscreteBayesianNetwork
from pgmpy.estimators import ExpectationMaximization as EM
url = "https://raw.githubusercontent.com/altdeep/causalML/master/datasets"
