When we applied machine-learning algorithms to the Titanic data from Lesson 9 on machine-learning pitfalls, we found they concluded that being female was associated with higher survival rates. The higher rates are due, in large part, to the fact that passengers were treated differently by the crew and given first access to the lifeboats. It does not account for whether female passengers were more likely to survive, all other things being equal. Can a causal analysis provide a more nuanced perspective on the data? Apply the `dowhy` learner to the data, using “Sex” as the treatment and “Survived” as the outcome. Look at the average treatment effect and the causal estimate. What do you conclude about the impact of sex on survival?

Below we install the Python library by Microsoft, `dowhy`. It can learn causal graphs from data and carry out do-calculus derivations to find ways of using.

We also import some additional libraries we'll be using.

In [1]:
# install the dowhy library
 
!pip install dowhy
 
# import required libraries
 
import os, sys
sys.path.append(os.path.abspath("../../"))
import dowhy
from dowhy import CausalModel
import pandas as pd
import numpy as np

Collecting dowhy
[?25l  Downloading https://files.pythonhosted.org/packages/fd/4b/3811cebed496bdd4ba5193d21d715cc24d94883a1c6483f6025a831ae89c/dowhy-0.4-py3-none-any.whl (97kB)
[K     |████████████████████████████████| 102kB 2.3MB/s 
Collecting pydot>=1.4
  Downloading https://files.pythonhosted.org/packages/33/d1/b1479a770f66d962f545c2101630ce1d5592d90cb4f083d38862e93d16d2/pydot-1.4.1-py2.py3-none-any.whl
Collecting sympy>=1.4
[?25l  Downloading https://files.pythonhosted.org/packages/1e/ed/4b0576282597e87e7cf3be33fa4f738d5974471f9b59a55b3730c3877530/sympy-1.6.1-py3-none-any.whl (5.8MB)
[K     |████████████████████████████████| 5.8MB 8.5MB/s 
Installing collected packages: pydot, sympy, dowhy
  Found existing installation: pydot 1.3.0
    Uninstalling pydot-1.3.0:
      Successfully uninstalled pydot-1.3.0
  Found existing installation: sympy 1.1.1
    Uninstalling sympy-1.1.1:
      Successfully uninstalled sympy-1.1.1
Successfully installed dowhy-0.4 pydot-1.4.1 sympy-1.6.1


We download and process the `data` below.

In [42]:
data= pd.read_csv("https://github.com/mlittmancs/great_courses_ml/raw/master/ship.csv", header = None)
# get rid of title row
col = ['PassengerId','Survived','Pclass','Name','Sex','Age','SibSp','Parch','Ticket','Fare','Cabin','Embarked']
data.columns = col
data = data[data.PassengerId != "PassengerId"]
# data = data.astype({"Survived": bool})
data = data.astype({"Survived": int})
data["SexB"] = data.Sex == 'female'

data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,SexB
1,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S,False
2,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C,True
3,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S,True
4,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S,True
5,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S,False


We create a `CausalModel` to process the data using the `data`, `treatement`, and `y_factual` `outcome`.  We call `identify_effect` on the `model` to derive a causal effect.

Note that when you run this routine, the code reminds you that it’s making some educated guesses about the way that the unobserved confounders can impact the model. The `dowhy` software steers users away from taking the results at face value, and into looking more closely at possible causal effects.

In [43]:
# Create a causal model from the data and the "x" variables as common causes.
 
model=CausalModel(
        data = data,
        treatment='SexB',
        outcome='Survived',
        common_causes=['Age', 'SibSp', 'Parch', 'Fare', 'Embarked'])
 
#Identify the causal effect
identified_estimand = model.identify_effect()

INFO:dowhy.causal_graph:If this is observed data (not from a randomized experiment), there might always be missing confounders. Adding a node named "Unobserved Confounders" to reflect this.
INFO:dowhy.causal_model:Model to find the causal effect of treatment ['SexB'] on outcome ['Survived']
INFO:dowhy.causal_identifier:Common causes of treatment and outcome:['Embarked', 'SibSp', 'Parch', 'Fare', 'U', 'Age']


WARN: Do you want to continue by ignoring any unobserved confounders? (use proceed_when_unidentifiable=True to disable this prompt) [y/n] y


INFO:dowhy.causal_identifier:Instrumental variables for treatment and outcome:[]


We estimate the effect of the treatment on the outcomes in two ways by calculating the “average treatment effect” or `ATE`. That’s a correlational measure of treatment and outcomes.

To estimate the average treatment effect, we separatethe instances where the treatment is given, `data_1`, from the instances where the treatment was NOT given, `data_0`. The `ATE` is the difference between the means in these two sets.


In [45]:
data_1 = data[data["SexB"]==True]
data_0 = data[data["SexB"]==False]
print("ATE", np.mean(data_1["Survived"])- np.mean(data_0["Survived"]))

ATE 0.5531300709799203


We also use Dowhy’s `estimate_effect` function to more precisely characterize the treatment effect. We use `backdoor_propensity_score_weighting`, which uses the do-calculus to re-assess how much of those gains are really attributable to the treatment and not other factors.

In [46]:
estimate = model.estimate_effect(identified_estimand, method_name="backdoor.propensity_score_weighting"
)
 
print("Causal Estimate is " + str(estimate.value))

INFO:dowhy.causal_estimator:INFO: Using Propensity Score Weighting Estimator
INFO:dowhy.causal_estimator:b: Survived~SexB+Embarked+SibSp+Parch+Fare+Age
  y = column_or_1d(y, warn=True)
INFO:numexpr.utils:NumExpr defaulting to 2 threads.


Causal Estimate is 0.3753219400107669
