
# Topics

* What is causal effects
* Fundamental Problem of Causal Inference
* Assumptions necessary to identify causal effects
* Matching techniques
* Accessing balance
* Sensitivity analysis to determine the impact of violations of assumptions on conclusions



# Brief History
![](history.png)



# Basic Definitions and Notations
* Treatment:  $A={0,1}$

* Observed Outcome: $Y={0,1}$

* Potential Outcomes: $Y^0, Y^1$

* Counterfactural:  
  If one outcome is $Y^a$ the other one is $Y^{1-a}$

* 0Causal Effects: $Y^1 \neq Y^0$


# Fundamental Problem of Causal Inference

* It is impossible to observe two outcomes simultaneously.

* However with certain assumptions we can estimate population level (average) causal effects.


# Average Causal Effect

* $E(Y^1-Y^0)=-0.1$

* with 1000 people 100 fewer with treatment A than treatment B



# Conditioning vs. Setting

In general 

$E(Y^1-Y^0) \neq E(Y|A=1)-E(Y|A=0)$

$E(Y|A=1)$ is subpopulation who actually had $A=1$

$E(Y^1)$ means the whole population was actually treated with $A=1$


# Other Causal Effects

$E(Y^1/Y^0)$ Causal relative risk

$E(Y^1-Y^0|A=1)$ Causal effect of treatment on treated

$E(Y^1-Y^0|V=v)$ average causal effect in the subpopulation with covariate $V=v$ heterogeneity treatment effects


# Causal Assumptions 
* SUTVA stable unit treatment value assumptions
  * No interference: 
    * units do not interfere with each other 
    * Treatment assignmentof one unit does not affect that outcome of another unit
    * Spillover or contagion are also terms for interference
  * One version of treatment

* Consistency $Y=Y^a$ if $A=a$ for all a

* Ignorability $Y^0,Y^1\perp \!\!\! \perp A|X$ Among people with the same values of $X$, we can think of treatment $A$ as being randomly assigned.

* Positivity: for every sef of values for $X$, treatment assignment was not deterministic:
$P(A=a|X=x)>0$ for all $a$ and $x$


# Observed Data

$E(Y|A=a,X=x)$

$= E(Y^a | A=a,X=x)$ (consistency)

$= E(Y^a | X=x)$ (ignorability)


# Mean Potential Outcome by standardization

$E(Y^a) = \Sigma_xE(Y|A=a,X=x)P(X=x)$

problem: could be many varibles in $X$, combinations of variables creat missing values


# Confounding

get rid of counfounder by randomization


# Matching(one covariate, multiple covariates)

* fine balance
* stochastic balance
* one to one
* one to many
* variable (sometimes one to one, sometimes one to many)
* Distance
  * Mahalanobis
  * robust version (use ranks instead to solve issue of outliers)
* Greedy matching(R package matchit)
  * Caliper (maximum acceptable distance)
* optimal matching (R package optmatch,rcbalance)

# Accessing Balance

* Standardized differences (similar means)

  $$smd=\frac{ \bar{X}_{treatment} -  \bar{X}_{control}}{ \sqrt{  \frac{S^2_{treatment}+S^2_{control}}{2} }  }$$
  
  * <0.1 indicate adequate balance
  * 0.1-0.2 not too alarming
  * \>0.2 indicate serious imbalance
   
* Table 1 (prematching and post-matching balance is compared) + SMD plot
  ![](accessing_balance.png)

* Hypothesis test and p-values (test for differences in means for each covariate -- two sample t-test) 




# Analyze Data After Matching

* Randomization test (Permutation test, Exact test) R package (McNemar.test) or t.test for continuous data




# Sensitivity analysis

* Possible hidden bias



# Propensity Score




#  Inverse Probability of Treatment Weighting (IPTW)






# Example I
- Preparing data
- Matching
- Outcome analysis

### Preparing data

In [None]:
#Install and load package
#install.packages("tableone")
#install.packages("Matching")
#install.packages("MatchIt")
library(tableone)
library(Matching)

#Load and view data
load(url("http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/rhc.sav"))
rhc[1:10,1:6]

* **swang1**: Treatment variables
* **cat1**: Primary disease category
* **meanbp1**： Mean blood pressure
* **sex, age, death**

In [None]:
#Spread variables

#create a data set with just these variables, for simplicity
ARF<-as.numeric(rhc$cat1=='ARF')
CHF<-as.numeric(rhc$cat1=='CHF')
Cirr<-as.numeric(rhc$cat1=='Cirrhosis')
colcan<-as.numeric(rhc$cat1=='Colon Cancer')
Coma<-as.numeric(rhc$cat1=='Coma')
COPD<-as.numeric(rhc$cat1=='COPD')
lungcan<-as.numeric(rhc$cat1=='Lung Cancer')
MOSF<-as.numeric(rhc$cat1=='MOSF w/Malignancy')
sepsis<-as.numeric(rhc$cat1=='MOSF w/Sepsis')
female<-as.numeric(rhc$sex=='Female')
died<-as.numeric(rhc$death=='Yes')
age<-rhc$age
treatment<-as.numeric(rhc$swang1=='RHC')
meanbp1<-rhc$meanbp1
aps<-rhc$aps1

#new dataset
mydata<-cbind(ARF,CHF,Cirr,colcan,Coma,lungcan,MOSF,sepsis,
              age,female,meanbp1,aps, treatment,died)
mydata<-data.frame(mydata)

head(mydata)

In [None]:
#Define the covariates we will use (shorter list than you would use in practice)
xvars<-c("ARF","CHF","Cirr","colcan","Coma","lungcan","MOSF","sepsis",
         "age","female","meanbp1")

In [None]:
#Before matching

#look at a table 1
table1<- CreateTableOne(vars=xvars,strata="treatment", data=mydata, test=FALSE)

## include standardized mean difference (SMD)
print(table1,smd=TRUE)

### Matching

In [None]:
#Match by greedy matching

greedymatch <- Match(Tr=treatment,M=1,X=mydata[xvars],replace=FALSE)
matched<-mydata[unlist(greedymatch[c("index.treated","index.control")]), ]

#get table 1 for matched data with standardized differences
matchedtab1<-CreateTableOne(vars=xvars, 
                            strata ="treatment", 
                            data=matched, 
                            test = FALSE)

In [None]:
print(matchedtab1, smd = TRUE)

### Outcome analysis

In [None]:
#Outcome analysis by T-test


#outcome analysis
y_trt<-matched$died[matched$treatment==1]
y_con<-matched$died[matched$treatment==0]

#pairwise difference
diffy <- y_trt-y_con

#paired t-test
t.test(diffy)

In [None]:
#Outcome analysis by McNemar's Chi-squared test:
#McNemar test
table(y_trt,y_con)
mcnemar.test(matrix(c(973,513,395,303),2,2))

# Example II matching with propesnisty score

In [None]:
#Use propensity score for matching

#fit a propensity score model. logistic regression
psmodel<-glm(treatment~ARF+CHF+Cirr+colcan+Coma+lungcan+MOSF+
               sepsis+age+female+meanbp1+aps,
    family=binomial(),data=mydata)

#show coefficients etc
summary(psmodel)

#create propensity score
pscore<-psmodel$fitted.values


In [None]:

#Do greedy matching on logit(PS) with a caliper

logit <- function(p) {log(p)-log(1-p)}
psmatch<-Match(Tr=mydata$treatment,
               M=1,
               X=logit(pscore),
               replace=FALSE,
               caliper=.2)
matched<-mydata[unlist(psmatch[c("index.treated",
                                 "index.control")]), ]
xvars<-c("ARF","CHF","Cirr","colcan","Coma",
         "lungcan","MOSF","sepsis",
         "age","female","meanbp1")

In [None]:
#Get standardized differences


matchedtab1<-CreateTableOne(vars=xvars, 
                            strata ="treatment", 
                            data=matched, 
                            test = FALSE)
print(matchedtab1, smd = TRUE)

In [None]:
#Outcome Analysis by T-test

y_trt<-matched$died[matched$treatment==1]
y_con<-matched$died[matched$treatment==0]

#pairwise difference
diffy<-y_trt-y_con

#paired t-test
t.test(diffy)

# Example III PSM (Matchit)

In [None]:
library(MatchIt)
data(lalonde)
head(lalonde)

In [None]:
m.out <- matchit(treat ~ age + educ + black + hispan + nodegree + married + re74 + re75, 
                 method = "nearest", data = lalonde, caliper=0.2)

In [None]:
summary(m.out)

In [None]:
plot(m.out, type='hist')

In [None]:
#install.packages("Zelig")
library(Zelig)
z.out <- zelig(re78 ~ treat + age + educ + black + hispan + nodegree + married + re74 + re75, 
               data = match.data(m.out), model = "ls")
x.out <- setx(z.out, treat=0)
x1.out <- setx(z.out, treat=1)

In [None]:
s.out <- sim(z.out, x = x.out, x1 = x1.out)
summary(s.out)