---
title: "Doubly Robust Estimation Notebook"
date: "2025-07-17"
excerpt: "Doubly Robust Estimation Notebook"
category: "Causal Inference"
tags: ["Causal Inference"]
---

[Doubly Robust Estimation](https://matheusfacure.github.io/python-causality-handbook/12-Doubly-Robust-Estimation.html)

In [3]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
from matplotlib import style
from matplotlib import pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression, LinearRegression

%matplotlib inline

style.use("fivethirtyeight")
pd.set_option("display.max_columns", None)

In [5]:
# 학생들의 마인드셋을 향상시키기 위한 세미나를 진행했는데, 이게 어느정도 영향이 있었는지
# 가상의 데이터

# 세미나 참석 여부 -> intervention
# 얼마나 성공했는지 혹은 성공할 확률 -> success_expect

data = pd.read_csv("./data/learning_mindset.csv")
data.sample(5, random_state=5)

Unnamed: 0,schoolid,intervention,achievement_score,success_expect,ethnicity,gender,frst_in_family,school_urbanicity,school_mindset,school_achievement,school_ethnic_minority,school_poverty,school_size
259,73,1,1.480828,5,1,2,0,1,-0.462945,0.652608,-0.515202,-0.169849,0.173954
3435,76,0,-0.987277,5,13,1,1,4,0.334544,0.648586,-1.310927,0.224077,-0.426757
9963,4,0,-0.15234,5,2,2,1,0,-2.289636,0.190797,0.875012,-0.724801,0.761781
4488,67,0,0.358336,6,14,1,0,4,-1.115337,1.053089,0.315755,0.054586,1.862187
2637,16,1,1.36092,6,4,1,0,1,-0.538975,1.433826,-0.033161,-0.982274,1.591641


In [7]:
# 성공한 정도 (success expect)가 높을수록 세미나 참석 여부(intervention)가 높은지

data.groupby("success_expect")["intervention"].agg(["mean", "count"])

Unnamed: 0_level_0,mean,count
success_expect,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.271739,92
2,0.265957,188
3,0.294118,476
4,0.271617,1064
5,0.31107,3803
6,0.354287,3802
7,0.362319,966


학생의 요인에 따라 참석 여부가 결정되고, 또 이 요인이 성공확률을 결정한다. -> confounding variable.   
즉, 단순히 참석 여부에 따라 성공확률을 비교해 세미나의 효과를 추정하기엔 어렵다.    
학생의 다른 요인들이 영향을 미치고 있으니까   

In [9]:
# regression을 통해 편향을 제거하고 세미나의 효과 보기
# 이를 위해 categorical features를 더미 변수로 변환

categ = ["ethnicity", "gender", "school_urbanicity"]
cont = ["school_mindset", "school_achievement", "school_ethnic_minority", "school_poverty", "school_size"]

data_with_categ = pd.concat([
    data.drop(columns=categ), # dataset without the categorical features
    pd.get_dummies(data[categ], columns=categ, drop_first=False) # categorical features converted to dummies
], axis=1)

print(data_with_categ.shape)

(10391, 32)
