# Assignment 4 - Part 2: Causal Forest (R)

This notebook implements a causal forest analysis to estimate heterogeneous treatment effects of a random cash transfer program encouraging medical check-ups using R.

In [None]:
# Load necessary libraries
library(randomForest)
library(rpart)
library(rpart.plot)
library(ggplot2)
library(dplyr)
library(tidyr)
library(reshape2)

# Set random seed for reproducibility
set.seed(123)

## Load and Prepare Data

In [None]:
# Load the dataset
column_names <- c('age', 'sex', 'cp', 'restbp', 'chol', 'fbs', 'restecg', 
                  'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'hd')

df <- read.csv('../input/processed.cleveland.data', 
               header = FALSE,
               col.names = column_names,
               na.strings = '?')

# Remove missing values
df <- na.omit(df)

cat("Dataset shape:", dim(df), "\n")
head(df)

## (0.5 points) Create binary treatment variable T

In [None]:
# Create binary treatment variable with random assignment
set.seed(123)
df$T <- rbinom(nrow(df), 1, 0.5)

cat("Treatment distribution:\n")
table(df$T)
cat("\nProportion treated:", mean(df$T), "\n")

## (1 point) Create outcome variable Y

In [None]:
# Create outcome variable Y
# Y = (1 + 0.05*age + 0.3*sex + 0.2*restbp) * T + 0.5*oldpeak + epsilon
# epsilon ~ N(0, 1)

set.seed(123)
epsilon <- rnorm(nrow(df), 0, 1)

df$Y <- (1 + 0.05 * df$age + 0.3 * df$sex + 0.2 * df$restbp) * df$T + 
        0.5 * df$oldpeak + epsilon

cat("Outcome variable Y statistics:\n")
summary(df$Y)

# Visualize Y distribution by treatment group
df$T_label <- ifelse(df$T == 0, "Control (T=0)", "Treated (T=1)")

p <- ggplot(df, aes(x = Y, fill = T_label)) +
  geom_histogram(alpha = 0.5, bins = 30, position = "identity") +
  labs(title = "Distribution of Outcome Variable by Treatment Group",
       x = "Y (Health Improvement)",
       y = "Frequency",
       fill = "Treatment Group") +
  theme_minimal() +
  theme(text = element_text(size = 12))

ggsave('../output/outcome_distribution_R.png', p, width = 10, height = 6, dpi = 300)

## (1 point) Calculate treatment effect using OLS

In [None]:
# Estimate treatment effect using OLS regression
# Simple model: Y ~ T
model_simple <- lm(Y ~ T, data = df)
cat("Simple OLS Model (Y ~ T):\n")
summary(model_simple)
cat("\nAverage Treatment Effect (ATE):", coef(model_simple)["T"], "\n")

In [None]:
# More complete model with covariates
model_full <- lm(Y ~ T + age + sex + restbp + oldpeak, data = df)
cat("\nFull OLS Model with Covariates:\n")
summary(model_full)
cat("\nAverage Treatment Effect (ATE) with controls:", coef(model_full)["T"], "\n")

## (2 points) Use Random Forest to estimate causal effects

In [None]:
# Prepare features for Random Forest
feature_cols <- c('age', 'sex', 'cp', 'restbp', 'chol', 'fbs', 'restecg', 
                  'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'T')

X_rf <- df[, feature_cols]
y_rf <- df$Y

# Train Random Forest model
set.seed(123)
rf_model <- randomForest(x = X_rf, y = y_rf, 
                         ntree = 100, 
                         maxnodes = 50,
                         nodesize = 10)

cat("Random Forest model trained successfully\n")
cat("% Variance explained:", rf_model$rsq[length(rf_model$rsq)] * 100, "%\n")

In [None]:
# Estimate individual treatment effects using Random Forest
# Create counterfactual datasets
X_treated <- X_rf
X_treated$T <- 1

X_control <- X_rf
X_control$T <- 0

# Predict outcomes under treatment and control
y_pred_treated <- predict(rf_model, X_treated)
y_pred_control <- predict(rf_model, X_control)

# Calculate Conditional Average Treatment Effect (CATE)
df$CATE <- y_pred_treated - y_pred_control

cat("Conditional Average Treatment Effect (CATE) statistics:\n")
summary(df$CATE)

# Visualize CATE distribution
p <- ggplot(df, aes(x = CATE)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "black") +
  geom_vline(xintercept = mean(df$CATE), color = "red", linetype = "dashed",
             linewidth = 1) +
  annotate("text", x = mean(df$CATE) * 1.2, y = Inf, vjust = 2,
           label = paste("Mean CATE =", round(mean(df$CATE), 4)), color = "red") +
  labs(title = "Distribution of Estimated Treatment Effects",
       x = "Conditional Average Treatment Effect (CATE)",
       y = "Frequency") +
  theme_minimal() +
  theme(text = element_text(size = 12))

ggsave('../output/cate_distribution_R.png', p, width = 10, height = 6, dpi = 300)

## (2 points) Plot representative tree capturing heterogeneous treatment effects

In [None]:
# Train a single decision tree with max_depth=2 to visualize heterogeneous treatment effects
tree_model <- rpart(Y ~ ., data = data.frame(X_rf, Y = y_rf),
                    control = rpart.control(maxdepth = 2, minsplit = 10))

# Plot the tree
png('../output/representative_tree_R.png', width = 1400, height = 800)
rpart.plot(tree_model, 
           main = "Representative Decision Tree (max_depth=2) for Heterogeneous Treatment Effects",
           box.palette = "RdBu", shadow.col = "gray", cex = 0.8)
dev.off()

cat("Tree interpretation:\n")
cat("This tree shows how different patient characteristics lead to different predicted outcomes.\n")
cat("The splits indicate which features are most important for determining treatment response.\n")

**Interpretation:**

The representative decision tree with max_depth=2 reveals the key features that drive heterogeneous treatment effects:

1. **Root Split:** The tree first splits on the most important feature for predicting the outcome.
2. **Subsequent Splits:** Further splits reveal interactions between covariates and treatment.
3. **Leaf Nodes:** Each leaf represents a subgroup with similar predicted outcomes.
4. **Heterogeneity:** Different paths through the tree represent different subpopulations that may benefit differently from the treatment.

## (1.5 points) Compute and visualize feature importances

In [None]:
# Get feature importances from Random Forest
importances <- importance(rf_model)
feature_importance_df <- data.frame(
  feature = rownames(importances),
  importance = importances[, 1]
)
feature_importance_df <- feature_importance_df[order(-feature_importance_df$importance), ]

cat("Feature Importances:\n")
print(feature_importance_df)

# Plot feature importances
feature_importance_df$feature <- factor(feature_importance_df$feature,
                                        levels = feature_importance_df$feature)

p <- ggplot(feature_importance_df, aes(x = importance, y = feature)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Feature Importances from Random Forest Model",
       x = "Importance",
       y = "Feature") +
  theme_minimal() +
  theme(text = element_text(size = 12))

ggsave('../output/feature_importances_R.png', p, width = 10, height = 8, dpi = 300)

## (2 points) Plot distribution of standardized covariates by predicted treatment effect terciles

In [None]:
# Standardize all covariates
covariate_cols <- c('age', 'sex', 'cp', 'restbp', 'chol', 'fbs', 'restecg', 
                    'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal')

df_standardized <- df
df_standardized[covariate_cols] <- scale(df[covariate_cols])

cat("Covariates standardized successfully\n")

In [None]:
# Divide CATE into terciles
df_standardized$tercile <- cut(df_standardized$CATE, 
                               breaks = quantile(df_standardized$CATE, probs = c(0, 1/3, 2/3, 1)),
                               labels = c('Low', 'Medium', 'High'),
                               include.lowest = TRUE)

cat("CATE tercile distribution:\n")
table(df_standardized$tercile)

In [None]:
# Compute mean of each covariate within each tercile
tercile_means <- df_standardized %>%
  group_by(tercile) %>%
  summarise(across(all_of(covariate_cols), mean)) %>%
  as.data.frame()

cat("Mean standardized covariates by tercile:\n")
print(tercile_means)

# Prepare data for heatmap
tercile_means_long <- melt(tercile_means, id.vars = "tercile")
colnames(tercile_means_long) <- c("tercile", "covariate", "value")

# Create heatmap
p <- ggplot(tercile_means_long, aes(x = covariate, y = tercile, fill = value)) +
  geom_tile() +
  geom_text(aes(label = round(value, 2)), size = 3) +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red", 
                       midpoint = 0, name = "Standardized\nMean") +
  labs(title = "Distribution of Standardized Covariates by Predicted Treatment Effect Terciles",
       x = "Covariates",
       y = "Treatment Effect Tercile") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        text = element_text(size = 12))

ggsave('../output/tercile_heatmap_R.png', p, width = 12, height = 8, dpi = 300)

**Interpretation:**

The heatmap shows how patient characteristics differ across treatment effect terciles:

1. **Red cells** (positive values): indicate that patients in this tercile have above-average values for that covariate.
2. **Blue cells** (negative values): indicate that patients in this tercile have below-average values for that covariate.
3. **White cells** (near zero): indicate that patients in this tercile have average values for that covariate.

This visualization helps identify which patient characteristics are associated with higher or lower treatment effects, revealing heterogeneity in treatment response.

## Summary

This analysis demonstrates:

1. **Randomized Treatment Assignment:** Successfully simulated a randomized cash transfer program.
2. **Outcome Generation:** Created a realistic outcome variable with treatment effects varying by patient characteristics.
3. **OLS Estimation:** Estimated average treatment effects using regression.
4. **Random Forest for Causal Inference:** Used machine learning to estimate heterogeneous treatment effects.
5. **Visualization:** Identified key features and patient subgroups with different treatment responses.

The results suggest that treatment effects vary across patients based on their characteristics, with age, sex, and blood pressure being particularly important predictors of treatment response.