# **Goal**: Hard-code ONE oracle run
For one query, compute one ATE number.

ATE = Average Treatment Effect

In [42]:
from pathlib import Path
import pandas as pd
from tabpfn import TabPFNClassifier
import numpy as np
from sklearn.linear_model import LogisticRegression

### Load dataset CSV.

I selected the alphabetically first real dataset, “4Cleaned base data.csv.”

It appears to focus on social support and poverty status among (mostly older) individuals/households.
It analyzes how economic/emotional/care support relates to single- and multi-dimensional poverty, controlling for demographics and household context.

In [33]:
example_data_path = Path("..") / "data" / "real_data" / "4Cleaned base data.csv"

df = pd.read_csv(example_data_path)

display(df.head())

Unnamed: 0,id,economicsupport,emotionalsupport,caresupport,economicpoverty,healthpoverty,rightspoverty,spiritualpoverty,multidimensionalpoverty,gender,...,squareofage,maritalstatus,householdregistration,totalnumberofchildren,numberofboys,proportionofboys,region,socialsecurity,totalhouseholdincome,numberofpeoplelivingtogether
0,11001618,7.600903,1,1,0,0,0,1,0,0,...,10201,0,1,3,2,0.666667,1,0,10.463103,3
1,11001918,8.987197,1,1,0,0,0,1,0,0,...,5041,0,1,2,0,0.0,0,1,10.700995,0
2,11002118,8.699514,1,1,0,1,0,0,0,0,...,6400,0,1,3,2,0.666667,1,1,10.645425,0
3,11002618,0.0,1,1,0,0,0,1,0,1,...,6561,0,1,2,1,0.5,0,1,11.395132,3
4,11002718,0.0,1,0,0,0,0,0,0,1,...,5929,1,1,2,1,0.5,1,1,11.395132,3


### Manually define treatment T, outcome Y, and covariates / cofounders X.

Oracle question: What is the causal effect of emotional support on multidimensional poverty?

**Why this choice?**
- Treatment is binary --> easy ATE
- Outcome is binary

**Notes about causal language**:
- Treatment T: What happens if we set T = 1 vs T = 0?
- Outcome Y: How does Y change when we intervene on T?
- Covariates / Cofounders X: What do we need to condition on so that T is "as good as randomized"?

In [34]:
# Define treatment
T_col = "emotionalsupport"

# Define outcome
Y_col = "multidimensionalpoverty"

# Define covariates / cofounders that likely affect both support and poverty
X_cols = [
    "age",
    "gender",
    "maritalstatus",
    "region",
    "totalhouseholdincome",
    "socialsecurity",
]

### Sanity check

In [35]:
T = df[T_col].to_numpy()
Y = df[Y_col].to_numpy()
X = df[X_cols].to_numpy()

print("N = ", len(df))
print("Treatment prevalence = ", T.mean())
print("Outcome prevalence = ", Y.mean())

N =  8061
Treatment prevalence =  0.8227267088450565
Outcome prevalence =  0.1568043667038829


**Interpretation**
- Treatment prevalence:
  - ~82% have emotionalsupport = 1
  - ~18% have emotionalsupport = 0
- Outcome prevalence:
  - ~15.7% are in multidimensional poverty

### Fit TabPFN as an outcome model

Feature matrix = [X, T], i.e. explicitly include treatment as a feature.

In [40]:
X_with_T = np.column_stack([X, T])

model = TabPFNClassifier()
model.fit(X_with_T, Y)

### Compute the ATE (causal part)

In [41]:
# Potential outcome under T = 0
X0 = np.column_stack([X, np.zeros(len(X))])
Y0_hat = model.predict_proba(X0)[:, 1]

# Potential outcome under T = 1
X1 = np.column_stack([X, np.ones(len(X))])
Y1_hat = model.predict_proba(X1)[:, 1]

ATE = (Y1_hat - Y0_hat).mean()

print("Oracle ATE (TabPFN): ", ATE)

Oracle ATE (TabPFN):  -0.02862272


**Interpretation of the ATE**

ATE = ~-0.029

Having emotional support reduces the probability of multidimensional poverty by ~2.9 percentage points, holding X fixed.

### Get a baseline ATE to compare it to TabPFN

To answer the question if this TabPFN ATE number is meaningful, or if any simple model would give the same thing.

In [43]:
lr = LogisticRegression(max_iter=1000)
lr.fit(X_with_T, Y)

Y0_lr = lr.predict_proba(X0)[:, 1]
Y1_lr = lr.predict_proba(X1)[:, 1]

ATE_lr = (Y1_lr - Y0_lr).mean()

print("Oracle ATE (Logistic Regression):", ATE_lr)

Oracle ATE (Logistic Regression): -0.03798557956607574


So TabPFN recovers the same causal direction as a classical method like logistic regression!