#### 1. How does linear regression allow for the estimation of the average treatment effect (ATE) in causal inference studies, and what role does the dummy variable play in this estimation?
* Linear regression allows for the estimation of ATE by including a dummy variable representing the treatment. The coefficient of this dummy variable directly estimates the average effect of the treatment on the outcome, assuming other variables are controlled for.

#### 2. Why is it necessary to control for potential confounders when using linear regression to estimate causal relationships, and how does including these confounders in the regression model help mitigate bias?
* Controlling for potential confounders in linear regression is essential to mitigate bias and estimate causal relationships more accurately. Including confounders helps ensure that the relationship between the treatment and the outcome is not spurious and reflects a more genuine causal effect.

#### 3. What is the significance of the "partialling out" interpretation in multivariate linear regression, especially in the context of causal inference, and how does it relate to the ceteris paribus condition?
* The "partialling out" technique in multivariate regression isolates the unique effect of each independent variable while controlling for others. This aligns with the ceteris paribus condition, allowing the estimation of the effect of one variable while holding others constant, critical for causal inference.

#### 4. How does randomized control trial (RCT) differ from regression analysis in addressing confounding bias, and why might RCT not always be feasible in real-world research scenarios?
* RCT directly addresses confounding by randomly assigning treatments, ensuring equivalent groups. Regression tries to control for confounders statistically, which can be less robust. RCTs are often not feasible in real-world settings due to ethical, practical, or financial constraints.

#### 5. Why is the concept of omitted variable bias (OVB) crucial in the context of linear regression for causal inference, and how can causal graphs help in understanding the potential sources of bias in estimating causal effects?
* OVB is crucial in regression analysis as it can lead to biased and inconsistent estimates if important variables are excluded. Causal graphs help identify potential sources of OVB by visualizing relationships among variables, aiding in the selection of relevant controls for the regression model.

In [1]:
import pandas as pd
import numpy as np

np.random.seed(42)


data = pd.DataFrame({
    'age': np.random.randint(18, 65, size=100),  # Age of individuals
    'income': np.random.normal(50000, 15000, size=100),  # Annual income with some variation
    'education_years': np.random.randint(12, 21, size=100),  # Years of education
    'health_index': np.random.uniform(0, 1, size=100) * 100,  # Health index from 0 to 100
    'employed': np.random.choice([0, 1], size=100)  # Employment status (0 = unemployed, 1 = employed)
})

data['income_post_training'] = data['income'] + np.where(data['employed'] == 1, np.random.normal(5000, 2000, size=100), 0)

data.head()

Unnamed: 0,age,income,education_years,health_index,employed,income_post_training
0,56,59544.576625,20,41.978086,1,63859.398889
1,46,36399.189971,15,25.620694,1,41897.819055
2,32,57140.638811,12,61.151371,0,57140.638811
3,60,69554.919026,13,8.159418,0,69554.919026
4,25,53173.805185,12,0.518486,1,59951.773773


In [9]:
# Calculate the Average Treatment Effect (ATE) of the training program on income.
data['income_change'] = data['income_post_training'] - data['income']
ATE = data['income_change'].mean()
print(f"Average Treatment Effect (ATE): {ATE}")

Average Treatment Effect (ATE): 2031.2960155456576


In [3]:
# What is the average income for those who were employed vs. those who were not?
average_income_employed = data[data['employed'] == 1]['income'].mean()
average_income_unemployed = data[data['employed'] == 0]['income'].mean()

print(f"Average Income for Employed: {average_income_employed}")
print(f"Average Income for Unemployed: {average_income_unemployed}")

Average Income for Employed: 49844.36398365272
Average Income for Unemployed: 51957.36250416066


In [5]:
# Does higher education correlate with higher income in this dataset?
correlation_education_income = data['education_years'].corr(data['income'])
correlation_education_income

0.0733377547304343

In [6]:
# Is there a significant difference in the health index between employed and unemployed individuals?
average_health_employed = data[data['employed'] == 1]['health_index'].mean()
average_health_unemployed = data[data['employed'] == 0]['health_index'].mean()

health_index_difference = average_health_employed - average_health_unemployed
print(f"Health Index Difference: {health_index_difference}")

Health Index Difference: -9.723033033929681


In [8]:
# Predict post-training income using a linear regression model with 'age',
#'education_years', 'health_index', and 'employed' as predictors.
#What is the coefficient for 'employed', and how does it relate to the ATE calculated in question 1?

from sklearn.linear_model import LinearRegression

features = ['age', 'education_years', 'health_index', 'employed']
target = 'income_post_training'

model = LinearRegression()
model.fit(data[features], data[target])

coefficients = model.coef_
employed_coefficient = coefficients[features.index('employed')]
print(f"Coefficient for 'employed': {employed_coefficient}")

Coefficient for 'employed': 1905.3979115265333
