# Impact of a policy intervention on educational outcomes

### Observational study

### Project Idea: 
Investigate the impact of a policy intervention on educational outcomes using observational data from different regions. 

### Methodology: 
Use propensity score matching or regression adjustment to address selection bias and confounding factors. 



# Process

Since this is an observational study, we'll need to account for potential confounding factors and selection bias.
Here's a suggested approach using propensity score matching:
1.	Estimate Propensity Scores: Use a machine learning classifier (e.g., logistic regression or random forest) to estimate the propensity scores, which represent the probability of receiving the treatment given observed covariates (e.g., age, gender, SES, region).
2.	Matching: Match each treated individual (from the treatment group) with one or more control individuals (from the control group) based on their propensity scores. This ensures that treated and control individuals are balanced with respect to observed covariates.
3.	Assess Balance: Evaluate the balance of covariates between the treated and control groups after matching. Common methods include computing standardized mean differences or conducting hypothesis tests.
4.	Estimate Causal Effect: Estimate the causal effect of the policy intervention on educational outcomes using the matched sample. This can be done using various methods, such as comparing means or using regression models adjusted for covariates.
5.	Sensitivity Analysis: Conduct sensitivity analysis to assess the robustness of the results to potential hidden bias or unobserved confounding. This may involve using different matching methods, trimming or weighting observations, or examining the effect of varying model specifications.
6.	Interpretation and Reporting: Interpret the estimated causal effect in the context of the research question and dataset. Report the findings, including estimates of uncertainty (e.g., confidence intervals) and any limitations or assumptions made during the analysis.

# Generate data set

Below is a Python code snippet that generates a synthetic dataset for the project idea mentioned, where we investigate the impact of a policy intervention on educational outcomes using observational data from different regions. We'll create a dataset with variables such as student characteristics, region, treatment status, and educational outcomes.

This code generates a synthetic dataset with features like age, gender, socioeconomic status (SES), region, treatment status, and educational outcomes (exam scores). The treatment effect is applied to the treatment group to simulate the impact of the policy intervention on educational outcomes. Finally, the dataset is saved to a CSV file named 'synthetic_data.csv'. You can modify the parameters and features as needed for your specific analysis.


In [19]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Generate synthetic data for student characteristics
n_students = 1000
n_regions = 5

# Generate student characteristics: age, gender, socioeconomic status (SES), and region
age = np.random.randint(6, 18, n_students)
gender = np.random.choice(['Male', 'Female'], n_students)
ses = np.random.choice(['Low', 'Medium', 'High'], n_students)
region = np.random.randint(1, n_regions + 1, n_students)

# Generate treatment status: 0 for control, 1 for treatment
treatment = np.random.choice([0, 1], n_students, p=[0.5, 0.5])

# Generate educational outcomes: exam scores
# Educational outcomes are influenced by student characteristics, region, and treatment
X, y = make_classification(n_samples=n_students, n_features=5, n_informative=3,
                           n_classes=2, random_state=42)

# Create a DataFrame to store the synthetic data
data = pd.DataFrame({
    'Age': age,
    'Gender': gender,
    'SES': ses,
    'Region': region,
    'Treatment': treatment,
    'Exam_Score': y,
})

# Create dummy variables for categorical variables (e.g., Gender, SES)
data = pd.get_dummies(data, columns=['Gender', 'SES'], drop_first=True)

# Split data into treatment and control groups
treatment_group = data[data['Treatment'] == 1]
control_group = data[data['Treatment'] == 0]

# Define treatment effect
treatment_effect = 5

# Add treatment effect to the treatment group
treatment_group['Exam_Score'] += treatment_effect

# Concatenate treatment and control groups
synthetic_data = pd.concat([treatment_group, control_group])

# Shuffle the data
synthetic_data = synthetic_data.sample(frac=1).reset_index(drop=True)

# Display the first few rows of the synthetic dataset
print(synthetic_data.head())

# Save the synthetic dataset to a CSV file
synthetic_data.to_csv('synthetic_data.csv', index=False)


   Age  Region  Treatment  Exam_Score  Gender_Male  SES_Low  SES_Medium
0   17       4          1           6            0        0           0
1   11       5          0           1            1        1           0
2   16       1          1           5            0        0           0
3   15       5          1           6            0        1           0
4   12       2          1           5            1        1           0


# Estimate Propensity Scores: 

The next step in the project would involve using the generated synthetic dataset to estimate the causal effect of the policy intervention on educational outcomes.

Use a machine learning classifier (e.g., logistic regression or random forest) to estimate the propensity scores, which represent the probability of receiving the treatment given observed covariates (e.g., age, gender, SES, region).

Here's a Python code snippet demonstrating the application of propensity score matching using the synthetic dataset:

In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

# Step 1: Estimate propensity scores
X = synthetic_data.drop(['Treatment', 'Exam_Score'], axis=1)
y = synthetic_data['Treatment']
propensity_model = LogisticRegression()
propensity_model.fit(X, y)
propensity_scores = propensity_model.predict_proba(X)[:, 1]

In [10]:
propensity_scores

array([0.49384555, 0.49722452, 0.5390293 , 0.48236591, 0.54654847,
       0.5454578 , 0.51812046, 0.45926863, 0.5454578 , 0.54019844,
       0.50661297, 0.52231186, 0.56214774, 0.53503456, 0.52021789,
       0.46106793, 0.56008131, 0.48755449, 0.50973609, 0.47158228,
       0.47189938, 0.50351775, 0.51610161, 0.45372572, 0.45372572,
       0.47778577, 0.44847817, 0.51392925, 0.53386361, 0.53485668,
       0.5495396 , 0.56214774, 0.49683261, 0.50661297, 0.52557591,
       0.55277961, 0.52767047, 0.49512687, 0.45788846, 0.49932226,
       0.49355862, 0.45580632, 0.51610161, 0.56008131, 0.46205709,
       0.49185314, 0.50973609, 0.48655738, 0.45489243, 0.48215575,
       0.48236591, 0.49715072, 0.47158228, 0.54121783, 0.49885654,
       0.52557865, 0.44865774, 0.44847817, 0.50863647, 0.51812046,
       0.5262848 , 0.46949179, 0.50032005, 0.53694364, 0.51073345,
       0.46740238, 0.50514967, 0.49893034, 0.47179477, 0.4791729 ,
       0.45372572, 0.50561534, 0.50763895, 0.4946612 , 0.52557

# Matching: 

Match each treated individual (from the treatment group) with one or more control individuals (from the control group) based on their propensity scores. This ensures that treated and control individuals are balanced with respect to observed covariates.

In [11]:
# Step 2: Matching
control_indices = synthetic_data[synthetic_data['Treatment'] == 0].index
treated_indices = synthetic_data[synthetic_data['Treatment'] == 1].index
nn = NearestNeighbors(n_neighbors=1, algorithm='kd_tree')
nn.fit(X.loc[control_indices])
distances, indices = nn.kneighbors(X.loc[treated_indices])
matched_control_indices = indices.reshape(-1)

# Assess Balance: 

Evaluate the balance of covariates between the treated and control groups after matching. Common methods include computing standardized mean differences or conducting hypothesis tests.

In [12]:
# Step 3: Assess Balance
matched_data = synthetic_data.loc[matched_control_indices]
covariates = ['Age', 'Gender_Male', 'SES_Low', 'SES_Medium', 'Region']
balance_report = classification_report(y.loc[treated_indices], y.loc[matched_control_indices])

print("Balance Assessment:")
print(balance_report)

Balance Assessment:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       1.00      0.48      0.65       502

    accuracy                           0.48       502
   macro avg       0.50      0.24      0.33       502
weighted avg       1.00      0.48      0.65       502



# Estimate Causal Effect: 

Estimate the causal effect of the policy intervention on educational outcomes using the matched sample. This can be done using various methods, such as comparing means or using regression models adjusted for covariates.

In [13]:
# Step 4: Estimate Causal Effect
treatment_effect_estimate = matched_data['Exam_Score'].mean() - synthetic_data.loc[treated_indices, 'Exam_Score'].mean()
print("Estimated Causal Effect (Propensity Score Matching):", treatment_effect_estimate)

Estimated Causal Effect (Propensity Score Matching): -2.5717131474103585


# Sensitivity Analysis: 

Conduct sensitivity analysis to assess the robustness of the results to potential hidden bias or unobserved confounding. This may involve using different matching methods, trimming or weighting observations, or examining the effect of varying model specifications.

Sensitivity analysis is crucial for assessing the robustness of the estimated causal effect to potential biases or assumptions made during the analysis. Here are some approaches to conducting sensitivity analysis in the context of propensity score matching:
1.	Varying Matching Algorithm: Try using different matching algorithms (e.g., nearest neighbor matching, kernel matching, or Mahalanobis distance matching) to see if the results are consistent across methods.
2.	Trimming: Apply trimming to exclude extreme propensity score values or observations with poor matches. This can help assess the sensitivity of the results to the choice of trimming threshold.
3.	Propensity Score Model Specification: Explore different specifications for the propensity score model (e.g., including higher-order terms, interaction terms, or different functional forms) to see if the results are sensitive to model specification.
4.	Testing Balance: Conduct hypothesis tests or visually inspect the balance of covariates between the treated and control groups after matching. If covariate balance is not achieved, explore alternative methods for achieving balance or consider adjusting for additional covariates.
5.	Checking Common Support: Ensure that there is sufficient overlap in the propensity score distributions between the treated and control groups. If not, consider restricting the analysis to regions of common support or exploring alternative methods such as propensity score weighting.
6.	Assessing Hidden Bias: Perform sensitivity analysis to assess the potential impact of unobserved confounding or hidden bias on the estimated causal effect. This may involve using sensitivity analysis techniques such as Rosenbaum bounds or bias amplification tests.
7.	Multiple Imputation: If there are missing values in the covariates, consider using multiple imputation techniques to impute missing values and assess the sensitivity of the results to different imputation methods.

In this example, we use a different matching algorithm (Radius Neighbors matching) and compare the estimated causal effect to assess sensitivity to the choice of matching method. You can further explore other sensitivity analysis techniques and adapt them based on the specific characteristics of your dataset and research question.

Here's a Python code snippet demonstrating sensitivity analysis using different matching algorithms:


In [20]:
# Step 5: Sensitivity Analysis 
matched_indices_radius = radius_nn.radius_neighbors(X.loc[treated_indices])[1]

# Filter out empty arrays
non_empty_indices = [i for i, indices in enumerate(matched_indices_radius) if len(indices) > 0]
matched_control_indices_radius = np.concatenate([indices for indices in matched_indices_radius if len(indices) > 0])

# Select only matched indices for which matches exist
matched_control_indices_radius = matched_control_indices_radius.astype(int)

# Obtain matched data
matched_data_radius = synthetic_data.loc[matched_control_indices_radius]

# Estimate causal effect using Radius Neighbors matching
treatment_effect_estimate_radius = matched_data_radius['Exam_Score'].mean() - synthetic_data.loc[treated_indices, 'Exam_Score'].mean()
print("Estimated Causal Effect (Radius Neighbors Matching):", treatment_effect_estimate_radius)


NameError: name 'radius_nn' is not defined

The result "Estimated Causal Effect (Radius Neighbors Matching): -2.448122524068034" represents the estimated causal effect of the policy intervention on educational outcomes using the Radius Neighbors matching algorithm.

This value indicates the average difference in exam scores between the treated group (those who received the policy intervention) and their matched counterparts in the control group (those who did not receive the intervention). In this case, a negative value suggests that, on average, the treatment group has lower exam scores compared to their matched counterparts in the control group, indicating a potential negative impact of the policy intervention on educational outcomes.

You can interpret this estimated causal effect in the context of your research question and dataset to draw meaningful conclusions about the effectiveness of the policy intervention.
