# Impact of an Online Learning Platform on Student Performance

### Quasi-experiment

### Project Idea: 
Examine the impact of a new online learning platform on student performance using data on student demographics, prior academic performance, and outcomes.

### Methodology: 
Generate data for student attributes and outcomes. Utilize Nearest Neighbor matching to pair treated students with control students. Calculate mean differences for each feature between treated and control groups. Analyze mean outcomes and estimate the Average Treatment Effect (ATE) of the online learning platform.

# Create data set

This code will generate a synthetic dataset containing information about student demographics, prior academic performance, and final exam scores. You can modify the parameters and distributions as needed to better reflect your specific research context.

In [2]:
import pandas as pd
import numpy as np

# Set random seed for reproducibility
np.random.seed(42)

# Number of students
n_students = 1000

# Generate student demographics
demographics = pd.DataFrame({
    'Student_ID': range(1, n_students + 1),
    'Gender': np.random.choice(['Male', 'Female'], size=n_students),
    'Age': np.random.randint(18, 25, size=n_students),
    'Ethnicity': np.random.choice(['White', 'Black', 'Hispanic', 'Asian'], size=n_students),
    'Socioeconomic_Status': np.random.choice(['Low', 'Medium', 'High'], size=n_students)
})

# Generate prior academic performance
prior_performance = pd.DataFrame({
    'Student_ID': range(1, n_students + 1),
    'High_School_GPA': np.random.uniform(2.0, 4.0, size=n_students),
    'SAT_Score': np.random.randint(800, 1600, size=n_students)
})

# Generate outcomes (e.g., exam scores)
outcomes = pd.DataFrame({
    'Student_ID': range(1, n_students + 1),
    'Final_Exam_Score': np.random.randint(50, 100, size=n_students)
})

# Combine datasets
data = pd.merge(demographics, prior_performance, on='Student_ID')
data = pd.merge(data, outcomes, on='Student_ID')

# Display the first few rows of the dataset
print(data.head())

   Student_ID  Gender  Age Ethnicity Socioeconomic_Status  High_School_GPA  \
0           1    Male   23     Black                  Low         2.127282   
1           2  Female   24     Black               Medium         3.662747   
2           3    Male   18     White                 High         3.197957   
3           4    Male   18  Hispanic                 High         2.229866   
4           5    Male   18  Hispanic               Medium         2.187715   

   SAT_Score  Final_Exam_Score  
0       1153                81  
1       1298                91  
2        940                67  
3       1557                64  
4        892                89  


In [3]:
data.describe()

Unnamed: 0,Student_ID,Age,High_School_GPA,SAT_Score,Final_Exam_Score
count,1000.0,1000.0,1000.0,1000.0,1000.0
mean,500.5,21.005,2.986954,1203.412,74.819
std,288.819436,2.040377,0.576381,230.07113,14.218453
min,1.0,18.0,2.000023,800.0,50.0
25%,250.75,19.0,2.516871,1005.75,63.0
50%,500.5,21.0,2.983421,1203.5,74.0
75%,750.25,23.0,3.473738,1405.25,88.0
max,1000.0,24.0,3.995642,1599.0,99.0


Now that we have our synthetic dataset, we can proceed with implementing the nearest neighbor matching approach to investigate the impact of the new online learning platform on student performance.

Here's an outline of the next steps:
1.	Preprocess the data: Ensure that the dataset is prepared for matching by encoding categorical variables and scaling numerical variables if necessary.
2.	Implement nearest neighbor matching: Use the sklearn.neighbors.NearestNeighbors class to identify nearest neighbors for each treated student based on their covariates.
3.	Perform matching: Match treated students with control students based on nearest neighbors.
4.	Assess balance: Evaluate the balance of covariates between the treated and control groups to ensure that they are comparable after matching.
5.	Analyze outcomes: Compare the outcomes (e.g., final exam scores) between the matched treated and control groups to estimate the impact of the new online learning platform.


# Preprocess data

Let's start by preprocessing the data to ensure that it's ready for matching. We'll encode categorical variables and scale numerical variables.

This code snippet preprocesses the dataset by scaling numerical features and encoding categorical features. After preprocessing, we'll have a feature matrix X and a target vector y, which are ready for matching.


In [4]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Define numerical and categorical features
numerical_features = ['Age', 'High_School_GPA', 'SAT_Score']
categorical_features = ['Gender', 'Ethnicity', 'Socioeconomic_Status']

# Preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(drop='first'), categorical_features)
    ])

# Apply preprocessing pipeline to the data
X = preprocessor.fit_transform(data.drop(columns=['Student_ID', 'Final_Exam_Score']))
y = data['Final_Exam_Score'].values

# Display the preprocessed feature matrix and target vector
print(X)
print(y)

[[ 0.97824975 -1.49224731 -0.21922448 ...  0.          1.
   0.        ]
 [ 1.4686005   1.17306397  0.41133075 ...  0.          0.
   1.        ]
 [-1.47350401  0.36626609 -1.14548837 ...  1.          0.
   0.        ]
 ...
 [-0.00245175  0.70458447 -1.74125435 ...  1.          1.
   0.        ]
 [ 1.4686005   0.72442951 -1.3542239  ...  0.          0.
   0.        ]
 [-1.47350401 -1.11712338 -1.73255704 ...  0.          1.
   0.        ]]
[81 91 67 64 89 81 58 86 59 69 55 97 52 91 60 92 66 81 53 53 66 90 68 54
 81 71 55 84 71 50 69 71 97 60 62 91 52 58 61 72 63 82 67 98 90 58 72 59
 68 96 54 96 72 79 74 77 83 94 97 96 74 59 63 91 95 79 90 74 63 84 97 83
 59 84 60 60 70 70 71 94 56 66 51 61 91 67 60 65 98 67 84 75 65 55 93 92
 96 96 97 91 61 62 53 88 51 75 76 76 64 57 56 96 95 89 61 62 92 66 66 85
 70 57 54 75 65 85 77 78 96 61 86 79 72 65 63 59 56 94 86 78 52 64 85 96
 95 68 62 91 70 86 76 78 69 74 55 85 51 71 81 65 67 54 85 79 94 68 72 83
 62 92 78 85 71 96 85 58 88 68 87 55 96 64 69

# Nearest Neighbors:

Now that we have preprocessed the data, we can proceed with implementing nearest neighbor matching to match treated students with control students based on their covariates.

We'll use the sklearn.neighbors.NearestNeighbors class to find the nearest neighbors for each treated student. Then, we'll use these nearest neighbors to perform matching.

Here's how we can implement nearest neighbor matching:

In this code:
•	We specify the number of nearest neighbors (n_neighbors) to consider for matching. Each treated student will be matched with their nearest neighbor.
•	We initialize a NearestNeighbors object and fit it to the preprocessed data (X).
•	We use the kneighbors method to find the nearest neighbors for each treated student (X[:n_students]). The indices variable will contain the indices of the nearest neighbors for each treated student.


In [5]:
from sklearn.neighbors import NearestNeighbors

# Number of nearest neighbors to consider for matching
n_neighbors = 1  # Each treated student will be matched with their nearest neighbor

# Initialize NearestNeighbors object
nn = NearestNeighbors(n_neighbors=n_neighbors + 1)  # Include itself in the neighbors

# Fit NearestNeighbors model to the preprocessed data
nn.fit(X)

# Find nearest neighbors for treated students
distances, indices = nn.kneighbors(X[:n_students], n_neighbors=n_neighbors + 1)


# Matching

Now that we have identified the nearest neighbors for each treated student, we can proceed with performing matching in the next step.

Now that we have identified the nearest neighbors for each treated student, we can proceed with performing matching.
For each treated student, we will match them with their nearest neighbor from the control group. We'll create matched pairs of treated and control students based on these nearest neighbors.

Here's how we can perform matching:

In this code:
•	We initialize lists to store the indices of matched treated and control students.
•	We loop through the indices of treated students and their nearest neighbors from the control group, and add them to the corresponding lists.
•	We extract the features and outcomes for the matched treated and control students.
•	We print the indices of matched treated and control students for verification.
After performing matching, we'll have matched pairs of treated and control students, ready for further analysis.


In [6]:
# Initialize lists to store matched pairs
matched_treated_indices = []
matched_control_indices = []

# Match treated students with their nearest neighbor from the control group
for treated_index, control_index in zip(range(n_students), indices[:, 1:]):
    matched_treated_indices.append(treated_index)
    matched_control_indices.append(control_index[0])

# Extract matched features and outcomes
matched_X_treated = X[matched_treated_indices]
matched_X_control = X[matched_control_indices]
matched_y_treated = y[matched_treated_indices]
matched_y_control = y[matched_control_indices]

# Display the indices of matched treated and control students
print("Indices of matched treated students:", matched_treated_indices)
print("Indices of matched control students:", matched_control_indices)


Indices of matched treated students: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 21

# Assess Balance:

Now that we have matched treated and control students, we need to assess the balance of covariates between the matched groups to ensure that they are comparable. This step is crucial to ensure that any differences in outcomes between the treated and control groups can be attributed to the treatment effect rather than confounding variables.

We can assess balance by comparing the distributions of covariates between the treated and control groups. Commonly used metrics include standardized mean differences (SMDs) and visual inspection of histograms or density plots.

Here's how we can assess balance using standardized mean differences:

In this code:
•	We calculate the standardized mean differences separately for numerical and categorical features.
•	For numerical features, we iterate through preprocessor.transformers_[0][2].
•	For categorical features, we iterate through preprocessor.named_transformers_['cat'].get_feature_names_out() and add the index offset cat_features_idx_start.
•	We concatenate the lists of standardized mean differences for numerical and categorical features.
•	We print the standardized mean differences for each covariate along with their respective feature names.

This should allow you to calculate the standardized mean differences for both numerical and categorical features. 


In [7]:
# standardized mean differences
def standardized_mean_difference(group1, group2):
    """Calculate the standardized mean difference between two groups."""
    diff = np.mean(group1) - np.mean(group2)
    pooled_std = np.sqrt((np.var(group1) + np.var(group2)) / 2)
    return diff / pooled_std

# Calculate standardized mean differences for numerical features
smds_numerical = []
for feature_idx, feature_name in enumerate(preprocessor.transformers_[0][2]):
    smd = standardized_mean_difference(matched_X_treated[:, feature_idx], matched_X_control[:, feature_idx])
    smds_numerical.append(smd)

# Calculate standardized mean differences for categorical features
smds_categorical = []
cat_features_idx_start = len(preprocessor.transformers_[0][2])
for feature_idx, feature_name in enumerate(preprocessor.named_transformers_['cat'].get_feature_names_out()):
    smd = standardized_mean_difference(matched_X_treated[:, cat_features_idx_start + feature_idx], matched_X_control[:, cat_features_idx_start + feature_idx])
    smds_categorical.append(smd)


After calculating the standardized mean differences, we can inspect them to determine whether balance has been achieved. A standardized mean difference below 0.1 is generally considered indicative of good balance.

In [8]:
# Display standardized mean differences for each covariate
for feature_name, smd in zip(preprocessor.transformers_[0][2] + list(preprocessor.named_transformers_['cat'].get_feature_names_out()), smds_numerical + smds_categorical):
    print(f"Standardized mean difference for {feature_name}: {smd}")

Standardized mean difference for Age: -0.0009885783299738143
Standardized mean difference for High_School_GPA: -0.011039412870220177
Standardized mean difference for SAT_Score: 0.003567050344670093
Standardized mean difference for Gender_Male: -0.00200036209831266
Standardized mean difference for Ethnicity_Black: 0.0022869856430335796
Standardized mean difference for Ethnicity_Hispanic: -0.0023567640287536166
Standardized mean difference for Ethnicity_White: -0.0022755646821052775
Standardized mean difference for Socioeconomic_Status_Low: -0.006349074822465663
Standardized mean difference for Socioeconomic_Status_Medium: -0.0021000642261963134


# Interpreting the SMDs:

•	SMD close to 0 indicates good balance between the matched treated and control groups for that covariate.
•	SMD greater than 0.1 may suggest potential imbalance, and further investigation is needed.
•	SMD less than -0.1 may also suggest potential imbalance, and further investigation is needed.
Based on the calculated SMDs you provided:
1.	Age: SMD is close to 0 (-0.00099), indicating good balance.
2.	High_School_GPA: SMD is close to 0 (-0.011), indicating good balance.
3.	SAT_Score: SMD is close to 0 (0.00357), indicating good balance.
4.	Gender_Male: SMD is close to 0 (-0.002), indicating good balance.
5.	Ethnicity_Black: SMD is close to 0 (0.00229), indicating good balance.
6.	Ethnicity_Hispanic: SMD is close to 0 (-0.00236), indicating good balance.
7.	Ethnicity_White: SMD is close to 0 (-0.00228), indicating good balance.
8.	Socioeconomic_Status_Low: SMD is close to 0 (-0.00635), indicating good balance.
9.	Socioeconomic_Status_Medium: SMD is close to 0 (-0.0021), indicating good balance.

Overall, based on the SMDs calculated, it appears that there is good balance between the matched treated and control groups for all covariates. This suggests that the matching process has effectively balanced the distribution of covariates between the two groups, reducing the potential for confounding bias in estimating treatment effects.


# Estimate the treatment effect

Now that we have assessed the balance between the matched treated and control groups, we can proceed with analyzing outcomes to estimate the treatment effect.

To do this, we'll compare the outcomes (dependent variable) between the matched treated and control groups. Commonly used methods for estimating treatment effects include calculating the average treatment effect (ATE), average treatment effect on the treated (ATT), or conducting hypothesis tests.

Here's how we can proceed with analyzing outcomes:

In this code:
•	We calculate the mean outcome for the matched treated and control groups using np.mean().
•	We calculate the treatment effect as the difference between the mean outcome for the treated group and the mean outcome for the control group.
•	We print the mean outcomes and the calculated treatment effect.

After running this code, we'll have the mean outcomes for both the treated and control groups, as well as the calculated treatment effect. This will provide us with valuable insights into the impact of the treatment on the outcome variable.


In [9]:
# Calculate the mean outcome for the matched treated and control groups
mean_outcome_treated = np.mean(matched_y_treated)
mean_outcome_control = np.mean(matched_y_control)

# Calculate the treatment effect (ATE)
ate = mean_outcome_treated - mean_outcome_control

# Print the results
print("Mean outcome for treated group:", mean_outcome_treated)
print("Mean outcome for control group:", mean_outcome_control)
print("Average Treatment Effect (ATE):", ate)


Mean outcome for treated group: 74.819
Mean outcome for control group: 74.795
Average Treatment Effect (ATE): 0.02400000000000091


With the mean outcomes for the treated and control groups, and the calculated Average Treatment Effect (ATE) of approximately 0.024, we can draw several insights into the impact of the treatment on the outcome variable:
1.	Small Treatment Effect: The ATE of 0.024 indicates a small positive effect of the treatment on the outcome variable. While this effect may be statistically significant, its practical significance depends on the context of the study and the scale of the outcome variable.
2.	Similar Mean Outcomes: The mean outcome for the treated group (74.819) is slightly higher than that of the control group (74.795), suggesting a positive treatment effect. However, the difference in means is relatively small, indicating a subtle impact of the treatment.
3.	Potential Practical Significance: While the treatment effect is small in magnitude, it may still have practical significance depending on the context. Further analysis and consideration of the context of the study are needed to determine the practical implications of this effect.

Additional analyses to explore could include:
•	Subgroup Analysis: Investigate whether the treatment effect varies across different subgroups of the population. This can help identify any differential effects of the treatment based on demographic or other characteristics.
•	Sensitivity Analysis: Perform sensitivity analysis to assess the robustness of the treatment effect estimate to different modeling assumptions or methodological choices. This can enhance the credibility of the findings and provide insights into the stability of the results.
•	Longitudinal Analysis: If the data is longitudinal, consider analyzing the trajectory of outcomes over time to understand the dynamics of the treatment effect and its persistence or attenuation over time.
•	Causal Mediation Analysis: Explore potential mechanisms through which the treatment affects the outcome by conducting causal mediation analysis. This can help uncover the underlying pathways through which the treatment operates to produce its effects.

These additional analyses can provide a deeper understanding of the treatment effect and its implications, helping to inform decision-making and further research efforts. 