# Effectiveness of a new traffic regulation 

### Quasi-experiment
In a quasi-experiment, researchers do not have complete control over the assignment of subjects to treatment and control groups, but they still attempt to emulate some aspects of an experimental design. In this case, the researchers are evaluating the effectiveness of a new traffic regulation on accident rates using data from before and after the policy implementation.

### Project Idea: 
Evaluate the effectiveness of a new traffic regulation on accident rates using data from before and after the policy implementation in different regions. 

### Methodology: 
Apply the DiD approach to estimate the treatment effect while accounting for time trends and regional differences. Let's start by creating a dataset we can use for this project. The methodology involves applying the DiD approach to estimate the treatment effect of the new traffic regulation. DiD compares changes in outcomes (accident rates) over time between treatment groups (regions where the regulation was implemented) and control groups (regions where the regulation was not implemented). By comparing differences in outcomes before and after the policy implementation across treatment and control groups, researchers can estimate the causal effect of the policy while accounting for time trends and regional differences.

### Data Collection: 
Data would be collected on accident rates before and after the policy implementation in both treatment and control regions. Additionally, data on other relevant factors such as traffic volume, road conditions, and weather patterns may also be collected to control for potential confounding variables.

### Causal Effect Estimation: 
By applying the DiD approach, researchers aim to estimate the causal effect of the new traffic regulation on accident rates while controlling for time trends and regional differences. This approach helps to provide more robust estimates of the policy's effectiveness compared to simple before-after comparisons.

# Create data set

To create a dataset for evaluating the effectiveness of a new traffic regulation on accident rates using the Difference-in-Differences (DiD) approach, we'll generate synthetic data representing accident rates before and after the policy implementation in different regions. The dataset will include variables such as region identifiers, time periods, accident rates, and potentially other relevant factors.

Here's how we can generate the synthetic dataset in Python:

In this synthetic dataset:

•	Region represents the region identifier.

•	Year represents the year of observation.

•	Accident_Rate represents the accident rate in each region for each year.

This dataset simulates accident rates before and after the policy implementation in different regions, allowing us to evaluate the effectiveness of the new traffic regulation using the DiD approach.


In [3]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Generate synthetic data for different regions and time periods
n_regions = 5
n_years_before = 5
n_years_after = 5

# Generate region identifiers
regions = ['Region' + str(i) for i in range(1, n_regions + 1)]

# Generate years before and after policy implementation
years_before = np.arange(2010, 2010 + n_years_before)
years_after = np.arange(2010 + n_years_before, 2010 + n_years_before + n_years_after)

# Generate accident rates before and after policy implementation for each region
data = []
for region in regions:
    for year in years_before:
        # Simulate accident rates before policy implementation
        accident_rate = np.random.normal(loc=10, scale=2, size=1)[0]
        data.append([region, year, accident_rate])
    for year in years_after:
        # Simulate accident rates after policy implementation (with potential decrease)
        accident_rate = np.random.normal(loc=8, scale=2, size=1)[0]
        data.append([region, year, accident_rate])

# Create a DataFrame to store the synthetic dataset
columns = ['Region', 'Year', 'Accident_Rate']
accident_data = pd.DataFrame(data, columns=columns)

# Display the first few rows of the synthetic dataset
print(accident_data.head())

# Save the synthetic dataset to a CSV file
accident_data.to_csv('synthetic_accident_data.csv', index=False)


    Region  Year  Accident_Rate
0  Region1  2010      10.993428
1  Region1  2011       9.723471
2  Region1  2012      11.295377
3  Region1  2013      13.046060
4  Region1  2014       9.531693


# Difference-in-Differences (DiD)

To evaluate the effectiveness of a new traffic regulation on accident rates using the Difference-in-Differences (DiD) approach in Python, we'll follow these steps:

1.	Prepare the dataset by loading the synthetic accident data.

2.	Define the treatment and control groups based on regions and time periods.

3.	Calculate the DiD estimator to estimate the treatment effect.

4.	Perform statistical tests to assess the significance of the estimated treatment effect.

Here's the Python code to accomplish these steps:
In this code:

•	We load the synthetic accident data containing accident rates, regions, and years.

•	We create a binary indicator variable (Post_Treatment) to identify the post-treatment period (years after 2014).

•	We create a DiD interaction term by multiplying the Post_Treatment and Region variables.

•	We specify a regression model where the dependent variable is the accident rate, and the independent variables include the treatment indicator (Post_Treatment), region (Region), and the DiD interaction term (DiD).

•	We fit an Ordinary Least Squares (OLS) regression model using the statsmodels library.

•	Finally, we print the regression results to examine the estimated coefficients, standard errors, t-statistics, and p-values.

By running this code, we can estimate the treatment effect of the new traffic regulation on accident rates while accounting for time trends and regional differences using the DiD approach. The regression results will provide insights into the effectiveness of the policy intervention in reducing accident rates.


In [5]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

# Load the synthetic accident data
accident_data = pd.read_csv('synthetic_accident_data.csv')

# Create a treatment indicator variable for the post-treatment period
accident_data['Post_Treatment'] = (accident_data['Year'] > 2014).astype(int)

# Create a DiD interaction term
accident_data['DiD'] = accident_data['Post_Treatment'] * accident_data['Region']

# Specify the regression model
# Include only relevant variables
X = accident_data[['Post_Treatment', 'Region']]  

# One-hot encoding for Region variable
X = pd.get_dummies(X, columns=['Region'], drop_first=True)  

# Calculate DiD interaction term, assuming Region2 as reference
X['DiD'] = X['Post_Treatment'] * X['Region_Region2']  

# Add a constant term
X = sm.add_constant(X)  

# Convert y to numeric
y = pd.to_numeric(accident_data['Accident_Rate'], errors='coerce')

y

0     10.993428
1      9.723471
2     11.295377
3     13.046060
4      9.531693
5      7.531726
6     11.158426
7      9.534869
8      7.061051
9      9.085120
10     9.073165
11     9.068540
12    10.483925
13     6.173440
14     6.550164
15     6.875425
16     5.974338
17     8.628495
18     6.183952
19     5.175393
20    12.931298
21     9.548447
22    10.135056
23     7.150504
24     8.911235
25     8.221845
26     5.698013
27     8.751396
28     6.798723
29     7.416613
30     8.796587
31    13.704556
32     9.973006
33     7.884578
34    11.645090
35     5.558313
36     8.417727
37     4.080660
38     5.343628
39     8.393722
40    11.476933
41    10.342737
42     9.768703
43     9.397793
44     7.042956
45     6.560312
46     7.078722
47    10.114244
48     8.687237
49     4.473920
Name: Accident_Rate, dtype: float64

In [6]:
# Drop rows with missing values
X = X.dropna()
y = y[X.index]  # Ensure y is aligned with X after dropping rows

X

Unnamed: 0,const,Post_Treatment,Region_Region2,Region_Region3,Region_Region4,Region_Region5,DiD
0,1.0,0,False,False,False,False,0
1,1.0,0,False,False,False,False,0
2,1.0,0,False,False,False,False,0
3,1.0,0,False,False,False,False,0
4,1.0,0,False,False,False,False,0
5,1.0,1,False,False,False,False,0
6,1.0,1,False,False,False,False,0
7,1.0,1,False,False,False,False,0
8,1.0,1,False,False,False,False,0
9,1.0,1,False,False,False,False,0


# Check data types

In [10]:
# Check the data types of X and y
print(X.dtypes)
print(y.dtype)

# Convert X to numeric
X = X.apply(pd.to_numeric, errors='coerce')

# Check for any remaining non-numeric data in X
print(X.select_dtypes(include=['object']))

# Convert y to numeric
y = pd.to_numeric(y, errors='coerce')

# Check for any remaining missing values in y
print(y.isnull().sum())

# Drop rows with missing values
X = X.dropna()

# Ensure y is aligned with X after dropping rows
y = y[X.index]  

y

const             float64
Post_Treatment      int32
Region_Region2       bool
Region_Region3       bool
Region_Region4       bool
Region_Region5       bool
DiD                 int32
dtype: object
float64
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
0


0     10.993428
1      9.723471
2     11.295377
3     13.046060
4      9.531693
5      7.531726
6     11.158426
7      9.534869
8      7.061051
9      9.085120
10     9.073165
11     9.068540
12    10.483925
13     6.173440
14     6.550164
15     6.875425
16     5.974338
17     8.628495
18     6.183952
19     5.175393
20    12.931298
21     9.548447
22    10.135056
23     7.150504
24     8.911235
25     8.221845
26     5.698013
27     8.751396
28     6.798723
29     7.416613
30     8.796587
31    13.704556
32     9.973006
33     7.884578
34    11.645090
35     5.558313
36     8.417727
37     4.080660
38     5.343628
39     8.393722
40    11.476933
41    10.342737
42     9.768703
43     9.397793
44     7.042956
45     6.560312
46     7.078722
47    10.114244
48     8.687237
49     4.473920
Name: Accident_Rate, dtype: float64

# Convert data types

In this code:

•	We convert boolean columns (Region_Region2, Region_Region3, Region_Region4, Region_Region5) to integer format using the astype(int) method. This will convert True to 1 and False to 0.

•	We then fit the DiD regression model using the modified X matrix and the y variable.

•	Finally, we print the regression results.

This adjustment ensures that all variables in the X matrix are numeric.


In [None]:
# Convert boolean columns to integer
X['Region_Region2'] = X['Region_Region2'].astype(int)
X['Region_Region3'] = X['Region_Region3'].astype(int)
X['Region_Region4'] = X['Region_Region4'].astype(int)
X['Region_Region5'] = X['Region_Region5'].astype(int)

# DiD regression model

•	We fit the DiD regression model using the modified X matrix and the y variable.

•	Finally, we print the regression results.

This adjustment ensures that all variables in the X matrix are numeric.

In [11]:
# Fit the DiD regression model
model = sm.OLS(y, X)
results = model.fit()

# Print regression results
print(results.summary())


                            OLS Regression Results                            
Dep. Variable:          Accident_Rate   R-squared:                       0.448
Model:                            OLS   Adj. R-squared:                  0.371
Method:                 Least Squares   F-statistic:                     5.820
Date:                Wed, 03 Apr 2024   Prob (F-statistic):           0.000167
Time:                        20:00:26   Log-Likelihood:                -95.770
No. Observations:                  50   AIC:                             205.5
Df Residuals:                      43   BIC:                             218.9
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const             11.2295      0.626     17.

# Interpretation

To interpret and communicate the results of the DiD regression analysis to a general audience, we can follow these steps:
1.	Overall Model Fit:
•	R-squared: The R-squared value (0.448) indicates that approximately 44.8% of the variability in the accident rate is explained by the variables included in the model. This suggests that the model provides a moderate level of explanation for the variation in accident rates.
2.	Coefficients:
•	Post_Treatment: The coefficient for the Post_Treatment variable (-2.6667) indicates that, on average, there is a decrease of approximately 2.67 units in the accident rate after the implementation of the treatment (traffic regulation).
•	Region Variables (Region_Region2 to Region_Region5): These coefficients represent the differences in accident rates between different regions compared to the reference region (Region1). For example, the coefficient for Region_Region2 (-2.9596) suggests that, on average, Region2 has a 2.96 units lower accident rate compared to Region1.
•	DiD: The coefficient for the DiD interaction term (0.9643) represents the additional change in the accident rate due to the treatment effect (interaction between post-treatment and region variables). However, it's not statistically significant at the conventional significance level (p > 0.05), indicating that the treatment effect may not vary significantly across regions.
3.	Statistical Significance:
•	P-values: The p-values associated with each coefficient indicate the statistical significance of the estimated effects. A p-value less than the chosen significance level (e.g., 0.05) suggests that the effect is statistically significant. In this case, the Post_Treatment variable and some of the Region variables have statistically significant effects on the accident rate.
4.	Confidence Intervals:
•	Confidence intervals: The 95% confidence intervals provide a range of values within which we can be reasonably confident that the true population parameter lies. For example, the confidence interval for the Post_Treatment coefficient (-3.796, -1.537) indicates that we are 95% confident that the true effect of the treatment lies within this interval.
5.	Model Assumptions:
•	The model assumes that there are no other unobserved factors influencing the accident rate, and that the relationships between the variables and the accident rate are linear and additive.
When communicating these results to a general audience, it's important to use plain language and avoid technical jargon. Additionally, providing visual aids such as charts or graphs can help illustrate the key findings and make the interpretation more accessible.


# Follow up

For the project of evaluating the effectiveness of a new traffic regulation on accident rates using the Difference-in-Differences (DiD) approach, there might be further steps to consider depending on the context and objectives of the analysis:
1.	Sensitivity Analysis: Perform sensitivity analysis to assess the robustness of the results to different model specifications, control variables, or assumptions.
2.	Additional Controls: Consider including additional control variables in the regression model to account for other factors that could affect accident rates, such as road conditions, weather, population density, etc.
3.	Subgroup Analysis: Explore whether the treatment effect varies across different subgroups (e.g., age groups, time periods, types of accidents) by conducting subgroup analyses.
4.	Time Trend Analysis: Investigate the presence of time trends in accident rates before and after the treatment implementation to assess whether the treatment effect is consistent over time.
5.	Policy Implications: Discuss the policy implications of the findings and provide recommendations for policymakers based on the estimated treatment effects.
6.	Communication: Prepare a clear and concise report summarizing the analysis, key findings, and conclusions in a format accessible to policymakers and stakeholders.
7.	Peer Review: Consider submitting the analysis for peer review to ensure the robustness and validity of the findings.
Overall, while the regression analysis provides valuable insights into the impact of the traffic regulation on accident rates, further analysis and interpretation may be necessary to fully understand the implications and make informed decisions.
