# Load Packages, Functions, and Data

In [30]:
# load packages 
library(data.table)
library(foreign)
library(lmtest)
library(sandwich)
library(multiwayvcov)
library(stargazer)

In [31]:
# read in our data
d <- read.csv("..\\data\\Slow_Kids_Data_Collection_Blocks.csv", header = TRUE)

# Data Exploration

In [32]:
#Explore the data
summary(d)

     Speed         Direction     Treatment         SameSide     
 Min.   :15.00   S      :150   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:24.00   N      :113   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :27.00   w      :106   Median :1.0000   Median :1.0000  
 Mean   :27.05   e      : 85   Mean   :0.5093   Mean   :0.5222  
 3rd Qu.:30.00   W      : 69   3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   :44.00   E      : 65   Max.   :1.0000   Max.   :1.0000  
                 (Other):111                                    
    Children        Pedestrians         Hour      Visibility  Location
 Min.   :0.00000   Min.   :0.000   Min.   :1400   Clear:699   A:111   
 1st Qu.:0.00000   1st Qu.:0.000   1st Qu.:1500               B:191   
 Median :0.00000   Median :0.000   Median :1520               C: 48   
 Mean   :0.01431   Mean   :0.186   Mean   :1577               D:104   
 3rd Qu.:0.00000   3rd Qu.:0.000   3rd Qu.:1700               E:159   
 Max.   :1.00000   Max.   :1.000   Max.   :1940       

# Covariate Balance Check

In [33]:
# check if our randomization worked
no_cov <-lm(d$Treatment ~ 1, data = d)
cov_check <- lm(d$Treatment ~ 1 + d$Direction2 + d$Block + d$SameSide + d$Children + d$Pedestrians + as.factor(d$DayofWeek))

summary(cov_check)


Call:
lm(formula = d$Treatment ~ 1 + d$Direction2 + d$Block + d$SameSide + 
    d$Children + d$Pedestrians + as.factor(d$DayofWeek))

Residuals:
    Min      1Q  Median      3Q     Max 
-0.8734 -0.4660  0.1266  0.4401  0.8742 

Coefficients: (4 not defined because of singularities)
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)              0.42047    0.09765   4.306 1.91e-05 ***
d$Direction2n            0.10468    0.12162   0.861  0.38968    
d$Direction2s            0.06581    0.12144   0.542  0.58802    
d$Direction2w            0.11724    0.05613   2.089  0.03711 *  
d$BlockCA-C-0717-1940   -0.02038    0.14537  -0.140  0.88857    
d$BlockCA-D-0718-1710    0.01175    0.13156   0.089  0.92886    
d$BlockCA-D-0718-1740    0.06508    0.11418   0.570  0.56889    
d$BlockCA-D-0718-1810   -0.25277    0.13187  -1.917  0.05569 .  
d$BlockCA-D-0718-1840    0.03152    0.11993   0.263  0.79278    
d$BlockCA-D-0718-1910   -0.03574    0.13795  -0.259  0.79563    
d

In [34]:
anova(no_cov, cov_check, test = "LRT")

Res.Df,RSS,Df,Sum of Sq,Pr(>Chi)
698,174.6896,,,
676,153.7438,22.0,20.94572,1.503258e-10


It looks like certain blocks may indicate an imbalance in randomization. What's going on in these blocks?

In [35]:
# check distribution of treatment vs control for significant blocks
table(d$Treatment[d$Block == "NJ-E-0722-1520"])
table(d$Treatment[d$Block == "NJ-E-0722-1540"])
table(d$Treatment[d$Block == "NJ-F-0722-1620"])
table(d$Treatment[d$Block == "NJ-F-0722-1640"])


 0  1 
26  5 


 0  1 
 6 27 


 0  1 
 5 22 


 0  1 
 8 26 

While it's true that these blocks are pretty imbalanced, because we are using clusters, which should take care of accounting for the differences in std dev, this is likely not an issue.

Note that our direction West also has the same issue, but directions are also loosely linked to each location (and thereby each block), so the same justification should work.

# Randomization Check

In [36]:
# check if our randomization worked by trying to predict a variable we think should be random
rand_check <- lm(d$SameSide ~ d$Treatment + d$Direction2 + d$Block + d$Pedestrians + d$Children + as.factor(d$DayofWeek))

summary(rand_check)


Call:
lm(formula = d$SameSide ~ d$Treatment + d$Direction2 + d$Block + 
    d$Pedestrians + d$Children + as.factor(d$DayofWeek))

Residuals:
    Min      1Q  Median      3Q     Max 
-0.8700 -0.4509  0.2230  0.3742  1.0212 

Coefficients: (4 not defined because of singularities)
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)              0.282983   0.097868   2.891  0.00396 ** 
d$Treatment             -0.023128   0.038254  -0.605  0.54566    
d$Direction2n            0.204919   0.120803   1.696  0.09029 .  
d$Direction2s            0.342777   0.120121   2.854  0.00446 ** 
d$Direction2w            0.456812   0.053200   8.587  < 2e-16 ***
d$BlockCA-C-0717-1940   -0.206757   0.144406  -1.432  0.15267    
d$BlockCA-D-0718-1710   -0.037011   0.130878  -0.283  0.77743    
d$BlockCA-D-0718-1740    0.071817   0.113585   0.632  0.52742    
d$BlockCA-D-0718-1810    0.104384   0.131493   0.794  0.42757    
d$BlockCA-D-0718-1840    0.120097   0.119233   1.007  0.3141

Based on our randomization effect, it looks like our treatment variable was truly randomized as it could not predict a variable that should be random (SameSide). Note that we do get statistically significant prediction by using direction, but this is expected as these two variables are related to each other (i.e. the sign was placed on a particular side of the street, so one direction of travel will be on the same side). We also ran the check again without the direction variable, which confirmed that we cannot predict this variable with our treatment variable.

In [37]:
rand_check2 <- lm(d$SameSide ~ d$Treatment + d$Block + d$Pedestrians + d$Children + as.factor(d$DayofWeek))

summary(rand_check2)


Call:
lm(formula = d$SameSide ~ d$Treatment + d$Block + d$Pedestrians + 
    d$Children + as.factor(d$DayofWeek))

Residuals:
    Min      1Q  Median      3Q     Max 
-0.6871 -0.5188  0.3210  0.4603  0.7036 

Coefficients: (3 not defined because of singularities)
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)              0.506810   0.099845   5.076 4.99e-07 ***
d$Treatment             -0.001377   0.040295  -0.034    0.973    
d$BlockCA-C-0717-1940   -0.136669   0.152388  -0.897    0.370    
d$BlockCA-D-0718-1710    0.001922   0.152887   0.013    0.990    
d$BlockCA-D-0718-1740    0.124339   0.134891   0.922    0.357    
d$BlockCA-D-0718-1810    0.172228   0.152920   1.126    0.260    
d$BlockCA-D-0718-1840    0.181659   0.140343   1.294    0.196    
d$BlockCA-D-0718-1910    0.136890   0.157981   0.867    0.387    
d$BlockNJ-E-0722-1500    0.001158   0.118736   0.010    0.992    
d$BlockNJ-E-0722-1520   -0.022282   0.132141  -0.169    0.866    
d$BlockNJ

# Cluster Analysis with Blocks

In [38]:
fit <- lm(d$Speed ~ d$Treatment + d$Direction2 + d$Block + d$SameSide + d$Children + d$Pedestrians + as.factor(d$DayofWeek), d)
fit$cluster.vcov <- cluster.vcov(fit, ~ d$Cluster)

fit_cl <- coeftest(fit, fit$cluster.vcov)
fit_cl


t test of coefficients:

                      Estimate Std. Error t value  Pr(>|t|)    
(Intercept)           28.42843    0.89032 31.9307 < 2.2e-16 ***
d$Treatment           -1.55878    0.30999 -5.0285 6.342e-07 ***
d$Direction2n         -0.86971    1.13816 -0.7641   0.44505    
d$Direction2s         -1.11976    1.25625 -0.8913   0.37306    
d$Direction2w         -0.85460    0.62848 -1.3598   0.17435    
d$BlockCA-C-0717-1940 -0.82048    0.81719 -1.0040   0.31573    
d$BlockCA-D-0718-1710  1.73614    1.64797  1.0535   0.29249    
d$BlockCA-D-0718-1740  2.09897    0.84853  2.4737   0.01362 *  
d$BlockCA-D-0718-1810  1.94486    1.09092  1.7828   0.07507 .  
d$BlockCA-D-0718-1840  0.63291    1.32914  0.4762   0.63410    
d$BlockCA-D-0718-1910  0.46190    1.34813  0.3426   0.73199    
d$BlockNJ-E-0722-1500 -0.53066    1.22745 -0.4323   0.66564    
d$BlockNJ-E-0722-1520  2.19191    0.97330  2.2520   0.02464 *  
d$BlockNJ-E-0722-1540  1.55489    0.79330  1.9600   0.05040 .  
d$BlockNJ-E-07

It looks like speed decreases with a sign, but decreases almost 2x more when children & pedestrians are present. Let's look at the interaction between children & pedestrians and the treatment.

In [39]:
fit_int <- lm(d$Speed ~ d$Treatment*d$Children*d$Pedestrians + d$Direction2 + d$Block + d$SameSide + as.factor(d$DayofWeek), d)
fit_int$cluster.vcov <- cluster.vcov(fit_int, ~ d$Cluster)

fit_int_cl <- coeftest(fit_int, fit_int$cluster.vcov)
fit_int_cl


t test of coefficients:

                           Estimate Std. Error t value  Pr(>|t|)    
(Intercept)               28.551608   1.100830 25.9364 < 2.2e-16 ***
d$Treatment               -1.228569   0.369475 -3.3252 0.0009316 ***
d$Children                -4.173981   1.233789 -3.3831 0.0007584 ***
d$Pedestrians             -0.533554   0.707197 -0.7545 0.4508353    
d$Direction2n             -1.104755   1.352450 -0.8169 0.4143003    
d$Direction2s             -1.365719   1.475442 -0.9256 0.3549683    
d$Direction2w             -0.839789   0.620264 -1.3539 0.1762155    
d$BlockCA-C-0717-1940     -0.962331   1.038360 -0.9268 0.3543730    
d$BlockCA-D-0718-1710      1.613995   1.650825  0.9777 0.3285788    
d$BlockCA-D-0718-1740      2.144174   0.966898  2.2176 0.0269166 *  
d$BlockCA-D-0718-1810      2.094070   1.211016  1.7292 0.0842348 .  
d$BlockCA-D-0718-1840      0.367202   1.367316  0.2686 0.7883531    
d$BlockCA-D-0718-1910      0.495236   1.459575  0.3393 0.7344887    
d$BlockN

In [40]:
#compare whether model with interaction vs without are different
anova(fit, fit_int, test = "LRT")

Res.Df,RSS,Df,Sum of Sq,Pr(>Chi)
675,10671.91,,,
673,10589.15,2.0,82.75846,0.07208698


Interestingly, it looks like there is not a significant heterogenous treatment effect due to the presence of children and pedestrians, and our treatment!

One other possible interaction could be with the area that we measured data in. This is generally accounted for with our blocks, but we might see bigger or smaller treatment effects by location. Let's take a look at that. Note: since we determined above that there is not an interaction between the presence of children or pedetrians with the treatment, we are going back to accounting for those as separate factors.

In [41]:
fit_city <- lm(d$Speed ~ d$Treatment*d$CityID + d$Children + d$Pedestrians + d$Direction2 + d$Block + d$SameSide + as.factor(d$DayofWeek), d)
fit_city$cluster.vcov <- cluster.vcov(fit_loc, ~ d$Cluster)

fit_city_cl <- coeftest(fit_city, fit_city$cluster.vcov)
fit_city_cl


t test of coefficients:

                      Estimate Std. Error t value  Pr(>|t|)    
(Intercept)           27.86753    0.52663 52.9172 < 2.2e-16 ***
d$Treatment           -0.53245    0.43043 -1.2370 0.2165110    
d$Children            -2.74078    0.72966 -3.7563 0.0001874 ***
d$Pedestrians         -1.44699    0.62787 -2.3046 0.0214929 *  
d$Direction2n         -1.09164    0.55448 -1.9688 0.0493910 *  
d$Direction2w         -0.83062    0.62843 -1.3217 0.1867055    
d$BlockCA-C-0717-1940 -0.69474    0.72550 -0.9576 0.3386040    
d$BlockCA-D-0718-1710  2.00449    1.34781  1.4872 0.1374256    
d$BlockCA-D-0718-1740  2.34403    0.73467  3.1906 0.0014858 ** 
d$BlockCA-D-0718-1810  2.49921    0.87509  2.8560 0.0044230 ** 
d$BlockCA-D-0718-1840  0.91132    1.57962  0.5769 0.5641825    
d$BlockNJ-E-0722-1500  0.13448    0.95513  0.1408 0.8880708    
d$BlockNJ-E-0722-1520  2.76406    0.72028  3.8375 0.0001360 ***
d$BlockNJ-E-0722-1540  2.64164    0.91404  2.8901 0.0039755 ** 
d$BlockNJ-F-07

In [42]:
fit_loc <- lm(d$Speed ~ d$Treatment*d$Location + d$Children + d$Pedestrians + d$Direction2 + d$Block + d$SameSide + as.factor(d$DayofWeek), d)
fit_loc$cluster.vcov <- cluster.vcov(fit_loc, ~ d$Cluster)

fit_loc_cl <- coeftest(fit_loc, fit_loc$cluster.vcov)
fit_loc_cl


t test of coefficients:

                        Estimate Std. Error t value  Pr(>|t|)    
(Intercept)             27.93929    0.52663 53.0534 < 2.2e-16 ***
d$Treatment             -2.82297    0.43043 -6.5585 1.087e-10 ***
d$LocationB              0.37721    0.66375  0.5683 0.5700206    
d$LocationC             -0.15980    0.81494 -0.1961 0.8446009    
d$LocationD             -0.55449    0.92283 -0.6009 0.5481378    
d$LocationE              1.20505    0.76920  1.5666 0.1176728    
d$LocationF              0.18182    1.13242  0.1606 0.8724889    
d$Children              -2.59938    0.72966 -3.5625 0.0003934 ***
d$Pedestrians           -1.42607    0.62787 -2.2713 0.0234470 *  
d$Direction2n            0.25608    0.55448  0.4618 0.6443421    
d$Direction2w           -0.88549    0.62843 -1.4090 0.1592859    
d$BlockCA-C-0717-1940   -0.66205    0.72550 -0.9125 0.3618125    
d$BlockCA-D-0718-1710    1.20450    1.34781  0.8937 0.3718195    
d$BlockCA-D-0718-1740    1.54792    0.73467  2.107

In [43]:
#for city, compare whether model with interaction vs without are different
anova(fit, fit_city, test = "LRT")

#for location, compare whether model with interaction vs without are different
anova(fit, fit_loc, test = "LRT")

Res.Df,RSS,Df,Sum of Sq,Pr(>Chi)
675,10671.91,,,
673,10607.21,2.0,64.69776,0.1284197


Res.Df,RSS,Df,Sum of Sq,Pr(>Chi)
675,10671.91,,,
670,10534.27,5.0,137.645,0.1192664


# Percent change analysis

We are going to compare our data against the [Federal Highway Administration Data](https://safety.fhwa.dot.gov/speedmgt/ref_mats/eng_count/2014/reducing_speed.cfm), which uses the percentage reduction as a metric. So we need to calculate our % change (between treatment and control) for mean speed.

In [44]:
# Calculate the means of the treatment and test control group
aggregate(d$Speed, list(d$Treatment), mean)

Group.1,x
0,27.79883
1,26.33708
