# Missing data robustness check models

We want to make sure that missing data is not responsible for the change we see in the permutation test results. To test this, we use an OLS model to see how controlling for missingness affects outcomes.

In [1]:
using KFactors, DataFrames, CSV, Pipe, RegressionTables, CategoricalArrays, StatsBase, FixedEffectModels

In [2]:
raw_data = KFactors.read_data("../data/peaks_merged.parquet");

┌ Info: Removing sensors days with peak-hour occ above 99th percentile (22.83%)
└ @ KFactors C:\Users\mwbc\git\peak-spreading\src\computation.jl:65


In [3]:
data = KFactors.create_test_data(raw_data, KFactors.Periods.SPRING_2022, min_complete=0.0);

In [4]:
data = data[data.period .∈ Ref(Set([:prepandemic, :postlockdown])), :];

In [5]:
meta = CSV.read("../data/sensor_meta_geo.csv", DataFrame)
leftjoin!(data, select(meta, Not([:Latitude, :Longitude, :urban, :District, :Lanes])), on=:station=>:ID);

In [6]:
data.entirely_imputed = data.periods_imputed .== 288
data.postlockdown = data.period .== :postlockdown
data.proportion_imputed = data.periods_imputed ./ 288
data.peak_hour_occ_pct = data.peak_hour_occ .* 100;

In [7]:
mod1 = reg(data, @formula(peak_hour_occ_pct~postlockdown), Vcov.cluster(:station))

                                Linear Model                                
Number of obs:                4569230   Degrees of freedom:                 1
R2:                             0.008   R2 Adjusted:                    0.008
F-Stat:                       1178.74   p-value:                        0.000
peak_hour_occ_pct |  Estimate Std.Error  t value Pr(>|t|) Lower 95% Upper 95%
-----------------------------------------------------------------------------
postlockdown      | -0.740239 0.0215607 -34.3328    0.000 -0.782504 -0.697975
(Intercept)       |   9.90211 0.0297216  333.162    0.000   9.84384   9.96037


In [8]:
data.proportion_imputed_cat = @pipe map(data.proportion_imputed) do x
    x == 0 && return "0.0"
    x < 0.1 && return "(0, 0.1)"
    x < 0.2 && return "[0.1, 0.2)"
    x < 0.3 && return "[0.2, 0.3)"
    x < 0.4 && return "[0.3, 0.4)"
    x < 0.5 && return "[0.4, 0.5)"
    x < 0.6 && return "[0.5, 0.6)"
    x < 0.7 && return "[0.6, 0.7)"
    x < 0.8 && return "[0.7, 0.8)"
    x < 0.9 && return "[0.8, 0.9)"
    x < 1 && return "[0.9, 1.0)"
    x == 1 && return "1.0"
    error("Unknown x value $x")
end |> CategoricalArray(_,
    levels=["0.0", "(0, 0.1)", "[0.1, 0.2)", "[0.2, 0.3)", "[0.3, 0.4)", "[0.4, 0.5)", "[0.5, 0.6)", "[0.6, 0.7)", "[0.7, 0.8)", "[0.8, 0.9)", "[0.9, 1.0)", "1.0"],
    ordered=true)
flexible_missing = reg(data, @formula(peak_hour_occ_pct~postlockdown+proportion_imputed_cat), Vcov.cluster(:station))

                                          Linear Model                                          
Number of obs:                          4569230   Degrees of freedom:                          12
R2:                                       0.070   R2 Adjusted:                              0.070
F-Stat:                                  243.49   p-value:                                  0.000
peak_hour_occ_pct                  |   Estimate Std.Error   t value Pr(>|t|) Lower 95%  Upper 95%
-------------------------------------------------------------------------------------------------
postlockdown                       |  -0.544839  0.020054  -27.1686    0.000  -0.58415  -0.505528
proportion_imputed_cat: (0, 0.1)   |   0.381112 0.0425143   8.96432    0.000  0.297773   0.464451
proportion_imputed_cat: [0.1, 0.2) |   0.310491 0.0973225   3.19033    0.001  0.119713   0.501269
proportion_imputed_cat: [0.2, 0.3) |   0.242865 0.0900066    2.6983    0.007 0.0664281   0.419302
proportion_imputed_ca

In [9]:
data.flow_per_lane = data.total_flow ./ data.Lanes ./ 1000 ./ 12;

In [10]:
with_flow = reg(data, @formula(peak_hour_occ_pct~postlockdown+proportion_imputed_cat+flow_per_lane+flow_per_lane^2), Vcov.cluster(:station))

                                           Linear Model                                           
Number of obs:                           4569230  Degrees of freedom:                           14
R2:                                        0.078  R2 Adjusted:                               0.078
F-Stat:                                  212.634  p-value:                                   0.000
peak_hour_occ_pct                  |   Estimate Std.Error   t value Pr(>|t|)  Lower 95%  Upper 95%
--------------------------------------------------------------------------------------------------
postlockdown                       |  -0.593988 0.0213503   -27.821    0.000   -0.63584  -0.552135
proportion_imputed_cat: (0, 0.1)   |   0.374614 0.0403254   9.28977    0.000   0.295565   0.453662
proportion_imputed_cat: [0.1, 0.2) |   0.298034  0.090865   3.27996    0.001   0.119914   0.476153
proportion_imputed_cat: [0.2, 0.3) |   0.219239 0.0836626   2.62051    0.009  0.0552375   0.383239
proportion

### And sensor fixed effects

In [11]:
data.station = CategoricalArray(data.station)
fixed_effects = reg(data, @formula(peak_hour_occ_pct~postlockdown+proportion_imputed_cat+flow_per_lane+flow_per_lane^2+fe(station)), Vcov.cluster(:station))



                                       Fixed Effect Model                                       
Number of obs:                          4569230  Degrees of freedom:                          13
R2:                                       0.569  R2 Adjusted:                              0.569
F-Stat:                                   330.2  p-value:                                  0.000
R2 within:                                0.069  Iterations:                                   1
peak_hour_occ_pct                  |   Estimate Std.Error  t value Pr(>|t|) Lower 95%  Upper 95%
------------------------------------------------------------------------------------------------
postlockdown                       |  -0.634976 0.0195715 -32.4439    0.000 -0.673342  -0.596611
proportion_imputed_cat: (0, 0.1)   |   0.120602 0.0109612  11.0027    0.000 0.0991154   0.142089
proportion_imputed_cat: [0.1, 0.2) |   0.256908 0.0452655  5.67557    0.000  0.168175    0.34564
proportion_imputed_cat: [0.2, 

In [12]:
labels = Dict(
        "postlockdown" => "Post-lockdown",
        "proportion_imputed_cat: (0, 0.1)" => "Proportion imputed: 0% (exclusive)--10%",
         "proportion_imputed_cat: [0.1, 0.2)" => "Proportion imputed: 10%--20%",
         "proportion_imputed_cat: [0.2, 0.3)" => "Proportion imputed: 20%--30%",
         "proportion_imputed_cat: [0.3, 0.4)" => "Proportion imputed: 30%--40%",
         "proportion_imputed_cat: [0.4, 0.5)" => "Proportion imputed: 40%--50%",
         "proportion_imputed_cat: [0.5, 0.6)" => "Proportion imputed: 50%--60%",
         "proportion_imputed_cat: [0.6, 0.7)" => "Proportion imputed: 60%--70%",
         "proportion_imputed_cat: [0.7, 0.8)" => "Proportion imputed: 70%--80%",
         "proportion_imputed_cat: [0.8, 0.9)" => "Proportion imputed: 80%--90%",
         "proportion_imputed_cat: [0.9, 1.0)" => "Proportion imputed: 90%--100% (exclusive)",
         "proportion_imputed_cat: 1.0" => "Proportion imputed: 100%",
         "flow_per_lane" => "Average flow per lane (thousands of vehicles/hour)",
         "flow_per_lane ^ 2" => "Average flow per lane (thousands of vehicles/hour) squared",
        "peak_hour_occ_pct" => "Percent of daily occupancy in peak hour",
    "station" => "Sensor fixed effects"
    )
regtable(mod1, flexible_missing, with_flow, fixed_effects, labels=labels)


----------------------------------------------------------------------------------------------------------
                                                                Percent of daily occupancy in peak hour   
                                                             ---------------------------------------------
                                                                   (1)         (2)         (3)         (4)
----------------------------------------------------------------------------------------------------------
(Intercept)                                                   9.902***   10.506***   11.072***            
                                                               (0.030)     (0.039)     (0.159)            
Post-lockdown                                                -0.740***   -0.545***   -0.594***   -0.635***
                                                               (0.022)     (0.020)     (0.021)     (0.020)
Proportion imputed: 0% (exclusive)--

In [13]:
regtable(mod1, flexible_missing, with_flow, fixed_effects, labels=labels, renderSettings=latexOutput(), print_estimator_section=false, estimformat="%0.2f", statisticformat="%0.2f")

\begin{tabular}{lrrrr}
\toprule
                                                           & \multicolumn{4}{c}{Percent of daily occupancy in peak hour} \\ 
\cmidrule(lr){2-5} 
                                                           &       (1) &       (2) &       (3) &                     (4) \\ 
\midrule
(Intercept)                                                &   9.90*** &  10.51*** &  11.07*** &                         \\ 
                                                           &    (0.03) &    (0.04) &    (0.16) &                         \\ 
Post-lockdown                                              &  -0.74*** &  -0.54*** &  -0.59*** &                -0.63*** \\ 
                                                           &    (0.02) &    (0.02) &    (0.02) &                  (0.02) \\ 
Proportion imputed: 0% (exclusive)--10%                    &           &   0.38*** &   0.37*** &                 0.12*** \\ 
                                                           &    