# Missing data robustness check models

We want to make sure that missing data is not responsible for the change we see in the permutation test results. To test this, we use an OLS model to see how controlling for missingness affects outcomes.

In [1]:
using KFactors, DataFrames, CSV, Pipe, RegressionTables, CategoricalArrays, StatsBase, FixedEffectModels

In [2]:
raw_data = KFactors.read_data("../data/peaks_merged.parquet");

┌ Info: Removing sensors days with peak-hour occ above 99th percentile (22.90%)
└ @ KFactors /Users/mwbc/git/peak-spreading/src/computation.jl:65


In [3]:
data = KFactors.create_test_data(raw_data, KFactors.Periods.SPRING_2022, min_complete=0.0);

In [4]:
data = data[data.period .∈ Ref(Set([:prepandemic, :postlockdown])), :];

In [5]:
meta = CSV.read("../data/sensor_meta_geo.csv", DataFrame)
leftjoin!(data, select(meta, Not([:Latitude, :Longitude, :urban, :District, :Lanes])), on=:station=>:ID);

In [33]:
data.entirely_imputed = data.periods_imputed .== 288
data.postlockdown = data.period .== :postlockdown
data.proportion_imputed = data.periods_imputed ./ 288
data.peak_hour_occ_pct = data.peak_hour_occ .* 100;

In [34]:
mod1 = reg(data, @formula(peak_hour_occ_pct~postlockdown), Vcov.cluster(:station))

                                Linear Model                                
Number of obs:                1555184   Degrees of freedom:                 1
R2:                             0.007   R2 Adjusted:                    0.007
F-Stat:                       815.391   p-value:                        0.000
peak_hour_occ_pct |  Estimate Std.Error  t value Pr(>|t|) Lower 95% Upper 95%
-----------------------------------------------------------------------------
postlockdown      | -0.670835 0.0234927 -28.5551    0.000 -0.716887 -0.624783
(Intercept)       |   10.0339 0.0303377  330.741    0.000   9.97445   10.0934


In [35]:
data.proportion_imputed_cat = @pipe map(data.proportion_imputed) do x
    x == 0 && return "0.0"
    x < 0.1 && return "(0, 0.1)"
    x < 0.2 && return "[0.1, 0.2)"
    x < 0.3 && return "[0.2, 0.3)"
    x < 0.4 && return "[0.3, 0.4)"
    x < 0.5 && return "[0.4, 0.5)"
    x < 0.6 && return "[0.5, 0.6)"
    x < 0.7 && return "[0.6, 0.7)"
    x < 0.8 && return "[0.7, 0.8)"
    x < 0.9 && return "[0.8, 0.9)"
    x < 1 && return "[0.9, 1.0)"
    x == 1 && return "1.0"
    error("Unknown x value $x")
end |> CategoricalArray(_,
    levels=["0.0", "(0, 0.1)", "[0.1, 0.2)", "[0.2, 0.3)", "[0.3, 0.4)", "[0.4, 0.5)", "[0.5, 0.6)", "[0.6, 0.7)", "[0.7, 0.8)", "[0.8, 0.9)", "[0.9, 1.0)", "1.0"],
    ordered=true)
flexible_missing = reg(data, @formula(peak_hour_occ_pct~postlockdown+proportion_imputed_cat), Vcov.cluster(:station))

                                         Linear Model                                         
Number of obs:                         1555184   Degrees of freedom:                         12
R2:                                      0.074   R2 Adjusted:                             0.074
F-Stat:                                225.883   p-value:                                 0.000
peak_hour_occ_pct                  |  Estimate Std.Error  t value Pr(>|t|)  Lower 95% Upper 95%
-----------------------------------------------------------------------------------------------
postlockdown                       | -0.511678 0.0217236  -23.554    0.000  -0.554262 -0.469094
proportion_imputed_cat: (0, 0.1)   |  0.376096 0.0418493  8.98693    0.000   0.294061  0.458132
proportion_imputed_cat: [0.1, 0.2) |  0.219737 0.0895422    2.454    0.014  0.0442102  0.395264
proportion_imputed_cat: [0.2, 0.3) |  0.360738 0.0900093  4.00779    0.000   0.184296  0.537181
proportion_imputed_cat: [0.3, 0.4) |  0.1

In [36]:
data.flow_per_lane = data.total_flow ./ data.Lanes ./ 1000 ./ 12;

In [38]:
with_flow = reg(data, @formula(peak_hour_occ_pct~postlockdown+proportion_imputed_cat+flow_per_lane+flow_per_lane^2), Vcov.cluster(:station))

                                           Linear Model                                           
Number of obs:                           1555184  Degrees of freedom:                           14
R2:                                        0.082  R2 Adjusted:                               0.082
F-Stat:                                  196.093  p-value:                                   0.000
peak_hour_occ_pct                  |   Estimate Std.Error   t value Pr(>|t|)  Lower 95%  Upper 95%
--------------------------------------------------------------------------------------------------
postlockdown                       |  -0.562491 0.0229306  -24.5301    0.000  -0.607441   -0.51754
proportion_imputed_cat: (0, 0.1)   |    0.35257 0.0400679   8.79933    0.000   0.274027   0.431114
proportion_imputed_cat: [0.1, 0.2) |   0.203461 0.0841758   2.41709    0.016  0.0384536   0.368468
proportion_imputed_cat: [0.2, 0.3) |   0.337694 0.0845164    3.9956    0.000   0.172019   0.503369
proportion

### And sensor fixed effects

In [39]:
data.station = CategoricalArray(data.station)
fixed_effects = reg(data, @formula(peak_hour_occ_pct~postlockdown+proportion_imputed_cat+flow_per_lane+flow_per_lane^2+fe(station)), Vcov.cluster(:station))

                                       Fixed Effect Model                                       
Number of obs:                          1555184   Degrees of freedom:                          13
R2:                                       0.576   R2 Adjusted:                              0.576
F-Stat:                                 281.327   p-value:                                  0.000
R2 within:                                0.071   Iterations:                                   1
peak_hour_occ_pct                  |    Estimate Std.Error   t value Pr(>|t|) Lower 95% Upper 95%
-------------------------------------------------------------------------------------------------
postlockdown                       |   -0.580613  0.021336  -27.2128    0.000 -0.622437 -0.538788
proportion_imputed_cat: (0, 0.1)   |    0.138317 0.0123348   11.2135    0.000  0.114137  0.162496
proportion_imputed_cat: [0.1, 0.2) |    0.233023  0.042136   5.53026    0.000  0.150425  0.315621
proportion_imputed_ca

In [44]:
labels = Dict(
        "postlockdown" => "Post-lockdown",
        "proportion_imputed_cat: (0, 0.1)" => "Proportion imputed: 0% (exclusive)--10%",
         "proportion_imputed_cat: [0.1, 0.2)" => "Proportion imputed: 10%--20%",
         "proportion_imputed_cat: [0.2, 0.3)" => "Proportion imputed: 20%--30%",
         "proportion_imputed_cat: [0.3, 0.4)" => "Proportion imputed: 30%--40%",
         "proportion_imputed_cat: [0.4, 0.5)" => "Proportion imputed: 40%--50%",
         "proportion_imputed_cat: [0.5, 0.6)" => "Proportion imputed: 50%--60%",
         "proportion_imputed_cat: [0.6, 0.7)" => "Proportion imputed: 60%--70%",
         "proportion_imputed_cat: [0.7, 0.8)" => "Proportion imputed: 70%--80%",
         "proportion_imputed_cat: [0.8, 0.9)" => "Proportion imputed: 80%--90%",
         "proportion_imputed_cat: [0.9, 1.0)" => "Proportion imputed: 90%--100% (exclusive)",
         "proportion_imputed_cat: 1.0" => "Proportion imputed: 100%",
         "flow_per_lane" => "Average flow per lane (thousands of vehicles/hour)",
         "flow_per_lane ^ 2" => "Average flow per lane (thousands of vehicles/hour) squared",
        "peak_hour_occ_pct" => "Percent of daily occupancy in peak hour",
    "station" => "Sensor fixed effects"
    )
regtable(mod1, flexible_missing, with_flow, fixed_effects, labels=labels)


----------------------------------------------------------------------------------------------------------
                                                                Percent of daily occupancy in peak hour   
                                                             ---------------------------------------------
                                                                   (1)         (2)         (3)         (4)
----------------------------------------------------------------------------------------------------------
(Intercept)                                                  10.034***   10.683***   11.192***            
                                                               (0.030)     (0.040)     (0.175)            
Post-lockdown                                                -0.671***   -0.512***   -0.562***   -0.581***
                                                               (0.023)     (0.022)     (0.023)     (0.021)
Proportion imputed: 0% (exclusive)--

In [45]:
regtable(mod1, flexible_missing, with_flow, fixed_effects, labels=labels, renderSettings=latexOutput(), print_estimator_section=false, estimformat="%0.2f", statisticformat="%0.2f")

\begin{tabular}{lrrrr}
\toprule
                                                           & \multicolumn{4}{c}{Percent of daily occupancy in peak hour} \\ 
\cmidrule(lr){2-5} 
                                                           &       (1) &       (2) &       (3) &                     (4) \\ 
\midrule
(Intercept)                                                &  10.03*** &  10.68*** &  11.19*** &                         \\ 
                                                           &    (0.03) &    (0.04) &    (0.17) &                         \\ 
Post-lockdown                                              &  -0.67*** &  -0.51*** &  -0.56*** &                -0.58*** \\ 
                                                           &    (0.02) &    (0.02) &    (0.02) &                  (0.02) \\ 
Proportion imputed: 0% (exclusive)--10%                    &           &   0.38*** &   0.35*** &                 0.14*** \\ 
                                                           &    