# Sensor shifts

Make sure sensor locations are stable over time. Most are, but some are not. Write out a file with those that are not so we can drop them.

TODO one file from d03 is empty - but shouldn't matter since it's the last one in 2015, and there's a new one in March 2016, and we're not using any data from pre-March 2016.

In [None]:
using CSV, DataFrames, DataFramesMeta, Logging, ProgressMeter, Geodesy, Dates, StatsBase

In [None]:
files = filter(readdir("../data/meta/")) do fn
    !isnothing(match(r"^d.*_text_meta_.*\.txt", fn))
end
nothing

## Figure out which dates to read

We want to read all metadata files from 2016 or later, and the last file before 2016-01-01, so we have valid metadata for the entire analysis period. We want to stop reading at 2022-08-19, in case people have newer metadata that would make some sensors drop out.

(Why August 19 instead of 18? There were several sensors that moved/changed the very next day after our analysis window, and it's possible that they would have had some effects of whatever caused this change pre-August 19. Also, that's what the metadata we downloaded at the same time as we downloaded the data showed.)

The PeMS site lists some earlier metadata files as extending into this period as well, but I think that's an error- metadata files seem to contain all sensors, so each should supersede the last. I have an email into PeMS to confirm. For now we ignore those files.

In [None]:
dates_by_district = Dict{String, Vector{Date}}()

for file in files
    parsed = match(r"^d0?([1-9][0-9]?)_text_meta_([0-9]{4})_([0-9]{2})_([0-9]{2}).txt", file)
    if !haskey(dates_by_district, parsed[1])
        dates_by_district[parsed[1]] = []
    end
    date = Date(parse(Int64, parsed[2]), parse(Int64, parsed[3]), parse(Int64, parsed[4]))
    push!(dates_by_district[parsed[1]], date)
end

In [None]:
dates_to_retain_by_district = Dict{String, Set{Date}}()

for (district, dates) in pairs(dates_by_district)
    # retain the file before 2016-01-01 and all after
    last_date_before_2016 = Date(1970, 1, 1)
    
    for date in dates
        if date <= Date(2016, 1, 1) && date > last_date_before_2016
            last_date_before_2016 = date
        end
    end
    
    dates_to_retain_by_district[district] = Set(collect(filter(d -> d >= last_date_before_2016, dates)))
end

In [None]:
all_meta = vcat(skipmissing(map(files) do file
        parsed = match(r"^d0?([1-9][0-9]?)_text_meta_([0-9]{4})_([0-9]{2})_([0-9]{2}).txt", file)
        date = Date(parse(Int64, parsed[2]), parse(Int64, parsed[3]), parse(Int64, parsed[4]))

        if !in(date, dates_to_retain_by_district[parsed[1]])
            return missing
        else
            data = CSV.read(joinpath("../data/meta", file), DataFrame;
                types=Dict(:Longitude=>Union{Missing,Float64}), validate=false)
            if ncol(data) == 0 && nrow(data) == 0
                @warn "file $file is empty, skipping"
                return missing
            end
            select!(data, [:ID, :Fwy, :Dir, :Latitude, :Longitude, :District, :Lanes, :County])
            data[!, :date] .= date
            return data
        end
    end)...)
nothing

In [None]:
all_meta = all_meta[all_meta.date .<= Date(2022, 8, 19), :]

## Compute station-level statistics

Make sure that freeway, direction, and number of lanes are stable, and that location did not shift by more than 100 meters.

In [None]:
function max_shift(lats, lons)
    @assert length(lats) == length(lons)
    max_shift = 0
    for i in 1:length(lats)
        if ismissing(lats[i]) && ismissing(lons[i]) continue end
        pos_i = LLA(lats[i], lons[i], 0)
        for j in 1:length(lons)
            if ismissing(lats[j]) && ismissing(lons[j]) continue end
            pos_j = LLA(lats[j], lons[j], 0)
            dist = euclidean_distance(pos_i, pos_j)
            if dist > max_shift
                max_shift = dist
            end
        end
    end
    return max_shift
end

last_nonmissing(x) = first(skipmissing(reverse(x)))

function last_nonmissing(lats, lons)
    for i in length(lats):-1:1
        if !ismissing(lats[i]) && !ismissing(lons[i])
            return (Latitude=lats[i], Longitude=lons[i])
        end
    end
    return (Latitude=missing, Longitude=missing)
end
    
station_stats = combine(groupby(all_meta, :ID),
    :Fwy => (x -> length(unique(x)) == 1) => :fwy_stable,
    :Dir => (x -> length(unique(x)) == 1) => :dir_stable,
    :Lanes => (x -> length(unique(x)) == 1) => :lanes_stable,
    [:Latitude, :Longitude] => max_shift => :max_shift_meters,
    # save representative values so we have them for all sensors
    # this file will be used to identify the lat/lons of sensors in the final dataset,
    # some sensors may not appear in one particular metadata file, so use the combination
    [:Latitude, :Longitude] => last_nonmissing => [:Latitude, :Longitude],
    :Fwy => last_nonmissing => :Fwy,
    :Dir => last_nonmissing => :Dir,
    :District => last_nonmissing => :District,
    :Lanes => last_nonmissing => :Lanes,
    :County => last_nonmissing => :County
    
)

In [None]:
mean(station_stats.fwy_stable)

In [None]:
mean(station_stats.dir_stable)

In [None]:
mean(station_stats.lanes_stable)

In [None]:
mean(station_stats.max_shift_meters .< 100)

In [None]:
mean(
    station_stats.fwy_stable .&
    station_stats.dir_stable .&
    station_stats.lanes_stable .&
    (station_stats.max_shift_meters .< 100)
    )

In [None]:
station_stats.ID[ismissing.(station_stats.Latitude)]

## Extract metadata for good sensors

This will be used to filter the sensor data to exclude the sensors that are unstable.

In [None]:
good_sensor_meta = station_stats[station_stats.fwy_stable .&
    station_stats.dir_stable .&
    station_stats.lanes_stable .&
    (station_stats.max_shift_meters .< 100) .&
    (.!ismissing.(station_stats.Latitude)), :]

In [None]:
CSV.write("../data/good_sensors.csv", good_sensor_meta)

In [None]:
all_meta[all_meta.ID .== 415657 .&& all_meta.date .> Date(2022,1,1), [:Lanes, :Dir, ]