In [1]:
from pathlib import Path

import polars as pl

# Research Question: What is the distribution of vehicle model years in the target population?

Note that we include vehicles that are driven in Utah County without being registered in Utah County.  Thus, we provide added information to what is publicly available from government registration records.

## Strategy
New vehicles are still being sold for 2024, 2025, and 2026 model years, but not for model year 2023.  The registration data that we have is for vehicles registered in 2024 all of the way up to February 17, 2025.  Thus, there may be additional registrations for the newer model year vehicles between February 2025 and March 2025 which are not in our dataset.  We can modify the registration counts for these new model years using the count for model year 2023.  

Assume that if the registration data were to go all of the way up to February 17, 2026 that there would be the same number of registrations expiring for model year 2024 as there currently is for model year 2023.  Note that the proportion of this period traversed at the time of data collection is approximately 1/12. 

Assume that if the registration data were to go all of the way up to February 17, 2027 that there would be the same number of registrations expiring for model year 2025 as there currently is for model year 2023.  Note that the proportion of this period traversed at the time of data collection is approximately 1/24. 

Assume that if the registration data were to go all of the way up to February 17, 2028 that there would be the same number of registrations expiring for model year 2026 as there currently is for model year 2023.  Note that the proportion of this period traversed at the time of data collection is approximately 1/36. 

Use the registration counts as the concentration parameters for a Dirichlet distribution.  Use the technique [here](https://en.wikipedia.org/wiki/Dirichlet_distribution#Conjugate_to_categorical_or_multinomial) to use these concentration parameters as pseudocounts to be added to our observed counts.  The summed counts can then be used as the concentration parameter for the posterior Dirichlet distribution of vehicle model years in Utah County.

In [2]:
source = Path("..", "raw_data", "registrations", "registrations.csv")
reg = pl.scan_csv(
    source=source
)

reg = (reg
    .with_columns(
        pl.col("num_registrations").str.replace_all(",", "").cast(pl.Int64).alias("num_registrations")
    )
    .collect()
    .lazy()
)

reg.collect().tail()

model_year,num_registrations
i64,i64
2022,30312
2023,31266
2024,27037
2025,5830
2026,8


In [6]:
reg_2023 = (reg
    .filter(pl.col("model_year") == 2023)
    .select("num_registrations")
    .collect()
    .item()
)

reg_2 = (reg
    .with_columns(
        pl.when(pl.col("model_year") > 2023)
        .then(pl.col("num_registrations") + (reg_2023 - pl.col("num_registrations")) / (12 * (pl.col("model_year") - 2023)))
        .otherwise(pl.col("num_registrations"))
        .cast(pl.Int64)
        .alias("num_registrations")
    )
)

reg_2.tail().collect()

model_year,num_registrations
i64,i64
2022,30312
2023,31266
2024,27389
2025,6889
2026,876
