**Conditional on weekday/weekend, how does the distribution of numbers change over the years?**

How do we model the bike count given:
- Day of the week
- Month of the year
- Year
- Hourly counts

Monthly count is the sum of the sampled daily counts, each conditional on the month and whether the day is a weekday or a weekend.

Either:
- Daily count is the sum of the sampled hourly counts, whether the hourly counts are conditional on the month and whether the day is a weekday or a weekend, or
- Daily count is the sum of the hourly counts, conditional on the proportion of null count hours


Determine whether there's a general trend amongst (for example) Mondays in May at 8am. If there is, we'd likely see that there's a general grouping around a non-zero figure, and then a cluster of zeros where the null values are.

- For Monday 7am in May, there was a big difference between pre-2022 and 2023. Jumps from 0-10 to 35-45.

So perhaps we could model per-hour counts at first. I suspect this would be a poisson distribution. We may then incorporate this into a hierarchical model, using the per-hour counts to determine the daily -> monthly models.

Per-hour counts depend on:
- Year
- Time of year
- Day of week

Split quite broadly:
- Weekday vs. weekend
- Month
- Time of day (hour)
- Year

The per-hour count then depends on:
- The hour
- Whether or not it's the weekend
- The month
- The year

Model this, ignoring any null rows. 

**Model plan**

Hourly count sampled from Poisson dist around lambda.

log(lambda) = mx

m ~ Normal or something. This is the "at best conditions" count (across all hours/months/weather)

We may make a linear model in itself, conditional on the current year. This could include a step factor for infrastructure improvements.

log(x) = a + b. x should be between 0 and 1.

a is the "time of day factor". Depends on the time of day. Should be high at peak hours.

b is the "time of year factor". Month or season. High at peak months, low at low months.

---

Must also include the weekday/weekend variable. This affects m and a. May also affect b.

In [1]:
24 * 2 * 12 * 6

3456

In [2]:
import warnings
warnings.filterwarnings("ignore", "is_categorical_dtype")
warnings.filterwarnings("ignore", "use_inf_as_na")

In [3]:
import polars as pl
import sqlite3
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns
import datetime

from bimodal.cli import YEAR_FILES, load_and_clean_raw

ImportError: cannot import name 'YEAR_FILES' from 'bimodal.cli' (/Users/henry/Documents/personal/bimodal/python/src/bimodal/cli.py)

In [None]:
RAW_DATA_PATH = Path("../data/raw/counter/")

In [None]:
data = load_and_clean_raw([RAW_DATA_PATH / yf for yf in YEAR_FILES.values()])

In [None]:
x = (
    data.with_columns(
        pl.col("record_time").str.to_datetime() + datetime.timedelta(hours=1)
    )
    # .filter(
    #     (pl.col("site_name") == "Basin Reserve") &
    #     # (pl.col("record_time").dt.month() > 5) &
    #     (pl.col("record_time").dt.month() < 9) &
    #     # (pl.col("record_time").dt.weekday() == 1) &
    #     (pl.col("record_time").dt.hour() == 10)
    # )
    .with_columns(
        pl.col("record_time").dt.year().alias("year"),
        pl.col("record_time").dt.month().alias("month"),
        pl.col("record_time").dt.weekday().alias("weekday")
    )
)

In [None]:
x.filter(
    (pl.col("count_incoming").is_not_null()) & 
    (pl.col("count_outgoing").is_not_null())
).sort(by=pl.col("record_time"))

In [None]:
x.filter(pl.col("record_time").dt.year() == 2022)

In [None]:
sns.set_theme()
# sns.set(rc={'figure.figsize':(10,6)})

sns.catplot(
    x.to_pandas(),
    x="year",
    y="count_incoming",
    row="weekday",
    col="month",
    # kind="box",
    height=4,
    aspect=2,
);

In [None]:
x = (
    data.with_columns(
        pl.col("record_time").str.to_datetime()
    )
    .filter((pl.col("site_name") == "Adelaide Road") & (pl.col("record_time").dt.year() > 2018))
    .sort("record_time")
    .groupby_dynamic("record_time", every="1w")
    .agg(
        pl.col("count_incoming").sum(),
        pl.col("count_outgoing").sum(),
    )
    .with_columns(
        pl.col("record_time").dt.year().alias("year"),
        pl.col("record_time").dt.week().alias("week"),
        # pl.col("record_time").dt.day().alias("day"),
        # pl.col("record_time").dt.ordinal_day().alias("ordinal_day"),
        (
            pl.col("record_time")
            .dt
            .month()
            .map_dict(
                {
                    1: "Jan",
                    2: "Feb",
                    3: "Mar",
                    4: "Apr",
                    5: "May",
                    6: "Jun",
                    7: "Jul",
                    8: "Aug",
                    9: "Sep",
                    10: "Oct",
                    11: "Nov",
                    12: "Dec",
                },
                return_dtype=str
            )
            .alias("month")
        ),
        (
            pl.col("record_time").dt.weekday() < 6
        ).alias("weekday")
    )
)

In [None]:
data.with_columns(
    pl.col("record_time").str.to_datetime()
).filter(
    (pl.col("record_time").dt.year() == 2021) &
    (pl.col("record_time").dt.month() == 8) &
    (pl.col("site_name") == "Basin Reserve")
)

In [None]:
c = sns.color_palette("crest", n_colors=6)
type(c)

In [None]:
sns.displot(
    x.filter(
        (pl.col("month") == "Aug") &
        (pl.col("weekday") == True) & 
        (pl.col("year") == 2020)
    ).to_pandas(),
    x="count_incoming",
    binwidth=10
)

In [None]:
x.filter(
        (pl.col("month") == "Aug") &
        (pl.col("weekday") == True) &
        (pl.col("year") == 2021)
    )

In [None]:
sns.set_theme()
sns.set(rc={'figure.figsize':(10,6)})

sns.catplot(
    x.filter(
        (pl.col("month") == "Aug") &
        (pl.col("weekday") == True)
    ).to_pandas(),
    y="count_incoming",
    x="year",
    hue="day",
)

In [None]:
sns.set_theme()
sns.set(rc={'figure.figsize':(10,6)})

sns.relplot(
    x.to_pandas(), 
    x="week", 
    y="count_incoming", 
    # hue="year", 
    kind="line", 
    palette=c,
    height=4,
    aspect=2,
    row="year"
).set(title="Bike count: Adelaide Road (Incoming)");

plt.xlabel("Week")
plt.ylabel("Count")
plt.ylim(0,2100)


plt.show();

In [None]:
sns.set_theme()
sns.set(rc={'figure.figsize':(10,6)})

sns.relplot(
    x.filter(pl.col("count_incoming") > 0).to_pandas(), 
    x="month", 
    y="count_incoming", 
    hue="year", 
    kind="line", 
    palette=c,
    height=4,
    aspect=2,
).set(title="Bike count: Basin Reserve (Incoming)");

plt.xlabel("Month")
plt.ylabel("Count")
plt.ylim(0,6000)


plt.show();

In [None]:
sns.relplot(x.filter(pl.col("count_outgoing") > 0).to_pandas(), x="month", y="count_outgoing", hue="year", kind="line", palette=c);

In [None]:
plt.plot((
    data.with_columns(
        pl.col("record_time").str.to_datetime()
    )
    .filter(pl.col("site_name") == "Basin Reserve")
    .sort("record_time")
    .groupby_dynamic("record_time", every="1mo")
    .agg(
        pl.col("count_incoming").sum(),
        pl.col("count_outgoing").sum(),
    )
)["count_outgoing"][45:])