# Ignore daily correlation. Focus on single-directional travel, and model dispersion of bikes across hours.

1. Given the day of the week, how many bikes do I expect to see?
2. How would these bikes be distributed over the course of the day?

## Findings

- Conjugate priors help a lot
- Sampling daily count then hourly from that works
- Checked correlation between 8am and 9am and found it is much better correlated than when we modelled with gaussian processes. Though real data was still much more correlated.

## Next steps

- Real data 
- Further simulation with weather data

## Weather factors

We want to allow the daily count to change w.r.t. the weather independently of the hourly patterns. For example, the weather may reduce the total number of bikes but not change when people bike (cold temperature), or it may change the hourly distribution with limited impact on daily counts (a very short period of heavy rain).

I also suspect that the morning and evening patterns could differ. For example, if the weather turns bad at midday, people may cycle in affected but not cycle home. But if the weather starts bad, people may leave their bikes at home.

- Daily count
    - Gamma regression (log link), linear model of weather factors
    - Include cross-effects, e.g. windy + rainy, rainy + cold
    - Include "weather in the morning", since this may be a good predictor of daily travel
    - Don't worry about whole-day weather. Can likely ignore anything outside of the 6am-7pm range
- Hourly count
    - We have our dirichlet level-one distribution from which we sample level-two distributions for each day separately
    - Add intermediary step which modifies the level-one distribution in line with the weather effects of the day
        - Linear model of weather data produces per-hour parameters (24) for each day
        - (Send these parameters through a logit link)?
        - Multiply the level-one distribution by the parameters, and rescale so sum = 1
        - Treat the result as the new level-one distribution for that day

In [None]:
import datetime

import matplotlib.pyplot as plt
import seaborn as sns

import polars as pl
import numpy as np
import scipy

import pymc as pm
import arviz as az

import warnings

sns.set(rc={'figure.figsize':(17,11)})

warnings.filterwarnings("ignore")

## Simulate data

1. Number of bikes per day
2. Distribution over the course of the day

Focusing on weekdays for now, assuming same distribution each day

In [None]:
days = 1000

hours = [str(i) for i in range(24)]
day_ids = np.arange(days)

# day-level
alpha_daily = 8
mu_daily = 300

# hour-level
daily_dist = np.array([1, 1, 0.5, 0.5, 0.5, 2, 8, 26, 30, 28, 18, 11, 12, 8, 6, 6, 12, 16, 15, 8, 4, 2, 1, 1])
daily_dist = daily_dist / daily_dist.sum()
conc = 300

# sample
daily_count = pm.NegativeBinomial.dist(alpha=alpha_daily, mu=mu_daily, size=days).eval()
day_dist_samples = pm.Dirichlet.dist(conc * daily_dist, size=days).eval()
observed_counts = pm.Multinomial.dist(n=daily_count, p=day_dist_samples).eval()



In [None]:
mean_count = observed_counts.mean(axis=0)
std_count = observed_counts.std(axis=0)

plt.plot(np.arange(24), mean_count, label="Mean")
plt.fill_between(np.arange(24), (mean_count - std_count).clip(min=0), mean_count + std_count, alpha=0.3, label="1 Stdev")
plt.title("Simulated hourly count")
plt.legend()
plt.show()

## Model simulated data (just daily counts)

Submodel in later model, attempts to model the daily total count as sampled from a negative binomial/gamma poisson distribution.

Priors are normally distributed with a log link, chosen to reflect range of believable values.

In [None]:
coords = {"days": day_ids}

with pm.Model(coords=coords) as m_daily:
    log_mu_d = pm.Normal("log_mu_d", mu=5.65, sigma=0.525)
    mu_d = pm.Deterministic("mu_d", np.exp(log_mu_d))
    
    log_alpha_d = pm.Normal("log_alpha_d", 1.875, 0.35)
    alpha_d = pm.Deterministic("alpha_d", np.exp(log_alpha_d))
    
    day_count = pm.NegativeBinomial("day_count", alpha=alpha_d, mu=mu_d, observed=observed_counts.sum(axis=1))
    
    trace_daily = pm.sample_prior_predictive(1000)
    trace_daily.extend(pm.sample(5000, tune=3000))
    trace_daily.extend(pm.sample_posterior_predictive(trace_daily))

In [None]:
pm.model_to_graphviz(m_daily)

In [None]:
trace_daily

**Observed (0) vs prior predictive (1)**

In [None]:
sns.displot([daily_count, trace_daily.prior_predictive.day_count.to_numpy()[0,:,0]]);

## Viewing trace results

In [None]:
az.summary(trace_daily)

In [None]:
az.plot_trace(trace_daily)
plt.tight_layout()

**Observed (0) vs posterior predictive (1)**

In [None]:
sns.displot(np.array([daily_count, trace_daily.posterior_predictive.day_count.to_numpy()[0,:days,0]]).T);

## Model simulated data (with hourly counts)

Swaps daily count submodel parameter priors conjugate priors to fix sampling issues. Defines multi-level dirichlet distribution over hours of day, sampling a unique hourly distribution for each observed day. 

In [None]:
coords = {"hour": hours}

with pm.Model(coords=coords) as m_hourly:
    observations = pm.MutableData("observations", observed_counts, dims=("day", "hour"))
    
    n_d = pm.Gamma("n_d", alpha=20, beta=0.05)
    p_d = pm.Beta("p_d", alpha=10, beta=10)
    day_count = pm.NegativeBinomial("day_count", n=n_d, p=p_d, observed=observations.sum(axis=1), dims="day")
    
    # Sample hourly counts from daily
    frac = pm.Dirichlet("frac", a=np.ones(24), dims="hour")
    conc = pm.LogNormal("conc", mu=1, sigma=1)
    hour_count = pm.DirichletMultinomial("hour_count", n=day_count, a=(frac*conc), observed=observations, dims=("day", "hour"))
    
    trace_hourly = pm.sample_prior_predictive(1000)
    trace_hourly.extend(pm.sample(5000, tune=3000))
    trace_hourly.extend(pm.sample_posterior_predictive(trace_hourly))

In [None]:
pm.model_to_graphviz(m_hourly)

## Model 2 trace

In [None]:
trace_hourly

In [None]:
az.plot_trace(trace_hourly)
plt.tight_layout()

In [None]:
az.plot_trace(trace_hourly, compact=False, var_names="frac")
plt.tight_layout()

In [None]:
az.plot_forest(trace_hourly, var_names=["frac"])
for j, (y_tick, frac_j) in enumerate(zip(plt.gca().get_yticks(), reversed(daily_dist))):
    plt.vlines(frac_j, ymin=y_tick - 0.45, ymax=y_tick + 0.45, color="black", linestyle="--")

In [None]:
cmap = plt.get_cmap("tab10")

fig, axs = plt.subplots(24, 1, sharex=True, sharey=True, figsize=(10, 30))
for j, ax in enumerate(axs):
    c = cmap(j)
    ax.hist(
        trace_hourly.posterior_predictive.hour_count.sel(hour=hours[j]).values.flatten(),
        bins=np.arange(75),
        histtype="step",
        color=c,
        density=True,
        label="Post.Pred.",
    )
    ax.hist(
        (trace_hourly.observed_data.hour_count.sel(hour=hours[j]).values.flatten()),
        bins=np.arange(75),
        color=c,
        density=True,
        alpha=0.25,
        label="Observed",
    )
    ax.annotate(
        f"{hours[j]}",
        xy=(0.96, 0.9),
        xycoords="axes fraction",
        ha="right",
        va="top",
        color=c,
    )

axs[-1].legend(loc="upper center", fontsize=10)
axs[-1].set_xlabel("Count")
axs[-1].set_yticks([0, 0.5, 1.0])
axs[-1].set_ylim(0, 0.6)
plt.tight_layout();

## Generating samples for real data

Data taken from 2022 Thorndon Quay outgoing count on weekdays (Monday - Friday). 

Results show good ability to fit the general distribution, but poor sampling in lower extreme likely due to absence of predictors in model (e.g. weather effects). 

**Load data**

In [None]:
real_data_df = (
    pl.read_parquet(
        "../data/counter_data.parquet"
    ).with_columns(
        pl.col("weekday") - 1,
        pl.col("count_outgoing").fill_null(0),
        (pl.col("record_time").dt.date() - datetime.date(year=2022, month=1, day=1)).dt.days().alias("day_in_year")
    ).filter(
        (pl.col("year") == 2022) &
        (pl.col("site_name") == "Thorndon Quay") &
        (pl.col("weekday") < 5)
    )
)


In [None]:
real_observations = []

for i,group in real_data_df.group_by(pl.col("day_in_year"), maintain_order=True):
    real_observations.append(group.select(pl.col("count_outgoing")).to_numpy().flatten())
    
real_observations = np.array(real_observations)

In [None]:
with m_hourly:
    pm.set_data(
        {"observations": real_observations}
    )
    real_trace = pm.sample(5000, tune=3000)
    real_trace.extend(pm.sample_posterior_predictive(real_trace))

In [None]:
az.plot_trace(real_trace)
plt.tight_layout()

## Inspecting similarity of 8am/9am correlation with observed

This was very poor in the GP model, where there was no significant correlation between the two. In this model the correlation is apparent and is visually similar to the observed.

Note that the observation values go right down to near-zero, but posterior predictions don't. This is likely due to effects of weather, holidays, and sensor error which are not currently modelled.

In [None]:
plt.scatter(
    real_trace.posterior_predictive.hour_count.sel(hour="8").to_numpy()[0,:260,0],
    real_trace.posterior_predictive.hour_count.sel(hour="9").to_numpy()[0,:260,0],
    s=100,
    alpha=0.8,
    label="Posterior prediction",
)
plt.scatter(
    real_trace.observed_data.hour_count.sel(hour="8").to_numpy(),
    real_trace.observed_data.hour_count.sel(hour="9").to_numpy(),
    s=100,
    alpha=0.8,
    label="Observed",
)

plt.xlabel("Count at 8am")
plt.ylabel("Count at 9am")

plt.xlim(0,225)
plt.ylim(0,250)
plt.legend()
plt.title("8am vs 9am counts, observations and posterior predictions");