# Basic Example of the Pogit Model

This notebook shows you how to fit a Pogit model with one covariate for the event generating process $\lambda$ and one covariate for the reporting rate $p$.

After understanding the basic modeling setup in this notebook, see the `Regularizer-And-Constraint-Demos` notebook for examples of regularization and constraints that can improve the fit to $p$ and $\lambda$. 

See `Road-Injuries-Tutorial` for an example of these methods applied to realistic data with additional covariates, and to see the effect of overdispersion and model misspecification on the model fit. The road injuries tutorial also addresses modeling data where each observation has a different sample size, in which case the model must include an offset term.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from xspline import XSpline
from regmod.data import Data
from regmod.variable import Variable, SplineVariable
from regmod.prior import SplineUniformPrior, SplineGaussianPrior, LinearGaussianPrior
from regmod.models import PogitModel
from regmod.utils import SplineSpecs
from regmod.optimizer import scipy_optimize

In [None]:
# Global plotting parameters
plt.rc('font', size=16) #controls default text size
plt.rc('axes', titlesize=20) #fontsize of the title
plt.rc('axes', labelsize=16) #fontsize of the x and y labels
plt.rc('xtick', labelsize=14) #fontsize of the x tick labels
plt.rc('ytick', labelsize=14) #fontsize of the y tick labels
plt.rc('legend', fontsize=14) #fontsize of the legend

## Generate Data

Generate data according to $\mathrm{logit}(p) = -\sin(2\pi x_0)$ and $\lambda = 15 + \exp(\cos(2\pi x_1))$ with $x_0, x_1 \sim \mathrm{Uniform}(0, 1)$

In this notation, $x_0$ and $x_1$ are covariates. $n$ is the total number of events (observed and unobserved) so that

$$n \sim \mathrm{Poisson}(\lambda)$$

while $y$ is the number of observed events

$$y \sim \mathrm{Binomial}(n, p)$$

Our goal is to use $y$, $x_0$ and $x_1$ to infer the true reporting probability $p$ and true rate $\lambda$.

In [None]:
np.random.seed(123)
NUM_OBS = 500

In [None]:
def get_true_p(x):
    return 1.0/(1.0 + np.exp(-np.sin(x*2.0*np.pi)))

def get_true_lam(x):
    return 15.0 + np.exp(np.cos(x*2.0*np.pi))

In [None]:
def generateData():
    x0 = np.random.rand(NUM_OBS)
    x1 = np.random.rand(NUM_OBS)

    true_p = get_true_p(x0)
    true_lam = get_true_lam(x1)
    
    n = np.random.poisson(true_lam)
    y = np.random.binomial(n=n, p=true_p)
    
    return x0, x1, y, n

In [None]:
x0, x1, y, n = generateData()

### Plot the data

When plotting against data, because we have two covariates, we cannot plot a single curve.
Here we scatter plot the prediction from the true parameter for each data point.

In [None]:
x = np.linspace(0, 1, 100)
fig, ax = plt.subplots(2, 2, figsize=(10*2, 5*2), sharex=True)
ax[0, 0].plot(x, get_true_p(x), color="#DC143C", linestyle="--")
ax[0, 0].set_ylabel("true p")
ax[0, 0].set_title("Data and Generating Model", loc="left", size=20)

ax[0, 1].plot(x, get_true_lam(x), color="#DC143C", linestyle="--")
ax[0, 1].set_ylabel(r'true $\lambda$')

ax[1, 0].scatter(x0, y, marker=".", color="gray")
ax[1, 0].scatter(x0, get_true_p(x0)*get_true_lam(x1), marker=".", color="#DC143C")
ax[1, 0].set_xlabel("x0")
ax[1, 0].set_ylabel(r'observation $\mu=\lambda p$')

ax[1, 1].scatter(x1, y, marker=".", color="gray")
ax[1, 1].scatter(x1, get_true_p(x0)*get_true_lam(x1), marker=".", color="#DC143C")
ax[1, 1].set_xlabel("x1")
ax[1, 1].set_ylabel(r'observation $\mu=\lambda p$')
plt.show()

## Fit the Pogit model to the observations

To construct the model, we need to create

* data object
* variables and parameters
* model object assemble the information from data and variables

And then we fit model and use it to predict.

### Data object

* Load (in this case since it is synthetic, we create) data frame
* Create data object, by passing in the data frame and specify the corresponding columns for more details check the [docstring](https://github.com/ihmeuw-msca/regmod/blob/develop/src/regmod/data.py#L13)

Important columns for the Pogit model are

* `col_obs`: observations in count space
* `col_covs`: different covariates used for model the $p$ and $\lambda$
* `col_offset`: offset column important for $\lambda$ parameter, usually use `log_population` as the offset for $\lambda$

In [None]:
df = pd.DataFrame({"y": y, "x0": x0, "x1": x1})
data = Data(col_obs="y", col_covs=["x0", "x1"], df=df)

### Variables and parameters

Here we use $x_0$ to model $p$ and $x_1$ to model $\lambda$. Use which covariate to model which parameter usually come from prior knowledge.
We model both $p$ and $\lambda$ by thrid-degree splines, with two interior knots for $p$ and one interior knot for $\lambda$.

To declare variables, we need to use `Variable` or `SplineVariable` class, where `Variable` is for regular variable, and
`SplineVariable` is for variable with spline.
In this specific case we use spline for both $x_0$ and $x_1$.
To specify the spline settings, we need to use `SplineSpecs` class, and you could input, knots and degree settings into the class. For more details please check the [docstring](https://github.com/ihmeuw-msca/regmod/blob/develop/src/regmod/utils.py#L74) of the class.

In [None]:
var0 = SplineVariable(name="x0",
                      spline_specs=SplineSpecs(knots=np.array([0.0, 0.25, 0.75, 1.0]),
                                               knots_type="abs",
                                               degree=3))

var1 = SplineVariable(name="x1",
                      spline_specs=SplineSpecs(knots=np.array([0.0, 0.5, 1.0]),
                                               knots_type="abs",
                                               degree=3))

And we create the parameter specification.
For Pogit model, it has two parameters, `p` and `lam`.

In [None]:
param_specs = {"p": {"variables": [var0]},
               "lam": {"variables": [var1]}}

### Model object

Here we assemble data and parameter information to create the model object.
For more details please check the [docstring](https://github.com/ihmeuw-msca/regmod/blob/develop/src/regmod/models/model.py#L16).

In [None]:
model = PogitModel(data, param_specs={"p": {"variables": [var0]}, "lam": {"variables": [var1]}})

Fit model using `scipy_optimize`.

In [None]:
result = scipy_optimize(model)

Extract the coeffcients for each parameter.

In [None]:
coefs_p, coefs_lam = model.split_coefs(result["coefs"])

### Predict and plot against true parameter

Predict the model with given covariates.

In [None]:
df_pred = pd.DataFrame({"x0": x, "x1": x})
data_pred = Data(col_covs=["x0", "x1"], df=df_pred)

df_pred["p"] = model.params[0].get_param(coefs_p, data_pred)
df_pred["lam"] = model.params[1].get_param(coefs_lam, data_pred)

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(10*2, 5))
ax[0].plot(df_pred.x0, df_pred.p, color="#008080", label="Model fit")
ax[0].plot(df_pred.x0, get_true_p(df_pred.x0), color="#DC143C", label="True p", linestyle="--")
ax[0].set_xlabel("x0")
ax[0].set_ylabel("p")
ax[0].set_title("Pogit Model", loc="left")
ax[0].legend()

ax[1].plot(df_pred.x0, df_pred.lam, color="#008080", label="Model fit")
ax[1].plot(df_pred.x0, get_true_lam(df_pred.x0), color="#DC143C", label=r'True $\lambda$', linestyle="--")
ax[1].set_xlabel("x1")
ax[1].set_ylabel(r'$\lambda$')
ax[1].legend()
plt.show()

### Predict and plot against training

You can directly predict using the model fitting data to test overall fitting.

In [None]:
df_fit = data.df

df_fit["p"] = model.params[0].get_param(coefs_p, data)
df_fit["lam"] = model.params[1].get_param(coefs_lam, data)

In [None]:
fit, ax = plt.subplots(1, 2, figsize=(10*2, 5))

for i, cov in enumerate(["x0", "x1"]):
    ax[i].scatter(df_fit[cov], df_fit.y, color="gray", marker=".", label="data")
    ax[i].scatter(df_fit[cov], df_fit.p*df_fit.lam, color="#008080", marker="x", label="pred")
    ax[i].scatter(df_fit[cov], get_true_p(df_fit.x0)*get_true_lam(df_fit.x1), color="#DC143C", marker=".", label="true")
    ax[i].set_xlabel(cov)
    ax[i].set_ylabel(r'observation $\mu=\lambda p$')
    ax[i].legend()

### Quantify uncertainty via 1,000 draws

Our model fitting process produces both coefficient estimates and the covariance matrix on those estimates, via sandwich estimation. We will take 1,000 draws of coefficients according to this covariance matrix, and use this to quantify uncertainty.

We first get the samples of coefficients using multivariate-normal distribution,
with mean to be point estimate `model.opt_coefs` and covariance matrix to be its positerior covariance matrix `model.opt_vcov`.

In [None]:
num_draws = 1000
coefs_samples = np.random.multivariate_normal(mean=result["coefs"], cov=result["vcov"], size=num_draws)

We then split the coefficients sample into samples for $p$ and $\lambda$.

In [None]:
coefs_samples = list(map(model.split_coefs, coefs_samples))

Create draws for $p$ and $\lambda$.

In [None]:
draws = [[], []]
for coefs in coefs_samples:
    for i, coef in enumerate(coefs):
        draws[i].append(model.params[i].get_param(coef, data_pred))

for i in range(2):
    draws[i] = np.vstack(draws[i])

Plot the uncertainty interval against the true parameters.

In [None]:
# Generate 1-alpha confidence intervals
alpha = 0.05
lb, ub = 0.5*alpha, 1 - 0.5*alpha

fig, ax = plt.subplots(1, 2, figsize=(10*2, 5))

ax[0].plot(df_pred.x0, df_pred.p, color="#008080", label="Model fit")
ax[0].plot(df_pred.x0, get_true_p(df_pred.x0), color="#DC143C", label="True p", linestyle="--")
ax[0].fill_between(df_pred.x0,
                   np.quantile(draws[0], lb, axis=0),
                   np.quantile(draws[0], ub, axis=0),
                   color="#008080", alpha=0.2)
ax[0].set_xlabel("x0")
ax[0].set_ylabel("p")
ax[0].set_title("Pogit Model", loc="left")
ax[0].legend()

ax[1].plot(df_pred.x0, df_pred.lam, color="#008080", label="Model fit")
ax[1].plot(df_pred.x0, get_true_lam(df_pred.x0), color="#DC143C", label=r'True $\lambda$', linestyle="--")
ax[1].fill_between(df_pred.x0,
                   np.quantile(draws[1], lb, axis=0),
                   np.quantile(draws[1], ub, axis=0),
                   color="#008080", alpha=0.2)
ax[1].set_xlabel("x1")
ax[1].set_ylabel(r'$\lambda$')
ax[1].legend()
plt.show()