## Week 7 Lecture 3 - Ordered logits

McElreath's lectures for today: https://www.youtube.com/watch?v=p7g-CgGCS34

McElreath's lectures for the whole book are available here: https://github.com/rmcelreath/statrethinking_winter2019

An R/Stan repo of code is available here: https://vincentarelbundock.github.io/rethinking2/

Dustin Stansbury has some lovely PyMC Code available here: https://github.com/dustinstansbury/statistical-rethinking-2023

You are encouraged to work through both of these versions to re-enforce what we're doing in class.

In [None]:
# Import python packages
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import scipy as sp 
import random as rd
import pdb
import pymc as pm
import patsy
import arviz as az
import networkx as nx
from matplotlib import pyplot as plt
import dataframe_image as dfi
import pytensor as pyt


# Helper functions
def stdize(x):
    return (x-np.mean(x))/np.std(x)


def indexall(L):
    poo = []
    for p in L:
        if not p in poo:
            poo.append(p)
    Ix = np.array([poo.index(p) for p in L])
    return poo,Ix

def indexall_(L):
    Il, Ll = pd.factorize(L, sort=True)
    return Ll, Il

# Ordered categorical variables

Among the most initially challenging kinds of models to get your head around are those for ordered categorical variables - both the response and covariates can be this way. Things are scaled in order, but the divisions between them may or may not be additive. Wacky stuff.

To get into this, we're going to have to cover a set of experiments used by moral philosophers to evaluate a set of unconcious principles that explain why people vary in judgements (relevant for things like juries):

1. **The action principle** - harm caused by action is morally worse than the same harm caused by omission
2. **The intention principle** - harm indended as the means to an end is morally worse than the same harm forseen as a side effect of a end
3. **The contact principle** - harm caused by physical contact is morally worse than the same harm without physical contact

These things may seem abstract, but think about something like the US drone program for item (3), where families of terrorists are killed without physical contact. Would it be worse if individual soldiers went into a targeted house and killed everyone? Philosophy has real-world consequences.

Experiments are designed as stories with one or more of these principles and we'll look at data comparing two stories, from Dennis and Evan.

Dennis' story:

> Standing by the railroad tracks, Dennis sees an empty, out-of-control boxcar about to hit five people. Next to Dennis is a lever that can be pulled, sending the boxcar down a side track and away from the five people. But pulling the lever will also lower the railing on a footbridge spanning the side track, causing one person to fall off the footbridge and onto the side track, where he will be hit by the boxcar. If Dennis pulls the lever the boxcar will switch tracks and not hit the five people, and the one person to fall and be hit by the boxcar. If Dennis does not pull the lever the boxcar will continue down the tracks and hit five people, and the one person will remain safe above the side track.


Evan's story:

> Standing by the railroad tracks, Evan sees an empty, out-of-control boxcar about to hit five people. Next to Evan is a lever that can be pulled, lowering the railing on a footbridge that spans the main track, and causing one person to fall off the footbridge and onto the main track, where he will be hit by the boxcar. The boxcar will slow down because of the one person, therefore preventing the five from being hit. If Evan pulls the lever the one person will fall and be hit by the boxcar, and therefore the boxcar will slow down and not hit the five people. If Evan does not pull the lever the boxcar will continue down the tracks and hit the five people, and the one person will remain safe above the main track.

Reading through these, you may have a different sense of what you would do in each case, or not, depending on how the action principle (both stories) and intention principle (Evan's story) weigh on your concience. 

We have data relating to these stories, so let's load it in


In [None]:
# Import trolley data
tdata = pd.read_csv('Trolley.csv',sep=';')
dfi.export(tdata.head(), 'tdata.jpg')
tdata.head()

The data include responses from 331 individuals (`id`) to various trolley problems (`case`). We're interested in their `response`, which is an ordered integer from 1 (never permissible) to 7 (always permissible) in reference to if the actor in the story should intervene or not.

A first question is how should we even display such data? Here are three alternatives

In [None]:
_, (ax0, ax1, ax2) = plt.subplots(1, 3, figsize=(17, 6))

# Pandas plotting goodness for a histogram
tdata.response.value_counts().sort_index().plot(kind='bar',ax=ax0)
ax0.set_xlabel("", fontsize=15);
ax0.set_ylabel("Frequency", fontsize=15);

# Pandas plotting goodness for a cumulative proportion
tdata.response.value_counts().sort_index().cumsum().div(tdata.shape[0]).plot(marker='o',ax=ax1)
ax1.set_xlim(0.9, 7.1);
ax1.set_xlabel("Response", fontsize=15)
ax1.set_ylabel("Cumulative proportion", fontsize=15)

# Pandas plotting goodness for log-odds
tdata.response.value_counts().sort_index().cumsum().iloc[:-1].div(tdata.shape[0]).apply(lambda p:np.log(p/(1.-p))).plot(marker='o',ax=ax2)
ax2.set_xlim(0.9, 7);
ax2.set_xlabel("", fontsize=15)
ax2.set_ylabel("log(Cumulative-odds)", fontsize=15)
plt.savefig('cum-odds.jpg',dpi=300);

From left to right we have: a histogram of the responses, the cumulative proportion of responses from each category, and the log of the cumulative odds of each category. I'll walk you through the panels from left to right, using pandas notation. The first quantity is the count of each response category

In [None]:
# Counts per response
tdata.response.value_counts()

Pandas indexing is on the left, keeping track of which response category these totals represent. Next we want to order them

In [None]:
# Ordered counts per response
tdata.response.value_counts().sort_index()

This is the left hand panel of the figure above. Next we want to calculate the cumulative proportions for each response - this is the ordering that we're concerned about. The first step there is to calculate the cumulative sum of each response

In [None]:
# Cumulative sum of responses (in order)
tdata.response.value_counts().sort_index().cumsum()

And then calculate the proportion represented by each sum by dividing by the total number of responses

In [None]:
# Cumulative proportion of responses (in order)
tdata.response.value_counts().sort_index().cumsum().div(tdata.shape[0])

This is the middle panel of the figure above. Lastly we want to plot the log(cumulative-odds) of each category. Why? Well because it is the cumulative analogue of the log-odds we used in the binomial model. Remember that the logit is the log-odds and so the cumulative logit is the log(cumulative-odds); these transformations are what constrain our model to the probability scale when it comes to the response.

To go from cumulative proportions to cumulative logit, we need the cumulative proportion for the first six responses - remember, these all have to sum to one so we get one of the ordered responses for free (typically the last is omitted). So first we can get the cumulative proportions for the first six categories by using the `.iloc` operator to remove the last category (`-1`)

In [None]:
# Cumulative proportion of first six responses (in order)
tdata.response.value_counts().sort_index().cumsum().iloc[:-1].div(tdata.shape[0])

The last thing we need to do is calculate the log of the cumulative odds. Recall the equation for the logit:

$$
logit(p) = log(\frac{p}{1-p})
$$

The equation for the cumulative logit is exactly this, but where $p=P(y_i<k)$

$$
logit(P(y_i<k)) = log(\frac{P(y_i<k)}{1-P(y_i<k)})
$$

We can use the `.apply` function in pandas to apply this logit to the cumulative probabilites we calculated in the step above

In [None]:
tdata.response.value_counts().sort_index().cumsum().iloc[:-1].div(tdata.shape[0]).apply(lambda p:np.log(p/(1.-p)))

What do we get by doing this? Well we've now reversed the cumulative probabilities onto the log-odds scale and - this is the miracle here - in doing so we can get the probabilty of each category by subtracting the cumulative probabilty in the previous category. Incidentally the genius behind [this idea](http://www.stat.uchicago.edu/~pmcc/pubs/paper2.pdf) is [Peter McCullagh](http://www.stat.uchicago.edu/~pmcc/) - who came up with it when he was 28...

To further see this, we can plot how substracting these cumulative probabilies stack up

In [None]:
# Store cumulative proportions
tmp = tdata.response.value_counts().sort_index().cumsum().div(tdata.shape[0])
# Grab cumulative proportion values
cumsum = tmp.values
ncat = len(cumsum)
# Pandas plotting goodness
tmp.plot(marker='o', label='P(y<=k)')
# Draw horizontal lines at cumulative proportions
[plt.axhline(p,c='grey',alpha=0.5) for p in cumsum]
# Draw vertical lines from previous probability to current probabilty
[plt.plot((i+1,i+1),(cumsum[i-1],cumsum[i]), c='red', zorder=0) for i in range(1,ncat)]
# Draw first line
plt.plot((1,1), (0,cumsum[0]), c='red', zorder=0, label='P(y=k)')
# Label length of red lines
[plt.text(i+1.1, cumsum[i-1]+0.03, str(round(cumsum[i]-cumsum[i-1],2)), c='red') for i in range(1,ncat)]
# Label length of first red line
plt.text(1.1, 0.03, str(round(cumsum[0],2)), c='red')
plt.legend()
plt.xlim(0.9, 7.1)
plt.ylim(0., 1.01)
plt.xlabel("Response", fontsize=15)
plt.ylabel("Cumulative proportion", fontsize=15)
plt.savefig('cum-props.jpg',dpi=300);

Now that the cumulative logit is hopefully clear, we can develop a statistical model to estimate these ordered probabilities. While PyMC3 and Stan both have Ordered-logit distributions available, it's important to recognize what's going on under the hood. What's happening is really this

$$
\begin{align}
R_i \sim & Cat(p)\\
p_1 = & q_1 \\
p_k = & q_k-q_{k-1} \text{   for K>k>1} \\
p_K = & 1-q_{k-1} \\
logit(q_k) = & \kappa_k - \phi_i \\
\phi_i = & \text{  linear model} \\
\kappa_k \sim & N(0, 1.5)
\end{align}
$$

What this really says is that the probabilities for each category ($p_k$) come via subtraction from the cumulative probabilities ($q_k$), and that the link to our linear model is just the logit for that cumulative probability. Interestingly each category has it's own itercept, $\kappa_k$ *from which* we subtract the rest of our linear model. Why by subtraction? Because adjustments made by the linear model should occur relative to the maximum - what is referred to as the **cutpoint**, which is represented by the $\kappa_k$'s. So **if we decrease the log(cumulative-odds) of every outcome k below the maximum, this shifts probability toward higher outcome values**. What this will do is give us higher values when the parameters in $\phi_i$ are positive. Re-read that bold sentence a few times until it makes sense. It's a pivotal idea.

Ok, much easier than doing all this by hand is to use the Ordered-logit in PyMC3, which for a model with no covariates is

In [None]:
# Responses (with -1 because Python)
R = tdata.response.values-1

In [None]:
# Number of unique classes
nK = len(np.unique(R))

In [None]:
COORDS = {'responses':np.unique(R).astype(str)}

In [None]:
with pm.Model(coords=COORDS) as OLo:
    # Cutpoints - note the use of transform here, and the need to initialize with some ordered values
    κ = pm.Normal('Cutpoint', 0, 1.5, 
                  transform=pm.distributions.transforms.ordered,
                  initval=np.arange(nK)-2.5,
                  dims='responses')
    
    # Likelihood
    Yi = pm.OrderedLogistic('Yi', 0, κ, observed=R)

In [None]:
with OLo:
    trace_o = pm.sample(1000)

In [None]:
tmp = pm.summary(trace_o)
dfi.export(tmp.head(), 'cutpoints.jpg')
tmp

Et voila! We have cutpoints! So what does this mean really? Well to have a look at these we need to convert these back to the cumulative probability scale

In [None]:
def invlogit(x):
    return np.exp(x)/(1+np.exp(x))

In [None]:
cum_prob = invlogit(trace_o.posterior['Cutpoint'].mean(('draw','chain')))
cum_prob_l95 = invlogit(np.quantile(trace_o.posterior['Cutpoint'], 0.025, axis=0))
cum_prob_u95 = invlogit(np.quantile(trace_o.posterior['Cutpoint'], 0.975, axis=0))
cum_prob

Which is the same as the cumulative proportions we calcualted from the raw data

In [None]:
# Store cumulative proportions
tmp = tdata.response.value_counts().sort_index().cumsum().div(tdata.shape[0])
# Grab cumulative proportion values
cumsum = tmp.values
ncat = len(cumsum)
# Pandas plotting goodness
tmp.plot(marker='o', label='P(y<=k)')
# Draw 80% HPD
[plt.plot((i+1.1,i+1.1),(cum_prob_l95[i],cum_prob_u95[i]), c='black', zorder=0) for i in range(ncat-1)]
# Draw horizontal lines at cumulative proportions
[plt.axhline(p,c='grey',alpha=0.5) for p in cumsum]
# Draw vertical lines from previous probability to current probabilty
[plt.plot((i+1,i+1),(cumsum[i-1],cumsum[i]), c='red', zorder=0) for i in range(1,ncat)]
# Draw first line
plt.plot((1,1), (0,cumsum[0]), c='red', zorder=0, label='P(y=k)')
# Label length of red lines
[plt.text(i+1.1, cumsum[i-1]+0.03, str(round(cumsum[i]-cumsum[i-1],2)), c='red') for i in range(1,ncat)]
# Label length of first red line
plt.text(1.1, 0.03, str(round(cumsum[0],2)), c='red')
plt.legend()
plt.xlim(0.9, 7.1)
plt.ylim(0., 1.01)
plt.xlabel("Response", fontsize=15)
plt.ylabel("Cumulative proportion", fontsize=15)
plt.savefig('cum-props_post.jpg',dpi=300);

but now with associated uncertainties around the probabilities. Incidentally this gives some pseudo-statsitical people the shits, I once had a haughty reviewer state '*you can't have uncertainty about probabilities because they're probabilities*'. They didn't win the day.

With this in hand, we can - by subtraction - get the estimated probabilities for each category:

In [None]:
k_prob = np.array([cum_prob[0]]+[cum_prob[i]-cum_prob[i-1] for i in range(1,ncat-1)]+[1-cum_prob[-1]])
k_prob

Ok, this is all fine and dandy, but how can we add covariates to these kinds of models, to make inferences about things we really care about? In the case of these trolley experiments, there are stories and response scores, and each story contains one or two of the three priciples: Action, Intention, and Contact

We saw the covariate model before in Bayes notation, now in PyMC form

In [None]:
# Theano wrapper for later use in plotting - transforms variables into Theano tensor objects/
from pytensor import shared

# Covariates
C = shared(tdata.contact.values)
I = shared(tdata.intention.values)
A = shared(tdata.action.values)

In [None]:
with pm.Model() as OLo_c:
    # Cutpoints - note the use of transform here, and the need to initialize with some ordered values
    κ = pm.Normal('Cutpoint', 0, 1.5, 
                  transform=pm.distributions.transforms.ordered, 
                  shape=nK-1, 
                  initval=np.arange(nK-1)-2.5)
    
    # Elements present in story
    βC = pm.Normal('C', 0, 1.5)
    βA = pm.Normal('A', 0, 1.5)
    βI = pm.Normal('I', 0, 1.5)
    βIC = pm.Normal('IC', 0, 1.5)
    βIA = pm.Normal('IA', 0, 1.5)
    
    # Linear model
    μ = βC*C+βA*A+βI*I+βIC*I*C+βIA*I*A
    
    # Likelihood
    Yi = pm.OrderedLogistic('Yi', μ, κ, observed=R)

In [None]:
with OLo_c:
    trace_c = pm.sample(1000)

In [None]:
pm.summary(trace_c)

In [None]:
# Plot covariate effects
pm.plot_forest(trace_c,var_names=['C','I','A','IC','IA'])
plt.axvline(0)
plt.savefig('cov_post.jpg',dpi=300);

Which are all negative - so less permissible. We **subtract** from the cutpoints, so the more negative things are the more they shift density into lower categories. To see this we can look at the effect of the biggest covariate, `IC`, on the transformed probabilities:

In [None]:
TP = trace_c.posterior
cum_prob_C = invlogit(TP['Cutpoint'].mean(('chain','draw'))-(TP['C'].mean(('chain','draw'))))
cum_prob_IC = (invlogit(TP['Cutpoint'].mean(('chain','draw'))-(TP['I'].mean(('chain','draw'))
                +TP['C'].mean(('chain','draw'))+TP['IC'].mean(('chain','draw')))) )

In [None]:
# Grab cumulative proportion values
tmp2 = pd.Series(np.append(cum_prob_C,1))
tmp = pd.Series(np.append(cum_prob_IC,1))
cumsum = cum_prob_IC
ncat = len(cumsum)
# Pandas plotting goodness
tmp.plot(marker='o', label='P(y<=k) IC')
tmp2.plot(marker='o', label='P(y<=k) C')
# Draw horizontal lines at cumulative proportions
[plt.axhline(p,c='grey',alpha=0.5) for p in cumsum]
# Draw vertical lines from previous probability to current probabilty
[plt.plot((i,i),(cumsum[i-1],cumsum[i]), c='red', zorder=0) for i in range(1,ncat)]
# Draw first line
plt.plot((0.01,0.01), (0,cumsum[0]), c='red', zorder=0, label='P(y=k)')
# Draw last line
plt.plot((ncat,ncat), (cumsum[ncat-1],1), c='red', zorder=0)
# Label length of red lines
[plt.text(i, cumsum[i-1]+0.03, str(np.round(cumsum[2]-cumsum[2-1],2).values), c='red') for i in range(1,ncat)]
plt.text(ncat, cumsum[ncat-1]+0.03, str(np.round(1-cumsum[ncat-1],2).values), c='red')
# Label length of first red line
plt.text(.1, 0.03, str(np.round(cumsum[0],2).values), c='red')
plt.legend()
plt.xlim(0, 8)
plt.ylim(0., 1.01)
plt.xlabel("Response", fontsize=15)
plt.ylabel("Cumulative proportion", fontsize=15)
plt.savefig('covar_sum.jpg',dpi=300);

Note: I had to-reinitialize the jupyter notebook at this stage to run this model - I have no idea why but will try to sort it out and re-post.

# Ordered categorical covariates

While ordered categories can be handled on the response side, they can also occur in potential covariates. In the trolley data, we see there is an ordered list of educational attainment

In [None]:
tdata.head()

in the `edu` column. Looking at the column we can see the values

In [None]:
tdata.edu.unique()

which need to be ordered in some way. We can do this using list indexing in Python

In [None]:
# Education order
Edu = ['Elementary School','Middle School','Some High School','High School Graduate',
       'Some College',"Bachelor's Degree","Master's Degree",'Graduate Degree']
# Number of education categories
nedu = len(Edu)
# Create new column with integer order
tdata['edu_score'] = [Edu.index(e) for e in tdata.edu.values]
# Take a look
tdata.head()

Our `edu_score` variable is now in order, and what we want to estimate is the incremental effect of each step in the education ladder on the morality of the various stories. The way to do this is to make additive effects building on previous education effects. So with elementary corresponding to the intercept, we can show for the next three levels of education:

$$
\begin{align}
\phi_1 = & \delta_1 \\
\phi_2 = & \delta_1 + \delta_2\\
\phi_3 = & \delta_1 + \delta_2 + \delta_3
\end{align}
$$

The shorthand notation for this is

$$
\phi_i = \sum^{7}_{j=1}\delta_j
$$

Now the real trick in all this is to have these $\delta$ parameters be some fraction of a maximum education effect ($\beta_E$), which we can write as

$$
\phi_i = \beta_E\sum^{E_i-1}_{j=0}\delta_j
$$

Which with $j=0$ makes the first category $\beta_E \delta_0=0$. The last bit here is to define priors, which is the easiest part: we get to use a Dirichlet, which is a multivariate extension of the Beta distribtuion. The Beta distribution is a distribution for two probabilities that sum to one and the Dirichlet is the same, but generalized to $n$ probabilities.  So to write out our whole model

$$
\begin{align}
R_i = & OrderedLogit(\phi_i, \kappa_k) \\
\phi_i = & \beta_E\sum^{E_i-1}_{j=0}\delta_j + \beta_A A + \beta_I I + \beta_C C\\
\kappa_k \sim & N(0, 1.5)\\
\beta_A,\beta_I, \beta_C, \beta_E \sim & N(0, 1)\\
\delta \sim & Dirichlet(\alpha) \\
\alpha = & [2,2,2,...,2]
\end{align}
$$

And we can implement all this in PyMC as

In [None]:
# Education attainment
E = shared(tdata.edu_score.values)
nedu

In [None]:
Edu[1:]

In [None]:
with pm.Model(coords={'EDU':Edu}) as OLo_oc:
    # Cutpoints - note the use of transform here, and the need to initialize with some ordered values
    κ = pm.Normal('Cutpoint', 0, 1.5, 
                  transform=pm.distributions.transforms.ordered, 
                  shape=nK-1, 
                  initval=np.arange(nK-1)-2.5)

    # Education mixture - weak flat priors
    δ = pm.Dirichlet('edu_mix',[2]*(nedu-1))
    # Cumulative
    δ0 = [0.0]
    δedu = pm.Deterministic('δedu', pm.math.concatenate([δ0, δ]),dims='EDU')
    # For each level of education, the cumulative delta_E, phi_i
    cδedu = pm.Deterministic('δedu_cumsum', δedu.cumsum(), dims='EDU')
    
    # Maximum education effect
    βE = pm.Normal('E', 0, 1)
    
    # Elements present in story
    βC = pm.Normal('C', 0, 1)
    βA = pm.Normal('A', 0, 1)
    βI = pm.Normal('I', 0, 1)
    
    # Linear model
    μ = βE*cδedu[E]+βC*C+βA*A+βI*I
    
    # Likelihood
    Yi = pm.OrderedLogistic('Yi', μ, κ, observed=R)

In [None]:
OLo_oc.point_logps()

In [None]:
with OLo_oc:
    trace_oc = pm.sample(1000)

In [None]:
# Plot covariate effects
pm.plot_forest(trace_oc,var_names=['C','I','A','E', 'δedu_cumsum'])
plt.axvline(0)
plt.savefig('cutco_post.jpg',dpi=300);