# How TorchCox works

In [1]:
import pandas as pd
import torch
from torch import nn
from torch import optim
import numpy as np

torch.autograd.set_detect_anomaly(True)

<torch.autograd.anomaly_mode.set_detect_anomaly at 0x7feb2c669940>

The first step is to get our survival data in the right format, which is the staircase encoding described in `notebooks/Staircase_encoding.ipynb`

In [2]:
valdf = pd.DataFrame({'id':['Bob','Sally','James','Ann'], 'time':[1,3,6,10], 'status':[1,1,0,1], 'smoke':[1,0,0,1]})
valdf

Unnamed: 0,id,time,status,smoke
0,Bob,1,1,1
1,Sally,3,1,0
2,James,6,0,0
3,Ann,10,1,1


In [3]:
tname = 'time'
Xnames = ['smoke']
dname = 'status'

**There is one difference between the snippet below and what is described in `notebooks/Staircase_encoding.ipynb`: the padding value here is a large negative number instead of zero.**  

The reason for this will become clear shortly, but in a nutshell is because we will use tensor-wide operations to compute the likelihood and we do not want these padding values to affect the calculation.  

The top row of the tensor of data which contributes to the numerator in the Cox likelihood will never be padding so is unaffected, but the denominator is computed from the full risk set (all rows in a front slice of the tensor) so the padding could affect the result, which would be a serious problem.  

As we will see the denominator of the Cox likelihood involves a `logsumexp()` function, so a large negative padding value results in `exp()` underflowing to zero, then being fed into a `sum()` where these zeros do not affect the result, and voilà, the padding will not affect the computation!

In [4]:
def _padToMatch2d(inputtens, targetshape):
    target = torch.full(targetshape, fill_value=-1e3)#torch.zeros(*targetshape)
    target[:inputtens.shape[0], :inputtens.shape[1]] = inputtens
    return target

inputdf = valdf[[tname,dname,*Xnames]].sort_values([dname,tname], ascending=[False,True])

tensin = torch.from_numpy(inputdf[[tname,dname,*Xnames]].values)

#Get unique event times
tensin_events = torch.unique(tensin[tensin[:,1]==1, 0])

#For each unique event stack another matrix with event at the top, and all at risk entries below
tensor = torch.stack([_padToMatch2d(tensin[tensin[:,0] >= eventtime, :], tensin.shape) for eventtime in tensin_events])

#Make sure the top row in each unique event time slice is an event
assert all(tensor[:,0,1] == 1)

tensor

tensor([[[    1.,     1.,     1.],
         [    3.,     1.,     0.],
         [   10.,     1.,     1.],
         [    6.,     0.,     0.]],

        [[    3.,     1.,     0.],
         [   10.,     1.,     1.],
         [    6.,     0.,     0.],
         [-1000., -1000., -1000.]],

        [[   10.,     1.,     1.],
         [-1000., -1000., -1000.],
         [-1000., -1000., -1000.],
         [-1000., -1000., -1000.]]])

We will need a couple of extra quantities computed from the tensor which are related to how one can deal with _tied event times_ in the Cox model.  

We use the _Breslow method_ here to deal with those, which involves summing over the covariates of entries at tied event times, and raising the denominator of the likelihood to the power of the number of tied events.

The Cox partial Likelihood is the product over the unique event times, $t_i$, of the ratio of $\exp(X_i\beta)$ for the covariates of the subject experiencing an event at that event time, divided by the sum of the equivalent contribution for all the subjects at risk at that event time, $\sum_{j:\, t_j \geq t_i} \exp(X_j\beta)$,    
$$\mathcal{L}(\beta \;|\; X) = \prod_{t_i} \frac{\exp(X_i\beta)}{\sum_{j:\, t_j \geq t_i} \exp(X_j\beta)}$$.

In the presence of tied event times, the Breslow method of dealing with these gives a slightly modified likelihood,
$$\mathcal{L}_B(\beta \;|\; X) = \prod_{t_i} \frac{\exp\left(\sum_{k: t_k=t_i} X_k\beta\right)}{\left[\sum_{j:\, t_j \geq t_i} \exp(X_j\beta)\right]^{d_i}}$$
where $d_i$ is the number of tied events at time $t_i$.

Compute some of the ingredients which we will require to compute the Breslow-method Cox likelihood:  
- `num_tied` is $d_i$  
- `event_tens` is the $\sum_{k: t_k=t_i} X_k$ which will go into the numerator

In [5]:
tiecountdf = inputdf.loc[inputdf[dname]==1,:].groupby([tname]).size().reset_index(name='tiecount')
num_tied = torch.from_numpy(tiecountdf.tiecount.values).int()

#One actually has to sum over the covariates which have a tied event time in the Breslow correction method!
#See page 33 here: https://www.math.ucsd.edu/~rxu/math284/slect5.pdf
event_tens = torch.stack([tensor[i, :num_tied[i], 2:].sum(dim=0) for i in range(tensor.shape[0])])

#Drop time and status columns as no longer required
tensor = tensor[:,:,2:]

Consider the log-likelihood of the (Breslow method) Cox partial likelihood above:  
$$\mathcal{L}_B(\beta \;|\; X) = \sum_{t_i} \left[ \sum_{k: t_k=t_i} X_k\beta \;\;-\;\; d_i \log\left(\sum_{j:\, t_j \geq t_i} \exp(X_j\beta)\right) \right]$$

We can now compute the (Breslow method) Cox partial likelihood, in the cell below:  
- `loss_event` gives the numerator: $\exp\left(\sum_{k: t_k=t_i} X_k\beta\right)$
- `XB` corresponds to $X_j\beta$ in the second term 
- `loss_atrisk` then is second term in the likelihood (previously the denominator)


This function then returns the negative log-likelihood!

In [6]:
def get_loss(tensor, event_tens, num_tied, beta):
    loss_event = torch.einsum('ik,k->i', event_tens, beta)

    XB = torch.einsum('ijk,k->ij', tensor, beta)
    loss_atrisk = -num_tied*torch.logsumexp(XB, dim=1)

    loss = torch.sum(loss_event + loss_atrisk)

    return -loss

Set up the optimisation, initialise $\beta$ values, select optimiser and learning rate, etc

In [7]:
beta = nn.Parameter(torch.zeros(len(Xnames))).float()

optimizer = optim.LBFGS([beta], lr=1)


def closure():
    optimizer.zero_grad()
    loss = get_loss(tensor, event_tens, num_tied, beta) #compute the loss
    loss.backward() #compute the derivative of the loss
    return loss

optimizer.step(closure)

tensor(4.1589, grad_fn=<NegBackward>)

In [8]:
print(beta.detach().numpy()) 

[0.34657338]


**And that is indeed the correct value for $\beta$!** The Maximum Likelihood Estimate for this simple dataset is $\beta = \log(2)/2$

In [9]:
np.log(2)/2

0.34657359027997264

You don't believe me that that is the correct answer? See `notebooks/Validation.ipynb` ;)

The above is exactly what is in the `TorchCox()` class in `torchcox/TorchCox.py`, and constitutes the entire fit procedure. You now understand exactly how it works, and it is verifiably correct.