# <i>Survival Analysis I: </i>
## <i>Introduction to Survival Analysis in Python</i>

## Applications

<p>Survival analysis studies survival times and factors that influence them; it is a statistical method to investigate the expected amount of time before an event of interest occurs. In medicine, it refers to patient survival times; in engineering, it refers to time-to-failure, and in economics, duration analysis. In business, it can be applied to assess elements such as customer and employee retention.</p>

## Objectives
The goal of a survival analysis is to estimate the survival distribution (i.e. function), compare two or more survival curves, and/or assess the effects of a number of factors on survival. It is a type of regression problem, but differs in that parts of the training data are censored, and can only be partially observed.

## Censoring

<p>Censoring occurs when we only have partial information about an individual's survival time. This could include a loss of contact situation, or simply the subject not having experienced the event of interest by the time of the close of the study. Ignoring censoring results in the overestimation of survival probabilities.</p>

<p>Left-censored data refers to an unavailability of data at the start of the timeline. For example, if the start date of an illness is not known. Right-censoring is more common, and means the survival duration is greater than the observed duration. A third case is interval censoring, in which the survival duration of an individual cannot be defined for some time-interval.</p>

## Basic Quantities

<p>Two key ways of specifying the survival curve are the survival function and the hazard function.</p>


<h3>The Survival Function</h3>

<p>Given a survival curve, we compute the expected remaining lifetime as a function of current age. $S(t)$ represents the probability of living past some time.</p>

<h3>$S(t) = Pr(T \gt t)$</h3>

<p>It has the properties of $S(0) = 1$ and $S(\infty) = 0$. Its relation to the non-negative and non-decreasing cumulative hazard function $H(t)$ is:</p>

<h3>$S(t) = exp(-H(t))$</h3>

<p>If we were able to observe a survival time for all subjects, then $\hat{S}(t)$ is simply equal to the number of subjects who survived beyond that time $t$, divided by the total number of subjects. But in the presence of censoring, that numerator is not defined.</p>



<h3>The Hazard Function</h3>

<p>The survival function is often defined in terms of the hazard function, which is the instantaneous rate of failure, i.e., the probability that a subject fails in the next interval of time, given that they have survived up to time $t$. The hazard function is related to the PDF and survival functions by:</p>

<h2>$h(t) = \frac{f(t)}{S(t)}$</h2>

<p>In other words, the hazard at time $t$ is the probability that an event occurs in the neighborhood of time $t$, divided by the probability that the subject is alive at time $t$. The cumulative hazard function is the area under the hazard function up to time $t$, i.e.:</p>

<h3>$H(t) = \int_0^t h(u) ~du$</h3>

<p>The hazard function is the derivative of the cumulative hazard function:</p>

<h2>$h(t) = \frac{dH(t)}{dt}$</h2>

<p>The survival function may be defined in terms of the hazard function by:</p>

<h3>$S(t) = exp \left( - \int_0^t h(u) ~du \right) = exp(-H(t))$</h3>

# Packages Available in Python

Python has packages <i>lifelines</i> and <i>scikit-survival</i>, which are easy to install and to use, and have extensive documentation.

- https://lifelines.readthedocs.io/en/latest/index.html

- https://scikit-survival.readthedocs.io/en/stable/index.html


Both packages provide numerous datasets to practice on:

- https://lifelines.readthedocs.io/en/latest/lifelines.datasets.html

- https://scikit-survival.readthedocs.io/en/stable/api/datasets.html

We will work with a breast cancer dataset that happens to be available in both packages.

# Loading the Breast Cancer Dataset in sksurv

In scikit-survival (sksurv), the call to the loading function returns a dataset partitioned into $X$ and $y$ values, where the $y$ values contain two pieces of information - the time, and whether or not the subject was censored at that time.

In [None]:
# !pip install --upgrade scikit-survival

In [4]:
import pandas as pd
import numpy as np
from sksurv.datasets import load_gbsg2

data_x, data_y = load_gbsg2()
print(type(data_x), np.shape(data_x))
print(type(data_y), np.shape(data_y))
print(type(pd.DataFrame(data_y)), np.shape(pd.DataFrame(data_y)))

<class 'pandas.core.frame.DataFrame'> (686, 8)
<class 'numpy.ndarray'> (686,)
<class 'pandas.core.frame.DataFrame'> (686, 2)


Notice that the $y$ portion of the dataset is returned as an array, whereas the  $X$ portion is returned as a dataframe. We can convert it to a dataframe if desired.

In [5]:
data_x.head()

Unnamed: 0,age,estrec,horTh,menostat,pnodes,progrec,tgrade,tsize
0,70.0,66.0,no,Post,3.0,48.0,II,21.0
1,56.0,77.0,yes,Post,7.0,61.0,II,12.0
2,58.0,271.0,yes,Post,9.0,52.0,II,35.0
3,59.0,29.0,yes,Post,4.0,60.0,II,17.0
4,73.0,65.0,no,Post,1.0,26.0,II,35.0


In [6]:
data_y = pd.DataFrame(data_y)
data_y.head()

Unnamed: 0,cens,time
0,True,1814.0
1,True,2018.0
2,True,712.0
3,True,1807.0
4,True,772.0


# Loading the Breast Cancer Dataset in Lifelines

In [7]:
# !pip install --upgrade lifelines

In lifelines, the dataset is returned as a single dataframe, and it is up to the user to specify which columns contain the $y$ values.

In [8]:
import lifelines
data = lifelines.datasets.load_gbsg2()
print(np.shape(data))

(686, 10)


In [9]:
data.head()

Unnamed: 0,horTh,age,menostat,tsize,tgrade,pnodes,progrec,estrec,time,cens
0,no,70,Post,21,II,3,48,66,1814,1
1,yes,56,Post,12,II,7,61,77,2018,1
2,yes,58,Post,35,II,9,52,271,712,1
3,yes,59,Post,17,II,4,60,29,1807,1
4,no,73,Post,35,II,1,26,65,772,1


In [10]:
idx = len(data.columns) - 2
data_x = data.iloc[:,:idx]
data_y = data.iloc[:,idx:]
data_x.head()

Unnamed: 0,horTh,age,menostat,tsize,tgrade,pnodes,progrec,estrec
0,no,70,Post,21,II,3,48,66
1,yes,56,Post,12,II,7,61,77
2,yes,58,Post,35,II,9,52,271
3,yes,59,Post,17,II,4,60,29
4,no,73,Post,35,II,1,26,65


In [11]:
data_y.head()

Unnamed: 0,time,cens
0,1814,1
1,2018,1
2,712,1
3,1807,1
4,772,1


# Types of Statistical Methods

<p>There are three types of statistical methods used to estimate survival and hazard functions:</p>
    <ol>
        <li>Non-Parametric (e.g. Kaplan-Meier estimator)</li>
        <li>Semi-Parametric (e.g. Cox proportional hazards model)</li>
        <li>Parametric (e.g. Weibull model, Exponential model)</li>
    </ol>

<p>Parametric refers to methods using distributions which adhere to a particular shape given by density function parameters. For non-parametric models, an empirical estimation of the survival function is obtained, and in semi-parametric methods, there is a little of both.</p>

<p>Subsequent articles are largely drawn from the lifelines documentation, and discuss each type of method. You can find the next one **here**</p>

# References

1. (n.d.). Lifelines Documentation. Lifelines. https://lifelines.readthedocs.io/en/latest/
2. Moore, D. F. (2016). Applied Survival Analysis in R. Springer International Publishing Switzerland 2016.
3. (n.d.). Scikit-Survival Documentation. https://scikit-survival.readthedocs.io/en/stable/user_guide/index.html
4. [Udemy]. (2019, April 1). Survival Analysis in R [Video]. Udemy. https://www.udemy.com/course/survival-analysis-in-r/
5. (2019, January 6). Survival Analysis Intuition and Implementation in Python. Towards Data Science. https://towardsdatascience.com/survival-analysis-intuition-implementation-in-python-504fde4fcf8e
6. Lewinson, E. (2020, August 17). Introduction to Survival Analysis: The Kaplan-Meier estimator. Towards Data Science. https://towardsdatascience.com/introduction-to-survival-analysis-the-kaplan-meier-estimator-94ec5812a97a
7. Lewinson, E. (2020, July 23). The Cox Proportional Hazards Model. Towards Data Science. https://towardsdatascience.com/the-cox-proportional-hazards-model-35e60e554d8f
8. Lewinson, E. (2020, August 23). Introduction to Survival Analysis: The Nelson-Aalen estimator. Towards Data Science. https://towardsdatascience.com/introduction-to-survival-analysis-the-nelson-aalen-estimator-9780c63d549d