# Survival analysis

## Aim

To learn how to compute Kaplan-Meier survival curves and test for a difference in the survival probabilities for different groups.

In [None]:
library(tidyverse)

## Reading in the dataset and identifying relevant variables

Throughout this session we will be analysing a dataset from a cohort study of 318 men carried out in Trinidad.
To read in the dataset, type:

In [None]:
library(haven)

In [None]:
trinidad_df <- read_dta("Data_files-20211113/Trinidad.dta")

In [None]:
head(trinidad_df)

To find out how many deaths from any cause occurred during the follow-up period, type:

In [None]:
library(gmodels)
CrossTable(trinidad_df$death)

There were a total of 88 deaths (28%) from any cause.

To find out how many men entered the study with CHD type:

In [None]:
CrossTable(trinidad_df$chdstart)

38 men entered the study with CHD. Note: chdstart was recorded for only 290/318 men, i.e. there are 28 missing values.

#### To analyse data from cohort studies focussing on the time to an event we will use the Kaplan-Meier method. This method can give a graphical description of the survival pattern of the cohort based on the individual data when an event or censoring occurs.

## Kaplan-Meier survival probabilities

To obtain Kaplan-Meier estimates of these survival probabilities we must first define the follow-up information. We must be careful to specify when individuals first became at risk. Type:

In [None]:
trinidad_df_2 <- trinidad_df %>%
    mutate(followup_time = as.numeric(difftime(trinidad_df$timeout, 
                                               trinidad_df$timein, 
                                               units = "days")) / 365.25)

To produce a Kaplan-Meier plot for the survival of men with heart disease at entry to the study, type:

In [None]:
library(survival)

In [None]:
library(survminer)

In [None]:
survfit(Surv(time = followup_time, 
             event = death) ~ chdstart, 
        data = trinidad_df_2 %>% 
            #Filter to keep only those with heart
            #disease at entry
            filter(chdstart == 1)) %>%
    ggsurvplot(conf.int = FALSE)

Notice this plot is a step function because it is calculated at every time point when an event or censoring occurs. Approximately, what is the survival probability at 10 years for men who enter the study with heart disease?

##### From the plot we can see this is about 0.55.

To produce the Kaplan-Meier survival curves for men with and without heart disease at entry to the study we must specify chdstart as the stratifying variable. Type:

In [None]:
survfit(Surv(time = followup_time, 
             event = death) ~ chdstart, 
        data = trinidad_df_2) %>%
    ggsurvplot(conf.int = FALSE)

The two survival curves are presented on the same plot and show that the men who entered the study without heart disease had higher cumulative survival probabilities. A formal statistical test of the difference between these curves is the **log-rank test**.

## Log-rank test

The log-rank test is used to test the null hypothesis of no difference between two survival curves. In `R` we use the command `survdiff` and specify the variable with the groups we want to compare. So, to compare the survival probabilities of men with and without heart disease at entry to the study, type:

In [None]:
survdiff(Surv(time = followup_time, 
             event = death) ~ chdstart, 
        data = trinidad_df_2)

The log-rank test provides strong evidence against the null hypothesis (P=0.01) and we can conclude that cumulative survival differs between the two groups of men. However, the log-rank test does not enable us to quantify the difference in survival probabilities.

# Review exercise

#### 1) Obtain Kaplan-Meier survival curves for current smokers and current non-smokers. Is the survival probability for current smokers always less than that for current non- smokers?

In [None]:
survfit(Surv(time = followup_time, 
             event = death) ~ current, 
        data = trinidad_df_2) %>%
    ggsurvplot(conf.int = FALSE)

Survival probabilities are similar for the first 4 years and then afterwards are higher for current non-smokers

#### 2) Use a log-rank test to test the difference in the survival curves you produced in 1. Interpret your output.

In [None]:
survdiff(Surv(time = followup_time, 
             event = death) ~ current, 
        data = trinidad_df_2)

There is moderate evidence for a difference in the two groups