#Simulation of a clinical trial cohort to demonstrate trial design
Basic analysis to create simple (non-censored) survival data with simple exponential baseline hazard and hazard to various risk factors that  can be used to demonstrate confounding, randomisation and selection bias.

Patient factors to consider will be
* Age
* Sex
* Performance status
* Cancer stage
* GTV Volume
* Chemotherapy regimen

We will consider lung cancer and take hazard ratio estimates from cardiac toxicity papers (10.1016/j.ejca.2017.07.053). Add in an HR for respose to new radiotherapy regimen
* Age HR - 1.01/year
* Sex HR - (ref female), 1.2
* Performance status HR - (ref 0), 1.3 (1), 1.55 (2), 1.82 (3+)
* Cancer stage HR - (ref 1), 1.3 (2), 1.5 (3), 2.0 (4)
* GTV Volume HR - 1.01 / cm3
* Chemotherapy regimen HR - (ref RT alone), 1.2 (Sequential), 1.4 (Concurrent)
* Treatment effect HR - 1.3


Load libraries

In [11]:
library(survival)
library(ggplot2)

Set basic parameters


*   Number of patients (nbrPats)
*   Hazards for different factors (hr_X)
*   Baseline hazards (h_0 - uniform as exponential cumulative baseline)



In [5]:
nbrPats = 1000
hr_age=1.01
hr_sex=1.2
hr_ps1=1.3
hr_ps2=1.55
hr_ps3=1.82
hr_stage2=1.3
hr_stage3=1.5
hr_stage4=2.0
hr_gtv=1.01
hr_chemoSeq=1.2
hr_chemoCon=1.4
hr_treat=1.3
h_0=0.1

Define the distribution of characteristics across the patient population

Start with weighted random assignment approximating distributions seen in Christie data
* Age - normal distribution with mean 70 and sd 10
* PS - sampled with weights 0.15, 0.45, 0.4, 0.1
* stage - sampled with weights 0.5, 0.125, 0.3, 0.075
* sex - sampled with weights 0.4, 0.6
* log gtv - normal distribution with mean 4 and sd 1

Will need to be weighted to reasonable distributions and will need to include some correlations (eg stage and GTV)

In [30]:
ps=sample(c(0,1,2,3), nbrPats, replace=TRUE, prob=c(0.15, 0.45, 0.4, 0.1))
patients = data.frame(ps=ps)
age=rnorm(nbrPats,mean=70,sd=10)
patients$age=age
stage=sample(c(1,2,3,4), nbrPats, replace=TRUE, prob=c(0.5, 0.125, 0.3, 0.075))
patients$stage=stage
sex=sample(c('male','female'), nbrPats, replace=TRUE, prob=c(0.6, 0.4))
patients$sex=sex
logGtv = rnorm(nbrPats,mean=3, sd=1)
patients$gtv=exp(logGtv)

In [None]:
ggplot(patients,aes(x=ps)) +
  geom_bar(stat='count')
ggplot(patients,aes(x=age)) +
  geom_histogram()
ggplot(patients,aes(x=stage)) +
  geom_bar(stat='count')
ggplot(patients,aes(x=gtv)) +
  geom_histogram()

Create time vector for period we are interested in (3 years)

In [3]:
time=seq(0,365*3)

Calculate cumulative baseline hazard

In [4]:
H_0 = h_0 * time

Create array of hazards as a function of time for different prognostic indicies

Hazard for patient i: h_i = H_0 * exp(B_i * x_i)

B_i *x_i = prognostic index
