# 05 - Simple Cox regression - Categorical variable

## Data

Source of data:  Mayo Clinic trial in PBC conducted between 1974 and 1984

Data set pbc.csv

In [1]:
library(readr)
pbc <- read_csv("data/pbc.csv",
                 show_col_types = FALSE)
head(pbc)


id,time,status,trt,age,sex,ascites,hepato,spiders,edema,bili,chol,albumin,copper,alk.phos,ast,trig,platelet,protime,stage
<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,400,2,1,58.76523,f,1,1,1,1.0,14.5,261,2.6,156,1718.0,137.95,172,190,12.2,4
2,4500,0,1,56.44627,f,0,1,1,0.0,1.1,302,4.14,54,7394.8,113.52,88,221,10.6,3
3,1012,2,1,70.07255,m,0,0,0,0.5,1.4,176,3.48,210,516.0,96.1,55,151,12.0,4
4,1925,2,1,54.74059,f,0,1,1,0.5,1.8,244,2.54,64,6121.8,60.63,92,183,10.3,4
5,1504,1,2,38.10541,f,0,1,1,0.0,3.4,279,3.53,143,671.0,113.15,72,136,10.9,3
6,2503,2,2,66.25873,f,0,1,0,0.0,0.8,248,3.98,50,944.0,93.0,63,.,11.0,3


## SAS program snippet

The following SAS code will be executed.

## R chunk

Packages will be loaded in the chunk where they are first needed.

A similar R program might look like this. It uses the lm() function.

The tidy() function from the broom-packages formats the output into a tibble for easier processing.

In [10]:
library(tidyverse)
library(broom)
library(survival)
pbc1 <- pbc %>% select(time, status, sex) %>% na.omit()
pbc1$status <- ifelse(pbc1$status == 0, 0, 1) # recode of status, all events equal 1, censored equal 0
my_cox <- coxph(formula = Surv(time, status) ~ sex, data = pbc1)
tidy(my_cox)

term,estimate,std.error,statistic,p.value
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
sexf,-0.361347,0.2088782,-1.729941,0.08364085


## Results

The output is divided into blocks to explain it and to reproduce it afterwards in the different languages.

### Block 1
![Block 1](img_screenshots_sex/block_1.png)

Row 1 refers to the dataset which was used in this procedure.

Row 2 gives the response variable or dependent variable for the Cox regression.

Row 3 gives the censoring variable.

Row 4 gives the censoring values.

Row 5 informs about the handling of ties. The default methods of the different statistical programs might differ.

The number of observations used might be less than the number of observations read.
SAS performs a listwise deletion (complete case analysis) if missing values are present.

### R chunk for reproduction

In [11]:
nrow(pbc1)
nobs(my_cox) # Number of events

### Block 2
![Block 2](img_screenshots_sex/block_2.png)

Coding of categorical is listed here.

This coding might differ from the coding in other statistic programming languages.

### R chunk for reproduction

In [16]:
table(pbc1$sex)
contrasts(as.factor(pbc1$sex))


  m   f 
 44 374 

Unnamed: 0,f
m,0
f,1


See the difference between SAS and R.

The consequence is the other direction of the effect of the gender.

### Block 3
![Block 3](img_screenshots_sex/block_3.png)

This block presents the number of events and of censored values and the proportion of censored values.

The important information that the model converged can be found here.

### R chunk for reproduction

In [None]:
#Todo

### Block 4
![Block 4](img_screenshots_sex/block_4.png)

The model fit status is described by 
-  AIC (Akaike Information Criterion): Smaller is better.
-  SBC (Schwarz Bayesian (Information) Criterion): Smaller is better.
-  -2 Log L (negative two times the log-likelihood)


### R chunk for reproduction

In [17]:
glance(my_cox)

n,nevent,statistic.log,p.value.log,statistic.sc,p.value.sc,statistic.wald,p.value.wald,statistic.robust,p.value.robust,r.squared,r.squared.max,concordance,std.error.concordance,logLik,AIC,BIC,nobs
<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
418,186,2.748578,0.09734096,3.024981,0.08199126,2.99,0.08364085,,,0.006553974,0.9919921,0.5182891,0.01261923,-1007.537,2017.075,2020.301,418


### Block 5
![Block 5](img_screenshots_sex/block_5.png)

These global tests test the null hypothesis that all regression coefficents are zero.

The tests are different chi-square tests.

### R chunk for reproduction

In [None]:
#Todo

### Block 6
![Block 6](img_screenshots_sex/block_6.png)

This block provides a global test for categorical variables and their influence.

The global chi-square value is equal to the level chi-square value in the next block if the categorical variable is binary.

### R chunk for reproduction

In [12]:
tidy(my_cox)

term,estimate,std.error,statistic,p.value
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
sexf,-0.361347,0.2088782,-1.729941,0.08364085


### Block 7
![Block 7](img_screenshots_sex/block_7.png)

Column 1 "Parameter" lists the parameter in the model.

Column 2 "" gives the level of the Parameter.

Column 3 "DF" gives the degrees of freedom for every parameter.

Column 4 "Estimate" lists the regression estimates for every parameter given that the other parameter are held constant.

Column 5 "Standard Error" gives the standard errors of the individual regression coefficients.

Column 6 "Chi-Square" tests the null hypothesis that the regression coefficient is zero given that the other predictors are in the model.

Column 7 "Pr > ChiSq" gives the p-value for the Chi-Square statistic.

Column 8 "Hazard Ratio" is the exponentiated coefficient as hazard ratio

In [13]:
summary(my_cox)

Call:
coxph(formula = Surv(time, status) ~ sex, data = pbc1)

  n= 418, number of events= 186 

        coef exp(coef) se(coef)     z Pr(>|z|)  
sexf -0.3613    0.6967   0.2089 -1.73   0.0836 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

     exp(coef) exp(-coef) lower .95 upper .95
sexf    0.6967      1.435    0.4627     1.049

Concordance= 0.518  (se = 0.013 )
Likelihood ratio test= 2.75  on 1 df,   p=0.1
Wald test            = 2.99  on 1 df,   p=0.08
Score (logrank) test = 3.02  on 1 df,   p=0.08
