# Survival Model for Non-Small Cell Lung Cancer
***
## Introduction

Lung cancer is the leading cause of cancer-related deaths worldwide, with an estimated 154,050 deaths in the US alone in 2018. I want to build a predictive model of one-year survival after diagnosis with NSCLC (non-small cell lung cancer) using both clinical and genomic data. Prognostic accuracy for life expectancy is highly valued by patients, their families, and healthcare professionals. It helps determine course of treatment and significantly aids end-of-life decision making.
<br/><br/>
The simulated dataset was provided by the US Department of Veteran Affairs. I will split the dataset into train, validation, and test sets. Cox proportional hazards regression with elastic net regularization will be used for my survival model. For model evaluation, I have chosen to use both concordance index and average partial log-likelihood.

### Data Description

The file clinical.csv contains clinical data on each patient. Its columns are as follows:
1.	ID: Unique identifier for the patient.
2.	Outcome: Whether the patient is alive or dead at the follow-up time.
3.	Survival.Months: The follow-up time in months.
4.	Age: The patient’s age in years at diagnosis.
5.	Primary.Site: Location of primary tumor.
6.	Histology: Tumor histology.
7.	Stage: Stage at diagnosis.
8.	Grade: Tumor grade.
9.	Num.Primaries: Number of primary tumors.
10.	Tumor.Size: Size of the tumor at diagnosis.
11.	T: Tumor Stage.
12.	N: Number of metastases to lymph nodes.
13.	M: Number of distant metastases.
14.	Radiation: Whether radiation took place (5) or not (0).
15.	Num.Mutations: The total number of mutations found in the tumor.
16.	Num.Mutated.Genes: The total number of genes with mutation.

The file genomics.csv contains information as to which genes were found to have a mutation in each patient’s tumor sequencing data. Only genes with a mutation are listed.
1.	ID: Unique identifier for the patient.
2.	Gene: The name of the gene.

## Data Wrangling

All necessary packages are imported, and display options and plotting styles are set.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import StratifiedShuffleSplit
from lifelines import CoxPHFitter
from lifelines.utils.sklearn_adapter import sklearn_adapter
from lifelines.calibration import survival_probability_calibration
from fancyimpute import IterativeImputer
from sklearn.preprocessing import StandardScaler
from mpl_toolkits.mplot3d import Axes3D
from statsmodels.tools.tools import add_constant
from statsmodels.stats.outliers_influence import variance_inflation_factor
from scipy.stats import percentileofscore

In [2]:
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 100)
pd.options.display.multi_sparse = False

In [3]:
sns.set(style='whitegrid', palette="deep", font_scale=1.1, rc={"figure.figsize": [8, 5]})

<br/>The clinical data is imported as a pandas DataFrame.

In [4]:
df_c = pd.read_csv('clinical.csv', index_col=0)

In [5]:
df_c.head()

Unnamed: 0_level_0,Outcome,Survival.Months,Age,Grade,Num.Primaries,T,N,M,Radiation,Stage,Primary.Site,Histology,Tumor.Size,Num.Mutated.Genes,Num.Mutations
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1,Alive,9.0,67,4,0,UNK,2.0,,0,IV,Left Lower Lobe,Squamous cell carcinoma,1.4,8,8
2,Dead,19.0,73,2,0,UNK,2.0,0.0,5,IV,Right Upper Lobe,Adenocarcinoma,,2,2
3,Dead,13.0,72,3,0,2,2.0,0.0,0,IIIA,Right Upper Lobe,Adenocarcinoma,1.5,1,1
4,Dead,15.0,69,9,1,1a,0.0,1.0,0,IA,Right Upper Lobe,Adenocarcinoma,,4,4
5,Dead,10.0,76,9,0,UNK,,,0,IIIA,Left Hilar,Large-cell carcinoma,,3,3


The clinical DataFrame contains 190 rows (patients) and 15 columns (clinical characteristics). The target variables are Outcome (Survival status at follow-up) and Survival.Months (Follow-up time in months), while the remaining 13 columns are features for prediction.

In [6]:
df_c.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 190 entries, 1 to 190
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Outcome            190 non-null    object 
 1   Survival.Months    190 non-null    float64
 2   Age                190 non-null    int64  
 3   Grade              190 non-null    int64  
 4   Num.Primaries      190 non-null    int64  
 5   T                  190 non-null    object 
 6   N                  125 non-null    float64
 7   M                  94 non-null     float64
 8   Radiation          190 non-null    int64  
 9   Stage              190 non-null    object 
 10  Primary.Site       190 non-null    object 
 11  Histology          190 non-null    object 
 12  Tumor.Size         98 non-null     float64
 13  Num.Mutated.Genes  190 non-null    int64  
 14  Num.Mutations      190 non-null    int64  
dtypes: float64(4), int64(6), object(5)
memory usage: 23.8+ KB


Several features contain missing values. N (# of metastases to lymph nodes), M (# of distant metastases) and Tumor.Size (Tumor size at diagnosis) all contain np.NaN values. Meanwhile, T (Tumor stage) contains 'UNK' values, and Grade (Tumor grade) contains 9's.<br/><br/>
There are 96 missing values in Grade, 62 in T, 65 in N, 96 in M, and 92 in Tumor.Size.<br/><br/>
Some features contain errors. Stage (Stage at diagnosis) contains the value '1B' which is meant to be 'IB', while Primary.Site (Location of primary tumor) contains the value 'Righ Upper Lobe' which is meant to be 'Right Upper Lobe'.

In [7]:
# Unique values of each column in df_c
for col in df_c:
    print(col + ':', sorted(df_c[col].unique()))

Outcome: ['Alive', 'Dead']
Survival.Months: [9.0, 9.5, 10.0, 11.0, 13.0, 15.0, 16.0, 18.0, 19.0, 22.0, 23.0, 24.0, 26.0, 29.0, 32.0, 33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0, 46.0, 50.0, 71.0]
Age: [56, 59, 60, 62, 63, 67, 68, 69, 70, 71, 72, 73, 74, 76, 77, 78, 80, 82, 83, 84]
Grade: [2, 3, 4, 9]
Num.Primaries: [0, 1]
T: ['1', '1a', '1b', '2', '2a', '2b', '3', '4', 'UNK']
N: [0.0, 2.0, nan, 1.0, 3.0]
M: [nan, 0.0, 1.0]
Radiation: [0, 5]
Stage: ['1B', 'IA', 'IB', 'IIA', 'IIB', 'IIIA', 'IIIB', 'IV', 'IVB']
Primary.Site: ['Both Lung', 'Left Hilar', 'Left Lower Lobe', 'Left Upper Lobe', 'Righ Upper Lobe', 'Right Hilar', 'Right Lower Lobe', 'Right Middle Lobe', 'Right Upper Lobe']
Histology: ['Adenocarcinoma', 'Large-cell carcinoma', 'Squamous cell carcinoma']
Tumor.Size: [1.4, nan, 1.0, 1.5, 1.6, 1.8, 1.9, 2.0, 2.5, 3.5, 3.6, 4.0, 4.4, 5.3, 5.4, 5.5, 8.0, 8.5, 9.0, 10.0]
Num.Mutated.Genes: [0, 1, 2, 3, 4, 5, 6, 7, 8]
Num.Mutations: [0, 1, 2, 3, 4, 5, 6, 7, 8]


In [8]:
# Values counts for each column in df_c
for col in df_c:
    print(df_c[col].value_counts())

Dead     150
Alive     40
Name: Outcome, dtype: int64
11.0    27
10.0    23
13.0    21
36.0    18
32.0    11
38.0     9
16.0     8
33.0     8
15.0     7
22.0     7
9.0      7
19.0     6
35.0     6
23.0     6
34.0     4
29.0     3
9.5      3
42.0     3
18.0     2
39.0     2
71.0     2
46.0     1
50.0     1
37.0     1
41.0     1
40.0     1
26.0     1
24.0     1
Name: Survival.Months, dtype: int64
62    27
76    26
67    22
72    20
71    14
73    13
69    13
70    10
63     8
82     7
74     7
56     5
83     4
68     4
77     4
80     2
78     1
60     1
59     1
84     1
Name: Age, dtype: int64
9    96
4    43
2    29
3    22
Name: Grade, dtype: int64
0    147
1     43
Name: Num.Primaries, dtype: int64
UNK    62
3      38
1a     26
4      23
2a     16
2      12
2b     10
1b      2
1       1
Name: T, dtype: int64
2.0    58
0.0    52
1.0     9
3.0     6
Name: N, dtype: int64
0.0    86
1.0     8
Name: M, dtype: int64
0    127
5     63
Name: Radiation, dtype: int64
IV      45
IIIA    43
IA

All missing values are changed to np.NaN, and the erroneous values are corrected. Also, for the binary variables Outcome and Radiation (whether the patient recieved radiation), I changed the values to [0, 1].

In [9]:
# Create a dictionary to find and replace values
dic_encode = {'Grade': {9: np.nan},
             'Radiation': {5: 1},
             'Stage': {'1B': 'IB'},
             'T': {'UNK': np.nan},
             'Primary.Site': {'Righ Upper Lobe': 'Right Upper Lobe'},
             'Outcome': {'Alive': 0, 'Dead': 1}
            }

df_c.replace(dic_encode, inplace=True)

Cancer stages range from 1 to 4, and each stage contains sub-stages of A and B. For the variables T and Stage, several values do not include sub-stages. For consistency purposes, I decided to remove the sub-stage information. I also changed the values of Stage to numerical values, because it is an ordinal variable and will allow me to test its correlation initially.

In [10]:
# Create a dictionary to find and replace values
dic_encode_2 = {'T': {'1': 1,'1a': 1,'1b': 1,'2': 2,'2a': 2,'2b': 2, '3': 3,'4': 4},
              'Stage': {'IA': 1,'IB': 1,'IIA': 2,'IIB': 2,'IIIA': 3,'IIIB': 3, 'IV': 4, 'IVB': 4}
             }

df_c.replace(dic_encode_2, inplace=True)

In [11]:
df_c.describe()

Unnamed: 0,Outcome,Survival.Months,Age,Grade,Num.Primaries,T,N,M,Radiation,Stage,Tumor.Size,Num.Mutated.Genes,Num.Mutations
count,190.0,190.0,190.0,94.0,190.0,128.0,125.0,94.0,190.0,190.0,98.0,190.0,190.0
mean,0.789474,22.186842,70.173684,3.148936,0.226316,2.429688,1.144,0.085106,0.331579,2.910526,4.494898,2.684211,3.084211
std,0.40876,12.42014,6.146909,0.867048,0.419551,1.032416,1.029438,0.280536,0.472024,1.087395,3.050988,1.460327,1.697575
min,0.0,9.0,56.0,2.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
25%,1.0,11.0,67.0,2.0,0.0,2.0,0.0,0.0,0.0,2.0,2.0,2.0,2.0
50%,1.0,16.0,71.0,3.0,0.0,2.0,2.0,0.0,0.0,3.0,3.6,3.0,3.0
75%,1.0,34.0,74.0,4.0,0.0,3.0,2.0,0.0,1.0,4.0,8.0,3.0,4.0
max,1.0,71.0,84.0,4.0,1.0,4.0,3.0,1.0,1.0,4.0,10.0,8.0,8.0


After looking at both feature frequencies, I have decided to combine the values 'Right Middle Lobe' and 'Both Lung' within Primary.Site into a new value named 'Other'. The value 'Other' will have 8 positive occurences, which will pose less problems when splitting the dataset into train and test sets.

In [12]:
# Create a dictionary to find and replace values
dic_encode_3 = {'Primary.Site': {'Right Middle Lobe': 'Other','Both Lung': 'Other'}}

df_c.replace(dic_encode_3, inplace=True)

The genomic data is now imported as a pandas DataFrame.

In [13]:
df_g = pd.read_csv('genomics.csv', index_col=0)

In [14]:
df_g.head()

Unnamed: 0_level_0,Gene
ID,Unnamed: 1_level_1
1,AKT1
158,AKT1
88,ALK_Col1
132,ALK_Col1
18,ALK_Col2


The genomic DataFrame has 510 rows and one column, but it needs to be changed to have the same structure as the clinical DataFrame.

In [15]:
df_g.shape

(510, 1)

I now change the genomic DataFrame so that each patient's ID is only listed once on the index, while the names of the genes are the column labels. The values of the restructured DataFrame are 1's and 0's indicating whether or not a patient has a mutation of a given gene.

In [16]:
df_g['Value'] = 1
df_g = df_g.pivot_table(index='ID', columns='Gene', aggfunc=lambda x: int(x.any()), fill_value=0)
df_g.columns = df_g.columns.droplevel()
df_g.columns.name = None

In [17]:
df_g.head()

Unnamed: 0_level_0,AKT1,ALK_Col1,ALK_Col2,APC,ATM_Col1,ATM_Col2,BRAF,CCND2,CDKN2A,CTNNB1,DNMT3A,EGFR,ERBB3,ERBB4,ESR1,FBXW7,FGFR1,FGFR3,FLT4,FOXL2,GNAS,HNF1A,KRAS_Col1,KRAS_Col2,MAP2K2,MET,MLH_Col2,MSH2,MSH6,NF_Col1,NF_Col2,NF_Col3,NF_Col5,NOTCH1,NTRK1,PDGFRB,PIK3CA,PIK3CB,POLD_Col2,PTCH1,PTEN,RB1,SMARCA4,SMARCB1,SMO,STK11,TERT,TP53_Col1,TP53_Col2,TSC2
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1
1,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0
5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1


The gemomic DataFrame contains 184 rows (patients) and 50 columns (genes). Not all 190 patients from the clinical dataset are included, because the clinical dataset listed 6 patients as having no gene mutations.

In [18]:
df_g.shape

(184, 50)

In [19]:
# The unique values in all of df_g
print(pd.unique(df_g.values.ravel()))

[1 0]


In [20]:
# The frequency of each gene mutation
freq_g = df_g.sum().sort_values(ascending=False)
freq_g.name = 'Frequency'
freq_g

TP53_Col1    117
KRAS_Col1     55
CDKN2A        45
TSC2          31
MSH2          30
STK11         23
APC           19
PIK3CB        11
NF_Col2       10
TERT          10
SMARCB1        9
MET            9
SMO            8
FBXW7          8
TP53_Col2      8
GNAS           7
MSH6           7
NF_Col3        7
PTEN           7
NTRK1          7
PIK3CA         7
EGFR           6
NF_Col1        5
PDGFRB         5
POLD_Col2      5
FGFR1          4
CTNNB1         4
RB1            4
PTCH1          4
ATM_Col1       4
NOTCH1         4
DNMT3A         3
CCND2          2
ALK_Col2       2
ALK_Col1       2
ERBB4          2
AKT1           2
FGFR3          2
FLT4           2
FOXL2          2
NF_Col5        2
ERBB3          1
ESR1           1
HNF1A          1
BRAF           1
ATM_Col2       1
KRAS_Col2      1
MLH_Col2       1
SMARCA4        1
MAP2K2         1
Name: Frequency, dtype: int64

Now the clinical and genomic DataFrames are combined horizontally into one.

In [21]:
df = df_c.join(df_g, how='outer')

In [22]:
df.head()

Unnamed: 0_level_0,Outcome,Survival.Months,Age,Grade,Num.Primaries,T,N,M,Radiation,Stage,Primary.Site,Histology,Tumor.Size,Num.Mutated.Genes,Num.Mutations,AKT1,ALK_Col1,ALK_Col2,APC,ATM_Col1,ATM_Col2,BRAF,CCND2,CDKN2A,CTNNB1,DNMT3A,EGFR,ERBB3,ERBB4,ESR1,FBXW7,FGFR1,FGFR3,FLT4,FOXL2,GNAS,HNF1A,KRAS_Col1,KRAS_Col2,MAP2K2,MET,MLH_Col2,MSH2,MSH6,NF_Col1,NF_Col2,NF_Col3,NF_Col5,NOTCH1,NTRK1,PDGFRB,PIK3CA,PIK3CB,POLD_Col2,PTCH1,PTEN,RB1,SMARCA4,SMARCB1,SMO,STK11,TERT,TP53_Col1,TP53_Col2,TSC2
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1
1,0,9.0,67,4.0,0,,2.0,,0,4,Left Lower Lobe,Squamous cell carcinoma,1.4,8,8,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
2,1,19.0,73,2.0,0,,2.0,0.0,1,4,Right Upper Lobe,Adenocarcinoma,,2,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
3,1,13.0,72,3.0,0,2.0,2.0,0.0,0,3,Right Upper Lobe,Adenocarcinoma,1.5,1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,1,15.0,69,,1,1.0,0.0,1.0,0,1,Right Upper Lobe,Adenocarcinoma,,4,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
5,1,10.0,76,,0,,,,0,3,Left Hilar,Large-cell carcinoma,,3,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0


The np.NaN values of the 6 patients with no genetic mutations are filled with zeros.

In [23]:
df.loc[:,'AKT1':] = df.loc[:,'AKT1':].fillna(0)

The full dataset now has 190 rows (patients) and 65 columns (clinical and genomic characteristics).

In [24]:
df.shape

(190, 65)

There are missing values in the following columns:

In [25]:
missing_cols = df.columns[df.isna().any()].tolist()
missing_cols

['Grade', 'T', 'N', 'M', 'Tumor.Size']

Before imputing missing values and performing feature selection, the dataset will be split into train, validation and test sets to prevent data leakage. I will use stratification so that the train, validation and test sets are more likely to be representative of the dataset as a whole. I want each subset to have a similar distribution of values from both Outlook and Survival.Months.<br/><br/>
I put the values of Survival.Months into 3 separate bins divided by the approximate tertile values for 33 and 67 percentile. Then I combine the former labels of the Outcome values ('Alive' and 'Dead') with the new group labels of the Survival.Months values. This will result in each patient being placed in one of 6 distinct groups (Alive-1, Alive-2, Alive-3, Dead-1, Dead-2, and Dead-3). Next, StratifiedShuffleSplit is used to create random train, validation and test indices stratified based on the distributions of the newly created groups. The train, validation and test sets are later created with these indices. The train-validation-test split percentage is 70-15-15.

In [44]:
# Create 6 groups for the patients so that they can be stratified by both Outcome and Survival.Months
months = df.loc[:, 'Survival.Months']
bins = np.array([min(months), months.quantile(1/3), months.quantile(2/3), max(months) + 1])
months_binned = np.digitize(months, bins)
months_binned = pd.Series(months_binned).astype(str)
outcome = df['Outcome'].replace({0: 'Alive', 1: 'Dead'})
patients_grouped = outcome + '-' + months_binned.values

In [45]:
# Distribution of groups
patients_grouped.value_counts()

Dead-2     52
Dead-1     50
Dead-3     48
Alive-3    20
Alive-2    10
Alive-1    10
Name: Outcome, dtype: int64

In [46]:
# Create stratified train, validation and test set indices using StratifiedShuffleSplit twice
sss1 = StratifiedShuffleSplit(n_splits=1, test_size=0.3, random_state=88)

for train_index, val_test_index in sss1.split(patients_grouped, patients_grouped):
    pg_val_test = patients_grouped.iloc[val_test_index]
    train_index = train_index + 1    
    
sss2 = StratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=88)

for val_index, test_index in sss2.split(pg_val_test, pg_val_test):
    pg_val = pg_val_test.iloc[val_index]
    pg_test = pg_val_test.iloc[test_index]
    val_index = pg_val.index
    test_index = pg_test.index

The dataset has 3 ordinal features (Grade, T, and Stage) and 2 categorical features (Primary.Site and Histology), and I will create dummy variables for these 5 features before I begin modelling.<br/><br/>
However, there are still missing values within several features (Grade, T, N, M, and Tumor.Size). I will use iterative multivariate feature imputation to fill in the missing values. The imputer will model each feature with missing values as a function of the other features, and then use that estimate for imputation. All of the features must have numerical values before I begin imputation. Neither of the categorical features have missing values, nor does the ordinal variable Stage.<br/><br/>
So, I will first create dummy variables for these 3 features (Stage, Primary.Site, and Histology). I will drop the first category/unique value of each feature, since only k - 1 indicator variables are needed to represent the k categories of a categorial variable.

In [47]:
# Covert 3 features into dummy variables
df = pd.get_dummies(df, columns=['Stage','Primary.Site','Histology'], drop_first=True).copy()

The train, validation and test sets are created with the indices generated previously.

In [48]:
train = df.loc[train_index]
val = df.loc[val_index]
test = df.loc[test_index]

I will keep the target variables (Outcome and Survival.Months) separate while performing imputation, so that the imputed test data is not influenced by this information. The training set is divided into y_train (containing just Outcome and Survival.Months) and X_train which contains the predictor variables. The validation and test sets are split similarly.

In [49]:
X_train = train.drop(['Outcome','Survival.Months'],axis=1)
y_train = train.loc[:, ['Outcome','Survival.Months']]

In [50]:
X_val = val.drop(['Outcome','Survival.Months'],axis=1)
y_val = val.loc[:, ['Outcome','Survival.Months']]

In [51]:
X_test = test.drop(['Outcome','Survival.Months'],axis=1)
y_test = test.loc[:, ['Outcome','Survival.Months']]

The imputer is initialized with a specified random seed for reproducibility. It is then fit on X_train. The imputer then transforms/imputes X_train and the result is assigned to X_train_imputed. The imputer remains fitted on X_train, and the same process of transformation and assingment takes place for both the validation and test set.

In [52]:
imputer = IterativeImputer(random_state=88)

X_train_imputed = X_train.copy()
X_val_imputed = X_val.copy()
X_test_imputed = X_test.copy()

imputer.fit(X_train)

X_train_imputed.iloc[:, :] = imputer.transform(X_train)

X_val_imputed.iloc[:, :] = imputer.transform(X_val)

X_test_imputed.iloc[:, :] = imputer.transform(X_test)

y_train and X_train are joined to form the imputed training set, train_imputed. The imputed validation and test sets are formed in the same way.<br/><br/>
The full imputed data set, df_imputed, is formed by concatenating train_imputed, val_imputed and test_imputed.

In [53]:
train_imputed = y_train.join(X_train_imputed, how='left')
val_imputed = y_val.join(X_val_imputed, how='left')
test_imputed = y_test.join(X_test_imputed, how='left')

# The rows of the resulting DataFrame are sorted
df_imputed = pd.concat([train_imputed, val_imputed, test_imputed]).sort_index()

The imputer imputes values with several decimals, so the imputed values need to be rounded. The values of Tumor.Size are rounded to one decimal, while the remaining imputed features are rounded to integers.

In [54]:
for col in ['Grade','T','N','M']:
    df_imputed.loc[:, col] = df_imputed[col].round()
    
df_imputed.loc[:, 'Tumor.Size'] = df_imputed['Tumor.Size'].round(1)

The value frequencies of the features with missing values are shown both before and after imputation.

In [55]:
for col in missing_cols:
    print('Initial:')
    print(df[col].value_counts(), '\n')
    print('Post-imputation:')
    print(df_imputed[col].value_counts(), '\n')

Initial:
4.0    43
2.0    29
3.0    22
Name: Grade, dtype: int64 

Post-imputation:
3.0    84
4.0    65
2.0    41
Name: Grade, dtype: int64 

Initial:
3.0    38
2.0    38
1.0    29
4.0    23
Name: T, dtype: int64 

Post-imputation:
2.0    100
3.0     38
1.0     29
4.0     23
Name: T, dtype: int64 

Initial:
2.0    58
0.0    52
1.0     9
3.0     6
Name: N, dtype: int64 

Post-imputation:
2.0    68
1.0    64
0.0    52
3.0     6
Name: N, dtype: int64 

Initial:
0.0    86
1.0     8
Name: M, dtype: int64 

Post-imputation:
0.0    182
1.0      8
Name: M, dtype: int64 

Initial:
2.0     20
1.5     13
9.0     10
3.6      9
10.0     8
4.0      7
5.5      6
8.5      6
3.5      2
8.0      2
1.8      2
1.4      2
1.9      2
5.4      2
5.3      2
2.5      2
1.0      1
4.4      1
1.6      1
Name: Tumor.Size, dtype: int64 

Post-imputation:
4.6     92
2.0     20
1.5     13
9.0     10
3.6      9
10.0     8
4.0      7
5.5      6
8.5      6
3.5      2
8.0      2
1.8      2
1.4      2
1.9      2
5.4     

In [56]:
# Change the value type of 2 columns
df_imputed.loc[:, ['Grade','T']] = df_imputed.loc[:, ['Grade','T']].astype(np.int64)

Two ordinal values orginally contained missing values, and they have been imputed. I will now create dummy variables for these features (Grade and T), dopping the first unique value of each feature.

In [57]:
df_imputed = pd.get_dummies(df_imputed, columns=['Grade','T'], drop_first=True)

# Change all values types to float
df_imputed = df_imputed.astype(np.float64)

The train, validation and test sets are taken from the current version of the dataset, df_imputed, by using their indices.

In [58]:
train = df_imputed.loc[train.index].copy()
val = df_imputed.loc[val.index].copy()
test = df_imputed.loc[test.index].copy()

There are now 76 variables in the dataset due to the creation of several dummy/indicator variables.

In [59]:
train.shape[1]

76

In [60]:
df_imputed.head()

Unnamed: 0_level_0,Outcome,Survival.Months,Age,Num.Primaries,N,M,Radiation,Tumor.Size,Num.Mutated.Genes,Num.Mutations,AKT1,ALK_Col1,ALK_Col2,APC,ATM_Col1,ATM_Col2,BRAF,CCND2,CDKN2A,CTNNB1,DNMT3A,EGFR,ERBB3,ERBB4,ESR1,FBXW7,FGFR1,FGFR3,FLT4,FOXL2,GNAS,HNF1A,KRAS_Col1,KRAS_Col2,MAP2K2,MET,MLH_Col2,MSH2,MSH6,NF_Col1,NF_Col2,NF_Col3,NF_Col5,NOTCH1,NTRK1,PDGFRB,PIK3CA,PIK3CB,POLD_Col2,PTCH1,PTEN,RB1,SMARCA4,SMARCB1,SMO,STK11,TERT,TP53_Col1,TP53_Col2,TSC2,Stage_2,Stage_3,Stage_4,Primary.Site_Left Lower Lobe,Primary.Site_Left Upper Lobe,Primary.Site_Other,Primary.Site_Right Hilar,Primary.Site_Right Lower Lobe,Primary.Site_Right Upper Lobe,Histology_Large-cell carcinoma,Histology_Squamous cell carcinoma,Grade_3,Grade_4,T_2,T_3,T_4
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1
1,0.0,9.0,67.0,0.0,2.0,0.0,0.0,1.4,8.0,8.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
2,1.0,19.0,73.0,0.0,2.0,0.0,1.0,4.6,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,1.0,13.0,72.0,0.0,2.0,0.0,0.0,1.5,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
4,1.0,15.0,69.0,1.0,0.0,1.0,0.0,4.6,4.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
5,1.0,10.0,76.0,0.0,1.0,0.0,0.0,4.6,3.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0


In [61]:
df_imputed.describe()

Unnamed: 0,Outcome,Survival.Months,Age,Num.Primaries,N,M,Radiation,Tumor.Size,Num.Mutated.Genes,Num.Mutations,AKT1,ALK_Col1,ALK_Col2,APC,ATM_Col1,ATM_Col2,BRAF,CCND2,CDKN2A,CTNNB1,DNMT3A,EGFR,ERBB3,ERBB4,ESR1,FBXW7,FGFR1,FGFR3,FLT4,FOXL2,GNAS,HNF1A,KRAS_Col1,KRAS_Col2,MAP2K2,MET,MLH_Col2,MSH2,MSH6,NF_Col1,NF_Col2,NF_Col3,NF_Col5,NOTCH1,NTRK1,PDGFRB,PIK3CA,PIK3CB,POLD_Col2,PTCH1,PTEN,RB1,SMARCA4,SMARCB1,SMO,STK11,TERT,TP53_Col1,TP53_Col2,TSC2,Stage_2,Stage_3,Stage_4,Primary.Site_Left Lower Lobe,Primary.Site_Left Upper Lobe,Primary.Site_Other,Primary.Site_Right Hilar,Primary.Site_Right Lower Lobe,Primary.Site_Right Upper Lobe,Histology_Large-cell carcinoma,Histology_Squamous cell carcinoma,Grade_3,Grade_4,T_2,T_3,T_4
count,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0,190.0
mean,0.789474,22.186842,70.173684,0.226316,1.147368,0.042105,0.331579,4.545789,2.684211,3.084211,0.010526,0.010526,0.010526,0.1,0.021053,0.005263,0.005263,0.010526,0.236842,0.021053,0.015789,0.031579,0.005263,0.010526,0.005263,0.042105,0.021053,0.010526,0.010526,0.010526,0.036842,0.005263,0.289474,0.005263,0.005263,0.047368,0.005263,0.157895,0.036842,0.026316,0.052632,0.036842,0.010526,0.021053,0.036842,0.026316,0.036842,0.057895,0.026316,0.021053,0.036842,0.021053,0.005263,0.047368,0.042105,0.121053,0.052632,0.615789,0.042105,0.163158,0.1,0.352632,0.368421,0.089474,0.110526,0.042105,0.173684,0.131579,0.289474,0.142105,0.405263,0.442105,0.342105,0.526316,0.2,0.121053
std,0.40876,12.42014,6.146909,0.419551,0.860275,0.20136,0.472024,2.186359,1.460327,1.697575,0.102326,0.102326,0.102326,0.300793,0.143939,0.072548,0.072548,0.102326,0.426268,0.143939,0.12499,0.175338,0.072548,0.102326,0.072548,0.20136,0.143939,0.102326,0.102326,0.102326,0.188872,0.072548,0.454716,0.072548,0.072548,0.212987,0.072548,0.365606,0.188872,0.160496,0.223887,0.188872,0.102326,0.143939,0.188872,0.160496,0.188872,0.234161,0.160496,0.143939,0.188872,0.143939,0.072548,0.212987,0.20136,0.32705,0.223887,0.487693,0.20136,0.370486,0.300793,0.479052,0.483651,0.28618,0.314373,0.20136,0.379839,0.338926,0.454716,0.350081,0.49224,0.497949,0.475668,0.500626,0.401057,0.32705
min,0.0,9.0,56.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,11.0,67.0,0.0,0.0,0.0,0.0,3.6,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1.0,16.0,71.0,0.0,1.0,0.0,0.0,4.6,3.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
75%,1.0,34.0,74.0,0.0,2.0,0.0,1.0,4.6,3.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0
max,1.0,71.0,84.0,1.0,3.0,1.0,1.0,10.0,8.0,8.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
