---

## Cox Proportional Hazards and Random Survival Forests for estimation of the effect of covariates (features) on survival probabilities In-Hospital Mortality in ICU Patients

This notebook extends our survival analysis on the PhysioNet Challenge data, focusing on the advanced modeling of in-hospital mortality risk for patients in Intensive Care Units (ICU). Building upon the foundation of the Kaplan-Meier estimator, we now introduce Cox Proportional Hazards and Random Survival Forests models to examine the impact of various covariates on survival times and to make individual-level predictions.

### Approach
We will first employ the Cox Proportional Hazards model, allowing us to evaluate the effect of different factos (age, ICU unit, medical conditions, etc.) over the chance of an event (like death) happening. Bearing in mind that **the Cox model assumes the impact of such factors remains constant over time.**

Then, we will explore the Random Survival Forests method, which, unlike the Cox model, can handle non-linear relationships between the risk factors and the survival times. This will allow us to capture more complex relationships between the risk factors and the survival times.

Steps
### Cox Proportional Hazards Model:

1. **Model Fitting**: Apply the Cox model to understand the effects of covariates on survival.
2. **Assumption Verification**: Check the proportional hazards assumption integral to the model.
3. **Coefficient Interpretation**: Interpret the model coefficients to understand hazard ratios.
4. **Model Validation**: Validate the model's predictive performance using appropriate metrics.


### Random Survival Forests:

1. **Model Fitting and Tuning**: Fit the Random Survival Forest model and optimize its hyperparameters.
2. **Variable Importance**: Determine the importance of various covariates in predicting survival.
3. **Survival Prediction**: Predict survival probabilities for individual patients.
4. **Model Evaluation**: Evaluate the model's performance and compare it to the Cox model.

#### Again, the objective is to gain a nuanced understanding of factors influencing patient survival in the ICU, enabling more informed medical decision-making and improving care strategies through precise risk stratification and personalized survival predictions

---

## 0. Data Ingestion
We will be using the [PhysioNet Challenge 2012 dataset](https://physionet.org/content/challenge-2012/1.0.0/), which contains data from 4000 ICU patients. The dataset is provided in the form of a CSV file, which we will read into a Pandas DataFrame for further analysis.

In [5]:
import pandas as pd
from lifelines import KaplanMeierFitter
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, Markdown
from lifelines import NelsonAalenFitter
from lifelines.statistics import logrank_test
from lifelines.plotting import add_at_risk_counts

from lifelines import CoxPHFitter
from lifelines.utils import concordance_index as cindex
from sklearn.model_selection import train_test_split
import pickle


In [18]:
ICU_df = pd.read_csv('PhysionetChallenge2012-set-a.csv.gz', compression='gzip')

In [19]:
ICU_df.shape

(4000, 120)

## 1. Model Fitting: Apply the Cox model to understand the effects of features (covariates) on survival


### 1.1. Data Preprocessing: Preparing the dataset for Cox analysis

Identifying relevant features:
- General Descriptors: 'Age', 'Gender', 'Height', 'ICUType', 'Weight'
- Outcome Related Features: 'SAPS-I', 'SOFA'
- Top 10 Variables from Feature Importance RF: 'GCS_last', 'Length_of_stay', 'BUN_last', 'GCS_median', 'GCS_highest', 'BUN_first', 'WBC_last', 'Bilirubin_first', 'PaO2_first'

In [21]:
# Selecting the specified covariates
selected_covariates = [
    'Age', 'Gender', 'Height', 'CCU', 'CSRU', 'SICU', 'Weight',
    'SAPS-I', 'SOFA', 'GCS_last', 'Length_of_stay', 'BUN_last', 
    'GCS_median', 'GCS_highest', 'BUN_first', 'WBC_last', 
    'Bilirubin_first', 'PaO2_first'
]

In [22]:
# Checking for missing values in the selected covariates
missing_values = ICU_df[selected_covariates].isnull().sum()

missing_values

Age                   0
Gender                3
Height             1894
CCU                   0
CSRU                  0
SICU                  0
Weight              331
SAPS-I                0
SOFA                  0
GCS_last             64
Length_of_stay        0
BUN_last             64
GCS_median           64
GCS_highest          64
BUN_first            64
WBC_last             92
Bilirubin_first    2282
PaO2_first          977
dtype: int64

In [23]:
# Updating the list of covariates by removing variables with high missingness
selected_covariates = [
    'Age', 'Gender', 'Weight', 'CCU', 'CSRU', 'SICU',
    'SAPS-I', 'SOFA', 'GCS_last', 'Length_of_stay', 'BUN_last', 
    'GCS_median', 'GCS_highest', 'BUN_first', 'WBC_last'
]

# Creating a new DataFrame with the selected covariates
filtered_data = ICU_df[selected_covariates]

# Dropping rows with any missing values in the remaining covariates
filtered_data = filtered_data.dropna()

# Checking the shape of the data after handling missing values
data_shape_after = filtered_data.shape

data_shape_after


(3588, 15)

# 3. Kaplan-Meier Estimation
Implementing the Kaplan-Meier method to estimate survival probabilities over time for the ICU patients

## 4. Survival Curve Plotting: Visualizing the survival curve to interpret how survival probabilities change over time

## 5. Result Interpretation: Discussing the implications of the survival curve in the context of in-hospital mortality in ICU settings
Interpreingt the Kaplan-Meier survival plot and understand what it indicates about the patient cohort

1. **Early Decline**: What factors might contribute to early mortality in ICU patients e.g. severity of illness, emergency admissions
2. **Plateau Phase**: What factors may help explaining the plateau in the survival curve? e.g. patients with better long-term prognosis
3. **Long-term Survivors**: What does it indicate when the survival probability becomes flat? e.g. patients who survived for a long duration
4. **Censoring**: What's the impact of censored data on the survival estimates, especially towards the end of the study period?
5. **Clinical Implications**: **SO WHAT!?** How these survival probabilities can inform clinical practice, patient counseling, and healthcare policy