# Survival / Time-to-Event Template – Cox Models & Basics

This notebook is a template for **time-to-event** problems, also called **survival analysis**:

- Time until churn  
- Time until injury  
- Time until hardware failure  
- Time until some event of interest  

Key twist vs standard regression:

- Each row has:
  - `duration` (how long we observed)  
  - `event` (1 if event happened, 0 if censored / not yet)  
- Some samples are **censored** (we don't see the event within observation window).

We use the **lifelines** library (if installed) for Kaplan–Meier curves and Cox models.


In [None]:
# ========== 1. Imports & Config (Survival Analysis) ==========

from pathlib import Path

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 100)
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)
plt.rcParams["figure.dpi"] = 100

# Try to import lifelines (for survival models)
try:
    from lifelines import KaplanMeierFitter, CoxPHFitter
    LIFELINES_AVAILABLE = True
except ImportError:
    LIFELINES_AVAILABLE = False
    print("lifelines is not installed. Install it via 'pip install lifelines' to use survival models.")

# ---- Config ----
DATA_DIR = Path("../input")
DATA_FILE = "survival_data.csv"   # must contain duration & event columns

DURATION_COL = "duration"
EVENT_COL = "event"               # 1 if event occurred, 0 if censored
ID_COL = "id"                     # optional identifier

RANDOM_STATE = 42


In [None]:
# ========== 2. Load Data & Basic Checks ==========

def load_data(data_dir: Path = DATA_DIR, data_file: str = DATA_FILE) -> pd.DataFrame:
    path = data_dir / data_file
    if not path.exists():
        raise FileNotFoundError(f"Data file not found: {path}")
    df = pd.read_csv(path)
    print("Data shape:", df.shape)
    display(df.head())
    return df


df = load_data()

if DURATION_COL not in df.columns or EVENT_COL not in df.columns:
    raise ValueError(f"Expected columns {DURATION_COL} and {EVENT_COL} not in dataframe.")

print(df[[DURATION_COL, EVENT_COL]].describe(include="all"))


### 3️⃣ Kaplan–Meier Curves (Non-Parametric Survival)

The **survival function** is:

- `S(t) = P(T > t)` – probability event has **not** occurred by time `t`

Kaplan–Meier gives a non-parametric estimate of this curve.  
Useful for visualizing time-to-event distribution and comparing groups.


In [None]:
if LIFELINES_AVAILABLE:
    kmf = KaplanMeierFitter()

    T = df[DURATION_COL]
    E = df[EVENT_COL]

    kmf.fit(T, event_observed=E, label="overall")
    kmf.plot()
    plt.title("Overall survival function (Kaplan–Meier)")
    plt.xlabel("Time")
    plt.ylabel("Survival probability S(t)")
    plt.show()
else:
    print("Skipping Kaplan–Meier plots: lifelines not available.")


You can also stratify by a categorical feature (e.g., position, customer segment)  
to see if survival curves differ between groups.


In [None]:
# Example: stratify by a categorical column (edit STRATA_COL)
STRATA_COL = None  # e.g., "group" or "segment"

if LIFELINES_AVAILABLE and STRATA_COL is not None and STRATA_COL in df.columns:
    plt.figure()
    for level, sub in df.groupby(STRATA_COL):
        kmf = KaplanMeierFitter()
        kmf.fit(sub[DURATION_COL], event_observed=sub[EVENT_COL], label=str(level))
        kmf.plot(ci_show=False)
    plt.title(f"Survival by {STRATA_COL}")
    plt.xlabel("Time")
    plt.ylabel("Survival probability S(t)")
    plt.show()


### 4️⃣ Cox Proportional Hazards Model

The **Cox model** is the standard workhorse:

- Models the hazard as `h(t | x) = h0(t) * exp(beta^T x)`  
- Provides hazard ratios per feature and a **risk score** per individual

Use Cox as a **baseline** for structured survival problems.


In [None]:
if LIFELINES_AVAILABLE:
    exclude_cols = [ID_COL, DURATION_COL, EVENT_COL]
    exclude_cols = [c for c in exclude_cols if c in df.columns]

    feature_cols = [c for c in df.columns if c not in exclude_cols]
    print("Feature columns for Cox model:", feature_cols)

    df_cox = df[[DURATION_COL, EVENT_COL] + feature_cols].copy()

    cph = CoxPHFitter()
    cph.fit(df_cox, duration_col=DURATION_COL, event_col=EVENT_COL)
    cph.print_summary()

    cph.plot()
    plt.title("Cox model coefficients (log hazard ratios)")
    plt.show()
else:
    print("Skipping Cox modeling: lifelines not available.")


### 5️⃣ Risk Scores & Evaluation

From Cox we obtain a **risk score** per individual:

- Higher score → higher hazard → event sooner
- Lower score → lower hazard → event later

We evaluate via **concordance index (C-index)**:  
0.5 = random, closer to 1.0 is better.


In [None]:
if LIFELINES_AVAILABLE:
    c_index = cph.concordance_index_
    print(f"C-index (training): {c_index:.4f}")

    df["cox_risk_score"] = cph.predict_partial_hazard(df_cox)
    display(df[["cox_risk_score", DURATION_COL, EVENT_COL]].head())


### 6️⃣ What to Do Next?

- Use `cox_risk_score` as a feature in other ML models.  
- Create risk groups (e.g., low / med / high) via quantiles of the risk score.  
- Validate on a **time-based** validation split to avoid leakage.  
- For more advanced work, check proportional hazards assumptions and consider extensions.

You can save the dataframe with risk scores:


In [None]:
OUTPUT_DIR = Path("./survival_outputs")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

df.to_csv(OUTPUT_DIR / "survival_with_risk_scores.csv", index=False)
print("Saved survival_with_risk_scores.csv")