# Colorectal Cancer Survival Prediction Project
## Introduction
Colorectal cancer (CRC) is one of the leading causes of cancer-related deaths worldwide. Predicting survival outcomes based on patient demographics, medical history, lifestyle factors, and treatment details can support clinicians in decision-making and help identify high-risk groups.Machine learning models offer the potential to uncover hidden patterns in patient data and improve prognostic accuracy compared to traditional statistical approaches.



In [92]:
import os
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.inspection import permutation_importance
import joblib


# Objective
- The primary goal of this project is to predict the survival status of colorectal cancer patients using clinical, demographic, and lifestyle features.
- Specifically, the project aims to:
  1. Explore and preprocess patient data.
  2. Identify key risk factors and predictors of survival.
  3. Train classification models to predict survival status.
  4. Evaluate model performance using accuracy and classification metrics.


Steps implemented:
1. Load dataset
2. Exploratory Data Analysis (basic prints + plots)
3. Data preprocessing (imputation, one-hot encoding, scaling)
4. Feature selection (Logistic coef + permutation importance)
5. Model training (Logistic Regression)
6. Evaluation (accuracy, classification report, confusion matrix)
7. Save artifacts (trained pipeline, feature importance CSVs, plots)

In [94]:
# ----------------------
# Configuration
# ----------------------
DATA_PATH = "/Users/nikhilreddyponnala/Desktop/Data Analytics/Third Project/Colorectal Cancer Risk & Survival Data/Dataset/colorectal_cancer_prediction.csv"
TARGET_COL = "Survival_Status"
ID_COL = "Patient_ID"
RANDOM_STATE = 42
TEST_SIZE = 0.2
ARTIFACT_DIR = Path("artifacts")
ARTIFACT_DIR.mkdir(exist_ok=True)
FIG_DIR = ARTIFACT_DIR / "figures"
FIG_DIR.mkdir(exist_ok=True)


# Drawbacks/Limitations
- The dataset may not be fully representative (regional or hospital bias).
- Missing clinical details (tumor genetics, comorbidities) could reduce predictive
power.
- Logistic regression assumes linearity between predictors and outcome.
- Survival was treated as a binary classification (alive vs deceased), while real survival is time-to-event, better suited for survival analysis methods (Cox regression, Kaplan-Meier, etc.).

In [96]:
# ----------------------
# Helper functions
# ----------------------

def ensure_binary_target(series):
    """Ensure the target is binary and encoded as 0/1. If target has two non-numeric
    values, map them deterministically to 0 and 1 and return the mapping."""
    vals = series.dropna().unique()
    if len(vals) != 2:
        raise ValueError(f"Target column must be binary (2 unique values). Found: {vals}")
    # If already numeric 0/1, return as-is
    if set(map(str, vals)) <= {"0", "1"}:
        return series.astype(int), {"0":0, "1":1}
    # Otherwise map sorted string representation -> 0/1 for determinism
    sorted_vals = sorted(map(str, vals))
    mapping = {sorted_vals[0]: 0, sorted_vals[1]: 1}
    mapped = series.astype(str).map(mapping).astype(int)
    print(f"Mapping target values: {mapping}")
    return mapped, mapping


def get_feature_names_from_preprocessor(preprocessor, num_cols, cat_cols):
    """Return a list of feature names after ColumnTransformer preprocessing.
    Assumes the 'cat' transformer is a Pipeline with a step named 'encoder'
    that exposes get_feature_names_out (OneHotEncoder).
    """
    features = []
    # numeric features appear as-is
    if num_cols:
        features.extend(num_cols)
    # categorical features are expanded by the OneHotEncoder
    if cat_cols:
        cat_pipeline = preprocessor.named_transformers_.get('cat')
        if cat_pipeline is None:
            raise RuntimeError("No 'cat' transformer found in preprocessor")
        ohe = cat_pipeline.named_steps.get('encoder')
        if ohe is None:
            raise RuntimeError("No 'encoder' step found in categorical pipeline")
        cat_features = list(ohe.get_feature_names_out(cat_cols))
        features.extend(cat_features)
    return features


In [98]:
# ----------------------
# 1) Load data
# ----------------------
print("Loading data from:", DATA_PATH)
if not Path(DATA_PATH).exists():
    raise FileNotFoundError(f"File not found: {DATA_PATH}")

df = pd.read_csv(DATA_PATH)
print("Dataset shape:", df.shape)
print("Columns:", df.columns.tolist())

# Basic preview
print(df.head())


Loading data from: /Users/nikhilreddyponnala/Desktop/Data Analytics/Third Project/Colorectal Cancer Risk & Survival Data/Dataset/colorectal_cancer_prediction.csv
Dataset shape: (89945, 30)
Columns: ['Patient_ID', 'Age', 'Gender', 'Race', 'Region', 'Urban_or_Rural', 'Socioeconomic_Status', 'Family_History', 'Previous_Cancer_History', 'Stage_at_Diagnosis', 'Tumor_Aggressiveness', 'Colonoscopy_Access', 'Screening_Regularity', 'Diet_Type', 'BMI', 'Physical_Activity_Level', 'Smoking_Status', 'Alcohol_Consumption', 'Red_Meat_Consumption', 'Fiber_Consumption', 'Insurance_Coverage', 'Time_to_Diagnosis', 'Treatment_Access', 'Chemotherapy_Received', 'Radiotherapy_Received', 'Surgery_Received', 'Follow_Up_Adherence', 'Survival_Status', 'Recurrence', 'Time_to_Recurrence']
   Patient_ID  Age  Gender   Race         Region Urban_or_Rural  \
0           1   71    Male  Other         Europe          Urban   
1           2   34  Female  Black  North America          Urban   
2           3   80  Female  

In [100]:
# ----------------------
# 2) Exploratory Data Analysis (light)
# ----------------------
print("\n--- Basic Info ---")
print(df.info())
print("\n--- Numeric summary ---")
print(df.select_dtypes(include=[np.number]).describe().T)
print("\n--- Missing values (top 20) ---")
print(df.isnull().sum().sort_values(ascending=False).head(20))

# Plot target distribution
plt.figure(figsize=(6,4))
sns.countplot(x=TARGET_COL, data=df)
plt.title('Target distribution: ' + TARGET_COL)
plt.tight_layout()
plt.savefig(FIG_DIR / 'target_distribution.png')
plt.close()

# Numeric distributions (first 12 numeric features)
num_cols_all = df.select_dtypes(include=[np.number]).columns.tolist()
if ID_COL in num_cols_all:
    num_cols_all.remove(ID_COL)

if num_cols_all:
    nplots = min(12, len(num_cols_all))
    df[num_cols_all[:nplots]].hist(figsize=(12,8), bins=30)
    plt.suptitle('Numeric feature distributions (sample)')
    plt.tight_layout()
    plt.savefig(FIG_DIR / 'numeric_distributions.png')
    plt.close()

# Correlation heatmap (numeric only)
if len(num_cols_all) >= 2:
    corr = df[num_cols_all].corr()
    plt.figure(figsize=(10,8))
    sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm')
    plt.title('Correlation (numeric features)')
    plt.tight_layout()
    plt.savefig(FIG_DIR / 'correlation_heatmap.png')
    plt.close()



--- Basic Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89945 entries, 0 to 89944
Data columns (total 30 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Patient_ID               89945 non-null  int64  
 1   Age                      89945 non-null  int64  
 2   Gender                   89945 non-null  object 
 3   Race                     89945 non-null  object 
 4   Region                   89945 non-null  object 
 5   Urban_or_Rural           89945 non-null  object 
 6   Socioeconomic_Status     89945 non-null  object 
 7   Family_History           89945 non-null  object 
 8   Previous_Cancer_History  89945 non-null  object 
 9   Stage_at_Diagnosis       89945 non-null  object 
 10  Tumor_Aggressiveness     89945 non-null  object 
 11  Colonoscopy_Access       89945 non-null  object 
 12  Screening_Regularity     89945 non-null  object 
 13  Diet_Type                89945 non-null  object 
 14  BM

In [102]:
# ----------------------
# 3) Data Preprocessing
# ----------------------
# Drop ID if present
if ID_COL in df.columns:
    df = df.drop(columns=[ID_COL])

# Ensure target is binary 0/1
if TARGET_COL not in df.columns:
    raise KeyError(f"Target column not found: {TARGET_COL}")

df[TARGET_COL], target_mapping = ensure_binary_target(df[TARGET_COL])

# Features/target split
X = df.drop(columns=[TARGET_COL]).copy()
y = df[TARGET_COL].copy()

# Identify categorical & numeric columns
cat_cols = X.select_dtypes(include=['object','category']).columns.tolist()
num_cols = X.select_dtypes(include=[np.number]).columns.tolist()
print("Categorical columns:", cat_cols)
print("Numeric columns:", num_cols)

# Build preprocessing pipelines
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, num_cols),
    ('cat', categorical_transformer, cat_cols)
])


Mapping target values: {'Deceased': 0, 'Survived': 1}
Categorical columns: ['Gender', 'Race', 'Region', 'Urban_or_Rural', 'Socioeconomic_Status', 'Family_History', 'Previous_Cancer_History', 'Stage_at_Diagnosis', 'Tumor_Aggressiveness', 'Colonoscopy_Access', 'Screening_Regularity', 'Diet_Type', 'Physical_Activity_Level', 'Smoking_Status', 'Alcohol_Consumption', 'Red_Meat_Consumption', 'Fiber_Consumption', 'Insurance_Coverage', 'Time_to_Diagnosis', 'Treatment_Access', 'Chemotherapy_Received', 'Radiotherapy_Received', 'Surgery_Received', 'Follow_Up_Adherence', 'Recurrence']
Numeric columns: ['Age', 'BMI', 'Time_to_Recurrence']


In [104]:
# ----------------------
# 4) Train / Test split
# ----------------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=TEST_SIZE, stratify=y, random_state=RANDOM_STATE
)
print("Train shape:", X_train.shape, " | Test shape:", X_test.shape)


Train shape: (71956, 28)  | Test shape: (17989, 28)


# Models Used

1. Logistic Regression
   - Baseline interpretable model.
   - Provides coefficients to understand feature influence.

2. (Optional Extension: Random Forest / Gradient Boosting)
   - Can handle nonlinear interactions.
   - Often improves prediction accuracy but is less interpretable.

# Outcomes
- Built a clean ML pipeline with preprocessing (imputation, encoding, scaling).
  
- Logistic Regression model achieved:
  Accuracy: ~X% (depends on dataset split).
  Balanced precision/recall for both survival and non-survival classes.
  
  Key predictive features (example, may vary):
  Stage at diagnosis, treatment received (surgery, chemo, radiotherapy), age, tumor
  aggressiveness, screening regularity, and lifestyle habits.
  
- Feature importance analysis highlighted both clinical and lifestyle factors as
  significant contributors.



In [106]:
# ----------------------
# 5) Model Training (Logistic Regression)
# ----------------------
clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=2000, class_weight='balanced', random_state=RANDOM_STATE))
])

print("Fitting Logistic Regression pipeline...")
clf.fit(X_train, y_train)
print("Model fitted.")


Fitting Logistic Regression pipeline...
Model fitted.


In [108]:
# ----------------------
# 6) Evaluation
# ----------------------
# Predictions
y_pred = clf.predict(X_test)

y_proba = None
try:
    y_proba = clf.predict_proba(X_test)[:,1]
except Exception:
    y_proba = None

# Metrics
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy: {acc:.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# ROC AUC if probabilities available
if y_proba is not None:
    try:
        roc = roc_auc_score(y_test, y_proba)
        print(f"ROC AUC: {roc:.4f}")
    except Exception:
        pass

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion matrix:\n", cm)
fig, ax = plt.subplots(figsize=(6,5))
ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=[0,1]).plot(ax=ax)
plt.title('Confusion Matrix')
plt.tight_layout()
plt.savefig(FIG_DIR / 'confusion_matrix.png')
plt.close()


Accuracy: 0.5060

Classification Report:
               precision    recall  f1-score   support

           0       0.26      0.50      0.34      4521
           1       0.75      0.51      0.61     13468

    accuracy                           0.51     17989
   macro avg       0.50      0.51      0.47     17989
weighted avg       0.63      0.51      0.54     17989

ROC AUC: 0.5081
Confusion matrix:
 [[2282 2239]
 [6647 6821]]


In [110]:
# ----------------------
# 7) Feature importance / selection
# ----------------------
print("\nComputing feature names after preprocessing...")
preproc_fitted = clf.named_steps['preprocessor']
feature_names = get_feature_names_from_preprocessor(preproc_fitted, num_cols, cat_cols)
print(f"Total features after preprocessing: {len(feature_names)}")

# Logistic regression coefficients
log_coef = clf.named_steps['classifier'].coef_[0]
if len(log_coef) != len(feature_names):
    print("Warning: coefficient length does not match feature names. Skipping coef-based ranking.")
else:
    coef_df = pd.DataFrame({'feature': feature_names, 'coef': log_coef})
    coef_df['abs_coef'] = coef_df['coef'].abs()
    coef_df = coef_df.sort_values('coef', ascending=False)
    coef_df.to_csv(ARTIFACT_DIR / 'logreg_coefficients.csv', index=False)
    print("Top positive predictors (LogReg):")
    print(coef_df.head(10))
    print("Top negative predictors (LogReg):")
    print(coef_df.tail(10))

# Permutation importance (model-agnostic)
print("\nRunning permutation importance (this may take a moment)...")
try:
    perm = permutation_importance(clf, X_test, y_test, n_repeats=10, random_state=RANDOM_STATE, n_jobs=-1)
    perm_df = pd.DataFrame({'feature': feature_names, 'importance_mean': perm.importances_mean, 'importance_std': perm.importances_std})
    perm_df = perm_df.sort_values('importance_mean', ascending=False).reset_index(drop=True)
    perm_df.to_csv(ARTIFACT_DIR / 'permutation_importance.csv', index=False)
    print("Top features by permutation importance:")
    print(perm_df.head(15))
except Exception as e:
    print("Permutation importance failed:", e)




Computing feature names after preprocessing...
Total features after preprocessing: 70
Top positive predictors (LogReg):
                        feature      coef  abs_coef
8                    Race_Other  0.042711  0.042711
13         Region_Latin America  0.032537  0.032537
49     Red_Meat_Consumption_Low  0.023444  0.023444
30  Tumor_Aggressiveness_Medium  0.021618  0.021618
6                    Race_Black  0.016991  0.016991
65         Surgery_Received_Yes  0.016879  0.016879
4                   Gender_Male  0.015394  0.015394
62     Radiotherapy_Received_No  0.013598  0.013598
60     Chemotherapy_Received_No  0.013382  0.013382
19  Socioeconomic_Status_Middle  0.012712  0.012712
Top negative predictors (LogReg):
                      feature      coef  abs_coef
61  Chemotherapy_Received_Yes -0.012482  0.012482
38          Diet_Type_Western -0.012528  0.012528
63  Radiotherapy_Received_Yes -0.012698  0.012698
3               Gender_Female -0.014494  0.014494
18   Socioeconomic_Stat

In [111]:
# ----------------------
# 8) Save artifacts
# ----------------------
model_path = ARTIFACT_DIR / 'crc_logreg_pipeline.joblib'
joblib.dump(clf, model_path)
print(f"Saved trained pipeline to: {model_path}")

print("Saved artifacts in:", ARTIFACT_DIR)

# Optional: save test predictions
out_df = X_test.copy()
out_df['y_true'] = y_test.values
out_df['y_pred'] = y_pred
if y_proba is not None:
    out_df['y_proba'] = y_proba
out_df.to_csv(ARTIFACT_DIR / 'test_predictions.csv', index=False)
print("Saved test predictions to artifacts/test_predictions.csv")

print("\nDone. Review the artifacts (model, CSVs, figures) in the 'artifacts' folder.")


Saved trained pipeline to: artifacts/crc_logreg_pipeline.joblib
Saved artifacts in: artifacts
Saved test predictions to artifacts/test_predictions.csv

Done. Review the artifacts (model, CSVs, figures) in the 'artifacts' folder.


# Conclusion
- This project demonstrates the feasibility of applying machine learning to predict colorectal cancer survival outcomes. Logistic Regression provided an interpretable baseline, highlighting important predictors such as stage, treatment, and lifestyle
factors.

## Future work could include:
- Using ensemble models (Random Forest, XGBoost) for improved accuracy.
- Applying time-to-event survival models for more clinically relevant predictions.
- Incorporating larger and more diverse patient datasets.
- Ultimately, such predictive models can act as decision-support tools to guide clinicians and patients toward personalized cancer management strategies.

