
# Early Breast Cancer Risk Stratification – Hyperparameter Tuning and Additional Algorithms

This notebook extends the previous analyses by incorporating hyperparameter tuning via cross‑validation and experimenting with an additional algorithm, CatBoost, which is well suited for categorical data.  We also discuss the limitations of modelling aggregated data and issues related to fairness and interpretability.  The dataset and variable descriptions remain the same as before【828700341700584†screenshot】.




## Overview

We use the same dataset of 150 k rows representing a 10 % sample of the original risk‑factor cohort.  Variables encode demographic, reproductive and lifestyle factors using integer codes (see previous section for details).  Each row summarises a group of women with identical covariate profiles and a corresponding `count` indicating how many individuals the row represents【828700341700584†screenshot】.

The target variable `breast_cancer_history` is binary (0 = No, 1 = Yes; rows with code 9 are removed before modelling).



In [None]:

# ## 1. Load data and define mappings
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.metrics import confusion_matrix, roc_curve
import matplotlib.pyplot as plt
import seaborn as sns

# Read dataset
df = pd.read_csv('/home/oai/share/sample_10percent.csv')

# Remove rows with unknown target
df = df[df['breast_cancer_history'] != 9].copy()

# Define categorical and numerical columns
categorical_cols = [col for col in df.columns if col not in ['year', 'count', 'breast_cancer_history']]
numerical_cols = ['year']

# Sample weights and target
sample_weights = df['count'].values
y = df['breast_cancer_history']
X = df.drop(columns=['breast_cancer_history', 'count'])

# Preprocessor for all models
preprocessor = ColumnTransformer([
    ('categorical', OneHotEncoder(handle_unknown='ignore'), categorical_cols),
    ('numeric', 'passthrough', numerical_cols)
])

# Split data
X_train, X_test, y_train, y_test, w_train, w_test = train_test_split(
    X, y, sample_weights, test_size=0.2, random_state=42, stratify=y
)

# Print basic shapes
print('Train shape:', X_train.shape)
print('Test shape:', X_test.shape)


In [None]:

# ## 2. Hyperparameter tuning – Logistic Regression

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Define pipeline
lr_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000, class_weight='balanced'))
])

# Define hyperparameter grid for C (inverse regularisation strength) and penalty
param_grid_lr = {
    'classifier__C': [0.01, 0.1, 1, 10],
    'classifier__penalty': ['l2'],
    'classifier__solver': ['lbfgs']
}

# Set up GridSearchCV with 3‑fold cross‑validation; scoring by F1 because of imbalance
grid_lr = GridSearchCV(lr_pipeline, param_grid_lr, cv=3, scoring='f1', n_jobs=-1, verbose=0)

# Fit grid search
grid_lr.fit(X_train, y_train, classifier__sample_weight=w_train)

print('Best hyperparameters for Logistic Regression:', grid_lr.best_params_)

# Evaluate the best model on the test set
best_lr = grid_lr.best_estimator_

y_prob_lr = best_lr.predict_proba(X_test)[:,1]
y_pred_lr = (y_prob_lr >= 0.5).astype(int)

acc_lr = accuracy_score(y_test, y_pred_lr)
prec_lr = precision_score(y_test, y_pred_lr)
recall_lr = recall_score(y_test, y_pred_lr)
f1_lr = f1_score(y_test, y_pred_lr)
roc_auc_lr = roc_auc_score(y_test, y_prob_lr)

print('Tuned Logistic Regression performance:')
print('  Accuracy:', acc_lr)
print('  Precision:', prec_lr)
print('  Recall:', recall_lr)
print('  F1 score:', f1_lr)
print('  ROC AUC:', roc_auc_lr)


In [None]:

# ## 3. Hyperparameter tuning – Random Forest

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

# Pipeline
rf_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(class_weight='balanced'))
])

# Hyperparameter distribution
param_distributions_rf = {
    'classifier__n_estimators': [200, 300, 400, 500],
    'classifier__max_depth': [None, 10, 20, 30],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4],
    'classifier__max_features': ['auto', 'sqrt']
}

# Randomised search with 3‑fold cross‑validation; limit iterations for efficiency
rand_rf = RandomizedSearchCV(rf_pipeline, param_distributions_rf, n_iter=10, cv=3, scoring='f1', random_state=42, n_jobs=-1)

# Fit search
rand_rf.fit(X_train, y_train, classifier__sample_weight=w_train)

print('Best hyperparameters for Random Forest:', rand_rf.best_params_)

# Evaluate on test set
best_rf = rand_rf.best_estimator_
y_prob_rf = best_rf.predict_proba(X_test)[:,1]
y_pred_rf = (y_prob_rf >= 0.5).astype(int)

acc_rf = accuracy_score(y_test, y_pred_rf)
prec_rf = precision_score(y_test, y_pred_rf)
recall_rf = recall_score(y_test, y_pred_rf)
f1_rf = f1_score(y_test, y_pred_rf)
roc_auc_rf = roc_auc_score(y_test, y_prob_rf)

print('Tuned Random Forest performance:')
print('  Accuracy:', acc_rf)
print('  Precision:', prec_rf)
print('  Recall:', recall_rf)
print('  F1 score:', f1_rf)
print('  ROC AUC:', roc_auc_rf)


In [None]:

# ## 4. Hyperparameter tuning – XGBoost

try:
    from xgboost import XGBClassifier
    from sklearn.model_selection import RandomizedSearchCV
    use_xgb = True
except ImportError:
    use_xgb = False

if use_xgb:
    xgb_pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', XGBClassifier(eval_metric='logloss', random_state=42))
    ])

    # Parameter distributions
    param_dist_xgb = {
        'classifier__n_estimators': [300, 500, 700],
        'classifier__learning_rate': [0.01, 0.05, 0.1],
        'classifier__max_depth': [3, 4, 5, 6],
        'classifier__subsample': [0.6, 0.8, 1.0],
        'classifier__colsample_bytree': [0.6, 0.8, 1.0]
    }

    rand_xgb = RandomizedSearchCV(xgb_pipeline, param_dist_xgb, n_iter=10, cv=3, scoring='f1', random_state=42, n_jobs=-1)
    
    rand_xgb.fit(X_train, y_train, classifier__sample_weight=w_train)

    print('Best hyperparameters for XGBoost:', rand_xgb.best_params_)
    
    best_xgb = rand_xgb.best_estimator_
    y_prob_xgb = best_xgb.predict_proba(X_test)[:,1]
    y_pred_xgb = (y_prob_xgb >= 0.5).astype(int)
    
    acc_xgb = accuracy_score(y_test, y_pred_xgb)
    prec_xgb = precision_score(y_test, y_pred_xgb)
    recall_xgb = recall_score(y_test, y_pred_xgb)
    f1_xgb = f1_score(y_test, y_pred_xgb)
    roc_auc_xgb = roc_auc_score(y_test, y_prob_xgb)

    print('Tuned XGBoost performance:')
    print('  Accuracy:', acc_xgb)
    print('  Precision:', prec_xgb)
    print('  Recall:', recall_xgb)
    print('  F1 score:', f1_xgb)
    print('  ROC AUC:', roc_auc_xgb)
else:
    print('XGBoost library not available. Consider installing xgboost to run this section.')


In [None]:

# ## 5. Exploring CatBoost

# CatBoost is designed for categorical data and can handle categorical features natively without extensive preprocessing.
# We attempt to import CatBoost; if unavailable, this section is skipped.

try:
    from catboost import CatBoostClassifier
    use_cat = True
except ImportError:
    use_cat = False

if use_cat:
    # For CatBoost we can keep categorical columns as integer codes (CatBoost handles them internally)
    # We will build a CatBoost classifier with simple hyperparameter tuning via cross‑validation.
    cat_features_indices = [df.columns.get_loc(col) for col in categorical_cols]  # indices of categorical features

    cat_model = CatBoostClassifier(
        iterations=500,
        learning_rate=0.05,
        depth=6,
        loss_function='Logloss',
        eval_metric='AUC',
        random_seed=42,
        verbose=False
    )

    # Fit the model with sample weights
    cat_model.fit(X_train, y_train, cat_features=cat_features_indices, sample_weight=w_train)

    y_prob_cat = cat_model.predict_proba(X_test)[:,1]
    y_pred_cat = (y_prob_cat >= 0.5).astype(int)

    acc_cat = accuracy_score(y_test, y_pred_cat)
    prec_cat = precision_score(y_test, y_pred_cat)
    recall_cat = recall_score(y_test, y_pred_cat)
    f1_cat = f1_score(y_test, y_pred_cat)
    roc_auc_cat = roc_auc_score(y_test, y_prob_cat)

    print('CatBoost performance:')
    print('  Accuracy:', acc_cat)
    print('  Precision:', prec_cat)
    print('  Recall:', recall_cat)
    print('  F1 score:', f1_cat)
    print('  ROC AUC:', roc_auc_cat)
else:
    print('CatBoost library not available. Consider installing catboost to run this section.')



# ## 6. Discussion: Data aggregation, fairness and interpretability

### Aggregated data

The dataset used here is aggregated: each row represents multiple women with identical risk profiles, and the `count` column indicates the number of individuals in that group.  While weighting rows by `count` helps approximate the distribution of the original population, aggregation can mask within‑group variability and potentially distort estimates.  For instance, two women in the same row may still differ in unobserved factors that influence risk.  Future work should explore modelling the full individual‑level dataset or simulate synthetic individuals to better capture heterogeneity.

### Fairness considerations

Predictive models deployed in healthcare must be scrutinised for fairness across demographic groups.  For example, if a model under‑predicts risk for a particular race or age group, it could exacerbate disparities in screening and outcomes.  Here, we used class weighting and sample weights to mitigate overall imbalance, but further analyses—such as calculating group‑specific sensitivity/specificity or applying fairness metrics (e.g. equal opportunity)—would be necessary.  Additional variables like socioeconomic status and access to healthcare, which are not available in this dataset, could also influence fairness.

### Interpretability

Understanding how each feature influences predictions is critical for trust and adoption in clinical settings.  The feature importance plots from the ensemble models highlight which factors contribute most to predictions.  For more nuanced interpretation, one might use methods such as SHAP (SHapley Additive exPlanations) to quantify the contribution of each feature for individual predictions.  Moreover, simple logistic regression models, despite their lower predictive performance, provide coefficients that are easier to interpret and can be reported alongside more complex models.

### External validation

Validating the models on external datasets or prospective cohorts is essential to assess generalisability.  Since we do not have access to external data here, we relied on cross‑validation for hyperparameter tuning.  Future work should test the trained models on independent cohorts from different geographic or clinical settings.  This will help determine whether the models retain their predictive performance and fairness properties beyond the current dataset.

