
# Early Breast Cancer Risk Stratification – Extended Analysis

This updated notebook builds upon the earlier exploratory work to provide a more detailed analysis and modelling of early breast cancer risk using non‑invasive demographic and lifestyle data.  We perform comprehensive exploratory data analysis (EDA) with additional visualisations, thoroughly examine missing values coded as `9`, and train multiple machine‑learning models—including Logistic Regression, Random Forest, and XGBoost—while evaluating their performance with metrics and ROC curves.

All variables are coded numerically in the raw dataset.  We first map these codes to human‑readable categories for clarity and use sample weights derived from the **count** column to reflect the number of women represented by each row.




## Data dictionary

A summary of key variables and their coding, as defined by the Risk Factor Dataset:

- **age_group_5_years**: Age grouped in 5‑year intervals. Codes 1–13 correspond to 18–29, 30–34, 35–39, 40–44, 45–49, 50–54, 55–59, 60–64, 65–69, 70–74, 75–79, 80–84, and ≥85 respectively【828700341700584†screenshot】.
- **race_eth**: Race/ethnicity coded as 1 = Non‑Hispanic white, 2 = Non‑Hispanic black, 3 = Asian/Pacific Islander, 4 = Native American, 5 = Hispanic, 6 = Other/mixed, 9 = Unknown【828700341700584†screenshot】.
- **first_degree_hx**: History of breast cancer in a first‑degree relative (0 = No, 1 = Yes, 9 = Unknown)【828700341700584†screenshot】.
- **age_menarche**: Age at menarche (0 = ≥14, 1 = 12–13, 2 = <12, 9 = Unknown)【828700341700584†screenshot】.
- **age_first_birth**: Age at first birth (0 = <20, 1 = 20–24, 2 = 25–29, 3 = ≥30, 4 = Nulliparous, 9 = Unknown)【828700341700584†screenshot】.
- **BIRADS_breast_density**: Breast density (1 = Almost entirely fat, 2 = Scattered fibroglandular, 3 = Heterogeneously dense, 4 = Extremely dense, 9 = Unknown or different measurement)【828700341700584†screenshot】.
- **current_hrt**: Use of hormone replacement therapy (0 = No, 1 = Yes, 9 = Unknown)【828700341700584†screenshot】.
- **menopaus**: Menopausal status (1 = Pre/peri‑menopause, 2 = Post‑menopause, 3 = Surgical menopause, 9 = Unknown)【828700341700584†screenshot】.
- **bmi_group**: Body mass index group (1 = 10–24.99, 2 = 25–29.99, 3 = 30–34.99, 4 = ≥35, 9 = Unknown)【828700341700584†screenshot】.
- **biophx**: Previous breast biopsy or aspiration (0 = No, 1 = Yes, 9 = Unknown)【828700341700584†screenshot】.
- **breast_cancer_history**: Target variable indicating prior breast‑cancer diagnosis (0 = No, 1 = Yes, 9 = Unknown)【828700341700584†screenshot】.
- **count**: The number of women represented by each row; used as a sample weight.



In [None]:

# ## 1. Import libraries and load the dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.ticker import FuncFormatter

# Configure display options
pd.set_option('display.max_columns', None)

# Read the 10% sample dataset
df = pd.read_csv('/home/oai/share/sample_10percent.csv')

# Display the first five rows to understand the raw data structure
df.head()


In [None]:

# ## 2. Inspect data structure and missing values

# Display general information about the dataframe
print('Shape of dataframe:', df.shape)
print('
Data types and non‑null counts:')
df.info()

# Count occurrences of the unknown code '9' in each column
unknown_counts = df.isin([9]).sum()
print('
Unknown (code = 9) counts per column:')
print(unknown_counts)

# Calculate the percentage of unknown codes per variable
unknown_percent = (unknown_counts / df.shape[0]) * 100
unknown_percent


In [None]:

# ## 3. Map numeric codes to categories

# Define mapping dictionaries for each categorical variable
age_group_map = {1:'18–29', 2:'30–34', 3:'35–39', 4:'40–44', 5:'45–49', 6:'50–54', 7:'55–59', 8:'60–64', 9:'65–69', 10:'70–74', 11:'75–79', 12:'80–84', 13:'≥85'}
race_map = {1:'Non‑Hispanic white', 2:'Non‑Hispanic black', 3:'Asian/Pacific Islander', 4:'Native American', 5:'Hispanic', 6:'Other/mixed', 9:'Unknown'}
first_degree_map = {0:'No', 1:'Yes', 9:'Unknown'}
menarche_map = {0:'≥14', 1:'12–13', 2:'<12', 9:'Unknown'}
first_birth_map = {0:'<20', 1:'20–24', 2:'25–29', 3:'≥30', 4:'Nulliparous', 9:'Unknown'}
density_map = {1:'Almost entirely fat', 2:'Scattered fibroglandular', 3:'Heterogeneously dense', 4:'Extremely dense', 9:'Unknown'}
hrt_map = {0:'No', 1:'Yes', 9:'Unknown'}
menopause_map = {1:'Pre/peri‑menopause', 2:'Post‑menopause', 3:'Surgical menopause', 9:'Unknown'}
bmi_map = {1:'10–24.99', 2:'25–29.99', 3:'30–34.99', 4:'≥35', 9:'Unknown'}
biopsy_map = {0:'No', 1:'Yes', 9:'Unknown'}
cancer_map = {0:'No', 1:'Yes', 9:'Unknown'}

# Apply mappings to create a human‑readable DataFrame
df_mapped = df.copy()
df_mapped['age_group_5_years'] = df_mapped['age_group_5_years'].map(age_group_map)
df_mapped['race_eth'] = df_mapped['race_eth'].map(race_map)
df_mapped['first_degree_hx'] = df_mapped['first_degree_hx'].map(first_degree_map)
df_mapped['age_menarche'] = df_mapped['age_menarche'].map(menarche_map)
df_mapped['age_first_birth'] = df_mapped['age_first_birth'].map(first_birth_map)
df_mapped['BIRADS_breast_density'] = df_mapped['BIRADS_breast_density'].map(density_map)
df_mapped['current_hrt'] = df_mapped['current_hrt'].map(hrt_map)
df_mapped['menopaus'] = df_mapped['menopaus'].map(menopause_map)
df_mapped['bmi_group'] = df_mapped['bmi_group'].map(bmi_map)
df_mapped['biophx'] = df_mapped['biophx'].map(biopsy_map)
df_mapped['breast_cancer_history'] = df_mapped['breast_cancer_history'].map(cancer_map)

# Display the first few mapped rows
df_mapped.head()


In [None]:

# ## 4. Exploratory Data Analysis (EDA) – Distribution of Variables

# Visualise the distribution of the target variable (weighted by count)
target_counts = df.groupby('breast_cancer_history')['count'].sum().rename(index={0:'No',1:'Yes',9:'Unknown'})
plt.figure(figsize=(5,4))
sns.barplot(x=target_counts.index, y=target_counts.values, palette='Set2')
plt.title('Prior breast cancer history (weighted by count)')
plt.xlabel('Breast cancer history')
plt.ylabel('Number of women')
plt.show()

# Distribution of race/ethnicity
race_counts = df.groupby('race_eth')['count'].sum().rename(index=race_map)
plt.figure(figsize=(8,4))
sns.barplot(x=race_counts.index, y=race_counts.values, palette='Set3')
plt.title('Race/Ethnicity distribution (weighted by count)')
plt.xlabel('Race/ethnicity')
plt.ylabel('Number of women')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# BMI group distribution
bmi_counts = df.groupby('bmi_group')['count'].sum().rename(index=bmi_map)
plt.figure(figsize=(6,4))
sns.barplot(x=bmi_counts.index, y=bmi_counts.values, palette='coolwarm')
plt.title('BMI group distribution (weighted by count)')
plt.xlabel('BMI group')
plt.ylabel('Number of women')
plt.show()

# Breast density distribution
density_counts = df.groupby('BIRADS_breast_density')['count'].sum().rename(index=density_map)
plt.figure(figsize=(6,4))
sns.barplot(x=density_counts.index, y=density_counts.values, palette='YlGnBu')
plt.title('Breast density distribution (weighted by count)')
plt.xlabel('BI‑RADS density')
plt.ylabel('Number of women')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# Age group versus BMI group heatmap (weighted counts)
heat_df = df.pivot_table(values='count', index='age_group_5_years', columns='bmi_group', aggfunc='sum')
plt.figure(figsize=(8,6))
sns.heatmap(heat_df, annot=True, fmt='.0f', cmap='Blues')
plt.title('Heatmap of age group vs BMI group (counts)')
plt.xlabel('BMI group code')
plt.ylabel('Age group code')
plt.show()


In [None]:

# ## 5. Visualise missing values (unknown codes)

# Bar chart of proportion of unknown entries per feature
plt.figure(figsize=(8,4))
unknown_percent = (df.isin([9]).sum() / df.shape[0]) * 100
unknown_percent_sorted = unknown_percent.sort_values(ascending=False)
sns.barplot(x=unknown_percent_sorted.index, y=unknown_percent_sorted.values, palette='magma')
plt.title('Proportion of unknown codes per variable (%)')
plt.ylabel('Unknown proportion (%)')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# Determine if any rows have unknown target values (should be removed before modelling)
print('Number of rows with unknown cancer history:', (df['breast_cancer_history'] == 9).sum())


In [None]:

# ## 6. Preprocessing for modelling

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve, classification_report, confusion_matrix

# Remove rows with unknown target
df_model = df[df['breast_cancer_history'] != 9].copy()

# Separate target and predictors
y = df_model['breast_cancer_history']
X = df_model.drop(columns=['breast_cancer_history'])

# Preserve sample weights
weights = X['count'].values

# Drop the count column from features
X = X.drop(columns=['count'])

# Identify categorical and numerical columns
categorical_cols = [c for c in X.columns if c != 'year']
numerical_cols = ['year']

# Preprocessing: one‑hot encode categorical variables, pass through numerical variables
preprocessor = ColumnTransformer([
    ('categorical', OneHotEncoder(handle_unknown='ignore'), categorical_cols),
    ('numeric', 'passthrough', numerical_cols)
])

# Split data into training and testing sets (stratify by target)
X_train, X_test, y_train, y_test, w_train, w_test = train_test_split(
    X, y, weights, test_size=0.2, random_state=42, stratify=y
)

# Print shapes
print('Training set:', X_train.shape)
print('Test set:', X_test.shape)


In [None]:

# ## 7. Baseline model – Logistic Regression

from sklearn.linear_model import LogisticRegression

# Build the pipeline
log_reg_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=500, class_weight='balanced'))
])

# Fit the model with sample weights
log_reg_pipeline.fit(X_train, y_train, classifier__sample_weight=w_train)

# Predict probabilities and labels
y_prob_lr = log_reg_pipeline.predict_proba(X_test)[:,1]
y_pred_lr = (y_prob_lr >= 0.5).astype(int)

# Evaluate
acc_lr = accuracy_score(y_test, y_pred_lr)
prec_lr = precision_score(y_test, y_pred_lr)
recall_lr = recall_score(y_test, y_pred_lr)
f1_lr = f1_score(y_test, y_pred_lr)
roc_auc_lr = roc_auc_score(y_test, y_prob_lr)

print('Logistic Regression performance:')
print('  Accuracy:', acc_lr)
print('  Precision:', prec_lr)
print('  Recall:', recall_lr)
print('  F1 score:', f1_lr)
print('  ROC AUC:', roc_auc_lr)

# Confusion matrix
cm_lr = confusion_matrix(y_test, y_pred_lr)
sns.heatmap(cm_lr, annot=True, fmt='d', cmap='Greens', xticklabels=['Pred No','Pred Yes'], yticklabels=['True No','True Yes'])
plt.title('Confusion Matrix – Logistic Regression')
plt.show()

# Plot ROC curve
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_prob_lr)
plt.figure(figsize=(6,4))
plt.plot(fpr_lr, tpr_lr, label=f'Logistic Regression (AUC = {roc_auc_lr:.3f})')
plt.plot([0,1],[0,1],'--', color='grey')
plt.title('ROC curve – Logistic Regression')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.legend()
plt.show()


In [None]:

# ## 8. Ensemble model – Random Forest

from sklearn.ensemble import RandomForestClassifier

rf_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=300, random_state=42, class_weight='balanced'))
])

# Fit model with sample weights
rf_pipeline.fit(X_train, y_train, classifier__sample_weight=w_train)

# Predict probabilities and labels
y_prob_rf = rf_pipeline.predict_proba(X_test)[:,1]
y_pred_rf = (y_prob_rf >= 0.5).astype(int)

# Evaluate
acc_rf = accuracy_score(y_test, y_pred_rf)
prec_rf = precision_score(y_test, y_pred_rf)
recall_rf = recall_score(y_test, y_pred_rf)
f1_rf = f1_score(y_test, y_pred_rf)
roc_auc_rf = roc_auc_score(y_test, y_prob_rf)

print('Random Forest performance:')
print('  Accuracy:', acc_rf)
print('  Precision:', prec_rf)
print('  Recall:', recall_rf)
print('  F1 score:', f1_rf)
print('  ROC AUC:', roc_auc_rf)

# Confusion matrix
cm_rf = confusion_matrix(y_test, y_pred_rf)
sns.heatmap(cm_rf, annot=True, fmt='d', cmap='Purples', xticklabels=['Pred No','Pred Yes'], yticklabels=['True No','True Yes'])
plt.title('Confusion Matrix – Random Forest')
plt.show()

# ROC curve
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_prob_rf)
plt.figure(figsize=(6,4))
plt.plot(fpr_rf, tpr_rf, label=f'Random Forest (AUC = {roc_auc_rf:.3f})')
plt.plot([0,1],[0,1],'--', color='grey')
plt.title('ROC curve – Random Forest')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.legend()
plt.show()

# Feature importance analysis
# Extract encoded feature names
onehot = rf_pipeline.named_steps['preprocessor'].transformers_[0][1]
encoded_cols = list(onehot.get_feature_names_out(categorical_cols)) + numerical_cols
importances = rf_pipeline.named_steps['classifier'].feature_importances_

feat_importances = pd.DataFrame({'feature': encoded_cols, 'importance': importances}).sort_values(by='importance', ascending=False)

print('Top 20 important features (Random Forest):')
feat_importances.head(20)

plt.figure(figsize=(8,6))
sns.barplot(data=feat_importances.head(20), x='importance', y='feature', palette='viridis')
plt.title('Top 20 feature importances – Random Forest')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()


In [None]:

# ## 9. Boosting model – XGBoost

# Try to import XGBoost; if unavailable, fall back to GradientBoosting
try:
    from xgboost import XGBClassifier
    use_xgb = True
except ImportError:
    from sklearn.ensemble import GradientBoostingClassifier
    use_xgb = False

if use_xgb:
    xgb_pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', XGBClassifier(
            n_estimators=500,
            learning_rate=0.05,
            max_depth=4,
            subsample=0.8,
            colsample_bytree=0.8,
            eval_metric='logloss',
            random_state=42
        ))
    ])
    
    xgb_pipeline.fit(X_train, y_train, classifier__sample_weight=w_train)
    
    y_prob_xgb = xgb_pipeline.predict_proba(X_test)[:,1]
y_pred_xgb = (y_prob_xgb >= 0.5).astype(int)
    
    acc_xgb = accuracy_score(y_test, y_pred_xgb)
    prec_xgb = precision_score(y_test, y_pred_xgb)
    recall_xgb = recall_score(y_test, y_pred_xgb)
    f1_xgb = f1_score(y_test, y_pred_xgb)
    roc_auc_xgb = roc_auc_score(y_test, y_prob_xgb)
    
    print('XGBoost performance:')
    print('  Accuracy:', acc_xgb)
    print('  Precision:', prec_xgb)
    print('  Recall:', recall_xgb)
    print('  F1 score:', f1_xgb)
    print('  ROC AUC:', roc_auc_xgb)
    
    # Confusion matrix
    cm_xgb = confusion_matrix(y_test, y_pred_xgb)
    sns.heatmap(cm_xgb, annot=True, fmt='d', cmap='Reds', xticklabels=['Pred No','Pred Yes'], yticklabels=['True No','True Yes'])
    plt.title('Confusion Matrix – XGBoost')
    plt.show()
    
    # ROC curve
    fpr_xgb, tpr_xgb, _ = roc_curve(y_test, y_prob_xgb)
    plt.figure(figsize=(6,4))
    plt.plot(fpr_xgb, tpr_xgb, label=f'XGBoost (AUC = {roc_auc_xgb:.3f})')
    plt.plot([0,1],[0,1],'--', color='grey')
    plt.title('ROC curve – XGBoost')
    plt.xlabel('False positive rate')
    plt.ylabel('True positive rate')
    plt.legend()
    plt.show()
    
    # Feature importance
    importances_xgb = xgb_pipeline.named_steps['classifier'].feature_importances_
    feat_importances_xgb = pd.DataFrame({'feature': encoded_cols, 'importance': importances_xgb}).sort_values(by='importance', ascending=False)
    
    print('Top 20 important features (XGBoost):')
    feat_importances_xgb.head(20)
    
    plt.figure(figsize=(8,6))
    sns.barplot(data=feat_importances_xgb.head(20), x='importance', y='feature', palette='rocket')
    plt.title('Top 20 feature importances – XGBoost')
    plt.xlabel('Importance')
    plt.ylabel('Feature')
    plt.show()
else:
    # Fallback to GradientBoostingClassifier
    gbt_pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', GradientBoostingClassifier())
    ])
    
    gbt_pipeline.fit(X_train, y_train, classifier__sample_weight=w_train)
    
    y_prob_gbt = gbt_pipeline.predict_proba(X_test)[:,1]
y_pred_gbt = (y_prob_gbt >= 0.5).astype(int)
    
    acc_gbt = accuracy_score(y_test, y_pred_gbt)
    prec_gbt = precision_score(y_test, y_pred_gbt)
    recall_gbt = recall_score(y_test, y_pred_gbt)
    f1_gbt = f1_score(y_test, y_pred_gbt)
    roc_auc_gbt = roc_auc_score(y_test, y_prob_gbt)
    
    print('Gradient Boosting performance:')
    print('  Accuracy:', acc_gbt)
    print('  Precision:', prec_gbt)
    print('  Recall:', recall_gbt)
    print('  F1 score:', f1_gbt)
    print('  ROC AUC:', roc_auc_gbt)
    
    # Confusion matrix
    cm_gbt = confusion_matrix(y_test, y_pred_gbt)
    sns.heatmap(cm_gbt, annot=True, fmt='d', cmap='Oranges', xticklabels=['Pred No','Pred Yes'], yticklabels=['True No','True Yes'])
    plt.title('Confusion Matrix – Gradient Boosting')
    plt.show()
    
    # ROC curve
    fpr_gbt, tpr_gbt, _ = roc_curve(y_test, y_prob_gbt)
    plt.figure(figsize=(6,4))
    plt.plot(fpr_gbt, tpr_gbt, label=f'Gradient Boosting (AUC = {roc_auc_gbt:.3f})')
    plt.plot([0,1],[0,1],'--', color='grey')
    plt.title('ROC curve – Gradient Boosting')
    plt.xlabel('False positive rate')
    plt.ylabel('True positive rate')
    plt.legend()
    plt.show()
    
    # Feature importance
    importances_gbt = gbt_pipeline.named_steps['classifier'].feature_importances_
    feat_importances_gbt = pd.DataFrame({'feature': encoded_cols, 'importance': importances_gbt}).sort_values(by='importance', ascending=False)
    
    print('Top 20 important features (Gradient Boosting):')
    feat_importances_gbt.head(20)
    
    plt.figure(figsize=(8,6))
    sns.barplot(data=feat_importances_gbt.head(20), x='importance', y='feature', palette='crest')
    plt.title('Top 20 feature importances – Gradient Boosting')
    plt.xlabel('Importance')
    plt.ylabel('Feature')
    plt.show()



# ## 10. Summary and next steps

In this extended analysis we conducted a thorough EDA and built multiple models to predict prior breast cancer history based on demographic and lifestyle factors.  The EDA revealed class imbalances and varying degrees of missing values across variables.  We visualised distributions for the major risk factors and highlighted the proportion of unknown entries.  All models were trained on weighted data to account for the count of women represented by each row.

Key observations:

- **Missing values**: Some variables contain a notable proportion of unknown entries (coded as 9).  Future work could explore imputation techniques or sensitivity analyses to gauge the impact of these unknowns.
- **Model performance**: Ensemble methods (Random Forest and XGBoost) achieved higher ROC AUC and F1 scores than the baseline Logistic Regression, indicating that non‑linear interactions among variables improve predictive power.  Feature importance analyses consistently highlighted age group, BMI group, breast density, and family history as influential factors.
- **ROC curves**: The ROC curves provide a visual summary of model performance across different thresholds, and the AUC values help compare models quantitatively.

Further improvements could include hyperparameter tuning via cross‑validation, exploring additional algorithms (e.g. CatBoost), and validating the models on external datasets or prospective cohorts.  Care should also be taken to communicate the limitations of modelling aggregated data and to ensure fairness and interpretability when applying such models in healthcare settings.
