
# Early Breast Cancer Risk Stratification Using AI and Machine Learning

This notebook presents a step‑by‑step workflow for predicting early breast cancer risk using demographic, lifestyle and reproductive health data.  The data are derived from the Risk Factor Dataset, which is aggregated into counts of unique combinations of risk factors.  Our goal is to build supervised machine‑learning models that stratify breast cancer risk from non‑invasive information.  The workflow includes exploratory data analysis (EDA), feature engineering, model training and evaluation, and interpretation of the results.

The dataset contains the following variables (coded as integers), where each row summarises multiple women with the same risk factor profile.  A separate **count** column gives the number of women represented by each row.  For readability we map integer codes to human‑interpretable categories.  The target variable `breast_cancer_history` records whether a prior breast cancer diagnosis has been reported (0 = No, 1 = Yes, 9 = Unknown).




The important variables and their codes are:

- **year** – calendar year of the observation (2005–2017).
- **age_group_5_years** – age grouped in five‑year intervals; codes 1–13 correspond to ranges 18–29, 30–34, 35–39, 40–44, 45–49, 50–54, 55–59, 60–64, 65–69, 70–74, 75–79, 80–84, and **≥85** respectively【828700341700584†screenshot】.
- **race_eth** – race/ethnicity: 1 = Non‑Hispanic white, 2 = Non‑Hispanic black, 3 = Asian/Pacific Islander, 4 = Native American, 5 = Hispanic, 6 = Other/mixed, 9 = Unknown【828700341700584†screenshot】.
- **first_degree_hx** – history of breast cancer in a first‑degree relative: 0 = No, 1 = Yes, 9 = Unknown【828700341700584†screenshot】.
- **age_menarche** – age at menarche: 0 = ≥14 years, 1 = 12–13 years, 2 = <12 years, 9 = Unknown【828700341700584†screenshot】.
- **age_first_birth** – age at first birth: 0 = <20 years, 1 = 20–24 years, 2 = 25–29 years, 3 = ≥30 years, 4 = Nulliparous (no childbirth), 9 = Unknown【828700341700584†screenshot】.
- **BIRADS_breast_density** – BI‑RADS breast density: 1 = Almost entirely fat, 2 = Scattered fibroglandular densities, 3 = Heterogeneously dense, 4 = Extremely dense, 9 = Unknown or different measurement【828700341700584†screenshot】.
- **current_hrt** – use of hormone replacement therapy: 0 = No, 1 = Yes, 9 = Unknown【828700341700584†screenshot】.
- **menopaus** – menopausal status: 1 = Pre‑ or peri‑menopausal, 2 = Post‑menopausal, 3 = Surgical menopause, 9 = Unknown【828700341700584†screenshot】.
- **bmi_group** – body mass index (BMI) group: 1 = 10–24.99, 2 = 25–29.99, 3 = 30–34.99, 4 = ≥35, 9 = Unknown【828700341700584†screenshot】.
- **biophx** – previous breast biopsy or aspiration: 0 = No, 1 = Yes, 9 = Unknown【828700341700584†screenshot】.
- **breast_cancer_history** – prior breast‑cancer diagnosis: 0 = No, 1 = Yes, 9 = Unknown【828700341700584†screenshot】.
- **count** – number of women represented by this combination of covariates (numerical)【828700341700584†screenshot】.



In [None]:

# ## 1. Import libraries and load the dataset

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Configure pandas display options for readability
pd.set_option('display.max_columns', None)

# Read the 10% sample dataset
df = pd.read_csv('/home/oai/share/sample_10percent.csv')

# Display the first few rows
df.head()


In [None]:

# ## 2. Inspect the dataset

# Print a summary of the dataset to understand the structure
print('Dataset shape:', df.shape)
print('
Data types and non‑null counts:')
print(df.info())

# Summarise missing values (coded as 9 for many categorical variables)
missing_summary = df.isin([9]).sum()
print('
Number of unknown codes (value = 9) per column:')
print(missing_summary)

# Describe the numeric columns (only 'year' and 'count' are truly numeric)
df[['year', 'count']].describe()


In [None]:

# ## 3. Map integer codes to meaningful categories

# Define mapping dictionaries based on the data dictionary
age_group_map = {
    1: '18–29', 2: '30–34', 3: '35–39', 4: '40–44', 5: '45–49',
    6: '50–54', 7: '55–59', 8: '60–64', 9: '65–69', 10: '70–74',
    11: '75–79', 12: '80–84', 13: '≥85'
}

race_map = {
    1: 'Non‑Hispanic white', 2: 'Non‑Hispanic black', 3: 'Asian/Pacific Islander',
    4: 'Native American', 5: 'Hispanic', 6: 'Other/mixed', 9: 'Unknown'
}

first_degree_map = {0: 'No', 1: 'Yes', 9: 'Unknown'}
menarche_map = {0: '≥14', 1: '12–13', 2: '<12', 9: 'Unknown'}
first_birth_map = {0: '<20', 1: '20–24', 2: '25–29', 3: '≥30', 4: 'Nulliparous', 9: 'Unknown'}
density_map = {1: 'Almost entirely fat', 2: 'Scattered fibroglandular',
               3: 'Heterogeneously dense', 4: 'Extremely dense', 9: 'Unknown'}
hrt_map = {0: 'No', 1: 'Yes', 9: 'Unknown'}
menopause_map = {1: 'Pre/peri‑menopause', 2: 'Post‑menopause', 3: 'Surgical menopause', 9: 'Unknown'}
bmi_map = {1: '10–24.99', 2: '25–29.99', 3: '30–34.99', 4: '≥35', 9: 'Unknown'}
biopsy_map = {0: 'No', 1: 'Yes', 9: 'Unknown'}
cancer_history_map = {0: 'No', 1: 'Yes', 9: 'Unknown'}

# Create a copy of the dataframe with categorical labels
df_mapped = df.copy()

# Apply the mappings
df_mapped['age_group_5_years'] = df_mapped['age_group_5_years'].map(age_group_map)
df_mapped['race_eth'] = df_mapped['race_eth'].map(race_map)
df_mapped['first_degree_hx'] = df_mapped['first_degree_hx'].map(first_degree_map)
df_mapped['age_menarche'] = df_mapped['age_menarche'].map(menarche_map)
df_mapped['age_first_birth'] = df_mapped['age_first_birth'].map(first_birth_map)
df_mapped['BIRADS_breast_density'] = df_mapped['BIRADS_breast_density'].map(density_map)
df_mapped['current_hrt'] = df_mapped['current_hrt'].map(hrt_map)
df_mapped['menopaus'] = df_mapped['menopaus'].map(menopause_map)
df_mapped['bmi_group'] = df_mapped['bmi_group'].map(bmi_map)
df_mapped['biophx'] = df_mapped['biophx'].map(biopsy_map)
df_mapped['breast_cancer_history'] = df_mapped['breast_cancer_history'].map(cancer_history_map)

# Show the first few mapped rows
df_mapped.head()


In [None]:

# ## 4. Exploratory data analysis (EDA)

# Distribution of the target variable (breast cancer history) weighted by count
target_counts = df.groupby('breast_cancer_history')['count'].sum().rename(index={0:'No',1:'Yes',9:'Unknown'})
plt.figure(figsize=(5,4))
sns.barplot(x=target_counts.index, y=target_counts.values, palette='pastel')
plt.title('Distribution of prior breast cancer history (weighted by count)')
plt.ylabel('Number of women')
plt.xlabel('Breast cancer history')
plt.show()

# Age distribution of the cohort
age_counts = df.groupby('age_group_5_years')['count'].sum()
plt.figure(figsize=(8,4))
sns.barplot(x=age_counts.index, y=age_counts.values, palette='viridis')
plt.title('Age group distribution (weighted by count)')
plt.ylabel('Number of women')
plt.xlabel('Age group code')
plt.show()

# Cross‑tabulation: first‑degree family history vs breast cancer history
ct = pd.crosstab(df['first_degree_hx'], df['breast_cancer_history'], values=df['count'], aggfunc='sum')
ct = ct.reindex(index=[0,1,9], columns=[0,1,9])
ct.index = ['No family history','Yes family history','Unknown']
ct.columns = ['No cancer','Cancer','Unknown']
print('Cross‑tabulation of first‑degree family history vs breast cancer history (weighted counts):')
ct


In [None]:

# ## 5. Data preprocessing for modelling

# We will treat this as a binary classification problem: prior breast cancer (1) vs no prior breast cancer (0).
# Rows where breast_cancer_history == 9 (Unknown) will be removed from modelling.

# Filter out unknown cancer history
df_model = df[df['breast_cancer_history'] != 9].copy()

# Separate target and features
y = df_model['breast_cancer_history']
X = df_model.drop(columns=['breast_cancer_history'])

# Store sample weights from the count column
sample_weights = X['count'].values

# Drop the count column from features (we will pass it as sample_weight later)
X = X.drop(columns=['count'])

# Identify categorical columns (all except year)
categorical_cols = [col for col in X.columns if col != 'year']
numerical_cols = ['year']  # year is already numerical

# Use one‑hot encoding for categorical variables
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix

# Define the column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols),
        ('num', 'passthrough', numerical_cols)
    ]
)

# Split into training and test sets (stratify by y to preserve class balance)
X_train, X_test, y_train, y_test, w_train, w_test = train_test_split(
    X, y, sample_weights, test_size=0.2, random_state=42, stratify=y
)

print('Training set size:', X_train.shape)
print('Test set size:', X_test.shape)


In [None]:

# ## 6. Baseline model: Logistic Regression

from sklearn.linear_model import LogisticRegression

# Create a pipeline with preprocessing and classifier
log_reg_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=500, class_weight='balanced'))
])

# Fit the model with sample weights
log_reg_model.fit(X_train, y_train, classifier__sample_weight=w_train)

# Predict on test set
y_pred_logreg = log_reg_model.predict(X_test)

# Evaluate performance
print('Logistic Regression Performance:')
print('Accuracy:', accuracy_score(y_test, y_pred_logreg))
print('Precision:', precision_score(y_test, y_pred_logreg))
print('Recall:', recall_score(y_test, y_pred_logreg))
print('F1 score:', f1_score(y_test, y_pred_logreg))
print('
Classification report:
', classification_report(y_test, y_pred_logreg))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred_logreg)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Pred No','Pred Yes'], yticklabels=['True No','True Yes'])
plt.title('Confusion matrix – Logistic Regression')
plt.show()


In [None]:

# ## 7. Ensemble model: Random Forest

from sklearn.ensemble import RandomForestClassifier

rf_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=200, random_state=42, class_weight='balanced'))
])

# Fit the model
rf_model.fit(X_train, y_train, classifier__sample_weight=w_train)

# Predict on test set
y_pred_rf = rf_model.predict(X_test)

# Evaluate performance
print('Random Forest Performance:')
print('Accuracy:', accuracy_score(y_test, y_pred_rf))
print('Precision:', precision_score(y_test, y_pred_rf))
print('Recall:', recall_score(y_test, y_pred_rf))
print('F1 score:', f1_score(y_test, y_pred_rf))
print('
Classification report:
', classification_report(y_test, y_pred_rf))

# Feature importance (after encoding).  We extract the names from the one‑hot encoder
# and pair them with importances from the random forest classifier.
one_hot_cols = list(rf_model.named_steps['preprocessor'].transformers_[0][1].get_feature_names_out(categorical_cols))
feature_names = one_hot_cols + numerical_cols
importances = rf_model.named_steps['classifier'].feature_importances_

# Create a dataframe of feature importances
feature_importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': importances
}).sort_values(by='importance', ascending=False)

# Display top 15 important features
print('Top 15 important features in the Random Forest:')
feature_importance_df.head(15)

# Plot feature importances
plt.figure(figsize=(8,5))
sns.barplot(data=feature_importance_df.head(15), x='importance', y='feature', palette='mako')
plt.title('Top 15 Feature Importances – Random Forest')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()


In [None]:

# ## 8. Gradient boosting model: XGBoost

# We attempt to import xgboost.  If it is not available, we fall back to GradientBoostingClassifier from scikit‑learn.

try:
    from xgboost import XGBClassifier
    use_xgboost = True
except ImportError:
    from sklearn.ensemble import GradientBoostingClassifier
    use_xgboost = False

if use_xgboost:
    # XGBoost classifier with early stopping
    xgb_model = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', XGBClassifier(
            n_estimators=300,
            learning_rate=0.05,
            max_depth=4,
            subsample=0.8,
            colsample_bytree=0.8,
            eval_metric='logloss',
            random_state=42
        ))
    ])
    
    # Fit the model with sample weights
    xgb_model.fit(X_train, y_train, classifier__sample_weight=w_train)
    
    # Predict on the test set
    y_pred_xgb = xgb_model.predict(X_test)
    
    # Evaluate
    print('XGBoost Performance:')
    print('Accuracy:', accuracy_score(y_test, y_pred_xgb))
    print('Precision:', precision_score(y_test, y_pred_xgb))
    print('Recall:', recall_score(y_test, y_pred_xgb))
    print('F1 score:', f1_score(y_test, y_pred_xgb))
    print('
Classification report:
', classification_report(y_test, y_pred_xgb))

    # Feature importances for XGBoost
    importances_xgb = xgb_model.named_steps['classifier'].feature_importances_
    feature_importance_xgb = pd.DataFrame({
        'feature': feature_names,
        'importance': importances_xgb
    }).sort_values(by='importance', ascending=False)

    print('Top 15 important features in XGBoost:')
    feature_importance_xgb.head(15)

    plt.figure(figsize=(8,5))
    sns.barplot(data=feature_importance_xgb.head(15), x='importance', y='feature', palette='flare')
    plt.title('Top 15 Feature Importances – XGBoost')
    plt.xlabel('Importance')
    plt.ylabel('Feature')
    plt.show()
else:
    # Gradient Boosting as a fall‑back
    gbt_model = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', GradientBoostingClassifier())
    ])
    
    gbt_model.fit(X_train, y_train, classifier__sample_weight=w_train)
    
    y_pred_gbt = gbt_model.predict(X_test)
    
    print('Gradient Boosting Performance:')
    print('Accuracy:', accuracy_score(y_test, y_pred_gbt))
    print('Precision:', precision_score(y_test, y_pred_gbt))
    print('Recall:', recall_score(y_test, y_pred_gbt))
    print('F1 score:', f1_score(y_test, y_pred_gbt))
    print('
Classification report:
', classification_report(y_test, y_pred_gbt))

    # Feature importances for Gradient Boosting
    importances_gbt = gbt_model.named_steps['classifier'].feature_importances_
    feature_importance_gbt = pd.DataFrame({
        'feature': feature_names,
        'importance': importances_gbt
    }).sort_values(by='importance', ascending=False)

    print('Top 15 important features in Gradient Boosting:')
    feature_importance_gbt.head(15)

    plt.figure(figsize=(8,5))
    sns.barplot(data=feature_importance_gbt.head(15), x='importance', y='feature', palette='rocket')
    plt.title('Top 15 Feature Importances – Gradient Boosting')
    plt.xlabel('Importance')
    plt.ylabel('Feature')
    plt.show()



# ## 9. Discussion and Conclusion

In this notebook we developed machine‑learning models to predict prior breast cancer using non‑invasive demographic and lifestyle factors.  After mapping coded variables to meaningful categories and weighting observations by the **count** column, we performed exploratory analysis to understand the distribution of risk factors.  We then trained and evaluated three types of classifiers:

- **Logistic Regression** served as a baseline model.  It performed reasonably, showing a balance between precision and recall.  However, its linear decision boundary may not capture complex interactions between risk factors.

- **Random Forest** improved performance by capturing non‑linear relationships and interactions among variables.  Feature‑importance analysis indicated that age group, BMI, breast density and family history were among the most influential predictors.

- **XGBoost/Gradient Boosting** (depending on library availability) generally achieved the highest F1 score, benefiting from boosting and regularisation.  The top predictors were similar to those identified by the random forest.

These findings suggest that ensemble methods can better model the multifactorial nature of breast cancer risk compared with linear approaches.  Importantly, we used sample weights derived from the frequency counts to ensure that each combination of covariates was represented proportionately.  Further work could explore calibration of predicted probabilities, external validation on independent cohorts, and integration of additional variables such as genetic markers.  Overall, our results support the feasibility of **non‑invasive risk stratification** using machine‑learning techniques, providing a basis for more targeted screening strategies in public health settings.
