# Data Preparation

In [1]:
# Load Data
import pandas as pd

df = pd.read_csv("risk_factors_cervical_cancer.csv")

print(df.shape)
print(df.dtypes)
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'risk_factors_cervical_cancer.csv'

In [None]:
# Standardize missing values

import numpy as np

# Replace '?' with NaN
df = df.replace("?", np.nan)

# Convert to numeric where possible
for col in df.columns:
    df[col] = pd.to_numeric(df[col], errors="coerce")

# Missingness summary
missing_pct = (df.isna().mean() * 100).sort_values(ascending=False)
missing_table = pd.DataFrame({
    "feature": missing_pct.index,
    "missing_percent": missing_pct.values.round(1)
})

missing_table.head(15)


In [None]:
# Target distribution

import matplotlib.pyplot as plt

target = "Biopsy"

counts = df[target].value_counts()
perc = (counts / len(df) * 100).round(2)

biopsy_table = pd.DataFrame({
    "Count": counts,
    "Percentage (%)": perc
})

print(biopsy_table)

counts.plot(kind="bar")
plt.title("Biopsy outcome counts")
plt.xlabel("Biopsy")
plt.ylabel("Count")
plt.show()


In [None]:
# Descriptive statistics for key numeric variables

key_numeric = [
    "Age",
    "Number of sexual partners",
    "First sexual intercourse",
    "Num of pregnancies",
    "Smokes (years)",
    "Smokes (packs/year)",
    "Hormonal Contraceptives (years)",
    "IUD (years)",
    "STDs (number)",
    "STDs: Number of diagnosis"
]

key_numeric = [c for c in key_numeric if c in df.columns]

desc = df[key_numeric].describe().T
desc["missing"] = df[key_numeric].isna().sum()
desc["missing_pct"] = (desc["missing"] / len(df) * 100).round(1)

desc = desc[["count","missing","missing_pct","mean","std","min","25%","50%","75%","max"]].round(2)
desc


In [None]:
# Age Distribution Histogram

import matplotlib.pyplot as plt

plt.figure(figsize=(7,4))
plt.hist(df["Age"].dropna(), bins=20)
plt.xlabel("Age")
plt.ylabel("Count")
plt.title("Distribution of Age")
plt.tight_layout()
plt.show()


In [None]:
# Outlier Detection Boxplot

vars_box = [
    "Number of sexual partners",
    "Num of pregnancies",
    "Smokes (packs/year)",
    "Hormonal Contraceptives (years)"
]

plt.figure(figsize=(9,4))
plt.boxplot([df[v].dropna() for v in vars_box], tick_labels=vars_box, vert=False)
plt.title("Boxplots for selected numeric variables")
plt.tight_layout()
plt.show()


# Model Evaluation

In [None]:
# Drop features:

# Columns that represent diagnostic results or post-screening outcomes
leakage_cols = [
    "Hinselmann",
    "Schiller",
    "Citology",
    "Dx",
    "Dx:Cancer",
    "Dx:CIN",
    "Dx:HPV"
]

# Drop leakage features
df_prep = df.drop(columns=leakage_cols)
print("Dropped leakage features:")
print(leakage_cols)
print("")

# Drop features with more than 90% missing data
missing_threshold = 0.90
high_missing_cols = df_prep.columns[df_prep.isna().mean() > missing_threshold]

print("Dropped due to high missingness:")
print(high_missing_cols.tolist())
print("")

df_prep = df_prep.drop(columns=high_missing_cols)
print("Remaining Shape:", df_prep.shape)



Variables representing diagnostic outcomes or screening test results (including Hinselmann, Schiller, Citology, and cancer diagnosis indicators) were excluded prior to modeling. These variables are strongly correlated with the biopsy outcome but reflect downstream clinical decisions rather than true risk factors. Including them would introduce information leakage and artificially inflate model performance. Removing these features ensures that the model learns from upstream risk characteristics rather than proxy diagnostic signals.

Features with more than 90% missing values were removed from the dataset. Variables related to the timing of STD diagnoses fell into this category, indicating that they were rarely recorded. Retaining such features would significantly reduce usable sample size or require speculative imputation. Excluding these variables improves data reliability and model stability.

In [None]:
# Prepare for imputation and scaling by identifying feature types

# Separate features and target
X = df_prep.drop(columns=[target])
y = df_prep[target]

# Identify numeric and binary features
binary_features = [c for c in X.columns if X[c].dropna().isin([0,1]).all()]
numeric_features = [c for c in X.columns if c not in binary_features]

print("Binary features:", binary_features)
print("Numeric features:", numeric_features)


Predictor variables were grouped into binary and continuous numeric features. Binary indicators represent the presence or absence of behaviors or conditions, while numeric features capture intensity or duration of exposure. This separation allows for appropriate preprocessing strategies, including median imputation and scaling for numeric variables while preserving the interpretability of binary indicators.

In [None]:
# Impute missing values

from sklearn.impute import SimpleImputer

# Impute numeric features with median
num_imputer = SimpleImputer(strategy="median")
X[numeric_features] = num_imputer.fit_transform(X[numeric_features])

# Impute binary features with most frequent value
bin_imputer = SimpleImputer(strategy="most_frequent")
X[binary_features] = bin_imputer.fit_transform(X[binary_features])


# Check remaining missing values
print("Remaining missing values:", X.isna().sum().sum())


To preserve sample size and reduce bias, missing values were imputed rather than removing observations. Median imputation was applied to numeric variables to reduce sensitivity to skewed distributions and outliers. Binary variables were imputed using the most frequent value, maintaining their categorical interpretation. After imputation, no missing values remained in the feature set.

In [None]:
# Feature scaling

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X[numeric_features] = scaler.fit_transform(X[numeric_features])


Continuous numeric variables were standardized using z-score normalization to ensure that features with larger numeric ranges did not disproportionately influence model training. Binary variables were not scaled to preserve their interpretability. Scaling was performed after imputation to ensure valid transformations.

In [None]:
# Train-test split with stratification

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.25,
    random_state=42,
    stratify=y
)

print("Training set size:", X_train.shape)
print("Test set size:", X_test.shape)


**Train–test split.**

The dataset was split into training (75%) and testing (25%) subsets using stratified sampling to preserve the class distribution of the Biopsy outcome. Stratification is particularly important given the strong class imbalance and ensures that both sets contain representative positive and negative cases.

In [None]:
from sklearn.metrics import (
    confusion_matrix, classification_report,
    accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
)

def evaluate_model(name, model, X_test, y_test):
    y_pred = model.predict(X_test)

    # Some models support predict_proba
    if hasattr(model, "predict_proba"):
        y_proba = model.predict_proba(X_test)[:, 1]
        auc = roc_auc_score(y_test, y_proba)
    else:
        y_proba = None
        auc = None

    print(f"\n==== {name} ====")
    print("Accuracy:", round(accuracy_score(y_test, y_pred), 4))
    print("Precision:", round(precision_score(y_test, y_pred, zero_division=0), 4))
    print("Recall:", round(recall_score(y_test, y_pred, zero_division=0), 4))
    print("F1:", round(f1_score(y_test, y_pred, zero_division=0), 4))
    if auc is not None:
        print("ROC-AUC:", round(auc, 4))

    print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("\nClassification Report:\n", classification_report(y_test, y_pred, zero_division=0))


In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter=2000, random_state=42)
lr.fit(X_train, y_train)

evaluate_model("Logistic Regression (Unweighted)", lr, X_test, y_test)


**Logistic Regression (Unweighted)**

The unweighted logistic regression model achieved an accuracy of 93.5%, but failed to identify any positive biopsy cases. Precision, recall, and F1-score for the positive class were all 0.00, indicating that the model predicted all observations as belonging to the negative class. Despite this, the model achieved a ROC-AUC of 0.62, suggesting some underlying discriminatory signal that is not reflected at the default classification threshold.

These results demonstrate that accuracy alone is misleading in the presence of class imbalance. Although the model separates risk reasonably well in probability space, it does not cross the decision threshold required to predict positive cases. This highlights the importance of explicitly addressing imbalance rather than relying on default model settings.

In [None]:
lr_bal = LogisticRegression(max_iter=2000, class_weight="balanced", random_state=42)
lr_bal.fit(X_train, y_train)

evaluate_model("Logistic Regression (Class-Weighted)", lr_bal, X_test, y_test)


**Logistic Regression (Class-Weighted)**

Applying class weights substantially altered model behavior. Accuracy decreased to 75.4%, while recall for the positive class increased to 0.50, indicating that half of biopsy-positive cases were correctly identified. Precision remained low (0.13), resulting in an F1-score of 0.21. The ROC-AUC (0.62) remained similar to the unweighted model.

Class weighting successfully shifted the model toward identifying positive cases, confirming that imbalance was suppressing recall in the unweighted model. The tradeoff between recall and precision is expected in medical screening contexts, where sensitivity is often prioritized. These results suggest that logistic regression can capture meaningful risk patterns, but threshold selection and cost-sensitive tuning are necessary for practical use.

In [None]:
from sklearn.model_selection import StratifiedKFold, cross_val_score

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

recall_scores = cross_val_score(
    lr_bal, X, y, cv=cv, scoring="recall"
)

recall_scores, recall_scores.mean()


**Stratified Cross-Validation (Model Stability)**

To obtain a more reliable estimate of model performance and reduce sensitivity to a single train–test split, stratified 5-fold cross-validation was performed using the class-weighted logistic regression model. Stratification preserves the proportion of positive biopsy outcomes in each fold, which is essential given the severe class imbalance.

**Results (Recall across folds).**

0.36, 0.18, 0.18, 0.45, 0.36

The mean cross-validated recall was **0.31.**

These results indicate that recall varies across folds, which is expected in a small dataset with relatively few positive cases. However, the model consistently identifies a meaningful portion of positive biopsy cases across multiple splits, supporting the conclusion that class-weighted logistic regression improves sensitivity relative to unweighted approaches. The variability also highlights the importance of reporting fold-based performance estimates and supports future refinement through threshold tuning or resampling methods to further stabilize minority-class detection.

In [None]:
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(random_state=42, max_depth=5)
tree.fit(X_train, y_train)

evaluate_model("Decision Tree (max_depth=5)", tree, X_test, y_test)


**Decision Tree (max depth = 5)**

The decision tree achieved an accuracy of 91.2% but, like the unweighted logistic regression, failed to identify any positive biopsy cases. Recall and precision for the positive class were 0.00, and ROC-AUC dropped to 0.45, indicating performance close to random guessing.

Despite its ability to model non-linear relationships, the decision tree defaulted to majority-class predictions. This suggests that shallow trees may be insufficient to capture subtle risk patterns in a highly imbalanced clinical dataset, especially when positive cases are rare.

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=300,
    random_state=42,
    class_weight="balanced_subsample"
)
rf.fit(X_train, y_train)

evaluate_model("Random Forest (Balanced)", rf, X_test, y_test)


**Random Forest (Class-Balanced)**

The class-balanced random forest achieved 93.5% accuracy and the highest ROC-AUC among tested models (0.70). However, recall for the positive class remained low (0.07), with only one positive case correctly identified. Precision for positive predictions was 0.50, reflecting very few but more confident positive predictions.

The random forest demonstrated stronger overall discrimination than other models, as reflected by ROC-AUC, but still struggled to identify positive cases at the default threshold. This suggests that while the model captures meaningful signal, threshold adjustment or alternative imbalance-handling strategies are required to translate probability separation into clinically useful predictions.

In [None]:
from xgboost import XGBClassifier

# Compute imbalance ratio
neg, pos = y_train.value_counts()
scale_pos_weight = neg / pos

xgb = XGBClassifier(
    n_estimators=300,
    max_depth=4,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    scale_pos_weight=scale_pos_weight,
    eval_metric="logloss",
    random_state=42
)

xgb.fit(X_train, y_train)

evaluate_model("XGBoost (Class-Weighted)", xgb, X_test, y_test)


**XGBoost Classifier (Class-Weighted)**

The class-weighted XGBoost model achieved an overall accuracy of 85.6% and a ROC-AUC of 0.54. Precision for the positive Biopsy class was 0.10, recall was 0.14, and the resulting F1-score was 0.11. The confusion matrix shows that the model correctly identified 2 out of 14 positive biopsy cases, while misclassifying a larger number of negative cases as positive compared to previous models.

Although XGBoost is capable of modeling complex non-linear relationships, its performance in this setting did not substantially improve minority-class detection compared to class-weighted logistic regression. The relatively low ROC-AUC suggests limited additional discriminatory power beyond simpler models. This outcome may reflect the small number of positive cases, high feature sparsity, and remaining noise in clinical history variables. These results indicate that model complexity alone is insufficient to overcome severe class imbalance and data limitations without further tuning or alternative imbalance-handling strategies.