# **Breast Cancer Wisconsin (Diagnostic)**
Donated on 10/31/1995

**Libraries for data handling, preprocessing, model training, and evaluation are imported.**

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
    accuracy_score, precision_score,
    recall_score, f1_score, confusion_matrix
)


In [2]:
df = pd.read_csv("wdbc.data", header=None)
df.head()


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


ID column is removed as it has no predictive value. Diagnosis is encoded for supervised learning.

In [3]:
# Drop ID column
df.drop(columns=[0], inplace=True)

# Encode target labels
df[1] = df[1].map({"M": 1, "B": 0})

X = df.drop(columns=[1])
y = df[1]


Stratified split preserves malignant–benign class distribution.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)


**# Logistic Regression (Baseline Model)**

In [5]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

log_model = LogisticRegression(max_iter=500)
log_model.fit(X_train_scaled, y_train)

y_train_pred_log = log_model.predict(X_train_scaled)
y_test_pred_log = log_model.predict(X_test_scaled)


## **Evaluating Logistic Regression**

In [6]:
print("Logistic Regression Results")

train_error_log = 1 - accuracy_score(y_train, y_train_pred_log)
test_error_log = 1 - accuracy_score(y_test, y_test_pred_log)

print("Train Error:", train_error_log)
print("Test Error:", test_error_log)

print("Accuracy:", accuracy_score(y_test, y_test_pred_log))
print("Precision:", precision_score(y_test, y_test_pred_log))
print("Recall:", recall_score(y_test, y_test_pred_log))
print("F1-score:", f1_score(y_test, y_test_pred_log))

print("Confusion Matrix:")
confusion_matrix(y_test, y_test_pred_log)


Logistic Regression Results
Train Error: 0.01318681318681314
Test Error: 0.03508771929824561
Accuracy: 0.9649122807017544
Precision: 0.975
Recall: 0.9285714285714286
F1-score: 0.9512195121951219
Confusion Matrix:


array([[71,  1],
       [ 3, 39]])

# **Decision Tree (Non-Linear Model)**

In [7]:
tree_model = DecisionTreeClassifier(random_state=42)
tree_model.fit(X_train, y_train)

y_train_pred_tree = tree_model.predict(X_train)
y_test_pred_tree = tree_model.predict(X_test)


**Evaluating Decision Tree**

In [8]:
print("Decision Tree Results")

train_error_tree = 1 - accuracy_score(y_train, y_train_pred_tree)
test_error_tree = 1 - accuracy_score(y_test, y_test_pred_tree)

print("Train Error:", train_error_tree)
print("Test Error:", test_error_tree)

print("Accuracy:", accuracy_score(y_test, y_test_pred_tree))
print("Precision:", precision_score(y_test, y_test_pred_tree))
print("Recall:", recall_score(y_test, y_test_pred_tree))
print("F1-score:", f1_score(y_test, y_test_pred_tree))

print("Confusion Matrix:")
confusion_matrix(y_test, y_test_pred_tree)


Decision Tree Results
Train Error: 0.0
Test Error: 0.07017543859649122
Accuracy: 0.9298245614035088
Precision: 0.9047619047619048
Recall: 0.9047619047619048
F1-score: 0.9047619047619048
Confusion Matrix:


array([[68,  4],
       [ 4, 38]])

**Conclusion : Very low training error but higher test error → overfitting.**

| Model               | Train Error | Test Error      | Behavior            |
| ------------------- | ----------- | --------------- | ------------------- |
| Logistic Regression | Low         | Slightly higher | Good generalization |
| Decision Tree       | ≈ 0         | High            | Overfitting         |


Logistic Regression exhibits a small generalization gap, indicating low variance and stable performance.
Decision Tree overfits by memorizing training data, leading to poor generalization on unseen samples.

ML Issues


Feature Scaling - Essential for Logistic Regression

Data Leakage - Scaling performed after split

Class Imbalance - Recall prioritized to avoid false negatives

Feature Correlation - Many tumor features are highly correlated


