## Introudction
**This analysis reframes logistic regression from a pure prediction task to a decision support problem. The objective is not to maximize accuracy, but to minimize decision harm by accounting for asymmetric error costs when flagging patients for further diagnostic evaluation.**


## STEP 1Ô∏è‚É£ Load the Data  
*Clean, Explicit, Reproducible*


In [28]:
import pandas as pd
import numpy as np

# Adjust filename if needed
file_path = "~/Downloads/BCdata.csv"

df = pd.read_csv(file_path)

# Basic sanity check
df.head()


Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


## STEP 2Ô∏è‚É£ Inspect Data Quality  
*Is this data fit for decision modeling?*

In [29]:
df.shape


(569, 33)

In [30]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

In [31]:
df.isnull().sum().sort_values(ascending=False)

Unnamed: 32                569
compactness_se               0
fractal_dimension_worst      0
symmetry_worst               0
concave points_worst         0
concavity_worst              0
compactness_worst            0
smoothness_worst             0
area_worst                   0
perimeter_worst              0
texture_worst                0
radius_worst                 0
fractal_dimension_se         0
symmetry_se                  0
concave points_se            0
concavity_se                 0
id                           0
diagnosis                    0
area_se                      0
perimeter_se                 0
texture_se                   0
radius_se                    0
fractal_dimension_mean       0
symmetry_mean                0
concave points_mean          0
concavity_mean               0
compactness_mean             0
smoothness_mean              0
area_mean                    0
perimeter_mean               0
texture_mean                 0
radius_mean                  0
smoothne

In [32]:
df = df.drop(columns=["Unnamed: 32"], errors="ignore")

## STEP 3Ô∏è‚É£ Define the Decision Target  
*What decision does the model support?*



In [33]:
df["diagnosis"].value_counts()


diagnosis
B    357
M    212
Name: count, dtype: int64

In [34]:
# Decision framing:
# 1 = Malignant (high-risk, action required)
# 0 = Benign (low-risk)

df["target"] = (df["diagnosis"] == "M").astype(int)

df["target"].value_counts(normalize=True)


target
0    0.627417
1    0.372583
Name: proportion, dtype: float64

## STEP 4Ô∏è‚É£ Feature Matrix  
*Minimal preprocessing by design*



In [35]:
# Remove identifier and target-related columns
df_model = df.drop(columns=["id", "diagnosis"], errors="ignore")

# Drop rows with missing values (conservative, transparent choice)
df_model = df_model.dropna()

df_model.shape



(569, 31)

In [36]:
X = df_model.drop(columns=["target"])
y = df_model["target"]

X.shape, y.shape


((569, 30), (569,))

Preprocessing note:
Missing values were handled by row-wise exclusion to avoid introducing imputation assumptions. Given the small proportion of missing data, this preserves interpretability and keeps the focus on decision modeling.

## STEP 5Ô∏è‚É£ Train‚ÄìTest Split  
*Preserving class balance*



In [37]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.25,
    stratify=y,
    random_state=42
)

y_train.mean(), y_test.mean()



(0.3732394366197183, 0.3706293706293706)

## STEP 6Ô∏è‚É£ Baseline Logistic Regression  
*Prediction without decision-awareness*



In [38]:
from sklearn.linear_model import LogisticRegression

baseline_model = LogisticRegression(
    max_iter=1000,
    solver="liblinear"
)

baseline_model.fit(X_train, y_train)



In [39]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

y_pred_default = baseline_model.predict(X_test)

accuracy_score(y_test, y_pred_default), confusion_matrix(y_test, y_pred_default)


(0.9440559440559441,
 array([[89,  1],
        [ 7, 46]], dtype=int64))

## STEP 7Ô∏è‚É£ Introduce Decision Costs  
*Asymmetric error consequences*



In [40]:
COST_FN = 10   # Missing a malignant case
COST_FP = 1    # Unnecessary follow-up


## STEP 8Ô∏è‚É£ Threshold Optimization  
*Choosing actions, not probabilities*



In [41]:
y_proba = baseline_model.predict_proba(X_test)[:, 1]

thresholds = np.linspace(0, 1, 101)
results = []

for t in thresholds:
    y_pred = (y_proba >= t).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
    total_cost = COST_FN * fn + COST_FP * fp
    
    results.append({
        "threshold": t,
        "false_positives": fp,
        "false_negatives": fn,
        "total_cost": total_cost
    })

results_df = pd.DataFrame(results)
results_df.sort_values("total_cost").head()


Unnamed: 0,threshold,false_positives,false_negatives,total_cost
11,0.11,6,0,6
12,0.12,6,0,6
10,0.1,6,0,6
9,0.09,6,0,6
8,0.08,7,0,7


In [42]:
optimal_threshold = results_df.loc[
    results_df["total_cost"].idxmin(), "threshold"
]

optimal_threshold


0.09

## STEP 9Ô∏è‚É£ Decision-Aware Evaluation  
*Why accuracy is insufficient*



In [43]:
y_pred_opt = (y_proba >= optimal_threshold).astype(int)
confusion_matrix(y_test, y_pred_opt)


array([[84,  6],
       [ 0, 53]], dtype=int64)

In [None]:
Key insight:
The decision-optimal threshold differs substantially from the default 0.5.
Optimizing for expected decision cost reduces false negatives at the expense of additional false positives ‚Äî an appropriate trade-off given the clinical context.

## STEP üîü Key Takeaways  
*What changes when ML supports decisions?*

**What changes when ML supports decisions?**

- Prediction alone does not create value; decisions do

- Accuracy obscures asymmetric error consequences

- Threshold selection often matters more than model choice

- Simple, interpretable models can be highly effective when aligned with decision objectives