# 2. KLASIFIKACIJA (Heart_Disease) – uteževanje razredov + izbor pomembnih spremenljivk

Ta zvezek je izboljšana različica klasifikacije, kjer upoštevamo:
- **neuravnoteženost razredov** (Heart_Disease = Yes je redek razred) → uporabimo **uteževanje (sample_weight)**,
- **feature importance** (izbor top N spremenljivk) → zmanjšamo dimenzionalnost in čas izvajanja,
- nato ponovno zgradimo **≥ 5 modelov** (vključno z logistično regresijo) in jih ocenimo z **10-fold CV**.

> Namen: bolj “procesno” uporabna klasifikacija (višja občutljivost/F1) in hitrejši runtime.


## 0) Nastavitve

In [1]:
CSV_PATH = "CVD_cleaned.csv"
TARGET = "Heart_Disease"

TEST_SIZE = 0.20
RANDOM_STATE = 42
N_SPLITS = 10

TOP_N_FEATURES = 7  # izberi 4–7, kot predlaga kolega

SAVE_FIGS = True
FIG_DIR = "figures/2_classification_weighted"
OUT_TABLE = "2_3_cv_tabela_klasifikacija_weighted.csv"
OUT_IMPORTANCE = "2_2_feature_importance.csv"


## 1) Uvoz podatkov

In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv(CSV_PATH)
df.shape, df.columns.tolist()


  from pandas.core.computation.check import NUMEXPR_INSTALLED
  from pandas.core import (


((308854, 19),
 ['General_Health',
  'Checkup',
  'Exercise',
  'Heart_Disease',
  'Skin_Cancer',
  'Other_Cancer',
  'Depression',
  'Diabetes',
  'Arthritis',
  'Sex',
  'Age_Category',
  'Height_(cm)',
  'Weight_(kg)',
  'BMI',
  'Smoking_History',
  'Alcohol_Consumption',
  'Fruit_Consumption',
  'Green_Vegetables_Consumption',
  'FriedPotato_Consumption'])

## 2) Preprocess (numeric encoding) – po principu `data.py`

Namesto one-hot za vse kategorije uporabimo **ročno kodiranje**:
- `General_Health`: ordinalno 1–5
- `Age_Category`: približna starost (sredina intervala)
- binarne spremenljivke: Yes/No → 1/0
- `Sex`: Male/Female → 1/0
- `Diabetes`: pretvorimo v 0/1 (tudi prediabetes/borderline)
- `Checkup`: Within the past year → 1, sicer 0

Prednost: **manj feature-jev**, hitrejši modeli in lažji feature-importance.


In [3]:
def general_health_to_numeric(df):
    mapping = {"Poor": 1, "Fair": 2, "Good": 3, "Very Good": 4, "Excellent": 5}
    out = df.copy()
    out["General_Health"] = out["General_Health"].map(mapping)
    return out

def age_category_to_numeric(df):
    age_map = {
        "18-24": 21, "25-29": 27, "30-34": 32, "35-39": 37,
        "40-44": 42, "45-49": 47, "50-54": 52, "55-59": 57,
        "60-64": 62, "65-69": 67, "70-74": 72, "75-79": 77, "80+": 82
    }
    out = df.copy()
    out["Age_Category"] = out["Age_Category"].map(age_map)
    return out

def binary_columns_to_numeric(df):
    out = df.copy()

    # klasične Yes/No
    binary_cols = [
        "Heart_Disease",
        "Smoking_History",
        "Exercise",
        "Skin_Cancer",
        "Other_Cancer",
        "Depression",
        "Arthritis",
    ]
    for col in binary_cols:
        out[col] = out[col].map({"Yes": 1, "No": 0})

    # Sex
    out["Sex"] = out["Sex"].map({"Male": 1, "Female": 0})

    # Diabetes ima več kategorij → map v 0/1
    diabetes_map = {
        "No": 0,
        "Yes": 1,
        "No, pre-diabetes or borderline diabetes": 0,
        "Yes, but female told only during pregnancy": 1,
    }
    out["Diabetes"] = out["Diabetes"].map(diabetes_map)

    return out

def checkup_to_numeric(df):
    out = df.copy()
    out["Checkup"] = out["Checkup"].map({
        "Within the past year": 1,
        "Within the past 2 years": 0,
        "Within the past 5 years": 0,
        "5 or more years ago": 0,
        "Never": 0
    })
    return out

def preprocess_numeric(df):
    out = df.copy()
    out = general_health_to_numeric(out)
    out = age_category_to_numeric(out)
    out = binary_columns_to_numeric(out)
    out = checkup_to_numeric(out)
    return out

df_p = preprocess_numeric(df)

# preveri, ali so ostali kakšni NaN zaradi mappinga
na_counts = df_p.isna().sum().sort_values(ascending=False)
na_counts.head(10)


General_Health                  0
Age_Category                    0
Green_Vegetables_Consumption    0
Fruit_Consumption               0
Alcohol_Consumption             0
Smoking_History                 0
BMI                             0
Weight_(kg)                     0
Height_(cm)                     0
Sex                             0
dtype: int64

In [4]:
# Če bi mapping ustvaril NaN (nepričakovane kategorije), jih jasno pokažemo:
problem_cols = na_counts[na_counts > 0]
problem_cols


Series([], dtype: int64)

Če bi se pojavile manjkajoče vrednosti zaradi kodiranja kategorij, bi morali mapping dopolniti.  
Za ta nabor pa pričakujemo, da je po kodiranju še vedno brez manjkajočih vrednosti.


In [5]:
# cilj + feature matrika
y = df_p[TARGET].astype(int)
X = df_p.drop(columns=[TARGET])

# preveri, da so vsi feature-ji numerični
X.dtypes.value_counts()


int64      11
float64     7
Name: count, dtype: int64

## 2.1 Split 80/20 (stratified) + razredna neuravnoteženost

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=TEST_SIZE,
    random_state=RANDOM_STATE,
    stratify=y
)

print("Train:", X_train.shape, " Test:", X_test.shape)
print("Delež pozitivnih (Yes=1): train =", y_train.mean().round(4), ", test =", y_test.mean().round(4))


Train: (247083, 18)  Test: (61771, 18)
Delež pozitivnih (Yes=1): train = 0.0809 , test = 0.0808


## Uteževanje razredov (sample_weight)

Namesto “oversampling/undersampling” uporabimo **uteževanje**:
- primeri iz manjšinskega razreda dobijo večjo utež,
- modeli se učijo tako, da “Yes” bolj vpliva na funkcijo izgube.

To je točno to, kar je priporočila profesorica.


In [7]:
from sklearn.utils.class_weight import compute_sample_weight

# globalni pregled uteži (na train)
sw_train = compute_sample_weight(class_weight="balanced", y=y_train)
pd.Series(sw_train).describe()


count    247083.000000
mean          1.000000
std           1.537561
min           0.543982
25%           0.543982
50%           0.543982
75%           0.543982
max           6.184187
dtype: float64

## 2.2 Feature importance (2 metodi) + izbor TOP N

Uporabimo 2 hitri metodi:
1) **ExtraTrees** feature importance (impurity-based)  
2) **LogisticRegression (L1)** absolutne koeficiente (če je možno; sicer L2 kot fallback)

Nato normaliziramo in naredimo “consensus” (povprečje).


In [8]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression

# ExtraTrees importance
et_imp_model = ExtraTreesClassifier(
    n_estimators=250,
    max_depth=12,
    min_samples_split=4,
    min_samples_leaf=2,
    max_features="sqrt",
    class_weight="balanced",
    n_jobs=-1,
    random_state=RANDOM_STATE
)

et_imp_model.fit(X_train, y_train, sample_weight=sw_train)
imp_et = pd.Series(et_imp_model.feature_importances_, index=X_train.columns, name="ExtraTrees_importance")

# Logistic (L1) importance (potrebuje scaling)
# Opomba: L1 zahteva solver saga/liblinear. Poskusimo saga, sicer liblinear, sicer L2 fallback.
lr_for_imp = None
for solver, penalty in [("saga", "l1"), ("liblinear", "l1")]:
    try:
        lr_for_imp = Pipeline(steps=[
            ("scaler", StandardScaler()),
            ("lr", LogisticRegression(
                penalty=penalty, solver=solver, C=1.0,
                max_iter=3000, class_weight="balanced",
                random_state=RANDOM_STATE
            ))
        ])
        lr_for_imp.fit(X_train, y_train, lr__sample_weight=sw_train)
        break
    except Exception:
        lr_for_imp = None

if lr_for_imp is None:
    # fallback: L2
    lr_for_imp = Pipeline(steps=[
        ("scaler", StandardScaler()),
        ("lr", LogisticRegression(
            penalty="l2", solver="lbfgs", C=1.0,
            max_iter=2000, class_weight="balanced",
            random_state=RANDOM_STATE
        ))
    ])
    lr_for_imp.fit(X_train, y_train, lr__sample_weight=sw_train)

coef = lr_for_imp.named_steps["lr"].coef_.ravel()
imp_lr = pd.Series(np.abs(coef), index=X_train.columns, name="LogReg_abs_coef")

# normalizacija in povprečje
imp = pd.concat([imp_et, imp_lr], axis=1)
imp["ExtraTrees_norm"] = imp["ExtraTrees_importance"] / (imp["ExtraTrees_importance"].sum() + 1e-12)
imp["LogReg_norm"] = imp["LogReg_abs_coef"] / (imp["LogReg_abs_coef"].sum() + 1e-12)
imp["Importance_mean"] = (imp["ExtraTrees_norm"] + imp["LogReg_norm"]) / 2

imp_sorted = imp.sort_values("Importance_mean", ascending=False)
imp_sorted.head(15)


Unnamed: 0,ExtraTrees_importance,LogReg_abs_coef,ExtraTrees_norm,LogReg_norm,Importance_mean
Age_Category,0.375644,1.010738,0.375644,0.277869,0.326756
General_Health,0.166776,0.604183,0.166776,0.1661,0.166438
Sex,0.053933,0.462745,0.053933,0.127216,0.090575
Arthritis,0.081963,0.184574,0.081963,0.050742,0.066353
Smoking_History,0.046837,0.218339,0.046837,0.060025,0.053431
Diabetes,0.049317,0.208183,0.049317,0.057233,0.053275
Checkup,0.050803,0.140279,0.050803,0.038565,0.044684
Weight_(kg),0.01548,0.189893,0.01548,0.052205,0.033843
BMI,0.013785,0.18979,0.013785,0.052176,0.032981
Depression,0.020962,0.154984,0.020962,0.042608,0.031785


In [9]:
# izberemo TOP N
top_features = imp_sorted.head(TOP_N_FEATURES).index.tolist()
top_features


['Age_Category',
 'General_Health',
 'Sex',
 'Arthritis',
 'Smoking_History',
 'Diabetes',
 'Checkup']

In [10]:
# izvoz importance tabele
imp_sorted.reset_index().rename(columns={"index":"Feature"}).to_csv(OUT_IMPORTANCE, index=False)
OUT_IMPORTANCE


'2_2_feature_importance.csv'

## 2.2.1 Hitri graf: TOP pomembne spremenljivke

In [11]:
import matplotlib.pyplot as plt
from pathlib import Path

if SAVE_FIGS:
    Path(FIG_DIR).mkdir(parents=True, exist_ok=True)

plt.figure()
plt.bar(imp_sorted.head(TOP_N_FEATURES).index, imp_sorted.head(TOP_N_FEATURES)["Importance_mean"].values)
plt.title(f"TOP {TOP_N_FEATURES} pomembnih spremenljivk (povprečje 2 metod)")
plt.ylabel("Importance_mean")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
if SAVE_FIGS:
    out = Path(FIG_DIR) / "feature_importance_top.png"
    plt.savefig(out, dpi=200)
    plt.close()
    str(out)
else:
    plt.show()


## 2.3 Modeliranje z izbranimi TOP feature-ji (hitreje + bolj fokusirano)

Zgradimo **≥ 5** modelov in izvedemo **10-fold CV**.  
Vsi modeli dobijo `sample_weight` (balanced) v vsakem foldu.


In [12]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import (
    roc_auc_score, accuracy_score, recall_score, precision_score, f1_score, confusion_matrix
)
from sklearn.base import clone
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.pipeline import Pipeline

Xtr = X_train[top_features].copy()
ytr = y_train.copy()

# 5 modelov (obvezno: LogisticRegression)
models = {}

models["LogisticRegression"] = Pipeline(steps=[
    ("scaler", StandardScaler()),
    ("lr", LogisticRegression(
        penalty="l2",
        C=2.0,
        solver="lbfgs",
        max_iter=2000,
        class_weight=None,   # uporabljamo sample_weight
        random_state=RANDOM_STATE
    ))
])

models["LinearSVC"] = Pipeline(steps=[
    ("scaler", StandardScaler()),
    ("svm", LinearSVC(
        C=1.0,
        class_weight=None,
        max_iter=7000,
        random_state=RANDOM_STATE
    ))
])

models["SGDClassifier"] = Pipeline(steps=[
    ("scaler", StandardScaler()),
    ("sgd", SGDClassifier(
        loss="log_loss",
        penalty="elasticnet",
        alpha=1e-4,
        l1_ratio=0.15,
        max_iter=3000,
        tol=1e-3,
        random_state=RANDOM_STATE
    ))
])

models["ExtraTrees"] = ExtraTreesClassifier(
    n_estimators=300,
    max_depth=14,
    min_samples_split=4,
    min_samples_leaf=2,
    max_features="sqrt",
    n_jobs=-1,
    random_state=RANDOM_STATE
)

models["HistGradientBoosting"] = HistGradientBoostingClassifier(
    learning_rate=0.08,
    max_depth=4,
    max_iter=300,
    min_samples_leaf=30,
    l2_regularization=0.0,
    random_state=RANDOM_STATE
)

list(models.keys())


['LogisticRegression',
 'LinearSVC',
 'SGDClassifier',
 'ExtraTrees',
 'HistGradientBoosting']

In [13]:
def get_score(fitted_model, X):
    # AUC score: predict_proba če obstaja, sicer decision_function
    if hasattr(fitted_model, "predict_proba"):
        return fitted_model.predict_proba(X)[:, 1]
    if hasattr(fitted_model, "decision_function"):
        return fitted_model.decision_function(X)
    return fitted_model.predict(X)

def compute_metrics(y_true, y_pred, y_score):
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    specificity = tn / (tn + fp) if (tn + fp) > 0 else np.nan
    npv = tn / (tn + fn) if (tn + fn) > 0 else np.nan

    return {
        "AUC": roc_auc_score(y_true, y_score) if len(np.unique(y_true)) == 2 else np.nan,
        "Accuracy": accuracy_score(y_true, y_pred),
        "Sensitivity": recall_score(y_true, y_pred, pos_label=1),
        "Specificity": specificity,
        "PPV": precision_score(y_true, y_pred, pos_label=1, zero_division=0),
        "NPV": npv,
        "F1": f1_score(y_true, y_pred, pos_label=1, zero_division=0),
    }

def cv_evaluate_weighted(models, X, y, n_splits=10, random_state=42):
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)

    summaries = {}
    all_folds = {}

    for name, est in models.items():
        fold_metrics = {m: [] for m in ["AUC","Accuracy","Sensitivity","Specificity","PPV","NPV","F1"]}

        for tr_idx, va_idx in skf.split(X, y):
            X_tr, X_va = X.iloc[tr_idx], X.iloc[va_idx]
            y_tr, y_va = y.iloc[tr_idx], y.iloc[va_idx]

            sw = compute_sample_weight(class_weight="balanced", y=y_tr)

            model = clone(est)
            # fit s sample_weight (Pipeline: podajemo step__sample_weight, če obstaja)
            if isinstance(model, Pipeline):
                last_step_name = model.steps[-1][0]
                fit_params = {f"{last_step_name}__sample_weight": sw}
                model.fit(X_tr, y_tr, **fit_params)
            else:
                model.fit(X_tr, y_tr, sample_weight=sw)

            y_pred = model.predict(X_va)
            y_score = get_score(model, X_va)

            mets = compute_metrics(y_va, y_pred, y_score)
            for k, v in mets.items():
                fold_metrics[k].append(v)

        all_folds[name] = fold_metrics
        summaries[name] = {f"{k}_mean": float(np.mean(v)) for k, v in fold_metrics.items()}
        summaries[name].update({f"{k}_sd": float(np.std(v, ddof=1)) for k, v in fold_metrics.items()})

    return all_folds, summaries

all_fold_metrics, summaries = cv_evaluate_weighted(models, Xtr, ytr, n_splits=N_SPLITS, random_state=RANDOM_STATE)
summaries


{'LogisticRegression': {'AUC_mean': 0.8323187862778874,
  'Accuracy_mean': 0.7326849709508102,
  'Sensitivity_mean': 0.7867046565844763,
  'Specificity_mean': 0.727933192909408,
  'PPV_mean': 0.20278655649248659,
  'NPV_mean': 0.974872880869307,
  'F1_mean': 0.3224522202705623,
  'AUC_sd': 0.006628305826886657,
  'Accuracy_sd': 0.003021994292805601,
  'Sensitivity_sd': 0.011865951810262163,
  'Specificity_sd': 0.0027645820878608035,
  'PPV_sd': 0.003380530447134452,
  'NPV_sd': 0.0014002497133395117,
  'F1_sd': 0.005185253809231648},
 'LinearSVC': {'AUC_mean': 0.8323274686158731,
  'Accuracy_mean': 0.7277959350023525,
  'Sensitivity_mean': 0.7933623157458911,
  'Specificity_mean': 0.7220284721306515,
  'PPV_mean': 0.2006887267285235,
  'NPV_mean': 0.9754430773641545,
  'F1_mean': 0.3203409049006057,
  'AUC_sd': 0.006649403439660119,
  'Accuracy_sd': 0.0032244377831756707,
  'Sensitivity_sd': 0.011526624178460499,
  'Specificity_sd': 0.00292171713643291,
  'PPV_sd': 0.003437242505530579

### Tabela rezultatov (mean ± SD, 10-fold CV)

Opomba: zaradi uteževanja in izbora top feature-jev se lahko izboljšajo **Sensitivity/F1** in zmanjša runtime.


In [14]:
def fmt(mean, sd, nd=3):
    return f"{mean:.{nd}f} ± {sd:.{nd}f}"

rows = []
for name, s in summaries.items():
    rows.append({
        "Model": name,
        "AUC_mean": s["AUC_mean"],
        "F1_mean": s["F1_mean"],
        "AUC (mean±SD)": fmt(s["AUC_mean"], s["AUC_sd"], 3),
        "Accuracy (mean±SD)": fmt(s["Accuracy_mean"], s["Accuracy_sd"], 3),
        "Sensitivity (mean±SD)": fmt(s["Sensitivity_mean"], s["Sensitivity_sd"], 3),
        "Specificity (mean±SD)": fmt(s["Specificity_mean"], s["Specificity_sd"], 3),
        "PPV (mean±SD)": fmt(s["PPV_mean"], s["PPV_sd"], 3),
        "NPV (mean±SD)": fmt(s["NPV_mean"], s["NPV_sd"], 3),
        "F1 (mean±SD)": fmt(s["F1_mean"], s["F1_sd"], 3),
        "Parametri": str(models[name].get_params() if isinstance(models[name], Pipeline) else models[name].get_params()),
        "Komentar": "",
        "Izbor": ""
    })

cv_table = pd.DataFrame(rows).sort_values("AUC_mean", ascending=False).reset_index(drop=True)

# TOP 3 po AUC (lahko zamenjaš na F1_mean, če želite fokus na 'Yes')
top3 = cv_table.head(3)["Model"].tolist()
cv_table["Izbor"] = cv_table["Model"].apply(lambda m: "DA" if m in top3 else "")

cv_table


Unnamed: 0,Model,AUC_mean,F1_mean,AUC (mean±SD),Accuracy (mean±SD),Sensitivity (mean±SD),Specificity (mean±SD),PPV (mean±SD),NPV (mean±SD),F1 (mean±SD),Parametri,Komentar,Izbor
0,HistGradientBoosting,0.833757,0.317254,0.834 ± 0.007,0.718 ± 0.004,0.810 ± 0.011,0.710 ± 0.004,0.197 ± 0.003,0.977 ± 0.001,0.317 ± 0.005,"{'categorical_features': 'from_dtype', 'class_...",,DA
1,LinearSVC,0.832327,0.320341,0.832 ± 0.007,0.728 ± 0.003,0.793 ± 0.012,0.722 ± 0.003,0.201 ± 0.003,0.975 ± 0.001,0.320 ± 0.005,"{'memory': None, 'steps': [('scaler', Standard...",,DA
2,LogisticRegression,0.832319,0.322452,0.832 ± 0.007,0.733 ± 0.003,0.787 ± 0.012,0.728 ± 0.003,0.203 ± 0.003,0.975 ± 0.001,0.322 ± 0.005,"{'memory': None, 'steps': [('scaler', Standard...",,DA
3,SGDClassifier,0.830187,0.320664,0.830 ± 0.007,0.731 ± 0.016,0.785 ± 0.026,0.726 ± 0.019,0.202 ± 0.007,0.975 ± 0.003,0.321 ± 0.007,"{'memory': None, 'steps': [('scaler', Standard...",,
4,ExtraTrees,0.823872,0.312798,0.824 ± 0.007,0.716 ± 0.003,0.799 ± 0.013,0.709 ± 0.003,0.194 ± 0.003,0.976 ± 0.001,0.313 ± 0.004,"{'bootstrap': False, 'ccp_alpha': 0.0, 'class_...",,


In [15]:
# izvoz tabele
cv_table.to_csv(OUT_TABLE, index=False)
OUT_TABLE


'2_3_cv_tabela_klasifikacija_weighted.csv'

### Grafično: AUC in F1 (mean ± SD)

In [16]:
from pathlib import Path
import matplotlib.pyplot as plt

def plot_metric_bar(metric_key, title, filename):
    means = []
    sds = []
    names = []
    for name in cv_table["Model"].tolist():
        means.append(summaries[name][f"{metric_key}_mean"])
        sds.append(summaries[name][f"{metric_key}_sd"])
        names.append(name)

    order = np.argsort(means)[::-1]
    names = [names[i] for i in order]
    means = [means[i] for i in order]
    sds = [sds[i] for i in order]

    plt.figure()
    plt.bar(names, means, yerr=sds, capsize=4)
    plt.title(title)
    plt.ylabel(metric_key)
    plt.xticks(rotation=45, ha="right")
    plt.tight_layout()
    if SAVE_FIGS:
        out = Path(FIG_DIR) / filename
        plt.savefig(out, dpi=200)
        plt.close()
        return str(out)
    else:
        plt.show()
        return None

auc_fig = plot_metric_bar("AUC", "Primerjava modelov (uteženo, TOP featureji): AUC", "cv_auc_weighted.png")
f1_fig  = plot_metric_bar("F1",  "Primerjava modelov (uteženo, TOP featureji): F1",  "cv_f1_weighted.png")
auc_fig, f1_fig


('figures\\2_classification_weighted\\cv_auc_weighted.png',
 'figures\\2_classification_weighted\\cv_f1_weighted.png')

## Kratek povzetek (kaj pričakujemo)

- Uteževanje razredov navadno **poveča Sensitivity** (ujamemo več “Yes”) in pogosto tudi **F1**.
- Izbor top feature-jev:
  - zmanjša runtime,
  - naredi modele bolj interpretabilne,
  - lahko izboljša generalizacijo (manj šuma).

Če želiš, lahko naslednji korak pripravimo še **točko 3** (izbor 3 najboljših + test set eval + ROC + confusion matrix).
