# Klasifikacija: Gradnja in ocenjevanje modelov (2. del)
- **Cilj (Y):** `Heart_Disease` (binarno: `Yes/No`)
- **2.1** delitev na učno (80%) in testno množico (20%)
- **2.2** gradnja **vsaj 5** različnih klasifikacijskih modelov (z nastavljenimi hiperparametri; brez privzetih)
- **2.3** ocenjevanje z **10-fold stratified cross-validacijo** na učni množici in izračun metrik:
  - **AUC, Accuracy, Sensitivity, Specificity, PPV, NPV, F1**
- rezultati so prikazani v **skupni tabeli (mean ± SD)** in **grafično**


## 0) Nastavitve

In [1]:
CSV_PATH = "CVD_cleaned.csv"
TARGET = "Heart_Disease"

TEST_SIZE = 0.20
RANDOM_STATE = 42
N_SPLITS = 10

SAVE_FIGS = True
FIG_DIR = "figures/2_classification"


## 1) Uvoz podatkov

In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv(CSV_PATH)
df.shape, df.columns.tolist()


  from pandas.core.computation.check import NUMEXPR_INSTALLED
  from pandas.core import (


((308854, 19),
 ['General_Health',
  'Checkup',
  'Exercise',
  'Heart_Disease',
  'Skin_Cancer',
  'Other_Cancer',
  'Depression',
  'Diabetes',
  'Arthritis',
  'Sex',
  'Age_Category',
  'Height_(cm)',
  'Weight_(kg)',
  'BMI',
  'Smoking_History',
  'Alcohol_Consumption',
  'Fruit_Consumption',
  'Green_Vegetables_Consumption',
  'FriedPotato_Consumption'])

In [3]:
# Ciljna spremenljivka: preslikava Yes/No -> 1/0
y_raw = df[TARGET].astype(str)
y = y_raw.map({"No": 0, "Yes": 1})

# Feature matrix
X = df.drop(columns=[TARGET])

# hitro preverjanje
y.value_counts(dropna=False), y.isna().sum()


(Heart_Disease
 0    283883
 1     24971
 Name: count, dtype: int64,
 0)

## 2.1 Delitev na učno (80%) in testno (20%)

Ker podatki niso časovno odvisni, uporabimo **naključno delitev**.
Za klasifikacijo uporabimo **stratify=y**, da ohranimo razmerje razredov v obeh množicah.


In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=TEST_SIZE,
    random_state=RANDOM_STATE,
    stratify=y
)

print("Train:", X_train.shape, " Test:", X_test.shape)
print("Razred 1 (Yes) - train:", y_train.mean().round(4), " test:", y_test.mean().round(4))


Train: (247083, 18)  Test: (61771, 18)
Razred 1 (Yes) - train: 0.0809  test: 0.0808


## 2.2 Priprava podatkov (preprocessing)

- Numerične spremenljivke: standardizacija (`StandardScaler`)
- Kategorialne spremenljivke: one-hot encoding (`OneHotEncoder`)

To sestavimo v `ColumnTransformer` in ga uporabimo v `Pipeline`, da se vse pravilno izvaja znotraj CV (brez “leakage”).


In [5]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline

num_cols = X_train.select_dtypes(include="number").columns.tolist()
cat_cols = [c for c in X_train.columns if c not in num_cols]

# OneHotEncoder: kompatibilno za različne verzije sklearn
try:
    ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=True)
except TypeError:
    ohe = OneHotEncoder(handle_unknown="ignore", sparse=True)

preprocess = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_cols),
        ("cat", ohe, cat_cols),
    ],
    remainder="drop"
)

num_cols, cat_cols[:5], len(cat_cols)


(['Height_(cm)',
  'Weight_(kg)',
  'BMI',
  'Alcohol_Consumption',
  'Fruit_Consumption',
  'Green_Vegetables_Consumption',
  'FriedPotato_Consumption'],
 ['General_Health', 'Checkup', 'Exercise', 'Skin_Cancer', 'Other_Cancer'],
 11)

## 2.2 Modeli (5 algoritmov, brez privzetih hiperparametrov)

Uporabimo 5 različnih algoritmov:
1. **Logistična regresija** (obvezno po navodilih)
2. **SGDClassifier (logistični loss)**
3. **LinearSVC**
4. **Random Forest**
5. **Extra Trees (Extremely Randomized Trees)**


In [6]:
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier


In [7]:
models = {}

# 1) Logistična regresija (obvezno po navodilih; eksplicitno nastavljeni parametri)
models["LogisticRegression"] = LogisticRegression(
    penalty=None,
    solver="lbfgs",
    max_iter=500,
    random_state=RANDOM_STATE
)

# 2) SGD (log-loss) - hitro na velikih podatkih
models["SGDClassifier"] = SGDClassifier(
    loss="log_loss",
    penalty="elasticnet",
    alpha=1e-4,
    l1_ratio=0.15,
    max_iter=2000,
    tol=1e-3,
    class_weight="balanced",
    random_state=RANDOM_STATE
)

# 3) Linear SVC (uporablja decision_function -> AUC še vedno deluje)
models["LinearSVC"] = LinearSVC(
    C=0.8,
    class_weight="balanced",
    max_iter=5000,
    random_state=RANDOM_STATE
)

# 4) Random Forest
models["RandomForest"] = RandomForestClassifier(
    n_estimators=300,
    max_depth=14,
    min_samples_split=4,
    min_samples_leaf=2,
    max_features="sqrt",
    class_weight="balanced_subsample",
    n_jobs=-1,
    random_state=RANDOM_STATE
)

# 5) Extra Trees (nadomestek za gradient boosting; deluje brez dodatnih paketov)
models["ExtraTrees"] = ExtraTreesClassifier(
    n_estimators=600,
    max_depth=16,
    min_samples_split=4,
    min_samples_leaf=2,
    max_features="sqrt",
    bootstrap=False,
    class_weight="balanced",
    n_jobs=-1,
    random_state=RANDOM_STATE
)

list(models.keys())


['LogisticRegression',
 'SGDClassifier',
 'LinearSVC',
 'RandomForest',
 'ExtraTrees']

## 2.3 10-fold cross-validacija + metrike

Metrike (po navodilih):
- **AUC**
- **Accuracy**
- **Sensitivity** (TPR = recall za pozitiven razred)
- **Specificity** (TNR)
- **PPV** (Precision)
- **NPV**
- **F1**

Vse izračunamo na validacijskih foldih (10-fold) in na koncu prikažemo **mean ± SD**.


In [8]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import (
    roc_auc_score, accuracy_score, recall_score, precision_score, f1_score, confusion_matrix
)
import matplotlib.pyplot as plt
from pathlib import Path

def get_score(model, X):
    """Vrne 'score' za AUC: predict_proba če obstaja, sicer decision_function."""
    if hasattr(model, "predict_proba"):
        return model.predict_proba(X)[:, 1]
    if hasattr(model, "decision_function"):
        return model.decision_function(X)
    # fallback (ni idealno za AUC, a naj ne pade)
    return model.predict(X)

def compute_metrics(y_true, y_pred, y_score):
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    specificity = tn / (tn + fp) if (tn + fp) > 0 else np.nan
    npv = tn / (tn + fn) if (tn + fn) > 0 else np.nan

    return {
        "AUC": roc_auc_score(y_true, y_score) if len(np.unique(y_true)) == 2 else np.nan,
        "Accuracy": accuracy_score(y_true, y_pred),
        "Sensitivity": recall_score(y_true, y_pred, pos_label=1),
        "Specificity": specificity,
        "PPV": precision_score(y_true, y_pred, pos_label=1, zero_division=0),
        "NPV": npv,
        "F1": f1_score(y_true, y_pred, pos_label=1, zero_division=0),
    }

def cv_evaluate(name, estimator, X_train, y_train, preprocess, n_splits=10, random_state=42):
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)

    fold_metrics = {m: [] for m in ["AUC","Accuracy","Sensitivity","Specificity","PPV","NPV","F1"]}

    for fold, (tr_idx, va_idx) in enumerate(skf.split(X_train, y_train), start=1):
        X_tr, X_va = X_train.iloc[tr_idx], X_train.iloc[va_idx]
        y_tr, y_va = y_train.iloc[tr_idx], y_train.iloc[va_idx]

        pipe = Pipeline(steps=[("preprocess", preprocess), ("model", estimator)])
        pipe.fit(X_tr, y_tr)

        y_pred = pipe.predict(X_va)
        y_score = get_score(pipe, X_va)

        mets = compute_metrics(y_va, y_pred, y_score)
        for k, v in mets.items():
            fold_metrics[k].append(v)

    # povzetek mean ± sd
    summary = {}
    for k, vals in fold_metrics.items():
        vals = np.array(vals, dtype=float)
        summary[f"{k}_mean"] = float(np.nanmean(vals))
        summary[f"{k}_sd"] = float(np.nanstd(vals, ddof=1))
    return fold_metrics, summary

# izhodna mapa za grafe
if SAVE_FIGS:
    Path(FIG_DIR).mkdir(parents=True, exist_ok=True)


In [9]:
all_fold_metrics = {}
summaries = {}

for name, est in models.items():
    print(f"Evaluating: {name}")
    folds, summary = cv_evaluate(name, est, X_train, y_train, preprocess, n_splits=N_SPLITS, random_state=RANDOM_STATE)
    all_fold_metrics[name] = folds
    summaries[name] = summary

summaries


Evaluating: LogisticRegression
Evaluating: SGDClassifier
Evaluating: LinearSVC
Evaluating: RandomForest
Evaluating: ExtraTrees


{'LogisticRegression': {'AUC_mean': 0.8338515313831852,
  'AUC_sd': 0.006792811718131441,
  'Accuracy_mean': 0.9193267069190476,
  'Accuracy_sd': 0.0006309307803403747,
  'Sensitivity_mean': 0.061170960645171954,
  'Sensitivity_sd': 0.005520567724076357,
  'Specificity_mean': 0.9948129954310166,
  'Specificity_sd': 0.0005282935466586343,
  'PPV_mean': 0.509319860933525,
  'PPV_sd': 0.03244334780445818,
  'NPV_mean': 0.9233499813910999,
  'NPV_sd': 0.00042298367418707307,
  'F1_mean': 0.10917816700653504,
  'F1_sd': 0.009261980564733461},
 'SGDClassifier': {'AUC_mean': 0.8320242197780319,
  'AUC_sd': 0.00633088938336287,
  'Accuracy_mean': 0.7286823984916007,
  'Accuracy_sd': 0.01793051781620956,
  'Sensitivity_mean': 0.790055704176886,
  'Sensitivity_sd': 0.023515928533297686,
  'Specificity_mean': 0.7232833376773508,
  'Specificity_sd': 0.02123678324645637,
  'PPV_mean': 0.2012156229883888,
  'PPV_sd': 0.00854155852135961,
  'NPV_mean': 0.9751497260012462,
  'NPV_sd': 0.00214384996675

## 2.3 Skupna tabela (mean ± SD)

Tabela vsebuje zahtevane metrike (mean ± SD čez 10 foldov).  
Dodatno:
- **AIC/BIC** izračunamo samo za logistično regresijo (za druge modele označimo `—`).


In [10]:
def fmt(mean, sd, nd=3):
    return f"{mean:.{nd}f} ± {sd:.{nd}f}"

rows = []
for name, s in summaries.items():
    rows.append({
        "Model": name,
        "Tip": name,
        # številčni mean-i (za pravilno sortiranje)
        "AUC_mean": s["AUC_mean"],
        "F1_mean": s["F1_mean"],
        # prikaz za poročilo
        "AUC (mean±SD)": fmt(s["AUC_mean"], s["AUC_sd"], 3),
        "Accuracy (mean±SD)": fmt(s["Accuracy_mean"], s["Accuracy_sd"], 3),
        "Sensitivity (mean±SD)": fmt(s["Sensitivity_mean"], s["Sensitivity_sd"], 3),
        "Specificity (mean±SD)": fmt(s["Specificity_mean"], s["Specificity_sd"], 3),
        "PPV (mean±SD)": fmt(s["PPV_mean"], s["PPV_sd"], 3),
        "NPV (mean±SD)": fmt(s["NPV_mean"], s["NPV_sd"], 3),
        "F1 (mean±SD)": fmt(s["F1_mean"], s["F1_sd"], 3),
        "Parametri": str(models[name].get_params()),
        "AIC/BIC": "—",
        "Komentar": "",
        "Izbor": ""
    })

cv_table = pd.DataFrame(rows)

# TOP 3 po AUC_mean (lahko spremeniš v F1_mean, če vam je to pomembnejše)
top3 = cv_table.sort_values("AUC_mean", ascending=False).head(3)["Model"].tolist()
cv_table["Izbor"] = cv_table["Model"].apply(lambda m: "DA" if m in top3 else "")

# razvrstimo po AUC_mean in na koncu skrijemo pomožne stolpce, če jih ne želiš
cv_table = cv_table.sort_values("AUC_mean", ascending=False).reset_index(drop=True)
cv_table


Unnamed: 0,Model,Tip,AUC_mean,F1_mean,AUC (mean±SD),Accuracy (mean±SD),Sensitivity (mean±SD),Specificity (mean±SD),PPV (mean±SD),NPV (mean±SD),F1 (mean±SD),Parametri,AIC/BIC,Komentar,Izbor
0,LogisticRegression,LogisticRegression,0.833852,0.109178,0.834 ± 0.007,0.919 ± 0.001,0.061 ± 0.006,0.995 ± 0.001,0.509 ± 0.032,0.923 ± 0.000,0.109 ± 0.009,"{'C': 1.0, 'class_weight': None, 'dual': False...",—,,DA
1,LinearSVC,LinearSVC,0.833827,0.322538,0.834 ± 0.007,0.732 ± 0.003,0.790 ± 0.012,0.726 ± 0.003,0.203 ± 0.003,0.975 ± 0.001,0.323 ± 0.005,"{'C': 0.8, 'class_weight': 'balanced', 'dual':...",—,,DA
2,SGDClassifier,SGDClassifier,0.832024,0.320538,0.832 ± 0.006,0.729 ± 0.018,0.790 ± 0.024,0.723 ± 0.021,0.201 ± 0.009,0.975 ± 0.002,0.321 ± 0.010,"{'alpha': 0.0001, 'average': False, 'class_wei...",—,,DA
3,ExtraTrees,ExtraTrees,0.826269,0.329297,0.826 ± 0.007,0.756 ± 0.003,0.740 ± 0.010,0.758 ± 0.003,0.212 ± 0.004,0.971 ± 0.001,0.329 ± 0.006,"{'bootstrap': False, 'ccp_alpha': 0.0, 'class_...",—,,
4,RandomForest,RandomForest,0.825708,0.338902,0.826 ± 0.007,0.781 ± 0.004,0.695 ± 0.011,0.788 ± 0.003,0.224 ± 0.005,0.967 ± 0.001,0.339 ± 0.006,"{'bootstrap': True, 'ccp_alpha': 0.0, 'class_w...",—,,


## (Opcijsko) AIC/BIC za logistično regresijo

AIC/BIC sta smiselna za **(nepenalizirano) logistično regresijo**.  
Izračun naredimo na **celotni učni množici (80%)**.

- log-likelihood izračunamo iz napovedanih verjetnosti
- število parametrov `k` = št. značilk po one-hot + intercept


In [11]:
from sklearn.metrics import log_loss

# fit logistične regresije na celotni učni množici
lr_pipe = Pipeline(steps=[("preprocess", preprocess), ("model", models["LogisticRegression"])])
lr_pipe.fit(X_train, y_train)

# verjetnosti na train
p = lr_pipe.predict_proba(X_train)[:,1]
# log-likelihood (vsota)
eps = 1e-15
p_clip = np.clip(p, eps, 1-eps)
ll = float(np.sum(y_train*np.log(p_clip) + (1-y_train)*np.log(1-p_clip)))

# št. parametrov
feat_names = lr_pipe.named_steps["preprocess"].get_feature_names_out()
k = len(feat_names) + 1  # + intercept
n = len(y_train)

AIC = 2*k - 2*ll
BIC = k*np.log(n) - 2*ll
AIC, BIC, k, n


(109856.51576810092, 110366.97226808699, 49, 247083)

In [12]:
# vpišemo AIC/BIC v tabelo (če sta izračunana)
if AIC is not None and BIC is not None:
    cv_table.loc[cv_table["Model"]=="LogisticRegression","AIC/BIC"] = f"AIC={AIC:.1f}, BIC={BIC:.1f}"
else:
    cv_table.loc[cv_table["Model"]=="LogisticRegression","AIC/BIC"] = "— (AIC/BIC ni izračunan)"
cv_table


Unnamed: 0,Model,Tip,AUC_mean,F1_mean,AUC (mean±SD),Accuracy (mean±SD),Sensitivity (mean±SD),Specificity (mean±SD),PPV (mean±SD),NPV (mean±SD),F1 (mean±SD),Parametri,AIC/BIC,Komentar,Izbor
0,LogisticRegression,LogisticRegression,0.833852,0.109178,0.834 ± 0.007,0.919 ± 0.001,0.061 ± 0.006,0.995 ± 0.001,0.509 ± 0.032,0.923 ± 0.000,0.109 ± 0.009,"{'C': 1.0, 'class_weight': None, 'dual': False...","AIC=109856.5, BIC=110367.0",,DA
1,LinearSVC,LinearSVC,0.833827,0.322538,0.834 ± 0.007,0.732 ± 0.003,0.790 ± 0.012,0.726 ± 0.003,0.203 ± 0.003,0.975 ± 0.001,0.323 ± 0.005,"{'C': 0.8, 'class_weight': 'balanced', 'dual':...",—,,DA
2,SGDClassifier,SGDClassifier,0.832024,0.320538,0.832 ± 0.006,0.729 ± 0.018,0.790 ± 0.024,0.723 ± 0.021,0.201 ± 0.009,0.975 ± 0.002,0.321 ± 0.010,"{'alpha': 0.0001, 'average': False, 'class_wei...",—,,DA
3,ExtraTrees,ExtraTrees,0.826269,0.329297,0.826 ± 0.007,0.756 ± 0.003,0.740 ± 0.010,0.758 ± 0.003,0.212 ± 0.004,0.971 ± 0.001,0.329 ± 0.006,"{'bootstrap': False, 'ccp_alpha': 0.0, 'class_...",—,,
4,RandomForest,RandomForest,0.825708,0.338902,0.826 ± 0.007,0.781 ± 0.004,0.695 ± 0.011,0.788 ± 0.003,0.224 ± 0.005,0.967 ± 0.001,0.339 ± 0.006,"{'bootstrap': True, 'ccp_alpha': 0.0, 'class_w...",—,,


## 2.3 Grafični prikaz rezultatov

Spodaj narišemo primerjavo modelov za:
- **AUC (mean ± SD)**
- **F1 (mean ± SD)**


In [13]:
def plot_metric_bar(metric_key, title, filename):
    means = []
    sds = []
    names = []
    for name in models.keys():
        means.append(summaries[name][f"{metric_key}_mean"])
        sds.append(summaries[name][f"{metric_key}_sd"])
        names.append(name)

    order = np.argsort(means)[::-1]
    names = [names[i] for i in order]
    means = [means[i] for i in order]
    sds = [sds[i] for i in order]

    plt.figure()
    plt.bar(names, means, yerr=sds, capsize=4)
    plt.title(title)
    plt.ylabel(metric_key)
    plt.xticks(rotation=45, ha="right")
    plt.tight_layout()
    if SAVE_FIGS:
        out = Path(FIG_DIR) / filename
        plt.savefig(out, dpi=200)
        plt.close()
        return str(out)
    else:
        plt.show()
        return None

auc_fig = plot_metric_bar("AUC", "Primerjava modelov (10-fold CV): AUC", "cv_auc.png")
f1_fig  = plot_metric_bar("F1",  "Primerjava modelov (10-fold CV): F1",  "cv_f1.png")
auc_fig, f1_fig


('figures\\2_classification\\cv_auc.png',
 'figures\\2_classification\\cv_f1.png')

## Izvoz rezultatov

- tabela: `2_3_cv_tabela_klasifikacija.csv`
- grafi: `figures/2_classification/cv_auc.png`, `figures/2_classification/cv_f1.png`


In [14]:
OUT_TABLE = "2_3_cv_tabela_klasifikacija.csv"
cv_table.to_csv(OUT_TABLE, index=False)
OUT_TABLE


'2_3_cv_tabela_klasifikacija.csv'