# Klasifikacija (2. del): Gradnja in ocenjevanje modelov

**Cilj (Y):** `Heart_Disease` (binarno: `Yes/No`)

Ta zvezek pokrije točke:
- **2.1** Delitev na učno (80%) in testno množico (20%) (*stratified*)
- **2.2** Gradnja **vsaj 5** različnih modelov (**vključno z logistično regresijo**) z **nastavljenimi hiperparametri**
- **2.3** **10-fold stratified** cross-validacija na učni množici in metrike:
  - **AUC, Accuracy, Sensitivity, Specificity, PPV, NPV, F1**
- rezultati: **tabela (mean ± SD)** + **grafično (AUC in F1)**

> Popravek kompatibilnosti: pri nekaterih verzijah `scikit-learn` je `penalty="none"` neveljaven.  
> Zato uporabljamo `penalty=None` (brez regularizacije) in imamo vgrajen fallback.


## 0) Nastavitve

In [1]:
CSV_PATH = "CVD_cleaned.csv"
TARGET = "Heart_Disease"

TEST_SIZE = 0.20
RANDOM_STATE = 42
N_SPLITS = 10

SAVE_FIGS = True
FIG_DIR = "figures/2_classification"


## 0.1 Verzije paketov (da vidimo kompatibilnost)

In [2]:
import sklearn, pandas as pd, numpy as np
print("scikit-learn:", sklearn.__version__)
print("pandas:", pd.__version__)
print("numpy:", np.__version__)


  from pandas.core.computation.check import NUMEXPR_INSTALLED
  from pandas.core import (


scikit-learn: 1.6.1
pandas: 2.2.2
numpy: 1.26.4


### (Opcijsko) Če ti manjka kaj v okolju

V **Jupyter** celico lahko daš (in potem **restart kernel**):

```python
import sys
!{sys.executable} -m pip install -U scikit-learn pandas numpy matplotlib
```

V **CMD/PowerShell** (v pravem venv/conda okolju):
- `pip install -U scikit-learn pandas numpy matplotlib`


## 1) Uvoz podatkov

In [3]:
df = pd.read_csv(CSV_PATH)
df.shape, df.columns.tolist()


((308854, 19),
 ['General_Health',
  'Checkup',
  'Exercise',
  'Heart_Disease',
  'Skin_Cancer',
  'Other_Cancer',
  'Depression',
  'Diabetes',
  'Arthritis',
  'Sex',
  'Age_Category',
  'Height_(cm)',
  'Weight_(kg)',
  'BMI',
  'Smoking_History',
  'Alcohol_Consumption',
  'Fruit_Consumption',
  'Green_Vegetables_Consumption',
  'FriedPotato_Consumption'])

In [4]:
# Ciljna spremenljivka: preslikava Yes/No -> 1/0
y_raw = df[TARGET].astype(str)
y = y_raw.map({"No": 0, "Yes": 1})

X = df.drop(columns=[TARGET])

print("Razredi:
", y.value_counts())
print("Manjkajoče v y:", y.isna().sum())


SyntaxError: EOL while scanning string literal (2210698892.py, line 7)

## 2.1 Delitev na učno (80%) in testno (20%)

Ker podatki niso časovno odvisni, uporabimo **naključno delitev**.
Za klasifikacijo uporabimo **stratify=y**, da ohranimo razmerje razredov.


In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=TEST_SIZE,
    random_state=RANDOM_STATE,
    stratify=y
)

print("Train:", X_train.shape, " Test:", X_test.shape)
print("Delež pozitivnih (Yes=1) - train:", y_train.mean().round(4), " test:", y_test.mean().round(4))


NameError: name 'X' is not defined

## 2.2 Preprocessing (brez leakage)

- Numerične: `StandardScaler`
- Kategorialne: `OneHotEncoder`

Uporabimo `ColumnTransformer` + `Pipeline`, da se vse pravilno izvaja znotraj CV.


In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline

num_cols = X_train.select_dtypes(include="number").columns.tolist()
cat_cols = [c for c in X_train.columns if c not in num_cols]

# OneHotEncoder: kompatibilnost (starejši sklearn uporablja sparse=)
try:
    ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=True)
except TypeError:
    ohe = OneHotEncoder(handle_unknown="ignore", sparse=True)

preprocess = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_cols),
        ("cat", ohe, cat_cols),
    ],
    remainder="drop"
)

print("Numerične:", len(num_cols), " Kategorialne:", len(cat_cols))


## 2.2 Modeli (5 algoritmov) — brez privzetih hiperparametrov

Modeli:
1. **LogisticRegression** (obvezno) — *kompatibilno z različnimi sklearn verzijami*
2. **SGDClassifier (log-loss)**
3. **LinearSVC**
4. **RandomForest**
5. **ExtraTrees**


In [None]:
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier


In [None]:
models = {}

# 1) Logistic Regression (brez regularizacije): penalty=None (kompatibilno), non-default: max_iter, tol
try:
    models["LogisticRegression"] = LogisticRegression(
        penalty=None,
        solver="lbfgs",
        max_iter=500,
        tol=1e-5,
    )
except Exception as e:
    # fallback: L2 z ne-default C
    print("Fallback za LogisticRegression zaradi:", repr(e))
    models["LogisticRegression"] = LogisticRegression(
        penalty="l2",
        C=0.7,
        solver="lbfgs",
        max_iter=500,
        tol=1e-5,
    )

# 2) SGD (log-loss) - ne-default hiperparametri
models["SGDClassifier"] = SGDClassifier(
    loss="log_loss",
    penalty="elasticnet",
    alpha=1e-4,
    l1_ratio=0.15,
    max_iter=2000,
    tol=1e-3,
    class_weight="balanced",
    random_state=RANDOM_STATE
)

# 3) Linear SVC - ne-default C, balanced
models["LinearSVC"] = LinearSVC(
    C=0.8,
    class_weight="balanced",
    max_iter=5000,
    random_state=RANDOM_STATE
)

# 4) Random Forest - ne-default
models["RandomForest"] = RandomForestClassifier(
    n_estimators=250,
    max_depth=14,
    min_samples_split=4,
    min_samples_leaf=2,
    max_features="sqrt",
    class_weight="balanced_subsample",
    n_jobs=-1,
    random_state=RANDOM_STATE
)

# 5) Extra Trees - ne-default
models["ExtraTrees"] = ExtraTreesClassifier(
    n_estimators=400,
    max_depth=16,
    min_samples_split=4,
    min_samples_leaf=2,
    max_features="sqrt",
    class_weight="balanced",
    n_jobs=-1,
    random_state=RANDOM_STATE
)

list(models.keys())


## 2.3 10-fold CV + metrike

Metrike:
- **AUC**
- **Accuracy**
- **Sensitivity** (TPR/Recall za pozitiven razred)
- **Specificity** (TNR)
- **PPV** (Precision)
- **NPV**
- **F1**


In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score, accuracy_score, recall_score, precision_score, f1_score, confusion_matrix
import matplotlib.pyplot as plt
from pathlib import Path

def get_score(model, X):
    # Za AUC: predict_proba, sicer decision_function
    if hasattr(model, "predict_proba"):
        return model.predict_proba(X)[:, 1]
    if hasattr(model, "decision_function"):
        return model.decision_function(X)
    return model.predict(X)

def compute_metrics(y_true, y_pred, y_score):
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    specificity = tn / (tn + fp) if (tn + fp) > 0 else np.nan
    npv = tn / (tn + fn) if (tn + fn) > 0 else np.nan

    return {
        "AUC": roc_auc_score(y_true, y_score) if len(np.unique(y_true)) == 2 else np.nan,
        "Accuracy": accuracy_score(y_true, y_pred),
        "Sensitivity": recall_score(y_true, y_pred, pos_label=1),
        "Specificity": specificity,
        "PPV": precision_score(y_true, y_pred, pos_label=1, zero_division=0),
        "NPV": npv,
        "F1": f1_score(y_true, y_pred, pos_label=1, zero_division=0),
    }

def cv_evaluate(estimator, X_train, y_train, preprocess, n_splits=10, random_state=42):
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)
    fold_metrics = {m: [] for m in ["AUC","Accuracy","Sensitivity","Specificity","PPV","NPV","F1"]}

    for tr_idx, va_idx in skf.split(X_train, y_train):
        X_tr, X_va = X_train.iloc[tr_idx], X_train.iloc[va_idx]
        y_tr, y_va = y_train.iloc[tr_idx], y_train.iloc[va_idx]

        pipe = Pipeline(steps=[("preprocess", preprocess), ("model", estimator)])
        pipe.fit(X_tr, y_tr)

        y_pred = pipe.predict(X_va)
        y_score = get_score(pipe, X_va)

        mets = compute_metrics(y_va, y_pred, y_score)
        for k, v in mets.items():
            fold_metrics[k].append(v)

    summary = {}
    for k, vals in fold_metrics.items():
        vals = np.array(vals, dtype=float)
        summary[f"{k}_mean"] = float(np.nanmean(vals))
        summary[f"{k}_sd"] = float(np.nanstd(vals, ddof=1))
    return fold_metrics, summary

if SAVE_FIGS:
    Path(FIG_DIR).mkdir(parents=True, exist_ok=True)


In [None]:
all_fold_metrics = {}
summaries = {}

for name, est in models.items():
    print("Evaluating:", name)
    folds, summary = cv_evaluate(est, X_train, y_train, preprocess, n_splits=N_SPLITS, random_state=RANDOM_STATE)
    all_fold_metrics[name] = folds
    summaries[name] = summary

summaries


## 2.3 Skupna tabela (mean ± SD)

- Tabela ima zahtevane metrike **mean ± SD** čez 10 foldov.
- Dodamo še stolpec **AIC/BIC** (izračunamo samo, če je logistična regresija brez regularizacije).


In [None]:
def fmt(mean, sd, nd=3):
    return f"{mean:.{nd}f} ± {sd:.{nd}f}"

rows = []
for name, s in summaries.items():
    rows.append({
        "Model": name,
        "Tip": name,
        # številčni mean-i (za sortiranje)
        "AUC_mean": s["AUC_mean"],
        "F1_mean": s["F1_mean"],

        # prikaz za poročilo
        "AUC (mean±SD)": fmt(s["AUC_mean"], s["AUC_sd"], 3),
        "Accuracy (mean±SD)": fmt(s["Accuracy_mean"], s["Accuracy_sd"], 3),
        "Sensitivity (mean±SD)": fmt(s["Sensitivity_mean"], s["Sensitivity_sd"], 3),
        "Specificity (mean±SD)": fmt(s["Specificity_mean"], s["Specificity_sd"], 3),
        "PPV (mean±SD)": fmt(s["PPV_mean"], s["PPV_sd"], 3),
        "NPV (mean±SD)": fmt(s["NPV_mean"], s["NPV_sd"], 3),
        "F1 (mean±SD)": fmt(s["F1_mean"], s["F1_sd"], 3),

        "Parametri": str(models[name].get_params()),
        "AIC/BIC": "—",
        "Komentar": "",
        "Izbor": ""
    })

cv_table = pd.DataFrame(rows)

# TOP 3 po AUC_mean
top3 = cv_table.sort_values("AUC_mean", ascending=False).head(3)["Model"].tolist()
cv_table["Izbor"] = cv_table["Model"].apply(lambda m: "DA" if m in top3 else "")

# sortiranje po AUC_mean
cv_table = cv_table.sort_values("AUC_mean", ascending=False).reset_index(drop=True)
cv_table


## (Opcijsko) AIC/BIC za logistično regresijo

AIC/BIC sta smiselna za **logistično regresijo brez regularizacije** (MLE).
Če zaradi verzije `sklearn` tečemo na regularizirani verziji, AIC/BIC pustimo kot `—`.


In [None]:
from sklearn.metrics import log_loss

lr_params = models["LogisticRegression"].get_params()
pen = lr_params.get("penalty", None)

if pen is None:
    lr_pipe = Pipeline(steps=[("preprocess", preprocess), ("model", models["LogisticRegression"])])
    lr_pipe.fit(X_train, y_train)

    p = lr_pipe.predict_proba(X_train)[:, 1]
    eps = 1e-15
    p_clip = np.clip(p, eps, 1-eps)

    ll = float(np.sum(y_train*np.log(p_clip) + (1-y_train)*np.log(1-p_clip)))

    feat_names = lr_pipe.named_steps["preprocess"].get_feature_names_out()
    k = len(feat_names) + 1  # intercept
    n = len(y_train)

    AIC = 2*k - 2*ll
    BIC = k*np.log(n) - 2*ll

    cv_table.loc[cv_table["Model"]=="LogisticRegression","AIC/BIC"] = f"AIC={AIC:.1f}, BIC={BIC:.1f}"
    print("AIC/BIC izračunan.")
else:
    print(f"AIC/BIC preskočen (penalty={pen!r}).")

cv_table


## 2.3 Grafični prikaz rezultatov (AUC in F1)

Narišemo primerjavo modelov z error bari (**SD**).


In [None]:
def plot_metric_bar(metric_key, title, filename):
    means = []
    sds = []
    names = []

    for name in models.keys():
        means.append(summaries[name][f"{metric_key}_mean"])
        sds.append(summaries[name][f"{metric_key}_sd"])
        names.append(name)

    order = np.argsort(means)[::-1]
    names = [names[i] for i in order]
    means = [means[i] for i in order]
    sds = [sds[i] for i in order]

    plt.figure()
    plt.bar(names, means, yerr=sds, capsize=4)
    plt.title(title)
    plt.ylabel(metric_key)
    plt.xticks(rotation=45, ha="right")
    plt.tight_layout()

    if SAVE_FIGS:
        out = Path(FIG_DIR) / filename
        plt.savefig(out, dpi=200)
        plt.close()
        return str(out)
    else:
        plt.show()
        return None

auc_fig = plot_metric_bar("AUC", "Primerjava modelov (10-fold CV): AUC", "cv_auc.png")
f1_fig  = plot_metric_bar("F1",  "Primerjava modelov (10-fold CV): F1",  "cv_f1.png")
auc_fig, f1_fig


## Izvoz rezultatov

In [None]:
OUT_TABLE = "2_3_cv_tabela_klasifikacija.csv"
cv_table.to_csv(OUT_TABLE, index=False)
OUT_TABLE
