# Project Description

## Motivation


<i><b>For Jan</b>: Insert business value</i>

## Data Source

<i><b>For Jan</b>: Insert write up</i>

Note: https://www.sciencedirect.com/science/article/abs/pii/S0379073824001944

## Main Problem

<i><b>For Jan</b>: Insert main problem</i>

Sample: What models can be recommended that provides the highest accuracy depending on the resolution level?

## Limitations

In the study "YHP: Y-chromosome Haplogroup Predictor for predicting male lineages based on Y-STRs", the researchers classified the different haplogroups into 18 resolutions, wherein each resolution was used to train and test the different machine learning models. Grouping the haplogroups into resolutions requires further research to ensure correctness of classification. 

With this in mind, this study no longer classified the haplogroups into resolution. Instead, the entire data set was utilized in training and testing machine learning models.

# Methodology

Step 1. Identify the Business Problem

Step 2. Identify the Machine Learning Task

Step 3. Identify Key Evaluation Metrics

Step 4. Build and Test Machine Learning Models

## 1. Identify the Business Problem

<i><b>For Jan</b>: Rephrase motivation and main problem</i>

## 2. Identify the Machine Learning Task

What will the machine learning model do?
- Goal is to predict the class label (i.e. haplogroup) choice from a predefined list of states (i.e. 27 Y-STRs)

Classification Problem
- Input: Y-STRs (e.g Column DYS576, Column DYS627)
- Output: Haplogroups (i.e. Column haplogroup)

Since this is a classification problem, the following models will be utilized.
1. KNN
2. LDA
3. Gaussian Naive Bayes
4. Decision Tree
5. Random Forest
6. Gradient Boosting

For KNN, scaling will be applied during the data preprocessing to help with faster convergence, equal feature contribution, and improved performance [2][3].

Note that Logistic Regression (L1, L2) will not be used because of the assumption of linearity between the dependent variable and the independent variables [4]. Given that the dataset has overlapping classes as seen in 4.2 EDA, it will be difficult to establish the linearity between the target and the features.

SVM will also not be used because the dataset has overlapping classes [5]. As an example, plotting two of the features (i.e. DYS627 and DYS576) show overlaps between the four haplogroups (i.e. R1a1a1b2a2, O2a2b1a1a1, O2a2a1, O2a2b1a2a1) as seen in 4.2 EDA

## 3. Identify Key Evaluation Metrics

<i><b>For Jan</b>: What evaluation metric will we use? If we will use Accuracy, explain why we will use Accuracy as the evalutation metric.

We also need to look for any industry benchmarks on Accuracy. Otherwise, we can proceed to using PCC.</i>

Evaluation Metrics: Classification
- Accuracy: use when the goal is to minimize the overall error state
- Precision: use when the cost of false positives is high
- Recall: use when the cost of false negatives is high
- F1-score: use if you want to optimize precision and recall at the same time

### PCC for Benchmark

## 4. Build and Test Machine Learning Models

In [1]:
import numpy as np
import pandas as pd
import math
import time
import re
import seaborn as sns
import matplotlib.pyplot as plt
from collections import defaultdict
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import StratifiedShuffleSplit, train_test_split, ShuffleSplit, StratifiedKFold, GridSearchCV
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import classification_report, f1_score, accuracy_score, ConfusionMatrixDisplay, confusion_matrix, matthews_corrcoef, balanced_accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import LinearSVC, SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from collections import Counter
from lightgbm import LGBMClassifier
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

### Loading Data

In [2]:
# Step 1. Load dataset
df = pd.read_excel('Supplemental Processed Data Set.xlsx', sheet_name='S Table 1', skiprows=1)
# Step 2. Fill NaN values
df = df.ffill()
# Step 3. Split haplotype into separate columns
df = pd.concat([df, df['haplotype'].str.replace('[', '').str.replace(']', '').str.split(',', expand=True)], axis=1)
YSTRs = {0: "DYS576", 1: "DYS389 I", 2: "DYS635", 3: "DYS389 II", 4: "DYS627", 5: "DYS460", 6: "DYS458",
                 7: "DYS19", 8: "Y-GATA-H4", 9: "DYS448", 10: "DYS391", 11: "DYS456", 12: "DYS390", 13: "DYS438", 
                 14: "DYS392", 15: "DYS518", 16: "DYS570", 17: "DYS437", 18: "DYS385a", 19: "DYS385b", 20: "DYS449", 
                 21: "DYS393", 22: "DYS439", 23: "DYS481", 24: "DYS576a", 25: "DYS576b", 26: "DYS533"
}

df = df.rename(columns=YSTRs)
df = df.drop(columns=['haplotype'])
df

Unnamed: 0,haplogroup,number_of_haplotypes,total_frequency,sampleID,population,frequency,DYS576,DYS389 I,DYS635,DYS389 II,...,DYS437,DYS385a,DYS385b,DYS449,DYS393,DYS439,DYS481,DYS576a,DYS576b,DYS533
0,C2b1a1a,4.0,1.0,HLM100,Hulunbuir[Mongolian],1.0,19.0,14.0,22.0,31.0,...,14.0,11.0,19.0,30.0,14.0,12.0,24.0,36.0,39.0,12.0
1,C2b1a1a,4.0,1.0,HHM158,Hohhot[Mongolian],1.0,19.0,14.0,22.0,30.0,...,14.0,11.0,17.0,30.0,14.0,14.0,24.0,39.0,39.0,12.0
2,C2b1a1a,4.0,1.0,ODM030,Ordos[Mongolian],1.0,18.0,14.0,21.0,31.0,...,14.0,11.0,19.0,30.0,14.0,12.0,23.0,37.0,38.0,12.0
3,C2b1a1a,4.0,1.0,HLM178,Hulunbuir[Mongolian],1.0,19.0,14.0,22.0,30.0,...,14.0,11.0,17.0,30.0,14.0,14.0,24.0,39.0,39.0,12.0
4,O2a2b1a1a1a4a1,6.0,1.0,HHM088,Hohhot[Mongolian],1.0,18.0,12.0,20.0,29.0,...,16.0,14.0,18.0,32.0,11.0,13.0,23.0,35.0,37.0,11.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4059,O2a1c1a1a1,14.0,1.0,HaiN153(Han),Han,1.0,20.0,12.0,20.0,28.0,...,14.0,13.0,13.0,31.0,13.0,11.0,25.0,37.0,40.0,11.0
4060,O2a1c1a1a1,14.0,1.0,GD-16(Han),Han,1.0,18.0,12.0,21.0,28.0,...,14.0,12.0,19.0,31.0,12.0,12.0,28.0,36.0,38.0,11.0
4061,O2a1c1a1a1,14.0,1.0,JX-82(Han),Han,1.0,19.0,12.0,21.0,28.0,...,14.0,12.0,19.0,33.0,12.0,12.0,26.0,36.0,39.0,11.0
4062,O2a1c1a1a1,14.0,1.0,HaiN139(Han),Han,1.0,16.0,14.0,21.0,29.0,...,14.0,12.0,18.0,29.0,14.0,12.0,23.0,37.0,39.0,11.0


FINAL MODEL


Replaced StratifiedShuffleSplit with Stratified K-Fold

In [5]:
# ======================================================
# 1. PREPARE DATA
# ======================================================
hap_col = df.columns[0]
df["major_haplogroup"] = df[hap_col].str.extract(r"^([A-Z])")

X = df.iloc[:, 6:].apply(pd.to_numeric, errors="coerce")
y_major = df["major_haplogroup"]

# ======================================================
# 2. SPLIT DATA (Singleton-safe)
# ======================================================
mask = y_major.value_counts()[y_major].values > 1
X_filtered, y_filtered = X[mask], y_major[mask]

sss = StratifiedShuffleSplit(n_splits=1, test_size=0.25, random_state=42)
for train_idx, test_idx in sss.split(X_filtered, y_filtered):
    X_train, X_test = X_filtered.iloc[train_idx], X_filtered.iloc[test_idx]
    y_train, y_test = y_filtered.iloc[train_idx], y_filtered.iloc[test_idx]

# Add singleton classes to training set
X_train = pd.concat([X_train, X[~mask]])
y_train = pd.concat([y_train, y_major[~mask]])

print(f"Train size: {len(X_train)}, Test size: {len(X_test)}")

# ======================================================
# 3. DEFINE MODEL + PARAMETER GRID
# ======================================================
rf_base = RandomForestClassifier(class_weight="balanced", n_jobs=-1, random_state=42)

param_grid = {
    "n_estimators": [200, 300, 500],
    "max_depth": [6, 8, 10, None],
    "min_samples_split": [5, 10, 20],
    "min_samples_leaf": [1, 2, 5, 10],
    "max_features": ["sqrt", "log2"],
}

# ======================================================
# 4. STAGE 1: HYPERPARAMETER TUNING (once)
# ======================================================
print("\n===== TUNING RANDOM FOREST (MAJOR HAPLOGROUPS) =====")

grid = GridSearchCV(
    estimator=rf_base,
    param_grid=param_grid,
    scoring="f1_macro",
    cv=3,
    n_jobs=-1,
    verbose=1,
)
grid.fit(X_train, y_train)

best_params = grid.best_params_
print(f"\nBest Hyperparameters: {best_params}")

best_rf = RandomForestClassifier(**best_params, class_weight="balanced", n_jobs=-1, random_state=42)

Train size: 3048, Test size: 1016

===== TUNING RANDOM FOREST (MAJOR HAPLOGROUPS) =====
Fitting 3 folds for each of 288 candidates, totalling 864 fits

Best Hyperparameters: {'max_depth': 8, 'max_features': 'sqrt', 'min_samples_leaf': 5, 'min_samples_split': 5, 'n_estimators': 200}


In [6]:
# ======================================================
# 5. STAGE 2: STRATIFIED K-FOLD EVALUATION (major haplogroups)
# ======================================================
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
fold_results = []

for fold, (train_idx, test_idx) in enumerate(skf.split(X_train, y_train), 1):
    X_tr, X_te = X_train.iloc[train_idx], X_train.iloc[test_idx]
    y_tr, y_te = y_train.iloc[train_idx], y_train.iloc[test_idx]

    best_rf.fit(X_tr, y_tr)
    y_pred_train = best_rf.predict(X_tr)
    y_pred_test = best_rf.predict(X_te)

    # Compute metrics
    f1_train = f1_score(y_tr, y_pred_train, average="macro", zero_division=0)
    f1_test = f1_score(y_te, y_pred_test, average="macro", zero_division=0)
    acc_train = accuracy_score(y_tr, y_pred_train)
    acc_test = accuracy_score(y_te, y_pred_test)

    fold_results.append({
        "Fold": fold,
        "Train_Acc": acc_train,
        "Test_Acc": acc_test,
        "Train_F1_macro": f1_train,
        "Test_F1_macro": f1_test,
        "Delta_F1_macro": f1_train - f1_test,
    })

major_results_df = pd.DataFrame(fold_results)
print("\n===== STRATIFIED K-FOLD RESULTS (MAJOR HAPLOGROUPS) =====")
display(major_results_df)
print(f"\nMean F1_macro (Test): {major_results_df['Test_F1_macro'].mean():.3f}")

# ======================================================
# 6. SUBCLADE COMPARISON (PER MAJOR HAPLOGROUP)
# ======================================================
sub_results = []

for hap in sorted(df["major_haplogroup"].dropna().unique()):
    sub_df = df[df["major_haplogroup"] == hap].copy()
    sub_X = sub_df.iloc[:, 6:].apply(pd.to_numeric, errors="coerce")
    sub_y = sub_df[hap_col]

    valid_counts = sub_y.value_counts()
    valid_classes = valid_counts[valid_counts > 1].index
    sub_X = sub_X[sub_y.isin(valid_classes)]
    sub_y = sub_y[sub_y.isin(valid_classes)]

    if sub_y.nunique() < 3 or len(sub_y) < 20:
        print(f"\nSkipping {hap}: only {len(sub_y)} valid samples, {sub_y.nunique()} subclades.")
        continue

    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    fold_metrics = []

    for fold, (train_idx, test_idx) in enumerate(skf.split(sub_X, sub_y), 1):
        X_tr, X_te = sub_X.iloc[train_idx], sub_X.iloc[test_idx]
        y_tr, y_te = sub_y.iloc[train_idx], sub_y.iloc[test_idx]

        best_rf.fit(X_tr, y_tr)
        y_pred_train = best_rf.predict(X_tr)
        y_pred_test = best_rf.predict(X_te)

        f1_train = f1_score(y_tr, y_pred_train, average="macro", zero_division=0)
        f1_test = f1_score(y_te, y_pred_test, average="macro", zero_division=0)
        acc_train = accuracy_score(y_tr, y_pred_train)
        acc_test = accuracy_score(y_te, y_pred_test)

        fold_metrics.append({
            "Clade": hap,
            "Fold": fold,
            "Train_Acc": acc_train,
            "Test_Acc": acc_test,
            "Train_F1_macro": f1_train,
            "Test_F1_macro": f1_test,
            "Delta_F1_macro": f1_train - f1_test,
            "n_classes": sub_y.nunique(),
            "n_samples": len(sub_y),
        })

    sub_results.extend(fold_metrics)

# ======================================================
# 7. SUMMARIZE RESULTS
# ======================================================
sub_results_df = pd.DataFrame(sub_results)

print("\n================= SUMMARY: MAJOR HAPLOGROUPS =================")
display(major_results_df)

print("\n================= SUMMARY: SUBCLADES =================")
display(sub_results_df.sort_values(["Clade", "Test_F1_macro"], ascending=[True, False]))

print("\nHierarchical Stratified K-Fold Evaluation Complete.")


===== STRATIFIED K-FOLD RESULTS (MAJOR HAPLOGROUPS) =====


Unnamed: 0,Fold,Train_Acc,Test_Acc,Train_F1_macro,Test_F1_macro,Delta_F1_macro
0,1,0.979081,0.97541,0.969378,0.86823,0.101148
1,2,0.983183,0.970492,0.977095,0.947657,0.029438
2,3,0.979491,0.978689,0.942868,0.899141,0.043727
3,4,0.98155,0.968801,0.981444,0.830567,0.150877
4,5,0.9836,0.967159,0.987865,0.913153,0.074712



Mean F1_macro (Test): 0.892

Skipping F: only 25 valid samples, 1 subclades.

Skipping G: only 18 valid samples, 5 subclades.

Skipping H: only 17 valid samples, 2 subclades.

Skipping I: only 12 valid samples, 1 subclades.

Skipping L: only 19 valid samples, 2 subclades.

Skipping P: only 4 valid samples, 1 subclades.



Unnamed: 0,Fold,Train_Acc,Test_Acc,Train_F1_macro,Test_F1_macro,Delta_F1_macro
0,1,0.979081,0.97541,0.969378,0.86823,0.101148
1,2,0.983183,0.970492,0.977095,0.947657,0.029438
2,3,0.979491,0.978689,0.942868,0.899141,0.043727
3,4,0.98155,0.968801,0.981444,0.830567,0.150877
4,5,0.9836,0.967159,0.987865,0.913153,0.074712





Unnamed: 0,Clade,Fold,Train_Acc,Test_Acc,Train_F1_macro,Test_F1_macro,Delta_F1_macro,n_classes,n_samples
4,C,5,0.848624,0.733945,0.812464,0.574812,0.237652,29,545
0,C,1,0.821101,0.706422,0.777088,0.530203,0.246885,29,545
1,C,2,0.78211,0.678899,0.755286,0.489479,0.265807,29,545
3,C,4,0.821101,0.623853,0.777157,0.46217,0.314987,29,545
2,C,3,0.818807,0.587156,0.784741,0.393442,0.391299,29,545
5,D,1,0.888889,0.796296,0.874669,0.534223,0.340446,21,270
9,D,5,0.902778,0.740741,0.908348,0.531627,0.376721,21,270
7,D,3,0.87963,0.777778,0.884191,0.518166,0.366025,21,270
6,D,2,0.87963,0.685185,0.838317,0.483818,0.354499,21,270
8,D,4,0.898148,0.685185,0.86208,0.403432,0.458648,21,270



Hierarchical Stratified K-Fold Evaluation Complete.


# Write-ups

Mutation-Aware Feature Engineering for Y-STR Classification

## 1. Conceptual Overview

In Y-STR haplogroup classification, each individual is represented by allelic repeat counts across multiple loci (e.g., DYS390, DYS391). However, these raw allele counts alone lack biological context â€” not all loci mutate at the same rate, and not all variation carries the same information about ancestry.

The following engineered features are designed to:

- Encode mutation rate heterogeneity among loci  
- Capture intra-haplogroup cohesion and distance to population-level centroids  
- Quantify global allele distributional statistics (mean, variance, skewness)  
- Normalize across loci for robust input to machine learning models such as Random Forests or Gradient Boosting

---

## 2. Haplogroup Centroid Features

### `l1_centroid`

For each sample $i$ belonging to haplogroup $h$, we compute:

$$
d_{L1}(i,h) = \sum_{j \in L} |x_{ij} - \mu_{hj}|
$$

where $\mu_{hj}$ is the mean allele count of haplogroup $h$ at locus $j$.

This feature measures how typical or atypical a Y-STR profile is relative to its haplogroup centroid â€” a measure of intra-clade genetic distance using the L1 norm (Manhattan distance).

- Low values indicate samples tightly clustered within the haplogroup  
- High values may indicate rare or boundary haplotypes

---

### `weighted_l1_centroid`

A variant of the above where loci are weighted according to mutation rates:

$$
d_{L1,w}(i,h) = \sum_{j \in L} w_j |x_{ij} - \mu_{hj}|
$$

Weights reflect mutation dynamics:

- Fast loci (e.g., DYS570, DYS576) mutate quickly and contribute less phylogenetic stability  
- Slow loci (e.g., DYS437, DYS389I/II) are evolutionarily conservative and have higher discriminative power

This feature incorporates region-specific mutation rates (East Asian context) to encode evolutionary priors into the model.

---

## 3. Global Allelic Distribution Features

| Feature | Definition | Interpretation |
|----------|-------------|----------------|
| `allele_sum` | $\sum_j x_{ij}$ | Proxy for total allelic load (total repeat count) |
| `allele_mean` | $\frac{1}{L} \sum_j x_{ij}$ | Average repeat count across loci |
| `allele_var` | $\text{Var}(x_{ij})$ | Dispersion of allele counts, reflecting within-sample heterogeneity |
| `allele_std` | $\sqrt{\text{Var}(x_{ij})}$ | Standard deviation, used for normalization and skew calculations |
| `allele_median` | $\text{Median}(x_{ij})$ | Robust measure of central tendency |

These features describe the overall distribution of allelic values within each STR profile.

---

## 4. Distributional Shape Descriptors

### `median_deviation`

$$
\text{median\_deviation} = \text{median} - \text{mean}
$$

Measures central tendency bias â€” positive values indicate right-skew (heavier upper tail), negative indicate left-skew.

### `skewness_proxy`

$$
\text{skewness\_proxy} = \frac{\text{mean} - \text{median}}{\text{std} + 10^{-9}}
$$

An approximation of Pearsonâ€™s second skewness coefficient, normalized by standard deviation.  
Detects asymmetric allele distributions that may correlate with unusual mutation histories.

---

## 5. Mutation-Category-Specific Variation

### `fast_var`

Variance computed only across fast-mutating loci.  
High values indicate strong intra-haplogroup microvariation.

### `slow_stability`

$$
\text{slow\_stability} = - \sum_{j \in L_{\text{slow}}} |x_{ij} - \mu_j|
$$

Quantifies stability for slow loci.  
Since deviations are negated, less negative values imply higher stability, serving as a proxy for phylogenetic conservatism.

---

## 6. Inter-Category Ratios and Dispersion Indices

### `fast_slow_ratio`

$$
R_{FS} = \frac{\overline{x}_{\text{fast}}}{\overline{x}_{\text{slow}} + \epsilon}
$$

Captures the relative expansion of fast vs. slow loci.  
High ratios suggest recent diversification or mutation accumulation.

### `heterogeneity_index`

$$
H = \frac{\text{allele\_std}}{\text{allele\_mean} + \epsilon}
$$

A normalized measure of intra-profile variability, analogous to a coefficient of variation.

---

## 7. Locus-Pair Differential Features

Certain loci (e.g., DYS385a/b, DYS389I/II) occur as duplicated regions on the Y chromosome and evolve in tandem.

| Feature | Definition | Interpretation |
|----------|-------------|----------------|
| `DYS385a_DYS385b_diff` | $DYS385a - DYS385b$ | Intra-pair divergence, useful for distinguishing subclades |
| `DYS389I_DYS389II_diff` | $DYS389I - DYS389II$ | Divergence between the two DYS389 loci, informative for deeper branching |

These localized features highlight pairwise mutational asymmetry, useful for differentiating closely related haplotypes.

---

## 8. Weighted Summary Statistics

### `weighted_mean` and `weighted_var`

Each locus is weighted by its mutation rate $w_j$:

$$
\text{weighted\_mean} = \frac{1}{L}\sum_j w_j x_{ij}
$$

$$
\text{weighted\_var} = \frac{1}{L}\sum_j w_j (x_{ij} - \text{weighted\_mean})^2
$$

These capture mutation-rate-aware global properties of each STR profile, emphasizing stable markers while accounting for variability at faster loci.

---

## 9. Normalization for Machine Learning

All continuous features are standardized using:

$$
z = \frac{x - \mu}{\sigma}
$$

via `StandardScaler`. This ensures comparability among features with different biological scales (e.g., raw counts vs. distances), improving model stability and performance.

---

## 10. Summary of Engineered Features

| Category | Features | Purpose |
|-----------|-----------|---------|
| Centroid distance | `l1_centroid`, `weighted_l1_centroid` | Within-haplogroup compactness; evolutionary weighting |
| Global stats | `allele_sum`, `allele_mean`, `allele_var`, `allele_std`, `allele_median` | Describe overall allele distribution |
| Shape metrics | `median_deviation`, `skewness_proxy` | Capture asymmetry and mutation bias |
| Mutation-specific | `fast_var`, `slow_stability`, `fast_slow_ratio` | Encode mutation-rate heterogeneity |
| Dispersion ratio | `heterogeneity_index` | Normalized variability across loci |
| Locus differentials | `DYS385a_DYS385b_diff`, `DYS389I_DYS389II_diff` | Localized mutational divergence |
| Weighted summaries | `weighted_mean`, `weighted_var` | Mutation-rate-adjusted global descriptors |
| Scaled features | All above normalized | Model robustness and comparability |

---

## 11. Why These Features Matter

This feature set integrates population genetics theory with statistical feature design:

- Reduces within-class variance while preserving between-class separability  
- Incorporates biologically meaningful mutation-rate priors  
- Provides robustness to missing or noisy STRs via aggregation  
- Produces interpretable predictors (e.g., centroid distance as biological typicality)

Overall, this design bridges population genetics and machine learning, yielding a biologically interpretable feature space for Y-STR haplogroup classification.


Model Evaluation Metrics for Y-STR Haplogroup Classification

## 1. Overview

To evaluate the performance of classifiers trained on Y-STR data (e.g., Random Forest, LightGBM), we employ a suite of **multi-class performance metrics** designed to capture accuracy, balance, and generalization across unevenly distributed haplogroups.

The dataset exhibits **hierarchical class structure** (major clades â†’ subclades) and **class imbalance** (some haplogroups have far fewer samples).  
Therefore, standard accuracy alone is insufficient â€” metrics like **macro-averaged F1**, **balanced accuracy**, and **Matthews Correlation Coefficient (MCC)** provide a more robust evaluation.

---

## 2. Accuracy

**Definition:**

$$
\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
$$

In a multi-class context:

$$
\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of samples}}
$$

Accuracy provides an overall measure of correctness, but it can be **misleading under class imbalance** â€” high accuracy can occur even if minority haplogroups are misclassified entirely.

**Interpretation:**
- High accuracy means most samples are correctly classified across all classes.
- However, accuracy does not account for per-class representation.

---

## 3. Precision, Recall, and F1-Score

These metrics are computed **per class** and then averaged. They are fundamental to understanding trade-offs between **false positives** and **false negatives**.

### Precision

$$
\text{Precision}_k = \frac{TP_k}{TP_k + FP_k}
$$

Measures how many predicted members of class $k$ are actually correct.

- High precision â†’ few false positives  
- In Y-STR terms: when the model predicts haplogroup *R1a*, itâ€™s often correct.

### Recall (Sensitivity)

$$
\text{Recall}_k = \frac{TP_k}{TP_k + FN_k}
$$

Measures how many true members of class $k$ were successfully identified.

- High recall â†’ few false negatives  
- In Y-STR terms: most true *R1a* individuals are captured by the model.

### F1-Score

The **harmonic mean** of precision and recall:

$$
F1_k = 2 \cdot \frac{\text{Precision}_k \cdot \text{Recall}_k}{\text{Precision}_k + \text{Recall}_k}
$$

The harmonic mean penalizes extreme imbalance (e.g., perfect precision but poor recall).

---

## 4. Macro-Averaged F1 (Used in This Project)

To obtain a single number summarizing performance across all haplogroups:

$$
F1_{\text{macro}} = \frac{1}{K} \sum_{k=1}^K F1_k
$$

- **Each class contributes equally**, regardless of frequency.  
- This is crucial for **imbalanced datasets**, ensuring that rare haplogroups (e.g., *G*, *L*) are weighted equally to dominant ones (e.g., *O*, *R*).

**Interpretation:**
- A high macro-F1 implies balanced performance across both common and rare haplogroups.
- A large gap between training and test macro-F1 indicates **overfitting** (model memorizing rare haplogroups).

---

## 5. Balanced Accuracy

Balanced accuracy adjusts for class imbalance by averaging recall over all classes:

$$
\text{Balanced Accuracy} = \frac{1}{K} \sum_{k=1}^{K} \frac{TP_k}{TP_k + FN_k}
$$

Equivalent to **macro-averaged recall**.  
This metric ensures that each haplogroup contributes equally to the score, even if its sample size is small.

**Interpretation:**
- High balanced accuracy indicates the model performs fairly across classes.
- Useful in biological datasets where some lineages are underrepresented.

---

## 6. Matthews Correlation Coefficient (MCC)

MCC provides a **single correlation measure** between observed and predicted classifications:

$$
MCC = \frac{TP \times TN - FP \times FN}
{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}
$$

For multi-class problems, MCC generalizes to:

$$
MCC = \frac{c \times s - \sum_k p_k t_k}
{\sqrt{(c^2 - \sum_k p_k^2)(c^2 - \sum_k t_k^2)}}
$$

where:
- $c$ = total samples  
- $t_k$ = true count for class $k$  
- $p_k$ = predicted count for class $k$  
- $s$ = total number of correct predictions

**Interpretation:**
- $MCC = 1$ â†’ perfect prediction  
- $MCC = 0$ â†’ random guessing  
- $MCC < 0$ â†’ inverse or systematically wrong predictions  

MCC is robust to class imbalance and is widely used in **genomic classification tasks**.

---

## 7. Confusion Matrix

The **confusion matrix** visualizes how predictions distribute across classes.

For $K$ classes, it is a $K \times K$ matrix $C$, where:

$$
C_{ij} = \text{number of samples with true class } i \text{ predicted as } j
$$

- Diagonal entries ($C_{ii}$): correctly predicted samples  
- Off-diagonal entries: misclassifications

**Interpretation:**
- Blocks of confusion between related haplogroups (e.g., *O1a* vs *O1b*) often reflect biological proximity.
- The matrix provides a valuable tool for understanding systematic misclassification patterns.

---

## 8. Overfitting Analysis (Train vs Test)

To assess overfitting, compare train vs test macro-F1:

$$
\Delta F1 = F1_{\text{train}} - F1_{\text{test}}
$$

- Small $\Delta F1$ â†’ model generalizes well  
- Large $\Delta F1$ â†’ potential overfitting (especially in rare classes)

In your experiment, this difference is reported as:

- `RF_Overfit` = RF train F1 - test F1  
- `LGBM_Overfit` = LGBM train F1 - test F1

---

## 9. Summary Table

| Metric | Purpose | Robust to Imbalance | Interpretation |
|:--------|:---------|:--------------------|:----------------|
| Accuracy | Overall correctness | No | Sensitive to dominant classes |
| Precision | Confidence in positive predictions | Yes (per class) | Penalizes false positives |
| Recall | Completeness of detection | Yes (per class) | Penalizes false negatives |
| F1-score | Balance of precision & recall | Yes | Unified detection performance |
| Macro-F1 | Equal weighting of all classes | Yes | Key for imbalanced haplogroups |
| Balanced Accuracy | Equalized recall across classes | Yes | Stability across uneven distributions |
| MCC | Correlation between true & predicted | Yes | Most informative single metric |
| Confusion Matrix | Diagnostic visualization | â€” | Shows specific misclassifications |

---

## 10. Practical Summary

- **Macro-F1** and **MCC** are your most informative metrics for biological interpretability.  
- **Accuracy** alone is insufficient in the presence of rare haplogroups.  
- **Confusion matrices** reveal where the classifier struggles â€” often within phylogenetically close subclades.  
- **Trainâ€“test F1 gaps** quantify generalization capacity and reveal overfitting tendencies.

This combination of metrics provides a comprehensive, statistically balanced view of model performance on hierarchical, imbalanced, and biologically meaningful Y-STR datasets.
