# Project Description

## Motivation


<i><b>For Jan</b>: Insert business value</i>

## Data Source

<i><b>For Jan</b>: Insert write up</i>

Note: https://www.sciencedirect.com/science/article/abs/pii/S0379073824001944

## Main Problem

<i><b>For Jan</b>: Insert main problem</i>

Sample: What models can be recommended that provides the highest accuracy depending on the resolution level?

## Limitations

In the study "YHP: Y-chromosome Haplogroup Predictor for predicting male lineages based on Y-STRs", the researchers classified the different haplogroups into 18 resolutions, wherein each resolution was used to train and test the different machine learning models. Grouping the haplogroups into resolutions requires further research to ensure correctness of classification. 

With this in mind, this study no longer classified the haplogroups into resolution. Instead, the entire data set was utilized in training and testing machine learning models.

# Methodology

Step 1. Identify the Business Problem

Step 2. Identify the Machine Learning Task

Step 3. Identify Key Evaluation Metrics

Step 4. Build and Test Machine Learning Models

## 1. Identify the Business Problem

<i><b>For Jan</b>: Rephrase motivation and main problem</i>

## 2. Identify the Machine Learning Task

What will the machine learning model do?
- Goal is to predict the class label (i.e. haplogroup) choice from a predefined list of states (i.e. 27 Y-STRs)

Classification Problem
- Input: Y-STRs (e.g Column DYS576, Column DYS627)
- Output: Haplogroups (i.e. Column haplogroup)

Since this is a classification problem, the following models will be utilized.
1. KNN
2. LDA
3. Gaussian Naive Bayes
4. Decision Tree
5. Random Forest
6. Gradient Boosting

For KNN, scaling will be applied during the data preprocessing to help with faster convergence, equal feature contribution, and improved performance [2][3].

Note that Logistic Regression (L1, L2) will not be used because of the assumption of linearity between the dependent variable and the independent variables [4]. Given that the dataset has overlapping classes as seen in 4.2 EDA, it will be difficult to establish the linearity between the target and the features.

SVM will also not be used because the dataset has overlapping classes [5]. As an example, plotting two of the features (i.e. DYS627 and DYS576) show overlaps between the four haplogroups (i.e. R1a1a1b2a2, O2a2b1a1a1, O2a2a1, O2a2b1a2a1) as seen in 4.2 EDA

## 3. Identify Key Evaluation Metrics

<i><b>For Jan</b>: What evaluation metric will we use? If we will use Accuracy, explain why we will use Accuracy as the evalutation metric.

We also need to look for any industry benchmarks on Accuracy. Otherwise, we can proceed to using PCC.</i>

Evaluation Metrics: Classification
- Accuracy: use when the goal is to minimize the overall error state
- Precision: use when the cost of false positives is high
- Recall: use when the cost of false negatives is high
- F1-score: use if you want to optimize precision and recall at the same time

### PCC for Benchmark

## 4. Build and Test Machine Learning Models

In [1]:
import numpy as np
import pandas as pd
import math
import time
import re
import seaborn as sns
import matplotlib.pyplot as plt
from collections import defaultdict
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import StratifiedShuffleSplit, train_test_split, ShuffleSplit, StratifiedKFold, GridSearchCV
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import classification_report, f1_score, accuracy_score, ConfusionMatrixDisplay, confusion_matrix, precision_score, recall_score, balanced_accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import LinearSVC, SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from collections import Counter
from lightgbm import LGBMClassifier
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

### Loading Data

In [2]:
# Step 1. Load dataset
df = pd.read_excel('Supplemental Processed Data Set.xlsx', sheet_name='S Table 1', skiprows=1)
# Step 2. Fill NaN values
df = df.ffill()
# Step 3. Split haplotype into separate columns
df = pd.concat([df, df['haplotype'].str.replace('[', '').str.replace(']', '').str.split(',', expand=True)], axis=1)
YSTRs = {0: "DYS576", 1: "DYS389 I", 2: "DYS635", 3: "DYS389 II", 4: "DYS627", 5: "DYS460", 6: "DYS458",
                 7: "DYS19", 8: "Y-GATA-H4", 9: "DYS448", 10: "DYS391", 11: "DYS456", 12: "DYS390", 13: "DYS438", 
                 14: "DYS392", 15: "DYS518", 16: "DYS570", 17: "DYS437", 18: "DYS385a", 19: "DYS385b", 20: "DYS449", 
                 21: "DYS393", 22: "DYS439", 23: "DYS481", 24: "DYS576a", 25: "DYS576b", 26: "DYS533"
}

df = df.rename(columns=YSTRs)
df = df.drop(columns=['haplotype'])
df

Unnamed: 0,haplogroup,number_of_haplotypes,total_frequency,sampleID,population,frequency,DYS576,DYS389 I,DYS635,DYS389 II,...,DYS437,DYS385a,DYS385b,DYS449,DYS393,DYS439,DYS481,DYS576a,DYS576b,DYS533
0,C2b1a1a,4.0,1.0,HLM100,Hulunbuir[Mongolian],1.0,19.0,14.0,22.0,31.0,...,14.0,11.0,19.0,30.0,14.0,12.0,24.0,36.0,39.0,12.0
1,C2b1a1a,4.0,1.0,HHM158,Hohhot[Mongolian],1.0,19.0,14.0,22.0,30.0,...,14.0,11.0,17.0,30.0,14.0,14.0,24.0,39.0,39.0,12.0
2,C2b1a1a,4.0,1.0,ODM030,Ordos[Mongolian],1.0,18.0,14.0,21.0,31.0,...,14.0,11.0,19.0,30.0,14.0,12.0,23.0,37.0,38.0,12.0
3,C2b1a1a,4.0,1.0,HLM178,Hulunbuir[Mongolian],1.0,19.0,14.0,22.0,30.0,...,14.0,11.0,17.0,30.0,14.0,14.0,24.0,39.0,39.0,12.0
4,O2a2b1a1a1a4a1,6.0,1.0,HHM088,Hohhot[Mongolian],1.0,18.0,12.0,20.0,29.0,...,16.0,14.0,18.0,32.0,11.0,13.0,23.0,35.0,37.0,11.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4059,O2a1c1a1a1,14.0,1.0,HaiN153(Han),Han,1.0,20.0,12.0,20.0,28.0,...,14.0,13.0,13.0,31.0,13.0,11.0,25.0,37.0,40.0,11.0
4060,O2a1c1a1a1,14.0,1.0,GD-16(Han),Han,1.0,18.0,12.0,21.0,28.0,...,14.0,12.0,19.0,31.0,12.0,12.0,28.0,36.0,38.0,11.0
4061,O2a1c1a1a1,14.0,1.0,JX-82(Han),Han,1.0,19.0,12.0,21.0,28.0,...,14.0,12.0,19.0,33.0,12.0,12.0,26.0,36.0,39.0,11.0
4062,O2a1c1a1a1,14.0,1.0,HaiN139(Han),Han,1.0,16.0,14.0,21.0,29.0,...,14.0,12.0,18.0,29.0,14.0,12.0,23.0,37.0,39.0,11.0


FINAL MODEL


In [3]:
# ======================================================
# 1. PREPARE DATA
# ======================================================
hap_col = df.columns[0]
df["major_haplogroup"] = df[hap_col].str.extract(r"^([A-Z])")

X = df.iloc[:, 6:].apply(pd.to_numeric, errors="coerce")
y_major = df["major_haplogroup"]

# ======================================================
# 2. SPLIT DATA (Singleton-safe)
# ======================================================
mask = y_major.value_counts()[y_major].values > 1
X_filtered, y_filtered = X[mask], y_major[mask]

sss = StratifiedShuffleSplit(n_splits=1, test_size=0.25, random_state=42)
for train_idx, test_idx in sss.split(X_filtered, y_filtered):
    X_train, X_test = X_filtered.iloc[train_idx], X_filtered.iloc[test_idx]
    y_train, y_test = y_filtered.iloc[train_idx], y_filtered.iloc[test_idx]

# Add singleton classes to training set
X_train = pd.concat([X_train, X[~mask]])
y_train = pd.concat([y_train, y_major[~mask]])

print(f"Train size: {len(X_train)}, Test size: {len(X_test)}")

# ======================================================
# 3. DEFINE MODEL + PARAMETER GRID
# ======================================================
rf_base = RandomForestClassifier(class_weight="balanced", n_jobs=-1, random_state=42)

param_grid = {
    "n_estimators": [200, 300, 500],
    "max_depth": [6, 8, 10, None],
    "min_samples_split": [5, 10, 20],
    "min_samples_leaf": [1, 2, 5, 10],
    "max_features": ["sqrt", "log2"],
}

# ======================================================
# 4. STAGE 1: HYPERPARAMETER TUNING (Stratified K-Fold)
# ======================================================
print("\n===== TUNING RANDOM FOREST (MAJOR HAPLOGROUPS) =====")

# Define stratified K-Fold for tuning
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid = GridSearchCV(
    estimator=rf_base,
    param_grid=param_grid,
    scoring="f1_macro",
    cv=cv_strategy,
    n_jobs=-1,
    verbose=1,
)

grid.fit(X_train, y_train)

best_params = grid.best_params_
print(f"\nBest Hyperparameters: {best_params}")

best_rf = RandomForestClassifier(
    **best_params,
    class_weight="balanced",
    n_jobs=-1,
    random_state=42,
)

Train size: 3048, Test size: 1016

===== TUNING RANDOM FOREST (MAJOR HAPLOGROUPS) =====
Fitting 5 folds for each of 288 candidates, totalling 1440 fits

Best Hyperparameters: {'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 20, 'n_estimators': 200}


In [4]:
# ======================================================
# 5. STAGE 2: STRATIFIED K-FOLD EVALUATION (major haplogroups)
# ======================================================
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
fold_results = []

for fold, (train_idx, test_idx) in enumerate(skf.split(X_train, y_train), 1):
    X_tr, X_te = X_train.iloc[train_idx], X_train.iloc[test_idx]
    y_tr, y_te = y_train.iloc[train_idx], y_train.iloc[test_idx]

    best_rf.fit(X_tr, y_tr)
    y_pred_train = best_rf.predict(X_tr)
    y_pred_test = best_rf.predict(X_te)

    # Compute metrics
    f1_train = f1_score(y_tr, y_pred_train, average="macro", zero_division=0)
    f1_test = f1_score(y_te, y_pred_test, average="macro", zero_division=0)
    acc_train = accuracy_score(y_tr, y_pred_train)
    acc_test = accuracy_score(y_te, y_pred_test)

    fold_results.append({
        "Fold": fold,
        "Train_Acc": round(acc_train, 4),
        "Test_Acc": round(acc_test, 4),
        "Train_F1_macro": round(f1_train, 4),
        "Test_F1_macro": round(f1_test, 4),
        "Delta_F1_macro": round(f1_train - f1_test, 4),
    })

major_results_df = pd.DataFrame(fold_results)
print("\n===== STRATIFIED K-FOLD RESULTS (MAJOR HAPLOGROUPS) =====")
display(major_results_df)
print(f"\nMean F1_macro (Test): {major_results_df['Test_F1_macro'].mean():.4f}")

# ======================================================
# 6. SUBCLADE COMPARISON (PER MAJOR HAPLOGROUP)
# ======================================================
sub_results = []

for hap in sorted(df["major_haplogroup"].dropna().unique()):
    sub_df = df[df["major_haplogroup"] == hap].copy()
    sub_X = sub_df.iloc[:, 6:].apply(pd.to_numeric, errors="coerce")
    sub_y = sub_df[hap_col]

    valid_counts = sub_y.value_counts()
    valid_classes = valid_counts[valid_counts > 1].index
    sub_X = sub_X[sub_y.isin(valid_classes)]
    sub_y = sub_y[sub_y.isin(valid_classes)]

    if sub_y.nunique() < 3 or len(sub_y) < 20:
        print(f"\nSkipping {hap}: only {len(sub_y)} valid samples, {sub_y.nunique()} subclades.")
        continue

    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    fold_metrics = []

    for fold, (train_idx, test_idx) in enumerate(skf.split(sub_X, sub_y), 1):
        X_tr, X_te = sub_X.iloc[train_idx], sub_X.iloc[test_idx]
        y_tr, y_te = sub_y.iloc[train_idx], sub_y.iloc[test_idx]

        best_rf.fit(X_tr, y_tr)
        y_pred_train = best_rf.predict(X_tr)
        y_pred_test = best_rf.predict(X_te)

        f1_train = f1_score(y_tr, y_pred_train, average="macro", zero_division=0)
        f1_test = f1_score(y_te, y_pred_test, average="macro", zero_division=0)
        acc_train = accuracy_score(y_tr, y_pred_train)
        acc_test = accuracy_score(y_te, y_pred_test)

        fold_metrics.append({
            "Clade": hap,
            "Fold": fold,
            "Train_Acc": round(acc_train, 4),
            "Test_Acc": round(acc_test, 4),
            "Train_F1_macro": round(f1_train, 4),
            "Test_F1_macro": round(f1_test, 4),
            "Delta_F1_macro": round(f1_train - f1_test, 4),
            "n_classes": sub_y.nunique(),
            "n_samples": len(sub_y),
        })

    sub_results.extend(fold_metrics)

# ======================================================
# 7. SUMMARIZE RESULTS
# ======================================================
sub_results_df = pd.DataFrame(sub_results)

print("\n================= SUMMARY: MAJOR HAPLOGROUPS =================")
display(major_results_df)

print("\n================= SUMMARY: SUBCLADES =================")
display(sub_results_df.sort_values(["Clade", "Test_F1_macro"], ascending=[True, False]))

print("\nHierarchical Stratified K-Fold Evaluation Complete.")


===== STRATIFIED K-FOLD RESULTS (MAJOR HAPLOGROUPS) =====


Unnamed: 0,Fold,Train_Acc,Test_Acc,Train_F1_macro,Test_F1_macro,Delta_F1_macro
0,1,0.9815,0.9754,0.972,0.8786,0.0934
1,2,0.984,0.9738,0.9775,0.9658,0.0117
2,3,0.9803,0.982,0.9389,0.9019,0.0369
3,4,0.984,0.9737,0.9853,0.9258,0.0595
4,5,0.9852,0.9655,0.9884,0.8907,0.0976



Mean F1_macro (Test): 0.9126

Skipping F: only 25 valid samples, 1 subclades.

Skipping G: only 18 valid samples, 5 subclades.

Skipping H: only 17 valid samples, 2 subclades.

Skipping I: only 12 valid samples, 1 subclades.

Skipping L: only 19 valid samples, 2 subclades.

Skipping P: only 4 valid samples, 1 subclades.



Unnamed: 0,Fold,Train_Acc,Test_Acc,Train_F1_macro,Test_F1_macro,Delta_F1_macro
0,1,0.9815,0.9754,0.972,0.8786,0.0934
1,2,0.984,0.9738,0.9775,0.9658,0.0117
2,3,0.9803,0.982,0.9389,0.9019,0.0369
3,4,0.984,0.9737,0.9853,0.9258,0.0595
4,5,0.9852,0.9655,0.9884,0.8907,0.0976





Unnamed: 0,Clade,Fold,Train_Acc,Test_Acc,Train_F1_macro,Test_F1_macro,Delta_F1_macro,n_classes,n_samples
0,C,1,0.844,0.7248,0.812,0.5703,0.2417,29,545
4,C,5,0.8509,0.7339,0.8034,0.5489,0.2544,29,545
3,C,4,0.8394,0.6697,0.8013,0.4979,0.3033,29,545
1,C,2,0.8417,0.6514,0.8108,0.4635,0.3473,29,545
2,C,3,0.8417,0.633,0.8087,0.4256,0.3831,29,545
7,D,3,0.8843,0.7407,0.8713,0.5089,0.3624,21,270
5,D,1,0.8935,0.7593,0.8823,0.4871,0.3952,21,270
9,D,5,0.8889,0.7593,0.8776,0.4848,0.3928,21,270
8,D,4,0.9028,0.6852,0.8779,0.4361,0.4418,21,270
6,D,2,0.8519,0.6667,0.7912,0.4198,0.3715,21,270



Hierarchical Stratified K-Fold Evaluation Complete.
