Binary Classifier Models for 103 Features

With 103 input features, you can choose from a wide range of binary classification models depending on your goals (speed, interpretability, performance, etc.).
🔢 Linear Models

    Logistic Regression

        Fast, interpretable.

        Good if features are not highly correlated and mostly linearly separable.

        Works well even with many features if regularization is used (L1 or L2).

🌳 Tree-Based Models

    Decision Tree

        Easy to interpret, but prone to overfitting on high-dimensional data.

    Random Forest

        Handles many features well, reduces overfitting via ensembling.

    Gradient Boosting (e.g., XGBoost, LightGBM, CatBoost)

        Among the best performers for tabular data.

        Handles feature importance automatically and tolerates irrelevant features.

        Often a top pick in Kaggle competitions.

📈 Linear + Kernel Methods

    Support Vector Machine (SVM)

        Linear kernel: efficient for many features.

        RBF/kernel: powerful but slower, and sensitive to scaling/noise.

🧠 Neural Networks

    MLPClassifier (Multi-layer Perceptron)

        From scikit-learn, or build with PyTorch/TensorFlow.

        Needs normalization and careful tuning.

        Performs better with large amounts of training data.

📉 Naive Bayes

    GaussianNB / BernoulliNB

        Very fast and simple.

        Assumes feature independence — not ideal with many correlated features, but surprisingly effective in some cases.

🧪 When to Use What?
Scenario	Recommended Model(s)
Quick baseline	Logistic Regression, Random Forest
High accuracy needed	XGBoost, LightGBM
High-dimensional + linear	Linear SVM, Logistic Regression
Many irrelevant/noisy features	Random Forest, XGBoost
Want feature importances	Random Forest, Logistic Regression (L1), XGBoost
Limited training data	Naive Bayes, Logistic Regression
Large dataset + compute	Neural Network, XGBoost

You can start with Logistic Regression for a baseline, and move to Random Forest or XGBoost for stronger performance.

In [None]:
# Multi-label Classification Metrics: Macro vs. Micro Averages

## 🧪 Yeast Dataset Overview

- **Samples**: 2,417  
- **Features**: 103  
- **Labels**: 14 functional classes  
- **Multi-label nature**: Each sample may belong to multiple classes simultaneously.

This dataset is ideal for studying multi-label classification metrics like **sensitivity (recall)** and **specificity**, especially for comparing **macro** and **micro** averages.

---

## 📊 Why Use the Yeast Dataset?

- **Label Imbalance**: Some classes are more prevalent than others, so it helps demonstrate how macro and micro averages respond to imbalanced data.
- **Real-world Multi-label Structure**: Common in domains like bioinformatics and text classification.

---

## 🔗 Dataset Links

- **Scikit-learn Example**:  
  [Yeast Classifier Chain](https://scikit-learn.org/stable/auto_examples/multioutput/plot_classifier_chain_yeast.html?utm_source=chatgpt.com)

- **Kaggle Notebook**:  
  [Multi-label Classification with Scikit-learn](https://www.kaggle.com/code/roccoli/multi-label-classification-with-sklearn?utm_source=chatgpt.com)

---

## 🧮 Code Example: Macro vs. Micro Recall

```python
from sklearn.metrics import recall_score

# Assuming y_true and y_pred are your true and predicted label matrices
recall_macro = recall_score(y_true, y_pred, average='macro')
recall_micro = recall_score(y_true, y_pred, average='micro')

print(f"Macro-average Recall: {recall_macro:.3f}")
print(f"Micro-average Recall: {recall_micro:.3f}")
