# OmicSelector2: Basic Feature Selection Workflow

This notebook demonstrates the basic feature selection workflow in OmicSelector2.

**Learning Objectives:**
- Load and prepare biomarker data
- Apply different feature selection methods
- Compare feature selection results
- Visualize selected features

**Prerequisites:**
```bash
pip install omicselector2
```

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Import OmicSelector2 components
from omicselector2.features.classical.lasso import LassoSelector
from omicselector2.features.classical.elastic_net import ElasticNetSelector
from omicselector2.features.classical.random_forest import RandomForestSelector
from omicselector2.features.classical.mrmr import MRMRSelector
from omicselector2.features.classical.boruta import BorutaSelector
from omicselector2.features.ensemble import EnsembleSelector

# Set random seed for reproducibility
np.random.seed(42)

# Configure visualization
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

## 1. Generate Synthetic Biomarker Data

For this tutorial, we'll create synthetic gene expression data that mimics real biomarker discovery scenarios.

In [None]:
def generate_biomarker_data(n_samples=200, n_features=1000, n_informative=20):
    """
    Generate synthetic gene expression data for biomarker discovery.
    
    Parameters:
    -----------
    n_samples : int
        Number of patient samples
    n_features : int
        Total number of genes (features)
    n_informative : int
        Number of truly informative genes
    """
    # Generate gene expression matrix
    X = pd.DataFrame(
        np.random.randn(n_samples, n_features),
        columns=[f"GENE_{i:04d}" for i in range(n_features)]
    )
    
    # Create binary outcome (e.g., responder vs non-responder)
    # Only the first n_informative genes influence the outcome
    linear_combo = X.iloc[:, :n_informative].sum(axis=1)
    y = pd.Series(
        (linear_combo > linear_combo.median()).astype(int),
        name="response"
    )
    
    return X, y

# Generate data
X_full, y = generate_biomarker_data(n_samples=200, n_features=1000, n_informative=20)

print(f"Dataset shape: {X_full.shape}")
print(f"Class distribution: {y.value_counts().to_dict()}")
print(f"\nFirst few genes: {X_full.columns[:5].tolist()}")

## 2. Split Data into Train and Test Sets

Following best practices, we'll split our data before any feature selection to prevent data leakage.

In [None]:
from sklearn.model_selection import train_test_split

# Split data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X_full, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print(f"\nTraining class distribution: {y_train.value_counts().to_dict()}")
print(f"Test class distribution: {y_test.value_counts().to_dict()}")

## 3. Apply Different Feature Selection Methods

### 3.1 Lasso (L1 Regularization)

In [None]:
# Lasso feature selection
lasso = LassoSelector(
    alpha=0.01,
    task='classification',
    n_features_to_select=50,
    random_state=42
)

lasso.fit(X_train, y_train)

print(f"Lasso selected {len(lasso.selected_features_)} features")
print(f"Top 10 features: {lasso.selected_features_[:10]}")

### 3.2 Elastic Net (L1 + L2 Regularization)

In [None]:
# Elastic Net feature selection
elastic_net = ElasticNetSelector(
    alpha=0.01,
    l1_ratio=0.7,  # 70% L1, 30% L2
    task='classification',
    n_features_to_select=50,
    random_state=42
)

elastic_net.fit(X_train, y_train)

print(f"Elastic Net selected {len(elastic_net.selected_features_)} features")
print(f"Top 10 features: {elastic_net.selected_features_[:10]}")

### 3.3 Random Forest Variable Importance

In [None]:
# Random Forest feature selection
rf = RandomForestSelector(
    n_estimators=100,
    task='classification',
    n_features_to_select=50,
    random_state=42
)

rf.fit(X_train, y_train)

print(f"Random Forest selected {len(rf.selected_features_)} features")
print(f"Top 10 features: {rf.selected_features_[:10]}")

### 3.4 mRMR (Minimum Redundancy Maximum Relevance)

In [None]:
# mRMR feature selection
mrmr = MRMRSelector(
    n_features_to_select=50,
    relevance='f_classif',
    redundancy='pearson',
    random_state=42
)

mrmr.fit(X_train, y_train)

print(f"mRMR selected {len(mrmr.selected_features_)} features")
print(f"Top 10 features: {mrmr.selected_features_[:10]}")

## 4. Compare Feature Selection Results

Let's visualize which features were selected by each method and identify consensus features.

In [None]:
# Create comparison dictionary
methods_features = {
    'Lasso': set(lasso.selected_features_[:20]),
    'Elastic Net': set(elastic_net.selected_features_[:20]),
    'Random Forest': set(rf.selected_features_[:20]),
    'mRMR': set(mrmr.selected_features_[:20])
}

# Find consensus features (selected by all methods)
consensus_features = set.intersection(*methods_features.values())

print(f"Number of consensus features (selected by all 4 methods): {len(consensus_features)}")
print(f"Consensus features: {sorted(consensus_features)}")

# Count how many times each feature was selected
from collections import Counter

all_features = []
for features in methods_features.values():
    all_features.extend(features)

feature_counts = Counter(all_features)
print(f"\nTop 10 most frequently selected features:")
for feature, count in feature_counts.most_common(10):
    print(f"  {feature}: selected by {count}/4 methods")

## 5. Ensemble Feature Selection

Use ensemble voting to combine results from multiple methods.

In [None]:
# Create ensemble selector
ensemble = EnsembleSelector(
    selectors=[lasso, elastic_net, rf, mrmr],
    strategy='majority_voting',
    threshold=0.5,  # Feature must be selected by >50% of methods
    n_features_to_select=30
)

ensemble.fit(X_train, y_train)

print(f"Ensemble selected {len(ensemble.selected_features_)} features")
print(f"\nTop 20 ensemble features:")
for i, feature in enumerate(ensemble.selected_features_[:20], 1):
    print(f"  {i}. {feature}")

## 6. Visualize Feature Importance

Plot the top features by their importance scores.

In [None]:
# Get feature importance from Random Forest
result = rf.get_result()
feature_importance_df = result.to_dataframe()

# Plot top 20 features
plt.figure(figsize=(10, 8))
top_20 = feature_importance_df.head(20)

plt.barh(range(len(top_20)), top_20['importance'])
plt.yticks(range(len(top_20)), top_20['feature'])
plt.xlabel('Feature Importance (Random Forest)')
plt.ylabel('Gene')
plt.title('Top 20 Features by Random Forest Importance')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

## 7. Evaluate Selected Features

Train a simple classifier using the ensemble-selected features and evaluate performance.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score, roc_curve

# Transform data using ensemble-selected features
X_train_selected = ensemble.transform(X_train)
X_test_selected = ensemble.transform(X_test)

# Train classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train_selected, y_train)

# Predictions
y_pred = clf.predict(X_test_selected)
y_pred_proba = clf.predict_proba(X_test_selected)[:, 1]

# Evaluation
print("Classification Report:")
print(classification_report(y_test, y_pred))

auc = roc_auc_score(y_test, y_pred_proba)
print(f"\nAUC-ROC Score: {auc:.3f}")

# Plot ROC curve
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Ensemble Feature Set')
plt.legend()
plt.grid(True)
plt.show()

## Summary

In this notebook, you learned:

1. **How to apply multiple feature selection methods** (Lasso, Elastic Net, Random Forest, mRMR)
2. **How to compare results** across different methods
3. **How to use ensemble voting** to create robust feature sets
4. **How to visualize** feature importance
5. **How to evaluate** the selected biomarker signature

**Next Steps:**
- Try with real biomarker data
- Experiment with different methods and parameters
- Explore cross-validation and hyperparameter tuning (see next notebook)
- Learn about signature benchmarking and stability selection