UnicoLab
diff --git a/‎docs/advanced/feature-moe.md‎
Lines changed: 236 additions & 0 deletions b/‎docs/advanced/feature-moe.md‎
Lines changed: 236 additions & 0 deletions
diff --git a/‎docs/advanced/imgs/feature_moe_architecture.txt‎
Lines changed: 45 additions & 0 deletions b/‎docs/advanced/imgs/feature_moe_architecture.txt‎
Lines changed: 45 additions & 0 deletions
diff --git a/‎docs/assets/js/fix-image-paths.js‎
Lines changed: 2 additions & 0 deletions b/‎docs/assets/js/fix-image-paths.js‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/index.md‎
Lines changed: 6 additions & 1 deletion b/‎docs/index.md‎
Lines changed: 6 additions & 1 deletion
@@ -0,0 +1,236 @@
+# 🧩 Feature-wise Mixture of Experts
+
+> Specialized processing for heterogeneous tabular features
+
+Feature-wise Mixture of Experts (MoE) is a powerful technique that applies different processing strategies to different features based on their characteristics. This approach allows for more specialized handling of each feature, improving model performance on complex, heterogeneous datasets.
+
+## 🔍 Quick Overview
+
+Feature MoE works by routing each feature through a set of specialized "expert" networks. Each expert can specialize in processing specific feature patterns or distributions, and a router determines which experts should handle each feature. This enables your model to handle complex, multi-modal data more effectively.
+
+## 🚀 Basic Usage
+
+Enable Feature MoE with just one parameter:
+
+```python
+from kdp import PreprocessingModel, FeatureType
+
+# Define features
+features = {
+    "age": FeatureType.FLOAT_NORMALIZED,
+    "income": FeatureType.FLOAT_RESCALED,
+    "occupation": FeatureType.STRING_CATEGORICAL,
+    "purchase_history": FeatureType.FLOAT_ARRAY,
+}
+
+# Create preprocessor with Feature MoE
+preprocessor = PreprocessingModel(
+    path_data="data.csv",
+    features_specs=features,
+    use_feature_moe=True,               # Turn on the magic
+    feature_moe_num_experts=4,          # Four specialized experts
+    feature_moe_expert_dim=64           # Size of expert representations
+)
+
+# Build and use
+result = preprocessor.build_preprocessor()
+model = result["model"]
+```
+
+## 🧩 How Feature MoE Works
+
+KDP's Feature MoE uses a "divide and conquer" approach with smart routing:
+
+![Feature MoE Architecture](imgs/feature_moe_architecture.png)
+
+1. **Expert Networks**: Each expert is a specialized neural network that processes features in its own unique way.
+2. **Router Network**: Determines which experts should process each feature.
+3. **Adaptive Weighting**: Features can use multiple experts with different weights.
+4. **Residual Connections**: Preserve the original feature information while adding expert insights.
+
+## ⚙️ Configuration Options
+
+Customize Feature MoE behavior with these parameters:
+
+```python
+preprocessor = PreprocessingModel(
+    use_feature_moe=True,
+    feature_moe_num_experts=5,           # More experts for complex signals
+    feature_moe_expert_dim=96,           # Larger dimension for subtle patterns
+    feature_moe_hidden_dims=[128, 64],   # Expert network architecture
+    feature_moe_routing="learned",       # How to assign experts
+    feature_moe_sparsity=2,              # Use top-2 experts per feature
+)
+```
+
+### Routing Types
+
+You can choose between two routing methods:
+
+**1. Learned Routing**: The model learns which experts to use for each feature during training.
+
+```python
+preprocessor = PreprocessingModel(
+    use_feature_moe=True,
+    feature_moe_routing="learned",
+    feature_moe_sparsity=2,  # Use top 2 experts per feature
+)
+```
+
+**2. Predefined Routing**: You specify which experts should handle each feature.
+
+```python
+preprocessor = PreprocessingModel(
+    use_feature_moe=True,
+    feature_moe_routing="predefined",
+    feature_moe_assignments={
+        "age": 0,              # Expert 0 for age
+        "income": 1,           # Expert 1 for income
+        "occupation": 2,       # Expert 2 for occupation
+        "purchase_history": 3  # Expert 3 for purchase history
+    }
+)
+```
+
+### Key Configuration Parameters
+
+| Parameter | Description | Default | Recommended Range |
+|-----------|-------------|---------|-------------------|
+| `feature_moe_num_experts` | Number of specialists | 4 | 3-5 for most tasks, 6-8 for very complex data |
+| `feature_moe_expert_dim` | Size of expert output | 64 | Larger (96-128) for complex patterns |
+| `feature_moe_routing` | How to assign experts | "learned" | "learned" for automatic, "predefined" for control |
+| `feature_moe_sparsity` | Use only top k experts | 2 | 1-3 (lower = faster, higher = more accurate) |
+| `feature_moe_hidden_dims` | Expert network size | [64, 32] | Deeper for complex relationships |
+
+## 💡 Pro Tips for Feature MoE
+
+1. **Group Similar Features**: Assign related features to the same expert for consistent processing:
+
+```python
+# Group demographic features to expert 0, financial to expert 1
+feature_groups = {
+    "age": 0, "gender": 0, "location": 0,           # Demographics
+    "income": 1, "credit_score": 1, "balance": 1,   # Financial
+    "item_id": 2, "brand": 2, "category": 2,        # Product
+    "timestamp": 3, "day_of_week": 3, "month": 3    # Temporal
+}
+
+# Apply grouping
+preprocessor = PreprocessingModel(
+    use_feature_moe=True,
+    feature_moe_routing="predefined",
+    feature_moe_assignments=feature_groups
+)
+```
+
+2. **Visualize Expert Assignments**: Examine which experts handle which features:
+
+```python
+# After training, check which experts handle each feature
+preprocessor_model = result["model"]
+feature_moe_layer = [layer for layer in preprocessor_model.layers
+                     if "feature_moe" in layer.name][0]
+
+# Get expert assignments
+assignments = feature_moe_layer.get_expert_assignments()
+
+# Visualize assignments
+import matplotlib.pyplot as plt
+import seaborn as sns
+
+plt.figure(figsize=(10, 6))
+expert_matrix = np.zeros((len(assignments), preprocessor.feature_moe_num_experts))
+
+for i, feature_name in enumerate(assignments.keys()):
+    assignment = assignments[feature_name]
+    if isinstance(assignment, int):
+        expert_matrix[i, assignment] = 1.0
+    else:
+        for expert_idx, weight in assignment.items():
+            expert_matrix[i, expert_idx] = weight
+
+sns.heatmap(expert_matrix,
+            xticklabels=[f"Expert {i}" for i in
+            range(preprocessor.feature_moe_num_experts)],
+            yticklabels=list(assignments.keys()),
+            cmap="YlGnBu")
+plt.title("Feature to Expert Assignments")
+plt.tight_layout()
+plt.show()
+```
+
+3. **Progressive Training**: Start with frozen experts, then fine-tune:
+
+```python
+# Start with frozen experts
+preprocessor = PreprocessingModel(
+    use_feature_moe=True,
+    feature_moe_freeze_experts=True  # Start with frozen experts
+)
+
+# Train for a few epochs, then unfreeze experts
+# ...training code...
+
+# Unfreeze experts for fine-tuning
+preprocessor.feature_moe_freeze_experts = False
+# ...continue training...
+```
+
+## 🔍 When to Use Feature MoE
+
+Feature MoE is particularly effective in these scenarios:
+
+1. **Heterogeneous Features**: When your features have very different statistical properties.
+
+```python
+# Diverse feature types benefit from specialized processing
+preprocessor = PreprocessingModel(
+    features_specs={
+        "user_id": FeatureType.STRING_HASHED,         # Categorical
+        "text_review": FeatureType.TEXT,              # Text
+        "purchase_amount": FeatureType.FLOAT_NORMALIZED, # Numerical
+        "purchase_date": FeatureType.DATE,           # Temporal
+    },
+    use_feature_moe=True,
+)
+```
+
+2. **Complex Multi-Modal Data**: When features come from different sources or modalities.
+
+```python
+# Features from different sources
+preprocessor = PreprocessingModel(
+    features_specs={
+        # User features
+        "user_age": FeatureType.FLOAT_NORMALIZED,
+        "user_interests": FeatureType.STRING_ARRAY,
+
+        # Item features
+        "item_price": FeatureType.FLOAT_RESCALED,
+        "item_category": FeatureType.STRING_CATEGORICAL,
+
+        # Interaction features
+        "view_count": FeatureType.INT_NORMALIZED,
+        "cart_add_timestamp": FeatureType.DATE,
+    },
+    use_feature_moe=True,
+)
+```
+
+3. **Transfer Learning**: When adapting a model to new features.
+
+```python
+# Use domain-specific experts for different feature groups
+preprocessor = PreprocessingModel(
+    use_feature_moe=True,
+    feature_moe_num_experts=3,  # One expert per domain
+)
+```
+
+## 📚 Related Topics
+
+- [Distribution-Aware Encoding](distribution-aware-encoding.md) - Another way to handle complex feature distributions
+- [Advanced Numerical Embeddings](numerical-embeddings.md) - Special handling for numerical features
+- [Tabular Attention](tabular-attention.md) - Alternative approach for feature interactions
+- [Feature Selection](../optimization/feature-selection.md) - Complement MoE with feature selection
+- [Complex Examples](../examples/complex-examples.md) - See MoE in action on complex datasets
@@ -0,0 +1,45 @@
+graph TD
+    subgraph "Feature-wise Mixture of Experts"
+        F1[Feature 1] --> Stack[Feature Stack]
+        F2[Feature 2] --> Stack
+        F3[Feature 3] --> Stack
+        F4[Feature 4] --> Stack
+
+        Stack --> Router[Router Network]
+
+        subgraph "Expert Networks"
+            E1[Expert 1]
+            E2[Expert 2]
+            E3[Expert 3]
+            E4[Expert 4]
+        end
+
+        Router -->|Routing Weights| Weights[Expert Weights]
+        Stack --> E1
+        Stack --> E2
+        Stack --> E3
+        Stack --> E4
+
+        E1 --> Combine[Weighted Combination]
+        E2 --> Combine
+        E3 --> Combine
+        E4 --> Combine
+        Weights --> Combine
+
+        Combine --> Unstack[Feature Unstack]
+
+        Unstack --> OF1[Enhanced Feature 1]
+        Unstack --> OF2[Enhanced Feature 2]
+        Unstack --> OF3[Enhanced Feature 3]
+        Unstack --> OF4[Enhanced Feature 4]
+    end
+
+    classDef feature fill:#b5e3d8,stroke:#333,stroke-width:1px
+    classDef expert fill:#ffcda8,stroke:#333,stroke-width:1px
+    classDef router fill:#a8c5e8,stroke:#333,stroke-width:1px
+    classDef enhanced fill:#d5a8e8,stroke:#333,stroke-width:1px
+
+    class F1,F2,F3,F4 feature
+    class E1,E2,E3,E4 expert
+    class Router,Weights router
+    class OF1,OF2,OF3,OF4 enhanced
@@ -32,6 +32,7 @@ document.addEventListener('DOMContentLoaded', function() {
     'auto_configuration.md',
     'complex_examples.md',
     'integrations.md',
+    'feature_moe.md',
     'transformer_blocks.md',
     'contributing.md'
   ];
@@ -46,6 +47,7 @@ document.addEventListener('DOMContentLoaded', function() {
     'optimization/auto-configuration.html',
     'examples/complex-examples.html',
     'integrations/overview.html',
+    'advanced/feature-moe.html',
     'advanced/transformer-blocks.html',
     'contributing/overview.html'
   ];
 
@@ -34,6 +34,7 @@ KDP is a high-performance preprocessing library for tabular data built on Tensor
     <ul>
       <li><a href="advanced/distribution-aware-encoding.md">Distribution-Aware Encoding</a></li>
       <li><a href="advanced/tabular-attention.md">Tabular Attention</a></li>
+      <li><a href="advanced/feature-moe.md">Feature-wise Mixture of Experts</a></li>
       <li><a href="advanced/feature-selection.md">Feature Selection</a></li>
       <li><a href="advanced/numerical-embeddings.md">Advanced Numerical Embeddings</a></li>
       <li><a href="advanced/transformer-blocks.md">Transformer Blocks</a></li>
@@ -84,6 +85,7 @@ KDP is a high-performance preprocessing library for tabular data built on Tensor
     <ul>
       <li>✅ Smart distribution detection</li>
       <li>✅ Neural feature interactions</li>
+      <li>✅ Feature-wise Mixture of Experts</li>
       <li>✅ Memory-efficient processing</li>
       <li>✅ Single-pass optimization</li>
       <li>✅ Production-ready scaling</li>
@@ -97,6 +99,7 @@ KDP is a high-performance preprocessing library for tabular data built on Tensor
 |-----------|---------------------|----------------|
 | Complex Distributions | Fixed binning strategies | 📊 **Distribution-Aware Encoding** that adapts to your specific data |
 | Interaction Discovery | Manual feature crosses | 👁️ **Tabular Attention** that automatically finds important relationships |
+| Heterogeneous Features | Uniform processing | 🧩 **Feature-wise Mixture of Experts** that specializes processing per feature |
 | Feature Importance | Post-hoc analysis | 🎯 **Built-in Feature Selection** during training |
 | Performance at Scale | Memory issues with large datasets | ⚡ **Optimized Processing Pipeline** with batching and caching |
 
@@ -118,7 +121,9 @@ preprocessor = PreprocessingModel(
     path_data="data.csv",
     features_specs=features,
     use_distribution_aware=True,  # Smart distribution handling
-    tabular_attention=True        # Automatic feature interactions
+    tabular_attention=True,       # Automatic feature interactions
+    use_feature_moe=True,         # Specialized processing per feature
+    feature_moe_num_experts=4     # Number of specialized experts
 )
 
 # Build and use