|
| 1 | +# 🧩 Feature-wise Mixture of Experts |
| 2 | + |
| 3 | +> Specialized processing for heterogeneous tabular features |
| 4 | +
|
| 5 | +Feature-wise Mixture of Experts (MoE) is a powerful technique that applies different processing strategies to different features based on their characteristics. This approach allows for more specialized handling of each feature, improving model performance on complex, heterogeneous datasets. |
| 6 | + |
| 7 | +## 🔍 Quick Overview |
| 8 | + |
| 9 | +Feature MoE works by routing each feature through a set of specialized "expert" networks. Each expert can specialize in processing specific feature patterns or distributions, and a router determines which experts should handle each feature. This enables your model to handle complex, multi-modal data more effectively. |
| 10 | + |
| 11 | +## 🚀 Basic Usage |
| 12 | + |
| 13 | +Enable Feature MoE with just one parameter: |
| 14 | + |
| 15 | +```python |
| 16 | +from kdp import PreprocessingModel, FeatureType |
| 17 | + |
| 18 | +# Define features |
| 19 | +features = { |
| 20 | + "age": FeatureType.FLOAT_NORMALIZED, |
| 21 | + "income": FeatureType.FLOAT_RESCALED, |
| 22 | + "occupation": FeatureType.STRING_CATEGORICAL, |
| 23 | + "purchase_history": FeatureType.FLOAT_ARRAY, |
| 24 | +} |
| 25 | + |
| 26 | +# Create preprocessor with Feature MoE |
| 27 | +preprocessor = PreprocessingModel( |
| 28 | + path_data="data.csv", |
| 29 | + features_specs=features, |
| 30 | + use_feature_moe=True, # Turn on the magic |
| 31 | + feature_moe_num_experts=4, # Four specialized experts |
| 32 | + feature_moe_expert_dim=64 # Size of expert representations |
| 33 | +) |
| 34 | + |
| 35 | +# Build and use |
| 36 | +result = preprocessor.build_preprocessor() |
| 37 | +model = result["model"] |
| 38 | +``` |
| 39 | + |
| 40 | +## 🧩 How Feature MoE Works |
| 41 | + |
| 42 | +KDP's Feature MoE uses a "divide and conquer" approach with smart routing: |
| 43 | + |
| 44 | + |
| 45 | + |
| 46 | +1. **Expert Networks**: Each expert is a specialized neural network that processes features in its own unique way. |
| 47 | +2. **Router Network**: Determines which experts should process each feature. |
| 48 | +3. **Adaptive Weighting**: Features can use multiple experts with different weights. |
| 49 | +4. **Residual Connections**: Preserve the original feature information while adding expert insights. |
| 50 | + |
| 51 | +## ⚙️ Configuration Options |
| 52 | + |
| 53 | +Customize Feature MoE behavior with these parameters: |
| 54 | + |
| 55 | +```python |
| 56 | +preprocessor = PreprocessingModel( |
| 57 | + use_feature_moe=True, |
| 58 | + feature_moe_num_experts=5, # More experts for complex signals |
| 59 | + feature_moe_expert_dim=96, # Larger dimension for subtle patterns |
| 60 | + feature_moe_hidden_dims=[128, 64], # Expert network architecture |
| 61 | + feature_moe_routing="learned", # How to assign experts |
| 62 | + feature_moe_sparsity=2, # Use top-2 experts per feature |
| 63 | +) |
| 64 | +``` |
| 65 | + |
| 66 | +### Routing Types |
| 67 | + |
| 68 | +You can choose between two routing methods: |
| 69 | + |
| 70 | +**1. Learned Routing**: The model learns which experts to use for each feature during training. |
| 71 | + |
| 72 | +```python |
| 73 | +preprocessor = PreprocessingModel( |
| 74 | + use_feature_moe=True, |
| 75 | + feature_moe_routing="learned", |
| 76 | + feature_moe_sparsity=2, # Use top 2 experts per feature |
| 77 | +) |
| 78 | +``` |
| 79 | + |
| 80 | +**2. Predefined Routing**: You specify which experts should handle each feature. |
| 81 | + |
| 82 | +```python |
| 83 | +preprocessor = PreprocessingModel( |
| 84 | + use_feature_moe=True, |
| 85 | + feature_moe_routing="predefined", |
| 86 | + feature_moe_assignments={ |
| 87 | + "age": 0, # Expert 0 for age |
| 88 | + "income": 1, # Expert 1 for income |
| 89 | + "occupation": 2, # Expert 2 for occupation |
| 90 | + "purchase_history": 3 # Expert 3 for purchase history |
| 91 | + } |
| 92 | +) |
| 93 | +``` |
| 94 | + |
| 95 | +### Key Configuration Parameters |
| 96 | + |
| 97 | +| Parameter | Description | Default | Recommended Range | |
| 98 | +|-----------|-------------|---------|-------------------| |
| 99 | +| `feature_moe_num_experts` | Number of specialists | 4 | 3-5 for most tasks, 6-8 for very complex data | |
| 100 | +| `feature_moe_expert_dim` | Size of expert output | 64 | Larger (96-128) for complex patterns | |
| 101 | +| `feature_moe_routing` | How to assign experts | "learned" | "learned" for automatic, "predefined" for control | |
| 102 | +| `feature_moe_sparsity` | Use only top k experts | 2 | 1-3 (lower = faster, higher = more accurate) | |
| 103 | +| `feature_moe_hidden_dims` | Expert network size | [64, 32] | Deeper for complex relationships | |
| 104 | + |
| 105 | +## 💡 Pro Tips for Feature MoE |
| 106 | + |
| 107 | +1. **Group Similar Features**: Assign related features to the same expert for consistent processing: |
| 108 | + |
| 109 | +```python |
| 110 | +# Group demographic features to expert 0, financial to expert 1 |
| 111 | +feature_groups = { |
| 112 | + "age": 0, "gender": 0, "location": 0, # Demographics |
| 113 | + "income": 1, "credit_score": 1, "balance": 1, # Financial |
| 114 | + "item_id": 2, "brand": 2, "category": 2, # Product |
| 115 | + "timestamp": 3, "day_of_week": 3, "month": 3 # Temporal |
| 116 | +} |
| 117 | + |
| 118 | +# Apply grouping |
| 119 | +preprocessor = PreprocessingModel( |
| 120 | + use_feature_moe=True, |
| 121 | + feature_moe_routing="predefined", |
| 122 | + feature_moe_assignments=feature_groups |
| 123 | +) |
| 124 | +``` |
| 125 | + |
| 126 | +2. **Visualize Expert Assignments**: Examine which experts handle which features: |
| 127 | + |
| 128 | +```python |
| 129 | +# After training, check which experts handle each feature |
| 130 | +preprocessor_model = result["model"] |
| 131 | +feature_moe_layer = [layer for layer in preprocessor_model.layers |
| 132 | + if "feature_moe" in layer.name][0] |
| 133 | + |
| 134 | +# Get expert assignments |
| 135 | +assignments = feature_moe_layer.get_expert_assignments() |
| 136 | + |
| 137 | +# Visualize assignments |
| 138 | +import matplotlib.pyplot as plt |
| 139 | +import seaborn as sns |
| 140 | + |
| 141 | +plt.figure(figsize=(10, 6)) |
| 142 | +expert_matrix = np.zeros((len(assignments), preprocessor.feature_moe_num_experts)) |
| 143 | + |
| 144 | +for i, feature_name in enumerate(assignments.keys()): |
| 145 | + assignment = assignments[feature_name] |
| 146 | + if isinstance(assignment, int): |
| 147 | + expert_matrix[i, assignment] = 1.0 |
| 148 | + else: |
| 149 | + for expert_idx, weight in assignment.items(): |
| 150 | + expert_matrix[i, expert_idx] = weight |
| 151 | + |
| 152 | +sns.heatmap(expert_matrix, |
| 153 | + xticklabels=[f"Expert {i}" for i in |
| 154 | + range(preprocessor.feature_moe_num_experts)], |
| 155 | + yticklabels=list(assignments.keys()), |
| 156 | + cmap="YlGnBu") |
| 157 | +plt.title("Feature to Expert Assignments") |
| 158 | +plt.tight_layout() |
| 159 | +plt.show() |
| 160 | +``` |
| 161 | + |
| 162 | +3. **Progressive Training**: Start with frozen experts, then fine-tune: |
| 163 | + |
| 164 | +```python |
| 165 | +# Start with frozen experts |
| 166 | +preprocessor = PreprocessingModel( |
| 167 | + use_feature_moe=True, |
| 168 | + feature_moe_freeze_experts=True # Start with frozen experts |
| 169 | +) |
| 170 | + |
| 171 | +# Train for a few epochs, then unfreeze experts |
| 172 | +# ...training code... |
| 173 | + |
| 174 | +# Unfreeze experts for fine-tuning |
| 175 | +preprocessor.feature_moe_freeze_experts = False |
| 176 | +# ...continue training... |
| 177 | +``` |
| 178 | + |
| 179 | +## 🔍 When to Use Feature MoE |
| 180 | + |
| 181 | +Feature MoE is particularly effective in these scenarios: |
| 182 | + |
| 183 | +1. **Heterogeneous Features**: When your features have very different statistical properties. |
| 184 | + |
| 185 | +```python |
| 186 | +# Diverse feature types benefit from specialized processing |
| 187 | +preprocessor = PreprocessingModel( |
| 188 | + features_specs={ |
| 189 | + "user_id": FeatureType.STRING_HASHED, # Categorical |
| 190 | + "text_review": FeatureType.TEXT, # Text |
| 191 | + "purchase_amount": FeatureType.FLOAT_NORMALIZED, # Numerical |
| 192 | + "purchase_date": FeatureType.DATE, # Temporal |
| 193 | + }, |
| 194 | + use_feature_moe=True, |
| 195 | +) |
| 196 | +``` |
| 197 | + |
| 198 | +2. **Complex Multi-Modal Data**: When features come from different sources or modalities. |
| 199 | + |
| 200 | +```python |
| 201 | +# Features from different sources |
| 202 | +preprocessor = PreprocessingModel( |
| 203 | + features_specs={ |
| 204 | + # User features |
| 205 | + "user_age": FeatureType.FLOAT_NORMALIZED, |
| 206 | + "user_interests": FeatureType.STRING_ARRAY, |
| 207 | + |
| 208 | + # Item features |
| 209 | + "item_price": FeatureType.FLOAT_RESCALED, |
| 210 | + "item_category": FeatureType.STRING_CATEGORICAL, |
| 211 | + |
| 212 | + # Interaction features |
| 213 | + "view_count": FeatureType.INT_NORMALIZED, |
| 214 | + "cart_add_timestamp": FeatureType.DATE, |
| 215 | + }, |
| 216 | + use_feature_moe=True, |
| 217 | +) |
| 218 | +``` |
| 219 | + |
| 220 | +3. **Transfer Learning**: When adapting a model to new features. |
| 221 | + |
| 222 | +```python |
| 223 | +# Use domain-specific experts for different feature groups |
| 224 | +preprocessor = PreprocessingModel( |
| 225 | + use_feature_moe=True, |
| 226 | + feature_moe_num_experts=3, # One expert per domain |
| 227 | +) |
| 228 | +``` |
| 229 | + |
| 230 | +## 📚 Related Topics |
| 231 | + |
| 232 | +- [Distribution-Aware Encoding](distribution-aware-encoding.md) - Another way to handle complex feature distributions |
| 233 | +- [Advanced Numerical Embeddings](numerical-embeddings.md) - Special handling for numerical features |
| 234 | +- [Tabular Attention](tabular-attention.md) - Alternative approach for feature interactions |
| 235 | +- [Feature Selection](../optimization/feature-selection.md) - Complement MoE with feature selection |
| 236 | +- [Complex Examples](../examples/complex-examples.md) - See MoE in action on complex datasets |
0 commit comments