Skip to content

Commit fac7806

Browse files
feat(KDP): adding MoE feature and tests
1 parent fdaa101 commit fac7806

File tree

11 files changed

+2126
-58
lines changed

11 files changed

+2126
-58
lines changed

docs/advanced/feature-moe.md

Lines changed: 236 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,236 @@
1+
# 🧩 Feature-wise Mixture of Experts
2+
3+
> Specialized processing for heterogeneous tabular features
4+
5+
Feature-wise Mixture of Experts (MoE) is a powerful technique that applies different processing strategies to different features based on their characteristics. This approach allows for more specialized handling of each feature, improving model performance on complex, heterogeneous datasets.
6+
7+
## 🔍 Quick Overview
8+
9+
Feature MoE works by routing each feature through a set of specialized "expert" networks. Each expert can specialize in processing specific feature patterns or distributions, and a router determines which experts should handle each feature. This enables your model to handle complex, multi-modal data more effectively.
10+
11+
## 🚀 Basic Usage
12+
13+
Enable Feature MoE with just one parameter:
14+
15+
```python
16+
from kdp import PreprocessingModel, FeatureType
17+
18+
# Define features
19+
features = {
20+
"age": FeatureType.FLOAT_NORMALIZED,
21+
"income": FeatureType.FLOAT_RESCALED,
22+
"occupation": FeatureType.STRING_CATEGORICAL,
23+
"purchase_history": FeatureType.FLOAT_ARRAY,
24+
}
25+
26+
# Create preprocessor with Feature MoE
27+
preprocessor = PreprocessingModel(
28+
path_data="data.csv",
29+
features_specs=features,
30+
use_feature_moe=True, # Turn on the magic
31+
feature_moe_num_experts=4, # Four specialized experts
32+
feature_moe_expert_dim=64 # Size of expert representations
33+
)
34+
35+
# Build and use
36+
result = preprocessor.build_preprocessor()
37+
model = result["model"]
38+
```
39+
40+
## 🧩 How Feature MoE Works
41+
42+
KDP's Feature MoE uses a "divide and conquer" approach with smart routing:
43+
44+
![Feature MoE Architecture](imgs/feature_moe_architecture.png)
45+
46+
1. **Expert Networks**: Each expert is a specialized neural network that processes features in its own unique way.
47+
2. **Router Network**: Determines which experts should process each feature.
48+
3. **Adaptive Weighting**: Features can use multiple experts with different weights.
49+
4. **Residual Connections**: Preserve the original feature information while adding expert insights.
50+
51+
## ⚙️ Configuration Options
52+
53+
Customize Feature MoE behavior with these parameters:
54+
55+
```python
56+
preprocessor = PreprocessingModel(
57+
use_feature_moe=True,
58+
feature_moe_num_experts=5, # More experts for complex signals
59+
feature_moe_expert_dim=96, # Larger dimension for subtle patterns
60+
feature_moe_hidden_dims=[128, 64], # Expert network architecture
61+
feature_moe_routing="learned", # How to assign experts
62+
feature_moe_sparsity=2, # Use top-2 experts per feature
63+
)
64+
```
65+
66+
### Routing Types
67+
68+
You can choose between two routing methods:
69+
70+
**1. Learned Routing**: The model learns which experts to use for each feature during training.
71+
72+
```python
73+
preprocessor = PreprocessingModel(
74+
use_feature_moe=True,
75+
feature_moe_routing="learned",
76+
feature_moe_sparsity=2, # Use top 2 experts per feature
77+
)
78+
```
79+
80+
**2. Predefined Routing**: You specify which experts should handle each feature.
81+
82+
```python
83+
preprocessor = PreprocessingModel(
84+
use_feature_moe=True,
85+
feature_moe_routing="predefined",
86+
feature_moe_assignments={
87+
"age": 0, # Expert 0 for age
88+
"income": 1, # Expert 1 for income
89+
"occupation": 2, # Expert 2 for occupation
90+
"purchase_history": 3 # Expert 3 for purchase history
91+
}
92+
)
93+
```
94+
95+
### Key Configuration Parameters
96+
97+
| Parameter | Description | Default | Recommended Range |
98+
|-----------|-------------|---------|-------------------|
99+
| `feature_moe_num_experts` | Number of specialists | 4 | 3-5 for most tasks, 6-8 for very complex data |
100+
| `feature_moe_expert_dim` | Size of expert output | 64 | Larger (96-128) for complex patterns |
101+
| `feature_moe_routing` | How to assign experts | "learned" | "learned" for automatic, "predefined" for control |
102+
| `feature_moe_sparsity` | Use only top k experts | 2 | 1-3 (lower = faster, higher = more accurate) |
103+
| `feature_moe_hidden_dims` | Expert network size | [64, 32] | Deeper for complex relationships |
104+
105+
## 💡 Pro Tips for Feature MoE
106+
107+
1. **Group Similar Features**: Assign related features to the same expert for consistent processing:
108+
109+
```python
110+
# Group demographic features to expert 0, financial to expert 1
111+
feature_groups = {
112+
"age": 0, "gender": 0, "location": 0, # Demographics
113+
"income": 1, "credit_score": 1, "balance": 1, # Financial
114+
"item_id": 2, "brand": 2, "category": 2, # Product
115+
"timestamp": 3, "day_of_week": 3, "month": 3 # Temporal
116+
}
117+
118+
# Apply grouping
119+
preprocessor = PreprocessingModel(
120+
use_feature_moe=True,
121+
feature_moe_routing="predefined",
122+
feature_moe_assignments=feature_groups
123+
)
124+
```
125+
126+
2. **Visualize Expert Assignments**: Examine which experts handle which features:
127+
128+
```python
129+
# After training, check which experts handle each feature
130+
preprocessor_model = result["model"]
131+
feature_moe_layer = [layer for layer in preprocessor_model.layers
132+
if "feature_moe" in layer.name][0]
133+
134+
# Get expert assignments
135+
assignments = feature_moe_layer.get_expert_assignments()
136+
137+
# Visualize assignments
138+
import matplotlib.pyplot as plt
139+
import seaborn as sns
140+
141+
plt.figure(figsize=(10, 6))
142+
expert_matrix = np.zeros((len(assignments), preprocessor.feature_moe_num_experts))
143+
144+
for i, feature_name in enumerate(assignments.keys()):
145+
assignment = assignments[feature_name]
146+
if isinstance(assignment, int):
147+
expert_matrix[i, assignment] = 1.0
148+
else:
149+
for expert_idx, weight in assignment.items():
150+
expert_matrix[i, expert_idx] = weight
151+
152+
sns.heatmap(expert_matrix,
153+
xticklabels=[f"Expert {i}" for i in
154+
range(preprocessor.feature_moe_num_experts)],
155+
yticklabels=list(assignments.keys()),
156+
cmap="YlGnBu")
157+
plt.title("Feature to Expert Assignments")
158+
plt.tight_layout()
159+
plt.show()
160+
```
161+
162+
3. **Progressive Training**: Start with frozen experts, then fine-tune:
163+
164+
```python
165+
# Start with frozen experts
166+
preprocessor = PreprocessingModel(
167+
use_feature_moe=True,
168+
feature_moe_freeze_experts=True # Start with frozen experts
169+
)
170+
171+
# Train for a few epochs, then unfreeze experts
172+
# ...training code...
173+
174+
# Unfreeze experts for fine-tuning
175+
preprocessor.feature_moe_freeze_experts = False
176+
# ...continue training...
177+
```
178+
179+
## 🔍 When to Use Feature MoE
180+
181+
Feature MoE is particularly effective in these scenarios:
182+
183+
1. **Heterogeneous Features**: When your features have very different statistical properties.
184+
185+
```python
186+
# Diverse feature types benefit from specialized processing
187+
preprocessor = PreprocessingModel(
188+
features_specs={
189+
"user_id": FeatureType.STRING_HASHED, # Categorical
190+
"text_review": FeatureType.TEXT, # Text
191+
"purchase_amount": FeatureType.FLOAT_NORMALIZED, # Numerical
192+
"purchase_date": FeatureType.DATE, # Temporal
193+
},
194+
use_feature_moe=True,
195+
)
196+
```
197+
198+
2. **Complex Multi-Modal Data**: When features come from different sources or modalities.
199+
200+
```python
201+
# Features from different sources
202+
preprocessor = PreprocessingModel(
203+
features_specs={
204+
# User features
205+
"user_age": FeatureType.FLOAT_NORMALIZED,
206+
"user_interests": FeatureType.STRING_ARRAY,
207+
208+
# Item features
209+
"item_price": FeatureType.FLOAT_RESCALED,
210+
"item_category": FeatureType.STRING_CATEGORICAL,
211+
212+
# Interaction features
213+
"view_count": FeatureType.INT_NORMALIZED,
214+
"cart_add_timestamp": FeatureType.DATE,
215+
},
216+
use_feature_moe=True,
217+
)
218+
```
219+
220+
3. **Transfer Learning**: When adapting a model to new features.
221+
222+
```python
223+
# Use domain-specific experts for different feature groups
224+
preprocessor = PreprocessingModel(
225+
use_feature_moe=True,
226+
feature_moe_num_experts=3, # One expert per domain
227+
)
228+
```
229+
230+
## 📚 Related Topics
231+
232+
- [Distribution-Aware Encoding](distribution-aware-encoding.md) - Another way to handle complex feature distributions
233+
- [Advanced Numerical Embeddings](numerical-embeddings.md) - Special handling for numerical features
234+
- [Tabular Attention](tabular-attention.md) - Alternative approach for feature interactions
235+
- [Feature Selection](../optimization/feature-selection.md) - Complement MoE with feature selection
236+
- [Complex Examples](../examples/complex-examples.md) - See MoE in action on complex datasets
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
graph TD
2+
subgraph "Feature-wise Mixture of Experts"
3+
F1[Feature 1] --> Stack[Feature Stack]
4+
F2[Feature 2] --> Stack
5+
F3[Feature 3] --> Stack
6+
F4[Feature 4] --> Stack
7+
8+
Stack --> Router[Router Network]
9+
10+
subgraph "Expert Networks"
11+
E1[Expert 1]
12+
E2[Expert 2]
13+
E3[Expert 3]
14+
E4[Expert 4]
15+
end
16+
17+
Router -->|Routing Weights| Weights[Expert Weights]
18+
Stack --> E1
19+
Stack --> E2
20+
Stack --> E3
21+
Stack --> E4
22+
23+
E1 --> Combine[Weighted Combination]
24+
E2 --> Combine
25+
E3 --> Combine
26+
E4 --> Combine
27+
Weights --> Combine
28+
29+
Combine --> Unstack[Feature Unstack]
30+
31+
Unstack --> OF1[Enhanced Feature 1]
32+
Unstack --> OF2[Enhanced Feature 2]
33+
Unstack --> OF3[Enhanced Feature 3]
34+
Unstack --> OF4[Enhanced Feature 4]
35+
end
36+
37+
classDef feature fill:#b5e3d8,stroke:#333,stroke-width:1px
38+
classDef expert fill:#ffcda8,stroke:#333,stroke-width:1px
39+
classDef router fill:#a8c5e8,stroke:#333,stroke-width:1px
40+
classDef enhanced fill:#d5a8e8,stroke:#333,stroke-width:1px
41+
42+
class F1,F2,F3,F4 feature
43+
class E1,E2,E3,E4 expert
44+
class Router,Weights router
45+
class OF1,OF2,OF3,OF4 enhanced

docs/assets/js/fix-image-paths.js

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ document.addEventListener('DOMContentLoaded', function() {
3232
'auto_configuration.md',
3333
'complex_examples.md',
3434
'integrations.md',
35+
'feature_moe.md',
3536
'transformer_blocks.md',
3637
'contributing.md'
3738
];
@@ -46,6 +47,7 @@ document.addEventListener('DOMContentLoaded', function() {
4647
'optimization/auto-configuration.html',
4748
'examples/complex-examples.html',
4849
'integrations/overview.html',
50+
'advanced/feature-moe.html',
4951
'advanced/transformer-blocks.html',
5052
'contributing/overview.html'
5153
];

docs/index.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ KDP is a high-performance preprocessing library for tabular data built on Tensor
3434
<ul>
3535
<li><a href="advanced/distribution-aware-encoding.md">Distribution-Aware Encoding</a></li>
3636
<li><a href="advanced/tabular-attention.md">Tabular Attention</a></li>
37+
<li><a href="advanced/feature-moe.md">Feature-wise Mixture of Experts</a></li>
3738
<li><a href="advanced/feature-selection.md">Feature Selection</a></li>
3839
<li><a href="advanced/numerical-embeddings.md">Advanced Numerical Embeddings</a></li>
3940
<li><a href="advanced/transformer-blocks.md">Transformer Blocks</a></li>
@@ -84,6 +85,7 @@ KDP is a high-performance preprocessing library for tabular data built on Tensor
8485
<ul>
8586
<li>✅ Smart distribution detection</li>
8687
<li>✅ Neural feature interactions</li>
88+
<li>✅ Feature-wise Mixture of Experts</li>
8789
<li>✅ Memory-efficient processing</li>
8890
<li>✅ Single-pass optimization</li>
8991
<li>✅ Production-ready scaling</li>
@@ -97,6 +99,7 @@ KDP is a high-performance preprocessing library for tabular data built on Tensor
9799
|-----------|---------------------|----------------|
98100
| Complex Distributions | Fixed binning strategies | 📊 **Distribution-Aware Encoding** that adapts to your specific data |
99101
| Interaction Discovery | Manual feature crosses | 👁️ **Tabular Attention** that automatically finds important relationships |
102+
| Heterogeneous Features | Uniform processing | 🧩 **Feature-wise Mixture of Experts** that specializes processing per feature |
100103
| Feature Importance | Post-hoc analysis | 🎯 **Built-in Feature Selection** during training |
101104
| Performance at Scale | Memory issues with large datasets |**Optimized Processing Pipeline** with batching and caching |
102105

@@ -118,7 +121,9 @@ preprocessor = PreprocessingModel(
118121
path_data="data.csv",
119122
features_specs=features,
120123
use_distribution_aware=True, # Smart distribution handling
121-
tabular_attention=True # Automatic feature interactions
124+
tabular_attention=True, # Automatic feature interactions
125+
use_feature_moe=True, # Specialized processing per feature
126+
feature_moe_num_experts=4 # Number of specialized experts
122127
)
123128

124129
# Build and use

0 commit comments

Comments
 (0)