Skip to content

Commit 00e75d6

Browse files
feat(KDP): adding auto config / recommender
1 parent a4be43f commit 00e75d6

File tree

9 files changed

+2037
-1
lines changed

9 files changed

+2037
-1
lines changed

docs/auto_configuration.md

Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
# 🧙‍♂️ Automatic Model Configuration
2+
3+
KDP includes a powerful model configuration recommendation system that analyzes your dataset's statistics and suggests the optimal preprocessing strategies for each feature.
4+
5+
## 🔍 Overview
6+
7+
The automatic model configuration system leverages statistical analysis to:
8+
9+
1. **Detect feature distributions** - Identifies the underlying distribution pattern for each feature
10+
2. **Recommend transformations** - Suggests appropriate preprocessing layers based on detected patterns
11+
3. **Optimize global settings** - Recommends global parameters for improved model performance
12+
4. **Generate code** - Provides ready-to-use Python code implementing the recommendations
13+
14+
## 🛠️ How It Works
15+
16+
The system works in two main phases:
17+
18+
### 1. Statistics Collection
19+
20+
First, the `DatasetStatistics` class analyzes your dataset to compute various statistical properties:
21+
22+
- **Numerical features**: Mean, variance, distribution shape metrics (estimated skewness/kurtosis)
23+
- **Categorical features**: Vocabulary size, cardinality, unique values
24+
- **Text features**: Vocabulary statistics, average sequence length
25+
- **Date features**: Cyclical patterns, temporal variance
26+
27+
### 2. Configuration Recommendation
28+
29+
Then, the `ModelAdvisor` analyzes these statistics to recommend:
30+
31+
- **Feature-specific transformations**: Based on the detected distribution type
32+
- **Advanced encoding options**: Such as distribution-aware encoding for complex distributions
33+
- **Attention mechanisms**: Tabular attention or multi-resolution attention when appropriate
34+
- **Global parameters**: Overall architecture suggestions based on the feature mix
35+
36+
## 🚀 Using the Configuration Advisor
37+
38+
### Method 1: Using the Python API
39+
40+
```python
41+
from kdp.stats import DatasetStatistics
42+
from kdp.processor import PreprocessingModel
43+
44+
# Initialize statistics calculator
45+
stats_calculator = DatasetStatistics(
46+
path_data="data/my_dataset.csv",
47+
features_specs=features_specs # Optional, will be inferred if not provided
48+
)
49+
50+
# Calculate statistics
51+
stats = stats_calculator.main()
52+
53+
# Generate recommendations
54+
recommendations = stats_calculator.recommend_model_configuration()
55+
56+
# Use the recommendations to build a model
57+
# You can directly use the generated code snippet or access specific recommendations
58+
print(recommendations["code_snippet"])
59+
```
60+
61+
### Method 2: Using the Command-Line Tool
62+
63+
KDP provides a command-line tool to analyze datasets and generate recommendations:
64+
65+
```bash
66+
python scripts/analyze_dataset.py --data path/to/data.csv --output recommendations.json
67+
```
68+
69+
Options:
70+
- `--data`, `-d`: Path to CSV data file or directory (required)
71+
- `--output`, `-o`: Path to save recommendations (default: recommendations.json)
72+
- `--stats`, `-s`: Path to save/load feature statistics (default: features_stats.json)
73+
- `--batch-size`, `-b`: Batch size for processing (default: 50000)
74+
- `--overwrite`, `-w`: Overwrite existing statistics file
75+
- `--feature-types`, `-f`: JSON file specifying feature types (optional)
76+
77+
## 🔮 Distribution Detection
78+
79+
The system can detect and recommend specific configurations for various distribution types:
80+
81+
| Distribution Type | Detection Criteria | Recommended Transformation |
82+
|-------------------|-------------------|----------------------------|
83+
| Normal | Skewness ≈ 0, Kurtosis ≈ 3 | Standard normalization |
84+
| Heavy-tailed | Kurtosis > 4 | Distribution-aware encoding |
85+
| Multimodal | Multiple peaks in histogram | Distribution-aware encoding |
86+
| Uniform | Even distribution | Min-max scaling |
87+
| Exponential | Positive, right-skewed | Distribution-aware encoding |
88+
| Log-normal | Very skewed, positive | Logarithmic transformation |
89+
| Discrete | Few unique values | Rank-based encoding |
90+
| Periodic | Cyclic patterns | Trigonometric features |
91+
| Sparse | Many zeros | Special zero handling |
92+
| Beta | Bounded between 0-1 | Beta CDF transformation |
93+
94+
## 🔄 Recommendation Output
95+
96+
The recommendation output includes:
97+
98+
1. **Feature-specific recommendations**:
99+
```json
100+
{
101+
"feature_name": {
102+
"feature_type": "NumericalFeature",
103+
"preprocessing": ["FLOAT_NORMALIZED"],
104+
"config": {"normalization": "z_score"},
105+
"detected_distribution": "normal",
106+
"notes": ["Normal distribution detected, standard normalization recommended"]
107+
}
108+
}
109+
```
110+
111+
2. **Global configuration recommendations**:
112+
```json
113+
{
114+
"output_mode": "CONCAT",
115+
"use_distribution_aware": true,
116+
"tabular_attention": true,
117+
"tabular_attention_heads": 4,
118+
"tabular_attention_placement": "multi_resolution",
119+
"notes": ["Mixed feature types detected, recommending multi-resolution attention"]
120+
}
121+
```
122+
123+
3. **Ready-to-use code snippet** implementing all recommendations
124+
125+
## 🔧 Fine-tuning Recommendations
126+
127+
While the automatic recommendations provide an excellent starting point, you may want to fine-tune them based on your domain knowledge:
128+
129+
1. **Feature selection**: Remove or combine features based on their importance
130+
2. **Distribution overrides**: Manually specify distribution types for certain features
131+
3. **Parameter tuning**: Adjust hyperparameters like embedding dimensions or number of attention heads
132+
133+
You can easily customize the generated code snippet to incorporate your domain-specific knowledge while still leveraging the power of automatic distribution detection and configuration.

0 commit comments

Comments
 (0)