|
| 1 | +# 🧙♂️ Automatic Model Configuration |
| 2 | + |
| 3 | +KDP includes a powerful model configuration recommendation system that analyzes your dataset's statistics and suggests the optimal preprocessing strategies for each feature. |
| 4 | + |
| 5 | +## 🔍 Overview |
| 6 | + |
| 7 | +The automatic model configuration system leverages statistical analysis to: |
| 8 | + |
| 9 | +1. **Detect feature distributions** - Identifies the underlying distribution pattern for each feature |
| 10 | +2. **Recommend transformations** - Suggests appropriate preprocessing layers based on detected patterns |
| 11 | +3. **Optimize global settings** - Recommends global parameters for improved model performance |
| 12 | +4. **Generate code** - Provides ready-to-use Python code implementing the recommendations |
| 13 | + |
| 14 | +## 🛠️ How It Works |
| 15 | + |
| 16 | +The system works in two main phases: |
| 17 | + |
| 18 | +### 1. Statistics Collection |
| 19 | + |
| 20 | +First, the `DatasetStatistics` class analyzes your dataset to compute various statistical properties: |
| 21 | + |
| 22 | +- **Numerical features**: Mean, variance, distribution shape metrics (estimated skewness/kurtosis) |
| 23 | +- **Categorical features**: Vocabulary size, cardinality, unique values |
| 24 | +- **Text features**: Vocabulary statistics, average sequence length |
| 25 | +- **Date features**: Cyclical patterns, temporal variance |
| 26 | + |
| 27 | +### 2. Configuration Recommendation |
| 28 | + |
| 29 | +Then, the `ModelAdvisor` analyzes these statistics to recommend: |
| 30 | + |
| 31 | +- **Feature-specific transformations**: Based on the detected distribution type |
| 32 | +- **Advanced encoding options**: Such as distribution-aware encoding for complex distributions |
| 33 | +- **Attention mechanisms**: Tabular attention or multi-resolution attention when appropriate |
| 34 | +- **Global parameters**: Overall architecture suggestions based on the feature mix |
| 35 | + |
| 36 | +## 🚀 Using the Configuration Advisor |
| 37 | + |
| 38 | +### Method 1: Using the Python API |
| 39 | + |
| 40 | +```python |
| 41 | +from kdp.stats import DatasetStatistics |
| 42 | +from kdp.processor import PreprocessingModel |
| 43 | + |
| 44 | +# Initialize statistics calculator |
| 45 | +stats_calculator = DatasetStatistics( |
| 46 | + path_data="data/my_dataset.csv", |
| 47 | + features_specs=features_specs # Optional, will be inferred if not provided |
| 48 | +) |
| 49 | + |
| 50 | +# Calculate statistics |
| 51 | +stats = stats_calculator.main() |
| 52 | + |
| 53 | +# Generate recommendations |
| 54 | +recommendations = stats_calculator.recommend_model_configuration() |
| 55 | + |
| 56 | +# Use the recommendations to build a model |
| 57 | +# You can directly use the generated code snippet or access specific recommendations |
| 58 | +print(recommendations["code_snippet"]) |
| 59 | +``` |
| 60 | + |
| 61 | +### Method 2: Using the Command-Line Tool |
| 62 | + |
| 63 | +KDP provides a command-line tool to analyze datasets and generate recommendations: |
| 64 | + |
| 65 | +```bash |
| 66 | +python scripts/analyze_dataset.py --data path/to/data.csv --output recommendations.json |
| 67 | +``` |
| 68 | + |
| 69 | +Options: |
| 70 | +- `--data`, `-d`: Path to CSV data file or directory (required) |
| 71 | +- `--output`, `-o`: Path to save recommendations (default: recommendations.json) |
| 72 | +- `--stats`, `-s`: Path to save/load feature statistics (default: features_stats.json) |
| 73 | +- `--batch-size`, `-b`: Batch size for processing (default: 50000) |
| 74 | +- `--overwrite`, `-w`: Overwrite existing statistics file |
| 75 | +- `--feature-types`, `-f`: JSON file specifying feature types (optional) |
| 76 | + |
| 77 | +## 🔮 Distribution Detection |
| 78 | + |
| 79 | +The system can detect and recommend specific configurations for various distribution types: |
| 80 | + |
| 81 | +| Distribution Type | Detection Criteria | Recommended Transformation | |
| 82 | +|-------------------|-------------------|----------------------------| |
| 83 | +| Normal | Skewness ≈ 0, Kurtosis ≈ 3 | Standard normalization | |
| 84 | +| Heavy-tailed | Kurtosis > 4 | Distribution-aware encoding | |
| 85 | +| Multimodal | Multiple peaks in histogram | Distribution-aware encoding | |
| 86 | +| Uniform | Even distribution | Min-max scaling | |
| 87 | +| Exponential | Positive, right-skewed | Distribution-aware encoding | |
| 88 | +| Log-normal | Very skewed, positive | Logarithmic transformation | |
| 89 | +| Discrete | Few unique values | Rank-based encoding | |
| 90 | +| Periodic | Cyclic patterns | Trigonometric features | |
| 91 | +| Sparse | Many zeros | Special zero handling | |
| 92 | +| Beta | Bounded between 0-1 | Beta CDF transformation | |
| 93 | + |
| 94 | +## 🔄 Recommendation Output |
| 95 | + |
| 96 | +The recommendation output includes: |
| 97 | + |
| 98 | +1. **Feature-specific recommendations**: |
| 99 | + ```json |
| 100 | + { |
| 101 | + "feature_name": { |
| 102 | + "feature_type": "NumericalFeature", |
| 103 | + "preprocessing": ["FLOAT_NORMALIZED"], |
| 104 | + "config": {"normalization": "z_score"}, |
| 105 | + "detected_distribution": "normal", |
| 106 | + "notes": ["Normal distribution detected, standard normalization recommended"] |
| 107 | + } |
| 108 | + } |
| 109 | + ``` |
| 110 | + |
| 111 | +2. **Global configuration recommendations**: |
| 112 | + ```json |
| 113 | + { |
| 114 | + "output_mode": "CONCAT", |
| 115 | + "use_distribution_aware": true, |
| 116 | + "tabular_attention": true, |
| 117 | + "tabular_attention_heads": 4, |
| 118 | + "tabular_attention_placement": "multi_resolution", |
| 119 | + "notes": ["Mixed feature types detected, recommending multi-resolution attention"] |
| 120 | + } |
| 121 | + ``` |
| 122 | + |
| 123 | +3. **Ready-to-use code snippet** implementing all recommendations |
| 124 | + |
| 125 | +## 🔧 Fine-tuning Recommendations |
| 126 | + |
| 127 | +While the automatic recommendations provide an excellent starting point, you may want to fine-tune them based on your domain knowledge: |
| 128 | + |
| 129 | +1. **Feature selection**: Remove or combine features based on their importance |
| 130 | +2. **Distribution overrides**: Manually specify distribution types for certain features |
| 131 | +3. **Parameter tuning**: Adjust hyperparameters like embedding dimensions or number of attention heads |
| 132 | + |
| 133 | +You can easily customize the generated code snippet to incorporate your domain-specific knowledge while still leveraging the power of automatic distribution detection and configuration. |
0 commit comments