Skip to content

Commit 55ae7ce

Browse files
docs(KDP): adding new styling
1 parent 0c932b0 commit 55ae7ce

File tree

4 files changed

+356
-382
lines changed

4 files changed

+356
-382
lines changed

docs/advanced/custom-preprocessing.md

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -689,6 +689,88 @@ KDP offers multiple approaches to custom preprocessing, from simple layer additi
689689
5. 📝 **Document Your Approach**: Document why custom preprocessing was necessary
690690
6. 🔁 **Ensure Reproducibility**: Make sure custom preprocessing is deterministic
691691

692+
## 🤖 Auto-Configuration Script
693+
694+
KDP provides an auto-configuration script that analyzes your dataset and recommends optimal preprocessing configurations. This tool can help you get started quickly by automatically detecting feature types and suggesting appropriate preprocessing steps.
695+
696+
### 🚀 Basic Usage
697+
698+
```python
699+
from kdp import auto_configure
700+
701+
# Analyze your dataset and get recommendations
702+
config = auto_configure(
703+
data_path="your_data.csv",
704+
batch_size=50000,
705+
save_stats=True
706+
)
707+
708+
# Review the recommendations
709+
print(config["recommendations"]) # Feature-specific recommendations
710+
print(config["code_snippet"]) # Ready-to-use code
711+
```
712+
713+
### 📊 What It Analyzes
714+
715+
The auto-configuration script examines:
716+
717+
- 🔍 **Data Distributions**: Identifies patterns in numerical data
718+
- 📈 **Feature Statistics**: Calculates mean, variance, skewness, etc.
719+
- 🎯 **Value Ranges**: Detects min/max values and outliers
720+
- 🔄 **Value Patterns**: Distinguishes between discrete and continuous values
721+
722+
### 🛠️ Command Line Interface
723+
724+
You can also use the script from the command line:
725+
726+
```bash
727+
python -m kdp.scripts.analyze_dataset \
728+
--data your_data.csv \
729+
--output recommendations.json \
730+
--stats features_stats.json \
731+
--batch-size 50000
732+
```
733+
734+
### 📝 Example Output
735+
736+
The script generates a comprehensive report including:
737+
738+
```python
739+
{
740+
"recommendations": {
741+
"income": {
742+
"feature_type": "NumericalFeature",
743+
"preprocessing": ["NORMALIZATION"],
744+
"detected_distribution": "log_normal",
745+
"config": {
746+
"embedding_dim": 16,
747+
"num_bins": 20
748+
}
749+
},
750+
"age": {
751+
"feature_type": "NumericalFeature",
752+
"preprocessing": ["NORMALIZATION"],
753+
"detected_distribution": "normal",
754+
"config": {
755+
"embedding_dim": 8,
756+
"num_bins": 10
757+
}
758+
}
759+
},
760+
"code_snippet": "# Generated code implementing the recommendations",
761+
"statistics": {
762+
# Detailed feature statistics
763+
}
764+
}
765+
```
766+
767+
### 💡 Pro Tips for Auto-Configuration
768+
769+
1. **Review Before Implementing**: Always review the recommendations before applying them
770+
2. **Combine with Domain Knowledge**: Use the recommendations alongside your expertise
771+
3. **Update When Data Changes**: Rerun the analysis when your data distribution changes
772+
4. **Customize as Needed**: Modify the generated code to match your specific requirements
773+
692774
## ⚠️ Limitations and Considerations
693775

694776
- 💾 Custom preprocessing layers must be compatible with TensorFlow's serialization

0 commit comments

Comments
 (0)