Skip to content

Commit d0ef7b7

Browse files
docs(KDP): revamping entire docs
1 parent 7b76a99 commit d0ef7b7

File tree

169 files changed

+5797
-2975
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

169 files changed

+5797
-2975
lines changed

README.md

Lines changed: 81 additions & 72 deletions
Original file line numberDiff line numberDiff line change
@@ -1,107 +1,116 @@
1-
# 🌟 Welcome to Keras Data Processor (KDP) - Preprocessing Power with TensorFlow Keras 🌟
1+
# 🌟 Keras Data Processor (KDP) - Powerful Data Preprocessing for TensorFlow
22

33
<p align="center">
44
<img src="docs/kdp_logo.png" width="350"/>
55
</p>
66

7-
**Welcome to the Future of Data Preprocessing!**
7+
**Transform your raw data into ML-ready features with just a few lines of code!**
88

9-
Diving into the world of machine learning and data science, we often find ourselves tangled in the preprocessing jungle.
10-
Worry no more! Introducing a state-of-the-art data preprocessing model based on TensorFlow Keras and the innovative use of Keras preprocessing layers.
9+
KDP provides a state-of-the-art preprocessing system built on TensorFlow Keras. It handles everything from feature normalization to advanced embedding techniques, making your ML pipelines faster, more robust, and easier to maintain.
1110

12-
Say goodbye to tedious data preparation tasks and hello to streamlined, efficient, and scalable data pipelines. Whether you're a seasoned data scientist or just starting out, this tool is designed to supercharge your ML workflows, making them more robust and faster than ever!
11+
## ✨ Key Features
1312

14-
## 🔑 Key Features:
13+
- 🚀 **Efficient Single-Pass Processing**: Process all features in one go, dramatically faster than alternatives
14+
- 🧠 **Distribution-Aware Encoding**: Automatically detects and optimally handles different data distributions
15+
- 👁️ **Tabular Attention**: Captures complex feature interactions for better model performance
16+
- 🔍 **Feature Selection**: Automatically identifies and focuses on the most important features
17+
- 🔄 **Feature-wise Mixture of Experts**: Specialized processing for different feature types
18+
- 📦 **Production-Ready**: Deploy your preprocessing along with your model as a single unit
1519

16-
- Automatic and scalable features statistics extraction: Automatically infer the feature tatistics from your data, saving you time and efforts.
17-
18-
- Customizable Preprocessing Pipelines: Tailor your preprocessing steps with ease, choosing from a wide range of options for numeric, categorical, and even complex feature crosses.
19-
20-
- Scalability and Efficiency: Designed for performance, handling large datasets with ease thanks to TensorFlow's powerful backend.
21-
22-
- Easy Integration: Seamlessly fits into your TensorFlow Keras models (as first layers of the mode), making it a breeze to go from raw data to trained model faster than ever.
23-
24-
## 🚀 Getting started:
25-
26-
We use poetry for handling dependencies so you will need to install it first.
27-
Then you can install the dependencies by running:
28-
29-
To install dependencies:
20+
## 🚀 Quick Installation
3021

3122
```bash
32-
poetry install
33-
```
34-
35-
or to enter a dedicated env directly:
23+
# Using pip
24+
pip install keras-data-processor
3625

37-
```bash
38-
poetry shell
26+
# Using Poetry
27+
poetry add keras-data-processor
3928
```
4029

41-
Then you can simply configure your preprocessor:
42-
43-
## 🛠️ Building Preprocessor:
30+
## 📋 Simple Example
4431

4532
```python
46-
from kdp import PreprocessingModel
47-
from kdp import FeatureType
33+
from kdp import PreprocessingModel, FeatureType
4834

49-
# DEFINING FEATURES PROCESSORS
35+
# Define your features
5036
features_specs = {
51-
# ======= NUMERICAL Features =========================
52-
"feat1": FeatureType.FLOAT_NORMALIZED,
53-
"feat2": FeatureType.FLOAT_RESCALED,
54-
# ======= CATEGORICAL Features ========================
55-
"feat3": FeatureType.STRING_CATEGORICAL,
56-
"feat4": FeatureType.INTEGER_CATEGORICAL,
57-
# ======= TEXT Features ========================
58-
"feat5": FeatureType.TEXT,
37+
"age": FeatureType.FLOAT_NORMALIZED,
38+
"income": FeatureType.FLOAT_RESCALED,
39+
"occupation": FeatureType.STRING_CATEGORICAL,
40+
"description": FeatureType.TEXT
5941
}
6042

61-
# INSTANTIATE THE PREPROCESSING MODEL with your data
62-
ppr = PreprocessingModel(
43+
# Create and build the preprocessor
44+
preprocessor = PreprocessingModel(
6345
path_data="data/my_data.csv",
6446
features_specs=features_specs,
47+
# Enable advanced features
48+
use_distribution_aware=True,
49+
tabular_attention=True
6550
)
66-
# construct the preprocessing pipelines
67-
ppr.build_preprocessor()
68-
```
51+
result = preprocessor.build_preprocessor()
52+
model = result["model"]
6953

70-
This wil output:
71-
72-
```JS
73-
{
74-
'model': <Functional name=preprocessor, built=True>,
75-
'inputs': {
76-
'feat1': <KerasTensor shape=(None, 1), dtype=float32, sparse=None, name=feat1>,
77-
'feat2': <KerasTensor shape=(None, 1), dtype=float32, sparse=None, name=feat2>,
78-
'feat3': <KerasTensor shape=(None, 1), dtype=string, sparse=None, name=feat3>,
79-
'feat4': <KerasTensor shape=(None, 1), dtype=int32, sparse=None, name=feat4>,
80-
'feat5': <KerasTensor shape=(None, 1), dtype=string, sparse=None, name=feat5>
81-
},
82-
'signature': {
83-
'feat1': TensorSpec(shape=(None, 1), dtype=tf.float32, name='feat1'),
84-
'feat2': TensorSpec(shape=(None, 1), dtype=tf.float32, name='feat2'),
85-
'feat3': TensorSpec(shape=(None, 1), dtype=tf.string, name='feat3'),
86-
'feat4': TensorSpec(shape=(None, 1), dtype=tf.int32, name='feat4'),
87-
'feat5': TensorSpec(shape=(None, 1), dtype=tf.string, name='feat5')
88-
},
89-
'output_dims': 45
90-
}
54+
# Use the preprocessor with your data
55+
processed_features = model(input_data)
9156
```
9257

93-
This will result in the following preprocessing steps:
58+
## 📚 Comprehensive Documentation
59+
60+
We've built an extensive documentation system to help you get the most from KDP:
61+
62+
### Core Guides
63+
64+
- [🚀 Quick Start Guide](docs/quick_start.md) - Get up and running in minutes
65+
- [📊 Feature Processing](docs/features.md) - Learn about all supported feature types
66+
- [🧙‍♂️ Auto-Configuration](docs/auto_configuration.md) - Let KDP configure itself for your data
67+
68+
### Advanced Topics
69+
70+
- [📈 Distribution-Aware Encoding](docs/distribution_aware_encoder.md) - Smart handling of different distributions
71+
- [👁️ Tabular Attention](docs/tabular_attention.md) - Capture complex feature interactions
72+
- [🔢 Advanced Numerical Embeddings](docs/advanced_numerical_embeddings.md) - Rich representations for numbers
73+
- [🤖 Transformer Blocks](docs/transformer_blocks.md) - Apply transformer architecture to tabular data
74+
- [🎯 Feature Selection](docs/feature_selection.md) - Focus on what matters in your data
75+
- [🧠 Feature-wise Mixture of Experts](docs/feature_moe.md) - Specialized processing per feature
76+
77+
### Integration & Performance
78+
79+
- [🔗 Integration Guide](docs/integrations.md) - Use KDP with existing ML pipelines
80+
- [🚀 Tabular Optimization](docs/tabular_optimization.md) - Supercharge your preprocessing
81+
- [📈 Performance Tips](docs/complex_examples.md) - Handling large datasets efficiently
82+
83+
### Background & Resources
84+
85+
- [💡 Motivation](docs/motivation.md) - Why we built KDP
86+
- [🤝 Contributing](docs/contributing.md) - Help improve KDP
87+
88+
## 🖼️ Model Architecture
89+
90+
Your preprocessing pipeline is built as a Keras model that can be used independently or as the first layer of any model:
9491

9592
<p align="center">
9693
<img src="docs/imgs/Model_Architecture.png" width="800"/>
9794
</p>
9895

96+
## 📊 Performance
9997

100-
**This preprocessing model can be used independentyly or as the first layer of any Keras model.
101-
This means you can ship your model with the preprocessing pipeline (built-in) as a single entity and deploy it with ease using Tesnorflow Serving.**
98+
KDP outperforms alternative preprocessing approaches, especially as data size increases:
10299

103-
```python
100+
<p align="center">
101+
<img src="docs/imgs/time_vs_nr_data.png" width="400"/>
102+
<img src="docs/imgs/time_vs_nr_features.png" width="400"/>
103+
</p>
104+
105+
## 🤝 Contributing
106+
107+
We welcome contributions! Please check out our [Contributing Guide](docs/contributing.md) for guidelines on how to proceed.
108+
109+
## 📄 License
110+
111+
This project is licensed under the MIT License - see the LICENSE file for details.
104112

105-
## 🔍 Dive Deeper:
113+
## 🙏 Acknowledgments
106114

107-
Explore the detailed documentation to leverage the full potential of this preprocessing tool. Learn about customizing feature crosses, bucketization strategies, embedding sizes, and much more to truly tailor your preprocessing pipeline to your project's needs.
115+
- The TensorFlow and Keras teams for their amazing work
116+
- All contributors who help make KDP better

0 commit comments

Comments
 (0)