|
1 | | -# 🌟 Welcome to Keras Data Processor (KDP) - Preprocessing Power with TensorFlow Keras 🌟 |
| 1 | +# 🌟 Keras Data Processor (KDP) - Powerful Data Preprocessing for TensorFlow |
2 | 2 |
|
3 | 3 | <p align="center"> |
4 | 4 | <img src="docs/kdp_logo.png" width="350"/> |
5 | 5 | </p> |
6 | 6 |
|
7 | | -**Welcome to the Future of Data Preprocessing!** |
| 7 | +**Transform your raw data into ML-ready features with just a few lines of code!** |
8 | 8 |
|
9 | | -Diving into the world of machine learning and data science, we often find ourselves tangled in the preprocessing jungle. |
10 | | -Worry no more! Introducing a state-of-the-art data preprocessing model based on TensorFlow Keras and the innovative use of Keras preprocessing layers. |
| 9 | +KDP provides a state-of-the-art preprocessing system built on TensorFlow Keras. It handles everything from feature normalization to advanced embedding techniques, making your ML pipelines faster, more robust, and easier to maintain. |
11 | 10 |
|
12 | | -Say goodbye to tedious data preparation tasks and hello to streamlined, efficient, and scalable data pipelines. Whether you're a seasoned data scientist or just starting out, this tool is designed to supercharge your ML workflows, making them more robust and faster than ever! |
| 11 | +## ✨ Key Features |
13 | 12 |
|
14 | | -## 🔑 Key Features: |
| 13 | +- 🚀 **Efficient Single-Pass Processing**: Process all features in one go, dramatically faster than alternatives |
| 14 | +- 🧠 **Distribution-Aware Encoding**: Automatically detects and optimally handles different data distributions |
| 15 | +- 👁️ **Tabular Attention**: Captures complex feature interactions for better model performance |
| 16 | +- 🔍 **Feature Selection**: Automatically identifies and focuses on the most important features |
| 17 | +- 🔄 **Feature-wise Mixture of Experts**: Specialized processing for different feature types |
| 18 | +- 📦 **Production-Ready**: Deploy your preprocessing along with your model as a single unit |
15 | 19 |
|
16 | | -- Automatic and scalable features statistics extraction: Automatically infer the feature tatistics from your data, saving you time and efforts. |
17 | | - |
18 | | -- Customizable Preprocessing Pipelines: Tailor your preprocessing steps with ease, choosing from a wide range of options for numeric, categorical, and even complex feature crosses. |
19 | | - |
20 | | -- Scalability and Efficiency: Designed for performance, handling large datasets with ease thanks to TensorFlow's powerful backend. |
21 | | - |
22 | | -- Easy Integration: Seamlessly fits into your TensorFlow Keras models (as first layers of the mode), making it a breeze to go from raw data to trained model faster than ever. |
23 | | - |
24 | | -## 🚀 Getting started: |
25 | | - |
26 | | -We use poetry for handling dependencies so you will need to install it first. |
27 | | -Then you can install the dependencies by running: |
28 | | - |
29 | | -To install dependencies: |
| 20 | +## 🚀 Quick Installation |
30 | 21 |
|
31 | 22 | ```bash |
32 | | -poetry install |
33 | | -``` |
34 | | - |
35 | | -or to enter a dedicated env directly: |
| 23 | +# Using pip |
| 24 | +pip install keras-data-processor |
36 | 25 |
|
37 | | -```bash |
38 | | -poetry shell |
| 26 | +# Using Poetry |
| 27 | +poetry add keras-data-processor |
39 | 28 | ``` |
40 | 29 |
|
41 | | -Then you can simply configure your preprocessor: |
42 | | - |
43 | | -## 🛠️ Building Preprocessor: |
| 30 | +## 📋 Simple Example |
44 | 31 |
|
45 | 32 | ```python |
46 | | -from kdp import PreprocessingModel |
47 | | -from kdp import FeatureType |
| 33 | +from kdp import PreprocessingModel, FeatureType |
48 | 34 |
|
49 | | -# DEFINING FEATURES PROCESSORS |
| 35 | +# Define your features |
50 | 36 | features_specs = { |
51 | | - # ======= NUMERICAL Features ========================= |
52 | | - "feat1": FeatureType.FLOAT_NORMALIZED, |
53 | | - "feat2": FeatureType.FLOAT_RESCALED, |
54 | | - # ======= CATEGORICAL Features ======================== |
55 | | - "feat3": FeatureType.STRING_CATEGORICAL, |
56 | | - "feat4": FeatureType.INTEGER_CATEGORICAL, |
57 | | - # ======= TEXT Features ======================== |
58 | | - "feat5": FeatureType.TEXT, |
| 37 | + "age": FeatureType.FLOAT_NORMALIZED, |
| 38 | + "income": FeatureType.FLOAT_RESCALED, |
| 39 | + "occupation": FeatureType.STRING_CATEGORICAL, |
| 40 | + "description": FeatureType.TEXT |
59 | 41 | } |
60 | 42 |
|
61 | | -# INSTANTIATE THE PREPROCESSING MODEL with your data |
62 | | -ppr = PreprocessingModel( |
| 43 | +# Create and build the preprocessor |
| 44 | +preprocessor = PreprocessingModel( |
63 | 45 | path_data="data/my_data.csv", |
64 | 46 | features_specs=features_specs, |
| 47 | + # Enable advanced features |
| 48 | + use_distribution_aware=True, |
| 49 | + tabular_attention=True |
65 | 50 | ) |
66 | | -# construct the preprocessing pipelines |
67 | | -ppr.build_preprocessor() |
68 | | -``` |
| 51 | +result = preprocessor.build_preprocessor() |
| 52 | +model = result["model"] |
69 | 53 |
|
70 | | -This wil output: |
71 | | - |
72 | | -```JS |
73 | | -{ |
74 | | -'model': <Functional name=preprocessor, built=True>, |
75 | | -'inputs': { |
76 | | - 'feat1': <KerasTensor shape=(None, 1), dtype=float32, sparse=None, name=feat1>, |
77 | | - 'feat2': <KerasTensor shape=(None, 1), dtype=float32, sparse=None, name=feat2>, |
78 | | - 'feat3': <KerasTensor shape=(None, 1), dtype=string, sparse=None, name=feat3>, |
79 | | - 'feat4': <KerasTensor shape=(None, 1), dtype=int32, sparse=None, name=feat4>, |
80 | | - 'feat5': <KerasTensor shape=(None, 1), dtype=string, sparse=None, name=feat5> |
81 | | - }, |
82 | | -'signature': { |
83 | | - 'feat1': TensorSpec(shape=(None, 1), dtype=tf.float32, name='feat1'), |
84 | | - 'feat2': TensorSpec(shape=(None, 1), dtype=tf.float32, name='feat2'), |
85 | | - 'feat3': TensorSpec(shape=(None, 1), dtype=tf.string, name='feat3'), |
86 | | - 'feat4': TensorSpec(shape=(None, 1), dtype=tf.int32, name='feat4'), |
87 | | - 'feat5': TensorSpec(shape=(None, 1), dtype=tf.string, name='feat5') |
88 | | - }, |
89 | | -'output_dims': 45 |
90 | | -} |
| 54 | +# Use the preprocessor with your data |
| 55 | +processed_features = model(input_data) |
91 | 56 | ``` |
92 | 57 |
|
93 | | -This will result in the following preprocessing steps: |
| 58 | +## 📚 Comprehensive Documentation |
| 59 | + |
| 60 | +We've built an extensive documentation system to help you get the most from KDP: |
| 61 | + |
| 62 | +### Core Guides |
| 63 | + |
| 64 | +- [🚀 Quick Start Guide](docs/quick_start.md) - Get up and running in minutes |
| 65 | +- [📊 Feature Processing](docs/features.md) - Learn about all supported feature types |
| 66 | +- [🧙♂️ Auto-Configuration](docs/auto_configuration.md) - Let KDP configure itself for your data |
| 67 | + |
| 68 | +### Advanced Topics |
| 69 | + |
| 70 | +- [📈 Distribution-Aware Encoding](docs/distribution_aware_encoder.md) - Smart handling of different distributions |
| 71 | +- [👁️ Tabular Attention](docs/tabular_attention.md) - Capture complex feature interactions |
| 72 | +- [🔢 Advanced Numerical Embeddings](docs/advanced_numerical_embeddings.md) - Rich representations for numbers |
| 73 | +- [🤖 Transformer Blocks](docs/transformer_blocks.md) - Apply transformer architecture to tabular data |
| 74 | +- [🎯 Feature Selection](docs/feature_selection.md) - Focus on what matters in your data |
| 75 | +- [🧠 Feature-wise Mixture of Experts](docs/feature_moe.md) - Specialized processing per feature |
| 76 | + |
| 77 | +### Integration & Performance |
| 78 | + |
| 79 | +- [🔗 Integration Guide](docs/integrations.md) - Use KDP with existing ML pipelines |
| 80 | +- [🚀 Tabular Optimization](docs/tabular_optimization.md) - Supercharge your preprocessing |
| 81 | +- [📈 Performance Tips](docs/complex_examples.md) - Handling large datasets efficiently |
| 82 | + |
| 83 | +### Background & Resources |
| 84 | + |
| 85 | +- [💡 Motivation](docs/motivation.md) - Why we built KDP |
| 86 | +- [🤝 Contributing](docs/contributing.md) - Help improve KDP |
| 87 | + |
| 88 | +## 🖼️ Model Architecture |
| 89 | + |
| 90 | +Your preprocessing pipeline is built as a Keras model that can be used independently or as the first layer of any model: |
94 | 91 |
|
95 | 92 | <p align="center"> |
96 | 93 | <img src="docs/imgs/Model_Architecture.png" width="800"/> |
97 | 94 | </p> |
98 | 95 |
|
| 96 | +## 📊 Performance |
99 | 97 |
|
100 | | -**This preprocessing model can be used independentyly or as the first layer of any Keras model. |
101 | | -This means you can ship your model with the preprocessing pipeline (built-in) as a single entity and deploy it with ease using Tesnorflow Serving.** |
| 98 | +KDP outperforms alternative preprocessing approaches, especially as data size increases: |
102 | 99 |
|
103 | | -```python |
| 100 | +<p align="center"> |
| 101 | + <img src="docs/imgs/time_vs_nr_data.png" width="400"/> |
| 102 | + <img src="docs/imgs/time_vs_nr_features.png" width="400"/> |
| 103 | +</p> |
| 104 | + |
| 105 | +## 🤝 Contributing |
| 106 | + |
| 107 | +We welcome contributions! Please check out our [Contributing Guide](docs/contributing.md) for guidelines on how to proceed. |
| 108 | + |
| 109 | +## 📄 License |
| 110 | + |
| 111 | +This project is licensed under the MIT License - see the LICENSE file for details. |
104 | 112 |
|
105 | | -## 🔍 Dive Deeper: |
| 113 | +## 🙏 Acknowledgments |
106 | 114 |
|
107 | | -Explore the detailed documentation to leverage the full potential of this preprocessing tool. Learn about customizing feature crosses, bucketization strategies, embedding sizes, and much more to truly tailor your preprocessing pipeline to your project's needs. |
| 115 | +- The TensorFlow and Keras teams for their amazing work |
| 116 | +- All contributors who help make KDP better |
0 commit comments