docs(KDP): added docs

Gandalfdore · Gandalfdore · commit d8da4c50613f · 2025-02-20T14:27:04.000+02:00
diff --git a/docs/advanced_numerical_embeddings.md b/docs/advanced_numerical_embeddings.md
@@ -0,0 +1,95 @@
+# Advanced Numerical Embeddings in KDP
+
+Keras Data Processor (KDP) now provides advanced numerical embedding techniques to better capture complex numerical relationships in your data. This release introduces two embedding approaches:
+
+---
+
+## AdvancedNumericalEmbedding
+
+**Purpose:**
+Processes individual numerical features with tailored embedding layers. This layer performs adaptive binning, applies MLP transformations per feature, and can incorporate dropout and batch normalization.
+
+**Key Parameters:**
+- **`embedding_dim`**: Dimension for each feature's embedding.
+- **`mlp_hidden_units`**: Number of hidden units in the MLP applied to each feature.
+- **`num_bins`**: Number of bins used for discretizing continuous inputs.
+- **`init_min` and `init_max`**: Initialization boundaries for binning.
+- **`dropout_rate`**: Dropout rate for regularization.
+- **`use_batch_norm`**: Flag to apply batch normalization.
+
+**Usage Example:**
+```python
+from kdp.custom_layers import AdvancedNumericalEmbedding
+import tensorflow as tf
+
+layer = AdvancedNumericalEmbedding(
+    embedding_dim=8,
+    mlp_hidden_units=16,
+    num_bins=10,
+    init_min=[-3.0, -2.0, -4.0],
+    init_max=[3.0, 2.0, 4.0],
+    dropout_rate=0.1,
+    use_batch_norm=True,
+)
+
+# Input shape: (batch_size, num_features)
+x = tf.random.normal((32, 3))
+# Output shape: (32, 3, 8)
+output = layer(x, training=False)
+```
+
+---
+
+## GlobalAdvancedNumericalEmbedding
+
+**Purpose:**
+Combines a set of numerical features into a single, compact representation. It does so by applying an internal advanced numerical embedding on the concatenated input and then performing a global pooling over all features.
+
+**Key Parameters (prefixed with `global_`):**
+- **`global_embedding_dim`**: Global embedding dimension (final pooled vector size).
+- **`global_mlp_hidden_units`**: Hidden units in the global MLP.
+- **`global_num_bins`**: Number of bins for discretization.
+- **`global_init_min` and `global_init_max`**: Global initialization boundaries.
+- **`global_dropout_rate`**: Dropout rate.
+- **`global_use_batch_norm`**: Whether to apply batch normalization.
+- **`global_pooling`**: Pooling method to use ("average" or "max").
+
+**Usage Example:**
+```python
+from kdp.custom_layers import GlobalAdvancedNumericalEmbedding
+import tensorflow as tf
+
+global_layer = GlobalAdvancedNumericalEmbedding(
+    global_embedding_dim=8,
+    global_mlp_hidden_units=16,
+    global_num_bins=10,
+    global_init_min=[-3.0, -2.0],
+    global_init_max=[3.0, 2.0],
+    global_dropout_rate=0.1,
+    global_use_batch_norm=True,
+    global_pooling="average"
+)
+
+# Input shape: (batch_size, num_features)
+x = tf.random.normal((32, 3))
+# Global output shape: (32, 8)
+global_output = global_layer(x, training=False)
+```
+
+---
+
+## When to Use Which?
+
+- **AdvancedNumericalEmbedding:**
+  Use this when you need to process each numerical feature individually, preserving their distinct characteristics via per-feature embeddings.
+
+- **GlobalAdvancedNumericalEmbedding:**
+  Choose this option when you want to merge multiple numerical features into a unified global embedding using a pooling mechanism. This is particularly useful when the overall interaction across features is more important than the individual feature details.
+
+## Advanced Configuration
+
+Both layers offer additional parameters to fine-tune the embed­ding process. You can adjust dropout rates, batch normalization, and binning strategies to best suit your data. For more detailed information, please refer to the API documentation.
+
+---
+
+This document highlights the key differences and usage examples for the new advanced numerical embeddings available in KDP.
diff --git a/docs/complex_example.md b/docs/complex_example.md
@@ -123,6 +123,27 @@ ppr = PreprocessingModel(
     # Distribution aware configuration
     use_distribution_aware=True, # here we activate the distribution aware encoder
     distribution_aware_bins=1000, # thats the default value, but you can change it for finer data
+
+    # Add advanced numerical embedding
+    use_advanced_numerical_embedding=True,
+    embedding_dim=32,  # Match embedding size with categorical features
+    mlp_hidden_units=16,
+    num_bins=10,
+    init_min=-3.0,
+    init_max=3.0,
+    dropout_rate=0.1,
+    use_batch_norm=True,
+
+    # Add global numerical embedding
+    use_global_numerical_embedding=True,
+    global_embedding_dim=32,  # Match embedding dimensions
+    global_mlp_hidden_units=16,
+    global_num_bins=10,
+    global_init_min=-3.0,
+    global_init_max=3.0,
+    global_dropout_rate=0.1,
+    global_use_batch_norm=True,
+    global_pooling="average",
 )
 
 # Build the preprocessor
diff --git a/docs/example_usages.md b/docs/example_usages.md
@@ -362,3 +362,66 @@ feature_importances = ppr.get_feature_importances()
 ```
 Here is the plot of the model:
 ![Complex Model](imgs/numerical_example_model_with_distribution_aware.png)
+
+
+## Example 5: Numerical features with numerical embedding
+
+Numerical embedding is a technique that allows us to embed numerical features into a higher dimensional space.
+This can be useful for capturing non-linear relationships within/between numerical feature/s.
+
+```python
+from kdp.features import NumericalFeature, FeatureType
+from kdp.processor import PreprocessingModel, OutputModeOptions
+
+
+# Define features
+features = {
+    "basic_float": NumericalFeature(
+        name="basic_float",
+        feature_type=FeatureType.FLOAT,
+    ),
+
+    "rescaled_float": NumericalFeature(
+        name="rescaled_float",
+        feature_type=FeatureType.FLOAT_RESCALED,
+        scale=2.0,
+    ),
+
+    "custom_float": NumericalFeature(
+        name="custom_float",
+        feature_type=FeatureType.FLOAT,
+        preprocessors=[
+            tf.keras.layers.Rescaling,
+            tf.keras.layers.Normalization,
+            DistributionAwareEncoder,
+        ],
+    ),
+}
+
+# Now we can create a preprocessing model with the features
+ppr = PreprocessingModel(
+    path_data="sample_data.csv",
+    features_specs=features,
+    features_stats_path="features_stats.json",
+    overwrite_stats=True,
+
+    # Add numerical embedding
+    # Use advanced numerical embedding for individual features
+    use_advanced_numerical_embedding=True,
+    # Use global numerical embedding for all features
+    use_global_numerical_embedding=True,
+
+    output_mode=OutputModeOptions.CONCAT,
+)
+
+# Build the preprocessor
+result = ppr.build_preprocessor()
+
+# Transform data using direct model prediction
+transformed_data = ppr.model.predict(test_batch)
+
+# Get feature importances
+feature_importances = ppr.get_feature_importances()
+```
+Here is the plot of the model:
+![Complex Model](imgs/numerical_example_model_with_advanced_numerical_embedding.png)
diff --git a/docs/imgs/complex_example.png b/docs/imgs/complex_example.png
diff --git a/docs/imgs/numerical_example_model_with_advanced_numerical_embedding.png b/docs/imgs/numerical_example_model_with_advanced_numerical_embedding.png
diff --git a/kdp/custom_layers.py b/kdp/custom_layers.py
@@ -2122,6 +2122,9 @@ def call(self, inputs: tf.Tensor, training: bool = False) -> tf.Tensor:
         # Combine branches via a per-feature, per-dimension gate.
         gate = tf.nn.sigmoid(self.gate)  # (num_features, embedding_dim)
         output = gate * cont + (1 - gate) * disc  # (batch, num_features, embedding_dim)
+        # If only one feature is provided, squeeze the features axis.
+        if self.num_features == 1:
+            return tf.squeeze(output, axis=1)  # New shape: (batch, embedding_dim)
         return output
 
     def get_config(self):