Skip to content

Commit d8da4c5

Browse files
committed
docs(KDP): added docs
1 parent 5cb4e8e commit d8da4c5

File tree

6 files changed

+182
-0
lines changed

6 files changed

+182
-0
lines changed
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
# Advanced Numerical Embeddings in KDP
2+
3+
Keras Data Processor (KDP) now provides advanced numerical embedding techniques to better capture complex numerical relationships in your data. This release introduces two embedding approaches:
4+
5+
---
6+
7+
## AdvancedNumericalEmbedding
8+
9+
**Purpose:**
10+
Processes individual numerical features with tailored embedding layers. This layer performs adaptive binning, applies MLP transformations per feature, and can incorporate dropout and batch normalization.
11+
12+
**Key Parameters:**
13+
- **`embedding_dim`**: Dimension for each feature's embedding.
14+
- **`mlp_hidden_units`**: Number of hidden units in the MLP applied to each feature.
15+
- **`num_bins`**: Number of bins used for discretizing continuous inputs.
16+
- **`init_min` and `init_max`**: Initialization boundaries for binning.
17+
- **`dropout_rate`**: Dropout rate for regularization.
18+
- **`use_batch_norm`**: Flag to apply batch normalization.
19+
20+
**Usage Example:**
21+
```python
22+
from kdp.custom_layers import AdvancedNumericalEmbedding
23+
import tensorflow as tf
24+
25+
layer = AdvancedNumericalEmbedding(
26+
embedding_dim=8,
27+
mlp_hidden_units=16,
28+
num_bins=10,
29+
init_min=[-3.0, -2.0, -4.0],
30+
init_max=[3.0, 2.0, 4.0],
31+
dropout_rate=0.1,
32+
use_batch_norm=True,
33+
)
34+
35+
# Input shape: (batch_size, num_features)
36+
x = tf.random.normal((32, 3))
37+
# Output shape: (32, 3, 8)
38+
output = layer(x, training=False)
39+
```
40+
41+
---
42+
43+
## GlobalAdvancedNumericalEmbedding
44+
45+
**Purpose:**
46+
Combines a set of numerical features into a single, compact representation. It does so by applying an internal advanced numerical embedding on the concatenated input and then performing a global pooling over all features.
47+
48+
**Key Parameters (prefixed with `global_`):**
49+
- **`global_embedding_dim`**: Global embedding dimension (final pooled vector size).
50+
- **`global_mlp_hidden_units`**: Hidden units in the global MLP.
51+
- **`global_num_bins`**: Number of bins for discretization.
52+
- **`global_init_min` and `global_init_max`**: Global initialization boundaries.
53+
- **`global_dropout_rate`**: Dropout rate.
54+
- **`global_use_batch_norm`**: Whether to apply batch normalization.
55+
- **`global_pooling`**: Pooling method to use ("average" or "max").
56+
57+
**Usage Example:**
58+
```python
59+
from kdp.custom_layers import GlobalAdvancedNumericalEmbedding
60+
import tensorflow as tf
61+
62+
global_layer = GlobalAdvancedNumericalEmbedding(
63+
global_embedding_dim=8,
64+
global_mlp_hidden_units=16,
65+
global_num_bins=10,
66+
global_init_min=[-3.0, -2.0],
67+
global_init_max=[3.0, 2.0],
68+
global_dropout_rate=0.1,
69+
global_use_batch_norm=True,
70+
global_pooling="average"
71+
)
72+
73+
# Input shape: (batch_size, num_features)
74+
x = tf.random.normal((32, 3))
75+
# Global output shape: (32, 8)
76+
global_output = global_layer(x, training=False)
77+
```
78+
79+
---
80+
81+
## When to Use Which?
82+
83+
- **AdvancedNumericalEmbedding:**
84+
Use this when you need to process each numerical feature individually, preserving their distinct characteristics via per-feature embeddings.
85+
86+
- **GlobalAdvancedNumericalEmbedding:**
87+
Choose this option when you want to merge multiple numerical features into a unified global embedding using a pooling mechanism. This is particularly useful when the overall interaction across features is more important than the individual feature details.
88+
89+
## Advanced Configuration
90+
91+
Both layers offer additional parameters to fine-tune the embed­ding process. You can adjust dropout rates, batch normalization, and binning strategies to best suit your data. For more detailed information, please refer to the API documentation.
92+
93+
---
94+
95+
This document highlights the key differences and usage examples for the new advanced numerical embeddings available in KDP.

docs/complex_example.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -123,6 +123,27 @@ ppr = PreprocessingModel(
123123
# Distribution aware configuration
124124
use_distribution_aware=True, # here we activate the distribution aware encoder
125125
distribution_aware_bins=1000, # thats the default value, but you can change it for finer data
126+
127+
# Add advanced numerical embedding
128+
use_advanced_numerical_embedding=True,
129+
embedding_dim=32, # Match embedding size with categorical features
130+
mlp_hidden_units=16,
131+
num_bins=10,
132+
init_min=-3.0,
133+
init_max=3.0,
134+
dropout_rate=0.1,
135+
use_batch_norm=True,
136+
137+
# Add global numerical embedding
138+
use_global_numerical_embedding=True,
139+
global_embedding_dim=32, # Match embedding dimensions
140+
global_mlp_hidden_units=16,
141+
global_num_bins=10,
142+
global_init_min=-3.0,
143+
global_init_max=3.0,
144+
global_dropout_rate=0.1,
145+
global_use_batch_norm=True,
146+
global_pooling="average",
126147
)
127148

128149
# Build the preprocessor

docs/example_usages.md

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -362,3 +362,66 @@ feature_importances = ppr.get_feature_importances()
362362
```
363363
Here is the plot of the model:
364364
![Complex Model](imgs/numerical_example_model_with_distribution_aware.png)
365+
366+
367+
## Example 5: Numerical features with numerical embedding
368+
369+
Numerical embedding is a technique that allows us to embed numerical features into a higher dimensional space.
370+
This can be useful for capturing non-linear relationships within/between numerical feature/s.
371+
372+
```python
373+
from kdp.features import NumericalFeature, FeatureType
374+
from kdp.processor import PreprocessingModel, OutputModeOptions
375+
376+
377+
# Define features
378+
features = {
379+
"basic_float": NumericalFeature(
380+
name="basic_float",
381+
feature_type=FeatureType.FLOAT,
382+
),
383+
384+
"rescaled_float": NumericalFeature(
385+
name="rescaled_float",
386+
feature_type=FeatureType.FLOAT_RESCALED,
387+
scale=2.0,
388+
),
389+
390+
"custom_float": NumericalFeature(
391+
name="custom_float",
392+
feature_type=FeatureType.FLOAT,
393+
preprocessors=[
394+
tf.keras.layers.Rescaling,
395+
tf.keras.layers.Normalization,
396+
DistributionAwareEncoder,
397+
],
398+
),
399+
}
400+
401+
# Now we can create a preprocessing model with the features
402+
ppr = PreprocessingModel(
403+
path_data="sample_data.csv",
404+
features_specs=features,
405+
features_stats_path="features_stats.json",
406+
overwrite_stats=True,
407+
408+
# Add numerical embedding
409+
# Use advanced numerical embedding for individual features
410+
use_advanced_numerical_embedding=True,
411+
# Use global numerical embedding for all features
412+
use_global_numerical_embedding=True,
413+
414+
output_mode=OutputModeOptions.CONCAT,
415+
)
416+
417+
# Build the preprocessor
418+
result = ppr.build_preprocessor()
419+
420+
# Transform data using direct model prediction
421+
transformed_data = ppr.model.predict(test_batch)
422+
423+
# Get feature importances
424+
feature_importances = ppr.get_feature_importances()
425+
```
426+
Here is the plot of the model:
427+
![Complex Model](imgs/numerical_example_model_with_advanced_numerical_embedding.png)

docs/imgs/complex_example.png

442 KB
Loading
181 KB
Loading

kdp/custom_layers.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2122,6 +2122,9 @@ def call(self, inputs: tf.Tensor, training: bool = False) -> tf.Tensor:
21222122
# Combine branches via a per-feature, per-dimension gate.
21232123
gate = tf.nn.sigmoid(self.gate) # (num_features, embedding_dim)
21242124
output = gate * cont + (1 - gate) * disc # (batch, num_features, embedding_dim)
2125+
# If only one feature is provided, squeeze the features axis.
2126+
if self.num_features == 1:
2127+
return tf.squeeze(output, axis=1) # New shape: (batch, embedding_dim)
21252128
return output
21262129

21272130
def get_config(self):

0 commit comments

Comments
 (0)