Skip to content

Commit 1916c40

Browse files
docs(KDP): updating DistributionEncoder docs
1 parent 00e75d6 commit 1916c40

File tree

1 file changed

+150
-98
lines changed

1 file changed

+150
-98
lines changed

docs/distribution_aware_encoder.md

Lines changed: 150 additions & 98 deletions
Original file line numberDiff line numberDiff line change
@@ -1,204 +1,256 @@
11
# Distribution-Aware Encoder
22

33
## Overview
4-
The **Distribution-Aware Encoder** is an advanced preprocessing layer that automatically detects and handles various types of data distributions. It leverages TensorFlow Probability (tfp) for accurate modeling and applies specialized transformations while preserving the statistical properties of the data.
54

6-
## Features
5+
The **Distribution-Aware Encoder** is an advanced preprocessing layer that automatically detects and handles various types of data distributions. It applies specialized transformations to improve model performance while preserving the statistical properties of the data. Built on pure TensorFlow operations without dependencies on TensorFlow Probability, it's lightweight and easy to deploy.
6+
7+
## Key Features
8+
9+
### 1. Automatic Distribution Detection
10+
- Uses statistical moments (mean, variance, skewness, kurtosis) to identify distribution types
11+
- Employs histogram analysis for multimodality detection
12+
- Performs autocorrelation analysis for periodic pattern detection
13+
- Adapts to data characteristics during training
14+
15+
### 2. Intelligent Transformations
16+
- Applies distribution-specific transformations automatically
17+
- Handles 16 different distribution types with specialized approaches
18+
- Adds Fourier features (sin/cos) for periodic data
19+
- Special handling for sparse data and zero values
20+
21+
### 3. Flexible Output Options
22+
- Optional projection to fixed embedding dimension
23+
- Distribution-specific embeddings can be added to outputs
24+
- Automatic feature expansion for periodic data
25+
26+
### 4. Production-Ready Implementation
27+
- Graph mode compatible for TensorFlow's static graph execution
28+
- No dependencies on TensorFlow Probability for easier deployment
29+
- Serialization support for model saving and loading
30+
31+
## Distribution Types Supported
32+
33+
The encoder automatically detects and handles these distribution types:
734

8-
### Distribution Types Supported
935
1. **Normal Distribution**
1036
- For standard normally distributed data
11-
- Handled via z-score normalization
12-
- Detection: Kurtosis ≈ 3.0, Skewness ≈ 0
37+
- Detection: Skewness < 0.5, Kurtosis ≈ 3.0
1338

1439
2. **Heavy-Tailed Distribution**
1540
- For data with heavier tails than normal
16-
- Handled via Student's t-distribution
17-
- Detection: Kurtosis > 3.5
41+
- Detection: Kurtosis > 4.0
1842

1943
3. **Multimodal Distribution**
2044
- For data with multiple peaks
21-
- Handled via Gaussian Mixture Models
22-
- Detection: KDE-based peak detection
45+
- Detection: Multiple significant peaks in histogram
2346

2447
4. **Uniform Distribution**
25-
- For evenly distributed data
26-
- Handled via min-max scaling
27-
- Detection: Kurtosis ≈ 1.8
48+
- For evenly distributed data between bounds
49+
- Detection: Bounded between 0 and 1
2850

2951
5. **Exponential Distribution**
3052
- For data with exponential decay
31-
- Handled via rate-based transformation
32-
- Detection: Skewness ≈ 2.0
53+
- Detection: Positive values with skewness > 1.0
3354

3455
6. **Log-Normal Distribution**
3556
- For data that is normal after log transform
36-
- Handled via logarithmic transformation
37-
- Detection: Log-transformed kurtosis ≈ 3.0
57+
- Detection: Positive values with skewness > 2.0
3858

3959
7. **Discrete Distribution**
4060
- For data with finite distinct values
41-
- Handled via rank-based normalization
42-
- Detection: Unique values analysis
61+
- Detection: Low unique value ratio (< 0.1)
4362

4463
8. **Periodic Distribution**
4564
- For data with cyclic patterns
46-
- Handled via Fourier features (sin/cos)
47-
- Detection: Peak spacing analysis
65+
- Detection: Significant peaks in autocorrelation
4866

4967
9. **Sparse Distribution**
5068
- For data with many zeros
51-
- Handled via separate zero/non-zero transformations
52-
- Detection: Zero ratio analysis
69+
- Detection: Zero ratio > 0.5
5370

5471
10. **Beta Distribution**
55-
- For bounded data between 0 and 1
56-
- Handled via beta CDF transformation
57-
- Detection: Value range and shape analysis
72+
- For bounded data between 0 and 1 with shape parameters
73+
- Detection: Bounded between 0 and 1 with skewness > 0.5
5874

5975
11. **Gamma Distribution**
6076
- For positive, right-skewed data
61-
- Handled via gamma CDF transformation
62-
- Detection: Positive support and skewness
77+
- Detection: Positive values with mild skewness (> 0.5)
6378

6479
12. **Poisson Distribution**
6580
- For count data
66-
- Handled via rate parameter estimation
67-
- Detection: Integer values and variance≈mean
81+
- Handled implicitly through other transformations
6882

69-
14. **Cauchy Distribution**
83+
13. **Cauchy Distribution**
7084
- For extremely heavy-tailed data
71-
- Handled via robust location-scale estimation
72-
- Detection: Undefined moments
85+
- Detection: Very high kurtosis (> 10.0)
7386

74-
15. **Zero-Inflated Distribution**
87+
14. **Zero-Inflated Distribution**
7588
- For data with excess zeros
76-
- Handled via mixture model approach
77-
- Detection: Zero proportion analysis
89+
- Detection: Moderate zero ratio (0.3-0.5)
90+
91+
15. **Bounded Distribution**
92+
- For data with known bounds
93+
- Handled implicitly through other transformations
94+
95+
16. **Ordinal Distribution**
96+
- For ordered categorical data
97+
- Handled similarly to discrete distributions
7898

7999
## Usage
80100

81101
### Basic Usage
82102

83-
The Distribution-Aware Encoder works seamlessly (and only) with numerical features. Enable it by setting `use_distribution_aware=True` in the `PreprocessingModel`.
103+
The Distribution-Aware Encoder works seamlessly with numerical features. Enable it by setting `use_distribution_aware=True` in the `PreprocessingModel`.
84104

85105
```python
86106
from kdp.processor import PreprocessingModel
87107
from kdp.features import NumericalFeature
88108

89109
# Define features
90110
features = {
91-
# Numerical features
92111
"feature1": NumericalFeature(),
93112
"feature2": NumericalFeature(),
94-
# etc ..
113+
# etc.
95114
}
96115

97-
# Initialize the model
98-
model = PreprocessingModel( # here
116+
# Initialize the model with distribution-aware encoding
117+
model = PreprocessingModel(
99118
features=features,
100119
use_distribution_aware=True
101120
)
102121
```
103122

104-
### Manual Usage
123+
### Manual Usage with Specific Distribution
124+
125+
You can specify a preferred distribution type for specific features:
105126

106127
```python
107128
from kdp.processor import PreprocessingModel
108129
from kdp.features import NumericalFeature, FeatureType
130+
from kdp.layers.distribution_aware_encoder_layer import DistributionType
109131

110-
# Define features
132+
# Define features with specific distribution preferences
111133
features = {
112-
# Numerical features
113134
"feature1": NumericalFeature(
114135
name="feature1",
115136
feature_type=FeatureType.FLOAT_NORMALIZED
116137
),
117138
"feature2": NumericalFeature(
118139
name="feature2",
119140
feature_type=FeatureType.FLOAT_RESCALED,
120-
prefered_distribution="log_normal" # here we could specify a prefered distribution (normal, periodic, etc)
141+
prefered_distribution=DistributionType.LOG_NORMAL # Specify preferred distribution
121142
)
122-
# etc ..
143+
# etc.
123144
}
124145

125146
# Initialize the model
126-
model = PreprocessingModel( # here
147+
model = PreprocessingModel(
127148
features=features,
128-
use_distribution_aware=True,
129-
distribution_aware_bins=1000, # 1000 is the default value, but you can change it for finer data
149+
use_distribution_aware=True
130150
)
131151
```
132152

133-
### Advanced Configuration
153+
### Direct Layer Usage
154+
155+
You can also use the layer directly in your Keras models:
156+
134157
```python
135-
encoder = DistributionAwareEncoder(
136-
num_bins=1000,
137-
epsilon=1e-6,
138-
detect_periodicity=True,
139-
handle_sparsity=True,
140-
adaptive_binning=True,
141-
mixture_components=3,
142-
trainable=True
143-
)
158+
import tensorflow as tf
159+
from kdp.layers import DistributionAwareEncoder
160+
161+
# Creating a model with automatic distribution detection
162+
inputs = tf.keras.Input(shape=(10,))
163+
encoded = DistributionAwareEncoder(embedding_dim=16)(inputs)
164+
outputs = tf.keras.layers.Dense(1)(encoded)
165+
model = tf.keras.Model(inputs, outputs)
166+
167+
# Save and load model with custom objects
168+
model.save("my_model.keras")
169+
custom_objects = DistributionAwareEncoder.get_custom_objects()
170+
loaded_model = tf.keras.models.load_model("my_model", custom_objects=custom_objects)
144171
```
145172

146173
## Parameters
147174

148175
| Parameter | Type | Default | Description |
149176
|-----------|------|---------|-------------|
150-
| num_bins | int | 1000 | Number of bins for quantile encoding |
151-
| epsilon | float | 1e-6 | Small value for numerical stability |
152-
| detect_periodicity | bool | True | Enable periodic pattern detection |
153-
| handle_sparsity | bool | True | Enable special handling for sparse data |
154-
| adaptive_binning | bool | True | Enable adaptive bin boundaries |
155-
| mixture_components | int | 3 | Number of components for mixture models |
156-
| trainable | bool | True | Whether parameters are trainable |
157-
| prefered_distribution | DistributionType | None | Manually specify distribution type |
177+
| embedding_dim | int or None | None | Output dimension for feature projection. If specified, a Dense layer projects the transformed features to this dimension. |
178+
| epsilon | float | 1e-6 | Small value to prevent numerical issues. |
179+
| detect_periodicity | bool | True | If True, checks for and handles periodic patterns by adding sin/cos features. |
180+
| handle_sparsity | bool | True | If True, applies special handling for sparse data (many zeros). |
181+
| auto_detect | bool | True | If True, automatically detects distribution type during training. |
182+
| distribution_type | str | "unknown" | Specific distribution type to use if auto_detect is False. |
183+
| transform_type | str | "auto" | Type of transformation to apply via DistributionTransformLayer. |
184+
| add_distribution_embedding | bool | False | If True, adds a learned embedding for the detected distribution type. |
185+
| trainable | bool | True | Whether the layer is trainable. |
158186

159-
## Key Features
187+
## Output Dimensions
160188

161-
### 1. Automatic Distribution Detection
162-
- Uses statistical moments and tests
163-
- Employs KDE for multimodality detection
164-
- Handles mixed distributions via ensemble approach
189+
The output dimensions depend on the configuration:
165190

166-
### 2. Adaptive Transformations
167-
- Learns optimal parameters during training
168-
- Adjusts to data distribution changes
169-
- Handles complex periodic patterns
191+
- **Base case**: Same shape as input
192+
- **With periodic features**: Input dimension × 3 (original + sin + cos features)
193+
- **With embedding_dim**: (batch_size, embedding_dim)
194+
- **With distribution_embedding**: Output has 8 additional dimensions
170195

171-
### 3. Fourier Feature Generation
172-
- Automatic frequency detection
173-
- Multiple harmonic components
174-
- Phase-aware transformations
196+
## Implementation Details
175197

176-
### 4. Robust Handling
177-
- Special treatment for zeros
178-
- Outlier-resistant transformations
179-
- Numerical stability safeguards
198+
### 1. Distribution Detection Process
180199

181-
## Implementation Details
200+
The encoder uses statistical moments and specialized tests to detect the distribution type:
201+
202+
```python
203+
# Calculate basic statistics
204+
mean = tf.reduce_mean(x)
205+
variance = tf.math.reduce_variance(x)
206+
std = tf.sqrt(variance + epsilon)
207+
208+
# Standardize for higher moments
209+
x_std = (x - mean) / (std + epsilon)
210+
211+
# Calculate skewness and kurtosis
212+
skewness = tf.reduce_mean(tf.pow(x_std, 3))
213+
kurtosis = tf.reduce_mean(tf.pow(x_std, 4))
214+
215+
# Check for zeros and sparsity
216+
zero_ratio = tf.reduce_mean(tf.cast(tf.abs(x) < epsilon, tf.float32))
217+
218+
# Check for discreteness
219+
unique_ratio = tf.size(tf.unique(tf.reshape(x, [-1]))[0]) / tf.size(x)
220+
221+
# Score each distribution type and select the best match
222+
```
223+
224+
### 2. Periodic Data Handling
225+
226+
For data with detected periodicity, the encoder adds Fourier features:
182227

183-
### 1. Periodic Data Handling
184228
```python
185229
# Normalize to [-π, π] range
186-
normalized = inputs * π / scale
230+
normalized = (x - mean) / (std + epsilon) * π
231+
187232
# Generate Fourier features
188-
features = [
189-
sin(freq * normalized + phase),
190-
cos(freq * normalized + phase)
191-
]
192-
# Add harmonics if multimodal
193-
if is_multimodal:
194-
for h in [2, 3, 4]:
195-
features.extend([
196-
sin(h * freq * normalized + phase),
197-
cos(h * freq * normalized + phase)
198-
])
233+
sin_feature = tf.sin(frequency * normalized + phase)
234+
cos_feature = tf.cos(frequency * normalized + phase)
235+
236+
# Combine with original data
237+
transformed = tf.concat([x, sin_feature, cos_feature], axis=-1)
199238
```
200239

201-
### 2. Distribution Detection
240+
### 3. Model Serialization
241+
242+
When saving models containing the DistributionAwareEncoder:
243+
244+
```python
245+
from kdp.layers import DistributionAwareEncoder, get_custom_objects
246+
247+
# Save the model
248+
model.save("my_model.keras")
249+
250+
# Load the model with custom objects
251+
custom_objects = get_custom_objects()
252+
loaded_model = tf.keras.models.load_model("my_model", custom_objects=custom_objects)
253+
```
202254
```python
203255
# Statistical moments
204256
mean = tf.reduce_mean(inputs)

0 commit comments

Comments
 (0)