|
1 | 1 | # Distribution-Aware Encoder |
2 | 2 |
|
3 | 3 | ## Overview |
4 | | -The **Distribution-Aware Encoder** is an advanced preprocessing layer that automatically detects and handles various types of data distributions. It leverages TensorFlow Probability (tfp) for accurate modeling and applies specialized transformations while preserving the statistical properties of the data. |
5 | 4 |
|
6 | | -## Features |
| 5 | +The **Distribution-Aware Encoder** is an advanced preprocessing layer that automatically detects and handles various types of data distributions. It applies specialized transformations to improve model performance while preserving the statistical properties of the data. Built on pure TensorFlow operations without dependencies on TensorFlow Probability, it's lightweight and easy to deploy. |
| 6 | + |
| 7 | +## Key Features |
| 8 | + |
| 9 | +### 1. Automatic Distribution Detection |
| 10 | +- Uses statistical moments (mean, variance, skewness, kurtosis) to identify distribution types |
| 11 | +- Employs histogram analysis for multimodality detection |
| 12 | +- Performs autocorrelation analysis for periodic pattern detection |
| 13 | +- Adapts to data characteristics during training |
| 14 | + |
| 15 | +### 2. Intelligent Transformations |
| 16 | +- Applies distribution-specific transformations automatically |
| 17 | +- Handles 16 different distribution types with specialized approaches |
| 18 | +- Adds Fourier features (sin/cos) for periodic data |
| 19 | +- Special handling for sparse data and zero values |
| 20 | + |
| 21 | +### 3. Flexible Output Options |
| 22 | +- Optional projection to fixed embedding dimension |
| 23 | +- Distribution-specific embeddings can be added to outputs |
| 24 | +- Automatic feature expansion for periodic data |
| 25 | + |
| 26 | +### 4. Production-Ready Implementation |
| 27 | +- Graph mode compatible for TensorFlow's static graph execution |
| 28 | +- No dependencies on TensorFlow Probability for easier deployment |
| 29 | +- Serialization support for model saving and loading |
| 30 | + |
| 31 | +## Distribution Types Supported |
| 32 | + |
| 33 | +The encoder automatically detects and handles these distribution types: |
7 | 34 |
|
8 | | -### Distribution Types Supported |
9 | 35 | 1. **Normal Distribution** |
10 | 36 | - For standard normally distributed data |
11 | | - - Handled via z-score normalization |
12 | | - - Detection: Kurtosis ≈ 3.0, Skewness ≈ 0 |
| 37 | + - Detection: Skewness < 0.5, Kurtosis ≈ 3.0 |
13 | 38 |
|
14 | 39 | 2. **Heavy-Tailed Distribution** |
15 | 40 | - For data with heavier tails than normal |
16 | | - - Handled via Student's t-distribution |
17 | | - - Detection: Kurtosis > 3.5 |
| 41 | + - Detection: Kurtosis > 4.0 |
18 | 42 |
|
19 | 43 | 3. **Multimodal Distribution** |
20 | 44 | - For data with multiple peaks |
21 | | - - Handled via Gaussian Mixture Models |
22 | | - - Detection: KDE-based peak detection |
| 45 | + - Detection: Multiple significant peaks in histogram |
23 | 46 |
|
24 | 47 | 4. **Uniform Distribution** |
25 | | - - For evenly distributed data |
26 | | - - Handled via min-max scaling |
27 | | - - Detection: Kurtosis ≈ 1.8 |
| 48 | + - For evenly distributed data between bounds |
| 49 | + - Detection: Bounded between 0 and 1 |
28 | 50 |
|
29 | 51 | 5. **Exponential Distribution** |
30 | 52 | - For data with exponential decay |
31 | | - - Handled via rate-based transformation |
32 | | - - Detection: Skewness ≈ 2.0 |
| 53 | + - Detection: Positive values with skewness > 1.0 |
33 | 54 |
|
34 | 55 | 6. **Log-Normal Distribution** |
35 | 56 | - For data that is normal after log transform |
36 | | - - Handled via logarithmic transformation |
37 | | - - Detection: Log-transformed kurtosis ≈ 3.0 |
| 57 | + - Detection: Positive values with skewness > 2.0 |
38 | 58 |
|
39 | 59 | 7. **Discrete Distribution** |
40 | 60 | - For data with finite distinct values |
41 | | - - Handled via rank-based normalization |
42 | | - - Detection: Unique values analysis |
| 61 | + - Detection: Low unique value ratio (< 0.1) |
43 | 62 |
|
44 | 63 | 8. **Periodic Distribution** |
45 | 64 | - For data with cyclic patterns |
46 | | - - Handled via Fourier features (sin/cos) |
47 | | - - Detection: Peak spacing analysis |
| 65 | + - Detection: Significant peaks in autocorrelation |
48 | 66 |
|
49 | 67 | 9. **Sparse Distribution** |
50 | 68 | - For data with many zeros |
51 | | - - Handled via separate zero/non-zero transformations |
52 | | - - Detection: Zero ratio analysis |
| 69 | + - Detection: Zero ratio > 0.5 |
53 | 70 |
|
54 | 71 | 10. **Beta Distribution** |
55 | | - - For bounded data between 0 and 1 |
56 | | - - Handled via beta CDF transformation |
57 | | - - Detection: Value range and shape analysis |
| 72 | + - For bounded data between 0 and 1 with shape parameters |
| 73 | + - Detection: Bounded between 0 and 1 with skewness > 0.5 |
58 | 74 |
|
59 | 75 | 11. **Gamma Distribution** |
60 | 76 | - For positive, right-skewed data |
61 | | - - Handled via gamma CDF transformation |
62 | | - - Detection: Positive support and skewness |
| 77 | + - Detection: Positive values with mild skewness (> 0.5) |
63 | 78 |
|
64 | 79 | 12. **Poisson Distribution** |
65 | 80 | - For count data |
66 | | - - Handled via rate parameter estimation |
67 | | - - Detection: Integer values and variance≈mean |
| 81 | + - Handled implicitly through other transformations |
68 | 82 |
|
69 | | -14. **Cauchy Distribution** |
| 83 | +13. **Cauchy Distribution** |
70 | 84 | - For extremely heavy-tailed data |
71 | | - - Handled via robust location-scale estimation |
72 | | - - Detection: Undefined moments |
| 85 | + - Detection: Very high kurtosis (> 10.0) |
73 | 86 |
|
74 | | -15. **Zero-Inflated Distribution** |
| 87 | +14. **Zero-Inflated Distribution** |
75 | 88 | - For data with excess zeros |
76 | | - - Handled via mixture model approach |
77 | | - - Detection: Zero proportion analysis |
| 89 | + - Detection: Moderate zero ratio (0.3-0.5) |
| 90 | + |
| 91 | +15. **Bounded Distribution** |
| 92 | + - For data with known bounds |
| 93 | + - Handled implicitly through other transformations |
| 94 | + |
| 95 | +16. **Ordinal Distribution** |
| 96 | + - For ordered categorical data |
| 97 | + - Handled similarly to discrete distributions |
78 | 98 |
|
79 | 99 | ## Usage |
80 | 100 |
|
81 | 101 | ### Basic Usage |
82 | 102 |
|
83 | | -The Distribution-Aware Encoder works seamlessly (and only) with numerical features. Enable it by setting `use_distribution_aware=True` in the `PreprocessingModel`. |
| 103 | +The Distribution-Aware Encoder works seamlessly with numerical features. Enable it by setting `use_distribution_aware=True` in the `PreprocessingModel`. |
84 | 104 |
|
85 | 105 | ```python |
86 | 106 | from kdp.processor import PreprocessingModel |
87 | 107 | from kdp.features import NumericalFeature |
88 | 108 |
|
89 | 109 | # Define features |
90 | 110 | features = { |
91 | | - # Numerical features |
92 | 111 | "feature1": NumericalFeature(), |
93 | 112 | "feature2": NumericalFeature(), |
94 | | - # etc .. |
| 113 | + # etc. |
95 | 114 | } |
96 | 115 |
|
97 | | -# Initialize the model |
98 | | -model = PreprocessingModel( # here |
| 116 | +# Initialize the model with distribution-aware encoding |
| 117 | +model = PreprocessingModel( |
99 | 118 | features=features, |
100 | 119 | use_distribution_aware=True |
101 | 120 | ) |
102 | 121 | ``` |
103 | 122 |
|
104 | | -### Manual Usage |
| 123 | +### Manual Usage with Specific Distribution |
| 124 | + |
| 125 | +You can specify a preferred distribution type for specific features: |
105 | 126 |
|
106 | 127 | ```python |
107 | 128 | from kdp.processor import PreprocessingModel |
108 | 129 | from kdp.features import NumericalFeature, FeatureType |
| 130 | +from kdp.layers.distribution_aware_encoder_layer import DistributionType |
109 | 131 |
|
110 | | -# Define features |
| 132 | +# Define features with specific distribution preferences |
111 | 133 | features = { |
112 | | - # Numerical features |
113 | 134 | "feature1": NumericalFeature( |
114 | 135 | name="feature1", |
115 | 136 | feature_type=FeatureType.FLOAT_NORMALIZED |
116 | 137 | ), |
117 | 138 | "feature2": NumericalFeature( |
118 | 139 | name="feature2", |
119 | 140 | feature_type=FeatureType.FLOAT_RESCALED, |
120 | | - prefered_distribution="log_normal" # here we could specify a prefered distribution (normal, periodic, etc) |
| 141 | + prefered_distribution=DistributionType.LOG_NORMAL # Specify preferred distribution |
121 | 142 | ) |
122 | | - # etc .. |
| 143 | + # etc. |
123 | 144 | } |
124 | 145 |
|
125 | 146 | # Initialize the model |
126 | | -model = PreprocessingModel( # here |
| 147 | +model = PreprocessingModel( |
127 | 148 | features=features, |
128 | | - use_distribution_aware=True, |
129 | | - distribution_aware_bins=1000, # 1000 is the default value, but you can change it for finer data |
| 149 | + use_distribution_aware=True |
130 | 150 | ) |
131 | 151 | ``` |
132 | 152 |
|
133 | | -### Advanced Configuration |
| 153 | +### Direct Layer Usage |
| 154 | + |
| 155 | +You can also use the layer directly in your Keras models: |
| 156 | + |
134 | 157 | ```python |
135 | | -encoder = DistributionAwareEncoder( |
136 | | - num_bins=1000, |
137 | | - epsilon=1e-6, |
138 | | - detect_periodicity=True, |
139 | | - handle_sparsity=True, |
140 | | - adaptive_binning=True, |
141 | | - mixture_components=3, |
142 | | - trainable=True |
143 | | -) |
| 158 | +import tensorflow as tf |
| 159 | +from kdp.layers import DistributionAwareEncoder |
| 160 | + |
| 161 | +# Creating a model with automatic distribution detection |
| 162 | +inputs = tf.keras.Input(shape=(10,)) |
| 163 | +encoded = DistributionAwareEncoder(embedding_dim=16)(inputs) |
| 164 | +outputs = tf.keras.layers.Dense(1)(encoded) |
| 165 | +model = tf.keras.Model(inputs, outputs) |
| 166 | + |
| 167 | +# Save and load model with custom objects |
| 168 | +model.save("my_model.keras") |
| 169 | +custom_objects = DistributionAwareEncoder.get_custom_objects() |
| 170 | +loaded_model = tf.keras.models.load_model("my_model", custom_objects=custom_objects) |
144 | 171 | ``` |
145 | 172 |
|
146 | 173 | ## Parameters |
147 | 174 |
|
148 | 175 | | Parameter | Type | Default | Description | |
149 | 176 | |-----------|------|---------|-------------| |
150 | | -| num_bins | int | 1000 | Number of bins for quantile encoding | |
151 | | -| epsilon | float | 1e-6 | Small value for numerical stability | |
152 | | -| detect_periodicity | bool | True | Enable periodic pattern detection | |
153 | | -| handle_sparsity | bool | True | Enable special handling for sparse data | |
154 | | -| adaptive_binning | bool | True | Enable adaptive bin boundaries | |
155 | | -| mixture_components | int | 3 | Number of components for mixture models | |
156 | | -| trainable | bool | True | Whether parameters are trainable | |
157 | | -| prefered_distribution | DistributionType | None | Manually specify distribution type | |
| 177 | +| embedding_dim | int or None | None | Output dimension for feature projection. If specified, a Dense layer projects the transformed features to this dimension. | |
| 178 | +| epsilon | float | 1e-6 | Small value to prevent numerical issues. | |
| 179 | +| detect_periodicity | bool | True | If True, checks for and handles periodic patterns by adding sin/cos features. | |
| 180 | +| handle_sparsity | bool | True | If True, applies special handling for sparse data (many zeros). | |
| 181 | +| auto_detect | bool | True | If True, automatically detects distribution type during training. | |
| 182 | +| distribution_type | str | "unknown" | Specific distribution type to use if auto_detect is False. | |
| 183 | +| transform_type | str | "auto" | Type of transformation to apply via DistributionTransformLayer. | |
| 184 | +| add_distribution_embedding | bool | False | If True, adds a learned embedding for the detected distribution type. | |
| 185 | +| trainable | bool | True | Whether the layer is trainable. | |
158 | 186 |
|
159 | | -## Key Features |
| 187 | +## Output Dimensions |
160 | 188 |
|
161 | | -### 1. Automatic Distribution Detection |
162 | | -- Uses statistical moments and tests |
163 | | -- Employs KDE for multimodality detection |
164 | | -- Handles mixed distributions via ensemble approach |
| 189 | +The output dimensions depend on the configuration: |
165 | 190 |
|
166 | | -### 2. Adaptive Transformations |
167 | | -- Learns optimal parameters during training |
168 | | -- Adjusts to data distribution changes |
169 | | -- Handles complex periodic patterns |
| 191 | +- **Base case**: Same shape as input |
| 192 | +- **With periodic features**: Input dimension × 3 (original + sin + cos features) |
| 193 | +- **With embedding_dim**: (batch_size, embedding_dim) |
| 194 | +- **With distribution_embedding**: Output has 8 additional dimensions |
170 | 195 |
|
171 | | -### 3. Fourier Feature Generation |
172 | | -- Automatic frequency detection |
173 | | -- Multiple harmonic components |
174 | | -- Phase-aware transformations |
| 196 | +## Implementation Details |
175 | 197 |
|
176 | | -### 4. Robust Handling |
177 | | -- Special treatment for zeros |
178 | | -- Outlier-resistant transformations |
179 | | -- Numerical stability safeguards |
| 198 | +### 1. Distribution Detection Process |
180 | 199 |
|
181 | | -## Implementation Details |
| 200 | +The encoder uses statistical moments and specialized tests to detect the distribution type: |
| 201 | + |
| 202 | +```python |
| 203 | +# Calculate basic statistics |
| 204 | +mean = tf.reduce_mean(x) |
| 205 | +variance = tf.math.reduce_variance(x) |
| 206 | +std = tf.sqrt(variance + epsilon) |
| 207 | + |
| 208 | +# Standardize for higher moments |
| 209 | +x_std = (x - mean) / (std + epsilon) |
| 210 | + |
| 211 | +# Calculate skewness and kurtosis |
| 212 | +skewness = tf.reduce_mean(tf.pow(x_std, 3)) |
| 213 | +kurtosis = tf.reduce_mean(tf.pow(x_std, 4)) |
| 214 | + |
| 215 | +# Check for zeros and sparsity |
| 216 | +zero_ratio = tf.reduce_mean(tf.cast(tf.abs(x) < epsilon, tf.float32)) |
| 217 | + |
| 218 | +# Check for discreteness |
| 219 | +unique_ratio = tf.size(tf.unique(tf.reshape(x, [-1]))[0]) / tf.size(x) |
| 220 | + |
| 221 | +# Score each distribution type and select the best match |
| 222 | +``` |
| 223 | + |
| 224 | +### 2. Periodic Data Handling |
| 225 | + |
| 226 | +For data with detected periodicity, the encoder adds Fourier features: |
182 | 227 |
|
183 | | -### 1. Periodic Data Handling |
184 | 228 | ```python |
185 | 229 | # Normalize to [-π, π] range |
186 | | -normalized = inputs * π / scale |
| 230 | +normalized = (x - mean) / (std + epsilon) * π |
| 231 | + |
187 | 232 | # Generate Fourier features |
188 | | -features = [ |
189 | | - sin(freq * normalized + phase), |
190 | | - cos(freq * normalized + phase) |
191 | | -] |
192 | | -# Add harmonics if multimodal |
193 | | -if is_multimodal: |
194 | | - for h in [2, 3, 4]: |
195 | | - features.extend([ |
196 | | - sin(h * freq * normalized + phase), |
197 | | - cos(h * freq * normalized + phase) |
198 | | - ]) |
| 233 | +sin_feature = tf.sin(frequency * normalized + phase) |
| 234 | +cos_feature = tf.cos(frequency * normalized + phase) |
| 235 | + |
| 236 | +# Combine with original data |
| 237 | +transformed = tf.concat([x, sin_feature, cos_feature], axis=-1) |
199 | 238 | ``` |
200 | 239 |
|
201 | | -### 2. Distribution Detection |
| 240 | +### 3. Model Serialization |
| 241 | + |
| 242 | +When saving models containing the DistributionAwareEncoder: |
| 243 | + |
| 244 | +```python |
| 245 | +from kdp.layers import DistributionAwareEncoder, get_custom_objects |
| 246 | + |
| 247 | +# Save the model |
| 248 | +model.save("my_model.keras") |
| 249 | + |
| 250 | +# Load the model with custom objects |
| 251 | +custom_objects = get_custom_objects() |
| 252 | +loaded_model = tf.keras.models.load_model("my_model", custom_objects=custom_objects) |
| 253 | +``` |
202 | 254 | ```python |
203 | 255 | # Statistical moments |
204 | 256 | mean = tf.reduce_mean(inputs) |
|
0 commit comments