Skip to content

Commit dd728ef

Browse files
committed
docs(KDP): added docs
1 parent 62f0dba commit dd728ef

File tree

6 files changed

+123
-53
lines changed

6 files changed

+123
-53
lines changed

docs/complex_example.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ features = {
2424
"quantity": NumericalFeature(
2525
name="quantity",
2626
feature_type=FeatureType.FLOAT_RESCALED
27+
prefered_distribution="poisson" # here we could specify a prefered distribution (normal, periodic, etc)
2728
),
2829

2930
# Categorical features
@@ -118,6 +119,10 @@ ppr = PreprocessingModel(
118119
feature_selection_placement="all_features", # Choose between (all_features|numeric|categorical)
119120
feature_selection_units=32,
120121
feature_selection_dropout=0.15,
122+
123+
# Distribution aware configuration
124+
use_distribution_aware=True, # here we activate the distribution aware encoder
125+
distribution_aware_bins=1000, # thats the default value, but you can change it for finer data
121126
)
122127

123128
# Build the preprocessor

docs/distribution_aware_encoder.md

Lines changed: 46 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -66,11 +66,6 @@ The Distribution-Aware Encoder is an advanced preprocessing layer that automatic
6666
- Handled via rate parameter estimation
6767
- Detection: Integer values and variance≈mean
6868

69-
13. **Weibull Distribution**
70-
- For lifetime/failure data
71-
- Handled via Weibull CDF
72-
- Detection: Shape and scale analysis
73-
7469
14. **Cauchy Distribution**
7570
- For extremely heavy-tailed data
7671
- Handled via robust location-scale estimation
@@ -81,29 +76,61 @@ The Distribution-Aware Encoder is an advanced preprocessing layer that automatic
8176
- Handled via mixture model approach
8277
- Detection: Zero proportion analysis
8378

84-
16. **Bounded Distribution**
85-
- For data with known bounds
86-
- Handled via scaled beta transformation
87-
- Detection: Value range analysis
88-
89-
17. **Ordinal Distribution**
90-
- For ordered categorical data
91-
- Handled via learned mapping
92-
- Detection: Discrete ordered values
93-
9479
## Usage
9580

9681
### Basic Usage
82+
83+
The capability only works with numerical features!
84+
9785
```python
9886
from kdp.processor import PreprocessingModel
99-
100-
preprocessor = PreprocessingModel(
101-
features_stats=stats,
102-
features_specs=specs,
87+
from kdp.features import NumericalFeature
88+
89+
# Define features
90+
features = {
91+
# Numerical features
92+
"feature1": NumericalFeature(),
93+
"feature2": NumericalFeature(),
94+
# etc ..
95+
}
96+
97+
# Initialize the model
98+
model = PreprocessingModel( # here
99+
features=features,
103100
use_distribution_aware=True
104101
)
105102
```
106103

104+
### Manual Usage
105+
106+
```python
107+
from kdp.processor import PreprocessingModel
108+
from kdp.features import NumericalFeature
109+
110+
# Define features
111+
features = {
112+
# Numerical features
113+
# Numerical features
114+
"feature1": NumericalFeature(
115+
name="feature1",
116+
feature_type=FeatureType.FLOAT_NORMALIZED
117+
),
118+
"feature2": NumericalFeature(
119+
name="feature2",
120+
feature_type=FeatureType.FLOAT_RESCALED
121+
prefered_distribution="log_normal" # here we could specify a prefered distribution (normal, periodic, etc)
122+
)
123+
# etc ..
124+
}
125+
126+
# Initialize the model
127+
model = PreprocessingModel( # here
128+
features=features,
129+
use_distribution_aware=True,
130+
distribution_aware_bins=1000, # 1000 is the default value, but you can change it for finer data
131+
)
132+
```
133+
107134
### Advanced Configuration
108135
```python
109136
encoder = DistributionAwareEncoder(
@@ -272,40 +299,6 @@ The DistributionAwareEncoder is integrated into the numeric feature processing p
272299
- Enable caching for repeated processing
273300
- Adjust mixture components based on data
274301

275-
## Example Use Cases
276-
277-
### 1. Financial Data
278-
```python
279-
# Handle heavy-tailed return distributions
280-
preprocessor = PreprocessingModel(
281-
use_distribution_aware=True,
282-
handle_sparsity=False,
283-
mixture_components=2
284-
)
285-
```
286-
287-
### 2. Temporal Data
288-
```python
289-
# Handle periodic patterns
290-
preprocessor = PreprocessingModel(
291-
use_distribution_aware=True,
292-
detect_periodicity=True,
293-
adaptive_binning=True
294-
)
295-
```
296-
297-
### 3. Sparse Features
298-
```python
299-
# Handle sparse categorical data
300-
preprocessor = PreprocessingModel(
301-
use_distribution_aware=True,
302-
handle_sparsity=True,
303-
mixture_components=1
304-
)
305-
```
306-
307-
## Monitoring and Debugging
308-
309302
### Distribution Detection
310303
```python
311304
# Access distribution information

docs/example_usages.md

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -288,3 +288,74 @@ print("Feature importances:", feature_importances)
288288

289289
Here is the plot of the model:
290290
![Complex Model](imgs/complex_model.png)
291+
292+
293+
## Example 4: Numerical features with distribution aware encoder
294+
295+
Normally the distribution aware encoder works well in automatic mode, once use_distribution_aware=True is set.
296+
However we can also manually set the prefered distribution for each numerical feature if we would like to.
297+
298+
```python
299+
from kdp.features import NumericalFeature, FeatureType
300+
from kdp.processor import PreprocessingModel, OutputModeOptions
301+
302+
# Define features
303+
features = {
304+
# 1. Default automatic distribution detection
305+
"basic_float": NumericalFeature(
306+
name="basic_float",
307+
feature_type=FeatureType.FLOAT,
308+
),
309+
310+
# 2. Manually setting a gamma distribution
311+
"rescaled_float": NumericalFeature(
312+
name="rescaled_float",
313+
feature_type=FeatureType.FLOAT_RESCALED,
314+
scale=2.0,
315+
prefered_distribution="gamma"
316+
),
317+
# 3. Custom preprocessing pipeline with a custom set normal distribution
318+
"custom_float": NumericalFeature(
319+
name="custom_float",
320+
feature_type=FeatureType.FLOAT,
321+
preprocessors=[
322+
tf.keras.layers.Rescaling,
323+
tf.keras.layers.Normalization,
324+
],
325+
bin_boundaries=[0.0, 1.0, 2.0],
326+
mean=0.0,
327+
variance=1.0,
328+
scale=4.0,
329+
prefered_distribution="normal"
330+
),
331+
}
332+
333+
# Now we can create a preprocessing model with the features
334+
ppr = PreprocessingModel(
335+
path_data="sample_data.csv",
336+
features_specs=features,
337+
features_stats_path="features_stats.json",
338+
overwrite_stats=True,
339+
output_mode=OutputModeOptions.CONCAT,
340+
341+
# Add feature selection to get the most important features
342+
feature_selection_placement="numeric", # Choose between (all_features|numeric|categorical)
343+
344+
# Add tabular attention to check for feature interactions
345+
tabular_attention=True,
346+
347+
# Add distribution aware encoder
348+
use_distribution_aware=True
349+
)
350+
351+
# Build the preprocessor
352+
result = ppr.build_preprocessor()
353+
354+
# Transform data using direct model prediction
355+
transformed_data = ppr.model.predict(test_batch)
356+
357+
# Get feature importances
358+
feature_importances = ppr.get_feature_importances()
359+
```
360+
Here is the plot of the model:
361+
![Complex Model](imgs/numerical_example_model_with_distribution_aware.png)

docs/imgs/complex_model.png

73.7 KB
Loading
232 KB
Loading

docs/quick_start.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ model = PreprocessingModel(
3131
features=features,
3232
tabular_attention=True, # Enable attention mechanism
3333
feature_selection=True # Enable feature selection
34+
use_distribution_aware=True # Enable distribution aware encoder
3435
)
3536
```
3637

0 commit comments

Comments
 (0)