docs(KDP): improving documentation

piotrlaczkowski · piotrlaczkowski · commit 91d4f8505fdc · 2025-01-31T14:51:33.000+01:00
diff --git a/Makefile b/Makefile
@@ -90,7 +90,7 @@ deploy_doc:
 .PHONY: serve_doc
 ## Test MkDocs based documentation locally.
 serve_doc:
-	mkdocs serve
+	poetry run mkdocs serve
 
 # ------------------------------------
 # Clean All
diff --git a/docs/feature_selection.md b/docs/feature_selection.md
@@ -1,47 +1,45 @@
-# Feature Selection in Keras Data Processor
+# 🎯 Feature Selection in KDP
 
-The Keras Data Processor includes a sophisticated feature selection mechanism based on the Gated Residual Variable Selection Network (GRVSN) architecture. This document explains the components, usage, and benefits of this feature.
+## 📚 Overview
 
-## Overview
+KDP includes a sophisticated feature selection mechanism based on the Gated Residual Variable Selection Network (GRVSN) architecture. This powerful system automatically learns and selects the most important features in your data.
 
-The feature selection mechanism uses a combination of gated units and residual networks to automatically learn the importance of different features in your data. It can be applied to both numeric and categorical features, either independently or together.
+## 🧩 Core Components
 
-## Components
+### 1. 🔀 GatedLinearUnit
 
-### 1. GatedLinearUnit
-
-The `GatedLinearUnit` is the basic building block that implements a gated activation function:
+The foundation of our feature selection system:
 
 ```python
 gl = GatedLinearUnit(units=64)
 x = tf.random.normal((32, 100))
 y = gl(x)
 ```
 
-Key features:
-- Applies a linear transformation followed by a sigmoid gate
-- Selectively filters input data based on learned weights
-- Helps control information flow through the network
+**Key Features:**
+* 🔄 Applies linear transformation with sigmoid gate
+* 🎛️ Selectively filters input data
+* 🔍 Controls information flow through the network
 
-### 2. GatedResidualNetwork
+### 2. 🏗️ GatedResidualNetwork
 
-The `GatedResidualNetwork` combines gated linear units with residual connections:
+Combines gated units with residual connections:
 
 ```python
 grn = GatedResidualNetwork(units=64, dropout_rate=0.2)
 x = tf.random.normal((32, 100))
 y = grn(x)
 ```
 
-Key features:
-- Uses ELU activation for non-linearity
-- Includes dropout for regularization
-- Adds residual connections to help with gradient flow
-- Applies layer normalization for stability
+**Key Features:**
+* ⚡ Uses ELU activation for non-linearity
+* 🎲 Includes dropout for regularization
+* 🔄 Adds residual connections for better gradient flow
+* 📊 Applies layer normalization for stability
 
-### 3. VariableSelection
+### 3. 🎯 VariableSelection
 
-The `VariableSelection` layer is the main feature selection component:
+The main feature selection component:
 
 ```python
 vs = VariableSelection(nr_features=3, units=64, dropout_rate=0.2)
@@ -51,17 +49,17 @@ x3 = tf.random.normal((32, 300))
 selected_features, weights = vs([x1, x2, x3])
 ```
 
-Key features:
-- Processes each feature independently using GRNs
-- Calculates feature importance weights using softmax
-- Returns both selected features and their weights
-- Supports different input dimensions for each feature
+**Key Features:**
+* 🔄 Independent GRN processing for each feature
+* ⚖️ Calculates feature importance weights via softmax
+* 📊 Returns both selected features and their weights
+* 🔧 Supports varying input dimensions per feature
 
-## Usage in Preprocessing Model
+## 💻 Usage Guide
 
 ### Configuration
 
-Configure feature selection in your preprocessing model:
+Set up feature selection in your preprocessing model:
 
 ```python
 model = PreprocessingModel(
@@ -72,18 +70,20 @@ model = PreprocessingModel(
 )
 ```
 
-### Placement Options
+### 🎯 Placement Options
 
-The `FeatureSelectionPlacementOptions` enum provides several options for where to apply feature selection:
+Choose where to apply feature selection using `FeatureSelectionPlacementOptions`:
 
-1. `NONE`: Disable feature selection
-2. `NUMERIC`: Apply only to numeric features
-3. `CATEGORICAL`: Apply only to categorical features
-4. `ALL_FEATURES`: Apply to all features
+| Option | Description |
+|--------|-------------|
+| `NONE` | Disable feature selection |
+| `NUMERIC` | Apply to numeric features only |
+| `CATEGORICAL` | Apply to categorical features only |
+| `ALL_FEATURES` | Apply to all features |
 
-### Accessing Feature Weights
+### 📊 Accessing Feature Weights
 
-After processing, feature weights are available in the `processed_features` dictionary:
+Monitor feature importance after processing:
 
 ```python
 # Process your data
@@ -92,25 +92,51 @@ processed = model.transform(data)
 # Access feature weights
 numeric_weights = processed["numeric_feature_weights"]
 categorical_weights = processed["categorical_feature_weights"]
+
+# Print feature importance
+for feature_name in features:
+    weights = processed_data[f"{feature_name}_weights"]
+    print(f"Feature {feature_name} importance: {weights.mean()}")
 ```
 
-## Benefits
+## 🌟 Benefits
+
+1. **🤖 Automatic Feature Selection**
+   * Learns feature importance automatically
+   * Adapts to your specific dataset
+   * Reduces manual feature engineering
 
-1. **Automatic Feature Selection**: The model learns which features are most important for your task.
-2. **Interpretability**: Feature weights provide insights into feature importance.
-3. **Improved Performance**: By focusing on relevant features, the model can achieve better performance.
-4. **Regularization**: Dropout and residual connections help prevent overfitting.
-5. **Flexibility**: Can be applied to different feature types and combinations.
+2. **📊 Interpretability**
+   * Clear feature importance weights
+   * Insights into model decisions
+   * Easy to explain to stakeholders
 
-## Integration with Other Features
+3. **⚡ Improved Performance**
+   * Focuses on relevant features
+   * Reduces noise in the data
+   * Better model convergence
 
-The feature selection mechanism integrates seamlessly with other preprocessing components:
+## 🔧 Best Practices
 
-1. **Transformer Blocks**: Can be used before or after transformer blocks
-2. **Tabular Attention**: Complements tabular attention by focusing on important features
-3. **Custom Preprocessors**: Works with any custom preprocessing steps
+### Hyperparameter Tuning
 
-## Example
+* 🎯 Start with default values
+* 📈 Adjust based on validation performance
+* 🔄 Monitor feature importance stability
+
+### Performance Optimization
+
+* ⚡ Use appropriate batch sizes
+* 🎲 Adjust dropout rates as needed
+* 📊 Monitor memory usage
+
+## 📚 References
+
+* [GRVSN Paper](https://arxiv.org/abs/xxxx.xxxxx)
+* [Feature Selection in Deep Learning](https://arxiv.org/abs/xxxx.xxxxx)
+* [KDP Documentation](https://kdp.readthedocs.io)
+
+## 📚 Example
 
 Here's a complete example of using feature selection:
 
@@ -153,7 +179,7 @@ for feature_name in features:
     print(f"Feature {feature_name} importance: {weights.mean()}")
 ```
 
-## Testing
+## 📊 Testing
 
 The feature selection components include comprehensive unit tests that verify:
 
@@ -167,4 +193,3 @@ The feature selection components include comprehensive unit tests that verify:
 Run the tests using:
 ```bash
 python -m pytest test/test_feature_selection.py -v
-```
diff --git a/docs/tabular_attention.md b/docs/tabular_attention.md
@@ -1,23 +1,26 @@
-# Tabular Attention in KDP
+# 🎯 Tabular Attention in KDP
 
-The KDP package includes powerful attention mechanisms for tabular data:
-1. Standard TabularAttention for uniform feature processing
-2. MultiResolutionTabularAttention for type-specific feature processing
+## 📚 Overview
 
-## Overview
+KDP includes powerful attention mechanisms for tabular data processing:
 
-### Standard TabularAttention
+1. 🔄 **Standard TabularAttention**: Uniform feature processing
+2. 🎛️ **MultiResolutionTabularAttention**: Type-specific feature processing
+
+### 🔄 Standard TabularAttention
 The TabularAttention layer applies attention uniformly across all features, capturing:
-- Dependencies between features for each sample
-- Dependencies between samples for each feature
 
-### MultiResolutionTabularAttention
-The MultiResolutionTabularAttention layer implements a hierarchical attention mechanism that processes different feature types appropriately:
-1. **Numerical Features**: Full-resolution attention that preserves precise numerical relationships
-2. **Categorical Features**: Embedding-based attention that captures categorical patterns
-3. **Cross-Feature Attention**: Hierarchical attention between numerical and categorical features
+* 🔗 Dependencies between features for each sample
+* 📊 Dependencies between samples for each feature
+
+### 🎛️ MultiResolutionTabularAttention
+The MultiResolutionTabularAttention implements a hierarchical attention mechanism:
+
+* 📈 **Numerical Features**: Full-resolution attention preserving precise numerical relationships
+* 🏷️ **Categorical Features**: Embedding-based attention capturing categorical patterns
+* 🔄 **Cross-Feature Attention**: Hierarchical attention between numerical and categorical features
 
-## Usage
+## 💻 Usage Examples
 
 ### Standard TabularAttention
 
@@ -72,49 +75,53 @@ model = PreprocessingModel(
 
 ![Multi-Resolution TabularAttention](imgs/attention_example_multi_resolution.png)
 
-## Configuration Options
+## ⚙️ Configuration Options
 
-### Common Options
-- `tabular_attention` (bool): Enable/disable attention mechanisms
-- `tabular_attention_heads` (int): Number of attention heads
-- `tabular_attention_dim` (int): Dimension of the attention model
-- `tabular_attention_dropout` (float): Dropout rate for regularization
+### Core Parameters
 
-### Placement Options
-- `tabular_attention_placement` (str):
-  - `ALL_FEATURES`: Apply uniform attention to all features
-  - `NUMERIC`: Apply only to numeric features
-  - `CATEGORICAL`: Apply only to categorical features
-  - `MULTI_RESOLUTION`: Use type-specific attention mechanisms
-  - `NONE`: Disable attention
+| Parameter | Type | Description |
+|-----------|------|-------------|
+| `tabular_attention` | bool | Enable/disable attention mechanisms |
+| `tabular_attention_heads` | int | Number of attention heads |
+| `tabular_attention_dim` | int | Dimension of the attention model |
+| `tabular_attention_dropout` | float | Dropout rate for regularization |
 
-### Multi-Resolution Specific Options
-- `tabular_attention_embedding_dim` (int): Dimension for categorical embeddings in multi-resolution mode
+### 🎯 Placement Options
+Choose where to apply attention using `tabular_attention_placement`:
 
-## How It Works
+* `ALL_FEATURES`: Apply uniform attention to all features
+* `NUMERIC`: Apply only to numeric features
+* `CATEGORICAL`: Apply only to categorical features
+* `MULTI_RESOLUTION`: Use type-specific attention mechanisms
+* `NONE`: Disable attention
 
-### Standard TabularAttention
-1. **Self-Attention**: Applied uniformly across all features
-2. **Layer Normalization**: Stabilizes learning
-3. **Feed-forward Network**: Processes attention outputs
+### 🎛️ Multi-Resolution Settings
+* `tabular_attention_embedding_dim`: Dimension for categorical embeddings in multi-resolution mode
+
+## 🔍 How It Works
+
+### Standard TabularAttention Architecture
+1. 🔄 **Self-Attention**: Applied uniformly across all features
+2. 📊 **Layer Normalization**: Stabilizes learning
+3. 🧮 **Feed-forward Network**: Processes attention outputs
 
-### MultiResolutionTabularAttention
-1. **Numerical Processing**:
+### MultiResolutionTabularAttention Architecture
+1. 📈 **Numerical Processing**:
    - Full-resolution self-attention
    - Preserves numerical precision
    - Captures complex numerical relationships
 
-2. **Categorical Processing**:
+2. 🏷️ **Categorical Processing**:
    - Embedding-based attention
    - Lower-dimensional representations
    - Captures categorical patterns efficiently
 
-3. **Cross-Feature Integration**:
+3. 🔄 **Cross-Feature Integration**:
    - Hierarchical attention between feature types
    - Numerical features attend to categorical features
    - Preserves type-specific characteristics while enabling interaction
 
-## Best Practices
+## 📈 Best Practices
 
 ### When to Use Standard TabularAttention
 - Data has uniform feature importance
@@ -143,7 +150,7 @@ model = PreprocessingModel(
    - Increase if overfitting
    - Monitor validation performance
 
-## Advanced Usage
+## 🤖 Advanced Usage
 
 ### Custom Layer Integration
 
@@ -186,7 +193,7 @@ attention_layer = PreprocessorLayerFactory.multi_resolution_attention_layer(
 )
 ```
 
-## Performance Considerations
+## 📊 Performance Considerations
 
 1. **Memory Usage**:
    - MultiResolutionTabularAttention is more memory-efficient for categorical features
@@ -203,7 +210,7 @@ attention_layer = PreprocessorLayerFactory.multi_resolution_attention_layer(
    - Monitor memory usage and training time
    - Use gradient clipping to stabilize training
 
-## References
+## 📚 References
 
 - [Attention Is All You Need](https://arxiv.org/abs/1706.03762) - Original transformer paper
 - [TabNet: Attentive Interpretable Tabular Learning](https://arxiv.org/abs/1908.07442) - Attention for tabular data
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -39,8 +39,10 @@ nav:
   - 🛠️ Defining Features: features.md
   - 🏭 Layers Factory: layers_factory.md
   - 📦 Integrating Preprocessing Model: integrations.md
-  - 🤖 TransformerBlocks: transformer_blocks.md
-  - 🎯 TabularAttention: tabular_attention.md
+  - 🔌 Additional Model Extentions:
+    - 🤖 TransformerBlocks: transformer_blocks.md
+    - 🎯 TabularAttention: tabular_attention.md
+    - 🔂 Features Selection: feature_selection.md
   - 🍦 Motivation: motivation.md
   - 🍻 Contributing: contributing.md
 
diff --git a/poetry.lock b/poetry.lock
diff --git a/pyproject.toml b/pyproject.toml