A comprehensive, dataset-agnostic machine learning framework for rapid development and deployment of classification pipelines on tabular data. Advanced DevOps tools.
- Overview
- Key Features
- Architecture
- Installation
- Quick Start
- Configuration
- Supported Models & Metrics
- Use Cases
- Performance
- Contributing
- Documentation
- License
Efficient Classifier is an enterprise-grade machine learning framework designed to accelerate the development lifecycle from data preprocessing to model deployment. Built with scalability and reproducibility in mind, it provides a unified interface for experimenting with multiple classification pipelines while maintaining rigorous tracking of experiments and results.
The framework supports both binary and multiclass classification tasks and has been extensively validated on real-world datasets, including the CCCS-CIC-AndMal-2020 cybersecurity dataset where it achieved 92% F1-score performance.
Our framework has been applied to cutting-edge cybersecurity research:
- Research Paper - CCCS-CIC-AndMal-2020 Analysis
- Complete Results - Plots, logs, and execution history
- Technical Report - Methodology and findings
- EDA Notebook - Exploratory data analysis
- Presentation - Project overview
- Multi-pipeline orchestration with customizable architectures
- Zero-boilerplate configuration through YAML
- Automated hyperparameter optimization (Grid, Random, Bayesian)
- One-command execution from data to deployment
- Comprehensive residual analysis and confusion matrices
- LIME-based feature importance with permutation testing
- Model calibration with reliability diagrams
- Cross-validation with stratified sampling
- Real-time training progress monitoring
- Slack bot integration for real-time notifications
- Automated DAG visualization of pipeline architectures
- Model serialization with joblib/pickle support
- Comprehensive logging and experiment tracking
- Built-in testing framework integration
- Multithreaded processing where parallelization is beneficial
- Memory-efficient data handling for large datasets
- Optimized feature selection algorithms (Boruta, L1 regularization)
- Smart caching mechanisms for repeated operations
The framework follows a modular, stage-based architecture:
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Data Loading │ -> │ Preprocessing │ -> │ Feature Analysis│
└─────────────────┘ └──────────────────┘ └─────────────────┘
|
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ DevOps │ <- │ Optimization <- │ Modeling │
└─────────────────┘ └──────────────────┘ └─────────────────┘
Stage | Feature Name | Description |
---|---|---|
Dataset | Data splitting analysis | Splits data and analyzes the SE variation by changing the split ratio |
Dataset | Data splitting plots | After split, stores plots of the distributions of the features between sets. Meant to verify the consistency of the data distribution across sets |
Data preprocessing | Feature encoding | Encodes (one-hot) features in a given dataset |
Data preprocessing | Missing values | Finds missing values based on a predefined set of missing data placeholders |
Data preprocessing | Duplicates | Finds and deletes duplicates observations |
Data preprocessing | Outlier detection | Finds outliers (based on different procedures: percentile, IQR) and curates them (based on different procedures: clipping, winsorization, elimination) |
Data preprocessing | Bound checking | Checks values stay within predefined range in order to detect possible measurement errors |
Data preprocessing | Feature scaling | Scales features (uses: robust scaler, minmax, standard) |
Data preprocessing | Class imbalance | Addresses class imbalances issues (uses: ADASYN or SMOTE) |
Feature analysis | Manual feature selection | Mutual information, low variance, multicolinearity and PCA |
Feature analysis | Automatic feature selection | L1, Boruta |
Feature analysis | Feature engineering | dataset-specific |
Modelling | Support for diverse models | Neural nets, linear classifiers, heuristic classifiers (e.g., majority class), ensembled, stacking, gradient boosting machine, instance-based models and more (see full model support list below) |
Modelling | Cross-model comparision | Assess all pipelines' models against all others across several metrics (see full metrics support list below) |
Modelling | Intra-model comparision | Assess all pipelines' models against its training predictions across several metrics (see full metrics support list below) |
Modelling | Per-epoch progress | Allows to see a per-epoch progress graph. Only available for neural networks |
Modelling | Residual analysis | Plots confusion matrices and stores and plots residuals from the misclassifications |
Modelling | Feature importance | Plots feature importance based on LIME, feature permutations and tree-based feature importance |
Modelling | Reliability diagram | Plots reliability diagrams for each model accross all classes |
Modelling | Model callibration | Callibrates classifiers (uses: isotonic or sigmoid) |
Modelling | Model weights | Allows for objective function to consider weights per class |
Modelling | Model optimization | Optimize models (uses: grid, random or bayesian search) |
Modelling | Model serialization | Serialize full model objects or pipelines objects (uses: joblib or pickle). Support for desarialization also present |
DevOps | SlackBot | Send in-real-time log and plots from the executions of the pipelines |
DevOps | Results CSV | Each unique pipeline run is stored with the features used, model name, model hyperparameter, timestamp and more |
DevOps | DAG visualization | Visualizes all the pipelines' architecture and results of the system in an automatically generated DAG plot. |
DevOps | Testing | Possibility to add testing-powered development |
DevOps | Code bot reviewers | This repo uses CodeRabbit in order to summarize pull requests and integrate them at a faster pace |
System-wide | Customizable | Adjust configurations.yaml for your dataset and get started in minutes |
System-wide | Efficient | Multithreading is present at each part of the system where sequentialism is not required |
pip install efficient-classifier
git clone https://github.com/javidsegura/efficient-classifier.git
cd efficient-classifier
pip install -r requirements.txt
For Slack bot integration, create a .env
file:
SLACK_BOT_TOKEN=your_bot_token
SLACK_SIGNING_SECRET=your_signing_secret
SLACK_APP_TOKEN=your_app_token
from efficient_classifier.root import run_pipeline
import yaml
PATH = "configurations.yaml"
variables = yaml.load(open(PATH), Loader=yaml.FullLoader)
if __name__ == "__main__":
run_pipeline(variables=variables)
- Configure dataset-specific cleaning in
pipeline_runner.py
:
def _clean_dataset_set_up_dataset_specific(self, df):
# Your custom preprocessing logic
return cleaned_df
- Implement feature engineering in
featureAnalysis_runner.py
:
def _run_feature_engineering_dataset_specific(self, df):
# Your custom feature engineering
return engineered_df
-
Update boundary conditions in
bound_config.py
for data validation. -
If you are using this library, after you have adjusted data cleaning and feature engineering code you will most likely still need to adjust the following parameters:
- class weights
- adjust bound checking file
bound_config.py
- modify features to encode
- modify data preprocessing modes
- adjust optimization space (see all the parameters commented as 'MAJOR IMPACT IN PERFORMANCE' in order to see the parameters that significanlty increase computational costs over the long run)
- If you want to create a new pipeline, you also need to adjust the following:
- data preprocessing
- models to include/exclude
- extend grid space (both in the
configuration.yaml
and inmodelling_runner_states_in.py
)
The framework uses a comprehensive YAML configuration system. Key configuration sections:
general:
pipelines_names: ["baseline", "advanced", "ensemble"]
max_plots_per_function: 10 # Control visualization output
phase_runners:
dataset_runners:
split_df:
p: .85 # Expected accuracy
step: 0.05 # Granularity of split analysis
encoding:
y_column: "target" # Target variable name
modelling_runner:
class_weights:
weights: {0: 1.0, 1: 2.0} # Handle class imbalance
models_to_include:
baseline: ["Random Forest", "Logistic Regression"]
advanced: ["XGBoost", "Neural Network"]
optimization:
method: "bayesian" # grid, random, or bayesian
cv_folds: 5
For complete configuration options, see the detailed documentation.
Tree-Based Algorithms:
- Random Forest, Decision Trees, Gradient Boosting
- XGBoost, LightGBM, CatBoost
- AdaBoost with configurable base estimators
Linear Models:
- Logistic Regression, Ridge Classifier
- Linear/Non-linear SVM, SGD Classifier
- Elastic Net with L1/L2 regularization
Advanced Methods:
- Feed-Forward Neural Networks
- Ensemble Stacking (meta-learning)
- K-Nearest Neighbors, Gaussian Naive Bayes
Baseline Models:
- Majority Class Classifier for benchmarking
- Classification Accuracy - Overall correctness
- Precision, Recall, F1-Score - Class-specific performance
- Cohen's Kappa - Inter-rater reliability
- Weighted Accuracy - Class-imbalance adjusted accuracy
In order to add a new metric go to detailed Q&A
Extend model support by modifying modelling_runner.py
:
def _model_initializers(self):
models = {
# Existing models...
"Custom Model": YourCustomClassifier(
param1=self.config['custom_param']
)
}
return models
Our flagship application demonstrates the framework's capabilities in cybersecurity:
- Dataset: CCCS-CIC-AndMal-2020 (Android malware detection)
- Performance: 92% F1-score with Random Forest + Stacking ensemble
- Scale: 50,000+ samples with 145 features
- Deployment: Production-ready model with 15ms inference time
Key Results:
- Outperformed baseline approaches by 23%
- Achieved 92% F1-score with Random Forest + Stacking ensemble
Titanic Survival Prediction: View Results
- 81.3% accuracy with ensemble methods
- Comprehensive feature engineering pipeline
- 5 minute set-up
Iris Classification: View Results
- 100%% accuracy across all pipeline configurations
- Validation of multi-class capabilities
- <5 minute set-up
Dataset | Samples | Features | Best Model | F1-Score | Training Time |
---|---|---|---|---|---|
CCCS-CIC-AndMal-2020 | 50K+ | 125 | Random Forest | 92.0% | 45 min |
Titanic | 891 | 12 | Stacking Ensemble | 81.3% | 2 min |
Iris | 150 | 4 | Neural Network | 100% | 30 sec |
Once you have the best model selected out of the assesment after tuning you can serialize your model (either with joblib or pickle). The model serialized contains the results at each stage and the model sklearn object at each stage. In production use the model_sklearn from the assesment attribute in the serialized model. Do model.predict(X) or model.predict_proba(X) if you want soft-predictions.
If you serialize a model or a pipeline you can have:
- Trained sklearn estimator objects
- Complete preprocessing pipelines
- Feature engineering transformations
- Model metadata and performance metrics
Slack bot provides real-time updates on training progress, model performance, and system alerts.
Automatically generated DAG visualization showing pipeline architecture, data flow, and performance metrics.
- Multi-label Classification Support
- Cyclical Feature Encoding for temporal data
- Cloud Deployment Integration (AWS, GCP, Azure)
- Docker Containerization for production deployment
- Advanced AutoML Capabilities with neural architecture search
- Missing Value Handling: Assumes preprocessed data (manual handling required)
- Grid Search Configuration: Complex setup process for new parameter spaces
- Stacking Visualization: Not included in DAG visualization
- Per-Pipeline Feature Selection: Currently uses unified feature selection
Operations marked as 'MAJOR IMPACT IN PERFORMANCE' in configuration:
- Bayesian optimization with large parameter spaces
- Neural network training with extensive architectures
- Cross-validation with high fold counts
- Feature selection on high-dimensional datasets
We welcome contributions from the community! Please follow these guidelines:
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature
- Commit changes:
git commit -m 'Add amazing feature'
- Push to branch:
git push origin feature/amazing-feature
- Open a Pull Request
- Follow PEP 8 style guidelines
- Include comprehensive docstrings
- Add unit tests for new features
- Update documentation for API changes
Please include:
- Description of changes and motivation
- Testing performed and results
- Breaking Changes if applicable
- Documentation updates
- Library Architecture - Design decisions and implementation details
- API Reference - Complete function and class documentation
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
If you use this framework in your research, please cite:
@software{efficient_classifier_2024,
title={Efficient Classifier: A Dataset-Agnostic ML Framework},
author={[Javier D., Caterina B, Juan A., Federica C., Irina I., Juliette J.]},
year={2025},
url={https://github.com/javidsegura/efficient-classifier}
}
- Built with scikit-learn, XGBoost, and other open-source ML libraries
- Validated on datasets from the Canadian Centre for Cyber Security
- Community contributors and beta testers
Ready to accelerate your ML workflow? Install via pip install efficient-classifier
and check out our Quick Start Guide.