Efficient Classifier

A comprehensive, dataset-agnostic machine learning framework for rapid development and deployment of classification pipelines on tabular data. Advanced DevOps tools.

Overview

Efficient Classifier is an enterprise-grade machine learning framework designed to accelerate the development lifecycle from data preprocessing to model deployment. Built with scalability and reproducibility in mind, it provides a unified interface for experimenting with multiple classification pipelines while maintaining rigorous tracking of experiments and results.

The framework supports both binary and multiclass classification tasks and has been extensively validated on real-world datasets, including the CCCS-CIC-AndMal-2020 cybersecurity dataset where it achieved 92% F1-score performance.

Research & Validation

Our framework has been applied to cutting-edge cybersecurity research:

Research Paper - CCCS-CIC-AndMal-2020 Analysis
Complete Results - Plots, logs, and execution history
Technical Report - Methodology and findings
EDA Notebook - Exploratory data analysis
Presentation - Project overview

Key Features

🚀 Rapid Pipeline Development

Multi-pipeline orchestration with customizable architectures
Zero-boilerplate configuration through YAML
Automated hyperparameter optimization (Grid, Random, Bayesian)
One-command execution from data to deployment

🔬 Advanced Analytics & Visualization

Comprehensive residual analysis and confusion matrices
LIME-based feature importance with permutation testing
Model calibration with reliability diagrams
Cross-validation with stratified sampling
Real-time training progress monitoring

🛠 Production-Ready DevOps

Slack bot integration for real-time notifications
Automated DAG visualization of pipeline architectures
Model serialization with joblib/pickle support
Comprehensive logging and experiment tracking
Built-in testing framework integration

⚡ High-Performance Computing

Multithreaded processing where parallelization is beneficial
Memory-efficient data handling for large datasets
Optimized feature selection algorithms (Boruta, L1 regularization)
Smart caching mechanisms for repeated operations

Architecture

The framework follows a modular, stage-based architecture:

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Data Loading  │ -> │  Preprocessing   │ -> │ Feature Analysis│
└─────────────────┘    └──────────────────┘    └─────────────────┘
                                                        |
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│     DevOps      │ <- │   Optimization     <- │     Modeling    │
└─────────────────┘    └──────────────────┘    └─────────────────┘

Stage-Specific Capabilities

Stage	Feature Name	Description
Dataset	Data splitting analysis	Splits data and analyzes the SE variation by changing the split ratio
Dataset	Data splitting plots	After split, stores plots of the distributions of the features between sets. Meant to verify the consistency of the data distribution across sets
Data preprocessing	Feature encoding	Encodes (one-hot) features in a given dataset
Data preprocessing	Missing values	Finds missing values based on a predefined set of missing data placeholders
Data preprocessing	Duplicates	Finds and deletes duplicates observations
Data preprocessing	Outlier detection	Finds outliers (based on different procedures: percentile, IQR) and curates them (based on different procedures: clipping, winsorization, elimination)
Data preprocessing	Bound checking	Checks values stay within predefined range in order to detect possible measurement errors
Data preprocessing	Feature scaling	Scales features (uses: robust scaler, minmax, standard)
Data preprocessing	Class imbalance	Addresses class imbalances issues (uses: ADASYN or SMOTE)
Feature analysis	Manual feature selection	Mutual information, low variance, multicolinearity and PCA
Feature analysis	Automatic feature selection	L1, Boruta
Feature analysis	Feature engineering	dataset-specific
Modelling	Support for diverse models	Neural nets, linear classifiers, heuristic classifiers (e.g., majority class), ensembled, stacking, gradient boosting machine, instance-based models and more (see full model support list below)
Modelling	Cross-model comparision	Assess all pipelines' models against all others across several metrics (see full metrics support list below)
Modelling	Intra-model comparision	Assess all pipelines' models against its training predictions across several metrics (see full metrics support list below)
Modelling	Per-epoch progress	Allows to see a per-epoch progress graph. Only available for neural networks
Modelling	Residual analysis	Plots confusion matrices and stores and plots residuals from the misclassifications
Modelling	Feature importance	Plots feature importance based on LIME, feature permutations and tree-based feature importance
Modelling	Reliability diagram	Plots reliability diagrams for each model accross all classes
Modelling	Model callibration	Callibrates classifiers (uses: isotonic or sigmoid)
Modelling	Model weights	Allows for objective function to consider weights per class
Modelling	Model optimization	Optimize models (uses: grid, random or bayesian search)
Modelling	Model serialization	Serialize full model objects or pipelines objects (uses: joblib or pickle). Support for desarialization also present
DevOps	SlackBot	Send in-real-time log and plots from the executions of the pipelines
DevOps	Results CSV	Each unique pipeline run is stored with the features used, model name, model hyperparameter, timestamp and more
DevOps	DAG visualization	Visualizes all the pipelines' architecture and results of the system in an automatically generated DAG plot.
DevOps	Testing	Possibility to add testing-powered development
DevOps	Code bot reviewers	This repo uses CodeRabbit in order to summarize pull requests and integrate them at a faster pace
System-wide	Customizable	Adjust configurations.yaml for your dataset and get started in minutes
System-wide	Efficient	Multithreading is present at each part of the system where sequentialism is not required

Installation

PyPI Installation (Recommended)

pip install efficient-classifier

Development Installation

git clone https://github.com/javidsegura/efficient-classifier.git
cd efficient-classifier
pip install -r requirements.txt

Environment Setup

For Slack bot integration, create a .env file:

SLACK_BOT_TOKEN=your_bot_token
SLACK_SIGNING_SECRET=your_signing_secret
SLACK_APP_TOKEN=your_app_token

Quick Start

Basic Usage

from efficient_classifier.root import run_pipeline
import yaml

PATH = "configurations.yaml"

variables = yaml.load(open(PATH), Loader=yaml.FullLoader)

if __name__ == "__main__":
      run_pipeline(variables=variables)

Custom Dataset Integration

Configure dataset-specific cleaning in pipeline_runner.py:

def _clean_dataset_set_up_dataset_specific(self, df):
    # Your custom preprocessing logic
    return cleaned_df

Implement feature engineering in featureAnalysis_runner.py:

def _run_feature_engineering_dataset_specific(self, df):
    # Your custom feature engineering
    return engineered_df

Update boundary conditions in bound_config.py for data validation.
If you are using this library, after you have adjusted data cleaning and feature engineering code you will most likely still need to adjust the following parameters:

class weights
adjust bound checking file bound_config.py
modify features to encode
modify data preprocessing modes
adjust optimization space (see all the parameters commented as 'MAJOR IMPACT IN PERFORMANCE' in order to see the parameters that significanlty increase computational costs over the long run)
If you want to create a new pipeline, you also need to adjust the following:
- data preprocessing
- models to include/exclude
- extend grid space (both in the configuration.yaml and in modelling_runner_states_in.py)

Configuration

The framework uses a comprehensive YAML configuration system. Key configuration sections:

Pipeline Definition

general:
  pipelines_names: ["baseline", "advanced", "ensemble"]
  max_plots_per_function: 10  # Control visualization output

Data Processing

phase_runners:
  dataset_runners:
    split_df:
      p: .85 # Expected accuracy
      step: 0.05          # Granularity of split analysis
    encoding:
      y_column: "target"  # Target variable name

Model Configuration

modelling_runner:
  class_weights:
    weights: {0: 1.0, 1: 2.0}  # Handle class imbalance
  models_to_include:
    baseline: ["Random Forest", "Logistic Regression"]
    advanced: ["XGBoost", "Neural Network"]
  optimization:
    method: "bayesian"  # grid, random, or bayesian
    cv_folds: 5

For complete configuration options, see the detailed documentation.

Supported Models & Metrics

Machine Learning Models

Tree-Based Algorithms:

Random Forest, Decision Trees, Gradient Boosting
XGBoost, LightGBM, CatBoost
AdaBoost with configurable base estimators

Linear Models:

Logistic Regression, Ridge Classifier
Linear/Non-linear SVM, SGD Classifier
Elastic Net with L1/L2 regularization

Advanced Methods:

Feed-Forward Neural Networks
Ensemble Stacking (meta-learning)
K-Nearest Neighbors, Gaussian Naive Bayes

Baseline Models:

Majority Class Classifier for benchmarking

Evaluation Metrics

Classification Accuracy - Overall correctness
Precision, Recall, F1-Score - Class-specific performance
Cohen's Kappa - Inter-rater reliability
Weighted Accuracy - Class-imbalance adjusted accuracy

In order to add a new metric go to detailed Q&A

Adding Custom Models

Extend model support by modifying modelling_runner.py:

def _model_initializers(self):
    models = {
        # Existing models...
        "Custom Model": YourCustomClassifier(
            param1=self.config['custom_param']
        )
    }
    return models

Use Cases

MANTIS: Cybersecurity Threat Detection

Our flagship application demonstrates the framework's capabilities in cybersecurity:

Dataset: CCCS-CIC-AndMal-2020 (Android malware detection)
Performance: 92% F1-score with Random Forest + Stacking ensemble
Scale: 50,000+ samples with 145 features
Deployment: Production-ready model with 15ms inference time

Key Results:

Outperformed baseline approaches by 23%
Achieved 92% F1-score with Random Forest + Stacking ensemble

Benchmark Datasets

Titanic Survival Prediction: View Results

81.3% accuracy with ensemble methods
Comprehensive feature engineering pipeline
5 minute set-up

Iris Classification: View Results

100%% accuracy across all pipeline configurations
Validation of multi-class capabilities
<5 minute set-up

Performance

Benchmarks

Dataset	Samples	Features	Best Model	F1-Score	Training Time
CCCS-CIC-AndMal-2020	50K+	125	Random Forest	92.0%	45 min
Titanic	891	12	Stacking Ensemble	81.3%	2 min
Iris	150	4	Neural Network	100%	30 sec

Model Deployment

Serialization & Inference

Once you have the best model selected out of the assesment after tuning you can serialize your model (either with joblib or pickle). The model serialized contains the results at each stage and the model sklearn object at each stage. In production use the model_sklearn from the assesment attribute in the serialized model. Do model.predict(X) or model.predict_proba(X) if you want soft-predictions.

Production Integration

If you serialize a model or a pipeline you can have:

Trained sklearn estimator objects
Complete preprocessing pipelines
Feature engineering transformations
Model metadata and performance metrics

Monitoring & Visualization

Real-Time Notifications

Slack bot provides real-time updates on training progress, model performance, and system alerts.

Pipeline Visualization

Automatically generated DAG visualization showing pipeline architecture, data flow, and performance metrics.

Roadmap & Known Limitations

Upcoming Features

Multi-label Classification Support
Cyclical Feature Encoding for temporal data
Cloud Deployment Integration (AWS, GCP, Azure)
Docker Containerization for production deployment
Advanced AutoML Capabilities with neural architecture search

Current Limitations

Missing Value Handling: Assumes preprocessed data (manual handling required)
Grid Search Configuration: Complex setup process for new parameter spaces
Stacking Visualization: Not included in DAG visualization
Per-Pipeline Feature Selection: Currently uses unified feature selection

Performance Considerations

Operations marked as 'MAJOR IMPACT IN PERFORMANCE' in configuration:

Bayesian optimization with large parameter spaces
Neural network training with extensive architectures
Cross-validation with high fold counts
Feature selection on high-dimensional datasets

Contributing

We welcome contributions from the community! Please follow these guidelines:

Development Process

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Commit changes: git commit -m 'Add amazing feature'
Push to branch: git push origin feature/amazing-feature
Open a Pull Request

Code Standards

Follow PEP 8 style guidelines
Include comprehensive docstrings
Add unit tests for new features
Update documentation for API changes

Pull Request Template

Please include:

Description of changes and motivation
Testing performed and results
Breaking Changes if applicable
Documentation updates

Documentation

Comprehensive Guides

Library Architecture - Design decisions and implementation details
API Reference - Complete function and class documentation

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Citation

If you use this framework in your research, please cite:

@software{efficient_classifier_2024,
  title={Efficient Classifier: A Dataset-Agnostic ML Framework},
  author={[Javier D., Caterina B, Juan A., Federica C., Irina I., Juliette J.]},
  year={2025},
  url={https://github.com/javidsegura/efficient-classifier}
}

Acknowledgments

Built with scikit-learn, XGBoost, and other open-source ML libraries
Validated on datasets from the Canadian Centre for Cyber Security
Community contributors and beta testers

Ready to accelerate your ML workflow? Install via pip install efficient-classifier and check out our Quick Start Guide.

Name		Name	Last commit message	Last commit date
Latest commit History 267 Commits
.github/workflows		.github/workflows
dataset		dataset
documentation		documentation
efficient-classifier		efficient-classifier
results		results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
configurations.yaml		configurations.yaml
main.py		main.py

License

javidsegura/efficient-classifier

Folders and files

Latest commit

History

Repository files navigation

Efficient Classifier

Table of Contents

Overview

Research & Validation

Key Features

🚀 Rapid Pipeline Development

🔬 Advanced Analytics & Visualization

🛠 Production-Ready DevOps

⚡ High-Performance Computing

Architecture

Stage-Specific Capabilities

Installation

PyPI Installation (Recommended)

Development Installation

Environment Setup

Quick Start

Basic Usage

Custom Dataset Integration

Configuration

Pipeline Definition

Data Processing

Model Configuration

Supported Models & Metrics

Machine Learning Models

Evaluation Metrics

Adding Custom Models

Use Cases

MANTIS: Cybersecurity Threat Detection

Benchmark Datasets

Performance

Benchmarks

Model Deployment

Serialization & Inference

Production Integration

Monitoring & Visualization

Real-Time Notifications

Pipeline Visualization

Roadmap & Known Limitations

Upcoming Features

Current Limitations

Performance Considerations

Contributing

Development Process

Code Standards

Pull Request Template

Documentation

Comprehensive Guides

License

Citation

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 16

Packages 0

Uh oh!

Contributors 6

Uh oh!

Languages

Packages