Skip to content

harddikk/Catching-the-Rare

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧾 Catching the Rare: Ensemble and Linear Models for Imbalanced Network Intrusion Detection

License: CC BY 4.0 DOI Python GitHub last commit GitHub issues

This repository contains the code and analysis accompanying the research paper
“Catching the Rare: Ensemble and Linear Models for Imbalanced Network Intrusion Detection.”

The research has been published on Zenodo (DOI: https://doi.org/10.5281/zenodo.18408605).

The paper is published on Zenodo (Version 2.0):
https://doi.org/10.5281/zenodo.18408605


Overview

Network intrusion detection is challenged by extreme class imbalance, where rare attack instances are easily overshadowed by benign traffic. This project investigates supervised machine learning approaches for binary intrusion detection, with an explicit focus on detecting minority (attack) classes rather than optimizing overall accuracy.

While the study is inspired by attack behaviors commonly analyzed in honeypot environments, it does not involve live honeypot deployment. Experiments are conducted using the publicly available CICIDS 2017 dataset, generated in a controlled environment that simulates realistic benign and malicious network traffic.

Core research focus:

  • Linear vs ensemble models under class imbalance
  • Precision–recall analysis over raw accuracy
  • Feature importance and interpretability
  • Transparent discussion of methodological limitations

Goal:

To improve early intrusion detection by identifying deceptive attack vectors through data-driven insights.


⚙️ Setup Instructions

Clone the repo:

git clone https://github.com/harddikk/Catching-the-Rare.git
cd Catching-the-Rare

Install dependencies:

pip install -r requirements.txt

Dataset

This project uses the CICIDS 2017 intrusion detection dataset released by the Canadian Institute for Cybersecurity.

Due to size and licensing constraints, the dataset is not included in this repository.

To reproduce the experiments:

  1. Download CICIDS 2017 from the official source
  2. Place the extracted CSV files in a directory named:
CICIDS2017/

at the project root.


🗂 Project Structure

Click to expand
Catching-the-Rare/
├── honeypot_intrusion_detection.ipynb      # Main Jupyter Notebook
├── requirements.txt                        # Python dependencies
├── LICENSE                                 # CC BY 4.0 License
├── README.md                               # This file
├── CICIDS2017/                             # Dataset folder (not included)
└── results/                                # Folder for selected plots/images
    ├── confusion_matrix_rf.png
    └── feature_importance_rf.png

🚀 Running the Notebook

Open

honeypot_intrusion_detection.ipynb

in Jupyter Notebook or VS Code, and execute all cells sequentially.
Results (plots, accuracy metrics, confusion matrices, etc.) will be displayed inline.


🧩 Techniques Used

  • Data preprocessing and cleaning
  • Stratified sampling under class imbalance
  • Feature scaling with leakage-aware pipelines
  • Supervised ML models:
    • Logistic Regression (linear baseline)
    • Random Forest (ensemble)
    • XGBoost (ensemble)
  • Evaluation using imbalance-aware metrics:
    • Precision–Recall curves
    • ROC curves
    • Balanced accuracy

📈 Results

Click to expand: Model Performance Metrics & Plots
Model Accuracy Precision (weighted) Recall (weighted) F1-Score (weighted)
Logistic Regression 0.9416 0.9489 0.9416 0.9435
Random Forest 0.9986 0.9985 0.9986 0.9985
XGBoost 0.9991 0.9991 0.9991 0.9991

Note: Metrics are calculated on the test split of the dataset.

Selected plots:

Confusion Matrix (RF)
Feature Importance - Top 10 (RF)

Note: Accuracy is reported for completeness but is not emphasized due to severe class imbalance. Confusion matrices for other models and feature importance for XGBoost are available in the notebook.


📚 Citation

If you use or reference this work, please cite:

Tiwari, Hardik. (2026). Catching the Rare: Ensemble and Linear Models for Imbalanced Network Intrusion Detection. Zenodo. https://doi.org/10.5281/zenodo.18408605

📄 License

This project is licensed under the Creative Commons Attribution 4.0 International License.
You’re free to use, modify, and share it — just give proper credit.


👨‍💻 Author

Hardik Tiwari
High school Researcher passionate about AI, cybersecurity, and system-level innovation.
Connect on LinkedIn 🚀

About

Repository for the research paper "Catching the Rare. Ensemble and Linear Models for Imbalanced Network Intrusion Detection". Includes data preprocessing, model training, evaluation, and analysis under severe class imbalance.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors