## Project Pipeline Tutorial

This notebook walks through the full processing pipeline of DUNE, step by step.

1. Data Generation
    - Data preparation for the machine learning (ML) model that targets hybrid packet-and flow-level classification.
2. Unconstrained Model Analysis
    - Train an unconstrained ML model, and extract the relationships between input features and output variables, per class feature importance (PCFI).
3. Model partitioning
    - Partition a tree-based classification model into smaller sub-problems using set partitioning problem.
4. Cluster Analysis
    - Analyze the performance of each sub-model trained for smaller sub-tasks that are once an unconstrained ML model is partitioned into.
5. Model Sequencing
    - Determine an optimal execution sequence of sub-models by formulating the problem as a Travelling Salesman Problem (TSP) using Integer Linear Programming (ILP).


- All generated files (outputs) are stored in the `tutorial_output` folder and are not removed by the pipeline, allowing for possible reuse or future reference.

- Please follow the instructions carefully through the pipeline! :)

In [1]:
import os
import configparser
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')
# Get the path of the tutorial notebook
main_path = !pwd
main_path = main_path[0]

## Data generation

---
To ensure smooth execution of consecutive steps for generating ordered sub-models from a large unconstrained model to implement across distributed switches, please generate your data in advance using a format compatible with our code. 

Please follow the instructions provided in the README to prepare your data accordingly.

## Unconstrained model analysis

---
- Since running unconstrained model analysis for all search space defined in the file `run_unconstrained_model_analysis.py` takes for a long time, you can decrease the search space to include only the hyperparameters (max depth, max number of trees) of chosen best model after running model analysis for all space. You can define the search space as below to decrease search space and fasten the analysis.

-   ToN-IoT --> `max_depth_list = list(range(24, 25))`, `n_trees_list = list(range(17, 18))`

-   UNSW --> `max_depth_list = list(range(23, 24))`, `n_trees_list = list(range(32, 33))`

-  And set 'results_dir_path' as `./tutorial_output/unconstrained_model_analysis_results` so that results will be stored under the folder.

- Before running `run_unconstrained_model_analysis.py`, please modify other fields required as well in `params.ini`.
#### output
- `run_unconstrained_model_analysis.py` will generate two files:
    - `importance_weights.csv`: Includes the importance of every feature for each class derived from the best unconstrained model chosen.
    - `score_per_cluster_per_class_df.pdf`: Includes per class score derived from the best unconstrained model chosen.

In [2]:
# Install the requirements for unconstrained model analysis and create a folder to store the output files
!pip3 install -r unconstrained_model_analysis/requirements.txt
os.makedirs(main_path + '/tutorial_output/unconstrained_model_analysis_results', exist_ok=True)

Defaulting to user installation because normal site-packages is not writeable


In [3]:
# Run unconstrained model analysis
!python3 -m unconstrained_model_analysis.src.run_unconstrained_model_analysis

['normal', 'scanning', 'ddos', 'injection', 'password', 'xss', 'ransomware']
INFO:TON-IOT:Will use 6 cores. Starting pool...
INFO:TON-IOT:Starting analysis of: npoint 2
INFO:TON-IOT:Starting analysis of: npoint 3
INFO:TON-IOT:Starting analysis of: npoint 4
INFO:TON-IOT:File tutorial_output/unconstrained_model_analysis_results/TON-IOT_models_2pkts.csv is present. To overwrite existing files pass force=True when running the analysis
INFO:TON-IOT:Finished analyzing n=2, Results at: tutorial_output/unconstrained_model_analysis_results
INFO:TON-IOT:File tutorial_output/unconstrained_model_analysis_results/TON-IOT_models_3pkts.csv is present. To overwrite existing files pass force=True when running the analysis
INFO:TON-IOT:Finished analyzing n=3, Results at: tutorial_output/unconstrained_model_analysis_results
INFO:TON-IOT:File tutorial_output/unconstrained_model_analysis_results/TON-IOT_models_4pkts.csv is present. To overwrite existing files pass force=True when running the analysis
INFO:

## Model partitioning

---
- Please run the following cell to set `weights_file` and `f1_file` in `spp_params.ini` with the related file paths of `importance_weights.csv` and `score_per_cluster_per_class_df.pdf`, respectively, located in `./tutorial_output/unconstrained_model_analysis_results/perf_results` before running `model_partitioning.py`.
- OR you can modify the  related fields manually.
#### output
- `model_partitioning.py` will generate one file:
    - `[use_case]_SPP_solution_Level_[best_level]_[time].csv`: Shows the blocks in a partition with classes and features chosen for each block. You can find it under `./model_partitioning/src`.

In [4]:
# Load the config file, modify the related field, and save the changes back to the file
config_file_path = main_path + '/model_partitioning/src/spp_params.ini.example'

config = configparser.ConfigParser()
config.read(config_file_path)

config[config['DEFAULT']['use_case']]['weights_file'] = main_path +'/tutorial_output/unconstrained_model_analysis_results/perf_results/importance_weights.csv'
config[config['DEFAULT']['use_case']]['f1_file'] = main_path + '/tutorial_output/unconstrained_model_analysis_results/perf_results/score_per_cluster_per_class_df.csv'

with open(config_file_path, 'w') as configfile:
    config.write(configfile)

In [5]:
# Install the requirements for model partitioning and set the current directory
!pip3 install -r model_partitioning/requirements.txt
os.chdir(main_path + '/model_partitioning/src')

Defaulting to user installation because normal site-packages is not writeable
Collecting numpy==1.24.4
  Using cached numpy-1.24.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
Collecting pandas==2.0.3
  Using cached pandas-2.0.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.3 MB)
Collecting pytz==2024.1
  Using cached pytz-2024.1-py2.py3-none-any.whl (505 kB)
Collecting scipy==1.10.1
  Using cached scipy-1.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.4 MB)
Collecting six==1.16.0
  Using cached six-1.16.0-py2.py3-none-any.whl (11 kB)
Collecting tzdata==2024.1
  Using cached tzdata-2024.1-py2.py3-none-any.whl (345 kB)
Installing collected packages: pytz, tzdata, six, numpy, scipy, pandas
  Attempting uninstall: pytz
    Found existing installation: pytz 2025.1
    Uninstalling pytz-2025.1:
      Successfully uninstalled pytz-2025.1
  Attempting uninstall: tzdata
    Found existing installation: tzdata 2025.1
    Uninstalling

In [6]:
# Run the model partitioning to obtain blocks via SPP
!python3 model_partitioning.py

+----+-----------+---------------------------+------------------------------------------------------------------------------------------------------+
|    |   Cluster | Class List                | Feature List                                                                                         |
|----+-----------+---------------------------+------------------------------------------------------------------------------------------------------|
|  0 |         0 | ['scanning', 'injection'] | ['dstport', 'ip.len', 'Max Packet Length', 'Packet Length Total', 'srcport']                         |
|  1 |         1 | ['ddos']                  | ['dstport', 'ip.len', 'Max Packet Length', 'Packet Length Total', 'udp.length', 'Min Packet Length'] |
|  2 |         2 | ['normal', 'ransomware']  | ['dstport', 'ip.len', 'Max Packet Length', 'Packet Length Total', 'ip.ttl', 'tcp.window_size_value'] |
|  3 |         3 | ['password']              | ['dstport', 'Max Packet Length', 'Packet Length Total

## Cluster analysis

---
- Please run the following cell to automatically set `cluster_data_file_path` and `results_dir_path` in `params.ini` with the path of the SPP result's file and directory to store the analysis results, respectively, before running `cluster_analysis.py`.
- OR you can modify related fields manually.
#### output
- `cluster_analysis.py` will generate one file:
    - `.csv`: Shows the best model chosen for each block after analysis across given search space.

In [7]:
spp_solution = str(list(Path(main_path + "/model_partitioning/src").glob("*.csv"))[0])
# Load the config file, modify the related field, and save the changes back to the file
config = configparser.ConfigParser()
config_file_path = main_path + '/cluster_analysis/src/params.ini'
config.read(config_file_path)

config[config['DEFAULT']['use_case']]['cluster_data_file_path'] = spp_solution
config[config['DEFAULT']['use_case']]['results_dir_path'] = main_path + '/tutorial_output/cluster_analysis_results'

with open(config_file_path, 'w') as configfile:
    config.write(configfile)

In [8]:
# Generate a folder to store results, set the current directory, and run cluster analysis
os.makedirs(main_path + '/tutorial_output/cluster_analysis_results', exist_ok=True)
os.chdir(main_path + '/cluster_analysis/src')

In [9]:
!python3 run_cluster_analysis.py

INFO:TON-IOT:Will use 8 cores. Starting pool...
INFO:TON-IOT:Starting analysis of: Cluster id: 0, npoint 2
INFO:TON-IOT:Starting analysis of: Cluster id: 1, npoint 2
INFO:TON-IOT:Starting analysis of: Cluster id: 2, npoint 2
INFO:TON-IOT:Starting analysis of: Cluster id: 3, npoint 2
INFO:TON-IOT:Starting analysis of: Cluster id: 4, npoint 2
INFO:TON-IOT:Starting analysis of: Cluster id: 0, npoint 3
INFO:TON-IOT:Starting analysis of: Cluster id: 2, npoint 3
INFO:TON-IOT:Starting analysis of: Cluster id: 1, npoint 3
INFO:TON-IOT:File /home/beyzabutun/distributed_in_band/tutorial_output/cluster_analysis_results/TON-IOT_models_2pkts_Cluster1.csv is present. To overwrite existing files pass force=True when running the analysis
INFO:TON-IOT:File /home/beyzabutun/distributed_in_band/tutorial_output/cluster_analysis_results/TON-IOT_models_2pkts_Cluster0.csv is present. To overwrite existing files pass force=True when running the analysis
INFO:TON-IOT:Finished analyzing n=2, Cluster=1. Results 

## Model sequencing

---
- Please run the following cell to automatically set `best_models_per_cluster_path` and `results_dir_path` in `params.ini` with the path of the SPP result's file and directory to store the results before running `model_sequencing.py`.
- OR you can modify related fields manually.
#### output

- It prints the optimal sequence of the blocks that should be executed.

In [10]:
# Load the config file, modify the related field, and save the changes back to the file
os.chdir(main_path + '/model_sequencing/src')
config = configparser.ConfigParser()
config_file_path = main_path + '/model_sequencing/src/params.ini'
config.read(config_file_path)

config[config['DEFAULT']['use_case']]['best_models_per_cluster_path'] = main_path + '/tutorial_output/cluster_analysis_results/perf_results/cluster_info_df.csv'
config[config['DEFAULT']['use_case']]['results_dir_path'] = main_path + '/tutorial_output/model_sequencing_results'

with open(config_file_path, 'w') as configfile:
    config.write(configfile)

In [11]:
# Generate a folder to store the results and ,nstall the requirements for model sequencing
os.makedirs(main_path + '/tutorial_output/model_sequencing_results', exist_ok=True)
!pip3 install -r ../requirements.txt

Defaulting to user installation because normal site-packages is not writeable
Collecting numpy==2.2.2
  Using cached numpy-2.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.4 MB)
Collecting pandas==2.2.3
  Using cached pandas-2.2.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.1 MB)
Collecting pytz==2025.1
  Using cached pytz-2025.1-py2.py3-none-any.whl (507 kB)
Collecting scipy==1.15.1
  Using cached scipy-1.15.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (40.6 MB)
Collecting six==1.17.0
  Using cached six-1.17.0-py2.py3-none-any.whl (11 kB)
Collecting tzdata==2025.1
  Using cached tzdata-2025.1-py2.py3-none-any.whl (346 kB)
Installing collected packages: pytz, tzdata, six, numpy, scipy, pandas
  Attempting uninstall: pytz
    Found existing installation: pytz 2024.1
    Uninstalling pytz-2024.1:
      Successfully uninstalled pytz-2024.1
  Attempting uninstall: tzdata
    Found existing installation: tzdata 2024.1
    Uninstalling t

In [12]:
# Run model sequencing
!python3 model_sequencing.py

INFO:TON-IOT:The analysis starts...
INFO:TON-IOT:Normalized confusion matrix is: 
INFO:TON-IOT:
           0          1          2          3          4
0  71.430913   1.607479   0.120404  29.846975   9.746733
1  19.715202  97.459923   5.059577   5.797520   3.436185
2   4.737300   0.286037  94.762352   0.035711   0.527144
3   0.068491   0.013702   0.000000  62.110956   1.501412
4   4.048094   0.632859   0.057667   2.208838  84.788527
Read LP format model from file /tmp/tmphej2njh0.pyomo.lp
Reading time = 0.00 seconds
x1: 45 rows, 31 columns, 88 nonzeros
Gurobi Optimizer version 12.0.1 build v12.0.1rc0 (linux64 - "Ubuntu 22.04.5 LTS")

CPU model: AMD EPYC 7402P 24-Core Processor, instruction set [SSE2|AVX|AVX2]
Thread count: 24 physical cores, 48 logical processors, using up to 24 threads

Optimize a model with 45 rows, 31 columns and 88 nonzeros
Model fingerprint: 0x26efcc68
Variable types: 1 continuous, 30 integer (25 binary)
Coefficient statistics:
  Matrix range     [1e+00, 5e+00]
 