The current repository contains all the scripts needed to reproduce the results published in the paper: "Obfuscation Revealed: Electromagnetic obfuscated malware classification".
.
├── README.md
├── requirements.txt
│── run_dl_on_selected_bandwidth.sh #> script to run the DL for all scenarii on
| # the (full) testing dataset (available on zenodo)
| # using pre-computed models
|── run_dl_on_reduced_dataset.sh #> script to run the training on
| # on a reduced dataset (350 per samples per
| # executable, available on zenodo)
│── run_ml_on_reduced_dataset.sh #> script to run the end-to-end analysis on
| # on a reduced dataset (350 per samples per
| # executable, available on zenodo)
│── run_ml_on_selected_bandwidth.sh #> script to run the ML classification for all
| # for all scenarii on the testing pre-computed
| # dataset (available on zenodo)
│── update_lists.sh #> script to update the location of the traces
│ # in the lists
│
├── ml_analysis
│ │── evaluate.py #> code for the LDA + {NB, SVM} on the
| | # reduced dataset (raw_data_reduced_dataset)
| |── NB.py #> Naïve Bayensian with known model
| | # (traces_selected_bandwidth)
| |── SVM.py #> Support vector machine with known model
| | # (traces_selected_bandwidth)
│ │── log-evaluation_reduced_dataset.txt #> output log file for the ML evaluation
| | # on the reduce datasete
│ │── log-evaluation_selected_bandwidth.txt #> output log file for the ML evaluation
| # using the precomputed models
│
│
│
├── dl_analysis
│ │── evaluate.py #> code to predict MLP and CNN using pretrained models
│ │── training.py #> code to train MLP and CNN and store models
| | # according to best validation accuracy
│ │── evaluation_log_DL.txt #> output log file with stored accuracies on the testing dataset
| |── training_log_reduced_dataset_mlp.txt #> output log file with stored validation accuracies
| | # on the reduced dataset for the mlp neural network over all scenarios and bandwidths
| |── training_log_reduced_dataset_cnn.txt #> output log file with stored validation accuracies
| | # on the reduced dataset for the cnn neural network over all scenarios and bandwidths
|
│
│
├── list_selected_bandwidth #> list of the files used for training,
│ │ # validating and testing (all in one file)
│ │ # for each sceanario (but only the testing
| | # data are available). Lists associated to
| | # the selected bandwidth dataset
│ │── files_lists_tagmap=executable_classification.npy
│ │── files_lists_tagmap=novelty_classification.npy
│ │── files_lists_tagmap=packer_identification.npy
│ │── files_lists_tagmap=virtualization_identification.npy
│ │── files_lists_tagmap=family_classification.npy
│ │── files_lists_tagmap=obfuscation_classification.npy
│ │── files_lists_tagmap=type_classification.npy
│
│
├── list_reduced_dataset #> list of the files used for training,
│ │ # validating and testing (all in one file)
│ │ # for each sceanario. Lists associated to
| | # the reduced dataset
│ │── files_lists_tagmap=executable_classification.npy
│ │── files_lists_tagmap=novelty_classification.npy
│ │── files_lists_tagmap=packer_identification.npy
│ │── files_lists_tagmap=virtualization_identification.npy
│ │── files_lists_tagmap=family_classification.npy
│ │── files_lists_tagmap=obfuscation_classification.npy
│ │── files_lists_tagmap=type_classification.npy
│
├── pre-processings #> codes use to preprocess the raw traces to be
│ # able to run the evaluations
│── list_manipulation.py #> split traces in {learning, testing, validating}
│ # sets
│── accumulator.py #> compute the sum and the square of the sum (to
│ # be able to recompute quickly the NICVS)
│── nicv.py #> to compute the NICVs
│── corr.py #> to compute Pearson coeff (alternative to the
| # NICV)
│── displayer.py #> use to display NICVs, correlations, traces...
│── signal_processing.py #> some signal processings (stft, ...)
|── bandwidth_extractor.py #> extract bandwidth, based on NICVs results
| # and creat new dataset
│── tagmaps #> all tagmaps use for to labelize the data
│ # (use to creat the lists)
│── executable_classification.csv
│── family_classification.csv
│── novelties_classification.csv
│── obfuscation_classification.csv
│── packer_identification.csv
│── type_classification.csv
│── virtualization_identification.csv
To be able to run the analysis you need to install python 3.6 and the required packages:
pip install -r requirements.txt
The testing dataset (spectrograms) used in the paper can be dowload on the following website:
https://zenodo.org/record/5414107
In order to update the location of the data, you previously dowloaded, inside
the lists please run the script update_lists.sh
:
./update_lists [directory where the lists are stored] [directory where the downloaded spectograms are stored]
This must be applyed to directoies list_selected_bandwidth
and list_reduced_dataset
respectively associated to the datasets: traces_selected_bandwidth.zip
and raw_data_reduced_dataset.zip
To run the computation of all the machine learning experiments, you can use
the scripts run_ml_on_reduced_dataset.sh
and run_ml_on_extracted_bandwidth.sh
:
./run_ml_on_extracted_bandwidth.sh [directory where the lists are stored] [directory where the models are stored] [directory where the accumulated data is stored (precomputed in pretrained_models/ACC) ]
The results are stored in the file ml_analysis/log-evaluation_selected_bandwidth.txt
.
./run_ml_on_reduced_dataset.sh
The results are stored in the file ml_analysis/log-evaluation_reduced_dataset.txt
.
The directory ml_analysis
contains the code needed for the classification by Machine Learning (ML).
usage: evaluate.py [-h]
[--lists PATH_LISTS]
[--mean_size MEAN_SIZES]
[--log-file LOG_FILE]
[--acc PATH_ACC]
[--nb_of_bandwidth NB_OF_BANDWIDTH]
[--time_limit TIME_LIMIT]
[--metric METRIC]
optional arguments:
-h, --help show this help message and exit
--lists PATH_LISTS Absolute path to a file containing the lists
--mean_size MEAN_SIZES Size of each means
--log-file LOG_FILE Absolute path to the file to save results
--acc PATH_ACC Absolute path of the accumulators directory
--nb_of_bandwidth NB_OF_BANDWIDTH number of bandwidth to extract
--time_limit TIME_LIMIT percentage of time to concerve (from the begining)
--metric METRIC Metric to use for select bandwidth: {nicv, corr}_{mean, max}
usage: NB.py [-h]
[--lists PATH_LISTS]
[--model_lda MODEL_LDA]
[--model_nb MODEL_NB]
[--mean_size MEAN_SIZES]
[--log-file LOG_FILE]
[--time_limit TIME_LIMIT]
[--acc PATH_ACC]
optional arguments:
-h, --help show this help message and exit
--lists PATH_LISTS Absolute path to a file containing the lists
--model_lda MODEL_LDA Absolute path to the file where the LDA model has been previously saved
--model_nb MODEL_NB Absolute path to the file where the NB model has been previously saved
--mean_size MEAN_SIZES Size of each means
--log-file LOG_FILE Absolute path to the file to save results
--time_limit TIME_LIMIT percentage of time to concerve (from the begining)
--acc PATH_ACC Absolute path of the accumulators directory
usage: read_logs.py [-h]
[--path PATH]
[--plot PATH_TO_PLOT]
optional arguments:
-h, --help show this help message and exit
--path PATH Absolute path to the log file
--plot PATH_TO_PLOT Absolute path to save the plot
usage: SVM.py [-h]
[--lists PATH_LISTS]
[--model_lda MODEL_LDA]
[--model_svm MODEL_SVM]
[--mean_size MEAN_SIZES]
[--log-file LOG_FILE]
[--time_limit TIME_LIMIT]
[--acc PATH_ACC]
optional arguments:
-h, --help show this help message and exit
--lists PATH_LISTS Absolute path to a file containing the lists
--model_lda MODEL_LDA Absolute path to the file where the LDA model has been previously saved
--model_svm MODEL_SVM Absolute path to the file where the SVM model has been previously saved
--mean_size MEAN_SIZES Size of each means
--log-file LOG_FILE Absolute path to the file to save results
--time_limit TIME_LIMIT percentage of time to concerve (from the begining)
--acc PATH_ACC Absolute path of the accumulators directory
The folder dl_analysis
contains a script for prediction (evaluate.py
) and for training (training.py
) the used cnn and mlp network models.
Script to run the prediction on a testing dataset using pre-trained models.
usage: evaluate.py [-h]
[--lists PATH_LISTS]
[--acc PATH_ACC]
[--band NB_OF_BANDWIDTH]
[--model h5-file containing precomputed model]
Script to run the training for our mlp or cnn model on a training and validation dataset and store the trained models.
usage: training.py [-h]
[--lists PATH_LISTS]
[--acc PATH_ACC]
[--band NB_OF_BANDWIDTH]
[--epochs number of epochs]
[--batch batch size]
[--arch neural network architecture {cnn, mlp}]
[--save filename to store model (h5 file)]
To run the computation of the all the deep learning experiments on the testing dataset (downloaded from zenodo) using pre-trained models, you can use
the script run_dl_on_selected_bandwidth.sh
:
./run_dl_on_selected_bandwidth.sh [directory where the lists are stored] [directory where the models are stored] [directory where the accumulated data is stored (precomputed in pretrained_models/ACC) ]
To train and store pre-trained models for the MLP and CNN architecture using the reduced dataset (downloaded from zenodo), you can use
the script run_dl_on_reduced_dataset.sh
:
./run_dl_on_reduced_dataset.sh [directory where the lists are stored] [directory where the accumulated data is stored (precomputed in pretrained_models/ACC) ] [DL architecture {cnn or mlp}] [number of epochs (e.g. 100)] [batch size (e.g. 100)]
Pre-trained models will be stored as {MLP,CNN}_{name of the scenario}_band_{the amount of selected bandwidth}.h5
and can be used with the script evaluate.py
or automatically with run_dl_on_selected_bandwidth.sh
. Note run_dl_on_selected_bandwidth.sh
expects models with a filename {name of the scenario}.h5
in subfolders MLP
and CNN
. For this simply selected the bandwidth that achieved the highest validation accuracy and shorten the filename to the scenario and store it in the corresponding subfolder.
Validation accuracies for all scenarios and bandwidths are stored in training_log_reduced_dataset_{cnn,mlp}.txt
.
Once the traces have been aquiered and before beeing able to run the evualuation some preprocessings are needed. The needed pre-processings are already written in the scripts listed above.
usage: accumulator.py [-h]
[--lists PATH_LISTS]
[--output OUTPUT_PATH]
[--no_stft]
[--freq FREQ]
[--window WINDOW]
[--overlap OVERLAP]
[--core CORE]
[--duration DURATION]
[--device DEVICE]
optional arguments:
-h, --help show this help message and exit
--lists PATH_LISTS Absolute path of the lists (cf. list_manipulation.py -- using a main list will help) trace directory
--output OUTPUT_PATH Absolute path of the output directory
--no_stft If no stft need to be applyed on the listed data
--freq FREQ Frequency of the acquisition in Hz
--window WINDOW Window size for STFT
--overlap OVERLAP Overlap size for STFT
--core CORE Number of core to use for multithreading accumulation
--duration DURATION to fixe the duration of the input traces (padded if input is short and cut otherwise)
--device DEVICE to fixe the duration of the input traces (padded if input is short and cut otherwise)
usage: bandwidth_extractor.py [-h]
[--acc PATH_ACC]
[--lists LISTS [LISTS ...]]
[--plot PATH_TO_PLOT]
[--nb_of_bandwidth NB_OF_BANDWIDTH]
[--log-level LOG_LEVEL]
[--output_traces PATH_OUTPUT_TRACES]
[--output_lists PATH_OUTPUT_LISTS]
[--freq FREQ]
[--window WINDOW]
[--overlap OVERLAP]
[--device DEVICE]
[--metric METRIC]
[--core CORE]
[--duration DURATION]
optional arguments:
-h, --help show this help message and exit
--acc PATH_ACC Absolute path of the accumulators directory
--lists LISTS [LISTS ...] Absolute path to all the lists (for each scenario). /!\ The data in the first one must contain all traces.
--plot PATH_TO_PLOT Absolute path to a file to save the plot
--nb_of_bandwidth NB_OF_BANDWIDTH number of bandwidth to extract (-1 means that all bandwidth will be concerved)
--log-level LOG_LEVEL Configure the logging level: DEBUG|INFO|WARNING|ERROR|FATAL
--output_traces PATH_OUTPUT_TRACES Absolute path to the directory where the traces will be saved
--output_lists PATH_OUTPUT_LISTS Absolute path to the files where the new lists will be saved
--freq FREQ Frequency of the acquisition in Hz
--window WINDOW Window size for STFT
--overlap OVERLAP Overlap size for STFT
--device DEVICE Used device under test
--metric METRIC Metric to use for the PoI selection: {nicv, corr}_{mean, max}
--core CORE Number of core to use for multithreading
--duration DURATION to fixe the duration of the input traces (padded if input is short and cut otherwise)
usage: corr.py [-h]
[--acc PATH_ACC]
[--lists PATH_LISTS]
[--plot PATH_TO_PLOT]
[--scale SCALE]
[--bandwidth_nb BANDWIDTH_NB]
[--metric METRIC]
[--log-level LOG_LEVEL]
optional arguments:
-h, --help show this help message and exit
--acc PATH_ACC Absolute path of the accumulators directory
--lists PATH_LISTS Absolute path to a file containing the main lists
--plot PATH_TO_PLOT Absolute path to the file where to save the plot (/!\ '.png' expected at the end of the filename)
--scale SCALE scale of the plotting: normal|log
--bandwidth_nb BANDWIDTH_NB display the nb of selected bandwidth, by default no bandwidth selected
--metric METRIC Metric used to select bandwidth: {corr}_{mean, max}
--log-level LOG_LEVEL Configure the logging level: DEBUG|INFO|WARNING|ERROR|FATAL
usage: displayer.py [-h]
[--display_trace PATH_TRACE]
[--display_lists PATH_LISTS]
[--list_idx LIST_IDX]
[--metric METRIC]
[--extension EXTENSION]
[--path_save PATH_SAVE]
optional arguments:
-h, --help show this help message and exit
--display_trace PATH_TRACE Absolute path to the trace to display
--display_lists PATH_LISTS Absolute path to the list to display
--list_idx LIST_IDX which list to display (all = -1, learning: 0, validating: 1, testing: 2)
--metric METRIC Applied metric for the display of set (mean, std, means, stds)
--extension EXTENSION extensio of the raw traces
--path_save PATH_SAVE Absolute path to save the figure (if None, display in pop'up)
usage: list_manipulation.py [-h]
[--raw PATH_RAW]
[--tagmap PATH_TAGMAP]
[--save PATH_SAVE]
[--main-lists PATH_MAIN_LISTS]
[--extension EXTENSION]
[--log-level LOG_LEVEL]
[--lists PATH_LISTS]
[--new_dir PATH_NEW_DIR]
[--nb_of_traces_per_label NB_OF_TRACES_PER_LABEL]
optional arguments:
-h, --help show this help message and exit
--raw PATH_RAW Absolute path to the raw data directory
--tagmap PATH_TAGMAP Absolute path to a file containing the tag map
--save PATH_SAVE Absolute path to a file to save the lists
--main-lists PATH_MAIN_LISTS Absolute path to a file containing the main lists
--extension EXTENSION extensio of the raw traces
--log-level LOG_LEVEL Configure the logging level: DEBUG|INFO|WARNING|ERROR|FATAL
--lists PATH_LISTS Absolute path to a file containing lists
--new_dir PATH_NEW_DIR Absolute path to the raw data, to change in a given file lists
--nb_of_traces_per_label NB_OF_TRACES_PER_LABEL number of traces to keep per label
usage: nicv.py [-h]
[--acc PATH_ACC]
[--lists PATH_LISTS]
[--plot PATH_TO_PLOT]
[--scale SCALE]
[--time_limit TIME_LIMIT]
[--bandwidth_nb BANDWIDTH_NB]
[--metric METRIC]
[--log-level LOG_LEVEL]
optional arguments:
-h, --help show this help message and exit
--acc PATH_ACC Absolute path of the accumulators directory
--lists PATH_LISTS Absolute path to a file containing the main lists
--plot PATH_TO_PLOT Absolute path to save the plot
--scale SCALE scale of the plotting: normal|log
--time_limit TIME_LIMIT percentage of time to concerve (from the begining)
--bandwidth_nb BANDWIDTH_NB display the nb of selected bandwidth, by default no bandwidth selected
--metric METRIC Metric used to select bandwidth: {nicv}_{mean, max}
--log-level LOG_LEVEL Configure the logging level: DEBUG|INFO|WARNING|ERROR|FATAL
usage: signal_processing.py [-h]
[--input INPUT]
[--dev DEVICE]
[--output OUTPUT]
[--freq FREQ] [--window WINDOW] [--overlap OVERLAP]
optional arguments:
-h, --help show this help message and exit
--input INPUT Absolute path to a raw trace
--dev DEVICE Type of file as input (pico|hackrf|i)
--output OUTPUT Absolute path to file where to save the axis
--freq FREQ Frequency of the acquisition in Hz
--window WINDOW Window size for STFT
--overlap OVERLAP Overlap size for STFT