Repository containing code for our AISec23 paper:
Theo Chow, Zeliang Kan, Lorenz Linhardt, Lorenzo Cavallaro, Daniel Arp, and Fabio Pierazzi , Drift Forensics of Malware Classifiers , In Proc. of the ACM Workshop on Artificial Intelligence and Security (AISec), 2023
If you use this repository in your own research, please cite our AISec23 paper as follows:
@inproceedings{chow2023driftforensics,
title = {Drift Forensics of Malware Classifiers},
author = {Chow, Theo and Kan, Zeliang and Linhardt, Lorenz and Cavallaro, Lorenzo and Arp, Daniel and Pierazzi, Fabio},
booktitle = {Proc. of the {ACM} Workshop on Artificial Intelligence and Security ({AISec})},
year = {2023},
}
Link to dataset can be found Here
This project requires Python 3 as well as the statistical learning stack of NumPy, SciPy and Scikit-learn, secml.
First, install package dependencies using the listing in requirements.txt
.
pip install -r requirements.txt
To reproduce the paper results, run
python paper_results.py
First load in the desired dataset and obtain the X predictors X, y predictors y, timestamps t, family labels f, feature names feature_names and md5 md5.
PATH = "../Datasets/extended-features/"
X, y, t, f, feature_names, md5 = load_transcend(f"{PATH}extended-features-X-updated.json",
f"{PATH}extended-features-y-updated.json",
f"{PATH}extended-features-meta-updated.json",
f"{PATH}meta_info_file.tsv")
Reduce the feature space to a manageable amount and save the feature indexes as a pkl file
X, feature_names = util.feature_reduction(X, y, feature_names, "pkl_files/feature_index_1000.pkl", feature_size=1000)
Put the data in to a dataset class, this gives us flexibiliy when selecting samples. Currently there are 2 main functions in the dataset class, splitting the dataset in to time aware splits for analysis and finding occurences of features in the dataset.
dataset = Dataset(X, y, t, f, feature_names, md5)
Search up feature name IDs
ids = dataset.get_feature_id_from_name("android")
Find IDs in family
dataset.sample_select_from_feature_id(families=['Dowgin','Dnotua','Kuguo','Airpush','Revmob'],ids=ids,contains=True, year=2015, month=1)
Split dataset and return time aware indexes for training and test
train, test = dataset.time_aware_split_index('month', 6, 1)
The analysis class runs the experiment outlined in the paper. Currently, there are 3 main experiments, base, half and snoop. The results of this will be logged in a MySQL database and the results in a pkl file. By default, a file name pkl_files
needs to be created.
analyse = Analysis(X, y, t, f, feature_names, train, test)
training_family = ['Dowgin','Dnotua','Kuguo','Airpush','Revmob']
testing_family = ['Dowgin','Dnotua','Kuguo','Airpush','Revmob']
analyse.run(training_family=training_family, testing_family=testing_family,experiment='snoop', dataset='Transcend')
To visualise the results, we first load in the corresponding data in question. The ResultsLoader()
class gives an easy way in accessing saved experiments.
training_familes = ['Dowgin','Dnotua','Kuguo','Airpush','Revmob']
testing_families = ['Dowgin','Dnotua','Kuguo','Airpush','Revmob']
ResultsLoader().query_database_for_ID('half',training_familes,testing_families,'Transcend')
Load in the desired data using the ID returned by ResultsLoader()
result1 = ResultsLoader().load_file_from_id(5)
result2 = ResultsLoader().load_file_from_id(6)
For performance, distribution and difference plots
Viz(result1,result2).plot_performance_distribution()
Viz(result1,result2).plot_single('difference')