Automatic Data Labeling for Software Vulnerability Prediction

This is the README file for the reproduction package of the paper: "Automatic Data Labeling for Software Vulnerability Prediction Models: How Far Are We?".

The package contains the following artefacts:

Data:
- Contains the code of the 1,582 human-labeled vulnerable files, 3,391 auto/D2A-labeled vulnerable files, and 73,042 non-vulnerable files, which can be used to train file-level vulnerability prediction models.
- Contains mappings between data and data splits used for 5-round evaluation, as described in section 4.4.1 in the paper.
D2A Validation: contains the Excel file (D2A-labeled VFCs_Validation.xlsx) of the manual validation of the auto/D2A-labeled vulnerable files
Code: contains the source code we used in our work to answer Research Questions (RQs) 2 and 3. It's noted that we setup the code to run on a computing cluster that runs on Slurm. Therefore, most of the code must be submitted using bash script (.sh) file, but our code can still be run locally by executing the python file directly.

The size of the Data folder is large, so they are not included in the GitHub repository. Please download them from this link instead.

Before running any code, please install all the required Python packages using the following command: pip install -r requirements.txt

For RQ2, train and evaluate models by running evaluate_models_openssl_rq2.sh and evaluate_models_ffmpeg_rq2.sh (slurm scripts)
For RQ3, train and evaluate models by runningevaluate_models_cl_openssl_rq3.sh, evaluate_models_cl_ffmpeg_rq3.sh (Confident Learning), evaluate_models_cr_openssl_rq3.sh, evaluate_models_cr_ffmpeg_rq3.sh (Centroid-based Removal), evaluate_models_dr_openssl_rq3.sh, evaluate_models_dr_ffmpeg_rq3.sh (Domain-specific Removal)

Note that after these training/evaluation scripts finish, they will generate output folders, i.e., ml_results_rq2/ and ml_results_rq3/ containing the results (.csv files) for the RQ2 and RQ3 models, respectively.

These .csv result files can then be used for analysis and comparison as described in the paper.

The file ML-wise_Results.xlsx contains the results of RQ2 and RQ3 for each Machine Learning (ML) algorithm based on the main MCC metric. Each result of an ML algorithm is aggregated from all the feature extractors and all the five test folds.

Name	Name	Last commit message	Last commit date
Latest commit lhmtriet First commit Jul 16, 2024 1662b14 · Jul 16, 2024 History 2 Commits
Code	Code	First commit	Jul 16, 2024
D2A Validation	D2A Validation	First commit	Jul 16, 2024
LICENSE	LICENSE	Initial commit	Jul 16, 2024
ML-wise_Results.xlsx	ML-wise_Results.xlsx	First commit	Jul 16, 2024
README.md	README.md	First commit	Jul 16, 2024
evaluate_models_cl_ffmpeg_rq3.sh	evaluate_models_cl_ffmpeg_rq3.sh	First commit	Jul 16, 2024
evaluate_models_cl_openssl_rq3.sh	evaluate_models_cl_openssl_rq3.sh	First commit	Jul 16, 2024
evaluate_models_cr_ffmpeg_rq3.sh	evaluate_models_cr_ffmpeg_rq3.sh	First commit	Jul 16, 2024
evaluate_models_cr_openssl_rq3.sh	evaluate_models_cr_openssl_rq3.sh	First commit	Jul 16, 2024
evaluate_models_dr_ffmpeg_rq3.sh	evaluate_models_dr_ffmpeg_rq3.sh	First commit	Jul 16, 2024
evaluate_models_dr_openssl_rq3.sh	evaluate_models_dr_openssl_rq3.sh	First commit	Jul 16, 2024
evaluate_models_ffmpeg_rq2.sh	evaluate_models_ffmpeg_rq2.sh	First commit	Jul 16, 2024
evaluate_models_openssl_rq2.sh	evaluate_models_openssl_rq2.sh	First commit	Jul 16, 2024
requirements.txt	requirements.txt	First commit	Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automatic Data Labeling for Software Vulnerability Prediction

About

Releases

Packages

Languages

License

lhmtriet/AutoVul

Folders and files

Latest commit

History

Repository files navigation

Automatic Data Labeling for Software Vulnerability Prediction

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages