An Extensible Automated Benchmarking Framework for Static Analysis Tool Evaluation Using Open-Source Code Commits
For the paper, please see An Empirical Study of Static Analysis Tools for Secure Code Review.
This automated framework is developed for facilitating the large-scale experiments of C/C++ Static Analysis Tools on the target set of open-source code commits (i.e., GitHub commits). The framework consists of two major parts: the central database and the runners. Each runner is responsible for running a experiment trial in the pipeline i.e., executing a selected SAST on a set of target commits and storing the warnings or reports in a folder on host machine. The central database collects the execution results (i.e., succcess, failed, checkedout_failed, or timeout) and start-end timestamps of each commit from each trial.
In the current version, five widely-used C/C++ SASTs i.e., CodeChecker, CodeQL, Cppcheck, Flawfinder, and Infer are integrated. Users can freely added the new SASTs of choice or modify the existing framework components by following the instructions in this section. Technically, this framework should also support SASTs and target commits of other programming languages (including the commits hosted on other platforms) as long as the commit is accessible via a certain URL and the SASTs can run on command-line. However, it has neither been implemented nor tested. Users are encouraged to visit the known limitations which are listed in this section prior to using the framework.
With this framework, user can conduct various SAST experiment trials on a set of target commits and gather the results for further analyses such as the effectiveness of SASTs to detect security issues in the early development process. All elements of the framework operate in the Docker environment for portability and scalability. To modify the runner's resources such as allocated memory or CPU cores, see this section. To run a trial, user must prepare the list of target commits as described in this section. Then start the trial following the instructions in this section.
We aim to improve this framework into a complete SAST benchmark for evaluating the performance of SAST-under-test on the target commits that contain certain issues. To do so, we are working toward accomplishing the following tasks.
- Define a standard format for test oracle of the target commits i.e., a) buggy files, buggy functions, or buggy lines in target commits and b) expected issue types such as CWE number
- Attach the analysis extension to systematically analyze and compare SAST performance from multople trials i.e., detection effectiveness, , and computation time
- Implement a reporting module that automatically translates and visualizes the analysis results
Build Central Database Image
Build Runner Image (Manage Target Commit Dependencies)
=======================
Start Central Database --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------->
Start Runner 1 -- ^
\ |
--> Install SAST -- |
\ Execution Result
--> Read Target Commits & Loop -- |
\ |
--> Clone Project & Checkout Commit -- |
^ \ |
| --> Pre-compiliation (if required) / Execute --> --> Store warning files output/[trial_name]
| |
--------------------------------------------------------------------------------------
Start Runner 2 ....
The framework was tested on MacOS 14.1.1 (M1 chip) and Ubuntu 20.04. Make sure that following software is installed and working properly:
- Docker Engine version 24.0.6 or newer
- Or Docker Desktop version 4.25.1 or newer (this includes Docker Engine)
The runner and central database use the following Docker base images:
- Runner: ubuntu:22.04
- Central Database: postgres (latest version)
The status of each execution is stored in the central PostgreSQL database. To start the central database container, run the following command in root folder:
docker-compose -f docker/database.yml up -dNote that the database is independent of the runners. It can support multiple runners in simultaneously. To view the executed transactions in database, use the following command to access psql console:
sudo docker exec -it auto-tool-investigation-db psql -U postgresThen, enter the default password for PostgreSQL database: postgresql; select the target database using the following command before running SQL queries.
\c auto_tool_investigation_db
select * from execution_log;The list of target code commits for SAST analysis stored in CSV file in following format.
project,cve,cwe,hash
memcached/memcached,CVE-2010-1152,707,32f382b605b4565bddfae5c9d082af3bfd30cf02
php/php-src,CVE-2016-3132,"710,664",f0168baecfb27e104e46fe914e5b5b6507ba8214
Column definition:
- project: GitHub project's identity e.g., git/git or php/php-src
- cve: CVE number of the commit, can be a dummy value if not available
- cwe: CWE number that shows the vulnerability of commit, can be a dummy value if not available
- hash: a valid commit identification (commit sha) which can be checked out
The sample target commit input file can be found here.
In root directory, run the following command with four parameters:
bash ./start-execution.sh [runner_type] [tool_name] [instance_name] [input_csv_in_data_ref_folder] [record_data:yes|no]- runner_type: select the type of runner that is suitable for the selected SAST.
customrunner has all the dependencies (See this section) installed to support SAST that requires compilation of the test subjectcleanrunner has only fundamental dependencies to run SASTs that do not require the compilation
- tool_name: a name of SAST integrated in the framework, by default five SASTs are available:
codechecker,codeql,cppcheck,infer, andflawfinder. - instance_name: a unique name that can identify the trial. This name will be used in central database and as the result folder.
- input_csv_in_data_ref_folder: file name of a csv file in foler
data-ref(with .csv extension) that contain list of target commits (See this section). - record_data:yes|no: record execution results in database. in case of no, the runner operates without communicating with central database.
For example:
# CodeQL trial with database record
bash ./start-execution.sh custom codeql codeql-trial1 test-selected-commit.csv yes
# Flawfinder trial without database record
bash ./start-execution.sh clean flawfinder flawfinder-trial1 test-selected-commit.csv noThe SAST warnings or outputs from each trial will be stored in folder ./output/[instance_name], in the original format that the tool produces which can be configured following the instructions in this section.
To monitor the runner, use following command to print real-time logs:
docker container logs -f [instance_name]The framework is designed for flexibility. User can modify various components of the framework to meet the specific requirements.
Allocated resources for the runner can be modified in start-execution.sh in the following command:
docker run -d --network=host \
--cpus=4 \
--name $INSTANCE_NAME \
--log-driver json-file --log-opt max-size=5m --log-opt max-file=10 \
...The available parameters can be seen here.
For the runner of type custom that supports SASTs which compiles the target commits during execution, the required dependencies of the target commits must be installed in the container to enable the compilation process. These dependencies are listed in Dockerfile in the following location ./script/tool/Dockerfile-Base.
User can add or remove the dependencies in the following command:
RUN apt-get -y update && \
apt-get -y install --no-install-recommends sudo \
apt-utils \
build-essential \
openssl \
clang \
clang-tidy \
git \
autoconf \
libgnutls28-dev \
...
Delete docker containers (existing runners) and image; then, rebuild the image (or simply start the new trial) to let the change take effect.
Currently, the framework only supports target commits that use GNU Automake build process. For each commit, ./script/tool/pre-compile.sh is run prior to SAST execution if the selected SAST requires compilation in the execution process. This script can be modified to cover other build processes such as cmake or Bazel.
All shell commands in pipeline are executed through function run_command in ./script/util.py. Modify following condition to set the new time limit:
run(..., timeout=60 * 60 * 5, ...)By default, the time limit of a command is 5 hours. Note that the pre-compilation process may also cause a timeout failure if it takes longer time than the limit.
The framework can recognize an integrated SAST by looking up the folders inside ./script/tool/. The most convenient approach to add a new SAST is to duplicate the template folder ./script/tool/sat_template and rename it for the new tool. Each SAST has two required components: entrypoint.sh (for installing the tool inside the runner and starting the pipeline) and handle.py (for pipeline interactions such as the execution commands and pre/post execution commands).
The entrypoint script is the first bash script executed inside the runner when the container is started. It is meant to prepare the SAST inside the container by installing the selected version of SAST in the runner and ensure that SAST can be launched from any paths. As the entrypoint is called by start-execution.sh, it should accept two parameters: 1) instance_name and 2) input_file. The sample structure of an entrypoint can be seen in ./script/tool/sat_template/entrypoint.sh.
Tool handle manages how pipeline interacts with each tool. To streamline the interactions, each tool handle is the child class of script/tool/ToolClass.py. See the following snippet for the docstrings of the mandatory functions.
def check_readiness(self) -> None:
"""
Future work
Params: None
Returns: None
"""
...
def get_tool_type(self) -> str:
"""
Constant function to indicate the type of tool
Params: None
Returns: string "SAST"
"""
...
def is_compilation_required(self) -> bool:
"""
Inform pipeline whether the pre-analysis step is needed
Params: None
Returns: boolean
"""
...
def get_supported_languages(self) -> list:
"""
Tool's information on the supported languages
Params: None
Returns: list
"""
...
def get_result_location(self) -> str:
"""
Future work
Params: None
Returns: None
"""
...
def count_result(self, output_filename: str) -> int:
"""
Called after the execution to count the warnings on each commit for a quick execution summary
Params: output_filename of the commit as specified by the pipeline
Returns: integer
"""
...
def get_pre_analysis_commands(self) -> list:
"""
Mandatory commands that must be run before the actual SAST execution i.e., the pre-compilation script
Params: None
Returns: None
"""
...
def does_analysis_use_shell(self) -> bool:
"""
Passing Shell flag to python subprocess command https://docs.python.org/3/library/subprocess.html#subprocess.run
Params: None
Returns: bool
"""
...
def get_analysis_commands(self, output_filename: str) -> list:
"""
Main SAST execution commands, either single ot multiple sets of commands
Params: output_filename of the commit as specified by the pipeline
Returns: list
"""
...
def get_expected_analysis_commands_return_codes(self) -> list:
"""
Normally the successful execution should return code 0, but some SASTs may return different returncodes
Params: None
Returns: list of returncodes that are considered successful for the command; when the list is empty, pipeline expects returncode 0
"""
...
def get_post_analysis_commands(self, output_filename: str) -> list:
"""
Final commands that should be run after the main SAST execution. For example, deleting the unnecessary outputs that may take up diskspace. Not that the pipeline already takes care of commit checkout process. This script should not manage the commit clean up.
Params: output_filename of the commit as specified by the pipeline
Returns: list
"""
...
def get_transaction_result(self, output_filename: str) -> list:
"""
Get the list of warnings in a common format, containing the essential information i.e., location_hash, location_file, location_start_line, location_start_column, location_end_line, location_end_column, warning_rule_id, warning_rule_name, warning_message, warning_weakness, and warning_severity. This function should read and extract information from the warning file and prepare SASTResult objects with all necessary information, especially warning_weakness and warning_severity, mapped.
Params: output_filename of the commit as specified by the pipeline
Returns: list of objects in SASTResult class
"""
...When a new handle is created, it must be added to ./script/tool/factory.py so that the pipeline can initiate the tool handle correctly.
The tool handle (./script/tool/tool_name/handle.py) controls parameters used in SAST execution, which are unique for different SASTs. For instance, the build commands, the output format and location, and activated checking rules. These parameters of the existing tools can be modified. See the following examples:
Single step (Cppcheck):
def get_analysis_commands(self, output_filename: str) -> list:
# these commands are to be called inside subject folder
return [
[
"cppcheck",
"-j 1",
".",
"--xml",
f"--output-file=../output/{output_filename}.xml",
],
]Multiple steps (CodeQL):
def get_analysis_commands(self, output_filename: str) -> list:
# these commands are to be called inside subject folder
return [
[
"codeql",
"database",
"create",
"../temp",
"--language=cpp",
"--command=make -j1 -i", # create database with make command
"--source-root=./",
],
[
"codeql",
"database",
"analyze",
"--format=sarif-latest", # set output format
f"--output=../output/{output_filename}.sarif",
"../temp",
"codeql/cpp-queries:codeql-suites/cpp-lgtm-full.qls", # select query suite
],
]Note that, by default, the framework executes tools with single job (-j 1) and does not activate or deactivate any rules other than what each tool initialy enabled.
Despite being designed for large-scale experiments, the framework has some noteworthy technical limitations that should be understood by the users.
- Runner's Resource: Some SASTs may terminate the execution abruptly if the allocated resources e.g., disk space on host machine or the memory are insufficient. However, the resource requirements depend on the SAST settings. For example, CodeQL has the soft-limit on memory. It is recommended that users allocate enough resource and monitor the execution in order to adjust the resource allocation when the issue occurs.
- Potential Incompatible SAST Configurations on Certain System: In particular, we encountered the error when running Infer with more than one job assignment (-j2, -j3, ...) on MacOS (M1 and M2). This issue does not occur on other host system such as Ubuntu. Therefore, all SASTs in our framework are running with -j1 by default to overcome this issue. Please see this section to modify the configuration, if necessary. Additionally, there can be other configuration issues in other tools that we haven't yet discovered.
- Unique Pre-Compilation Steps: Our pre-compilation step (see this section) can support most C/C++ projects that use the typical build process. Specifically, the script executes either
autogen.sh,build.sh, orbootstrap.shto generate build configuration files. Then, it executesconfigureto prepare for build process. This pre-compilation process should be modified if the target commits follow the alternate approaches such as having the build configuration generation file under other names. - Unlisted Target Commit Dependencies: Although many frequently-used packages are installed in the runner's base image, some target commits from the other projects may require other packages that are not included. Users should monitor the execution and review the error messages when the execution failed during the compilation or pre-compilation steps. Target commit dependencies can be managed following the instruction in this section.
- Flawfinder Source Code Encoding: Flawfinder requires all source files to be encoded in UTF-8. Thus, we used the ftfy Python package to convert source code files to UTF-8.
We would like to thankfully acknowledge the support from our colleagues who reported these limitations and provided the valuable feedback to improve this framework.
Please consider citing this publication if you use dataset or code from this repository.
@inproceedings{10.1145/3650212.3680313,
author = {Charoenwet, Wachiraphan and Thongtanunam, Patanamon and Pham, Van-Thuan and Treude, Christoph},
title = {An Empirical Study of Static Analysis Tools for Secure Code Review},
year = {2024},
isbn = {9798400706127},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3650212.3680313},
doi = {10.1145/3650212.3680313},
booktitle = {Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis},
pages = {691–703},
keywords = {Code Changes Prioritization, Code Review, Static Application Security Testing Tool},
}