D2A Dataset and Generation Pipeline

This repository is to support contributions for tools that generated the D2A dataset hosted on IBM Data Asset eXchange.

Introduction

D2A is a differential analysis based approach to label issues reported by inter-procedural static analyzers as ones that are more likely to be true positives and ones that are more likely to be false positives. Our goal is to generate a large labeled dataset that can be used for machine learning approaches for code understanding and vulnerability detection.

Why D2A?

Given programs can exhibit diverse behaviors, training machine learning models for code understanding and vulnerability detection requires large datasets. However, according to a recent survey, lacking good and real-world datasets has become a major barrier for this field. Many existing works created self-constructed datasets based on different criteria and may just release partial datasets. The following table summarizes the characteristics of a few popular datasets for software vulnerability detection tasks.

Due to the lack of oracle, there is no perfect dataset that is large enough and has 100% correct labels for AI-based vulnerability detection tasks. Datasets generated from manual reviews have better quality labels in general. However, limited by their nature, they are usually not large enough for model training. On the other hand, although the quality of the D2A dataset is bounded by the capacity of static analysis, D2A can produce large datasets with better labels comparing to the ones labeled solely by static analysis, and complement existing manually labelled datasets.

Differential Analysis and D2A Dataset Generation Pipeline

Intuition

For projects with commit histories, we assume some commits are code changes that fix bugs. We run static analysis on the versions before and after such commits. If some issues detected in a before-commit version disappear in the corresponding after-commit version, they are very likely to be real bugs that got fixed by the commit. If we analyze a large number of consecutive version pairs and aggregate the results, some issues found in a before-commit version never disappear in an after-commit version. We say they are not very likely to be real bugs because they were never fixed. Then, we de-duplicate the issues found in all versions and adjust their classifications according to the commit history. Finally, we label the issues that are very likely to be real bugs as positives and the remaining ones as negatives.

Components

The following figure shows the overview of the D2A dataset generation pipeline.

Commit Message Analysis (scripts/infer_pipeline/commit_msg_analyzer) analyzes the commit messages and identifies the commits that are more likely to refer to vulnerability fixes.
Pairwise Static Analysis (scripts/infer_pipeline) run the analyzer on the before-commit and after-commit versions for the commit hashes selected in the previous step.
Auto-labeler (scripts/auto_labeler) merges the analysis results for all selected commit versions and label each issue based on differential logic and commit history heuristics.
Function Extractor (scripts/dataset_generator) extracts the bodies of the functions involved in the trace.

Sample Types

There are two types of samples:

Samples based on static analyzer outputs. Such samples are reported by the static analyzer and thus all have analyzer outputs including bug traces. We extract the functions mentioned in the trace together with other information. We use "label_source": "auto_labeler" to denote such samples. The labels can be 0 (e.g., auto-labeler_0.json) or 1 (e.g., auto-labeler_1.json) according to the auto-labeler. Please refer to Sec.III-C in the D2A paper for details.
Samples from the fixed versions. Such samples are not directly generated from static analysis outputs because they are not reported by the analyzer. Therefore, they do not contain static analyzer outputs. Instead, given samples with positive auto-labeler labels (i.e. label_source == "auto_labeler" && label == 1) found in the before-fix version, we extract the corresponding functions in the after-fix version and label them 0 (e.g., after_fix_0.json). We use "label_source": "after_fix_extractor" to denote such samples. More information can be found in the Sec.III-D in the D2A paper.

Downloading the D2A Dataset and the Splits

The D2A dataset and the global splits can be downloaded from IBM Data Asset eXchange.

The latest version: v1.0.0.

Sample Description and Dataset Stats

Details could be found in Sample Description and Dataset Stats.

Using the Dataset

Please refer to Dataset Usage Examples for details.

D2A Leaderboard

Leaderboard

More details

Annotating More Projects

Please refer to Running the Dataset Generation Pipeline.

D2A Paper and Citation

Paper

D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis

[arXiv] [ICSE-SEIP'21]

Citation

Please cite the following paper, if the D2A dataset or generation pipeline is useful for your research.

@inproceedings{D2A,
  author = {Zheng, Yunhui and Pujar, Saurabh and Lewis, Burn and Buratti, Luca and Epstein, Edward and Yang, Bo and Laredo, Jim and Morari, Alessandro and Su, Zhong},
  title = {D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis},
  year = {2021},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  series = {ICSE-SEIP '21},
  booktitle = {Proceedings of the ACM/IEEE 43rd International Conference on Software Engineering: Software Engineering in Practice}
}

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
docs		docs
leaderboard		leaderboard
scripts		scripts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

D2A Dataset and Generation Pipeline

Table of Contents

Introduction

Why D2A?

Differential Analysis and D2A Dataset Generation Pipeline

Intuition

Components

Sample Types

Downloading the D2A Dataset and the Splits

Sample Description and Dataset Stats

Using the Dataset

D2A Leaderboard

Annotating More Projects

D2A Paper and Citation

Paper

Citation

About

Releases

Packages

Contributors 5

Languages

License

IBM/D2A

Folders and files

Latest commit

History

Repository files navigation

D2A Dataset and Generation Pipeline

Table of Contents

Introduction

Why D2A?

Differential Analysis and D2A Dataset Generation Pipeline

Intuition

Components

Sample Types

Downloading the D2A Dataset and the Splits

Sample Description and Dataset Stats

Using the Dataset

D2A Leaderboard

Annotating More Projects

D2A Paper and Citation

Paper

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages