Skip to content

GumTreeDiff/datasets

Repository files navigation

Diff datasets

A collection of diff datasets. It contains:

  • GitHub Java is a Java dataset containing 1000 commits from 10 popular projects.
  • GitHub Python is a Python dataset containing 1000 commits from 10 popular projects.
  • Defects4J is a Java dataset of bug fixes used in the program repair community.
  • BugsInPy is a Python dataset of bug fixes used in the program repair community.

The layout of these datasets is the following: the before folders contain the files before modification, and the after folders contain the files after. Inside the before and after folders, there is one folder per project that contains one folder per commit. Note that the commit names are the same in the before and after folders. The unparsable folder contains the commits from the previous datasets for which we could not parse one of the files.

The Python scripts used to produce the datasets are also provided.

About

A collection of diff datasets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published