Code and datasets for our paper "Learning to Describe Solutions for Bug Reports Based on Developer Discussions" at Findings of ACL 2022.
If you find this work useful, please consider citing our paper:
@inproceedings{PanthaplackelETAL22BugReportDescription,
author = {Panthaplackel, Sheena and Li, Junyi Jessy and Gligoric, Milos and Mooney, Raymond J.},
title = {Learning to Describe Solutions for Bug Reports Based on Developer Discussions},
booktitle = {Findings of ACL (Association for Computational Linguistics)},
pages = {2935--2952},
year = {2022},
}
Note that the code and data can only be used for research purposes. Running this code requires torch==1.4.0+cu92
and fairseq==0.10.2
. Other libraries may need to be installed as well. This code base borrows code from the PLBART and Fairseq repositories.
Getting Started
- Clone the repository.
- Edit
constants.py
by specifying theROOT_DIR
. - Install sentencepiece and specify the path to
spm_encode
asSPM_PATH
inconstants.py
. - Download
sentencepiece.bpe.model
anddict.txt
from here and specify the paths forSENTENCE_PIECE_MODEL_PATH
andPLBART_DICT
respectively inconstants.py
. - Follow directions here to download plbart-base.pt and specify the path for
PLBART_CHECKPOINT
inconstants.py
. - Download our dataset and saved models from here. Note that the primary dataset is located in
public_bug_report_data
. - Create a directory for writing processed data, which will be referred to as
[PROCESSED_DATA_DIR]
in later steps. - Create a directory for writing predicted output, which will be referred to as
[OUTPUT_DIR]
in later steps. - Create a directory for writing a new model, which will be referred to as
[MODEL_DIR]
in later steps.
The commands below correspond to runnting training and inference on the full dataset. To use the filtered dataset, simply append the --filtered
flag to any command.
Running PLBART Generation Model
- To run inference on the finetuned PLBART generation model, run the following:
cd generation_models/
python3 plbart.py --test_mode --processed_data_dir=[PROCESSED_DATA_DIR] --model_dir=finetuned_plbart_generation/ --output_dir=[OUTPUT_DIR]
- To instead finetune the original PLBART checkpoint, and then run inference, run the following commands:
cd generation_models/
python3 plbart.py --processed_data_dir=[PROCESSED_DATA_DIR] --model_dir=[MODEL_DIR]
python3 plbart.py --test_mode --processed_data_dir=[PROCESSED_DATA_DIR] --model_dir=[MODEL_DIR] --output_dir=[OUTPUT_DIR]
Running Transformer Generation Models
- To train and evaluate a transformer-based seq2seq model (with a pointer network) and then run the following commands:
cd generation_models/
python3 transformer_seq2seq.py --model_path=[MODEL_DIR]/model.pkl.gz
python3 transformer_seq2seq.py --test_mode --model_path=[MODEL_DIR]/model.pkl.gz
- To train and evaluate a hierarchical transformer-based seq2seq model (with a pointer network) and then run the following commands:
cd generation_models/
python3 transformer_seq2seq.py --hierarchical --model_path=[MODEL_DIR]/hier_model.pkl.gz
python3 transformer_seq2seq.py --hierarchical --test_mode --model_path=[MODEL_DIR]/hier_model.pkl.gz
Running Pipelined Combined System
- To run inference on the already finetuned system, run the following command:
cd combined_systems/
python3 pipelined_model.py --class_model_dir=finetuned_plbart_classification/ --gen_model_dir=finetuned_plbart_generation/ --output_dir=[OUTPUT_DIR] --processed_data_dir=--processed_data_dir=[PROCESSED_DATA_DIR] --test_mode
- To instead finetune the original PLBART checkpoint, you should first finetune a generation model using Step #2 in the section titled "Running PLBART Generation Model." You can then train the classifier and run inference using the following commands:
cd combined_systems/
python3 pipelined_model.py --processed_data_dir=--processed_data_dir=[PROCESSED_DATA_DIR] --class_model_dir=[MODEL_DIR] --gen_model_dir=[PATH TO SAVED GENERATION MODEL]
python3 pipelined_model.py --processed_data_dir=--processed_data_dir=[PROCESSED_DATA_DIR] --class_model_dir=[MODEL_DIR] --gen_model_dir=[PATH TO SAVED GENERATION MODEL] --output_dir=[OUTPUT_DIR] --test_mode
Running Jointly Trained Combined System
- To run inference on the already finetuned system, run the following command:
cd combined_systems/
python3 jointly_trained_model.py --processed_data_dir=[PROCESSED_DATA_DIR] --model_dir=finetuned_plbart_joint/ --output_dir=[OUTPUT_DIR]
- To instead finetune the original PLBART checkpoint and run inference, use the following commands:
cd combined_systems/
python3 jointly_trained_model.py --processed_data_dir=[PROCESSED_DATA_DIR] --model_dir=[MODEL_DIR]
python3 jointly_trained_model.py --processed_data_dir=[PROCESSED_DATA_DIR] --model_dir=[MODEL_DIR] --output_dir=[OUTPUT_DIR] --test_mode
Supplementary Data
We have provided additional data in the supplementary_data
directory. In our work, we only consider in-lined code snippets and exclude longer code snippets which are marked with markdown tags. However, this information is included in the raw data at supplementary_data/bugs/single_code_change
(see the anonymized_raw_text
field). Next, although we only consider bug-related issue reports and those associated with a single commit message/PR title, we have included the raw data for non-bug reports (supplementary_data/nonbugs/
) and multiple set of code changes/descriptions (supplementary_data/bugs/multi_code_change
and supplementary_data/nonbugs/multi_code_change
).