GitHub - mahimanzum/FixEval: We introduce FixEval , a dataset for competitive programming bug fixing along with a comprehensive test suite and show the necessity of execution based evaluation compared to suboptimal match based evaluation metrics like BLEU, CodeBLEU, Syntax Match, Exact Match etc.

Official Code for FixEval: Execution-based Evaluation of Program Fixes for Competitive Programming Problems

Abstract:

Source code repositories consist of large codebases, often containing error-prone programs. The increasing complexity of software has led to a drastic rise in time and costs for identifying and fixing these defects. Various methods exist to automatically generate fixes for buggy code. However, due to the large combinatorial space of possible solutions for a particular bug, there are not many tools and datasets available to evaluate generated code effectively. In this work, we introduce FixEval, a benchmark comprising buggy code submissions to competitive programming problems and their respective fixes. We introduce a richtest suite to evaluate and assess the correctness of model-generated program fixes. We consider two Transformer language models pretrained on programming languages as our baselines, and compare them using match-based and execution-based evaluation metrics. Our experiments show that match-based metrics do not reflect model-generated program fixes accurately, while execution-based methods evaluate programs through all cases and scenarios specifically designed for that solution. Therefore, we believe FixEval provides a step towards real-world automatic bug fixing and model-generated code evaluation.

Table of Contents
Folder Structure
Dataset
- Download CodeNet Metadata
- Download Test Cases
Installation
Pre-processing
- Split The Data
Download Preprocessed Data
- Download our preprocessed Java dataset
- Download our preprocessed Python dataset
Training and Evaluation
Benchmarks
- Match-based metrics
- Execution-based metrics
License
Citation

Folder Structure


├── codet5
│   ├── run.sh 
│   ├── configs.py
│   ├── models.py
│   ├── run_gen.py
│   └── ...
│ 
├── plbart
│   ├── run.sh 
│   ├── configs.py
│   ├── models.py
│   ├── run_gen.py
│   └── ...
│
├── data
│   ├── java
│   │    ├──jsons
│   │    ├──processed
│   ├── python
│   │    ├──jsons
│   │    ├──processed
│   ├── atcoder_test_cases
│   └── processed.json
│
├── third_party
│   ├── apex
│   ├── fairseq
│   ├── tree-sitter-cpp
│   ├── tree-sitter-java
│   └── tree-sitter-python
│
├── evaluation
│   ├── CodeBLEU 
│   ├── codegen 
│   ├── bleu.py
│   ├── compile.py
│   ├── compute_ca.py
│   ├── evaluator.py
│   ├── execution_evaluation_TC_arc_MP.py
│   └── ...
│
└── src
    ├── 01_preprocessing.ipynb
    ├── make_submission_list_json.py
    ├── process_json.py
    ├── deduplication.py
    ├── generate_eval_files.py
    ├── merge.py
    ├── split.py
    └── ...

Dataset

All data for reproducing the results is available here:

https://drive.google.com/drive/folders/1dzuHuouuWzlFCy1CMj9DYG9JGraEay27?usp=sharing

Run the following commands in the root folder.

Download Project CodeNet Dataset (Skip this if you want to run from our preprocessed files)

Run this command to download the whole CodeNet dataset (around 8GB zip file) in the root directory and decompress it.

wget https://dax-cdn.cdn.appdomain.cloud/dax-project-codenet/1.0.0/Project_CodeNet.tar.gz
tar -xf Project_CodeNet.tar.gz

Download CodeNet Metadata

Run this command to download the CodeNet metadata (281Mb zip file) in the root directory and decompress it

wget https://dax-cdn.cdn.appdomain.cloud/dax-project-codenet/1.0.0/Project_CodeNet_metadata.tar.gz
tar -xf Project_CodeNet_metadata.tar.gz

Download Test Cases

Make the data folder to store the test cases along with the Java and Python data files.

wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1AInTHzaZqym7WsT1B7yc8nZy7dA3ovPf' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1AInTHzaZqym7WsT1B7yc8nZy7dA3ovPf" -O atcoder_test_cases.zip && rm -rf /tmp/cookies.txt
unzip atcoder_test_cases.zip
cd ../

Installation

The preferred installation method is to run this command (You may need to change the bash file to update the environment names, etc.):

bash install_env.sh

Another method is to run the following (You may need to manually add some libraries):

conda env create -n python -f src/environment.yml
conda activate python36

All the commands below assume that you installed everything in this environment correctly and activated the environment.

Pre-processing (Skip this if you want to run from our preprocessed files)

src/make_submission_list_json.py parses problem submission information, problem list csv, and the actual submission files folder to create an initial json, processed.json, which uses the following format:

processed is a dictionary containing a list of user_id's with information about each user in processed.keys().
processed['user_id'] is a list containing a list of problem_id's solved by that user.
processed['user_id']['problem_id'] contains list of tuples. Each tuple consists of information about a submission (submission_id,date,language,original_language,filename_ext,status)

To create this, use the followint script (You may need to change the path information):

cd src
python make_submission_list_json.py
cd ../

If there is any file missing Like "my_languages.so" Please check the folder and if it's not there please create an issue. I will make that available ASAP.

https://drive.google.com/drive/folders/1dzuHuouuWzlFCy1CMj9DYG9JGraEay27?usp=sharing

Create Language Specific Data (Skip this part if you just want to download our version)

We use the processed.json file to create the training data chunk by chunk (10k per file) and store them in the data folder for individual programming languages. The following code preprocesses and stores both Java and Python data into the json format in folders stored at data/{language}/jsons/.

cd src
python process_json.py
cd ../

Or, you can also download the processed.json file, which is the root file for all data generation and processing:

cd data/
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1gxZYObARqJytI9gf6gEX-CZhCpc4JPE6' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1gxZYObARqJytI9gf6gEX-CZhCpc4JPE6" -O processed.zip && rm -rf /tmp/cookies.txt
unzip processed.zip
cd ../

Split The Data (Skip this if you want to continue from our preprocessed files)

split.py merges all the json chunks, deduplicates using jaccard similarity function, and splits the data into the train-valid-test (80-10-10) ratio. This is done on the problem level so that no datapoints for a single problem exist in multiple splits, like train and test. During the split, we also mantain the condition that for all the datapoints in the valid and test sets- we have the test cases available so that execution-based evaluation can be done on both the valid and test set data.

cd src
python split.py 
python split.py --lang py --src_file ../data/Python/jsons/ --src_dir ../data/Python/processed/ --out_dir ../data/Python/processed/
cd ../

Download Preprocessed Data

Run the following commands if you want to download the processed data and train:

Download and unzip our preprocessed Java dataset

wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1vsuUrJ2j86EYGb2WWQatqsqJ-V8Sl6en' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1vsuUrJ2j86EYGb2WWQatqsqJ-V8Sl6en" -O java.zip && rm -rf /tmp/cookies.txt
unzip java.zip

Download and unzip our preprocessed Python dataset

wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1rjjYW8SB8f5Hr34ig84OKpNYOzdt03Ar' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1rjjYW8SB8f5Hr34ig84OKpNYOzdt03Ar" -O python.zip && rm -rf /tmp/cookies.txt
unzip python.zip

After successful completion, we derive 4 datasets from this part:

java buggy code to java fixed code (data/java/processed/)
java buggy code with verdict information to java fixed code (data/java/processed_with_verdict/)
python buggy code to python fixed code (data/python/processed/)
python buggy code with verdict information to python fixed code in (data/python/processed_with_verdict/)

Each of these 4 directories contains:

{train, test, valid}.jsonl files containing all the information for the datapoints. This also allows us to always revert back to the original dataset
{train, test, valid}.{language-language}.id files, where language is in the set [java, python]
6 raw test files for training.
{src, tgt}_{train, test, valid}.{language-language}.language

Training and Evaluation

Training the model and evaluating on the dataset

GPU is required to run the experiments.

To use our open sourced pretrained models and data files, download plbart.zip or codeT5.zip from the link below and verify the results using the same procedure.

https://drive.google.com/drive/folders/1dzuHuouuWzlFCy1CMj9DYG9JGraEay27?usp=sharing

And then go to that specific folder and run the run.sh command. More instructions later n this page.

cd plbart/
./run.sh

To run the codet5 model, go to the codet5 folder and use the run.sh script file. This will also evalualte the model on match-based metrics (BLEU, CodeBleu, Syntax Match, Dataflow Match, etc.). Some changes are required to execute the run.sh script:

Change the source and target languages on lines 14-15 to one of these ['java', 'python']
Change path_2_data at the end of line 22 to the folder name with the processed or processed_with_verdict data
Change line 27 to make the Model and Cached data save directory consistent with the data as well. For example, append "_with_verdict" if the associated data path contains "_with_verdict" as well.
To simply run the evaluation, comment out the train function in the bottom of the run.sh file

Each run.sh file has a similar structure:

./run.sh GPU_ID SRC_LANGUAGE TARGET_LANGUAGE DATA_SOURCE WITH_VERDICT

GPU_ID is how many GPUs you want to use. For single GPU, input "0".
SRC_LANGUAGE, TARGET_LANGUAGE are both the same for a single run. They can be either "java" or "python".
DATA_SOURCE is the location of the preprocessed data. For example, "codenet" if the stored preprocessed data folder is named "codenet".
WITH_VERDICT can be either "true" or "false" depending on if you want to use the verdict information in the input or not.

cd codet5/
nohup ./run.sh 0 java java codenet false #TODO Briefly explain one or all of these examples i.e.:
nohup ./run.sh 0 java java codenet true #Executes the Java dataset with one GPU and verdict information
nohup ./run.sh 0 python python codenet false #Executes the Python dataset with one GPU and without verdict information
nohup ./run.sh 0 python python codenet true

Similarly, for training and evaluating the plbart model, navigate to the root directory and use the following:

cd plbart/
nohup ./run.sh 0 java java codenet false
nohup ./run.sh 0 java java codenet true
nohup ./run.sh 0 python python codenet false
nohup ./run.sh 0 python python codenet true

The run.sh script for each of the models contains 3 function:

train -> Trains that specific model and saves the checkpoints and logs all the necessary matrices.
evaluate -> Loads a pretrained model (usually the checkpoint-best-ppl) model and evaluates all metrics except the execution-based evaluation with pass@k accuracy.
generate -> Loads a pretrained model (usually the checkpoint-best-ppl) and generates a json file with the predictions from the loaded model.

Evaluation

Evaluate on Execution

This part is not included in the usual evaluation because changes are required based on your system to run this efficiently.

First, run the below commands. These commands will create 4 additional splits in the 4 core data folders (data/language/{processed, processed_with_verdict}) named eval which are similar to train, valid, and test but smaller. The main difference between eval and test set is that {train, test, valid} are created using our split method and all the datapoints are split between these.
But here we create an eval split which is sampled from the test datapoints using generate_eval_files.py, keeping the true data distribution similar to the test file but on a smaller scale (500 in our case) to keep the runtime and computational complexity in check as we need to generate multiple submissions and run each of them with many test cases to calculate our pass@k accuracy.

cd src/
python generate_eval_files.py 
python generate_eval_files.py --with_verdict True
python generate_eval_files.py --lang python
python generate_eval_files.py --with_verdict True --lang python
cd ../

Let's generate the file with the model predictions

Go to the specific model folder and execute the run.sh command with only the generate function uncommented and save_dir, path_2_data, and languages set to the correct versions. For example:

cd plbart/
./run.sh

To use our open sourced pretrained models, download plbart.zip or codeT5.zip from the link below and verify the results using the same procedure.

https://drive.google.com/drive/folders/1dzuHuouuWzlFCy1CMj9DYG9JGraEay27?usp=sharing

Pre-preprocess the generated files that contains all tokenized and detokenized source, target, and predictions

First, we need to create a self-contained json with all of the necessary versions to detokenize the code and execute. We split this portion explicitly because it is not possible to run the code and install all the libraries required to tokenize the Java and Python programs using the ARC (Advanced Research Computing) supercomputer at Virginia Tech. Thus, we do it elsewhere and create the resulting json file which can be used to generate results.

cd src/
python merge.py --references data/java/processed/generation.json --language java
python merge.py --references data/java/processed_with_verdict/generation.json --language java
python merge.py --references data/python/processed/generation.json --language python
python merge.py --references data/python/processed_with_verdict/generation.json --language python
cd ../

These will create 4 json files. You may need to change the output file names for your own clarification.

Finally, let's run the code to execute and evaluate

First, we expect the test cases folder and the "problem_list.csv" file to be in the root directory. So let's copy those:

cp -r data/atcoder_test_cases atcoder_test_cases
cp Project_CodeNet/metadata/problem_list.csv problem_list.csv

Now, let's run the execute and evaluate methods:

python evaluation/execution_evaluation_TC_arc_MP.py --references test_python2python_with_verdict_output.jsonl --language python --test_cases atcoder_test_cases --problem_list problem_list.csv

To run on ARC, we provide a file for using in slurm clusters where you might need to change your credentials.

sbatch batch_run.sh

The previous commands will create a json file which contains all the fields necessary for visualizing and getting pass@k accuracy.

Use results.py to get the results

We can use results.py to generate the results. We can also use the previous json in the src/01_preprocessing.ipynb notebook for visualizing.

Benchmarks

Match-based metrics

We evaluate the models' performances on the test set in terms of Compilation Accuracy (CA), BLEU, Syntax Match (SM), Dataflow Match (DM), CodeBLEU (CB), and Exact Match (EM). We report the model performances below.

Method	Language	Verdict	BLEU	EM	SM	DM	CB	CA
Naive Copy	Java	No	80.28	0.03	84.22	53.64	75.43	89.93
Naive Copy	Python	No	68.55	0.73	70.12	60.51	68.47	96.56
PLBART	Java	No	58.49	0.45	66.92	43.08	57.23	31.36
	Java	Yes	59.84	1.46	68.01	44.99	58.62	33.04
	Python	No	61.89	2.32	64.32	48.81	61.13	91.16
	Python	Yes	62.25	2.46	63.31	49.73	62.21	92.21
CodeT5	Java	No	62.31	2.96	74.01	52.30	63.37	63.03
	Java	Yes	62.54	2.45	73.93	53.29	63.71	64.23
	Python	No	64.92	2.74	68.79	56.21	63.53	92.80
	Python	Yes	64.67	2.97	68.45	56.04	63.28	92.70

Execution-based metrics

We also evaluate our model using pass@k and test case average. Here are the benckmark results:

Language	Verdict	pass@k				TCA@k
Language	Verdict	k = 1	k = 3	k = 5	k = 10	k = 1	k = 3	k = 5	k = 10
Java	No	8.65	15.62	19.63	24.44	41.00	34.00	32.70	29.60
Java	Yes	10.94	18.77	22.66	27.96	44.99	38.80	35.87	32.90
Python	No	6.86	13.07	16.27	20.51	50.20	41.20	38.50	35.20
Python	Yes	7.32	13.94	17.47	22.63	58.75	41.16	38.37	34.88

License

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

Citation

@article{haque2022fixeval,
  title={FixEval: Execution-based Evaluation of Program Fixes for Competitive Programming Problems},
  author={Haque, Md Mahim Anjum and Ahmad, Wasi Uddin and Lourentzou, Ismini and Brown, Chris},
  journal={arXiv preprint arXiv:2206.07796},
  year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 143 Commits
codet5		codet5
evaluation		evaluation
plbart		plbart
src		src
third_party		third_party
.gitignore		.gitignore
FixEval.pdf		FixEval.pdf
LICENSE		LICENSE
README.md		README.md
batch_run.sh		batch_run.sh
install_env.sh		install_env.sh
requirements.txt		requirements.txt

License

mahimanzum/FixEval

Folders and files

Latest commit

History

Repository files navigation

Official Code for FixEval: Execution-based Evaluation of Program Fixes for Competitive Programming Problems

Abstract:

Table of Contents

Folder Structure

Dataset

Download Project CodeNet Dataset (Skip this if you want to run from our preprocessed files)

Download CodeNet Metadata

Download Test Cases

Installation

Pre-processing (Skip this if you want to run from our preprocessed files)

Create Language Specific Data (Skip this part if you just want to download our version)

Split The Data (Skip this if you want to continue from our preprocessed files)

Download Preprocessed Data

Download and unzip our preprocessed Java dataset

Download and unzip our preprocessed Python dataset

Training and Evaluation

Training the model and evaluating on the dataset

GPU is required to run the experiments.

Evaluation

Evaluate on Execution

Let's generate the file with the model predictions

Pre-preprocess the generated files that contains all tokenized and detokenized source, target, and predictions

Finally, let's run the code to execute and evaluate

Use results.py to get the results

Benchmarks

Match-based metrics

Execution-based metrics

License

Citation

About

Resources

License

Stars

Watchers

Forks

Languages