SPT-Code

Requirements

Minimize requirements

The list of minimize requirements can be found in requirements.txt.

Additional requirements

If you need to reprocess the raw dataset, or use your own dataset, then you will also need to install the following packages.

tree_sitter==0.19.0
antlr4-python3-runtime==4.9.2

Besides, antlr4 need to be installed, installation guidance here.

If you encounter errors about my-languages.so when preprocessing the dataset, please run sources/data/asts/build_lib.py first.

Datasets and Tokenizers

We provide pre-processed datasets, saved as pickle binary files, which can be loaded directly as instances.

The pre-processed datasets can be downloaded here: (OneDrive, iCloud, GoogleDrive). Put the downloaded dataset pickle file into {dataset_root}/dataset_saved/ (default to.../dataset/dataset_saved), the program will automatically detect and use it.

It is also possible to use a custom dataset, simply by placing it in the specified location according to the relevant settings in the source code, or by modifying the corresponding dataset loading script in the source code. The dataset loading code is located in the sources/data/data.py and sources/data/data_utils.py files.

Pre-trained Tokenizers and Models

Custom tokenizers (we call "vocab") can be downloaded here: (OneDrive, iCloud, Google Drive). Extract it in a certain directory. Specific the argument trained_vocab of main.py where the tokenizers are located or put it in {dataset_root}/vocab_saved (default to.../dataset/vocab_saved).

You may pre-train SPT-Code by yourself. We also provide pre-trained models available here. Extract and put it in a directory, then specific the argument trained_model like tokenizers before.

Runs

Run main.py to start pre-train, fine-tune or test. All arguments are located in args.py, specific whatever you need.

Some example scripts are as following.

# pre-training
python main.py \
--do-pre-train \
--pre-train-tasks cap,mass,mng \
--batch-size 64 \
--eval-batch-size 64 \
--cuda-visible-devices 0,1,2,3 \
--fp16 \
--model-name pre_train

# summarization on pre-trained model and vocab
python main.py \
--do-fine-tune \
--task summarization \
--summarization-language java \
--model-name summarization_java \
--trained_vocab '../pre_trained/vocabs/' \
--trained_model '../pre_trained/models/all/'

# bug fixing without pre-training
python main.py \
--do-fine-tune \
--train-from-scratch \
--task bug_fix \
--bug_fix_scale medium

# only test on translation
python main.py \
--only-test \
--task translation \
--translation-source-language java \
--translation-target-language c_sharp \
--trained_vocab '../pre_trained/vocabs/' \
--trained_model '../outputs/translation_java_c_sharp_20210826_052653/models/'

Distributed training

Installation

Distributed training for SPT-Code requires Hugging Face's accelerate package. Also, for distributed CPU training, we need ipex and Intel OneCCL. In addition, it requires MPI installation.

pip install accelerate

Refer to this link for ipex and OneCCL installation.

Command for training

For a 4-node distributed training, we first need to create a hostfile that lists IPs of all 4 nodes, with one node per line.

x.x.x.x # node1
y.y.y.y # node2
z.z.z.z # node3
w.w.w.w # node4

Then we need to run accelerate config to setup a config file for accelerate. Refer to this link. Say Yes to IPEX, distributed training, and CPU.

With this setup, following command starts 4 node distributed CPU training with oneCCL backend. Each node run 1 process.

$ cd sources
$ mpirun -launcher ssh -verbose -genv I_MPI_DEBUG 4 -genv OMP_NUM_THREADS 112 -f <hostfile> \
-n 4 -ppn 1 accelerate launch --config_file <accelerate_config_file> \
main.py --do-pre-train --pre-train-tasks cap,mass,mng --batch-size 64 \
--eval-batch-size 64 --model-name pre_train_distcpu_4node_c_small
--logging-steps=10 --do-dist-cpu-training=True --use-ipex=True \
--dataset-root=<dataset>

Name		Name	Last commit message	Last commit date
Latest commit History 223 Commits
sources		sources
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SPT-Code

Requirements

Minimize requirements

Additional requirements

Datasets and Tokenizers

Pre-trained Tokenizers and Models

Runs

Distributed training

Installation

Command for training

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SPT-Code

Requirements

Minimize requirements

Additional requirements

Datasets and Tokenizers

Pre-trained Tokenizers and Models

Runs

Distributed training

Installation

Command for training

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages