A Pytorch implemenmtation of GrammarT5.

A PyTorch Implementation of "GrammarT5: Grammar-Integrated Pre-trained Encoder-Decoder Neural Model for Code"

Introduction

Pre-trained models for code have exhibited promising performance across various code-related tasks, such as code summarization, code completion, code translation, and bug detection. These accomplishments have substantially contributed to the advancement of AI-assisted programming and developer tools. However, despite their success, the majority of current models still represent code as a token sequence in the fine-tuning phase, which may not adequately capture the essence of the underlying code structure.

In this work, we propose GrammarT5, a grammar-integrated encoder-decoder pre-trained model for code. GrammarT5 employs a novel grammar-integrated representation, Tokenized Grammar Rule List (TGRL), for code. TGRL is constructed based on the grammar rule list utilized in syntax-guided code generation and integrates syntax information with code tokens within an appropriate input length. Furthermore, we suggest attaching language flags to help GrammarT5 differentiate between grammar rules of various programming languages. Finally, we introduce three novel pre-training objectives—Edge Prediction (EP), Identifier Prediction (IP), and Sub-Tree Prediction (STP)—for GrammarT5 to learn syntax from TGRL.

Experiments were conducted on five code-related tasks using ten datasets, demonstrating that GrammarT5 achieves state-of-the-art performance on all tasks in comparison to models of the same scale. Additionally, the paper illustrates that the proposed pre-training objectives and language flags can enhance GrammarT5's ability to better capture code syntax and semantics.

Dataset

Train set

The raw data from Code Search Net (https://zenodo.org/record/7857872)

Test set

CodeXGLUE

Usage

Pre-trained Model

We will publish our pre-trained models after the paper is accepted.

Fine-tuning Model

Task can be one of the following tasks. ['django', 'concode', 'codetrans', 'repair', 'assert', 'conala', 'test', 'repairme', 'transj2c', 'transc2j', 'commentjava', 'commentpython', 'mbpp', 'searchadv', 'searchcos']

sh run.sh

The saved model is checkModel[task].

Testing Model

sh eval.sh

Dependency

Python 3.7
PyTorch 1.12
transformers 2.26
Java 8
docker
nvidia-docker

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Attention.py		Attention.py
CombinationLayer.py		CombinationLayer.py
ConvolutionForward.py		ConvolutionForward.py
Dataset.py		Dataset.py
DatasetSum.py		DatasetSum.py
DenseLayer.py		DenseLayer.py
Embedding.py		Embedding.py
FastAttention.py		FastAttention.py
Grape.py		Grape.py
LICENSE		LICENSE
LayerNorm.py		LayerNorm.py
Model.py		Model.py
Model1.py		Model1.py
Multihead_Attention.py		Multihead_Attention.py
Multihead_Attention_Rel.py		Multihead_Attention_Rel.py
Multihead_Attention_Tree.py		Multihead_Attention_Tree.py
Multihead_Combination.py		Multihead_Combination.py
README.md		README.md
RelEmbedding.py		RelEmbedding.py
ScheduledOptim.py		ScheduledOptim.py
Searchnode1.py		Searchnode1.py
SubLayerConnection.py		SubLayerConnection.py
T5Block.py		T5Block.py
TokenEmbedding.py		TokenEmbedding.py
Transfomer.py		Transfomer.py
TreeConv.py		TreeConv.py
TreeConvGen.py		TreeConvGen.py
beam.py		beam.py
beamsample.py		beamsample.py
beamsearch.py		beamsearch.py
bert.py		bert.py
bertEncoding.py		bertEncoding.py
blcDP.py		blcDP.py
bleunl.py		bleunl.py
cat.py		cat.py
catpkl.py		catpkl.py
config.json		config.json
decodeTrans.py		decodeTrans.py
eval.sh		eval.sh
fastTransformer.py		fastTransformer.py
gcnn.py		gcnn.py
gcnnnormal.py		gcnnnormal.py
gelu.py		gelu.py
getidentifier.py		getidentifier.py
graphTransfomer.py		graphTransfomer.py
mergy.py		mergy.py
palrun.py		palrun.py
postionEmbedding.py		postionEmbedding.py
process.py		process.py
relAttention.py		relAttention.py
relTransformer.py		relTransformer.py
rightTransformer.py		rightTransformer.py
rightnTransfomer.py		rightnTransfomer.py
run.py		run.py
run.sh		run.sh
run1.py		run1.py
runold.py		runold.py
runsearch.py		runsearch.py
scheduler.py		scheduler.py
sep.py		sep.py
solvetree.py		solvetree.py
solvetree_sitter.py		solvetree_sitter.py
sp.py		sp.py
splitdata.py		splitdata.py
splitdata2.py		splitdata2.py
sptest.py		sptest.py
sptest2.py		sptest2.py
stringfy.py		stringfy.py
stringfy2.py		stringfy2.py
stringfycode.py		stringfycode.py
sum.py		sum.py
sumdata.py		sumdata.py
sumres.py		sumres.py
test.py		test.py
testacc.py		testacc.py
testbleu.py		testbleu.py
testbleuforcodet5.py		testbleuforcodet5.py
testbleuforunx.py		testbleuforunx.py
testmbpp.py		testmbpp.py
tmp.py		tmp.py
tmp0.py		tmp0.py
tmp1.py		tmp1.py
tmp2.py		tmp2.py
tmp3.py		tmp3.py
tmp4.py		tmp4.py
tmp5.py		tmp5.py
tmp6.py		tmp6.py
tmp7.py		tmp7.py
trans.py		trans.py
turn2tree.py		turn2tree.py
turn2tree_sitter.py		turn2tree_sitter.py
vocab.py		vocab.py

License

pkuzqh/GrammarT5

Folders and files

Latest commit

History