Automatic summarization of source code using a neural network, based on the Universal Transformer architecture.
Check out the live demo!
The Transformer architecture for sequence-to-sequence modeling is comprised of an Encoder and a Decoder. The Encoder and Decoder have sets of layers, each of which has a self-attention block and a feed-forward block. The Decoder layers additionally have an encoder-decoder attention block, which attends to the processed input as well as the currently generated output.
The Universal Transformer architecture uses the same encoder layer across the entire Encoder; likewise with the Decoder. This reduces the size of the model, and improves accuracy across many tasks, including those of an algorithmic nature (e.g. interpreting source code).
I implemented this project using TensorFlow.
- Download the dataset and create the SentencePiece tokenizers
- Run the training script
train.py
, providing the parametersnum_epochs
,model_path
(path to the model, which contains atransformer_description.json
file with necessary attributes), anddataset_path
(ordinarilydata/leclair_java
)
If you trained a model yourself, you can run an interactive demo by running translation_transformer.py
with the arguments model_path
and dataset_path
.
There is a pretrained model in the TFLite format at models/java_summ_ut_4.tflite
, which you can try by running server.py
. You can also run gunicorn server:code_summarization_server
to run it using Gunicorn.