CLAWSAT: Towards Both Robust and Accurate Code Models

Code repository for the SANER2023 paper CLAWSAT: Towards Both Robust and Accurate Code Models.

Link to paper - https://arxiv.org/abs/2211.11711

Slides -

Citation

@article{jia2022clawsat,
  title={CLAWSAT: Towards Both Robust and Accurate Code Models},
  author={Jia, Jinghan and Srikant, Shashank and Mitrovska, Tamara and Gan, Chuang and Chang, Shiyu and Liu, Sijia and O'Reilly, Una-May},
  journal={arXiv preprint arXiv:2211.11711},
  year={2022}
}

Abstract

We integrate contrastive learning (CL) with adversarial learning to co-optimize the robustness and accuracy of code models. Different from existing works, we show that code obfuscation, a standard code transformation operation, provides novel means to generate complementary `views' of a code that enable us to achieve both robust and accurate code models. To the best of our knowledge, this is the first systematic study to explore and exploit the robustness and accuracy benefits of (multi-view) code obfuscations in code models. Specifically, we first adopt adversarial codes as robustness-promoting views in CL at the self-supervised pre-training phase. This yields improved robustness and transferability for downstream tasks. Next, at the supervised fine-tuning stage, we show that adversarial training with a proper temporally-staggered schedule of adversarial code generation can further improve robustness and accuracy of the pre-trained code model. Built on the above two modules, we develop CLAWSAT, a novel self-supervised learning (SSL) framework for code by integrating CL with adversarial views (CLAW) with staggered adversarial training (SAT). On evaluating three downstream tasks across Python and Java, we show that CLAWSAT consistently yields the best robustness and accuracy (e.g. 11% in robustness and 6% in accuracy on the code summarization task in Python). We additionally demonstrate the effectiveness of adversarial learning in CLAW by analyzing the characteristics of the loss landscape and interpretability of the pre-trained models.

Authors

Jinghan Jia, Shashank Srikant,Tamara Mitrovska, Chuang Gan, Shiyu Chang, Sijia Liu, and Una-May O’Reilly.

If you face issues running this codebase, please open an issue on this repository, and mention as much information to reproduce your issue, including the exact command you have run, the configurations that you have used, the output you see, etc. See posts like these which describe how to communicate problems effectively via Github issues.

To discuss any other details on the method we introduce, contact Jinghan (jiajingh@msu.edu) and Shashank (shash@mit.edu)

Setting up training pipeline

See the readme from our ICLR 2021 work for details on setting up the basic training pipeline.

Commands

Download and normalize datasets

make download-datasets
make normalize-datasets

Create the transformed datasets

make apply-transforms-sri-py150
make apply-transforms-csn-python
make extract-transformed-tokens

Train a normal seq2seq model for 10 epochs on sri/py150

bash experiments/normal_seq2seq_train.sh

Run adversarial training and testing on sri/py150 for 5 epochs

bash experiments/normal_adv_train.sh

Get the augmented sri/py150 datasets with random and adversarial views

bash scripts/augment.sh

Pretrain a seq2seq encoder on a sri/py150 augmented dataset, finetune the encoder on sri/py150, and test the final model on normal and adversarial datasets.

bash experiments/finetune_and_test_0.sh

Pretrain a seq2seq encoder on sri/py150 and run adversasrial training starting from the pretrained model.

bash experiments/pretrain_adv_train.sh

Directory Structure

The instructions that follow have been adopted from Ramakrishnan et al.'s codebase. In this repository, we have the following directories:

`./datasets`

Note: the datasets are all much too large to be included in this GitHub repo. This is simply the structure as it would exist on disk once our framework is setup.

./datasets
  + ./raw            # The four datasets in "raw" form
  + ./normalized     # The four datasets in the "normalized" JSON-lines representation 
  + ./preprocess
    + ./tokens       # The four datasets in a representation suitable for token-level models
    + ./ast-paths    # The four datasets in a representation suitable for code2seq
  + ./transformed    # The four datasets transformed via our code-transformation framework 
    + ./normalized   # Transformed datasets normalized back into the JSON-lines representation
    + ./preprocessed # Transformed datasets preprocessed into:
      + ./tokens     # ... a representation suitable for token-level models
      + ./ast-paths  # ... a representation suitable for code2seq
  + ./adversarial    # Datasets in the format < source, target, tranformed-variant #1, #2, ..., #K >
    + ./tokens       # ... in a token-level representation
    + ./ast-paths    # ... in an ast-paths representation

`./models`

We have two Machine Learning on Code models. Both of them are trained on the Code Summarization task. The seq2seq model has been modified to incorporate our attack formulation, and includes an adversarial training loop. The branch pytorch-code2seq implements our attack formulation on code2seq. This is work in progress.

./models
  + ./pytorch-seq2seq   # seq2seq model implementation
  + ./contracode   # contracode encoder, including transformer encoder and lstm encoder.
  + ./pytorch-seq2seq-code-completion # seq2seq model for code completion
  + ./Transformer # Transformer for code summarization.

`./results`

This directory stores results that are small-enough to be checked into GitHub. This is automatically generated once the codebase is set up.

`./scripts`

In this directory there are a large number of scripts for doing various chores related to running and maintaing this code transformation infrastructure.

`./tasks`

This directory houses the implementations of various pieces of our core framework:

./tasks
  + ./astor-apply-transforms
  + ./depth-k-test-seq2seq
  + ./download-c2s-dataset
  + ./download-csn-dataset
  + ./extract-adv-dataset-c2s
  + ./extract-adv-dataset-tokens
  + ./generate-baselines
  + ./integrated-gradients-seq2seq
  + ./normalize-raw-dataset
  + ./preprocess-dataset-c2s
  + ./preprocess-dataset-tokens
  + ./spoon-apply-transforms
  + ./test-model-*
  + ./train-model-*
  + ./pretain-contracode*
  + ./adv-pretrain-controcode*
  + ./adv-train-model-seq2seq-online*
  + ./finetune-contracode*

`./vendor`

This directory contains dependencies in the form of git submodukes.

`Makefile`

We have one overarching Makefile that can be used to drive a number of the data generation, training, testing, adn evaluation tasks.

download-datasets                    (DS-1) Downloads all prerequisite datasets
normalize-datasets                   (DS-2) Normalizes all downloaded datasets
extract-ast-paths                    (DS-3) Generate preprocessed data in a form usable by code2seq style models. 
extract-tokens                       (DS-3) Generate preprocessed data in a form usable by seq2seq style models. 
apply-transforms-c2s-java-med        (DS-4) Apply our suite of transforms to code2seq's java-med dataset.
apply-transforms-c2s-java-small      (DS-4) Apply our suite of transforms to code2seq's java-small dataset.
apply-transforms-csn-java            (DS-4) Apply our suite of transforms to CodeSearchNet's java dataset.
apply-transforms-csn-python          (DS-4) Apply our suite of transforms to CodeSearchNet's python dataset.
apply-transforms-sri-py150           (DS-4) Apply our suite of transforms to SRI Lab's py150k dataset.
extract-transformed-ast-paths        (DS-6) Extract preprocessed representations (ast-paths) from our transfromed (normalized) datasets 
extract-transformed-tokens           (DS-6) Extract preprocessed representations (tokens) from our transfromed (normalized) datasets 
extract-adv-datasets-tokens          (DS-7) Extract preprocessed adversarial datasets (representations: tokens)
docker-cleanup                       (MISC) Cleans up old and out-of-sync Docker images.
submodules                           (MISC) Ensures that submodules are setup.
help                                 (MISC) This help.
test-model-seq2seq                   (TEST) Tests the seq2seq model on a selected dataset.
test-finetuned-model                 (TEST) Tests the finetuned seq2seq model on a selected dataset.
train-model-seq2seq                  (TRAIN) Trains the seq2seq model on a selected dataset.
adv-pretrain-contracode              (TRAIN) Robustness-aware pretrain the seq2seq encoder.
adv-pretrain-contracode-transformer  (TRAIN) Robustness-aware pretrain the transformer encoder.
finetune-contracode                  (TRAIN) Finetune the pretrain encoder on downstream summarization task.
finetune-contracode-code-completion  (TRAIN) Finetune the pretrain encoder on downstream code completion task.
finetune-contracode-code-clone       (TRAIN) Finetune the pretrain encoder on downstream code clone detection task.
finetune-contracode-code-transformer (TRAIN) Finetune the pretrain encoder on downstream code clone detection task.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.history		.history
experiments		experiments
models		models
scripts		scripts
tasks		tasks
vendor		vendor
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

License

OPTML-Group/CLAW-SAT

Folders and files

Latest commit

History

Repository files navigation

CLAWSAT: Towards Both Robust and Accurate Code Models

Abstract

Authors

Setting up training pipeline

Commands

Directory Structure

./datasets

./models

./results

./scripts

./tasks

./vendor

Makefile

About

Resources

License

Stars

Watchers

Forks

Languages