Skip to content
forked from tech-srl/code2vec

TensorFlow code for the neural network presented in the paper: "code2vec: Learning Distributed Representations of Code"

License

Notifications You must be signed in to change notification settings

noamyft/code2vec

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

THIS IS A DEPRECATED REPOSITORY. PLEASE REFER TO THIS LINK.

Adversarial Examples for Models of Code - Code2vec

An adversary for Code2vec - neural network for learning distributed representations of code. This is an official implemention of the model described in:

Noam Yefet, Uri Alon and Eran Yahav, "Adversarial Examples for Models of Code", 2019 https://arxiv.org/abs/1910.07517

This is a TensorFlow implementation , designed to be easy and useful in research, and for experimenting with new ideas for attacks in machine learning for code tasks. Contributions are welcome.

Table of Contents

Requirements

On Ubuntu:

  • Python3. To check if you have it:

python3 --version

  • TensorFlow - version 1.13 or newer (install). To check TensorFlow version:

python3 -c 'import tensorflow as tf; print(tf.__version__)'

Quickstart

Step 0: Cloning this repository

git clone https://github.com/noamyft/code2vec.git
cd code2vec

Step 1: Creating a new dataset from java sources

In order to have a preprocessed dataset to attack the network on, you can either download our preprocessed dataset, or create a new dataset of your own.

Download our preprocessed dataset (compressed: 200Mb, extracted 1Gb)

We provided a preprocessed dataset (based on Uri Alon's Java-large dataset).

First, you should download and extract the preprocessed datasets below in the dir created earlier:

Then extract it:

tar -xvzf java_large_adversarial_data.tar.gz

This will create directory named "data" with all the relevant data for the model and adversary.

Step 2: Downloading a trained model

We provide a trained code2vec model that was trained on the Java-large dataset (thanks to Uri Alon). Trainable model (3.5 GB):

wget https://code2vec.s3.amazonaws.com/model/java-large-model.tar.gz
tar -xvzf java-large-model.tar.gz

You can also train your own model. see Code2Vec

Step 3: Run adversary on the trained model

Once you download the preprocessed datasets and pretrained model - you can run the adversary on the model, by run:

  • for Varname Attack:
python3 code2vec.py --load models/java-large/saved_model_iter3 --load_dict data/java_large_adversarial/java-large --test data/java_large_adversarial/java_large_adversarial.test.c2v --test_adversarial --adversarial_type targeted --adversarial_target add
  • for Deadcode Attack:
python3 code2vec.py --load models/java-large/saved_model_iter3 --load_dict data/java_large_adversarial/java-large --test data/java_large_adversarial/java_large_adversarial_with_deadcode.test.c2v --test_adversarial --adversarial_type nontargeted --adversarial_deadcode --adversarial_target merge|from

Where:

  • _--load _ - the path to the pretrained model.
  • _--load_dict _ - the path to the preprocessed dictionary.
  • _--adversarial_deadcode _ - use DeadCode attack (note: you should also specify the path to the deadcode dataset)
  • _--adversarial_type _ - targeted\nontargeted.
  • _--adversarial_target _ - specify the desired target (for the "targeted" type). Names seperated by '|" (e.g. "merge|from")

You can also determine the BFS search's depth and width by setting the --adversarial_depth , --adversarial_topk parameters respectively (2 by default).

Manually examine adversarial examples

You can run the examples we provided in the paper on the Code2vec's online demo. available at https://code2vec.org/.

  • You can copy&paste the sort example from here

  • you can type the following code in each example to get Prediction of sort:

int introsorter = 0;

Defense

You can run the Outlier Detection defense by adding the --guard_input with threshold to either:

  • regular evaluation, e.g. :
python3 code2vec.py --load models/java-large/saved_model_iter3 --test data/java_large_adversarial/java_large_adversarial.test.c2v --guard_input 2.7
  • adversarial evaluation. e.g.:
python3 code2vec.py --load models/java-large/saved_model_iter3 --load_dict data/java_large_adversarial/java-large --test data/java_large_adversarial/java_large_adversarial.test.c2v --test_adversarial --adversarial_type targeted --adversarial_target add --guard_input 2.7

Configuration

Changing hyper-parameters is possible by editing the file config.py. Here are some of the parameters and their description:

config.MAX_WORDS_FROM_VOCAB_FOR_ADVERSARIAL = 100000

The vocabulary size of the adversary.

config.ADVERSARIAL_MINI_BATCH_SIZE = 256

set the batch size for gradients step of the adversary.

config.TEST_BATCH_SIZE = config.BATCH_SIZE = 1024

Batch size in evaluating. Affects only the evaluation speed and memory consumption, does not affect the results.

config.READING_BATCH_SIZE = 1300 * 4

The batch size of reading text lines to the queue that feeds examples to the network during training.

config.NUM_BATCHING_THREADS = 2

The number of threads enqueuing examples.

config.BATCH_QUEUE_SIZE = 300000

Max number of elements in the feeding queue.

config.DATA_NUM_CONTEXTS = 200

The number of contexts in a single example, as was created in preprocessing.

config.MAX_CONTEXTS = 200

The number of contexts to use in each example.

config.WORDS_VOCAB_SIZE = 1301136

The max size of the token vocabulary.

config.TARGET_VOCAB_SIZE = 261245

The max size of the target words vocabulary.

config.PATHS_VOCAB_SIZE = 911417

The max size of the path vocabulary.

config.EMBEDDINGS_SIZE = 128

Embedding size for tokens and paths.

About

TensorFlow code for the neural network presented in the paper: "code2vec: Learning Distributed Representations of Code"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 82.7%
  • Java 13.6%
  • Shell 3.7%