# Overview of DeepBugs replication
following readme on github page
* All commands are called from the main directory.
* Python code (most of the implementation) and JavaScript code (for extracting data from .js files) are in the `/python` and `/javascript` directories.
* All data to learn from, e.g., .js files are expected to be in the `/data` directory.
* All data that is generated, e.g., intermediate representations, are written into the main directory. It is recommended to move them into separate directories.
* All generated data files have a timestamp as part of the file name. Below, all files are used with `*`. When running commands multiple times, make sure to use the most recent files.

# Clone github for DeepBugs
I forked the project to my own github in order to corrections to the code.
Certain keras imports are depricated causing errors.

* depricated import

`from tensorflow.python.keras.models import Sequential`

`from tensorflow.python.keras.layers.core import Dense, Dropout`


* correct import statement

`from keras.models import Sequential`

`from keras.layers import Dense, Dropout`

In [None]:
! git clone -b working https://github.com/livingdan/DeepBugs_replication

In [None]:
# move contents of DeepBugs to main directory
! mv DeepBugs_replication/* .

In [None]:
#  install dependencies npm modules acorn, estraverse, walk-sync
! npm install acorn
! npm install estraverse
! npm install walk-sync

# 1. Download Training and Testing datasets
### Two options for using training/testing data
* The full corpus can be downloaded [here](http://www.srl.inf.ethz.ch/js150.php) and is expected to be stored in `data/js/programs_all`. It consists of 100.000 training files, listed in `data/js/programs_training.txt`, and 50.000 files for validation, listed in `data/js/programs_eval.txt`.
* This repository contains only a very small subset of the corpus. It is stored in `data/js/programs_50`. Training and validation files for the small corpus are listed in `data/js/programs_50_training.txt` and `data/js/programs_50_eval.txt`.

In [None]:
! gdown http://files.srl.inf.ethz.ch/data/js_dataset.tar.gz

In [None]:
! tar -xzf js_dataset.tar.gz

In [None]:
! mkdir data/js/programs_all
! tar -xzf data.tar.gz -C data/js/programs_all


In [None]:
! mv data/js/programs_all/data/* data/js/programs_all

# 2. Learning a Bug Detector
Creating a bug detector consists of two main steps:

1. Extract positive (i.e., likely correct) and negative (i.e., likely buggy) training examples from code.

2. Train a classifier to distinguish correct from incorrect code examples.

This replication example will address the swapped argument bug detector
* The `SwappedArgs` bug detector looks for accidentally swapped arguments of a function call, e.g., calling `setPoint(y,x)` instead of `setPoint(x,y)`.

## 2.1 Extract positive and Negative training examples

`node javascript/extractFromJS.js calls --parallel 4 data/js/programs_50_training.txt data/js/programs_50`

  * The `--parallel` argument sets the number of processes to run.
  * `programs_50_training.txt` contains files to include (one file per line). To extract data for validation, run the command with `data/js/programs_50_eval.txt`.
  * The last argument is a directory that gets recursively scanned for .js files, considering only files listed in the file provided as the second argument.
  * The command produces `calls_*.json` files, which is data suitable for the `SwappedArgs` bug detector. For the other bug two detectors, replace `calls` with `binOps` in the above command.

### Full Corpus data set
For google colab the training set it too large to use using the free teir.

In [None]:
# Using full dataset training data
! node javascript/extractFromJS.js calls --parallel 4 data/js/programs_training.txt data/js/programs_all
! mkdir training/
! mv calls_* training/

In [None]:
# Using full dataset evaluation data
! node javascript/extractFromJS.js calls --parallel 4 data/js/programs_eval.txt data/js/programs_all
! mkdir eval/
! mv calls_* eval/

### Subset of dataset
50 files to split between training and evaluation

In [None]:
# extract training data
! node javascript/extractFromJS.js calls --parallel 4 data/js/programs_50_training.txt data/js/programs_50
! mkdir training/
! mv calls_* training/

In [None]:
# extract Eval data
! node javascript/extractFromJS.js calls --parallel 4 data/js/programs_50_eval.txt data/js/programs_50
! mkdir eval/
! mv calls_* eval/

## 2.2 Train a classifier to identify bugs
1. Train and validate the classifier

`python3 python/BugLearnAndValidate.py --pattern SwappedArgs --token_emb token_to_vector.json --type_emb type_to_vector.json --node_emb node_type_to_vector.json --training_data calls_xx*.json --validation_data calls_yy*.json`

  * The first argument selects the bug pattern.
  * The next three arguments are vector representations for tokens (here: identifiers and literals), for types, and for AST node types. These files are provided in the repository.
  * The remaining arguments are two lists of .json files. They contain the training and validation data extracted in Step 1.
  * After learning the bug detector, the command measures accurracy and recall w.r.t. seeded bugs and writes a list of potential bugs in the unmodified validation code (see `poss_anomalies.txt`).

2. Train a classifier for later use

`python3 python/BugLearn.py --pattern SwappedArgs --token_emb token_to_vector.json --type_emb type_to_vector.json --node_emb node_type_to_vector.json --training_data calls_xx*.json`

  * Optionally, pass --out some/dir to set the output directory for the trained model.

In [None]:
# Train and validate the classifier
! python3 python/BugLearnAndValidate.py --pattern SwappedArgs --token_emb token_to_vector.json --type_emb type_to_vector.json --node_emb node_type_to_vector.json --training_data training/calls_*.json --validation_data eval/calls_*.json

In [None]:
# Train a classifier for later use
! python3 python/BugLearn.py --pattern SwappedArgs --token_emb token_to_vector.json --type_emb type_to_vector.json --node_emb node_type_to_vector.json --training_data training/calls_*.json --out bug_detection_model/

# 3. Finding Bugs
Finding bugs in one or more source files consists of these two steps:
1. Extract code pieces
2. Use a trained classifier to identify bugs

In [None]:
# Extract code peices from file directory
#! node javascript/extractFromJS.js calls --files <list of files>
! node javascript/extractFromJS.js calls --files data/js/programs_50/*.js
! mkdir find_bugs/
! mv calls_* find_bugs/

Use a trained classifier to identify bugs

`python3 python/BugFind.py --pattern SwappedArgs --threshold 0.95 --model some/dir --token_emb token_to_vector.json --type_emb type_to_vector.json --node_emb node_type_to_vector.json --testing_data calls_xx*.json`

  * The first argument selects the bug pattern.
  * 0.95 is the threshold for reporting bugs; higher means fewer warnings of higher certainty.
  * --model sets the directory to load a trained model from.
  * The next three arguments are vector representations for tokens (here: identifiers and literals), for types, and for AST node types. These files are provided in the repository.
  * The remaining argument is a list of .json files. They contain the data extracted in Step 1.
  * The command examines every code piece and writes a list of potential bugs with its probability of being incorrect

In [None]:
# Use a trained classifier to identify bugs
! python3 python/BugFind.py --pattern SwappedArgs --threshold 0.95 --model bug_detection_model/ --token_emb token_to_vector.json --type_emb type_to_vector.json --node_emb node_type_to_vector.json --testing_data find_bugs/calls_*.json

# 4. Embeddings for Identifiers

The above bug detector rely on a vector representation for identifier names and literals. To use our framework, the easiest is to use the shipped `token_to_vector.json` file. Alternatively, you can learn the embeddings via Word2Vec as follows:

1. Extract identifiers and tokens:

`node javascript/extractFromJS.js tokens --parallel 4 data/js/programs_50_training.txt data/js/programs_50`

  * The command produces `tokens_*.json` files.
  
2. Encode identifiers and literals with context into arrays of numbers (for faster reading during learning):
  
  `python3 python/TokensToTopTokens.py tokens_*.json`
  
  * The arguments are the just created files.
  * The command produces `encoded_tokens_*.json` files and a file `token_to_number_*.json` that assigns a number to each identifier and literal.

3. Learn embeddings for identifiers and literals:
  
  `python3 python/EmbeddingLearnerWord2Vec.py token_to_number_*.json encoded_tokens_*.json`

  * The arguments are the just created files.
  * The command produces a file `token_to_vector_*.json`.

In [None]:
# Extract identifiers and tokens
!node javascript/extractFromJS.js tokens --parallel 4 data/js/programs_50_training.txt data/js/programs_50

In [None]:
# Encode identifiers and literals with context into arrays of numbers
! python3 python/TokensToTopTokens.py tokens_*.json

In [None]:
# Learn embeddings for identifiers and literals
! python3 python/EmbeddingLearnerWord2Vec.py token_to_number_*.json encoded_tokens_*.json