# Overview of DeepBugs replication and comparison with using CNN architecture for Incorrect Binary Operator bug pattern

## using google colab pro high-ram machine with T4 GPU

following readme on github page
* All commands are called from the main directory.
* Python code (most of the implementation) and JavaScript code (for extracting data from .js files) are in the `/python` and `/javascript` directories.
* All data to learn from, e.g., .js files are expected to be in the `/data` directory.
* All data that is generated, e.g., intermediate representations, are written into the main directory. It is recommended to move them into separate directories.
* All generated data files have a timestamp as part of the file name. Below, all files are used with `*`. When running commands multiple times, make sure to use the most recent files.

# Clone github for DeepBugs
I forked the project to my own github in order to corrections to the code.
Certain keras imports are depricated causing errors.

* depricated import

`from tensorflow.python.keras.models import Sequential`

`from tensorflow.python.keras.layers.core import Dense, Dropout`


* correct import statement

`from keras.models import Sequential`

`from keras.layers import Dense, Dropout`

In [None]:
! git clone -b working https://github.com/livingdan/DeepBugs_replication

Cloning into 'DeepBugs_replication'...
remote: Enumerating objects: 507, done.[K
remote: Counting objects: 100% (135/135), done.[K
remote: Compressing objects: 100% (51/51), done.[K
remote: Total 507 (delta 90), reused 120 (delta 84), pack-reused 372[K
Receiving objects: 100% (507/507), 313.28 MiB | 50.30 MiB/s, done.
Resolving deltas: 100% (285/285), done.
Updating files: 100% (129/129), done.


In [None]:
# move contents of DeepBugs to main directory
! mv DeepBugs_replication/* .

In [None]:
#  install dependencies npm modules acorn, estraverse, walk-sync
# npm is a package manager for the JavaScript programming language
! npm install acorn
! npm install estraverse
! npm install walk-sync

[?25l[[90m..................[0m] / rollbackFailedOptional: [34;40mverb[0m [35mnpm-session[0m 30cbc22d7572312[0m[K[[90m..................[0m] / rollbackFailedOptional: [34;40mverb[0m [35mnpm-session[0m 30cbc22d7572312[0m[K[[90m..................[0m] / rollbackFailedOptional: [34;40mverb[0m [35mnpm-session[0m 30cbc22d7572312[0m[K[[90m..................[0m] - rollbackFailedOptional: [34;40mverb[0m [35mnpm-session[0m 30cbc22d7572312[0m[K[[90m..................[0m] / loadIdealTree:loadAllDepsIntoIdealTree: [7msill[0m [35minstall[0m loa[0m[K[[7m            [27m[90m......[0m] - refresh-package-json:acorn: [32;40mtiming[0m [35maction:finalize[0m Compl[0m[K[K[?25h[37;40mnpm[0m [0m[30;43mWARN[0m [0m[35msaveError[0m ENOENT: no such file or directory, open '/content/package.json'
[0m[37;40mnpm[0m [0m[34;40mnotice[0m[35m[0m created a lockfile as package-lock.json. You should commit this file.
[0m[37;40mnpm[0m [0m[30;43

# 1. Download Training and Testing datasets
* The full corpus can be downloaded [here](http://www.srl.inf.ethz.ch/js150.php) and is expected to be stored in `data/js/programs_all`. It consists of 100.000 training files, listed in `data/js/programs_training.txt`, and 50.000 files for validation, listed in `data/js/programs_eval.txt`.


In [None]:
! gdown http://files.srl.inf.ethz.ch/data/js_dataset.tar.gz

Downloading...
From: http://files.srl.inf.ethz.ch/data/js_dataset.tar.gz
To: /content/js_dataset.tar.gz
100% 2.63G/2.63G [02:15<00:00, 19.5MB/s]


In [None]:
! tar -xzf js_dataset.tar.gz
#extracts

In [None]:
! mkdir data/js/programs_all
! tar -xzf data.tar.gz -C data/js/programs_all
#makes directory and loads the data there

In [None]:
! mv data/js/programs_all/data/* data/js/programs_all
#move data to the new location

# 2. Embeddings for Identifiers using Word2Vec
-- -
## Optional instead of performing own exteractions You can use my embeddings using token_to_vector_Replication.json

The above bug detector rely on a vector representation for identifier names and literals. To use our framework, the easiest is to use the shipped `token_to_vector.json` file. Alternatively, you can learn the embeddings via Word2Vec as follows:

1. Extract identifiers and tokens:

`node javascript/extractFromJS.js tokens --parallel 4 data/js/programs_50_training.txt data/js/programs_50`

  * The command produces `tokens_*.json` files.
  
2. Encode identifiers and literals with context into arrays of numbers (for faster reading during learning):
  
  `python3 python/TokensToTopTokens.py tokens_*.json`
  
  * The arguments are the just created files.
  * The command produces `encoded_tokens_*.json` files and a file `token_to_number_*.json` that assigns a number to each identifier and literal.

3. Learn embeddings for identifiers and literals:
  
  `python3 python/EmbeddingLearnerWord2Vec.py token_to_number_*.json encoded_tokens_*.json`

  * The arguments are the just created files.
  * The command produces a file `token_to_vector_*.json`.

In [None]:
# Extract identifiers and tokens from all dataset
!node javascript/extractFromJS.js tokens --parallel 4 data/js/programs_all.txt data/js/programs_all

In [None]:
# move tokens to own directory
!mkdir tokens/
!mv tokens_*.json tokens/

In [None]:
# Encode identifiers and literals with context into arrays of numbers for training
! python3 python/TokensToTopTokens.py tokens/tokens_*.json

In [None]:
# move encoded tokens to own directory
!mkdir encoded_tokens/
!mv encoded_tokens_*.json encoded_tokens/
!mv token_to_number_*.json encoded_tokens/

In [None]:
# Learn embeddings for identifiers and literals training
! python3 python/EmbeddingLearnerWord2Vec.py encoded_tokens/token_to_number_*.json encoded_tokens/encoded_tokens_*.json

In [None]:
# move token_to_vector file to own directory
!mkdir token_to_vector/
!mv token_to_vector_1*.json token_to_vector/


# 3. Learning a Bug Detector
Creating a bug detector consists of two main steps:

1. Extract positive (i.e., likely correct) and negative (i.e., likely buggy) training examples from code.

2. Train a classifier to distinguish correct from incorrect code examples.

This replication example will address the swapped argument bug detector
* The `SwappedArgs` bug detector looks for accidentally swapped arguments of a function call, e.g., calling `setPoint(y,x)` instead of `setPoint(x,y)`.

## 3.1 Extract positive and Negative training examples

## Optional instead of performing own exteractions on entire dataset you can Unzip my previous extracted data.

`node javascript/extractFromJS.js calls --parallel 4 data/js/programs_50_training.txt data/js/programs_50`

  * The `--parallel` argument sets the number of processes to run.
  * `programs_50_training.txt` contains files to include (one file per line). To extract data for validation, run the command with `data/js/programs_50_eval.txt`.
  * The last argument is a directory that gets recursively scanned for .js files, considering only files listed in the file provided as the second argument.
  * The command produces `calls_*.json` files, which is data suitable for the `SwappedArgs` bug detector. For the other bug two detectors, replace `calls` with `binOps` in the above command.

### Option 1: Unzip binOps previously extrated to speed up model

In [None]:
# unzip previously extracted binops to speed up process for training/eval
!unzip binop/training.zip
!unzip binop/eval.zip

Archive:  binop/training.zip
   creating: training/
  inflating: training/binOps_1713539473688.json  
  inflating: training/binOps_1713538745093.json  
  inflating: training/binOps_1713538592712.json  
  inflating: training/binOps_1713539539287.json  
  inflating: training/binOps_1713539564607.json  
  inflating: training/binOps_1713539559688.json  
  inflating: training/binOps_1713539511294.json  
  inflating: training/binOps_1713538414561.json  
  inflating: training/binOps_1713538557802.json  
  inflating: training/binOps_1713538395728.json  
  inflating: training/binOps_1713538836266.json  
  inflating: training/binOps_1713538670156.json  
  inflating: training/binOps_1713538457847.json  
  inflating: training/binOps_1713538699910.json  
  inflating: training/binOps_1713538553088.json  
  inflating: training/binOps_1713538736852.json  
  inflating: training/binOps_1713539110950.json  
  inflating: training/binOps_1713539617852.json  
  inflating: training/binOps_1713538655691.json 

### Option 2: Full Corpus data set
For google colab the training set it too large to use using the free teir.

In [None]:
# cut the training dataset by 40% as loading full dataset for model causes out of memory error
import random

with open("data/js/programs_training.txt", "r") as f:
    lines = f.readlines()

# Shuffle the list of lines
print(len(lines))
random.shuffle(lines)

# Calculate the number of lines to remove (40%)
num_lines_to_remove = int(len(lines) * 0.4)

# Remove the specified number of lines
for i in range(num_lines_to_remove):
    lines.pop()

print(len(lines))
# Write the remaining lines back to the file
with open("data/js/programs_training.txt", "w") as f:
    f.writelines(lines)

In [None]:
# Using full dataset training data
! node javascript/extractFromJS.js binOps --parallel 4 data/js/programs_training.txt data/js/programs_all
! mkdir training/
! mv binOps_* training/

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Reading data/js/programs_all/Automattic/wpcom.js/examples/server/index.js

Added binary operations. Total now: 5749

Considered binary operations: 2 out of 2 (100%)

Reading data/js/programs_all/Automattic/wpcom.js/lib/category.js

Reading data/js/programs_all/Alfresco/Aikau/aikau/src/test/resources/alfresco/renderers/InlineEditPropertyLinkTest.js

Added binary operations. Total now: 1755

Considered binary operations: 0 out of 0 (NaN%)

Added binary operations. Total now: 5754

Considered binary operations: 5 out of 13 (38%)

Reading data/js/programs_all/Alfresco/Aikau/aikau/src/test/resources/alfresco/renderers/PropertyLinkTest.js

Added binary operations. Total now: 1758
Considered binary operations: 3 out of 3 (100%)

Reading data/js/programs_all/Alfresco/Aikau/aikau/src/test/resources/alfresco/renderers/PropertyTest.js

Added binary operations. Total now: 1770
Considered binary operations: 12 out of 12 (100%)

Readin

In [None]:
# cut the eval dataset by 15% as loading full dataset for model causes out of memory error
import random

with open("data/js/programs_eval.txt", "r") as f:
    lines = f.readlines()

# Shuffle the list of lines
print(len(lines))
random.shuffle(lines)

# Calculate the number of lines to remove (15%)
num_lines_to_remove = int(len(lines) * 0.15)

# Remove the specified number of lines
for i in range(num_lines_to_remove):
    lines.pop()

print(len(lines))
# Write the remaining lines back to the file
with open("data/js/programs_eval.txt", "w") as f:
    f.writelines(lines)

In [None]:
# Using full dataset evaluation data
! node javascript/extractFromJS.js binOps --parallel 4 data/js/programs_eval.txt data/js/programs_all
! mkdir eval/
! mv binOps_* eval/

[1;30;43mStreaming output truncated to the last 5000 lines.[0m

Considered binary operations: 27 out of 45 (60%)

Reading data/js/programs_all/BufferSpace/NaughtySquirrel/Classes/Information/Environments/Footprints.js

Added binary operations. Total now: 1796

Considered binary operations: 12 out of 21 (57%)

Reading data/js/programs_all/BufferSpace/NaughtySquirrel/Classes/Information/Environments/LightCircle.js

Added binary operations. Total now: 1806

Considered binary operations: 10 out of 17 (59%)

Reading data/js/programs_all/BufferSpace/NaughtySquirrel/Classes/Panels/ScoreItem.js

Added binary operations. Total now: 1812

Considered binary operations: 6 out of 8 (75%)

Reading data/js/programs_all/BufferSpace/NaughtySquirrel/Classes/Scenes/About.js

Added binary operations. Total now: 1819

Considered binary operations: 7 out of 7 (100%)

Reading data/js/programs_all/BukGet/api/server.js

Added binary operations. Total now: 1889

Considered binary operations: 70 out of 75 (93%

## 3.2 Train a classifier to identify bugs
1. Train and validate the classifier

`python3 python/BugLearnAndValidate.py --pattern BinOperator --token_emb token_to_vector.json --type_emb type_to_vector.json --node_emb node_type_to_vector.json --training_data binOps_xx*.json --validation_data binOps_yy*.json`

  * The first argument selects the bug pattern.
  * The next three arguments are vector representations for tokens (here: identifiers and literals), for types, and for AST node types. These files are provided in the repository.
  * The remaining arguments are two lists of .json files. They contain the training and validation data extracted in Step 1.
  * After learning the bug detector, the command measures accurracy and recall w.r.t. seeded bugs and writes a list of potential bugs in the unmodified validation code (see `poss_anomalies.txt`).

### using previously learned by me embeddings token_to_vector_Replication.json to speed up notebook

In [None]:
# Train and validate the classifier
! python3 python/BugLearnAndValidate.py --pattern BinOperator --token_emb token_to_vector_Replication.json --type_emb type_to_vector.json --node_emb node_type_to_vector.json --training_data training/binOps_*.json --validation_data eval/binOps_*.json

2024-04-20 18:48:03.286485: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-20 18:48:03.286534: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-20 18:48:03.287904: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-20 18:48:03.295116: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
BugDetection started with ['python/BugLearnAndValidat

In [None]:
# Train and validate the classifier
! python3 python/BugLearnAndValidateCNN.py --pattern BinOperator --token_emb token_to_vector_Replication.json --type_emb type_to_vector.json --node_emb node_type_to_vector.json --training_data training/binOps_*.json --validation_data eval/binOps_*.json

2024-04-20 19:11:46.193723: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-20 19:11:46.193784: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-20 19:11:46.195137: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-20 19:11:46.202283: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
BugDetection started with ['python/BugLearnAndValidat

# 4. Finding Bugs
Finding bugs in one or more source files consists of these two steps:
1. Extract code pieces
2. Use a trained classifier to identify bugs

In [None]:
# Extract code peices from file directory
#! node javascript/extractFromJS.js calls --files <list of files>
! node javascript/extractFromJS.js binOps --files data/js/programs_50/*.js
! mkdir find_bugs/
! mv binOps_* find_bugs/

Use a trained classifier to identify bugs

`python3 python/BugFind.py --pattern BinOperator --threshold 0.95 --model some/dir --token_emb token_to_vector.json --type_emb type_to_vector.json --node_emb node_type_to_vector.json --testing_data binOps_xx*.json`

  * The first argument selects the bug pattern.
  * 0.95 is the threshold for reporting bugs; higher means fewer warnings of higher certainty.
  * --model sets the directory to load a trained model from.
  * The next three arguments are vector representations for tokens (here: identifiers and literals), for types, and for AST node types. These files are provided in the repository.
  * The remaining argument is a list of .json files. They contain the data extracted in Step 1.
  * The command examines every code piece and writes a list of potential bugs with its probability of being incorrect

In [None]:
# Use a trained classifier to identify bugs
! python3 python/BugFind.py --pattern BinOperator --threshold 0.95 --model bug_detection_model_*/ --token_emb token_to_vector.json --type_emb type_to_vector.json --node_emb node_type_to_vector.json --testing_data eval/binOps_*.json