In [4]:
!gdown https://drive.google.com/uc?id=1w7KZ5ugVfcz4RGsY2feSalH9xpa_JFLh


Downloading...
From: https://drive.google.com/uc?id=1w7KZ5ugVfcz4RGsY2feSalH9xpa_JFLh
To: C:\Users\RASMITHA\DeepBugs-master\DeepBugs_data.tar.gz

  0%|          | 0.00/20.1M [00:00<?, ?B/s]
  3%|2         | 524k/20.1M [00:00<00:05, 3.38MB/s]
 10%|#         | 2.10M/20.1M [00:00<00:02, 6.70MB/s]
 18%|#8        | 3.67M/20.1M [00:00<00:01, 9.34MB/s]
 23%|##3       | 4.72M/20.1M [00:00<00:01, 9.05MB/s]
 34%|###3      | 6.82M/20.1M [00:00<00:01, 11.3MB/s]
 42%|####1     | 8.39M/20.1M [00:00<00:01, 11.3MB/s]
 49%|####9     | 9.96M/20.1M [00:00<00:00, 11.3MB/s]
 57%|#####7    | 11.5M/20.1M [00:01<00:00, 11.5MB/s]
 65%|######5   | 13.1M/20.1M [00:01<00:00, 11.2MB/s]
 73%|#######2  | 14.7M/20.1M [00:01<00:00, 11.2MB/s]
 81%|########  | 16.3M/20.1M [00:01<00:00, 11.2MB/s]
 88%|########8 | 17.8M/20.1M [00:01<00:00, 11.2MB/s]
 96%|#########6| 19.4M/20.1M [00:01<00:00, 11.1MB/s]
100%|##########| 20.1M/20.1M [00:01<00:00, 10.6MB/s]


In [5]:
!tar -xzf DeepBugs_data.tar.gz

The data are function calls extracted from open-source JavaScript code. Let's read the JSON data into our Python-based learning code:

In [6]:
import os
import json
import numpy as np
from keras.models import Sequential
from keras.layers.core import Dense, Dropout
path = os.path.abspath("C:/Users/RASMITHA/DeepBugs-master/DeepBugs_data/calls")

In [7]:
import os
calls = []
for file in os.listdir("DeepBugs_data/calls"):
  with open(os.path.join("DeepBugs_data/calls", file), encoding='utf-8') as fp:
    calls.extend(json.load(fp))

print(f"Have read {len(calls)} function calls")

Have read 28005 function calls


We'll also use a pre-trained embedding of code tokens. It's a Word2Vec model trained on tokenized JavaScript code.

In [8]:
with open("DeepBugs_data/token_to_vector.json") as fp:
  type_to_vector = json.load(fp)

print(f"Have loaded {len(type_to_vector)} token embeddings.")

Have loaded 9930 token embeddings.


In [9]:
xs = []   # Inputs given to the model: Each element is
          #   the vector representation of a function call.
ys = []   # Outputs expected from the model: For each
          #   call, predict the probability that it's buggy.

for call in calls:
  if (call["callee"] in type_to_vector and
      call["arguments"][0] in type_to_vector and
      call["arguments"][1] in type_to_vector):
    callee_vec = type_to_vector[call["callee"]]
    arg1_vec = type_to_vector[call["arguments"][0]]
    arg2_vec = type_to_vector[call["arguments"][1]]

    # Positive, i.e., correct example
    x_correct = callee_vec + arg1_vec + arg2_vec
    # Negative, i.e., buggy example
    x_buggy = callee_vec + arg2_vec + arg1_vec

    xs.append(x_correct)
    ys.append(0)  # Probability that buggy is 0
    xs.append(x_buggy)
    ys.append(1)  # Probability that buggy is 1

# Split into training and validation data
nb_training = int(0.9*len(xs))
xs_training = np.array(xs[:nb_training])
ys_training = np.array(ys[:nb_training])
xs_validation = np.array(xs[nb_training:])
ys_validation = np.array(ys[nb_training:])

print(f"{len(xs_training)} training examples")
print(f"{len(xs_validation)} validation examples")

21592 training examples
2400 validation examples


In [11]:
x_length = len(xs[0])
model = Sequential()
model.add(Dropout(0.2, input_shape=(x_length,)))
model.add(Dense(200, input_dim=x_length, activation="relu", kernel_initializer='normal'))
model.add(Dropout(0.2))
model.add(Dense(1, activation="sigmoid", kernel_initializer='normal'))

model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
model.fit(xs_training, ys_training, batch_size=100, epochs=5, verbose=1)        

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1ff63245f40>

In [12]:
validation_stats = model.evaluate(xs_validation, ys_validation)
print(f"Validation accuracy: {validation_stats[1]}")

Validation accuracy: 0.7620833516120911


## Using the Learned Bug Detection Model

Once trained, we can query the model with a given function call. In a full implementation, the model would reason about calls extracted from JavaScript code. Here, we simply give the callee and arguments as a string:

In [17]:
# Function call: setTimeout(delay, fn)
callee = "ID:setTimeout"  # Prefix "ID:" is to indicate that it's an identifier.
arg1 = "ID:delay"
arg2 = "ID:fn"

x = type_to_vector[callee] + type_to_vector[arg1] + type_to_vector[arg2]
xs = np.array([x])

buggy_probabilities = model.predict(xs)
print(f"Call is buggy with probability {str(round(buggy_probabilities[0][0], 4))}")

Call is buggy with probability 0.9495
