# CVE vs. Non-CVE Prediction - Deep Learning with Bi-directional GRUs 

In [1]:
# Load the Drive helper and mount
from google.colab import drive

# This will prompt for authorization.
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [2]:
!ls "/content/drive/My Drive"

'Colab Notebooks'   dl_model.h5   Xtest_norm.pkl    ytest_labels.pkl
 dl_model2.h5	    file.txt	  Xtrain_norm.pkl   ytrain_labels.pkl


In [3]:
import os
import time
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from tqdm import tqdm
import math
import dill
from sklearn.model_selection import train_test_split
from sklearn import metrics

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, CuDNNGRU, Conv1D, CuDNNLSTM
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers

Using TensorFlow backend.


# Data Retrieval

Following is normalized (pre-processed text data) issue\PR descriptions and corresponding labels which I had pickled earlier.
- This is the full data which we have BTW including positives and negatives 
- class label 0 - non-security related data potentially which was unflagged by Regexes
- class label 1 - potentially security related data which was flagged by Regexes
- class label 2 - security and CVE related data which we manually mapped

In [4]:
with open('/content/drive/My Drive/Xtrain_norm.pkl', 'rb') as f:
    X_train = []
    while True:
        try:
            X_train.extend(dill.load(f))
        except:
            print('EOF reached')
            break
            
with open('/content/drive/My Drive/Xtest_norm.pkl', 'rb') as f:
    X_test = []
    while True:
        try:
            X_test.extend(dill.load(f))
        except:
            print('EOF reached')
            break
            
with open('/content/drive/My Drive/ytrain_labels.pkl', 'rb') as f:
    y_train = dill.load(f)
    
with open('/content/drive/My Drive/ytest_labels.pkl', 'rb') as f:
    y_test = dill.load(f)
    
len(X_train), len(X_test), len(y_train), len(y_test)

EOF reached
EOF reached


(481390, 120348, 481390, 120348)

- We filter out the negative data (non-security related) in the following code to reduce the class imbalance.
- We focus only on modeling for security data i.e CVEs vs. Non-CVEs

In [5]:
train_positives = []
y_train_positives = []
for doc, label in zip(X_train, y_train):
    if label != 0:
        train_positives.append(doc)
        y_train_positives.append(label)

test_positives = []
y_test_positives = []
for doc, label in zip(X_test, y_test):
    if label != 0:
        test_positives.append(doc)
        y_test_positives.append(label)
        
len(train_positives), len(y_train_positives), len(test_positives), len(y_test_positives)

(67400, 67400, 16851, 16851)

# Data Preparation

In [6]:
X_train, X_val, y_train, y_val = train_test_split(train_positives, y_train_positives, test_size=0.1, random_state=42)
X_test, y_test = test_positives, y_test_positives
len(X_train), len(X_val), len(X_test), len(y_train), len(y_val), len(y_test)

(60660, 6740, 16851, 60660, 6740, 16851)

In [0]:
## some config values 
embed_size = 300 # how big is each word vector
max_features = 300000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 1000 # max number of words in a doc to use

In [0]:
## Tokenize the sentences
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(X_train)
train_X = tokenizer.texts_to_sequences(X_train)
val_X = tokenizer.texts_to_sequences(X_val)
test_X = tokenizer.texts_to_sequences(X_test)

In [0]:
## Pad the sentences 
train_X = pad_sequences(train_X, maxlen=maxlen)
val_X = pad_sequences(val_X, maxlen=maxlen)
test_X = pad_sequences(test_X, maxlen=maxlen)

In [0]:
train_y = np.array([1 if item==2 else 0 for item in y_train])
val_y = np.array([1 if item==2 else 0 for item in y_val])
test_y = np.array([1 if item==2 else 0 for item in y_test])

In [11]:
print('Sample Data:')
print(X_train[:3])
print(train_X[:3, :])
print(train_y[:3])

Sample Data:
['fix regression introduced during previous scanner hardening module workdir is deleted when cleaning root module work dir when they are nested', 'utf passwords in url do not work summary curl s url enganacao data method help works usr bin env python2 import requests import json payload method help url enganacao requests post url headers content type text plain charset utf data json dumps payload also works but use usr bin env python3 instead and it does not work anymore requests messes things up expected result work as curl and requests in python2 do actual result requests in python3 messes with encoding reproduction steps python3 import requests import json payload method help url http user1 enganacao requests post url headers content type text plain charset utf data json dumps payload system information python', 'add module to dump gnome keyring network passwords this pr adds a new post module that will dump network passwords from gnome keyring on linux systems this lev

We build a `class_weight` dictionary to tell our deep learning model to give higher weightage to each CVE related issue. Based on the weights computed below, you can see the class imbalance is terrible still, we definitely need more +ve CVE related data (each record helps)

In [24]:
from sklearn.utils import class_weight

class_weights = class_weight.compute_class_weight('balanced',
                                                 np.unique(train_y),
                                                 train_y)
class_weights = dict(enumerate(class_weights))
class_weights[0] /= 2 #0.1  #0.05 | 1
class_weights[1] *= 4#300   #650 | 100
class_weights

{0: 0.25183081751606634, 1: 275.1020408163265}

In [0]:
del model

# Model Training

In [26]:
inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size)(inp)
x = Bidirectional(CuDNNGRU(64, return_sequences=True))(x)
x = GlobalMaxPool1D()(x)
x = Dense(16, activation="relu")(x)
x = Dropout(rate=0.1)(x)
x = Dense(1, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 1000)              0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 1000, 300)         90000000  
_________________________________________________________________
bidirectional_2 (Bidirection (None, 1000, 128)         140544    
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 128)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 16)                2064      
_________________________________________________________________
dropout_2 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 17        
Total para

In [34]:
model.fit(train_X, train_y, batch_size=512, epochs=20, initial_epoch=10,
          class_weight=class_weights, validation_data=(val_X, val_y))

Train on 60660 samples, validate on 6740 samples
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f086250ed68>

# Model Evaluation

In [35]:
pred_y = model.predict([test_X], batch_size=512, verbose=1)



In [0]:
pred_y = pred_y.ravel()
pred_y = [1 if prob > 0.5 else 0 for prob in pred_y]

In [0]:
from sklearn.metrics import confusion_matrix, classification_report

In [38]:
confusion_matrix(y_true=test_y, y_pred=pred_y)
# we can get more recall at the cost of more false positives
#array([[15377,  1367],
#       [   50,    57]])

array([[16501,   243],
       [   77,    30]])

In [39]:
print(classification_report(y_true=test_y, y_pred=pred_y))

              precision    recall  f1-score   support

           0       1.00      0.99      0.99     16744
           1       0.11      0.28      0.16       107

   micro avg       0.98      0.98      0.98     16851
   macro avg       0.55      0.63      0.57     16851
weighted avg       0.99      0.98      0.99     16851



In [0]:
model.save('/content/drive/My Drive/dl_model4.h5')

- This model uses a Bidirectional GRU + trying to tackle class imbalance with class weights
- Embeddings are trained from scratch
- We have trained only on Security related data (filtering out the -ves completely with regex) CVE and maybe non-CVE related
- 30 out of 107 we are able to predict correctly on the test set (28% recall)
- We need more positive data definitely
- We can even adjust class weights to predict more CVE data but false positives also increase
