# CWE AI

This code is used to build a machine learning model that can predict the category of a software vulnerability based on its name. The categories are called CWEs, and they are assigned a number.

The dataset (tagged_cve_cwe.tsv) was built by scraping Snyk Database and the official NIST CVE Data Feeds: 
- Titles of Vulnerabilities gathered from Snyk Database
- CVE from the Vulnerability
- NVD Descriptions are also on a column in case you want to use them.
- CWE that matches the CVE (in case there are many, we use the first one).

In [25]:
pip install nltk scikit-learn pandas

Note: you may need to restart the kernel to use updated packages.


In [8]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import re
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/azureuser/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


The above lines import the necessary libraries required to perform text preprocessing and build a machine learning model for classification.

In [9]:
def text_prepare(text):
    """Performs tokenization and simple preprocessing."""
    replace_by_space_re = re.compile('[/(){}\[\]\|@,;]')
    bad_symbols_re = re.compile('[^0-9a-z #+_]')
    stopwords_set = set(stopwords.words('english'))
    text = text.lower()
    text = replace_by_space_re.sub(' ', text)
    text = bad_symbols_re.sub('', text)
    text = ' '.join([x for x in text.split() if x and x not in stopwords_set])
    return text.strip()

The text_prepare function performs text preprocessing, which involves converting all text to lowercase, replacing certain characters with spaces, removing certain symbols, removing stopwords (common words such as "the", "a", "an"), and returning the cleaned text.

In [10]:
def tfidf_features(X_train, X_test, vectorizer_path):
    """Performs TF-IDF transformation and dumps the model."""
    tfidf_vectorizer = TfidfVectorizer(min_df=5, max_df=0.9, ngram_range=(1, 2), token_pattern='(\S+)')
    X_train=tfidf_vectorizer.fit_transform(X_train)
    X_test=tfidf_vectorizer.transform(X_test)
    with open(vectorizer_path,'wb') as vectorizer_file:
        pickle.dump(tfidf_vectorizer,vectorizer_file)
    return X_train, X_test

The tfidf_features function performs the TF-IDF transformation on the training and test data. This function creates a TfidfVectorizer object and fits it on the training data. It then transforms the training and test data using this fitted vectorizer. Finally, it saves the vectorizer object to a file using pickle. The function returns the transformed training and test data.

In [11]:
# Read the tagged_cve_cwe dataset
input_fd = open('tagged_cve_cwe.txt', encoding='utf-8', errors = 'backslashreplace')
cwe_df = pd.read_csv(input_fd, sep='\t', engine='python').sample(4593, random_state=0)
cwe_df.head()

# Split data into features and labels
x = cwe_df['Defect'].values
y = cwe_df['CWE'].values

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
print('Train size = {}, test size = {}'.format(len(X_train), len(X_test)))

Train size = 3674, test size = 919


The above code reads in the training and test data from a CSV file, and creates training and test datasets. The train_test_split function is used to split the data into training and test sets. The function prints the size of the training and test sets.

In [12]:
X_train_tfidf, X_test_tfidf = tfidf_features(X_train, X_test,'./tfidf_vectorizer.pkl')
vectorizer = pickle.load(open('./tfidf_vectorizer.pkl', 'rb'))
X_train_tfidf, X_test_tfidf = vectorizer.transform(X_train), vectorizer.transform(X_test)

Now we perform TF-IDF transformation on the training and test datasets. We use the tfidf_features function defined earlier to fit and transform the data. Then we load the vectorizer object saved earlier and transform the data using this saved vectorizer.

In [13]:
lr = LogisticRegression(solver='newton-cg',C=5, penalty='l2',n_jobs=-1)
cwe_classifier = OneVsRestClassifier(lr)
cwe_classifier.fit(X_train_tfidf, y_train)
OneVsRestClassifier(estimator=LogisticRegression(C=5, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=-1, penalty='l2', random_state=None, solver='newton-cg', tol=0.0001, verbose=0, warm_start=False), n_jobs=1)
y_test_pred = cwe_classifier.predict(X_test_tfidf)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('Test accuracy = {}'.format(test_accuracy))

Test accuracy = 0.940152339499456


This section of the code trains a Logistic Regression model using the One-vs-Rest (OvR) strategy for multi-class classification. The solver used here is "newton-cg", which is a gradient-based optimization algorithm used to solve the logistic regression optimization problem. The parameter C is set to 5, which controls the trade-off between fitting the training data and avoiding overfitting. The penalty used is L2 regularization, which helps to avoid overfitting.

The OneVsRestClassifier is a strategy that trains multiple classifiers, one for each class, and then selects the class with the highest probability. The fit method is called on the OneVsRestClassifier object using the X_train_tfidf and y_train data. This trains the classifier on the training data.

Once the model is trained, the predict method is called on the cwe_classifier object using the X_test_tfidf data to generate the predicted classes for the test data. The accuracy_score method from the sklearn.metrics module is used to calculate the accuracy of the model's predictions on the test data. The accuracy is printed to the console.

In [19]:
def guess_cwe(cve_title):
    # Prepare the input text
    cve_title_prepared = text_prepare(cve_title)
    # Transform the text into features using the pre-trained vectorizer
    features = vectorizer.transform([cve_title_prepared])
    # Make a prediction using the pre-trained classifier
    prediction_prob = cwe_classifier.predict_proba(features)
    prediction_score = max(prediction_prob[0])
    prediction_percent = prediction_score * 100
    # Return the predicted CWE
    predicted_cwe = cwe_classifier.predict(features)[0]
    print(f"The predicted score for {cve_title} (CWE-{predicted_cwe}): {prediction_percent:.2f}%")
    return predicted_cwe

Finally, we create a function called "guess_cwe" that takes the name of a vulnerability as input and outputs the predicted CWE category for that vulnerability title, based on what the machine learning model has learned previously. The function uses the trained model and the TF-IDF transformation that was learned during training to make the prediction.

In [24]:
guess_cwe('Structured Query Language Injection')
guess_cwe('SQL Injection')
guess_cwe('Double Free')
guess_cwe('Use After Free')
guess_cwe('OS CMD Injection')
guess_cwe('Denial of Service')

The predicted score for Structured Query Language Injection (CWE-89): 40.97%
The predicted score for SQL Injection (CWE-89): 94.04%
The predicted score for Double Free (CWE-415): 94.94%
The predicted score for Use After Free (CWE-416): 93.36%
The predicted score for OS CMD Injection (CWE-78): 75.20%
The predicted score for Denial of Service (CWE-400): 66.35%


400