# CVSS Score Prediction

CVSS: Common Vulnerability Scoring System

One of the thorniest problems in cybersecurity is how to prioritize work. There is often an overwhelming amount of work for frequently understaffed security teams. By predicting the severity score of a vulnerability, security teams can prioritize their which issue to address first to keep their organization safe.

# Setup

In [2]:
import re
import string
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import WordPunctTokenizer
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

In [3]:
stop_words = stopwords.words('english')
word_punct_tokenizer = WordPunctTokenizer()
word_net_lemmatizer = WordNetLemmatizer()
snowball_stemmer = SnowballStemmer("english")

In [4]:
cve = pd.read_csv(r'G:\My Drive\MIDS\W207\Final Project\data\cve.csv', header=0, index_col=0).drop_duplicates().dropna()
products = pd.read_csv(r'G:\My Drive\MIDS\W207\Final Project\data\products.csv', header=0, index_col=0)
vendors = pd.read_csv(r'G:\My Drive\MIDS\W207\Final Project\data\vendors.csv', header=0, index_col=0)
vendor_product = pd.read_csv(r'G:\My Drive\MIDS\W207\Final Project\data\vendor_product.csv', header=0, index_col=0)

In [5]:
cve.mod_date = pd.to_datetime(cve.mod_date)
cve.pub_date = pd.to_datetime(cve.pub_date)

## Feature Engineering

Since we are trying to predict cvss scores using accompanying text data, we did some light preprocessing on the summary text then vectorized the words using Tfidf. Using weighted signals from the summary text helps the model gain further insight into the relationship between the severity score and the CVE. For the proof of concept, we also truncated the scores from float values to int, making it more straightforward for the models.

In [6]:
cve['access_authentication_ENCODED'] = pd.Categorical(cve['access_authentication'], categories=['NONE', 'SINGLE', 'MULTIPLE']).codes
cve['access_complexity_ENCODED'] = pd.Categorical(cve['access_complexity'], categories=['LOW', 'MEDIUM', 'HIGH']).codes
cve['access_vector_ENCODED'] = pd.Categorical(cve['access_vector'], categories=['LOCAL', 'NETWORK', 'ADJACENT_NETWORK']).codes
cve['impact_availability_ENCODED'] = pd.Categorical(cve['impact_availability'], categories=['NONE', 'PARTIAL', 'COMPLETE']).codes
cve['impact_integrity_ENCODED'] = pd.Categorical(cve['impact_integrity'], categories=['NONE', 'PARTIAL', 'COMPLETE']).codes

In [7]:
# cve = cve.merge(pd.get_dummies(cve['access_authentication'], prefix='access_authentication'), left_index=True, right_index=True)
# cve = cve.merge(pd.get_dummies(cve['access_complexity'], prefix='access_complexity'), left_index=True, right_index=True)
# cve = cve.merge(pd.get_dummies(cve['access_vector'], prefix='access_vector'), left_index=True, right_index=True)
# cve = cve.merge(pd.get_dummies(cve['impact_availability'], prefix='impact_availability'), left_index=True, right_index=True)
# cve = cve.merge(pd.get_dummies(cve['impact_confidentiality'], 'impact_confidentiality'), left_index=True, right_index=True)
# cve = cve.merge(pd.get_dummies(cve['impact_integrity'], 'impact_integrity'), left_index=True, right_index=True)

In [8]:
cve.drop('access_authentication', axis=1, inplace=True)
cve.drop('access_complexity', axis=1, inplace=True)
cve.drop('access_vector', axis=1, inplace=True)
cve.drop('impact_availability', axis=1, inplace=True)
cve.drop('impact_confidentiality', axis=1, inplace=True)
cve.drop('impact_integrity', axis=1, inplace=True)

In [9]:
cve['cvss'] = cve['cvss'].apply(np.floor)
cve = cve[(cve["cvss"] > 0)]

In [10]:
cve["summary"] = cve["summary"].str.lower()
cve["summary"] = cve['summary'].apply(lambda x: re.sub(r"\W", ' ', x))
cve["summary"] = cve['summary'].apply(lambda x: re.sub(r"\d+", ' ', x))
cve['summary'] = cve['summary'].apply(lambda x: ' '.join([item.translate(str.maketrans('', '', string.punctuation)) for item in word_punct_tokenizer.tokenize(x) if item.isalnum() if item not in stop_words]))
cve['summary'] = cve['summary'].apply(lambda x: ' '.join([word_net_lemmatizer.lemmatize(item, pos=tag) for tag in  ('a', 'n', 'v') for item in word_punct_tokenizer.tokenize(x)]))
cve['summary'] = cve['summary'].apply(lambda x: ' '.join([snowball_stemmer.stem(item) for item in word_punct_tokenizer.tokenize(x)]))

In [11]:
texts = cve['summary'].astype('str')
labels = cve['cvss']
tokenized_texts = [word_punct_tokenizer.tokenize(text) for text in texts]
tfidf_vectorizer = TfidfVectorizer(tokenizer = lambda x: x, preprocessor = lambda x: x, min_df = 5, max_df = 0.8) 

In [12]:
x_train, x_test, y_train, y_test = train_test_split(tokenized_texts, labels, test_size = 0.1, stratify = labels)

In [13]:
x_train_tfidf = tfidf_vectorizer.fit_transform(x_train) 
x_test_tfidf = tfidf_vectorizer.transform(x_test)

## Model Training

We are training multiple linear and classification models to see which performs the best on the data. The CVE severity scores are linear, but a model used for classification could perform well using the simplified scores and weighted text data.

In [14]:
knn_model = KNeighborsClassifier()
knn_model.fit(x_train_tfidf, y_train)
knn_pred = knn_model.predict(x_test_tfidf)

In [15]:
decision_tree_model = DecisionTreeClassifier()
decision_tree_model.fit(x_train_tfidf, y_train)
decision_tree_pred = decision_tree_model.predict(x_test_tfidf)

In [16]:
random_forest_model = RandomForestClassifier()
random_forest_model.fit(x_train_tfidf, y_train)
random_forest_pred = random_forest_model.predict(x_test_tfidf)

In [17]:
ada_boost_model = AdaBoostClassifier()
ada_boost_model.fit(x_train_tfidf, y_train)
ada_boost_pred = ada_boost_model.predict(x_test_tfidf)

In [18]:
gradient_boost_model = GradientBoostingClassifier()
gradient_boost_model.fit(x_train_tfidf, y_train)
gradient_boost_pred = gradient_boost_model.predict(x_test_tfidf)

In [19]:
knn_f1 = metrics.f1_score(y_test, knn_pred, average='micro')
decision_tree_f1 = metrics.f1_score(y_test, decision_tree_pred, average='micro')
random_forest_f1 = metrics.f1_score(y_test, random_forest_pred, average='micro')
ada_boost_f1 = metrics.f1_score(y_test, ada_boost_pred, average='micro')
gradient_boost_f1 = metrics.f1_score(y_test, gradient_boost_pred, average='micro')

In [20]:
print(f'knn f1 score: {knn_f1}')
print(f'decision_tree score: {decision_tree_f1}')
print(f'random_forest score: {random_forest_f1}')
print(f'ada_boost score: {ada_boost_f1}')
print(f'gradient_boost score: {gradient_boost_f1}')

knn f1 score: 0.5627710568363388
decision_tree score: 0.615841132161607
random_forest score: 0.6835197443506049
ada_boost score: 0.46142433234421365
gradient_boost score: 0.6141291942478886


## Findings

We see from the F1 scores that random forest is the highest performer. We can infer that the decisions tree performs well using the vectorized text, possibly due to the rigorous technical/security writing structure, which is usually uniform and has fairly standardized naming.

## Next Steps

The following steps are to enrich the dataset with more data and improve our feature engineering. We can supplement the base dataset with data from several different sources:
* MITRE CVE DB: The data we are using is a subset of this DB, and the remaining fields could enhance the models.
* Project Zero DB: Project Zero DB contains tracking of vulnerabilities that have been exploited or patched with rich data that give us other severity indicators. (Ex: Time to patch can show the difficulty of exposure)
* MITRE ATT&CK: This categorization has rich information above CVE data and its uses with concise categorical labels by attackers in the real world.