# Data

Data is [fetched from the Api](api_.py) socket provided, https://services.nvd.nist.gov/rest/json/cves/1.0/. 

A local [MongoDB is established](mongo_.py) and the data is upserted. 

With [mongo queries](pddf.py) the data is unraveled, split into two, for CVSSv2 and CVSSv3. It is then returned as pandas' dataframes.

The `Run()` class will perform all this with default settings

In [2]:

import pandas as pd
import numpy as np

from run import Run

#### Establish and Fill MongoDB
Connects to the API and builds succesive queries by default starting in 2014 waiting 1s between each query.

Upserts to a local MongoDB established with default settings


In [3]:
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#NOTE: will automatically start downloading and
# try to connect to a default mongoDB client
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


#Run(collection='t').fill_mongo()


# Part 1
Explore the data to get a better understanding of the content


### Fill Pandas' dataframes
Two df's returned for CVSSv2 and v3 which are merged together using an outer join

- Create a Data frame with one row per CVE id

In [4]:
dfs = Run(collection= 't').fill_df()
dfV2,dfV3 = dfs.dfV2,dfs.dfV3
try:
    df = pd.merge( dfV3, dfV2, 'outer', '_id',suffixes=['_V3', '_V2'])
except:
    print('''ERROR: Possibly no MongoDB loaded\nCreating df from backupDB.csv''')
    pd.read_csv('backupDB.csv', low_memory=False)

- How many CVEs have CVSSv3 metrics versus only CVSSv2 metrics?

In [5]:
print(f"""Total n of CVE's = {len(df)}
with CVSSv3 = {len(df.vectorString_V3.dropna())}
with CVSSv2 = {len(df.vectorString_V2.dropna())}
with just CVSSv2 = {len((df[df['vectorString_V3'].isnull()])['vectorString_V2'].dropna())}
""")

Total n of CVE's = 115262
with CVSSv3 = 100248
with CVSSv2 = 114569
with just CVSSv2 = 14420



Checking for unique values in each column to determine if categorical 

In [6]:
for i in df.columns:
    if df[i].dtype == object and len(df[i].unique()) <10:
        print (df[i].unique(), i)

[nan '3.1' '3.0'] version_V3
[nan 'NETWORK' 'LOCAL' 'PHYSICAL' 'ADJACENT_NETWORK'] attackVector
[nan 'LOW' 'HIGH'] attackComplexity
[nan 'NONE' 'LOW' 'HIGH'] privilegesRequired
[nan 'NONE' 'REQUIRED'] userInteraction
[nan 'UNCHANGED' 'CHANGED'] scope
[nan 'HIGH' 'LOW' 'NONE'] confidentialityImpact_V3
[nan 'HIGH' 'LOW' 'NONE'] integrityImpact_V3
[nan 'HIGH' 'NONE' 'LOW'] availabilityImpact_V3
[nan 'CRITICAL' 'MEDIUM' 'HIGH' 'LOW'] baseSeverity
['2.0' nan] version_V2
['NETWORK' 'LOCAL' 'ADJACENT_NETWORK' nan] accessVector
['LOW' 'MEDIUM' 'HIGH' nan] accessComplexity
['NONE' 'SINGLE' 'MULTIPLE' nan] authentication
['NONE' 'PARTIAL' 'COMPLETE' nan] confidentialityImpact_V2
['PARTIAL' 'NONE' 'COMPLETE' nan] integrityImpact_V2
['NONE' 'PARTIAL' 'COMPLETE' nan] availabilityImpact_V2
['MEDIUM' 'HIGH' 'LOW' nan] severity
[False True nan] obtainAllPrivilege
[False True nan] obtainUserPrivilege
[False True nan] obtainOtherPrivilege
[False True nan] userInteractionRequired
[nan False True] acInsuf

`attackVector` and `accessVector` are nominal categorical and need dummy variables.


In [7]:
Δ_list = [['attackVector','AV_3'], ['accessVector', 'AV_2']]
dum_add = lambda ele: pd.get_dummies(df[ele[0]],prefix=ele[1])

frames = [dum_add(i) for i in Δ_list]

df = df.drop([i[0] for i in Δ_list], axis=1)

frames.append(df)

df = pd.concat(frames,axis=1)


the other columns are ordinal categorical(e.g. `LOW`, `MEDIUM`, `HIGH` ) or boolean and can be filled with a dict

In [8]:
cat_dict = {'NONE': 0, 'LOW':1, 'MEDIUM':2,'HIGH':3, 'CRITICAL':4,
'PARTIAL': 1, 'COMPLETE':2,
'SINGLE' :1, 'MULTIPLE' :2,
'UNCHANGED':0 ,'CHANGED':1,
'REQUIRED':1,
False:0, True:1
}
df.replace(cat_dict, inplace=True)

creating general index of all keys for manipulating columns and adds the new dummy variables

In [9]:
dfV2_keys,dfV3_keys = dfs.dfV2.keys(),dfs.dfV3.keys()


transform_idx = df.T.index

for i in transform_idx:
    #print(i[:4])
    if i[:4] == 'AV_3':
        dfV3_keys = dfV3_keys.append(pd.Index([i]))
    if i[:4] == 'AV_2':
        dfV2_keys = dfV2_keys.append(pd.Index([i]))

dfV3_keys = dfV3_keys.drop('attackVector')
dfV2_keys = dfV2_keys.drop('accessVector')

setDf = dfV3_keys.union(dfV2_keys)#set(dfV3_keys)


creating an index of all columns that appear in both CVSSv2 and v3

In [10]:
idxPairs = []

for i in setDf: 
    substr_i = transform_idx[transform_idx.str.startswith(i+'_')]
    if len(substr_i) >0:
        idxPairs.append(substr_i)
    

#idxPairs

Cleaning up data types

In [11]:
f = df.filter(regex='version*').columns
df[f] = df[f].astype(float)
df = df.convert_dtypes()

- Both CVSSv2 and CVSSv3 have the same set of impact metrics, i.e. Confidentiality, Integrity and Availability, however their values are slightly different. For example, CVSSv2 uses complete (C) to represent the highest level of impact, but CVSSv3 uses high (H) instead. Is it possible to directly map from CVSSv2 impact metric values to CVSSv3?

In [12]:
for i in ['confidentialityImpact','integrityImpact','availabilityImpact']:
    print(df.corr()[i+'_V2'][i+'_V3'])

0.8024715511898207
0.7847626572053179
0.8352650731843352


As the correleation matrix between versions of the suggested metrics is not = 1, it would NOT be a good idea to directly map them

# Part 2
predict the CVSSv3 Scope metric for CVEs without a CVSSv3 vector.

- What type of learning problem is this? What is the target? 

This is a supervised classification problem with a single variable `scope` as the target

In [13]:

from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis  import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, mean_absolute_error
from sklearn.preprocessing import StandardScaler

from pprint import pprint


selects and combines the metrics of interest from previous indices

In [85]:

cvssV3 = []
cvssV2 = []
dfV3_keys_cp = dfV3_keys.drop(['_id','vectorString'])
dfV2_keys_cp = dfV2_keys.drop(['_id','vectorString'])
for i in idxPairs:
    if i[0][:-3] in dfV3_keys and i[0][:-3] != 'vectorString':
        cvssV3.append(i[0])
        dfV3_keys_cp = dfV3_keys_cp.drop([i[0][:-3]])

    
    if i[0][:-3] in dfV2_keys and i[0][:-3] != 'vectorString':
        cvssV2.append(i[1])
        dfV2_keys_cp = dfV2_keys_cp.drop([i[0][:-3]])

cvssV3 = dfV3_keys_cp.union(cvssV3)  
cvssV2 = dfV2_keys_cp.union(cvssV2)   


copies the main df, selecting all columns and then dropping types = object (relevant data have dtype float, int,etc), the column `acInsufInfo` and all remaining rows with NaN's 

In [86]:
dfNoNan = df.loc[:,df.dtypes != 'object'].drop('acInsufInfo',axis=1).dropna()

X_pt2 = dfNoNan[cvssV2.drop('acInsufInfo')]
y_pt2 = dfNoNan['scope'].astype(int)


- How would you build the training / validation / testing dataset?

Take a random subset of the data and split into train and test sets

In [87]:
X_pt2_train, X_pt2_test, y_pt2_train, y_pt2_test = train_test_split(X_pt2, y_pt2, test_size= .33) 

intended to scale data however all iterations tried made little difference on this dataset, possibly due to high n of categorical inputs and low variance in numerical data

In [88]:
#sc = StandardScaler()
#X_pt2_train = sc.fit_transform(X_pt2_train)
#X_pt2_test = sc.transform(X_pt2_test)

percentage of total scope count that is = 1



In [89]:
df.scope.value_counts()[1]/df.scope.value_counts()[0]

0.20283647096936755

- Which evaluation metrics would you use?

a helper function which takes in a model, fits it with train values, predicts the test values and checks it against the y test values returning a dict of all results.

NOTE: for compatibility with multioutput models, a custom score() function is defined later on and a value of 0 is given for a multioutput confusion matrix


In [90]:

def model_info(model,X,y):
    #takes model, set of X, set of y
    #returns dict
    X_train, X_test = X
    y_train, y_test = y
    print('~~~ fitting model')
    f = model.fit(X_train.values, y_train.values)
    print('~~~ predicting values')
    ŷ = model.predict(X_test.values)
    print('~~~ checking validity')
    
    try:
        sc = f.score(X_test.values, y_test.values)
    except:
        sc= score(X_test.values, y_test.values)
    
    m = mean_absolute_error(y_test.values, ŷ)
    c = confusion_matrix(y_test.values, ŷ) if len(y_train.shape) == 1 else 0

    dict_ = {'model': f, 'score' : sc, 'prediction': ŷ, 'MAE' : m, 'Confusion Matrix': c}
    return dict_
    

builds a dict of all the model dicts with model name as key and an index corresponding to model list location

In [91]:


def model_dict(mod_list,X,Y):
    #takes a list of models,set of X, set of y
    #returns dict
    mod_dict ={}
    idx = 0
    for i in mod_list: 
        mod_type = i.__str__()
        print(f"\n~~~~~~~~~~~~~~~~~~~~~~~~~\nWorking on {mod_type}")
        if len(Y[0].shape) > 1:
            mod_info = model_info(MultiOutputRegressor(i),X,Y)
        mod_info = model_info(i,X,Y)
        mod_dict[mod_type] =  (idx ,mod_info)
        idx+=1
        pprint(mod_info)
        
    return mod_dict


- What simple model might be used for this problem? Could this be improved upon with a more complex solution?

The Random Forest Classifier is repeatedly the best performer here. Ensemble Learning would be a good canditate here for improving on the given results

In [116]:

model_SVC = svm.SVC()
model_MLP = MLPClassifier()
model_LR =  LogisticRegression(max_iter= 500)
model_LDA =  LinearDiscriminantAnalysis()
model_KNN =  KNeighborsClassifier()
model_CART =  DecisionTreeClassifier()
model_RFC = RandomForestClassifier()
model_NB =  GaussianNB()

mod_list_pt2 = [
    
    #model_SVC,#slow
    model_MLP,
    model_LR,
    model_LDA,
    model_KNN,
    model_CART,
    model_RFC,
    model_NB
    ]



mod_dict_pt2 = model_dict(mod_list_pt2,[X_pt2_train,X_pt2_test], [y_pt2_train,y_pt2_test])


~~~~~~~~~~~~~~~~~~~~~~~~~
Working on MLPClassifier()
~~~ fitting model
~~~ predicting values
~~~ checking validity
{'Confusion Matrix': array([[26610,   624],
       [ 1176,  4347]]),
 'MAE': 0.054950087004304426,
 'model': MLPClassifier(),
 'prediction': array([0, 0, 0, ..., 0, 0, 0]),
 'score': 0.9450499129956956}

~~~~~~~~~~~~~~~~~~~~~~~~~
Working on LogisticRegression(max_iter=500)
~~~ fitting model
~~~ predicting values
~~~ checking validity
{'Confusion Matrix': array([[26457,   777],
       [ 1524,  3999]]),
 'MAE': 0.07024452788716916,
 'model': LogisticRegression(max_iter=500),
 'prediction': array([0, 0, 0, ..., 0, 0, 0]),
 'score': 0.9297554721128308}

~~~~~~~~~~~~~~~~~~~~~~~~~
Working on LinearDiscriminantAnalysis()
~~~ fitting model
~~~ predicting values
~~~ checking validity
{'Confusion Matrix': array([[26032,  1202],
       [ 1110,  4413]]),
 'MAE': 0.07058033397441768,
 'model': LinearDiscriminantAnalysis(),
 'prediction': array([0, 0, 0, ..., 0, 0, 0]),
 'score': 0.929

ŷ dict built from the model dict for easier access to ŷ values

In [93]:
def ŷ_test_dict(mod_dict):
    ŷtest_dict = {}
    for i in mod_dict:
        ŷ = pd.DataFrame(mod_dict[i][1]['prediction'])
        ŷtest_dict[i.__str__()] =  ŷ
    return ŷtest_dict
ŷ_pt2_test_vals = ŷ_test_dict(mod_dict_pt2)

#ŷ_pt2_test_vals

In [111]:
def best_model_chk(mod_dict):
    #takes model dict 
    #returns tuple of index and model score 
    best_mod= ('',0)
    for i in mod_dict:
        presc = mod_dict[i][1]['score']
        sc = np.average(presc) #if isinstance(presc,(int,float)) else np.average(presc)
        if sc > best_mod[1]:
            best_mod = mod_dict[i][0], mod_dict[i][1]['score']

    print(best_mod)
    return best_mod
work_model_pt2 = mod_list_pt2[best_model_chk(mod_dict_pt2)[0]]


(5, 0.9452025521262631)


selects data where scope = nan

selects all columns with v2 data except for `acInsufInfo` and drops rows with nan

predicts ŷ based on the best model picked from prev function 

In [117]:
df_y_pt2 = df[df['scope'].isna()]

Xnew_pt2 = df_y_pt2[cvssV2.drop('acInsufInfo')].dropna()

index_pt2 = Xnew_pt2.index

ŷ_pt2 = pd.DataFrame(work_model_pt2.predict(Xnew_pt2.values), index = index_pt2, columns=[['scope']])

ŷ_pt2.value_counts()

(scope,)
0           12160
1            2123
dtype: int64

In [96]:
ŷ_pt2_vals = {}
for i in mod_list_pt2:
    temp = pd.DataFrame(i.predict(Xnew_pt2.values))
    ŷ_pt2_vals[i.__str__()] = ŷ_pt2.value_counts(),temp 

In [97]:
for i in ŷ_pt2_vals:
    print(ŷ_pt2_vals[i][0][1]/ŷ_pt2_vals[i][0][0])

0.17555555555555555
0.16757949807896674
0.18590169378943872
0.17759089784813256
0.17574909450115245
0.17420256494574154
0.5985450475657527


predicted `scope` df

In [115]:
ŷ_pt2

Unnamed: 0,scope
0,0
1,0
2,0
3,0
4,0
...,...
15806,0
15807,0
15808,0
15809,0



### Part 3
Predict the CVSSv3 Confidentiality, Integrity and Availability metrics for CVEs without a CVSSv3 vector.
- What type of learning problem is this? What is the target? 


This is a multi output supervised regression problem with 3 targets, `confidentialityImpact_V3`, `integrityImpact_V3`, `availabilityImpact_V3`

each target is calculated as

In [99]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold

from sklearn.multioutput import MultiOutputRegressor
from sklearn.ensemble import RandomForestRegressor

model_RFR = RandomForestRegressor()

In [100]:
mod_list_pt3 = [
    
    model_RFR,
    #model_KNN,#slow
    model_CART,
    model_RFC,
    ]
    

In [101]:
dfNoNan = df.loc[:,df.dtypes != 'object'].drop('acInsufInfo',axis=1).dropna()

In [102]:
response = pd.Index(['confidentialityImpact_V3', 'integrityImpact_V3', 'availabilityImpact_V3'])
X = dfNoNan[cvssV2.drop('acInsufInfo')]
y = dfNoNan[response].astype(int)

- How would you build the training / validation / testing dataset?

In [103]:
X_pt3_train, X_pt3_test, y_pt3_train, y_pt3_test = train_test_split(X, y) 


- Which evaluation metrics would you use?

this is a really simple metric based on the sklearn `model.score()` function which gets  1 - total correct divided by total n. My implementation will take multioutput and returns a vector of len(ŷ). Does NOT work for single target ŷ

In [104]:
def score(y_test, ŷ): 
    #takes y known , ŷ
    #returns arr with len(ŷ)

    arr_1 = y_test
    arr_2 = ŷ

    if len(arr_1)!=len(arr_2):
        print(len(arr_1), len(arr_2))
        print('!!! NOT the same length !!!')
        return

    shape = arr_2.shape
 
    truth_d = {True:[0]*shape[1], False:[0]*shape[1]}

    for i in range(shape[0]):
        for j in range(shape[1]):
            truth_d[arr_1[i][j] == arr_2[i][j]][j] += 1

    return [1 - truth_d[True][i]/ shape[0] for i in range(shape[1])]




- What simple model might be used for this problem? Could this be improved upon with a more complex solution?


Here the Multioutput Regressor acts as a wrapper around estimators. This allows for direct regression of each individual estimator. as can be seen from the Random Forest Regressor prediction the output is continous

a better approach would be to use a chained regressor which would chain each regression together in a conditinal manner i.e;   y1  , y2|ŷ1 ,  y3|( ŷ1 & ŷ2 )

In [105]:
mod_dict_pt3 = model_dict(mod_list_pt3,[X_pt3_train,X_pt3_test], [y_pt3_train,y_pt3_test])


~~~~~~~~~~~~~~~~~~~~~~~~~
Working on RandomForestRegressor()
~~~ fitting model
~~~ predicting values
~~~ checking validity
~~~ fitting model
~~~ predicting values
~~~ checking validity
{'Confusion Matrix': 0,
 'MAE': 0.21795856159953483,
 'model': RandomForestRegressor(),
 'prediction': array([[2.78455130e-01, 2.18375109e+00, 2.38857444e-02],
       [2.99892037e+00, 3.00000000e+00, 2.99783245e+00],
       [9.94455451e-01, 1.02390032e+00, 8.67838768e-03],
       ...,
       [8.20706054e-01, 1.22672856e+00, 8.02149324e-03],
       [0.00000000e+00, 2.35445386e-03, 3.00000000e+00],
       [2.88855498e+00, 2.90345594e+00, 2.88284513e+00]]),
 'score': 0.8655463148907462}

~~~~~~~~~~~~~~~~~~~~~~~~~
Working on DecisionTreeClassifier()
~~~ fitting model
~~~ predicting values
~~~ checking validity
~~~ fitting model
~~~ predicting values
~~~ checking validity
{'Confusion Matrix': 0,
 'MAE': 0.14012465076294864,
 'model': DecisionTreeClassifier(),
 'prediction': array([[0, 3, 0],
       [3, 3, 3]

In [106]:
ŷ_pt3_test_vals = ŷ_test_dict(mod_dict_pt3)


In [107]:
work_model_pt3 = mod_list_pt3[best_model_chk(mod_dict_pt3)[0]]

(0, 0.8655463148907462)


The plan here is as above select all response variables from the df that = NaN

selects all columns with v2 data except for `acInsufInfo` and drop rows with nan

predict ŷ based on the best model picked from above

In [119]:

df_y_pt3 = df[ df['vectorString_V3' ].isna()]

Xnew_pt3 = df_y_pt3[cvssV2.drop('acInsufInfo')].dropna()


index_pt3 = Xnew_pt3.index


ŷ_pt3 = pd.DataFrame(work_model_pt3.predict(Xnew_pt3.values),index = index_pt3,columns = response).round().astype(int)


In [120]:

ŷ_pt3

Unnamed: 0,confidentialityImpact_V3,integrityImpact_V3,availabilityImpact_V3
0,0,2,0
1,3,3,3
2,3,3,3
3,3,3,3
4,3,3,3
...,...,...,...
15806,3,0,0
15807,3,0,0
15808,3,3,2
15809,3,0,0
