How do support vector machines compare to nearest neighbors in terms of generalization accuracy and training time? Run a support vector machine on the malware data from Lesson 7 to see. What kernel (linear, polynomial, sigmoid, rbf appears to be the best for this dataset?

In this notebook we will train classifiers to identify malware.

We load our libraries below.

In [6]:
from sklearn import neighbors
from sklearn import svm

import matplotlib.pyplot as plt

In this notebook code cell, we download our training and test data from GitHub.

In [2]:
!wget https://github.com/mlittmancs/great_courses_ml/raw/master/malware-test.csv
!wget https://github.com/mlittmancs/great_courses_ml/raw/master/malware-train.csv

--2020-06-28 11:09:59--  https://github.com/mlittmancs/great_courses_ml/raw/master/malware-test.csv
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/mlittmancs/great_courses_ml/master/malware-test.csv [following]
--2020-06-28 11:10:00--  https://raw.githubusercontent.com/mlittmancs/great_courses_ml/master/malware-test.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11593257 (11M) [text/plain]
Saving to: ‘malware-test.csv’


2020-06-28 11:10:00 (27.8 MB/s) - ‘malware-test.csv’ saved [11593257/11593257]

--2020-06-28 11:10:01--  https://github.com/mlittmancs/great_courses_ml/raw/master/malware-train

We next declare a `getdat` function, which collects the data and labels within the file used for training. We loop over the lines of the given file one by one, splitting each line into components at the commas. The data file uses `pe-malicious` to label the positive instances of malware. So, we use `== "pre-malicious"` to turn those labels into 0/`False` for safe and 1/`True` for malware. The function returns the data and the labels for the file. Since the data is read from the file as a string, each component needs to be converted to a floating point number.


We then use this funciton to get the training and test data used for our model.



In [3]:
def getdat(filename):
    with open(filename, "r") as f:
       data = f.readlines()
    dat = []
    labs = []
    for line in data:
        wordline = line.split(",")
        labs = labs + [wordline[0] == "pe-malicious"]
        dat = dat + [[float(wordline[i]) for i in range(1,len(wordline))]]
    return(dat,labs)
traindat, trainlabs = getdat("malware-train.csv")
testdat, testlabs = getdat("malware-test.csv")

We define a `testscore` function, which calculates the number of correctly classified examples in the test set.

We use this `testscore` function to calculate the accuracy of the model for four different values of k: 1, 5, 7, and 9

In [21]:
def testscore(dat,labs):
    yhats = clf.predict(dat)
    correct = sum([yhats[i] == labs[i] for i in range(len(dat))])
    return(correct)

m = 4000
clf = neighbors.KNeighborsClassifier(n_neighbors=1,metric="cosine")
clf = clf.fit(traindat[:m], trainlabs[:m])
testscore(traindat[:m],trainlabs[:m])/m, testscore(testdat,testlabs)/len(testlabs)

(1.0, 0.8705)

Now, SVMs, varying the kernel.

In [20]:
m = 4000
for k in ['linear', 'poly', 'sigmoid', 'rbf']:
  clf = svm.SVC(kernel=k)
  clf = clf.fit(traindat[:m], trainlabs[:m])
  score = testscore(traindat[:m],trainlabs[:m])/m, testscore(testdat,testlabs)/len(testlabs)
  print(k,score)

linear (0.85875, 0.846)
poly (0.91, 0.861)
sigmoid (0.51275, 0.5115)
rbf (0.91125, 0.8865)
