# Virus Classifier
## Code By [Siraj](https://github.com/llSourcell). Jupyter port by [Phillip Kuznetsov](https://github.com/philkuz)
### From Sirajology's [Build an Antivirus in 5 Min](https://www.youtube.com/watch?v=iLNHVwSu9EA). 

[Github Repo](https://github.com/llSourcell/antivirus_demo). Make sure you have [data.csv](https://github.com/philkuz/antivirus_demo/raw/master/data.csv) in this folder to make this work.

In [5]:
import pandas as pd
import numpy as np
import pickle
import sklearn.ensemble as ske
from sklearn import cross_validation, tree, linear_model
from sklearn.feature_selection import SelectFromModel
from sklearn.externals import joblib
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix

## Import Data and Remove extra columns

In [8]:
data = pd.read_csv('data.csv', sep='|')
X = data.drop(['Name', 'md5', 'legitimate'], axis=1).values
y = data['legitimate']
print('Researching important features based on %i total features\n' % X.shape[1])

  interactivity=interactivity, compiler=compiler, result=result)


Researching important features based on 54 total features



### Find the most important features for classification

In [23]:
fsel = ske.ExtraTreesClassifier().fit(X, y)
model = SelectFromModel(fsel, prefit=True)
X_new = model.transform(X)
nb_features = X_new.shape[1]

In [24]:
features = []
print('%i features identified as important' % nb_features)

indices = np.argsort(fsel.feature_importances_)[::-1][:nb_features]
for f in range(nb_features):
    print("%d. feature %s (%f)" % (f + 1, data.columns[2+indices[f]], fsel.feature_importances_[indices[f]]))

for f in sorted(np.argsort(fsel.feature_importances_)[::-1][:nb_features]):
    features.append(data.columns[2+f])

12 features identified as important
1. feature Characteristics (0.161483)
2. feature DllCharacteristics (0.129093)
3. feature Machine (0.107110)
4. feature ResourcesMaxEntropy (0.081689)
5. feature SectionsMaxEntropy (0.069531)
6. feature Subsystem (0.066103)
7. feature MajorSubsystemVersion (0.051514)
8. feature ResourcesMinEntropy (0.034731)
9. feature VersionInformationSize (0.032248)
10. feature ImageBase (0.026219)
11. feature SizeOfOptionalHeader (0.020073)
12. feature MajorOperatingSystemVersion (0.019690)


### Split data into training and testing data.

In [11]:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X_new, y, test_size=0.2)

### Train and Evaluate each algorithm

In [18]:
algorithms = {
    "DecisionTree": tree.DecisionTreeClassifier(max_depth=10), 
    "RandomForest": ske.RandomForestClassifier(n_estimators=50), 
    "GradientBoosting" : ske.GradientBoostingClassifier(n_estimators=50),
    "AdaBoost": ske.AdaBoostClassifier(n_estimators=100),
    "GNB": GaussianNB()
}

results={}
print("\nNow testing algorithms")
for algo in algorithms:
    clf = algorithms[algo]
    clf.fit(X_train, y_train)
    score = clf.score(X_test, y_test)
    print("%s : %f %%" % (algo, score*100))
    results[algo] = score

winner = max(results, key=results.get)
print('\nWinner algorithm is %s with a %f %% success' % (winner, results[winner]*100))


Now testing algorithms
GradientBoosting : 98.721478 %
GNB : 70.315103 %
DecisionTree : 98.946034 %
RandomForest : 99.188700 %
AdaBoost : 98.341181 %

Winner algorithm is RandomForest with a 99.188700 % success


In [22]:
print('Saving algorithm and feature list in classifier directory...')
joblib.dump(algorithms[winner], 'classifier/classifier.pkl')
open('classifier/features.pkl', 'wb').write(pickle.dumps(features))
print('Saved')

Saving algorithm and feature list in classifier directory...
Saved
