Project to explore the use of machine learning in identifying malware through opcode frequency
Dataset was sourced from Kaggle (url: https://www.kaggle.com/allenye66/final-opcodes)
The dataset consists of malware code obtained through VirusTotal, organized by opcode appearance and frequency. There are 9,725 observations of 1,034 variables. Only two variables are categorical: family and file. Of the 1032 numerical variables, 1031 consisted of the frequency of an opcode in each observation, and one feature described the total number of opcodes in each observation.
Prior to feature selection and reduction, the distribution of classes was plotted in order to see if any classes were under-represented in comparison to others. An imbalance of class labels was identified, with HOTBAR having significantly less occurrences than the other class labels. Therefore, HOTBAR was removed before further data processing steps were taken.
A single feature, family, was identified for use as a target multiclass. This class was converted into a factor for use in classification algorithms.
The correlation values were calculated using Pearson’s Correlation Coefficient with a cutoff of 0.8, which revealed these highly correlated features:
"lea", "xor", "shl", "totalOpcodes", "and", "other", "push", "call", "mov", "sub", "dec", "test", "inc", "in", "imul", "ret", "adc", "cmp", "jp", "jo", "jmp", "je", "jne", "xchg", "outsb", "fstp"
The identified features were removed before feature reduction exercises were performed.
Principal component analysis was used in order to reduce the dimensionality of the dataset. However, the resulting eigenvalues indicated that the variables of this data set are not strongly linked and are most likely highly independent of each other.
The data was then split into training and testing sets using a ratio of 0.8/0.2, respectively. Care was also taken to ensure that each class label was equally represented in both the testing and training data.
This tree produced an accuracy of 0.398, which was to be expected from a dataset with an incredibly large feature size. When there are too many features to select from, information gain becomes very low, and thus the accuracy of the decision tree suffers. However, analysis of the confusion matrix yields a kappa value of 0.345, indicating that the decision tree yields better performance than random guessing. As a final note, the average AUC for all class labels yielded a value of 0.7431.
It is general knowledge that Naïve Bayes is more robust towards datasets with large number of features, and thus it is reasonable to assume that it would produce better accuracy without reducing dimensionality. The initial attempt at creating a Naïve Bayes classifier resulted in a similar accuracy score to the decision tree, with an accuracy of only 0.4202, and a kappa value of 0.373. The average multi-class AUC was 0.7605.
The subsequent attempt at creating a Naïve Bayes classifier involved tuning using the base.tune() function of the e1071 package. This function attempts to tune the hyperparameters of the model by utilizing a grid search over the supplied parameter ranges. However, this function did not produce better results, and displayed the exact same accuracy as the untuned version of the model.
Similar to the Naïve Bayes algorithm, SVM models are known to adapt well to a large number of features. An SVM model was created with a radial kernel and a cost of 20. Various cost functions were explored, ranging from 10 – 50, with no appreciable difference from 20 – 50. The resulting model was comprised of 7046 support vectors. The SVM classifier produced the best results of the three individual classifiers that were explored, with an out-of-the-box accuracy of 0.626 and a kappa value of 0.5929. The multi-class average AUC was calculated to be 0.8875.
An ensemble of 100 bagged decision trees were induced in order to measure the possible performance gains versus increased required computational time. This produced a classifier that performed significantly better than a single decision tree, with an accuracy of 0.59 and a kappa of 0.555. However, this model was computationally very expensive to create. Because it is an ensemble of 100 fully formed decision trees, the computational time was approximately 100 times that of creating a decision tree. In exchange, the model gained an increased accuracy of 48.2%. This model reported a multi-class average AUC of 0.8661.
10-fold cross validation was utilized with the Naïve Bayes classifier. The training data was divided into 10 subsets, with each iteration utilizing 9 of the subsets, and the 10th being utilized for model validation. This has the effect of significantly reducing bias, while also reducing variance. This approach to ensemble learning resulted in an accuracy of 0.424, a kappa of 0.376, and a multi-class average AUC of 0.759. This is very similar to the performance of the non-cross-validated Naïve Bayes model, and implies that withholding data does not help in producing better a priori probabilities.
A random forest classifier was created using the randomForest package for R. This classifier was tuned to create a forest consisting of 500 trees, with each tree being induced using a random subset of features. This is in contrast to a bagged ensemble of decision trees, which consists of fully formed trees. The classifier will then use majority voting when performing a classification task. This particular tuned random forest exhibited an accuracy of 0.520 and a kappa of 0.4747. The multi-class average AUC was 0.814.
An ECOC variant included in the mlr package was applied, which created an ensemble of 20 binary classification decision trees (one for each class label). This particular ensemble method resulted in the worst accuracy of all the classifiers tested, including both individual and ensemble classifiers, of just 0.370 and a kappa of 0.327. The multi-class average AUC is unfortunately unknown as this point due to the classifier output being incompatible with the ROC algorithm.
| Classifier | Accuracy | Kappa | Average AUC | Training Time | Prediction Time |
|---|---|---|---|---|---|
| Decision Tree | 0.398 | 0.345 | 0.743 | 1.94 min | 0.15s |
| Naive Bayes | 0.420 | 0.373 | 0.765 | 9.83 min | 4.79s |
| SVM | 0.626 | 0.593 | 0.888 | 12.53 min | 15.94s |
| Bagged Trees | 0.59 | 0.555 | 0.866 | 22.53 min | 6.58s |
| CV Naive Bayes | 0.59 | 0.376 | 0.759 | 14.1 min | 5.11s |
| Random Forest | 0.424 | 0.475 | 0.814 | 30.54 min | 12.94s |
| ECOC Tree | 0.370 | 0.327 | N/A | 8.61 min | 3.63s |