Skip to content

gregshin/malware-opcode-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Malware classification through opcode frequency

Project to explore the use of machine learning in identifying malware through opcode frequency

Data Description

Dataset was sourced from Kaggle (url: https://www.kaggle.com/allenye66/final-opcodes)

The dataset consists of malware code obtained through VirusTotal, organized by opcode appearance and frequency. There are 9,725 observations of 1,034 variables. Only two variables are categorical: family and file. Of the 1032 numerical variables, 1031 consisted of the frequency of an opcode in each observation, and one feature described the total number of opcodes in each observation.

Data Pre-Processing

Prior to feature selection and reduction, the distribution of classes was plotted in order to see if any classes were under-represented in comparison to others. An imbalance of class labels was identified, with HOTBAR having significantly less occurrences than the other class labels. Therefore, HOTBAR was removed before further data processing steps were taken.

Feature Selection

A single feature, family, was identified for use as a target multiclass. This class was converted into a factor for use in classification algorithms.

Correlation

The correlation values were calculated using Pearson’s Correlation Coefficient with a cutoff of 0.8, which revealed these highly correlated features:

"lea", "xor", "shl", "totalOpcodes", "and", "other", "push", "call", "mov", "sub", "dec", "test", "inc", "in", "imul", "ret", "adc", "cmp", "jp", "jo", "jmp", "je", "jne", "xchg", "outsb", "fstp"

The identified features were removed before feature reduction exercises were performed.

Feature Reduction

Principal component analysis was used in order to reduce the dimensionality of the dataset. However, the resulting eigenvalues indicated that the variables of this data set are not strongly linked and are most likely highly independent of each other.

Training vs Testing

The data was then split into training and testing sets using a ratio of 0.8/0.2, respectively. Care was also taken to ensure that each class label was equally represented in both the testing and training data.

Single Classifier Exploration

Decision Tree

This tree produced an accuracy of 0.398, which was to be expected from a dataset with an incredibly large feature size. When there are too many features to select from, information gain becomes very low, and thus the accuracy of the decision tree suffers. However, analysis of the confusion matrix yields a kappa value of 0.345, indicating that the decision tree yields better performance than random guessing. As a final note, the average AUC for all class labels yielded a value of 0.7431.

Naive Bayes

It is general knowledge that Naïve Bayes is more robust towards datasets with large number of features, and thus it is reasonable to assume that it would produce better accuracy without reducing dimensionality. The initial attempt at creating a Naïve Bayes classifier resulted in a similar accuracy score to the decision tree, with an accuracy of only 0.4202, and a kappa value of 0.373. The average multi-class AUC was 0.7605.

The subsequent attempt at creating a Naïve Bayes classifier involved tuning using the base.tune() function of the e1071 package. This function attempts to tune the hyperparameters of the model by utilizing a grid search over the supplied parameter ranges. However, this function did not produce better results, and displayed the exact same accuracy as the untuned version of the model.

Support Vector Machine

Similar to the Naïve Bayes algorithm, SVM models are known to adapt well to a large number of features. An SVM model was created with a radial kernel and a cost of 20. Various cost functions were explored, ranging from 10 – 50, with no appreciable difference from 20 – 50. The resulting model was comprised of 7046 support vectors. The SVM classifier produced the best results of the three individual classifiers that were explored, with an out-of-the-box accuracy of 0.626 and a kappa value of 0.5929. The multi-class average AUC was calculated to be 0.8875.

Ensemble Classifiers

Subsampling: Bagging

An ensemble of 100 bagged decision trees were induced in order to measure the possible performance gains versus increased required computational time. This produced a classifier that performed significantly better than a single decision tree, with an accuracy of 0.59 and a kappa of 0.555. However, this model was computationally very expensive to create. Because it is an ensemble of 100 fully formed decision trees, the computational time was approximately 100 times that of creating a decision tree. In exchange, the model gained an increased accuracy of 48.2%. This model reported a multi-class average AUC of 0.8661.

Subsampling: Cross Validation Committees

10-fold cross validation was utilized with the Naïve Bayes classifier. The training data was divided into 10 subsets, with each iteration utilizing 9 of the subsets, and the 10th being utilized for model validation. This has the effect of significantly reducing bias, while also reducing variance. This approach to ensemble learning resulted in an accuracy of 0.424, a kappa of 0.376, and a multi-class average AUC of 0.759. This is very similar to the performance of the non-cross-validated Naïve Bayes model, and implies that withholding data does not help in producing better a priori probabilities.

Random Forest

A random forest classifier was created using the randomForest package for R. This classifier was tuned to create a forest consisting of 500 trees, with each tree being induced using a random subset of features. This is in contrast to a bagged ensemble of decision trees, which consists of fully formed trees. The classifier will then use majority voting when performing a classification task. This particular tuned random forest exhibited an accuracy of 0.520 and a kappa of 0.4747. The multi-class average AUC was 0.814.

ECOC-type Classifier

An ECOC variant included in the mlr package was applied, which created an ensemble of 20 binary classification decision trees (one for each class label). This particular ensemble method resulted in the worst accuracy of all the classifiers tested, including both individual and ensemble classifiers, of just 0.370 and a kappa of 0.327. The multi-class average AUC is unfortunately unknown as this point due to the classifier output being incompatible with the ROC algorithm.

Final Results

Classifier Accuracy Kappa Average AUC Training Time Prediction Time
Decision Tree 0.398 0.345 0.743 1.94 min 0.15s
Naive Bayes 0.420 0.373 0.765 9.83 min 4.79s
SVM 0.626 0.593 0.888 12.53 min 15.94s
Bagged Trees 0.59 0.555 0.866 22.53 min 6.58s
CV Naive Bayes 0.59 0.376 0.759 14.1 min 5.11s
Random Forest 0.424 0.475 0.814 30.54 min 12.94s
ECOC Tree 0.370 0.327 N/A 8.61 min 3.63s

About

Project to explore the use of machine learning in identifying malware through opcode frequency

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages