Skip to content

kowshik14/MalHyStack_Malware-Analysis-with-Stack-Ensemble

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 

Repository files navigation

MalHyStack: A hybrid stacked ensemble learning framework with feature engineering schemes for obfuscated malware analysis

Read the full paper

Abstract

Since the advent of malware, it has reached a toll in this world that exchanges billions of data daily. Millions of people are victims of it, and the numbers are not decreasing as the year goes by. Malware is of various types in which obfuscation is a special kind. Obfuscated malware detection is necessary as it is not usually detectable and is prevalent in the real world. Although numerous works have already been done in this field so far, most of these works still need to catch up at some points, considering the scope of exploration through recent extensions. In addition to that, the application of a hybrid classification model is yet to be popularized in this field. Thus, in this paper, a novel hybrid classification model named, MalHyStack, has been proposed for detecting such obfuscated malware within the network. This proposed working model is built incorporating a stacked ensemble learning scheme, where conventional machine learning algorithms namely, Extremely Randomized Trees Classifier (ExtraTrees), Extreme Gradient Boosting (XgBoost) Classifier, and Random Forest are used in the first layer which is then followed by a deep learning layer in the second stage. Before utilizing the classification model for malware detection, an optimum subset of features has been selected using Pearson correlation analysis which improved the accuracy of the model by more than 2 % for multiclass classification. It also reduces time complexity by approximately two and three times for binary and multiclass classification, respectively. For evaluating the performance of the proposed model, a recently published balanced dataset named CIC-MalMem-2022 has been used. Utilizing this dataset, the overall experimental results of the proposed model represent a superior performance when compared to the existing classification models.

Highlights

  • Hybrid stacking of multiple models outperforms individual unique models.
  • Deep learning-based model performs better as a meta learner in the ensemble model.
  • Transformation of features in a standard scale provides better processing of data.
  • Reduced features results in less complexity and higher accuracy.
  • Pearson correlation coefficient is used to reduce dimension of the dataset.

Overall Flow Diagram

image

Dataset

The secondary dataset can be downloaded from Canadian Institute for Cybersecurity datasets.

Dataset Link

Architecture of the malware family/category detection model

image

Stacking

A stack ensemble, also known as stacked generalization, is a powerful machine learning technique that combines the predictions of multiple models to outperform any single model alone. Imagine it like hiring a team of experts and asking them to vote on a decision. They each bring their own perspectives and biases, but by combining their insights, you get a more accurate and reliable outcome.

Here's a simplified breakdown:

  1. Build multiple "base models" of different types (e.g., decision trees, support vector machines). These models individually learn from the data you provide.

  2. Train a "meta-model" on the output predictions of the base models. This meta-model learns how to best combine the individual predictions into a single, final prediction.

  3. Use the meta-model for actual predictions: When you have new data, you run it through the base models first. Their predictions then become the input for the meta-model, which generates the final prediction.

Stacking benefits:

  • Improved accuracy: Combines the strengths of different models, often leading to better performance than any single model.
  • Reduced variance: By averaging out individual model errors, stacking can make predictions more stable.
  • Flexibility: Allows using diverse models within the ensemble, leveraging their unique strengths.

However, stacking can also be:

  • More complex to implement: Requires training and tuning multiple models and the meta-model.
  • Computationally expensive: Training multiple models can be time-consuming and resource-intensive.

Overall, stack ensemble is a valuable tool for machine learning practitioners seeking to boost model performance and achieve better results.

Here we have used Vecstack which is Python package for stacking (stacked generalization) featuring lightweight functional API and fully compatible scikit-learn API.

image

Libraries

  • Python 3.5+
  • Keras 2.1.0+
  • Tensorflow 1.10.0+

Platform

  • Google Colab

Python Script

Python script for all the pre-processing, feature engineering, model implementation and evaluation. Link

Citation

If you find this repository useful in your research, please cite this article as:

IEEE Style- K. S. Roy, T. Ahmed, P. B. Udas, Md. E. Karim, and S. Majumdar, “MalHyStack: A hybrid stacked ensemble learning framework with feature engineering schemes for obfuscated malware analysis,” Intelligent Systems with Applications, vol. 20, p. 200283, Nov. 2023, doi: 10.1016/j.iswa.2023.200283.

APA Style- Roy, K. S., Ahmed, T., Udas, P. B., Karim, M. E., & Majumdar, S. (2023, November). MalHyStack: A hybrid stacked ensemble learning framework with feature engineering schemes for obfuscated malware analysis. Intelligent Systems With Applications, 20, 200283. https://doi.org/10.1016/j.iswa.2023.200283

Contact-Info

Please feel free to contact us for any questions or cooperation opportunities. We will be happy to help.

About

MalHyStack is a hybrid classification model for detecting obfuscated malware in network traffic. It combines machine learning (ExtraTrees, XgBoost, Random Forest) and deep learning, using Pearson correlation for feature selection. Tested on the CIC-MalMem-2022 dataset, it outperforms existing models in both accuracy and efficiency.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors