MalHyStack: A hybrid stacked ensemble learning framework with feature engineering schemes for obfuscated malware analysis

Read the full paper

📄 DOI
ResearchGate

Abstract

Since the advent of malware, it has reached a toll in this world that exchanges billions of data daily. Millions of people are victims of it, and the numbers are not decreasing as the year goes by. Malware is of various types in which obfuscation is a special kind. Obfuscated malware detection is necessary as it is not usually detectable and is prevalent in the real world. Although numerous works have already been done in this field so far, most of these works still need to catch up at some points, considering the scope of exploration through recent extensions. In addition to that, the application of a hybrid classification model is yet to be popularized in this field. Thus, in this paper, a novel hybrid classification model named, MalHyStack, has been proposed for detecting such obfuscated malware within the network. This proposed working model is built incorporating a stacked ensemble learning scheme, where conventional machine learning algorithms namely, Extremely Randomized Trees Classifier (ExtraTrees), Extreme Gradient Boosting (XgBoost) Classifier, and Random Forest are used in the first layer which is then followed by a deep learning layer in the second stage. Before utilizing the classification model for malware detection, an optimum subset of features has been selected using Pearson correlation analysis which improved the accuracy of the model by more than 2 % for multiclass classification. It also reduces time complexity by approximately two and three times for binary and multiclass classification, respectively. For evaluating the performance of the proposed model, a recently published balanced dataset named CIC-MalMem-2022 has been used. Utilizing this dataset, the overall experimental results of the proposed model represent a superior performance when compared to the existing classification models.

Highlights

Hybrid stacking of multiple models outperforms individual unique models.
Deep learning-based model performs better as a meta learner in the ensemble model.
Transformation of features in a standard scale provides better processing of data.
Reduced features results in less complexity and higher accuracy.
Pearson correlation coefficient is used to reduce dimension of the dataset.

Overall Flow Diagram

Dataset

The secondary dataset can be downloaded from Canadian Institute for Cybersecurity datasets.

Dataset Link

Architecture of the malware family/category detection model

Stacking

A stack ensemble, also known as stacked generalization, is a powerful machine learning technique that combines the predictions of multiple models to outperform any single model alone. Imagine it like hiring a team of experts and asking them to vote on a decision. They each bring their own perspectives and biases, but by combining their insights, you get a more accurate and reliable outcome.

Here's a simplified breakdown:

Build multiple "base models" of different types (e.g., decision trees, support vector machines). These models individually learn from the data you provide.
Train a "meta-model" on the output predictions of the base models. This meta-model learns how to best combine the individual predictions into a single, final prediction.
Use the meta-model for actual predictions: When you have new data, you run it through the base models first. Their predictions then become the input for the meta-model, which generates the final prediction.

Stacking benefits:

Improved accuracy: Combines the strengths of different models, often leading to better performance than any single model.
Reduced variance: By averaging out individual model errors, stacking can make predictions more stable.
Flexibility: Allows using diverse models within the ensemble, leveraging their unique strengths.

However, stacking can also be:

More complex to implement: Requires training and tuning multiple models and the meta-model.
Computationally expensive: Training multiple models can be time-consuming and resource-intensive.

Overall, stack ensemble is a valuable tool for machine learning practitioners seeking to boost model performance and achieve better results.

Here we have used Vecstack which is Python package for stacking (stacked generalization) featuring lightweight functional API and fully compatible scikit-learn API.

Libraries

Python 3.5+
Keras 2.1.0+
Tensorflow 1.10.0+

Platform

Google Colab

Python Script

Python script for all the pre-processing, feature engineering, model implementation and evaluation. Link

Citation

If you find this repository useful in your research, please cite this article as:

IEEE Style- K. S. Roy, T. Ahmed, P. B. Udas, Md. E. Karim, and S. Majumdar, “MalHyStack: A hybrid stacked ensemble learning framework with feature engineering schemes for obfuscated malware analysis,” Intelligent Systems with Applications, vol. 20, p. 200283, Nov. 2023, doi: 10.1016/j.iswa.2023.200283.

APA Style- Roy, K. S., Ahmed, T., Udas, P. B., Karim, M. E., & Majumdar, S. (2023, November). MalHyStack: A hybrid stacked ensemble learning framework with feature engineering schemes for obfuscated malware analysis. Intelligent Systems With Applications, 20, 200283. https://doi.org/10.1016/j.iswa.2023.200283

Contact-Info

Please feel free to contact us for any questions or cooperation opportunities. We will be happy to help.

Email : kowshikroy777@gmail.com OR ebtikarim@ece.kuet.ac.bd OR pritombiswas005@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Dataset		Dataset
Python Script		Python Script
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MalHyStack: A hybrid stacked ensemble learning framework with feature engineering schemes for obfuscated malware analysis

Read the full paper

Abstract

Highlights

Overall Flow Diagram

Dataset

Architecture of the malware family/category detection model

Stacking

Libraries

Platform

Python Script

Citation

Contact-Info

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MalHyStack: A hybrid stacked ensemble learning framework with feature engineering schemes for obfuscated malware analysis

Read the full paper

Abstract

Highlights

Overall Flow Diagram

Dataset

Architecture of the malware family/category detection model

Stacking

Libraries

Platform

Python Script

Citation

Contact-Info

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages