Skip to content

The project that manage M4M package which used for feature extraction and conversion to libSVM data format

Notifications You must be signed in to change notification settings

hjlee94/feature_extraction_for_malware

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

m4ML(malware for Machine Learning)

Introduction

The project that manage m4ML tool which used for feature extraction and conversion to libSVM data format.

Prerequisite

this tool programmed by the python language. The version of installed package should be same with this requirement.

  • Requirement

    • python == 2.7.13
    • argparse == 1.1
    • tqdm == 4.28.1
    • numpy == 1.15.4
  • m4ML Diretory Structure

    • ./conf

      • ./conf/features.conf : it contains the list of features to be extracted.
      • ./conf/mergence.conf : it specify the index that each features starts at.
      • ./conf/settings.conf : it contains general settings. (e.g. job number)
    • ./resource

      • ./resource/input : It has not specific meaning. you can just locate your data in this folder.
      • ./resource/output : The feature extraction result will be stored in a folder named job-id
    • ./src

      • It has python code to work.
  • Input Directory Requirement
    Input directory should have this structure down below and satisfy that ...

    • Root directory should be one.
    • Level 2 directory represents the extension of files.
    • Level 3 directory represents the clasees of data (e.g. malware, benign, trojan, etc...)

input data structure

How to use

The m4ML Tools has 3 stages(modes) to feature extraction.

  • Extraction : it extracts features from input data
  • Labeling : it makes a csv file that can map a filename to a class
  • Encoding : it encodes index of features and merge multiple feature into one file(libsvm format file).

The usage is down below. Each stage can be launched by executing python script named MalProc.py

python MalProc.py [-h] -M Mode [-j NJOBS] [-i INPUT] [-o OUTPUT] [-f FEATURE] [-s SAMPLING] [-e ENCODINGCONF] [-n NAMEENCODERPATH]

There are paramerters to specify value.

Parameters Description Extraction Labeling Encoding
-M (--Mode) specify mode to launch (extraction, labeling or encoding)
-j (--njobs) specify the number of workers(cpu processor) to run parallel extraction
-i (--input) specify input directory path
-o (--output) specify output path to save result
-f (--feature) if the number of feature you want to extract is one, specify the feature name
-s (--sampling) specify sampling ratio or the number of data
-n (--nameEncoderPath) specify labeling file path

Examples

1. Extract N-gram feature

# Set configure files for job-id (e.g. job01)
# Run extraction mode
  >  python MalProc.py -M extraction -i /Data/dataset/exe -f Ngram
# Run labeling mode
  > python MalProc.py -M labeling -i /Data/dataset/exe
# Run encoding mode
  > python MalProc.py -M encoding -i ../resource/output/job01

2. Extract N-gram and WEM features

# Set configure files for job-id (e.g. job02) and feature index (Ngram 1, WEM 65537)  
# Run extraction mode
  >  python MalProc.py -M extraction -i /Data/dataset/exe
# Run labeling mode
  > python MalProc.py -M labeling -i /Data/dataset/exe
# Run encoding mode
  > python MalProc.py -M encoding -i ../resource/output/job02

About

The project that manage M4M package which used for feature extraction and conversion to libSVM data format

Resources

Stars

Watchers

Forks

Packages

No packages published