m4ML(malware for Machine Learning)

Introduction

The project that manage m4ML tool which used for feature extraction and conversion to libSVM data format.

Prerequisite

this tool programmed by the python language. The version of installed package should be same with this requirement.

Requirement
- python == 2.7.13
- argparse == 1.1
- tqdm == 4.28.1
- numpy == 1.15.4
m4ML Diretory Structure
- ./conf
  - ./conf/features.conf : it contains the list of features to be extracted.
  - ./conf/mergence.conf : it specify the index that each features starts at.
  - ./conf/settings.conf : it contains general settings. (e.g. job number)
- ./resource
  - ./resource/input : It has not specific meaning. you can just locate your data in this folder.
  - ./resource/output : The feature extraction result will be stored in a folder named job-id
- ./src
  - It has python code to work.
Input Directory Requirement
Input directory should have this structure down below and satisfy that ...
- Root directory should be one.
- Level 2 directory represents the extension of files.
- Level 3 directory represents the clasees of data (e.g. malware, benign, trojan, etc...)

How to use

The m4ML Tools has 3 stages(modes) to feature extraction.

Extraction : it extracts features from input data
Labeling : it makes a csv file that can map a filename to a class
Encoding : it encodes index of features and merge multiple feature into one file(libsvm format file).

The usage is down below. Each stage can be launched by executing python script named MalProc.py

python MalProc.py [-h] -M Mode [-j NJOBS] [-i INPUT] [-o OUTPUT] [-f FEATURE] [-s SAMPLING] [-e ENCODINGCONF] [-n NAMEENCODERPATH]

There are paramerters to specify value.

Parameters	Description	Extraction	Labeling	Encoding
-M (--Mode)	specify mode to launch (extraction, labeling or encoding)	○	○	○
-j (--njobs)	specify the number of workers(cpu processor) to run parallel extraction	△	Ⅹ	Ⅹ
-i (--input)	specify input directory path	○	○	○
-o (--output)	specify output path to save result	Ⅹ	△	△
-f (--feature)	if the number of feature you want to extract is one, specify the feature name	△	Ⅹ	Ⅹ
-s (--sampling)	specify sampling ratio or the number of data	Ⅹ	Ⅹ	△
-n (--nameEncoderPath)	specify labeling file path	Ⅹ	Ⅹ	△

Examples

1. Extract N-gram feature

# Set configure files for job-id (e.g. job01)
# Run extraction mode
  >  python MalProc.py -M extraction -i /Data/dataset/exe -f Ngram
# Run labeling mode
  > python MalProc.py -M labeling -i /Data/dataset/exe
# Run encoding mode
  > python MalProc.py -M encoding -i ../resource/output/job01

2. Extract N-gram and WEM features

# Set configure files for job-id (e.g. job02) and feature index (Ngram 1, WEM 65537)  
# Run extraction mode
  >  python MalProc.py -M extraction -i /Data/dataset/exe
# Run labeling mode
  > python MalProc.py -M labeling -i /Data/dataset/exe
# Run encoding mode
  > python MalProc.py -M encoding -i ../resource/output/job02

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.idea		.idea
images		images
malware_for_ML		malware_for_ML
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

m4ML(malware for Machine Learning)

Introduction

Prerequisite

How to use

Examples

1. Extract N-gram feature

2. Extract N-gram and WEM features

About

Releases

Packages

Languages

hjlee94/feature_extraction_for_malware

Folders and files

Latest commit

History

Repository files navigation

m4ML(malware for Machine Learning)

Introduction

Prerequisite

How to use

Examples

1. Extract N-gram feature

2. Extract N-gram and WEM features

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages