<a href="https://colab.research.google.com/github/parindi/ember/blob/master/Task1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Neural Network Model on EMBER Malware Dataset:

The EMBER dataset is a collection of features from PE files that serve as a benchmark dataset for researchers. <br>
In this notebook, the EMBER-2017 v2 dataset is used which contains features from 1.1 million PE files scanned in or before 2017.

In [1]:
# Importing required modules

import pandas as pd

## Dataset Extraction:
To use the dataset in this notebook, simple download and upload didn't work as the URL to download the dataset is detected as untrused by Google. So, downloading the EMBER 2017 v2 dataset to Colab notebook using wget command and double unzipping it to get the

In [2]:
!wget https://ember.elastic.co/ember_dataset_2018_2.tar.bz2 --no-check-certificate

--2023-11-27 14:30:00--  https://ember.elastic.co/ember_dataset_2018_2.tar.bz2
Resolving ember.elastic.co (ember.elastic.co)... 34.107.161.234, 2600:1901:0:1f6d::
Connecting to ember.elastic.co (ember.elastic.co)|34.107.161.234|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1696539273 (1.6G) [application/x-bzip2]
Saving to: ‘ember_dataset_2018_2.tar.bz2’


2023-11-27 14:30:48 (34.1 MB/s) - ‘ember_dataset_2018_2.tar.bz2’ saved [1696539273/1696539273]



In [3]:
# Decompressing a .bz2 file
!bzip2 -d ember_dataset_2018_2.tar.bz2

In [4]:
# Extracting from tar file
!tar -xvf ember_dataset_2018_2.tar

ember2018/
ember2018/train_features_1.jsonl
ember2018/train_features_0.jsonl
ember2018/train_features_3.jsonl
ember2018/test_features.jsonl
ember2018/ember_model_2018.txt
ember2018/train_features_5.jsonl
ember2018/train_features_4.jsonl
ember2018/train_features_2.jsonl


All the required dataset files are extracted.

Now to work with the EMBER dataset, we need to clone its github repository whihc can be done by following code:

In [5]:
!git clone https://github.com/elastic/ember

Cloning into 'ember'...
remote: Enumerating objects: 285, done.[K
remote: Counting objects: 100% (93/93), done.[K
remote: Compressing objects: 100% (65/65), done.[K
remote: Total 285 (delta 40), reused 70 (delta 28), pack-reused 192[K
Receiving objects: 100% (285/285), 11.36 MiB | 22.38 MiB/s, done.
Resolving deltas: 100% (121/121), done.


In [6]:
!mv ember ember-master

In [7]:
!cp -r ember-master/* .

In [8]:
!pip install -r requirements_notebook.txt
!python setup.py install

Collecting jupyter>=1.0.0 (from -r requirements_notebook.txt (line 1))
  Downloading jupyter-1.0.0-py2.py3-none-any.whl (2.7 kB)
Collecting vega>=2.5 (from -r requirements_notebook.txt (line 2))
  Downloading vega-4.0.0-py3-none-any.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
Collecting qtconsole (from jupyter>=1.0.0->-r requirements_notebook.txt (line 1))
  Downloading qtconsole-5.5.1-py3-none-any.whl (123 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m123.4/123.4 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
Collecting ipytablewidgets<0.4.0,>=0.3.0 (from vega>=2.5->-r requirements_notebook.txt (line 2))
  Downloading ipytablewidgets-0.3.1-py2.py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.2/190.2 kB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0m
Collecting lz4 (from ipytablewidgets<0.4.0,>=0.3.0->vega>=2.5->-r requirements_notebook.

In [9]:
!pip install lief

Collecting lief
  Downloading lief-0.13.2-cp310-cp310-manylinux_2_24_x86_64.whl (4.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.0/4.0 MB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: lief
Successfully installed lief-0.13.2


The LIEF project is used to extract features from PE files included in the EMBER dataset. Raw features are extracted to JSON format. Vectorized features can be produced from these raw features and saved in binary format from which they can be converted to CSV, dataframe, or any other format.

In [1]:
import ember
ember.create_vectorized_features("/content/ember2018/")
ember.create_metadata("/content/ember2018/")

Vectorizing training set


100%|██████████| 800000/800000 [44:57<00:00, 296.55it/s]


Vectorizing test set


100%|██████████| 200000/200000 [11:00<00:00, 302.95it/s]


Unnamed: 0,sha256,appeared,label,avclass,subset
0,0abb4fda7d5b13801d63bee53e5e256be43e141faa077a...,2006-12,0,,train
1,c9cafff8a596ba8a80bafb4ba8ae6f2ef3329d95b85f15...,2007-01,0,,train
2,eac8ddb4970f8af985742973d6f0e06902d42a3684d791...,2007-02,0,,train
3,7f513818bcc276c531af2e641c597744da807e21cc1160...,2007-02,0,,train
4,ca65e1c387a4cc9e7d8a8ce12bf1bcf9f534c9032b9d95...,2007-02,0,,train
...,...,...,...,...,...
999995,e033bc4967ce64bbb5cafdb234372099395185a6e0280c...,2018-12,1,zbot,test
999996,c7d16736fd905f5fbe4530670b1fe787eb12ee86536380...,2018-12,1,flystudio,test
999997,0020077cb673729209d88b603bddf56b925b18e682892a...,2018-12,0,,test
999998,1b7e7c8febabf70d1c17fe3c7abf80f33003581c380f28...,2018-12,0,,test


In [2]:
import ember
data_path = '/content/ember2018/'
emberdf = ember.read_metadata(data_path)
emberdf.head()

Unnamed: 0,sha256,appeared,label,avclass,subset
0,0abb4fda7d5b13801d63bee53e5e256be43e141faa077a...,2006-12,0,,train
1,c9cafff8a596ba8a80bafb4ba8ae6f2ef3329d95b85f15...,2007-01,0,,train
2,eac8ddb4970f8af985742973d6f0e06902d42a3684d791...,2007-02,0,,train
3,7f513818bcc276c531af2e641c597744da807e21cc1160...,2007-02,0,,train
4,ca65e1c387a4cc9e7d8a8ce12bf1bcf9f534c9032b9d95...,2007-02,0,,train


In [3]:
X_train0, y_train0, X_test0, y_test0 = ember.read_vectorized_features(data_path)



In [4]:
X_train0

memmap([[1.4676122e-02, 4.2218715e-03, 3.9226813e-03, ..., 0.0000000e+00,
         0.0000000e+00, 0.0000000e+00],
        [1.8452372e-01, 3.1307504e-02, 5.6928140e-03, ..., 4.4229600e+05,
         0.0000000e+00, 0.0000000e+00],
        [2.5173673e-01, 1.4204546e-02, 6.8414863e-03, ..., 3.7280000e+04,
         0.0000000e+00, 0.0000000e+00],
        ...,
        [1.4297070e-01, 8.6626979e-03, 4.2015705e-03, ..., 0.0000000e+00,
         0.0000000e+00, 0.0000000e+00],
        [1.4780925e-01, 6.4021470e-03, 5.1157344e-03, ..., 0.0000000e+00,
         0.0000000e+00, 0.0000000e+00],
        [1.3445158e-01, 6.8144272e-03, 5.5496283e-03, ..., 0.0000000e+00,
         0.0000000e+00, 0.0000000e+00]], dtype=float32)

In [4]:
#shape of the dataset
X_train0.shape, y_train0.shape, X_test0.shape, y_test0.shape

((800000, 2381), (800000,), (200000, 2381), (200000,))

## Data Preprocessing:

It is known that the EMBER train dataset has three sample categories, namels unlabled, benign and malicious. They are represented as -1, 0 and 1 respectively. But it can be seen that the test dataset has only benign and malicious samples. In this project, I am ignoring the unlabled samples from the train dataset for the better performance of the model.

In [5]:
import pandas as pd
# Creating dataframes of X_train & y_train
X_train0 = pd.DataFrame(X_train0)
y_train0 = pd.DataFrame(y_train0)
X_train0.shape, y_train0.shape

((800000, 2381), (800000, 1))

In [6]:
#Unique labels in the train dataset
y_train0[0].unique()

array([ 0.,  1., -1.], dtype=float32)

In [7]:
# Combining features and lables of train dataset
X_train0[2381] = y_train0[0]
X_train0.shape, y_train0.shape

((800000, 2382), (800000, 1))

In [8]:
#Checking the presence of unique lables in the combined dataframe
X_train0[2381].unique()

array([ 0.,  1., -1.], dtype=float32)

In [9]:
X_train0.shape, y_train0.shape

((800000, 2382), (800000, 1))

The dataset is huge and takes lot to time for vectorizing and creating metadata for every runtime execution. So, create pickle files for the training and testing samples to store them in the system. By downloading and storing these pickle files, one can avoid the execution of the former lines of code.

In [11]:
#Pickling the datasets
pd.DataFrame(X_train0).to_pickle("./X_train.pkl")
pd.DataFrame(y_train0).to_pickle("./y_train.pkl")
pd.DataFrame(X_test0).to_pickle("./X_test.pkl")
pd.DataFrame(y_test0).to_pickle("./y_test.pkl")

In [15]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


I faced network failure error while downloading the pickle file to store it in my system. The alternate solution for this error is to upload the pickle files to the Google Drive by executing the following code:

In [16]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [17]:
# Copying pickle files to Google Drive
!cp ./X_test.pkl ./gdrive/My\ Drive/Pickle_Files/
!cp ./y_test.pkl ./gdrive/My\ Drive/Pickle_Files/
!cp ./X_train.pkl ./gdrive/My\ Drive/Pickle_Files/
!cp ./y_train.pkl ./gdrive/My\ Drive/Pickle_Files/

In [1]:
# Extracting training data from pickle files
import pandas as pd
X_trainp = pd.read_pickle("/content/gdrive/My Drive/Pickle_Files/X_train.pkl")
y_trainp = pd.read_pickle("/content/gdrive/My Drive/Pickle_Files/y_train.pkl")

In [2]:
# Extracting testing data from pickle files
X_testp =pd.read_pickle("/content/gdrive/My Drive/Pickle_Files/X_test.pkl")
y_testp = pd.read_pickle("/content/gdrive/My Drive/Pickle_Files/y_test.pkl")

In [3]:
#Shape of the dataset
X_trainp.shape, y_trainp.shape, X_testp.shape, y_testp.shape

((800000, 2382), (800000, 1), (200000, 2381), (200000, 1))

At this point of execution, I can see that the above lines of code used most of the 12GB RAM availbale in Colab. So, even though the datasets are pickled, the RAM crashes. The alternative for this is to create HDF5 files. The h5py package is a Pythonic interface to the HDF5 binary data format.

In [8]:
import h5py

# Loading X_train data to HDF5 file
h50 = h5py.File('X_train0.h5', 'w')
h50.create_dataset('X_train0', data=X_train0)
h50.close()

In [9]:
# Loading y_train data to HDF5 file
h51 = h5py.File('y_train0.h5', 'w')
h51.create_dataset('y_train0', data=y_train0)
h51.close()

In [10]:
#Loading X_test data to HDF5 file
h52 = h5py.File('X_test0.h5', 'w')
h52.create_dataset('X_test0', data=X_test0)
h52.close()

In [11]:
#Loading y_test data to HDF5 file
h53 = h5py.File('y_test0.h5', 'w')
h53.create_dataset('y_test0', data=y_test0)
h53.close()

In [12]:
#Storing all the h5 files to GDrive
!cp ./X_train0.h5 ./gdrive/My\ Drive/Pickle_Files
!cp ./y_train0.h5 ./gdrive/My\ Drive/Pickle_Files
!cp ./X_test0.h5 ./gdrive/My\ Drive/Pickle_Files
!cp ./y_test0.h5 ./gdrive/My\ Drive/Pickle_Files

In [13]:
#reading the X_train data from h5 files
import h5py
Xh5 = h5py.File('/content/gdrive/My Drive/Pickle_Files/X_train0.h5','r')
X_train = Xh5['X_train0']
X_train.shape

(800000, 2381)

In [14]:
# Reading y_train data from h5 files
import h5py
yh5 = h5py.File('/content/gdrive/My Drive/Pickle_Files/y_train0.h5','r')
y_train = yh5['y_train0']
y_train.shape

(800000,)

In [15]:
# Reading X_test data from h5 files
import h5py
Xth5 = h5py.File('/content/gdrive/My Drive/Pickle_Files/X_test0.h5','r')
X_test = Xth5['X_test0']
X_test.shape

(200000, 2381)

In [16]:
# Reading y_test data from h5 files
import h5py
yth5 = h5py.File('/content/gdrive/My Drive/Pickle_Files/y_test0.h5','r')
y_test = yth5['y_test0']
y_test.shape

(200000,)

**The features of this dataset are scaled on different scalars and among them I picked RobustScalar to do the feature scaling.**

In [None]:
# Scaling the features inorder to improve the performance of the model
from sklearn.preprocessing import RobustScaler

rs = RobustScaler()
Xtrain_rs = rs.fit_transform(X_train)
Xtest_rs = rs.fit_transform(X_test)

In [None]:
#Loading scaled X_train data to HDF5 file
h54 = h5py.File('Xtrain_rs.h5', 'w')
h54.create_dataset('Xtrain_rs', data=Xtrain_rs)
h54.close()

#Storing the h5 files to GDrive
!cp ./Xtrain_rs.h5 ./gdrive/My\ Drive/Pickle_Files

In [None]:
#Loading scaled X_test data to HDF5 file
h55 = h5py.File('Xtest_rs.h5', 'w')
h55.create_dataset('Xtest_rs', data=Xtest_rs)
h55.close()

#Storing the h5 files to GDrive
!cp ./Xtest_rs.h5 ./gdrive/My\ Drive/Pickle_Files

In [None]:
# Reading Xtrain_rs data from h5 files
import h5py
Xrsh5 = h5py.File('/content/gdrive/My Drive/Pickle_Files/Xtrain_rs.h5','r')
Xtrain_rs = Xrsh5['Xtrain_rs']
Xtrain_rs.shape

(600000, 2381)

In [None]:
# Reading Xtest_rs data from h5 files
import h5py
Xtrsh5 = h5py.File('/content/gdrive/My Drive/Pickle_Files/Xtest_rs.h5','r')
Xtest_rs = Xtrsh5['Xtest_rs']
Xtest_rs.shape

(200000, 2381)

## Model Arcitecture & Training:

In [None]:
#Function for the model
def myModel():

    import tensorflow as tf
    from tensorflow import keras
    from tensorflow.keras import layers
    from tensorflow.keras.models import Sequential
    from keras import regularizers
    tf.compat.v1.disable_eager_execution()

    #Model architecture
    model = Sequential()
    model.add(layers.InputLayer(input_shape=(2381,)))
    model.add(layers.Dropout(0.2))
    model.add(layers.Dense(units = 1000, activation = tf.nn.relu, activity_regularizer=regularizers.l2(0.01)))
    model.add(layers.Dropout(0.5))
    model.add(layers.Dense(units = 1, activation=tf.nn.sigmoid))
    print(model.summary())

    #model compilation
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

    model.save('my_model.h5')

    return model

In [None]:
model = myModel()

Using TensorFlow backend.


Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dropout (Dropout)            (None, 2381)              0         
_________________________________________________________________
dense (Dense)                (None, 1000)              2382000   
_________________________________________________________________
dropout_1 (Dropout)          (None, 1000)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 1001      
Total params: 2,383,001
Trainable params: 2,383,001
Non-trainable params: 0
_________________________________________________________________
None


In [None]:
#Training the model on 1 epoch
history = model.fit(Xtrain_rs, y_train,
                batch_size=256, shuffle="batch",
                epochs=1,
                validation_split=0.2)

Train on 480000 samples, validate on 120000 samples
Epoch 1/1


In [None]:
history = model.fit(Xtrain_rs, y_train,
                batch_size=256, shuffle="batch",
                epochs=30,
                validation_split=0.2)

Train on 480000 samples, validate on 120000 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


## Model Testing:

In [None]:
# testing the model

score =model.evaluate(Xtest_rs,y_test)
print("Training accuracy:", score[1])

Training accuracy: 0.4422149956226349


Now, lets save the model for future use.

In [None]:
# Save the model
#model.save('my_model.h5')
model.save_weights('my_model_weights.h5')

#Storing the model to GDrive
!cp ./my_model.h5 ./gdrive/My\ Drive/Pickle_Files
!cp ./my_model_weights.h5 ./gdrive/My\ Drive/Pickle_Files

In [None]:
# save neural network structure to JSON (no weights)
model_json = model.to_json()
with open("mymodeljson.json", "w") as json_file:
    json_file.write(model_json)

model.save_weights("my_model-weights.h5")

The below set of code is a a function that takes a PE file as its argument, runs it through the trained model, and returns the output i.e., 1 for Malware or 0 for Benign.

In [None]:
!wget https://repo.anaconda.com/archive/Anaconda3-2020.02-Windows-x86_64.exe

--2020-04-28 03:17:13--  https://repo.anaconda.com/archive/Anaconda3-2020.02-Windows-x86_64.exe
Resolving repo.anaconda.com (repo.anaconda.com)... 104.16.131.3, 104.16.130.3, 2606:4700::6810:8303, ...
Connecting to repo.anaconda.com (repo.anaconda.com)|104.16.131.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 488908696 (466M) [application/octet-stream]
Saving to: ‘Anaconda3-2020.02-Windows-x86_64.exe’


2020-04-28 03:17:15 (201 MB/s) - ‘Anaconda3-2020.02-Windows-x86_64.exe’ saved [488908696/488908696]



In [None]:
def testPE(pe):
  import ember
  import numpy as np
  import tensorflow as tf
  from sklearn.preprocessing import RobustScaler
  rs = RobustScaler()

  #opening the downloaded PE file
  testpe = open(pe, "rb").read()
  #Feature extractor class of the ember project
  extract = ember.PEFeatureExtractor()
  data = extract.feature_vector(testpe) #vectorizing the extracted features
  scaled_data = rs.fit_transform([data])
  Xdata = np.reshape(scaled_data,(1, 2381))

  model = tf.keras.models.load_model('my_model.h5')
  pred = model.predict_classes(Xdata)

  return pred

In [None]:
testPE("Anaconda3-2020.02-Windows-x86_64.exe")



array([[0]], dtype=int32)

The model predicted that Anaconda PE file as Benign