# Binary LSTM classification model

In this notebook, we've replicated the binary LSTM model for the DGA/non-DGA classification of a domain name; from the Endgame paper:

"Predicting Domain Generation Algorithms with Long Short-Term Memory Networks"
http://arxiv.org/abs/1611.00791v1


In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from keras.models import Sequential, model_from_json
from keras.layers.core import Dense, Dropout, Activation
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.preprocessing import sequence
from keras.preprocessing import text

from tensorflow.python.client import device_lib

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [4]:
# Check if gpu can be utilized for acceleration
print(device_lib.list_local_devices())

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 6031639707492484166
]


In [6]:
# Read DGA and Cisco high confidence data
dga_df = pd.read_csv('dga-feed-high.csv', header=None, skiprows=15)
dga_all_df = pd.read_csv('dga-feed.csv', header=None, skiprows=14)
cisco_df = pd.read_csv('top-1m.csv', header=None)

In [7]:
# display head
def display_df(dga_df_, cisco_df_, dga_all_df_):
    display("DGA feed sample: {}".format( dga_df_.shape) )
    display(dga_df_.head())
    display("DGA feed high and low confidence sample: {}".format( dga_all_df_.shape))
    display(dga_all_df_.head())
    display("Cisco feed sample: {}".format( cisco_df_.shape) )
    display(cisco_df_.head())

In [9]:
# Remove unused columns, add output label 'dga'

dga_df_slim =   dga_df.drop(columns=range(1,dga_df.shape[1]), inplace=False)
dga_df_slim.columns = ['domain']
dga_all_df_slim =   dga_all_df.drop(columns=range(1,dga_all_df.shape[1]), inplace=False)
dga_all_df_slim.columns = ['domain']

cisco_df_slim = cisco_df.drop(columns=[0], inplace=False)
cisco_df_slim.columns = ['domain']
dga_df_slim['dga'] = 1
dga_all_df_slim['dga'] = 1
cisco_df_slim['dga'] = 0

display_df(dga_df_slim, cisco_df_slim, dga_all_df_slim)
unified_df = pd.concat([cisco_df_slim, dga_df_slim], ignore_index=True)

'DGA feed sample: (381953, 2)'

Unnamed: 0,domain,dga
0,plvklpgwivery.com,1
1,dnuxdhcgblsgy.net,1
2,qjlullhfkiowp.biz,1
3,elkidddodxdly.ru,1
4,rnbfwuprlwfor.org,1


'DGA feed high and low confidence sample: (852819, 2)'

Unnamed: 0,domain,dga
0,qbtdyvvoubcrakm.com,1
1,efsadbrxqnweigx.net,1
2,lfwqxtewtsmgxvy.biz,1
3,yjvncyagpfhswmx.ru,1
4,mjyjwdvmwcpwvna.org,1


'Cisco feed sample: (1000000, 2)'

Unnamed: 0,domain,dga
0,netflix.com,0
1,api-global.netflix.com,0
2,prod.netflix.com,0
3,push.prod.netflix.com,0
4,google.com,0


In [10]:
# Separate input sequences (domains) and output labels (DGA 0/1), and do train/test split

X = unified_df['domain']
Y = unified_df['dga']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2,random_state=23)

In [12]:
# Binary classification LSTM model

TRAIN_MODEL = False                                          # Load saved model otherwise
max_features = 1000                                          # length of vocabulary
batch_size = 128                                             # input batch size
num_epochs = 5                                               # epochs to train
    
# train the model
# encode string characters to integers
encoder = text.Tokenizer(num_words=500, char_level=True)
encoder.fit_on_texts(X_train)                            # build character indices
X_train_tz = encoder.texts_to_sequences(X_train)

# Model definition - this is the core model from Endgame
model=Sequential()
model.add(Embedding(max_features, 128, input_length=75))
model.add(LSTM(128))
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='rmsprop')

# Pad sequence where sequences are case insensitive characters encoded to
# integers from 0 to number of valid characters
X_train_pad=sequence.pad_sequences(X_train_tz, maxlen=75)

# Train where Y_train is 0-1
model.fit(X_train_pad, Y_train, batch_size=batch_size, epochs=num_epochs)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x18a8f01c550>

For a typical training on a dual core CPU, each epoc took about 2.5 to 3 times more training time, compared to training with a GPU.

In [13]:
# Validation on test dataset

X_test_pad = sequence.pad_sequences(encoder.texts_to_sequences(X_test), maxlen=75)
Y_pred = model.predict_classes(X_test_pad)
acc = accuracy_score(Y_test, Y_pred)
print("Model accuracy = {:8.3f} %".format(acc*100))

Model accuracy =   99.011 %


In [17]:
X_2 = dga_all_df_slim['domain']
Y_2 = dga_all_df_slim['dga']
X_test_pad2 = sequence.pad_sequences(encoder.texts_to_sequences(X_2), maxlen=75)
Y_pred2 = model.predict_classes(X_test_pad2)
acc = accuracy_score(Y_2, Y_pred2)
print("Model accuracy = {:8.3f} %".format(acc*100))

Model accuracy =   80.017 %


In [12]:
# Save model and weights
if TRAIN_MODEL == True:
    model_save = model.to_json()
    with open('.\\saved_models\\binary_LSTM.json', 'w') as file:
        file.write(model_save)
    model.save_weights('.\\saved_models\\binary_LSTM.h5')
    print('MODEL SAVED TO DISK!')
else:
    print('MODEL AREADY SAVED TO DISK.')

MODEL AREADY SAVED TO DISK.


## Look ahead and next steps:
__1__ Look closer at the misclassified domains. Any particular DGA category stands out? What do we need to improve? 

__2__ Improving classification accuracy - more balanced dataset especially for the multiclass classification.

__3__ Learning from scratch takes significant time. Need to implement model update in batches of new domain dataset.

__4__ Modify the model to do multiclass classification across the various DGA categories. Do we need to trim down the categories - dataset shows 60+ categories and new ones may be added any time.