# Domain Generation Algorithm (DGA) Detection

## Table of Contents
* Introduction
* Data Importing
* Data Preprocessing
* Training and Evaluation
* Inference
* Conclusion

## Introduction
[Domain Generation Algorithms](https://en.wikipedia.org/wiki/Domain_generation_algorithm) (DGAs) are used to generate domain names that can be used by the malware to communicate with the command and control servers. IP addresses and static domain names can be easily blocked, and a DGA provides an easy method to generate a large number of domain names and rotate through them to circumvent traditional block lists. We will use a type of recurrent neural network called the [Gated Recurrent Unit](https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21) (GRU) for this example. This implementation enables users to train their models with up-to-date domain names representative of both benign and DGA generated strings. This capability could also be used in production. This notebook provides a view into the data science workflow to create a DGA detection implementation.

In [1]:
import os
import cudf
import cupy as cp
import torch
import requests
import logging
import numpy as np
from datetime import datetime
from sklearn.metrics import accuracy_score, average_precision_score
from dga_detector import DGADetector
from dataloader import DataLoader
from dga_dataset import DGADataset
from utils import str2ascii
from cuml.model_selection import train_test_split

#### Enable console logging

In [2]:
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    handlers=[
        logging.StreamHandler()
    ]
)

#### Load Input Dataset to GPU Dataframe

In [3]:
INPUT_CSV = "../datasets/dga-training-data.csv"

In [4]:
gdf = cudf.read_csv(INPUT_CSV)

In [5]:
train_data = gdf['domain']
labels = gdf['type']

Because we have only benign and DGA (malicious) categoriesm, the number of domain types need to be set to 2 (`N_DOMAIN_TYPE=2`). Vocabulary size(`CHAR_VOCAB`) is set to 128 ASCII characters. The values below set for `HIDDEN_SIZE`, `N_LAYERS` of the network, and the `LR` (Learning Rate) give an optimum balance for the network size and performance. They might need be set via experiments when working with other datasets.

In [6]:
LR = 0.001
N_LAYERS = 3
CHAR_VOCAB = 128
HIDDEN_SIZE = 100
N_DOMAIN_TYPE = 2

#### Instantiate DGA detector
Now that the data is ready, the datasets are created, and we've set the parameters for the model, we can use the DGA detector method to create and train the model.

In [7]:
dd = DGADetector(lr=LR)
dd.init_model(n_layers=N_LAYERS, char_vocab=CHAR_VOCAB, hidden_size=HIDDEN_SIZE, n_domain_type=N_DOMAIN_TYPE)

2023-05-23 15:28:42,334 [INFO] Found GPU's now setting up cuda for the model


In [8]:
EPOCHS = 25
TRAIN_SIZE = 0.7
BATCH_SIZE = 10000
MODELS_DIR = 'models'

### Training and Evaluation
Now we train and evaluate the model.

In [9]:
%%time
dd.train_model(train_data, labels, batch_size=BATCH_SIZE, epochs=EPOCHS, train_size=0.7)

2023-05-23 15:28:44,362 [INFO] Initiating model training ...
2023-05-23 15:28:44,362 [INFO] Truncate domains to width: 100
2023-05-23 15:29:10,117 [INFO] Evaluating trained model ...
2023-05-23 15:29:13,494 [INFO] Test set accuracy: 409207/614179 (0.6662666746990699)

2023-05-23 15:29:38,524 [INFO] Evaluating trained model ...
2023-05-23 15:29:41,858 [INFO] Test set accuracy: 518867/614179 (0.8448139711712709)

2023-05-23 15:30:07,173 [INFO] Evaluating trained model ...
2023-05-23 15:30:10,605 [INFO] Test set accuracy: 580162/614179 (0.944613866641484)

2023-05-23 15:30:35,663 [INFO] Evaluating trained model ...
2023-05-23 15:30:39,037 [INFO] Test set accuracy: 592778/614179 (0.9651551095039068)

2023-05-23 15:31:03,868 [INFO] Evaluating trained model ...
2023-05-23 15:31:07,187 [INFO] Test set accuracy: 592414/614179 (0.9645624484067349)

2023-05-23 15:31:32,262 [INFO] Evaluating trained model ...
2023-05-23 15:31:35,653 [INFO] Test set accuracy: 586014/614179 (0.954142033511403)

202

CPU times: user 21min 7s, sys: 22.7 s, total: 21min 30s
Wall time: 11min 51s





### Save Model
Save pretrained model to a given output location.

In [10]:
if not os.path.exists(MODELS_DIR):
    print("Creating directory '{}'".format(MODELS_DIR))
    os.makedirs(MODELS_DIR)

now = datetime.now()
model_filename = "rnn_classifier_{}.bin".format(now.strftime("%Y-%m-%d_%H_%M_%S"))
model_filepath = os.path.join(MODELS_DIR, model_filename)
dd.save_checkpoint(model_filepath)

2023-05-23 15:40:35,640 [INFO] Pretrained model checkpoint saved to location: 'models/rnn_classifier_2023-05-23_15_40_35.bin'


### Inference

Using the model generated above, we now score the test dataset against the model to determine if the domain is likely generated by a DGA or benign.

In [11]:
dga_detector = DGADetector()
dga_detector.load_checkpoint(model_filepath)

domain_train, domain_test, type_train, type_test = train_test_split(gdf, "type", train_size=0.7)
test_df = cudf.DataFrame()
test_df["type"] = type_test.reset_index(drop=True)
test_df["domain"] = domain_test.reset_index(drop=True)

test_dataset = DGADataset(test_df, 100)
test_dataloader = DataLoader(test_dataset, batchsize=BATCH_SIZE)

pred_results = []
true_results = []
for chunk in test_dataloader.get_chunks():
    pred_results.append(list(dga_detector.predict(chunk['domain']).values_host))
    true_results.append(list(chunk['type'].values_host))
pred_results = np.concatenate(pred_results)
true_results = np.concatenate(true_results)
accuracy_score_result = accuracy_score(pred_results, true_results)

print('Model accuracy: %s'%(accuracy_score_result))

2023-05-23 15:40:35,667 [INFO] Found GPU's now setting up cuda for the model


Model accuracy: 0.9907323435024643


In [12]:
average_precision = average_precision_score(true_results, pred_results)

print('Average precision score: {0:0.3f}'.format(average_precision))

Average precision score: 0.976


## Export Model to ONNX

In [13]:
def preprocess(df, pad_max_len):
    df = str2ascii(df[0:32], 'domain')
    df = df.drop("domain", axis=1)
    seq_len_arr = df["len"].values_host
    df = df.drop("len", axis=1)
    seq_len_tensor = torch.LongTensor(seq_len_arr)

    seq_cp = df.to_cupy()
    input = cp.zeros((seq_cp.shape[0], pad_max_len))
    input[:seq_cp.shape[0], :seq_cp.shape[1]] = seq_cp
    input = input.astype("long")
    seq_tensor = torch.as_tensor(input)
    
    if torch.cuda.is_available():
        seq_tensor = seq_tensor.cuda()
        seq_len_tensor = seq_len_tensor.cuda()

    return seq_tensor, seq_len_tensor

In [14]:
input, seq_lengths = preprocess(gdf[0:32], 100)
sample_model_input = (input, seq_lengths)
model_to_export = dga_detector.get_unwrapped_model()



In [15]:
torch.onnx.export(model_to_export,              
                  sample_model_input,               
                  "model.onnx",                                      # where to save the model
                  export_params=True,                                # store the trained parameter weights inside the model file
                  opset_version=10,                                  # the ONNX version to export the model to
                  do_constant_folding=True,                          # whether to execute constant folding for optimization
                  input_names = ['domains', "seq_lengths"],          # the model's input names
                  output_names = ['output'],                         # the model's output names
                  dynamic_axes={'domains' : {0 : 'batch_size'},      # variable length axes
                                'seq_lengths': {0: 'batch_size'}, 
                                'output' : {0 : 'batch_size'}})

  gru_input = pack_padded_sequence(embedded, seq_lengths.data.cpu().numpy())
  gru_input = pack_padded_sequence(embedded, seq_lengths.data.cpu().numpy())
  _C._jit_pass_onnx_node_shape_type_inference(node, params_dict, opset_version)


## Conclusion

This DGA detection implementation enables users to train their models for detection and also use existing models. This capability could also be used in conjunction with log parsing efforts if the logs contain domain names. Data is kept in GPU memory, removing unnecessary copy/converts and providing a 4X speed advantage over CPU only implementations. This is esepcially true with large batch sizes.