# Domain Generation Algorithms (DGA) 

MITRE ATT&CK: https://attack.mitre.org/techniques/T1323/

## Preamble
To combat the use of blacklists on hard-coded domain-names, a domain generation algorithm is often coded into Malware samples.  This technique allows adversaries generate domain names and effectively bypass any blocking behaviours.

Traditional approaches are reactive - it involves reverse engineering the Malware binaries.  This is naturally time consuming.

## Detection Approach(s)

* NXDomains
If a single host is making a large number of NXDomain (non-existant) - this could indicate Malware usage.  This is still a reactive approach, it requires the sum of NX domains to be met before approach could be made.

* Feature Engineering
Look at intrinsic features

* RNN
Scalable and easy to use technique for detection

## Datasets
* Cisco's Umbrella Popularity
 The popularity list contains our most queried domains based on passive DNS usage across our Umbrella global network of more than 100 Billion requests per day with 65 million unique active users, in more than 165 countries. Unlike Alexa, the metric is not based on only browser based 'http' requests from users but rather takes in to account the number of unique client IPs invoking this domain relative to the sum of all requests to all domains. In other words, our popularity ranking reflects the domain’s relative internet activity agnostic to the invocation protocols and applications where as ’site ranking’ models (such as Alexa) focus on the web activity over port 80 mainly from browsers. 
https://s3-us-west-1.amazonaws.com/umbrella-static/index.html


* DGA Sets from NetLab


## References
[1] - https://blog.malwarebytes.com/security-world/2016/12/explained-domain-generating-algorithm/

[2] - https://www.youtube.com/watch?v=jm7wH2G0h6c

<hr>

In [78]:
import pandas as pd
import numpy as np

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
import sklearn
from sklearn import feature_extraction
from sklearn.model_selection import train_test_split

## Data Ingestion & Preprocessing

In [172]:
# Ingesting Benign Domains
benignDomains_df = pd.read_csv('dataset/CISCO-top-1m.csv', names=['domain'])
benignDomains_df['class'] = 'benign'
benignDomains_df['src'] = 'cisco'

# Ingesting Malicious Domains
maliciousDomains_df = pd.read_csv('dataset/dga-domains.txt', sep='\t', usecols=[0,1], names=['src', 'domain'])
maliciousDomains_df['class'] = 'malicious'

# Merging Datasets
dataset_df = pd.concat([maliciousDomains_df[0:50], benignDomains_df[0:50]], sort=False)
dataset_df = dataset_df.sample(frac=1)
dataset_df.head()

Unnamed: 0,src,domain,class
2,cisco,api-global.netflix.com,benign
14,cisco,clients4.google.com,benign
41,cisco,mtalk.google.com,benign
43,nymaim,cdfviq.com,malicious
26,nymaim,meisvfddhoc.biz,malicious


## Building the Model

In [119]:
def build_model(max_features):
    """Builds logistic regression model"""
    model = Sequential()
    model.add(Dense(1, input_dim=max_features, activation='sigmoid'))

    model.compile(loss='binary_crossentropy',
                  optimizer='adam')

    return model

In [173]:
def run(max_epoch=5, nfolds=10, batch_size=20):
    """Run train/test on logistic regression model"""

    X = list(dataset_df['domain'])
    labels = list(dataset_df['class'])
    
    # Create feature vectors
    print("vectorizing data")
    ngram_vectorizer = feature_extraction.text.CountVectorizer(analyzer='char', ngram_range=(2, 2))
    count_vec = ngram_vectorizer.fit_transform(X)

    max_features = count_vec.shape[1]
    
    # Create standard labels:
    y = [0 if x == 'benign' else 1 for x in labels]
    
    for fold in range(nfolds):
        

        print('Count Vec: ' + str(count_vec))
        print('Y: ' + str(y))
        print('Labels: ' + str(labels))
        
        X_train, X_test, y_train, y_test, _, label_test = train_test_split(count_vec, y, labels, test_size=0.2)
        
        #print('Building Model...')
        model = build_model(max_features)
        
        #print('Training Model...')
        X_train, X_holdout, y_train, y_holdout = train_test_split(X_train, y_train, test_size=0.05)
        
        best_iter = -1
        best_auc = 0.0
        out_data = {}
        
        for ep in range(max_epoch):
            model.fit(X_train.todense(), y_train, batch_size=batch_size, epochs=1)
            
            t_probs = model.predict_proba(X_holdout.todense())
            t_auc = sklearn.metrics.roc_auc_score(y_holdout, t_probs)

            if t_auc > best_auc:
                
                print('Ready to predict')
                return model

            else:
                # No longer improving...break and calc statistics
                if (ep-best_iter) > 5:
                    break

testModel = run()

vectorizing data
Count Vec:   (0, 240)	1
  (0, 64)	1
  (0, 4)	1
  (0, 365)	1
  (0, 158)	1
  (0, 191)	1
  (0, 108)	1
  (0, 322)	1
  (0, 98)	1
  (0, 219)	1
  (0, 11)	1
  (0, 185)	1
  (0, 29)	1
  (0, 40)	1
  (0, 231)	1
  (0, 194)	1
  (0, 120)	1
  (0, 1)	1
  (0, 139)	1
  (0, 258)	1
  (0, 32)	1
  (1, 88)	1
  (1, 190)	1
  (1, 236)	1
  (1, 242)	1
  :	:
  (98, 402)	1
  (98, 387)	1
  (98, 248)	1
  (98, 339)	1
  (98, 316)	1
  (98, 166)	1
  (98, 320)	1
  (98, 71)	1
  (98, 236)	1
  (98, 122)	1
  (98, 98)	1
  (98, 219)	1
  (98, 11)	1
  (99, 39)	1
  (99, 181)	1
  (99, 86)	1
  (99, 94)	1
  (99, 378)	1
  (99, 281)	1
  (99, 319)	1
  (99, 190)	1
  (99, 98)	1
  (99, 219)	1
  (99, 11)	1
  (99, 29)	1
Y: [0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1]
Labels:

In [176]:
ngram_vectorizer = feature_extraction.text.CountVectorizer(analyzer='char', ngram_range=(2, 2))
count_vec = ngram_vectorizer.fit_transform(['tedsfwfkewrglkergst'])
print(len(count_vec))
#print(testModel.predict(count_vec))

# print(testModel)

TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]