### N-Grams Implementation for DGA detection

- Concept: we want to teach the algorithm what should be “expected” as far as character combinations
- We do that by figuring out what n-grams appear in legitimate domains and then calculate the difference

Reference:
1. [Stanford CS224n - Using nGrams for probabilistic langauge modeling](https://www.youtube.com/watch?v=dkUtavsPqNA)

The goal in probabilistic modeling

In [2]:
# Setup
import sklearn.feature_extraction
import pandas as pd
import numpy as np
import pylab
import tldextract
import numpy as np

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

# Set default pylab parameters to figure
pylab.rcParams['figure.figsize'] = (14.0, 5.0)
pylab.rcParams['axes.grid'] = True

#### Prepare legit dataset

In [3]:
alexa_dataframe = pd.read_csv('data/alexa_100k.csv', names=['rank','uri'], header=None, encoding='utf-8')

def domain_extract(uri):
    ext = tldextract.extract(uri)
    if (not ext.suffix):
        return np.nan
    else:
        return ext.domain
    
alexa_dataframe['domain'] = alexa_dataframe['uri'].apply(domain_extract)
del alexa_dataframe['rank'], alexa_dataframe['uri']
alexa_dataframe.dropna(inplace=True)
alexa_dataframe.drop_duplicates(inplace=True)
alexa_dataframe['class'] = 'legit'
# Shuffle the data (important for training/testing)
alexa_dataframe = alexa_dataframe.reindex(np.random.permutation(alexa_dataframe.index))
alexa_total = alexa_dataframe.shape[0]
print('Total Alexa domains %d' % alexa_total)
# Create a holdout set of 10% of the total alexa domains
split = int(0.1 * alexa_total)
hold_out_alexa, alexa_dataframe = alexa_dataframe[:split], alexa_dataframe[split:]
print('Number of training Alexa domains: %d' % alexa_dataframe.shape[0])
alexa_dataframe.head()

Total Alexa domains 91377
Number of training Alexa domains: 82240


Unnamed: 0,domain,class
93422,ufs-online,legit
63122,suratdiamond,legit
32013,ivpaste,legit
87578,excelwithbusiness,legit
94832,whos,legit


#### Prepare DGA dataset

In [4]:
# Read in the DGA domains
dga_dataframe = pd.read_csv('data/dga_domains.txt', names=['raw_domain'], header=None, encoding='utf-8')

# We noticed that the blacklist values just differ by captilization or .com/.org/.info
# <Try map operation>
dga_dataframe['domain'] = dga_dataframe.applymap(lambda x: x.split('.')[0].strip().lower())
del dga_dataframe['raw_domain']

# It's possible we have NaNs from blanklines or whatever
dga_dataframe = dga_dataframe.dropna()
dga_dataframe = dga_dataframe.drop_duplicates()
dga_total = dga_dataframe.shape[0]
print('Total DGA domains %d' % dga_total)

# Set the class
dga_dataframe['class'] = 'dga'

# Hold out 10%
hold_out_dga = dga_dataframe[int(dga_total*.9):]
dga_dataframe = dga_dataframe[:int(dga_total*.9)]

print('Number of training DGA domains: %d' % dga_dataframe.shape[0])
dga_dataframe.head()

Total DGA domains 2664
Number of training DGA domains: 2397


Unnamed: 0,domain,class
0,04055051be412eea5a61b7da8438be3d,dga
1,1cb8a5f36f,dga
2,30acd347397c34fc273e996b22951002,dga
3,336c986a284e2b3bc0f69f949cb437cb,dga
5,40a43e61e56a5c218cf6c22aca27f7ee,dga


In [5]:
# Concatenate the domains in a big pile!
all_domains = pd.concat([alexa_dataframe, dga_dataframe], ignore_index=True)
all_domains.head()

Unnamed: 0,domain,class
0,ufs-online,legit
1,suratdiamond,legit
2,ivpaste,legit
3,excelwithbusiness,legit
4,whos,legit


### Background on N-Grams model

The goal of a language model is to compute the probability of a sentence or sequence of words:  
[TODO] Update theory

<img src="./images/bigram_example.png" alt="drawing" width="800px"/>

Reference: 
1. [Stanford CS224n](https://www.youtube.com/watch?v=dkUtavsPqNA)

In [43]:
from sklearn.feature_extraction.text import CountVectorizer
import operator

def ngram_extract(series, ngrams, analyzer='char', min_df=1e-4, max_df=1.0):
    cv = CountVectorizer(analyzer=analyzer, ngram_range=(ngrams,ngrams), min_df=min_df, max_df=max_df)
    counts_matrix = cv.fit_transform(series)
    ngrams_counts = counts_matrix.sum(axis=0).getA1()
    ngrams_list = cv.get_feature_names()
    ngrams_dict = dict(sorted(zip(ngrams_list, ngrams_counts), key=operator.itemgetter(1), reverse=True))
    return ngrams_dict

unigrams_dict = ngram_extract(alexa_dataframe['domain'], ngrams=1) 
bigrams_dict = ngram_extract(alexa_dataframe['domain'], ngrams=2)
trigrams_dict = ngram_extract(alexa_dataframe['domain'], ngrams=3)

In [49]:
holdout_trigrams = ngram_extract(hold_out_alexa['domain'].iloc[1], ngrams=3)

ValueError: Iterable over raw text documents expected, string object received.

In [65]:
word = hold_out_alexa['domain'].iloc[0]
word

'566ee'

In [62]:
from nltk.util import ngrams
grams = ngrams(hold_out_alexa['domain'].iloc[0], 2)
grams_list = [''.join(g) for g in grams]

In [66]:
for c in enumerate(word):
    print(c)

5
6
6
e
e


In [None]:
#TODO: UNK processing
print('The size of trigrams before <UNK>: %d' %len(trigrams_dict))
UNK_word = []
Vocab_trigrams = []
for k in trigrams_dict:
    if trigrams_dict[k] < 3:
        UNK_word.append(k)
    else:
        Vocab_trigrams.append(k)

In [15]:
from sklearn.feature_extraction.text import CountVectorizer
legit_series = pd.Series(['facebook','gooogle', 'apple'], name="domain")
cv = CountVectorizer(analyzer='char', ngram_range=(2,2))
counts_matrix = cv.fit_transform(legit_series)
ngram_score = np.log10(counts_matrix.sum(axis=0).getA1())

print(cv.get_feature_names())
print("\nReturns a document-term matrix:")
print(counts_matrix.toarray())

print("\nSum up the values for each term and log the summed value:")
ngram_score

['ac', 'ap', 'bo', 'ce', 'eb', 'fa', 'gl', 'go', 'le', 'og', 'ok', 'oo', 'pl', 'pp']

Returns a document-term matrix:
[[1 0 1 1 1 1 0 0 0 0 1 1 0 0]
 [0 0 0 0 0 0 1 1 1 1 0 2 0 0]
 [0 1 0 0 0 0 0 0 1 0 0 0 1 1]]

Sum up the values for each term and log the summed value:


array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.30103   , 0.        ,
       0.        , 0.47712125, 0.        , 0.        ])

In [10]:
test_series = pd.Series(['faceboook','zqwpro'], name="domain")
print(cv.transform(test_series).toarray())
print(cv.transform(test_series).T.toarray())

[[1 0 1 1 1 1 0 0 0 0 1 2 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
[[1 0]
 [0 0]
 [1 0]
 [1 0]
 [1 0]
 [1 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [1 0]
 [2 0]
 [0 0]
 [0 0]]


In [11]:
print('alexa_gram feature is the number of matches between ngrams from test domains and ngrams from all legit domains')
np.set_printoptions(threshold=np.inf)
ngram_score * cv.transform(test_series).T 

alexa_gram feature is the number of matches between ngrams from test domains and ngrams from all legit domains


array([0.95424251, 0.        ])

#### TF-IDF

**TF(t)** = (Number of times term t appears in a document) / (Total number of terms in the document)  
**IDF(t)** = log_e(Total number of documents / Number of documents with term t in it)  

Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.