### Jason's extra notes

- Concept: we want to teach the algorithm what should be “expected” as far as character combinations
- We do that by figuring out what n-grams appear in legitimate domains and then calculate the difference

Reference: 
1. [Data Driven Security](http://datadrivensecurity.info/blog/posts/2014/Oct/dga-part2/)
2. [Stanford CS224n](https://www.youtube.com/watch?v=dkUtavsPqNA)

In [8]:
# Setup
import sklearn.feature_extraction
import pandas as pd
import numpy as np
import pylab

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

# Set default pylab parameters to figure
pylab.rcParams['figure.figsize'] = (14.0, 5.0)
pylab.rcParams['axes.grid'] = True

In [15]:
from sklearn.feature_extraction.text import CountVectorizer
legit_series = pd.Series(['facebook','gooogle', 'apple'], name="domain")
cv = CountVectorizer(analyzer='char', ngram_range=(2,2))
counts_matrix = cv.fit_transform(legit_series)
ngram_score = np.log10(counts_matrix.sum(axis=0).getA1())

print(cv.get_feature_names())
print("\nReturns a document-term matrix:")
print(counts_matrix.toarray())

print("\nSum up the values for each term and log the summed value:")
ngram_score

['ac', 'ap', 'bo', 'ce', 'eb', 'fa', 'gl', 'go', 'le', 'og', 'ok', 'oo', 'pl', 'pp']

Returns a document-term matrix:
[[1 0 1 1 1 1 0 0 0 0 1 1 0 0]
 [0 0 0 0 0 0 1 1 1 1 0 2 0 0]
 [0 1 0 0 0 0 0 0 1 0 0 0 1 1]]

Sum up the values for each term and log the summed value:


array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.30103   , 0.        ,
       0.        , 0.47712125, 0.        , 0.        ])

In [10]:
test_series = pd.Series(['faceboook','zqwpro'], name="domain")
print(cv.transform(test_series).toarray())
print(cv.transform(test_series).T.toarray())

[[1 0 1 1 1 1 0 0 0 0 1 2 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
[[1 0]
 [0 0]
 [1 0]
 [1 0]
 [1 0]
 [1 0]
 [0 0]
 [0 0]
 [0 0]
 [0 0]
 [1 0]
 [2 0]
 [0 0]
 [0 0]]


In [11]:
print('alexa_gram feature is the number of matches between ngrams from test domains and ngrams from all legit domains')
np.set_printoptions(threshold=np.inf)
ngram_score * cv.transform(test_series).T 

alexa_gram feature is the number of matches between ngrams from test domains and ngrams from all legit domains


array([0.95424251, 0.        ])

#### TF-IDF

**TF(t)** = (Number of times term t appears in a document) / (Total number of terms in the document)  
**IDF(t)** = log_e(Total number of documents / Number of documents with term t in it)  

Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.