# Worksheet 5.2 - DGA Making Predictions on New Data

Once you have trained a model, you will need to build a pipeline for any new data to be pre-processed and feature engineered so that i can then be put into the model to get a prediction. This Notebook is to show how we would build that pipeline and load the trained model to make these predictions for new data. 

In [None]:
import pandas as pd
import pickle
import numpy as np
import re

## Making a Prediction
The code below demonstrates how you will go from an unknown raw domain to predicting whether it is DGA or not.  The key thing is that you have to regenerate all the features, and create a 1 row dataframe of all your features which is then passed to the model. 

### Define all the functions we used for feature engineering

In [None]:
def H_entropy (x):
    # Calculate Shannon Entropy
    prob = [ float(x.count(c)) / len(x) for c in dict.fromkeys(list(x)) ] 
    H = - sum([ p * np.log2(p) for p in prob ]) 
    return H

def firstDigitIndex(s):
    for i, c in enumerate(s):
        if c.isdigit():
            return i + 1
    return 0

def vowel_consonant_ratio (x):
    # Calculate vowel to consonant ratio
    x = x.lower()
    vowels_pattern = re.compile('([aeiou])')
    consonants_pattern = re.compile('([b-df-hj-np-tv-z])')
    vowels = re.findall(vowels_pattern, x)
    consonants = re.findall(consonants_pattern, x)
    try:
        ratio = len(vowels) / len(consonants)
    except: # catch zero devision exception 
        ratio = 0  
    return ratio

# ngrams: Implementation according to Schiavoni 2014: "Phoenix: DGA-based Botnet Tracking and Intelligence"
# http://s2lab.isg.rhul.ac.uk/papers/files/dimva2014.pdf

def ngrams(word, n):
    '''
    Extract n ngrams from each word and return a regular Python list
    
    Input 
    word: (string) or a (list) of strings
    n: (integer) or a (list) of integers, lenght of ngram

    Output: 
    
    '''
    
    list_ngrams = []
    if isinstance(word, list):
        for w in word:
            if isinstance(n, list):
                for curr_n in n:
                    ngrams = [w[i:i+curr_n] for i in range(0,len(w)-curr_n+1)]
                    list_ngrams.extend(ngrams)
            else:
                ngrams = [w[i:i+n] for i in range(0,len(w)-n+1)]
                list_ngrams.extend(ngrams)
    else:
        if isinstance(n, list):
            for curr_n in n:
                ngrams = [word[i:i+curr_n] for i in range(0,len(word)-curr_n+1)]
                list_ngrams.extend(ngrams)
        else:
            ngrams = [word[i:i+n] for i in range(0,len(word)-n+1)]
            list_ngrams.extend(ngrams)

    return list_ngrams

def ngram_feature(word, common_dict, n):
    '''
    Takes (word) as input, splits it into (n) ngrams and then looks up and counts where
    these ngrams are found in the (common_dict). Function returns the normalized sum of all 
    ngrams that were found in the (common_dict). 
    
    For example, ngram_feature('facebook', ngram_common_dict, 2) will return 171.28
    
    Input 
    word: (str) or (list) of strings (domain in our case)
    common_dict: (dictionary) that contains the count for most common english words
    n: (int) or (list) of the ngram length example: 1,2,3
    
    Output
    feature: (float) a normalized sum of ngram count found in the common_dict
    '''

    # get a list of matching ngrams from common dict. 
    list_ngrams = ngrams(word, n)
    
    count_sum=0
    
    for ngram in list_ngrams:
        if common_dict[ngram]:
            count_sum += common_dict[ngram]
    try:
        feature = count_sum/(len(word)-n+1)
    except:
        feature = 0
    return feature
    
def average_ngram_feature(word):
    
    '''
    Takes a word (word) as input, uses the ngram_feature and ngrms functions (above) to create ngrams, 
    get the sum of the ngram count found in the common_dict for 1, 2 and 3-grams. 
    Then, Calculate the average of these three results.
    
    Input: 
    Output: average of the list 
    '''
    ngram_counts = []
    num_of_grams = [1,2,3]
    for n in num_of_grams:
        ngram_counts.append(ngram_feature(word, ngram_common_dict, n))
                  
    return sum(ngram_counts)/len(ngram_counts)

## Load the dictionary of common ngrams 

In [None]:
# Source: https://github.com/first20hours/google-10000-english
with open('../data/d_common_en_words' + '.pickle', 'rb') as f:
        ngram_common_dict = pickle.load(f)

## Load the model that you trained in the previous notebook

In [None]:
#load the trained model
with open('../data/dga_decision_tree.sav', 'rb') as file:
    clf = pickle.load(file)

## Define function to create the pipeline and get prediction

In [None]:
def is_dga(domain, clf, common_dict):
    # Function that takes new domain string, trained model 'clf' as input and
    # dictionary ngram_common_dict of most common english words
    # returns prediction
    
    domain_features = pd.DataFrame()
    # order of features is ['length', 'digits', 'entropy', 'vowel-cons', firstDigitIndex, 'ngrams']
    domain_features.loc[0,'length'] = len(domain)
    pattern = re.compile('([0-9])')
    domain_features.loc[0,'digits'] = len(re.findall(pattern, domain))
    domain_features.loc[0,'entropy'] = H_entropy(domain)
    domain_features.loc[0,'vowel-cons'] = vowel_consonant_ratio(domain)
    domain_features.loc[0,'firstDigitIndex'] = firstDigitIndex(domain)
    domain_features.loc[0,'ngrams'] = average_ngram_feature(domain)
    
    pred = clf.predict(domain_features)
    return pred[0]

# Make a prediction on a domain 
Get a prediction using the ```is_dga``` function defined above for the following domain

```
adfajdskflajlkdsfjaksdjf;lakjsdfkajdsf8989dsf32.com
```

In [None]:
# Your code here

A few more domains to try

- spardeingeld
- google
- blackhat.com
- 1vxznov16031kjxneqjk1rtofi6
- lthmqglxwmrwex

In [None]:
# Your code here