<a href="https://colab.research.google.com/github/joshred83/Chapter9_1/blob/master/entropy_for_deobfuscation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import pandas as pd
import numpy as np

In [0]:
data = pd.read_csv("MOCK_DATA.csv")


This is our mock data. It represents an email field, clustered by other criteria (id). Sorting the data will help illustrate how it's organized. 

# There are three types of clusters in this data set. 

1. clusters of random email addresses.
2. clusters of a single email address repeated. 
3. clusters of a single email address employing known obfuscation patterns. 

The premise of this project is that clusters of obfuscated email addresses are inherently suspicious and worth closer scrutiny. 


#The "blue" cluster represents the first type of cluster.


In [0]:
data.set_index('id', inplace=True)
data.sort_values(by='id', inplace=True)
data.loc['blue']

Unnamed: 0_level_0,email
id,Unnamed: 1_level_1
blue,parturient.montes@molestiedapibusligula.co.uk
blue,sed@infelisNulla.co.uk
blue,nibh.Donec@magnased.org
blue,dictum.eleifend.nunc@sagittis.org
blue,turpis@eleifendvitaeerat.co.uk
blue,auctor@pedeSuspendissedui.org
blue,aliquet.lobortis@etnuncQuisque.org
blue,Sed.neque.Sed@ametorci.org
blue,quam@sapien.net
blue,nec@consequatlectussit.co.uk


# The "cerulean" cluster represents the second type of cluster. 

In [0]:
data.loc['cerulean']

Unnamed: 0_level_0,email
id,Unnamed: 1_level_1
cerulean,legittaxguy@taxguy.com
cerulean,legittaxguy@taxguy.com
cerulean,legittaxguy@taxguy.com
cerulean,legittaxguy@taxguy.com
cerulean,legittaxguy@taxguy.com
cerulean,legittaxguy@taxguy.com
cerulean,legittaxguy@taxguy.com


# The "mauve" cluster represents a suspicious cluster.

In [0]:
data.loc['mauve']

Unnamed: 0_level_0,email
id,Unnamed: 1_level_1
mauve,josh02@gmail.com
mauve,josh02@gmail.com
mauve,josh03@gmail.com
mauve,josh+taxes@gmail.com
mauve,jo.sh@gmail.com
mauve,josh01@gmail.com
mauve,josh02@gmail.com
mauve,tom@gmail.com


# In the real world, clusters may be partly mixed. So, the following process seeks to measure the 'impact' of using email obfuscation. 

The first step is to build a function which normalizes our email strings. 


In [0]:
import re
def normalize_email(email):
    
    prefix = email.split(sep='@')[0]
    suffix = '@' +email.split(sep='@')[1]
    
    #strip out numbers and dots from the prefix
    prefix = re.sub(r'[.0-9]', '', prefix)
    
    #strip out anything that follows a plus sign
    prefix = prefix.split(sep='+')[0]
    
    #return the reconstituted email address
    return prefix+suffix



And then apply our function, creating a new column. 

In [0]:
data['email'].apply(normalize_email)

id
blue        parturientmontes@molestiedapibusligula.co.uk
blue                              sed@infelisNulla.co.uk
blue                              nibhDonec@magnased.org
blue                     dictumeleifendnunc@sagittis.org
blue                      turpis@eleifendvitaeerat.co.uk
blue                       auctor@pedeSuspendissedui.org
blue                   aliquetlobortis@etnuncQuisque.org
blue                            SednequeSed@ametorci.org
blue                                     quam@sapien.net
blue                        nec@consequatlectussit.co.uk
blue                     sagittis@condimentumDonecat.edu
blue                                     enimdiam@eu.com
blue                                     nisl@lectus.net
blue          Seddiamlorem@Pellentesquehabitantmorbi.com
cerulean                          legittaxguy@taxguy.com
cerulean                          legittaxguy@taxguy.com
cerulean                          legittaxguy@taxguy.com
cerulean                    

In [0]:
data['norm_email'] = data['email'].apply(normalize_email)

Let's look at our key testing cluster:

In [0]:
data.loc['mauve']

Unnamed: 0_level_0,email,norm_email
id,Unnamed: 1_level_1,Unnamed: 2_level_1
mauve,josh02@gmail.com,josh@gmail.com
mauve,josh02@gmail.com,josh@gmail.com
mauve,josh03@gmail.com,josh@gmail.com
mauve,josh+taxes@gmail.com,josh@gmail.com
mauve,jo.sh@gmail.com,josh@gmail.com
mauve,josh01@gmail.com,josh@gmail.com
mauve,josh02@gmail.com,josh@gmail.com
mauve,tom@gmail.com,tom@gmail.com


And this is a cluster which is more random. (Our legit tax guy cluster is essentially unaffected)

In [0]:
data.loc['yellow']

Unnamed: 0_level_0,email,norm_email
id,Unnamed: 1_level_1,Unnamed: 2_level_1
yellow,ultricies.dignissim.lacus@commodo.ca,ultriciesdignissimlacus@commodo.ca
yellow,augue.scelerisque@magnaPraesent.net,auguescelerisque@magnaPraesent.net
yellow,mi.lacinia.mattis@acsem.org,milaciniamattis@acsem.org
yellow,adipiscing.enim.mi@AeneanmassaInteger.edu,adipiscingenimmi@AeneanmassaInteger.edu
yellow,Vivamus@aliquetProinvelit.edu,Vivamus@aliquetProinvelit.edu
yellow,vel.faucibus.id@euismod.edu,velfaucibusid@euismod.edu
yellow,sed.turpis.nec@Crassed.edu,sedturpisnec@Crassed.edu
yellow,id@leoVivamusnibh.edu,id@leoVivamusnibh.edu
yellow,Cras.dolor.dolor@Vestibulum.co.uk,Crasdolordolor@Vestibulum.co.uk
yellow,augue.eu.tellus@Duis.co.uk,augueeutellus@Duis.co.uk


This is where we begin to use the formula for information entropy. The goal here is to determine whether the cluster is 'artificially disorganized'. 


In our 'mauve' cluster, we can see that there is a common actor using specific techniques to either avoid detection (or admittedly, for legitimate business reasons, though this seems rare). 



In [0]:
data.reset_index()

#Probability of a given email address occurring in the cluster within the observed data
obs_P = (data.groupby(['id','email']).size()/data.groupby(['id']).size()).rename('P').reset_index()


#Probability of a given email address occurring in the cluster within the normalized dataP
norm_P =(data.groupby(['id','norm_email']).size()/data.groupby(['id']).size()).rename('P').reset_index()



In [0]:
obs_P['log_P'] = np.log2(obs_P['P']) 
obs_P['-P_log_P'] = -(obs_P['P'] * obs_P['log_P'])


In [0]:

norm_P['log_P'] = np.log2(norm_P['P']) 
norm_P['-P_log_P'] = -(norm_P['P'] * norm_P['log_P'])

In [0]:
observed_entropy = obs_P.groupby(['id']).agg('sum')
observed_entropy.sort_values('-P_log_P',ascending=False)

Unnamed: 0_level_0,P,log_P,-P_log_P
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
red,1.0,-69.486868,4.087463
violet,1.0,-69.486868,4.087463
orange,1.0,-64.0,4.0
blue,1.0,-53.302969,3.807355
green,1.0,-53.302969,3.807355
indigo,1.0,-38.053748,3.459432
yellow,1.0,-38.053748,3.459432
mauve,1.0,-16.415037,2.405639
cerulean,1.0,0.0,0.0


In [0]:
normalized_entropy = norm_P.groupby(['id']).agg('sum')
normalized_entropy.sort_values('-P_log_P',ascending=False)

Unnamed: 0_level_0,P,log_P,-P_log_P
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
red,1.0,-69.486868,4.087463
violet,1.0,-69.486868,4.087463
orange,1.0,-64.0,4.0
blue,1.0,-53.302969,3.807355
green,1.0,-53.302969,3.807355
indigo,1.0,-38.053748,3.459432
yellow,1.0,-38.053748,3.459432
mauve,1.0,-3.192645,0.543564
cerulean,1.0,0.0,0.0


In [0]:
entropy_dif = observed_entropy['-P_log_P'] - normalized_entropy['-P_log_P'] 

In [0]:
entropy_dif.sort_values(ascending=False)

id
mauve       1.862075
yellow      0.000000
violet      0.000000
red         0.000000
orange      0.000000
indigo      0.000000
green       0.000000
cerulean    0.000000
blue        0.000000
Name: -P_log_P, dtype: float64

In [0]:
data.loc['mauve']

Unnamed: 0_level_0,email,norm_email
id,Unnamed: 1_level_1,Unnamed: 2_level_1
mauve,josh02@gmail.com,josh@gmail.com
mauve,josh02@gmail.com,josh@gmail.com
mauve,josh03@gmail.com,josh@gmail.com
mauve,josh+taxes@gmail.com,josh@gmail.com
mauve,jo.sh@gmail.com,josh@gmail.com
mauve,josh01@gmail.com,josh@gmail.com
mauve,josh02@gmail.com,josh@gmail.com
mauve,tom@gmail.com,tom@gmail.com
