## Data Wrangling

This project's approach will rely mainly on lexical features derived from individual url links. Therefore, afer import of collected benign, phishing and malicious url lists, new features will be created from the base url links. 

Besides lexical features, this project will leverage www.alexa.com to determine if a url's domain exists in Alexa's Top 500 website list.

In [1]:
#import relevant modules
import re
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import datetime
from dateutil.relativedelta import relativedelta
from datetime import date
import warnings
warnings.filterwarnings('ignore')
import gc
gc.enable()

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

  import pandas.util.testing as tm


### Data Sources & Import

Data Sources:

Malicious URLs:
A sample of 10,000 urls is taken from a csv record of over 600,000 malicious url links retrieved from https://urlhaus.abuse.ch. URLhaus is a project operated by abuse.ch. The project collects and shares malware URLs, to assist network administrators and security analysts in protecting their networks from cyber threats.

Phishing URLs:
A sample of 10,000 urls is taken from a csv record of over 17,000 phishing url links retrieved from http://phishtank.org/. PhishTank is a collaborative clearing house for data and information about phishing on the Web. It's url lists are available to developers to integrate anti-phishing data into their applications.  

Benign URLs:
Over 25,000 urls were collected by crawling Alexa's list of the top 2500 websites. In order to help validate that each url was 'benign', each url's reputation was checked via VirusTotal. VirusTotal inspects urls with over 70 antivirus scanners and URL/domain blacklisting services, as well as other tools. Virus scans were requested in those instances where a url had no previous scans or reporting available.

In [None]:
# read in csv file of phishing urls
ph_df = pd.read_csv('phishing_urls.csv')     

# Reduce dataset to reflect only urls verified and online
ph_df = ph_df[(ph_df['verified'] == 'yes') & (ph_df['online'] == 'yes')]

# drop unnecessary features
drop = ['phish_id', 'target', 'phish_detail_url', 'submission_time', 'verification_time', 'verified', 'online']
ph_df = ph_df.drop(drop, axis=1)

# assign category value 'phishing'
ph_df['category'] = 'phishing'

# take a sample of 10,000 records
ph_df_sample = ph_df.sample(n=10000) 

In [None]:
# read in benign urls, drop unnecessary features
b_df = pd.read_csv('alexa_urls.csv')  
drop = ['scheme', 'netloc', 'path', 'params', 'query', 'fragment']
b_df = b_df.drop(drop, axis=1)

# assign category value 'benign'
b_df['category'] = 'benign'

In [None]:
# read in malicious urls
m_df = pd.read_csv('malicious_urls.csv')

# assign category value 'malicious'
m_df['category'] = 'malicious'

# take a sample of 10,000 records
m_df_sample = m_df.sample(n=10000) 

In [None]:
# concat all dataframes
df = pd.concat([b_df, ph_df_sample, m_df_sample], axis=0)

In [None]:
# save a copy
df.to_pickle('capstone2_data')

### Develop New Features

Build out new features based on lexical analysis of the url and its components: scheme, netloc, path, params, query and fragment.

In [2]:
# df = pd.read_pickle('capstone2_data')

In [3]:
# parse the urls into components: scheme, netloc, path, params, query and fragment.
from urllib.parse import urlparse

df['scheme'],df['netloc'],df['path'],df['params'],df['query'],df['fragment'] = zip(*df['url'].map(urlparse))

#### URL Lexical Features

In [4]:
# build url length features
df['len_url'] = df['url'].apply(len) # url length
df['is_53'] = (df['len_url'] < 54) # is url less than 54 char
df['is_54_75'] = (df['len_url'] > 53) & (df['len_url'] < 76) # is url between 54 and 75 char
df['is_76'] = (df['len_url'] > 75) # is url greater than 75 char

In [5]:
# split url into tokens and create new features: list of url's tokens, number of url tokens, average token length 
from nltk.tokenize import WordPunctTokenizer

tokenized_url = []
len_tokenized = []

for url in df.url:
    result = WordPunctTokenizer().tokenize(url)
    tokenized_url.append(result)
    len_tokenized.append(len(result))
    
df['len_tokenized_url'] = len_tokenized
df['avg_token_len'] = df['len_url']/df['len_tokenized_url'] 
df['tokenized_url'] = tokenized_url

In [6]:
# create new numeric feature identifying location of last set of '//' in url
df['last_slashes'] = df.apply(lambda row: row.url.rfind('//'), axis=1)

# create feature speciyfing % location of slashes within the url
df['loc_last_slashes'] = df['last_slashes']/df['len_url']

In [7]:
# create features reflecting character content

def pc_upper_lower(string):
    # Based on the a-zA-Z characters in a string, 
    # calculates the percentage of uppercase and lowercase characters.
    
    alpha = uppercase = lowercase = n_letters = 0 
    if len(string) >= 1:
        for char in string:
            if char.isalpha():
                alpha += 1
                if char.isupper():
                    uppercase += 1
                if char.islower():
                    lowercase += 1
        n_letters = uppercase + lowercase
        if n_letters != 0:
            percent_upper = round(uppercase/n_letters, 3)
            percent_lower = round(lowercase/n_letters, 3)
        else:
            percent_upper = 0
            percent_lower = 0
    else:
        percent_upper = 0
        percent_lower = 0 
    return percent_upper, percent_lower


url_list = df['url'].tolist()

num = []
let = []
spec = []

percent_alpha = []
percent_num = []
percent_special = []

num_dots = []
num_at_signs = []
num_semicolons = []
num_underscores = []
num_question_marks = []
upper_percent = []
lower_percent = []

for i in url_list:
    numbers = sum(c.isdigit() for c in i)
    num.append(numbers)
    percent_num.append(numbers/len(i))
    
    letters = sum(c.isalpha() for c in i)
    let.append(letters)
    percent_alpha.append(letters/len(i))
    
    others = len(i) - numbers - letters
    spec.append(others)
    percent_special.append(others/len(i))
    
    num_dots.append(i.count('.'))
    num_at_signs.append(i.count('@'))
    num_semicolons.append(i.count(';'))
    num_underscores.append(i.count('_'))
    num_question_marks.append(i.count('?'))

    percent_upper, percent_lower = pc_upper_lower(i)
    upper_percent.append(percent_upper)
    lower_percent.append(percent_lower)

df['n_let'] = let 
df['n_num'] = num
df['n_spec'] = spec
df['pc_num'] = percent_num
df['pc_let'] = percent_alpha
df['pc_spec'] = percent_special
df['n_dots'] = num_dots
df['n_ats'] = num_at_signs
df['n_semicol'] = num_semicolons
df['num_underscores'] = num_underscores
df['num_question'] = num_question_marks
df['pc_uppercase'] = upper_percent
df['pc_lowercase'] = lower_percent

In [8]:
# calculate an Shannon entropy score for each url
# urls with a larger characters distribution will have a higher score

import math

def shannon(word):
    entropy = 0.0
    length = len(word)
    occ = {}
    for c in word :
        if not c in occ:
            occ[ c ] = 0
        else:
            occ[ c ] = occ[c] + 1

    for (k,v) in occ.items(): # changed from iteritems
        p = float( v ) / float(length)
        if p > 0: # added this to avoid math domain error where p = 0
            entropy -= p * math.log(p, 2) # Log base 2
    return entropy

url_entropy_result = []

for i in url_list:
    url_entropy_result.append(shannon(i))
    
df['entropy'] = url_entropy_result

In [9]:
# create a masque feature: the number of letter+digit+letter substrings in a url
# create a character continuity rate feature: (length of longest letter substring + length of longest digit substring +
# length of longest special character substring) / URL length

import re

def masque_count(token):
    match = re.findall("([a-zA-Z][0-9][a-zA-Z])", token)
    match_count = len(match)
    return match_count

def longestSubstring(str):
    
    # find the longest consecutive substring of a certain type
    
    l = re.findall(r'[A-Za-z]+', str)
    d = re.findall(r'\d+', str)
    s = re.findall(r'[^a-zA-Z0-9]+', str)
    if l:  #THESE WONT CALC IF EMPTY LIST
        ll = max(l, key = len)
        max_l = len(ll)
    else:
        max_l = 0
    if d: 
        ld = max(d, key = len)
        max_d = len(ld)
    else:
        max_d = 0
    if s:
        ls = max(s, key = len)
        max_s = len(ls)
    else:
        max_s = 0
        
    char_total = (max_l + max_d + max_s)
    char_cont_rate = char_total/len(str)
    return char_cont_rate


url_masques = [] 
cont_rate = []
                    
for url in url_list:
    url_masques.append(masque_count(url))
    cont_rate.append(longestSubstring(url))

df['n_masques'] = url_masques
df['char_cont_rate'] = cont_rate

#### Domain Lexical Features

In [10]:
# create reg_domain and domain_suffix features, by extracting second-level and top-level domain section. e.g. example.com
# identify ip addresses used in place of domain

# import tldextract to parse out true registered domain
import tldextract 

# adding a function that identifies an ip address
def is_ipv4(ip):
    match = re.match("^(\d{0,3})\.(\d{0,3})\.(\d{0,3})\.(\d{0,3})", ip)
    if not match:
        return False
    quad = []
    for number in match.groups():
        quad.append(int(number))
    if quad[0] < 1:
        return False
    for number in quad:
        if number > 255 or number < 0:
            return False
    return True

reg_domain = []
domain_suffix = []

for i in url_list:
    ext = tldextract.extract(i)
    reg = '.'.join(ext[1:]) #this returns domain + suffixes or ip + .
    sub_domain = '.'.join(ext[:2])
    suffix = ext.suffix
    domain_suffix.append(suffix)
    if is_ipv4(reg): # if reg is an ip, drop the . at the end of ip
        reg = str(reg)[:-1]
    
    reg_domain.append(reg) #append list with the domain+suffix, or ip address
    
#add domain as new df feature 
df['reg_domain'] = reg_domain
df['domain_suffix'] = domain_suffix


In [11]:
# create feature identifying the number of domain suffixes (top-level domains)

num_domain_suffix = []

for i in df.domain_suffix:
    if i:
        num_domain_suffix.append(i.count('.') + 1)
    else: 
        num_domain_suffix.append(0)


df['n_domain_suffix'] = num_domain_suffix

In [12]:
# use prior is_ip function to create new boolean feature. 
# create new features based on reg_domain characters

IP = []
num = []
let = []
spec = []
percent_numbers = []
percent_chars = []
percent_others = []
domain_dots = []
domain_len = []
domain_hyphens = []
domain_ats = []
domain_masques = []
domain_entropy_result = []

for i in reg_domain:
    response = is_ipv4(i)
    IP.append(response)
    
    num.append(numbers)
    percent_numbers.append(numbers/len(i))
    
    letters = sum(c.isalpha() for c in i)
    let.append(letters)
    percent_chars.append(letters/len(i))
    
    others = len(i) - numbers - letters
    spec.append(others)
    percent_others.append(others/len(i))
    
    domain_dots.append(i.count('.'))
    domain_len.append(len(i))
    domain_hyphens.append(i.count('-'))
    domain_ats.append(i.count('@'))                  
                    
    domain_masques.append(masque_count(i))
    domain_entropy_result.append(shannon(i))

df['len_domain'] = domain_len
df['is_ip'] = IP #add boolean list to df
df['n_domain_num'] = num
df['n_domain_let'] = let
df['n_domain_spec'] = spec
df['pc_domain_num'] = percent_numbers
df['pc_domain_let'] = percent_chars
df['pc_domain_spec'] = percent_others
df['n_domain_dots'] = domain_dots
df['n_domain_tok'] = df['n_domain_dots'] + 1
df['avg_domain_tok_len'] = df['len_domain']/df['n_domain_tok']
df['n_domain_hyphens'] = domain_hyphens
df['n_domain_ats'] = domain_ats
df['n_domain_masques'] = domain_masques
df['domain_entropy'] = domain_entropy_result

In [13]:
# create is_top_500_domain (a boolean feature) to report if reg_domain is in Alexa's top 500 domain list

# read in top 500 websites
top_500_domains = pd.read_csv('alexa_top500.csv')

top_500_list = top_500_domains.domain.to_list()

#adding a function that identifies whether a domain is in the top 500 domain list
def is_match(domain, my_list):
    if domain in my_list:
        return True
    else:
        return False

is_top_500_domain = []

for domain in reg_domain:
    if is_ipv4(domain):
        is_top_500_domain.append(False)
    else:
        is_top_500_domain.append(is_match(domain, top_500_list))
    
df['is_top500_domain'] = is_top_500_domain # likely FALSE for suspicious/malicious urls

In [14]:
#create features based on url's netloc section (netloc = subdomain + domain + suffix) in comparison to exisitng reg_domain feature

num_netloc_dots = []
len_netloc = []
num = []
let = []
spec = []
percent_numbers = []
percent_chars = []
percent_others = [] 
netloc_masques = []
netloc_entropy_result = []
n_subs = []

netloc = df.netloc

for i in netloc:
    dots = i.count('.')
    num_netloc_dots.append(dots)
    len_netloc.append(len(i))

    numbers = sum(c.isdigit() for c in i)
    num.append(numbers)
    percent_numbers.append(numbers/len(i))
    
    letters = sum(c.isalpha() for c in i)
    let.append(letters)
    percent_chars.append(letters/len(i))
    
    others = len(i) - numbers - letters
    spec.append(others)
    percent_others.append(others/len(i))
    
    netloc_masques.append(masque_count(i))
    
    netloc_entropy_result.append(shannon(i))
    

df['n_netloc_dots'] = num_netloc_dots
df['len_netloc'] = len_netloc
df['n_netloc_num'] = num
df['n_netloc_let'] = let
df['n_netloc_spec'] = spec
df['pc_netloc_num'] = percent_numbers
df['pc_netloc_let'] = percent_chars
df['pc_netloc_spec'] = percent_others
df['n_netloc_tok'] = df['n_netloc_dots'] + 1
df['n_subdomains'] = (df['n_netloc_tok'] - df['n_domain_tok'])
df['avg_netloc_tok_len'] = df['len_netloc']/(df['n_netloc_tok'])
df['n_netloc_masques'] = netloc_masques
df['netloc_entropy'] = netloc_entropy_result

#### Path Lexical Features

In [15]:
# create feature with list of all paths within url

path_items = []

for i in df.path:
    path_list = (re.split('/', i))
    path_list = [x for x in path_list if x != ""]
    path_items.append(path_list)
    
df['path_items'] = path_items

In [16]:
# create additional path features based on characters

path_slashes = []
path_20 = []
len_total_path = []
percent_numbers = []
percent_letters = []
percent_others = [] 
path_masques = []
path_entropy_result = []

for i in df.path:
    path_slashes.append(i.count('/'))
    path_20.append(i.count('/%20'))
    len_total_path.append(len(i))
    
    if len(i) > 0:
        numbers = sum(c.isdigit() for c in i)
        percent_numbers.append(numbers/len(i))
        letters = sum(c.isalpha() for c in i)
        percent_letters.append(letters/len(i))
        others = len(i) - numbers - letters
        percent_others.append(others/len(i))
        path_masques.append(masque_count(i))
        path_entropy_result.append(shannon(i))
    
    else:
        percent_numbers.append(0)
        percent_letters.append(0)
        percent_others.append(0)
        path_masques.append(0)
        path_entropy_result.append(0)

df['len_all_paths'] = len_total_path    
df['n_path_slashes'] = path_slashes
df['n_path_pc20'] = path_20
df['pc_path_num'] = percent_numbers
df['pc_path_let'] = percent_letters
df['pc_path_spec'] = percent_others
df['n_path_masques'] = path_masques
df['path_entropy'] = path_entropy_result

In [17]:
# create path features based on length of individual path items

def path_lengths(list_of_paths):
    max = 0
    min = 500
    single = 0
    for i in list_of_paths:
        if len(i) < min:
             min = len(i)    
        if len(i) > max:
            max = len(i)    
        if len(i) == 1: 
            single += 1        
    num_items = len(list_of_paths)    
    return min, max, single, num_items

path_shortest_item_len = []
path_longest_item_len = []
num_single_char_path = []
num_path_items = []

for paths in df.path_items:
    if len(paths) > 0:
        min, max, single, num_items = path_lengths(paths)
        path_shortest_item_len.append(min)
        path_longest_item_len.append(max)
        num_single_char_path.append(single)
        num_path_items.append(num_items)
    else:
        path_shortest_item_len.append(0)
        path_longest_item_len.append(0)
        num_single_char_path.append(0)
        num_path_items.append(0)
    
df['shortest_path_len'] = path_shortest_item_len
df['longest_path_len'] = path_longest_item_len
df['n_single_char_path'] = num_single_char_path
df['n_path_items'] = num_path_items
df['avg_path_token_len'] = (df['len_all_paths'])/(df['n_path_items'])

# switch out any 'inf' values with NAN
df.loc[~np.isfinite(df['avg_path_token_len']), 'avg_path_token_len'] = np.nan

In [18]:
# for Path section, calculate percent of upper and lowercase letters (of total letters)

upper_percent = []
lower_percent = []

paths = df.path

for i in paths:
    percent_upper, percent_lower = pc_upper_lower(i)
    upper_percent.append(percent_upper)
    lower_percent.append(percent_lower)
    
df['pc_path_uppercase'] = upper_percent
df['pc_path_lowercase'] = lower_percent


#### Parameter Lexical Features

In [19]:
# create features based on parameter section of url

len_parameters = []
num = []
let = []
spec = []
percent_numbers = []
percent_chars = []
percent_others = [] 
params_masques = []
param_entropy_result = []

for i in df.params:
    length = len(i)
    len_parameters.append(length)
    
    if i:
        numbers = sum(c.isdigit() for c in i)
        num.append(numbers)
        percent_numbers.append(numbers/len(i))
    
        letters = sum(c.isalpha() for c in i)
        let.append(letters)
        percent_chars.append(letters/len(i))
    
        others = len(i) - numbers - letters
        spec.append(others)
        percent_others.append(others/len(i))
    
        params_masques.append(masque_count(i))
    
        param_entropy_result.append(shannon(i))
    
    else:
        num.append(0)
        let.append(0)
        spec.append(0)
        percent_numbers.append(0)
        percent_chars.append(0)
        percent_others.append(0)
        params_masques.append(0)
        param_entropy_result.append(0)
        
df['len_param'] = len_parameters
df['n_param_num'] = num
df['n_param_let'] = let
df['n_param_spec'] = spec
df['pc_param_num'] = percent_numbers
df['pc_param_let'] = percent_chars
df['pc_param_spec'] = percent_others
df['n_params_masque'] = params_masques
df['param_entropy'] = param_entropy_result

#### Query Lexical Features

In [20]:
# create features based on query section of url

queries = df['query']
num_queries = []
len_query = []
num = []
let = []
spec = []
percent_numbers = []
percent_chars = []
percent_others = [] 
queries_masques = []
queries_entropy_result = []


for i in queries:
    if len(i) > 0:
        num_queries.append(i.count(';') + 1) 
        len_query.append(len(i))
        
        numbers = sum(c.isdigit() for c in i)
        num.append(numbers)
        percent_numbers.append(numbers/len(i))
        
        letters = sum(c.isalpha() for c in i)
        let.append(letters)
        percent_chars.append(letters/len(i))
        
        others = len(i) - numbers - letters
        spec.append(others)
        percent_others.append(others/len(i))
    
        queries_masques.append(masque_count(i))
        
        queries_entropy_result.append(shannon(i))
    else:
        num_queries.append(0)
        len_query.append(0)
        num.append(0)
        let.append(0)
        spec.append(0)
        percent_numbers.append(0)
        percent_chars.append(0)
    
        percent_others.append(0)
    
        queries_masques.append(0)
    
        queries_entropy_result.append(0)
        
df['n_queries'] = num_queries
df['len_query'] = len_query
df['n_query_num'] = num
df['n_query_let'] = let
df['n_query_spec'] = spec
df['pc_query_num'] = percent_numbers
df['pc_query_let'] = percent_chars
df['pc_query_spec'] = percent_others
df['n_queries_masques'] = queries_masques
df['queries_entropy'] = queries_entropy_result

#### Fragment Lexical Features

In [21]:
# create features based on fragment section

len_fragment = []
num = []
let = []
spec = []
percent_numbers = []
percent_chars = []
percent_others = [] 
frag_masques = []
frag_entropy_result = []

for i in df.fragment:
    len_frag = len(i)
    len_fragment.append(len_frag)
    
    if len_frag > 0:
        numbers = sum(c.isdigit() for c in i)
        num.append(numbers)
        percent_numbers.append(numbers/len(i))
    
        letters = sum(c.isalpha() for c in i)
        let.append(letters)
        percent_chars.append(letters/len(i))
    
        others = len(i) - numbers - letters
        spec.append(others)
        percent_others.append(others/len(i))
    
        frag_masques.append(masque_count(i))
    
        frag_entropy_result.append(shannon(i))
    
    else:
        num.append(0)
        let.append(0)
        spec.append(0)
        percent_numbers.append(0)
        percent_chars.append(0)
        percent_others.append(0)
        frag_masques.append(0)
        frag_entropy_result.append(0)
        
df['len_frag'] = len_fragment
df['n_frag_num'] = num
df['n_frag_let'] = let
df['n_fraf_spec'] = spec
df['pc_frag_num'] = percent_numbers
df['pc_frag_let'] = percent_chars
df['pc_frag_spec'] = percent_others
df['n_frag_masques'] = frag_masques
df['frag_entropy'] = frag_entropy_result

#### Housekeeping

In [22]:
# save a copy of dataframe
df.to_pickle('capstone2_withfeatures')

In [None]:
df = pd.read_pickle('capstone2_withfeatures')

In [None]:
df.head(5)

In [23]:
# drop categorical features
drop = ['url', 'scheme', 'netloc', 'path', 'params', 'query', 'fragment', 'reg_domain', 'domain_suffix', 'path_items', 'tokenized_url'
]
df2 = df.drop(drop, axis=1)

In [24]:
# change booleans to int
df2['is_53'] = df2['is_53'].astype(int)
df2['is_54_75'] = df2['is_54_75'].astype(int)
df2['is_76'] = df2['is_76'].astype(int)
df2['is_ip'] = df2['is_ip'].astype(int)
df2['is_top500_domain'] = df2['is_top500_domain'].astype(int)

In [25]:
# save a copy of dataframe
df2.to_pickle('capstone2_final')