# Data Preparation
**1. Data Acquisition**

    * Data used here were collected from two sources, so preparation steps were done separately
    
**2. Feature Generation based on URL string**

    * amount of dots appeared in sub domain, existence of @, using IP as the domain name, existence of soft hyphen,
      double slash, number of subdomains, etc.
    * query WHOIS Server for Host-based information
    * Making queries from WHOIS Server is very time-consuming, so this step was implemented by spliting the dataset         into three portions and ran them separately.


## Import relevant packages

In [1]:
import os
import sys
import re
import matplotlib
import pandas as pd
import numpy as np
from os.path import splitext
import ipaddress as ip
import tldextract
import whois
import datetime
from urllib.parse import urlparse

## Import Dataset

In [2]:
URLdata = pd.read_csv("../data/project_data.csv", index_col=0)
URLdata = URLdata.sample(frac=1).reset_index(drop=True)
URLdata.head()

Unnamed: 0,url,label
0,http://livelywebordainedkool.cf/harwoodservice...,1
1,amazon.com.br:80/gp/redirect.html?_encoding=UT...,0
2,usaerectionrx.com,0
3,http://ciespains.com/delivery/index.htm,1
4,http://paypalsecureresetloginpage.elifelfmarle...,1


In [3]:
URLdata.shape

(47274, 2)

## Check the dataset's Class Balance

In [4]:
#print total number of urls in the dataset
print('There are ' + str(len(URLdata)) + ' urls in the dataset.')
#percentage of malicious url
spam = len(URLdata[URLdata['label']==1])
percent_spam = spam/len(URLdata)*100
print(str(round(percent_spam,2)) + '% of urls are malicious urls.')

There are 47274 urls in the dataset.
49.6% of urls are malicious urls.


## Create Features Based on Possible Criteria

### 1. Number of Dots. appeared in Sub-Domain

In [5]:
# Method to count number of dots appeared in sub-domain
def countdots(url):  
    return url.count('.')

### 2. Check if IP address is used as an alternative of the domain name in the URL
If an IP address is used as an alternative of the domain name in the URL, such as “http://125.98.3.123/fake.html”, users can be sure that someone is trying to steal their personal information. Sometimes, the IP address is even transformed into hexadecimal code as shown in the following link “http://0x58.0xCC.0xCA.0x62/2/paypal.ca/index.html”. 

In [6]:
# Is IP addr present as the hostname, let's validate

import ipaddress as ip 

def containsip(url):
    try:
        if ip.ip_address(url):
            return 1
    except:
        return 0

### 3. Count the number of Soft Hyphens(-) present
Spammers have jumped on the little-used soft hyphen (or SHY character) to fool URL filtering devices.

In [7]:
#method to count the number of presence of soft hyphens

def CountSoftHyphen(url):
    return url.count('-')


### 4. Count the number of @ present

In [8]:
#method to check the presence of @

def CountAt(url):
    return url.count('@')

### 5. Count the number of // Double Slash present
The existence of “//” within the URL path means that the user will be redirected to another website. An example of such URL’s is: “http://www.legitimate.com//http://www.phishing.com”. We examine the presence of “//”

In [9]:
def CountDSlash(url):
    return url.count('//')

### 6. Count the number of Subdir represented by /

In [10]:
def countSubDir(url):
    return url.count('/')

### 7. Count the number of Sub-Domain

In [11]:
def countSubDomain(subdomain):
    if not subdomain:
        return 0
    else:
        return len(subdomain.split('.'))

### 8. Count Queries

In [12]:
def countQueries(query):
    if not query:
        return 0
    else:
        return len(query.split('&'))

### 9. List malicious keywords and domains

In [13]:
#top 30 most suspicious TLD and words
Suspicious_TLD=['zip','cricket','link','work','party','gq','kim','country','science','tk','download','xin','gdn',
                'racing','jetzt','stream','vip','bid','ren','load','mom','party','trade','date','wang','accountants',
               'bid','ltd','men','faith']
#trend micro's top 10 malicious domains 
Suspicious_Domain=['luckytime.co.kr','mattfoll.eu.interia.pl','trafficholder.com','dl.baixaki.com.br',
                   'bembed.redtube.comr','tags.expo9.exponential.com','deepspacer.com','funad.co.kr',
                   'trafficconverter.biz', 'alegroup.info']

### 10. Get the file extension

In [14]:
def get_ext(url):
    """Return the filename extension from url, or ''."""
    
    root, ext = splitext(url)
    return ext

### 11. URL Length

In [15]:
def length(url):
    """
    Find the length of URL without https://www. or http://www. or http:// or https://
    
    Used to correct the calculation of the length of each URL
    """
    url = str(url)
    if url[:12] == 'https://www.':
        url = url[12:]
    elif url[:11] == 'http://www.':
        url = url[11:]
    elif url[:8] == 'https://':
        url = url[8:]
    elif url[:7] == 'http://':
        url = url[7:]
    else:
        url = url
    return len(url)

In [36]:
def url_format(url):
    """
    """
    url = str(url)
    if url[:8] == 'https://' or url[:7] == 'http://':
        url = url
    else:
        url = 'http://' + url
    return url 

## Create Empty Dataframe to Store Features

In [16]:
featureSet = pd.DataFrame(columns=('url','no of dots','no of hyphen','len of url','no of at',\
'no of double slash','no of subdir','no of subdomain','len of domain','no of queries','contains IP',
                                   'presence of Suspicious_TLD',\
'presence of suspicious domain','create_age(months)','expiry_age(months)',
                                   'update_age(days)','country','file extension','label'))

In [17]:
from urllib.parse import urlparse
import tldextract

In [18]:
### check whois 
def generateFeature(df):
    for i in range(len(df)):
        try:
            # check whois API for domain documentation
            ext = tldextract.extract(df["url"].iloc[i])
            domain = '.'.join(ext[1:])
            w = whois.whois(domain)
            # if above works, generate all features
            features = getFeatures(df["url"].iloc[i], df["label"].iloc[i], w)
            featureSet.loc[i] = features
        except:
            features = getFeatures2(df["url"].iloc[i], df["label"].iloc[i])  
            featureSet.loc[i] = features

In [19]:
def getFeatures(url, label, w): 
    result = []
    url = str(url_format(url))
    
    #add the url to feature set
    result.append(url)
    
    #parse the URL and extract the domain information
    path = urlparse(url)
    ext = tldextract.extract(url)
    
    #counting number of dots in subdomain    
    result.append(countdots(ext.subdomain))
    
    #checking hyphen in domain   
    result.append(CountSoftHyphen(path.netloc))
    
    #length of URL    
    result.append(length(url))
    
    #checking @ in the url    
    result.append(CountAt(path.netloc))
    
    #checking presence of double slash    
    result.append(CountDSlash(path.path))
    
    #Count number of subdir    
    result.append(countSubDir(path.path))
    
    #number of sub domain    
    result.append(countSubDomain(ext.subdomain))
    
    #length of domain name    
    result.append(len(path.netloc))
    
    #count number of queries    
    result.append(len(path.query))
    
    #Adding domain information
    
    #if IP address is being used as a URL     
    result.append(containsip(ext.domain))
    
    #presence of Suspicious_TLD
    result.append(1 if ext.suffix in Suspicious_TLD else 0)
    
    #presence of suspicious domain
    result.append(1 if '.'.join(ext[1:]) in Suspicious_Domain else 0 )
      
    #Get domain information by asking whois
    avg_month_time=365.2425/12.0
        
    #calculate creation age in months
                  
    if w.creation_date == None or type(w.creation_date) is str :
        result.append(-1)
        
    else:
        if(type(w.creation_date) is list): 
            create_date=w.creation_date[-1]
        else:
            create_date=w.creation_date

        if(type(create_date) is datetime.datetime):
            today_date=datetime.datetime.now()
            create_age_in_mon=((today_date - create_date).days)/avg_month_time
            create_age_in_mon=round(create_age_in_mon)
            result.append(create_age_in_mon)
            
        else:
            result.append(-1)
    
    #calculate expiry age in months
                  
    if(w.expiration_date==None or type(w.expiration_date) is str):
        result.append(-1)
    else:
        if(type(w.expiration_date) is list):
            expiry_date=w.expiration_date[-1]
        else:
            expiry_date=w.expiration_date
        if(type(expiry_date) is datetime.datetime):
            today_date=datetime.datetime.now()
            expiry_age_in_mon=((expiry_date - today_date).days)/avg_month_time
            expiry_age_in_mon=round(expiry_age_in_mon)

            # appending  in months Appended to the Vector
            result.append(expiry_age_in_mon)
        else:
            # expiry date error so append -1
            result.append(-1)

    #find the age of last update
                  
    if(w.updated_date==None or type(w.updated_date) is str):
        result.append(-1)
    else:
        if(type(w.updated_date) is list):
            update_date=w.updated_date[-1]
        else:
            update_date=w.updated_date
        if(type(update_date) is datetime.datetime):
            today_date=datetime.datetime.now()
            update_age_in_days=((today_date - update_date).days)
            result.append(update_age_in_days)
            # appending updated age in days Appended to the Vector
        else:
            result.append(-1)

    
    #find the country who is hosting this domain
    if(w.country == None):
        result.append("None")
    else:
        if isinstance(w.country,str):
            result.append(w['country'])
        else:
            result.append(w['country'][0])
    
    result.append(get_ext(path.path))
    result.append(str(label))
    return result

In [20]:
#URLs without Whois information
def getFeatures2(url, label): 
    result = []
    url = str(url)
    
    #add the url to feature set
    result.append(url)
    
    #parse the URL and extract the domain information
    path = urlparse(url)
    ext = tldextract.extract(url)
    
    #counting number of dots in subdomain    
    result.append(countdots(ext.subdomain))
    
    #checking hyphen in domain   
    result.append(CountSoftHyphen(path.netloc))
    
    #length of URL    
    result.append(length(url))
    
    #checking @ in the url    
    result.append(CountAt(path.netloc))
    
    #checking presence of double slash    
    result.append(CountDSlash(path.path))
    
    #Count number of subdir    
    result.append(countSubDir(path.path))
    
    #number of sub domain    
    result.append(countSubDomain(ext.subdomain))
    
    #length of domain name    
    result.append(len(path.netloc))
    
    #count number of queries    
    result.append(len(path.query))
    
    #Adding domain information
    
    #if IP address is being used as a URL     
    result.append(containsip(ext.domain))
    
    #presence of Suspicious_TLD
    result.append(1 if ext.suffix in Suspicious_TLD else 0)
    
    #presence of suspicious domain
    result.append(1 if '.'.join(ext[1:]) in Suspicious_Domain else 0 )
    
    #append default for create_age(months)country
    result.append(-1)
    
    #append default for expiry_age(months)
    result.append(-1)
    
    #append default for update_age(days)
    result.append(-1)
    
    #append default for country
    result.append('None')
    
    #append extension
    path = urlparse(url)
    
    result.append(get_ext(path.path))
    
    #append label
    result.append(str(label))
    
    return result

In [21]:
generateFeature(URLdata)

In [22]:
featureSet

Unnamed: 0,url,no of dots,no of hyphen,len of url,no of at,no of double slash,no of subdir,no of subdomain,len of domain,no of queries,contains IP,presence of Suspicious_TLD,presence of suspicious domain,create_age(months),expiry_age(months),update_age(days),country,file extension,label
0,http://livelywebordainedkool.cf/harwoodservice...,0,0,66,0,0,3,0,24,0,0,0,0,-1,-1,-1,,.php,1
1,amazon.com.br:80/gp/redirect.html?_encoding=UT...,0,0,207,0,0,2,0,0,173,0,0,0,-1,-1,-1,,.html,0
2,usaerectionrx.com,0,0,17,0,0,0,0,0,0,0,0,0,9,3,249,UA,.com,0
3,http://ciespains.com/delivery/index.htm,0,0,32,0,0,2,0,13,0,0,0,0,1,11,0,,.htm,1
4,http://paypalsecureresetloginpage.elifelfmarle...,0,0,64,0,0,1,1,44,0,0,0,0,30,6,905,TR,,1
5,msci.com,0,0,8,0,0,0,0,0,0,0,0,0,300,0,358,US,.com,0
6,cecilmarine.com,0,0,15,0,0,0,0,0,0,0,0,0,214,2,361,US,.com,0
7,mapharma61.fr/downloader/Maged/Model/Config/ss...,0,0,148,0,0,12,0,0,0,0,0,0,91,9,417,,.html,1
8,http://agencepub.co.rw/Paypal/Account-Limited/...,0,0,101,0,0,8,0,15,0,0,0,0,-1,-1,-1,,,1
9,http://alohacomcentre.com/css/,0,0,23,0,0,2,0,18,0,0,0,0,61,-1,23,SG,,1


In [23]:
featureSet.to_csv('../data/FeatureData.csv')

In [24]:
featureSet.isnull().any()

url                              False
no of dots                       False
no of hyphen                     False
len of url                       False
no of at                         False
no of double slash               False
no of subdir                     False
no of subdomain                  False
len of domain                    False
no of queries                    False
contains IP                      False
presence of Suspicious_TLD       False
presence of suspicious domain    False
create_age(months)               False
expiry_age(months)               False
update_age(days)                 False
country                          False
file extension                   False
label                            False
dtype: bool

In [25]:
featureSet.dtypes

url                              object
no of dots                       object
no of hyphen                     object
len of url                       object
no of at                         object
no of double slash               object
no of subdir                     object
no of subdomain                  object
len of domain                    object
no of queries                    object
contains IP                      object
presence of Suspicious_TLD       object
presence of suspicious domain    object
create_age(months)               object
expiry_age(months)               object
update_age(days)                 object
country                          object
file extension                   object
label                            object
dtype: object

In [26]:
features = featureSet.columns.tolist()
features.remove('url')
features.remove('country')
features.remove('file extension')
for f in features:
    featureSet[f] = featureSet[f].astype(int)
featureSet.dtypes

url                              object
no of dots                        int64
no of hyphen                      int64
len of url                        int64
no of at                          int64
no of double slash                int64
no of subdir                      int64
no of subdomain                   int64
len of domain                     int64
no of queries                     int64
contains IP                       int64
presence of Suspicious_TLD        int64
presence of suspicious domain     int64
create_age(months)                int64
expiry_age(months)                int64
update_age(days)                  int64
country                          object
file extension                   object
label                             int64
dtype: object

In [41]:
featureSet.to_csv('../data/FeatureData.csv')