# **2.B FEATURE EXTRACTION - PHISHING URL**
Phishing URLs only

#### The objective of this notebook is to collect data and save it as a CSV file for Feature Extraction.

* Lexical Features
* Whois Features
* Popularity Features

#### This project is worked on Jupyter Notebook 

In [1]:
import pandas as pd
from urllib.parse import urlparse
import re
from bs4 import BeautifulSoup
import whois
import urllib.request
import time
import socket
from urllib.error import HTTPError
from datetime import datetime

In [2]:

phishing_urls = pd.read_csv("/Datasets/Dataset1/Extract-Data-Phish/verified_url.csv")


In [3]:
phishing_urls

Unnamed: 0,phish_id,url,phish_detail_url,submission_time,verified,verification_time,online,target
0,4924425,https://cnpulp.net/EserviceMain/irs/ir/index.html,http://www.phishtank.com/phish_detail.php?phis...,2017-04-03T16:19:26+00:00,yes,2017-04-03T16:21:12+00:00,yes,Internal Revenue Service
1,4924407,http://foldinati.com/ju/aba.html,http://www.phishtank.com/phish_detail.php?phis...,2017-04-03T16:00:26+00:00,yes,2017-04-03T16:53:30+00:00,yes,Other
2,4924285,http://adminsmaintenaceroutine.000webhostapp.com/,http://www.phishtank.com/phish_detail.php?phis...,2017-04-03T15:18:59+00:00,yes,2017-04-03T16:44:35+00:00,yes,Microsoft
3,4924258,http://roadmaster.com.my/wp-content/themes/log...,http://www.phishtank.com/phish_detail.php?phis...,2017-04-03T15:16:55+00:00,yes,2017-04-03T16:24:32+00:00,yes,Other
4,4924237,http://dhalander.com.br/Atendimento.Cliente01/...,http://www.phishtank.com/phish_detail.php?phis...,2017-04-03T15:14:22+00:00,yes,2017-04-03T16:53:30+00:00,yes,Other
5,4924221,http://vseservicy.ru/wp-content/plugins/sas.php,http://www.phishtank.com/phish_detail.php?phis...,2017-04-03T14:59:08+00:00,yes,2017-04-03T16:24:32+00:00,yes,Other
6,4924213,http://ci0.co.vu/css/?bsoul=/yh/en/?i=31416&am...,http://www.phishtank.com/phish_detail.php?phis...,2017-04-03T14:50:03+00:00,yes,2017-04-03T16:24:32+00:00,yes,Other
7,4924206,http://dialaduduman.com/review/images/yah/vali...,http://www.phishtank.com/phish_detail.php?phis...,2017-04-03T14:37:54+00:00,yes,2017-04-03T15:34:05+00:00,yes,Other
8,4924203,http://milwaukeecreamcitys.org/images/april.php,http://www.phishtank.com/phish_detail.php?phis...,2017-04-03T14:28:49+00:00,yes,2017-04-03T15:01:26+00:00,yes,Other
9,4924176,http://de-paypa.lhilfe-guard.net/,http://www.phishtank.com/phish_detail.php?phis...,2017-04-03T14:18:23+00:00,yes,2017-04-03T16:57:58+00:00,yes,PayPal


In [4]:

phishing_urls = pd.read_csv("/Datasets/Dataset1/Extract-Data-Phish/online-valid.csv", usecols = ["url"] )



In [5]:
phishing_urls.shape

(14858, 1)

In [6]:
phishing_urls

Unnamed: 0,url
0,http://u1047531.cp.regruhosting.ru/acces-inges...
1,http://hoysalacreations.com/wp-content/plugins...
2,http://www.accsystemprblemhelp.site/checkpoint...
3,http://www.accsystemprblemhelp.site/login_atte...
4,https://firebasestorage.googleapis.com/v0/b/so...
...,...
14853,http://bancoestado700.blogspot.com/
14854,http://www.habbocreditosparati.blogspot.com/
14855,http://creditiperhabbogratissicuro100.blogspot...
14856,http://mundovirtualhabbo.blogspot.com/2009_01_...


In [7]:
#Collecting 6,000 Phishing URLs randomly

phishurl = phishing_urls.sample(n = 6000, random_state = 12)
phishurl = phishurl.reset_index(drop=True)
phishurl.head()
phishurl

Unnamed: 0,url
0,http://confirmprofileaccount.com/
1,http://www.marreme.com/MasterAdmin/04mop.html
2,http://modsecpaststudents.com/review/
3,https://docs.google.com/forms/d/e/1FAIpQLScL6L...
4,https://oportunidadedasemana.com/americanas//?...
...,...
5995,http://message-moncompte-labonquepostale-fr.co...
5996,http://tik.info.pl/wp-includes/Requests/Except...
5997,http://hornelink.cn/linkedIn/message/linkedIn/...
5998,https://docs.google.com/forms/d/e/1FAIpQLSfyRC...


## 2.1 Lexical Features

* URL Length 
* URL Shortening Services “TinyURL”
* URL Presence of "@" Symbol
* URL Presence of special characters : _ ? = & etc
* URL Suspicious words (security sensitive words)
* URL Digit Count
* URL Protocol Count (http / https)
* URL Dot Count
* URL Hyphen Count
* Domain presence of IP Address
* Domain presence of hyphen / prefix or Suffix
* Sub Domain and Multi Sub Domains Count
* Redirecting "//" in URL (// position)
* URL presence of EXE


In [8]:
#class FeatureExtraction:
#    def __init__(url):
#        pass

# 1.Extracts domain from the given URL
def getDomain(url):
    domain = urlparse(url).netloc
    if re.match(r"^www.",domain):
        domain = domain.replace("www.","")
    return domain
    
# 2.Checks for IP address in URL (Have_IP)
def ip_address(url):
    try:
        ipaddress.ip_address(url)
        ip = 1
    except:
        ip = 0
    return ip
    
# 3.Checks the presence of @ in URL (Have_At)
def have_at_symbol(url):
    if "@" in url:
        at = 1 
    else:
        at = 0   
    return at
    
# 4.Finding the length of URL and categorizing (URL_Length)
def long_url(url):
    if len(url) < 54:
        length = 0    
    else:
        length = 1    
    return length

# 5.Gives number of '/' in URL (URL_Depth)
def getDepth(url):
    s = urlparse(url).path.split('/')
    depth = 0
    for j in range(len(s)):
        if len(s[j]) != 0:
            depth = depth+1
    return depth
        
# 6.Checking for redirection '//' in the url (Redirection)
def redirection(url):
    pos = url.rfind('//')
    if pos > 6:
        if pos > 7:
            return 1
        else:
            return 0
    else:
        return 0
    
# 7.Existence of “HTTPS” Token in the Domain Part of the URL (https_Domain)
def httpDomain(url):
    domain = urlparse(url).netloc
    if 'https://|http://' in domain:
        return 1
    else:
        return 0

    
# 8. Checking for Shortening Services in URL (Tiny_URL) 
def shortening_service(url):
    match = re.search('bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|tinyurl|tr\.im|is\.gd|cli\.gs|'
                    'yfrog\.com|migre\.me|ff\.im|tiny\.cc|url4\.eu|twit\.ac|su\.pr|twurl\.nl|snipurl\.com|'
                    'short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|snipr\.com|fic\.kr|loopt\.us|'
                    'doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|to\.ly|bit\.do|t\.co|lnkd\.in|'
                    'db\.tt|qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|tinyurl\.com|ow\.ly|bit\.ly|ity\.im|'
                    'q\.gs|is\.gd|po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u\.bb|yourls\.org|'
                    'x\.co|prettylinkpro\.com|scrnch\.me|filoops\.info|vzturl\.com|qr\.net|1url\.com|tweez\.me|v\.gd|'
                    'tr\.im|link\.zip\.net', url)
    if match:
        return 1               # phishing
    else:
        return 0               # legitimate
    
    
    
    
# 9.Checking for Prefix or Suffix Separated by (-) in the Domain (Prefix/Suffix)     
def prefix_suffix_separation(url):
    if "-" in urlparse(url).netloc:
        return 1            # phishing
    else:
        return 0            # legitimate
    
# 10. DNS Record 

    
# 11.Web traffic (Web_Traffic)
def web_traffic(url):
    try:
        url = urllib.parse.quote(url)
        rank = BeautifulSoup(urllib.request.urlopen("http://data.alexa.com/data?cli=10&dat=s&url=" + url).read(), "xml").find(
        "REACH")['RANK']
        rank = int(rank)
    except TypeError:
        return 1
    if rank <100000:
        return 1
    else:
        return 0
        
# 12.Survival time of domain: The difference between termination time and creation time (Domain_Age)  
def domainAge(domain_name):
    creation_date = domain_name.creation_date
    expiration_date = domain_name.expiration_date
    if (isinstance(creation_date,str) or isinstance(expiration_date,str)):
        try:
            creation_date = datetime.strptime(creation_date,'%Y-%m-%d')
            expiration_date = datetime.strptime(expiration_date,"%Y-%m-%d")
        except:
            return 1
    if ((expiration_date is None) or (creation_date is None)):
        return 1
    elif ((type(expiration_date) is list) or (type(creation_date) is list)):
        return 1
    else:
        ageofdomain = abs((expiration_date - creation_date).days)
        if ((ageofdomain/30) < 6):
            age = 1
        else:
            age = 0
    return age

# 13.End time of domain: The difference between termination time and current time (Domain_End) 
def domainEnd(domain_name):
    expiration_date = domain_name.expiration_date
    if isinstance(expiration_date,str):
        try:
            expiration_date = datetime.strptime(expiration_date,"%Y-%m-%d")
        except:
            return 1
    if (expiration_date is None):
        return 1
    elif (type(expiration_date) is list):
        return 1
    else:
        today = datetime.now()
        end = abs((expiration_date - today).days)
    if ((end/30) < 6):
        end = 0
    else:
        end = 1
    return end

# 14. Dot count
def dot_count(url):
    if url.count(".") < 3:
        return 0            # legitimate
    elif url.count(".") == 3:
        return 1            # suspicious
    else:
        return 1            # phishing
        
    
# 14. Special characters count
def specialcharCount(url):
    cnt = 0
    special_characters = [';','+=','_','?','=','&','[',']','/',':']
    for each_letter in url:
        if each_letter in special_characters:
            cnt = cnt + 1
    return cnt


# 15. 
def subdomCount(url):

    # separate protocol and domain then count the number of dots in domain
    
    domain = url.split("//")[-1].split("/")[0].split("www.")[-1]
    if(domain.count('.')<=1):
        return 0
    else:
        return 1

In [10]:
#Function to extract features
def featureExtraction(url,label):
    
    features = []
  #Address bar based features (10)
    features.append(getDomain(url))
    features.append(ip_address(url))
    features.append(have_at_symbol(url))
    features.append(long_url(url))
    features.append(getDepth(url))
    features.append(redirection(url))
    features.append(httpDomain(url))
    features.append(shortening_service(url))
    features.append(prefix_suffix_separation(url))
  
  #Domain based features (4)
    dns = 0
    try:
        domain_name = whois.whois(urlparse(url).netloc)
    except:
        dns = 1
        
    features.append(dns)
    features.append(web_traffic(url))
    features.append(1 if dns == 1 else domainAge(domain_name))
    features.append(1 if dns == 1 else domainEnd(domain_name))
    
    features.append(dot_count(url))
    features.append(specialcharCount(url))
    features.append(subdomCount(url))
    

    
    
    features.append(label)
    
    
    return features

In [11]:
feature_names = ['domain', 'ip_present', 'at_present', 'url_length', 'url_depth','redirection', 
                      'https_domain', 'short_url', 'prefix/suffix', 'dns_record', 'web_traffic', 
                      'domain_age', 'domain_end', 'dot_count', 'specialchar_count','subdom_count', 'label']

label = 1

In [12]:
# Extracting the features & storing them in a list
# Lexical Features

# starting time
start_time = time.time()
print('\n')
print('Begin feature extraction for phishing dataset.... \n')

##===================================##


#Extracting the feautres & storing them in a list
phish_features = []
rows = len(phishurl['url'])
label = 1

for i in range(0, rows):
    url = phishurl['url'][i]
    print(i), print(url)
    
    
    phish_features.append(featureExtraction(url,label))

    
##===================================##

elapsed = time.strftime("%H:%M:%S", time.gmtime(time.time() - start_time))
print('\n')
print(f"Runtime: Feature Extraction for phishing dataset took:  {elapsed}")


print('\n\n\n\n')
print("***Phishing Features")




Begin feature extraction for phishing dataset.... 

0
http://confirmprofileaccount.com/
1
http://www.marreme.com/MasterAdmin/04mop.html
2
http://modsecpaststudents.com/review/
3
https://docs.google.com/forms/d/e/1FAIpQLScL6L9TaPWaz0nJqHOBCupc-iHfQPWYVeKqZdHklbfgiVTy_Q/viewform
4
https://oportunidadedasemana.com/americanas//?samsung-un50nu7100-tv-led-50-smart-tv-4k-uhd-3hdmi-2usb-preto-nas-americanas&amp;skullid=195433942&amp;cart=MTk1NDMzOTQy
5
https://villashippingtradingpv-my.sharepoint.com/personal/vam_station_flyme_mv/_layouts/15/guestaccess.aspx?guestaccesstoken=SM5QC7f1yqReaI4g9DT2aPYu7luYAyPPMBtOrhDBKbc%3d&amp;docid=1_10c3ef135ace74ea6afaed6ad75fab3bc&amp;wdFormId=%7B9A841E3B%2D98C8%2D48BF%2DBDD0%2DF64979E54640%7D
6
https://www.ikonikcommercialgroup.com/wp-includes/ID3/783498274001/sc0/adapter2adapter.ping.html
7
https://fareast.qa/wp/onedrive/7a0f492752e89cdbc505ab0148975f3e
Error trying to connect to socket: closing socket - [Errno 11001] getaddrinfo failed
8
https://kassa.c

In [13]:
#Converting the list to dataframe

phishing = pd.DataFrame(phish_features, columns= feature_names)
phishing.head()

Unnamed: 0,domain,ip_present,at_present,url_length,url_depth,redirection,https_domain,short_url,prefix/suffix,dns_record,web_traffic,domain_age,domain_end,dot_count,specialchar_count,subdom_count,label
0,confirmprofileaccount.com,0,0,0,0,0,0,1,0,1,1,1,1,0,4,0,1
1,marreme.com,0,0,0,2,0,0,0,0,0,1,1,1,1,5,0,1
2,modsecpaststudents.com,0,0,0,1,0,0,0,0,0,1,1,1,0,5,0,1
3,docs.google.com,0,0,1,5,0,0,0,0,0,1,1,1,0,9,1,1
4,oportunidadedasemana.com,0,0,1,1,1,0,0,0,0,1,0,0,0,13,0,1


In [14]:
# Storing the extracted legitimate URLs fatures to csv file

phishing.to_csv('/Datasets/Dataset1/Creating-data/phish_updated.csv', index= False)
  