# **2.B FEATURE EXTRACTION - PHISHING URL**
Phishing URLs only

#### The objective of this notebook is to collect data and save it as a CSV file for Feature Extraction.

* Lexical Features
* Whois Features
* Popularity Features

#### This project is worked on Jupyter Notebook 

In [1]:
import pandas as pd
from urllib.parse import urlparse
import re
from bs4 import BeautifulSoup
import whois
import urllib.request
import time
import socket
from urllib.error import HTTPError
from datetime import datetime

In [2]:

phishing_urls = pd.read_csv("/Users/jillkathleen/Desktop/Phishing-Analysis-Detection/Back-End/Feature-Extraction-ntbk/FE-more-data/verified_phishtank.csv", usecols = ["url"])


In [3]:
phishing_urls

Unnamed: 0,url
0,https://smtdcer-de.com/
1,https://aumzon.wpgomr2.shop/
2,https://amaumou.26hiuuh.top/
3,https://aonzon.co.ip.0ik5w9.cn/pc/
4,https://aonzon.co.ip.0ik5w9.cn/mobile/
...,...
10448,http://gkjx168.com/images
10449,http://www.habbocreditosparati.blogspot.com/
10450,http://creditiperhabbogratissicuro100.blogspot...
10451,http://mundovirtualhabbo.blogspot.com/2009_01_...


In [4]:
#phishing = phishing_urls[['url']]

#phishing

#Randomly select data
#phishing_urls = phishing_urls.sample(n=1500, replace=False)

#Collecting 1,500 Phishing URLs randomly

phishurl = phishing_urls.sample(n = 1500, random_state = 12, replace=False)
phishurl = phishurl.reset_index(drop=True)
phishurl.head()

Unnamed: 0,url
0,https://jbshtl.secure52serv.com/receipt/secure...
1,http://lazarus.co.zw/dft/PDFFILE/index.html
2,http://drivingschoolglasgow.co.uk/bt/systemght...
3,https://rektuen.qsyljs.com/member/RakutenPC.html
4,https://hairahsweetcakes.com/d.php


In [5]:
phishurl

Unnamed: 0,url
0,https://jbshtl.secure52serv.com/receipt/secure...
1,http://lazarus.co.zw/dft/PDFFILE/index.html
2,http://drivingschoolglasgow.co.uk/bt/systemght...
3,https://rektuen.qsyljs.com/member/RakutenPC.html
4,https://hairahsweetcakes.com/d.php
...,...
1495,https://fb-page-confirm-1000001234762781615600...
1496,https://2na.io/actualizar
1497,http://ekabel.hu/CN/en.php
1498,https://arislm.com/wo-teng/B/


## 2.1 Lexical Features

* URL Length 
* URL Shortening Services “TinyURL”
* URL Presence of "@" Symbol
* URL Presence of special characters : _ ? = & etc
* URL Suspicious words (security sensitive words)
* URL Digit Count
* URL Protocol Count (http / https)
* URL Dot Count
* URL Hyphen Count
* Domain presence of IP Address
* Domain presence of hyphen / prefix or Suffix
* Sub Domain and Multi Sub Domains Count
* Redirecting "//" in URL (// position)
* URL presence of EXE


In [6]:
#class FeatureExtraction:
#    def __init__(url):
#        pass

# 1.Extracts domain from the given URL
def getDomain(url):
    domain = urlparse(url).netloc
    if re.match(r"^www.",domain):
        domain = domain.replace("www.","")
    return domain
    
# 2.Checks for IP address in URL (Have_IP)
def ip_address(url):
    try:
        ipaddress.ip_address(url)
        ip = 1
    except:
        ip = 0
    return ip
    
# 3.Checks the presence of @ in URL (Have_At)
def have_at_symbol(url):
    if "@" in url:
        at = 1 
    else:
        at = 0   
    return at
    
# 4.Finding the length of URL and categorizing (URL_Length)
def long_url(url):
    if len(url) < 54:
        length = 0    
    else:
        length = 1    
    return length

# 5.Gives number of '/' in URL (URL_Depth)
def getDepth(url):
    s = urlparse(url).path.split('/')
    depth = 0
    for j in range(len(s)):
        if len(s[j]) != 0:
            depth = depth+1
    return depth
        
# 6.Checking for redirection '//' in the url (Redirection)
def redirection(url):
    pos = url.rfind('//')
    if pos > 6:
        if pos > 7:
            return 1
        else:
            return 0
    else:
        return 0
    
# 7.Existence of “HTTPS” Token in the Domain Part of the URL (https_Domain)
def httpDomain(url):
    domain = urlparse(url).netloc
    if 'https://|http://' in domain:
        return 1
    else:
        return 0

    
# 8. Checking for Shortening Services in URL (Tiny_URL) 
def shortening_service(url):
    match = re.search('bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|tinyurl|tr\.im|is\.gd|cli\.gs|'
                    'yfrog\.com|migre\.me|ff\.im|tiny\.cc|url4\.eu|twit\.ac|su\.pr|twurl\.nl|snipurl\.com|'
                    'short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|snipr\.com|fic\.kr|loopt\.us|'
                    'doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|to\.ly|bit\.do|t\.co|lnkd\.in|'
                    'db\.tt|qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|tinyurl\.com|ow\.ly|bit\.ly|ity\.im|'
                    'q\.gs|is\.gd|po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u\.bb|yourls\.org|'
                    'x\.co|prettylinkpro\.com|scrnch\.me|filoops\.info|vzturl\.com|qr\.net|1url\.com|tweez\.me|v\.gd|'
                    'tr\.im|link\.zip\.net', url)
    if match:
        return 1               # phishing
    else:
        return 0               # legitimate
    
    
    
    
# 9.Checking for Prefix or Suffix Separated by (-) in the Domain (Prefix/Suffix)     
def prefix_suffix_separation(url):
    if "-" in urlparse(url).netloc:
        return 1            # phishing
    else:
        return 0            # legitimate
    
# 10. DNS Record 

    
# 11.Web traffic (Web_Traffic)
def web_traffic(url):
    try:
        url = urllib.parse.quote(url)
        rank = BeautifulSoup(urllib.request.urlopen("http://data.alexa.com/data?cli=10&dat=s&url=" + url).read(), "xml").find(
        "REACH")['RANK']
        rank = int(rank)
    except TypeError:
        return 1
    if rank <100000:
        return 1
    else:
        return 0
        
# 12.Survival time of domain: The difference between termination time and creation time (Domain_Age)  
def domainAge(domain_name):
    creation_date = domain_name.creation_date
    expiration_date = domain_name.expiration_date
    if (isinstance(creation_date,str) or isinstance(expiration_date,str)):
        try:
            creation_date = datetime.strptime(creation_date,'%Y-%m-%d')
            expiration_date = datetime.strptime(expiration_date,"%Y-%m-%d")
        except:
            return 1
    if ((expiration_date is None) or (creation_date is None)):
        return 1
    elif ((type(expiration_date) is list) or (type(creation_date) is list)):
        return 1
    else:
        ageofdomain = abs((expiration_date - creation_date).days)
        if ((ageofdomain/30) < 6):
            age = 1
        else:
            age = 0
    return age

# 13.End time of domain: The difference between termination time and current time (Domain_End) 
def domainEnd(domain_name):
    expiration_date = domain_name.expiration_date
    if isinstance(expiration_date,str):
        try:
            expiration_date = datetime.strptime(expiration_date,"%Y-%m-%d")
        except:
            return 1
    if (expiration_date is None):
        return 1
    elif (type(expiration_date) is list):
        return 1
    else:
        today = datetime.now()
        end = abs((expiration_date - today).days)
    if ((end/30) < 6):
        end = 0
    else:
        end = 1
    return end

# 14. Dot count
def dot_count(url):
    if url.count(".") < 3:
        return 0            # legitimate
    elif url.count(".") == 3:
        return 1            # suspicious
    else:
        return 1            # phishing
        
    
# 14. Special characters count
def specialcharCount(url):
    cnt = 0
    special_characters = [';','+=','_','?','=','&','[',']','/',':']
    for each_letter in url:
        if each_letter in special_characters:
            cnt = cnt + 1
    return cnt


# 15. 
def subdomCount(url):

    # separate protocol and domain then count the number of dots in domain
    
    domain = url.split("//")[-1].split("/")[0].split("www.")[-1]
    if(domain.count('.')<=1):
        return 0
    else:
        return 1

In [7]:
#Function to extract features
def featureExtraction(url,label):
    
    features = []
  #Address bar based features (10)
    features.append(getDomain(url))
    features.append(ip_address(url))
    features.append(have_at_symbol(url))
    features.append(long_url(url))
    features.append(getDepth(url))
    features.append(redirection(url))
    features.append(httpDomain(url))
    features.append(shortening_service(url))
    features.append(prefix_suffix_separation(url))
  
  #Domain based features (4)
    dns = 0
    try:
        domain_name = whois.whois(urlparse(url).netloc)
    except:
        dns = 1
        
    features.append(dns)
    features.append(web_traffic(url))
    features.append(1 if dns == 1 else domainAge(domain_name))
    features.append(1 if dns == 1 else domainEnd(domain_name))
    
    features.append(dot_count(url))
    features.append(specialcharCount(url))
    features.append(subdomCount(url))
    

    
    
    features.append(label)
    
    
    return features

In [8]:
feature_names = ['domain', 'ip_present', 'at_present', 'url_length', 'url_depth','redirection', 
                      'https_domain', 'short_url', 'prefix/suffix', 'dns_record', 'web_traffic', 
                      'domain_age', 'domain_end', 'dot_count', 'specialchar_count','subdom_count', 'label']

label = 1

In [10]:
# Extracting the features & storing them in a list
# Lexical Features

# starting time
start_time = time.time()
print('\n')
print('Begin feature extraction for phishing dataset.... \n')

##===================================##


#Extracting the feautres & storing them in a list
phish_features = []
rows = len(phishurl['url'])
label = 1

for i in range(0, rows):
    url = phishurl['url'][i]
    print(i), print(url)
    
    
    phish_features.append(featureExtraction(url,label))

    
##===================================##

elapsed = time.strftime("%H:%M:%S", time.gmtime(time.time() - start_time))
print('\n')
print(f"Runtime: Feature Extraction for phishing dataset took:  {elapsed}")


print('\n\n\n\n')
print("***Phishing Features")




Begin feature extraction for phishing dataset.... 

0
https://jbshtl.secure52serv.com/receipt/secureNetflix/07fa0b9102bc9bd33ebf592e514a1a83
1
http://lazarus.co.zw/dft/PDFFILE/index.html
2
http://drivingschoolglasgow.co.uk/bt/systemghtdate/upgjtdauth/supphjtoe/ebthl
3
https://rektuen.qsyljs.com/member/RakutenPC.html
4
https://hairahsweetcakes.com/d.php
5
https://docs.google.com/forms/d/e/1FAIpQLSdZRUbBm3NrQDZjs6q7PhUTjgMN-dM8zquPhjg9Ge31Q7BdHg/viewform
6
http://procesart.com/plugins/to/TO/authorize_client_id:1h0aw69g-xzok-drxc-4cuk-wt70lhonb5d3_rbsgp6aiuehk89vqj1027ocnxw4ftmzl5yd3v1jz5xlik9r0o6p4teyhwfcngdm7qu8b23salzce6013dkjthoqa2ib94s5gnfrmu8w7pyxv?data=Y3B0Y3NAcnR0LmNvLnph
7
https://lnkd.in/dvewNPt
8
http://bt-service.yolasite.com/
9
http://palvetif.000webhostapp.com/
10
https://urlz.fr/fIxZ
11
https://re-direct-me.com/?u=Dsjy
12
http://lindyandfriends.net/online/pdf/pdf-/pdf-/pdf
13
https://sites.google.com/view/asdfghjklhgfdsdfgh/home
14
https://sites.google.com/view/gfgherbviu

84
http://utilizzamps.com
85
https://www.banking.suncoastcreditunion.com.mfa.index.zhinbeauty.com/
86
https://www.posten-post.it/no/p/e4ce4c160ce0fb31af2818d35c3e34eb/minside/?view=login&amp;appIdKey=fcd00c0656cc490&amp;country=
87
https://rsdxpjtgqrzlacfootsbzolaqh-dot-gowa11111111.rj.r.appspot.com/
88
https://paiement.aps-lbc.tech/
Error trying to connect to socket: closing socket
89
http://albel.intnet.mu/File/
90
https://inpost-order.pl-id30345773.xyz
91
http://blog.tienghanthaybeo.com/wp-content/uploads/2021/templ/index.html
92
http://m25g.22web.org/gl
93
https://switch.com.kw/.si/
94
https://sites.google.com/site/clamingidentify008754501/
95
http://murderarchives.com.au/
96
https://12a1j.trk.elasticemail.com/tracking/click?d=vmSxXbIXG-cRkFl_AchPvjlUNFae1XLLbCtS_agx7_NNgYzANqvOAnK7hw7ut6FtAvPZMSo-s66YutdtIsjCORhgOIoRusGTYJWvLhzMWtrutyZsbscFuQAKc5VHYiYNlA2
97
https://lookdigital.co/rezz/Inv/index.php
98
http://sbi.mx/page/41/786/accountSummary.php
99
http://02-billing-support.org/l

194
http://kartaltepespor.com/common/chem/
195
https://secure-monitor.com/basic.php?k=58e4507759bd9e387a08d6d0674c3090b26dc157
196
http://www5.sndc-crad-nem-inedx.rlfrjwgf.cn/
197
https://btupgradingsupport12.mystrikingly.com/
198
http://www.rondelbarrilito.com/filedocu/xx/
199
https://clouddoc-authorize.firebaseapp.com/.xx./xx...x...xx
200
https://firebasestorage.googleapis.com/v0/b/achproject509353-i353-3ih5f-10.appspot.com/o/achbf-vye-ur-g8%252Fbv-ebry-8g%252Fbf-vye-ur-g8%252Fbv-ebry-8g%25%40FAbf-vye-ur-g8%252Fbv-ebry-8g10.html?alt=media&amp;token=cf886132-ee55-43e8-9d0f-a6dbb7ba590a#
201
https://clouddoc-authorize.firebaseapp.com/......xx.../...xx
202
http://sunbeltmembers.com/dev/images/Billing/update/amazon/eb050c42adb6dc03ca76
203
http://jifxrefamtafjyefceovjqzstf-dot-gowa11111111.rj.r.appspot.com/
204
http://unify.appbox.biz/multipage/education/assets/user/
205
https://sites.google.com/view/btbusinesssecures/btconnect
206
http://archost.net.au/mkb.hu/mkb/05d31199bf26e1efb263a6a

296
http://nonstop-ks.com/netpeak/black/excelnonauto/excelnonauto/z/enc/enc.php?email=samillind@daum.net&amp;.rand=13vqcr8bp0gud&amp;lc=1033&amp;id=64855&amp;mkt=en-us&amp;cbcxt=mai&amp;snsc=1
297
http://caracasmateriais.blogspot.com/
298
https://replug.link/6a5f2b30?userid=
299
http://amazon-co-jp.tk-help.work/
Error trying to connect to socket: closing socket
300
https://sofe-firma.firebaseapp.com/
301
http://guidizontech.com/notions/tline/index.html
302
https://goo.gl/YOzuaz
303
http://chat-whatsapp.vizvaz.com
304
https://tech-365.eu/DMqcrBLPAKPyqTdt/
305
http://timeenigma.com#ggradnigo@prepaidlegal.com
306
http://timeline.fbcom-81b4xejpx4.kets.sd/connect.html?timelimit=49e08b7aa960f359b49a312104ff9ce0
Error trying to connect to socket: closing socket
307
https://anisyohana.my/wp1/block/
Error trying to connect to socket: closing socket
308
http://www.equalchances.org/net/page/
309
http://clayheart.com/wp-content/plugins/jjkkii
310
http://ftp.lesterandco.com/palermo/sections.js
311


403
https://j6a656akodf.typeform.com/to/SOZXqark
404
http://habibassociatesbd.com/Filesty/stewart/
405
https://poczta-order.pl-id87774130.xyz
406
http://cnl.snprobbx.pbz.r.de.a2ip.ru/
407
http://reconfirmsite.click?Facebook-Security
Error trying to connect to socket: closing socket
408
http://www.servlices.runescape.com-ov.ru/
409
http://ideonpackaging-my.sharepoint.com/personal/chasem_ideonpackaging_com/_layouts/15/doc.aspx?sourcedoc={efe0bf70-91fe-4df4-9b7f-a1f8f457789a}&amp;action=default&amp;slrid=54094d9f-d083-a000-8e05-3d2cf3964fda&amp;originalpath=ahr0chm6ly9pzgvvbnbhy2thz2luzy1tes5zagfyzxbvaw50lmnvbs86bzovcc9jagfzzw0vrw5dxzrpxy1rzljobtmtac1qulhlsm9cv0xqz2jfq0gtq3lnmu5eodvftfz0dz9ydgltzt02n0o1ehhqcjewzw&amp;cid=d0584eb7-b94e-4984-b42d-e13b1f82defd
410
https://jaccsivr.vmenu.jp/
411
https://sites.google.com/view/audio-call-net/home
412
http://www.igricekonzole.com/main
413
http://webgaliciaonlinew0.mipropia.com/?i=1
414
https://bit.ly/3wb6m3I
415
https://www.forupsite.com/landing

504
https://hm.ru/kfxhxb
505
http://dtrexx.com.ng/comm
Error trying to connect to socket: closing socket
506
https://urlz.fr/ekkz
507
http://www.im-creator.com/viewer/vbid-fa0f29d5-fpsjmms8
508
https://vzrew.creatorlink.net/
509
https://franckpilier23-my-cheetah-website-copy.cheetah.builderall.com/
510
https://fnzskhxpymppkjqtzljqayrjgf-dot-gowa11111111.rj.r.appspot.com/
511
https://ebay.payment-issues-help.com/login.php?sslchannel=true&loggedin=false:sessionid=09009384925314118090
512
https://docomoidhe.duckdns.org/index.html
513
https://re-redirection-acc-id923872635122.blogspot.com/
514
https://dev-www.orlenpaczka.ce5.pl
515
https://smdc-crab.com-mom-inbox.joizyoa.cn/
516
http://unitib.com/HU/LorincMeszarosNews
517
https://firebasestorage.googleapis.com/v0/b/gsdffdwatdfwdadddadsgd.appspot.com/o/!%23%24%40%26buli%24!%40%26!%40%23%24!%26.html?alt=media&amp;token=110228a1-3566-41ef-b241-427ad3b25a9f#aaronfredricks@legalshield.com
518
http://online2banking.byethost6.com/
Error trying to

610
http://gregmounsey.com/gmwp/document-view/
611
http://www.hasadom1.com/2lmrw6m/mgn672c/
612
http://www.paypay-en.xyz/
613
http://edje.com/cmspagephotos/80-dj774h8dFY/Valid
614
https://rn3pc9bqfbv.typeform.com/to/IR5ouqS0
615
http://clouddoc-authorize.firebaseapp.com/.........
616
http://michel.hyperphp.com/
617
http://unrecognisedpayeerequest.com/lloyds
618
https://sites.google.com/view/transect/
619
http://www.mky.com/images/sa/mostroliupiu/aviamicaspoilerturce/comisvoiajoeusa/cefacacutiz/spoialaalu/mucan/fata.html?MASDDASBNASDBNSDABNSD=SDABDSANBSDABSDAVBASD6SADNMBASDMNASDBNDASBASDNASD7ASDBNASDVBDASVAVBAS8ASMNDBAMNBDASMNSDA5ASADNSBBSDANBSDANBDBASMNBD
620
https://itsmdshahin.github.io/facebook/
621
https://ahartlawnscaping.com/DC
622
http://www-cursosdigitalesmx-com.filesusr.com/html/3e0bdd_e8b0b5ae4dc3befcc395d02342c163a1.html
623
http://referral.hosannahfministry.com.ng/wp-content/themes/colibri-wp/template-parts/front-header/buttons/index.html?GHVthXseZzEERDXftGVYBHUnjIMjnKBjFCx

707
http://s.id/zRg9B
708
http://donedealprojects.com/comx1
709
https://joyarrington.com/plugins/content/contact/itiu.cm.ac.lsn/log1.php?id=a13aec5a482271437c31eff14b6e6539a13aec5a482271437c31eff14b6e6539&amp;session=a13aec5a482271437c31eff14b6e6539a13aec5a482271437c31eff14b6e6539
710
http://nabagejec1893.blogspot.sg/
711
https://amoueaom-cc-jp.aichar.cn/
712
https://bhavin0077.github.io/Netflix/
713
http://cv.kredit24.com/dau9tJdL11/nqB1G
714
https://biolinky.co/m4glacier
715
https://superbahisim1.blogspot.com/
716
https://1drv.ms/xs/s!Am4xl7RvUGyWaXod-4XmPezY4Mk?wdFormId=%7BA1C5478A-C065-4B6F-B415-C1A0973F4392%7D
717
https://savethedateeventsllc.org/dhfghf.php#aaaa@example.jp
Error trying to connect to socket: closing socket
718
https://smbc-login.com/
719
https://workprotocoles-com.webs.com/
720
https://sites.google.com/view/62374ytu/btconnecc?authuser=1
721
https://linktr.ee/PromoTitanS19/
722
http://flladv.com.br/bseconsult/officeonline/home/
723
http://ambrosecourt.com/Our/Ourtim

830
http://www.www.httpservlces.runescape.com-m.ru/
831
http://51.222.192.117/telas/Mercadolivre/
832
http://dssdsdffff.000webhostapp.com/
833
http://alfaauv.com/wp-includes/sitemaps/providers/Project_diagram/home
834
https://wxcefappmobile.com/Caixa_Acesso
835
https://creditagricole.zyrosite.com/
836
http://bitalchile.cl/
837
http://archost.net.au/mkb.hu/mkb/a.php
838
https://contactinfo-mypo.com/
839
http://u8tny.skipdns.link/wp-content/plugins/affiliatewp-signup-referrals/includes/d/
840
https://forms.office.com/Pages/ResponsePage.aspx?id=o1KPbBY4gEGRoN8DfasQ2IJzeu7mzR9PqFlgiZARhuJUN0VVVkRUMVVXUEcyT1hQMUo1TTg3WE02TC4u
841
http://saldospc.com/5Yg4qK/mic1/index.php
842
https://linktr.ee/edu1e2.com
843
https://netorg6600800-my.sharepoint.com/:x:/r/personal/ginger_gingerfountain_com/_layouts/15/WopiFrame.aspx?guestaccesstoken=gpys8ex7Ys1urrzBfeAsvlEXkODTrovMMCPn%2brSNEbs%3d&amp;docid=1_1882b07b5eb5643d2bdaa63426324ef0e&amp;wdFormId=%7B9BD54AF1%2DEE16%2D4E07%2D8D62%2D6E9B76E47512%7D&amp;

942
http://amamzon.dsybys.shop/
943
http://m.httpsservlces.runescape.com-m.ru/
944
http://refreshingsupportinglinecare.com/mazon/0f6d8
945
https://rebelution-protection-only-5887412986324688.tk/checkpoint_next.php
946
http://www.cutt.ly/dkvKq49/
947
https://betasus223.blogspot.com/
948
https://docs.google.com/forms/d/e/1FAIpQLSfvvvNddWmy-3u-AGX0bVAr5WfmPLx8bVGEF_Zdia7Ra9Llfg/viewform?usp=send_form
949
https://wzplh.app.link/e2WxrTBLm4
950
https://rewindingshop.com/components/ro/index.html#neg@cedia.org.ec
951
https://paypal-my.sharepoint.com/personal/keyu_paypal_com/Documents/Delivering%20Certainty%20Presentation/Delivering%20Certainty%20Roadmap%20Presentation%204.18.19%20v11%20(Shared).pptx
952
https://jbshtl.secure52serv.com/receipt/secureNetflix/085e4dd9bf9b42f434fbc5482c404ce6
953
http://stevemadentr.com/
954
https://tenderometer.eu/Netfliix0.secure01c.com.alerst.accouunts/secureNetflix/17e22f168e417fa387c6305917b03adb
955
https://sfo3.digitaloceanspaces.com/owx/hjfho3jbhj90234lk/S

1041
https://officialevent.way.live/edoardopolacco
Error trying to connect to socket: closing socket
1042
http://cannellandcoflooring.co.uk/administrator/templates/system/images/img/onel/
1043
http://638ca12d-ba2f-451c-8418-faf56b7de7ff.htmlcomponentservice.com/get_draft?id=638ca1_14694f4a84161543466426a12288de1a.html
1044
http://joink-grupss01.duckdns.org/
1045
https://storage.cloud.google.com/officpcpspbcncuser.appspot.com/index.htm#jbell@legalshield.com
1046
https://globalinfohost.com/landingpage/dd263864-38a2-466a-be2f-4e5ec6c5e042/hctN-cqfiFpE_oKKKLCW-nscTYxGDAC6UsniYJMrH7M
1047
https://atriumlandscape-my.sharepoint.com/personal/sue_atriumlandscape_com/_layouts/15/WopiFrame.aspx?guestaccesstoken=nrG8nkxnKyJi2axm9efeKdi62u6cuvRsxCypZFZ9jaY%3d&docid=1_10932d3dd2ac2478f833ee56388ecb767&wdFormId=%7BFAEBEC1D%2DBC38%2D42BF%2DBE94%2D47EBB62D7501%7D&action=formsubmit&cid=06548627-9647-42de-a0c7-75a424aaacde
1048
https://t.co/hAu7Jfzq6w?amp=1?trackingid=UPVqJZrM&amp;signature=newsletter
10

1138
https://www.google.com/url?q=https://duodanseclub.fr//nh/rd/logon/?email%3D%5B%5B-Email-%5D%5D&amp;source=gmail&amp;ust=1593678623293000&amp;usg=AFQjCNHq3h-kf1Tmy7IQ1nwzA8yZ6k4XMQ
1139
https://smbc-card.user.com.ccxwhj.cn/
1140
https://uc-card.com.nm1jqq.cn/pc/ucp_signin.html
1141
http://netmas.com.mx/net3/login.html
1142
http://delezhen.mashalezhen.com/wp-includes/pomo/updatenow/files1/top.html
1143
https://tinyurl.com/m6t9puyd
1144
https://amoueaom-co-jp.9769bj3.cn/
1145
http://ifuudaixcbaqhoxwbttgkptnlb-dot-gowa11111111.rj.r.appspot.com/
1146
https://willingsd.no:443/FBG/
1147
http://eclipsevpn-new.com/secure-paypal/179bdcfa38bd99fc525b36e58540cf93?dispatch=2MrBZCpk3Umyo0tN8EhJr7VsOQRnFyfFE7bSeTOuLKjR7YLFZ8&amp;email=
1148
https://sites.google.com/view/securebtweebly/bt
1149
https://l.wl.co/l?u=https://getpayment.irs.gov.account-cash.app/?imanhalal
1150
https://sandert12.blogspot.com/p/la-banque-postale.html
1151
https://secure-notofication-payment-account2347456.etfef4weasrgar

1246
https://spkitem7.life/fullino/anmeldung.php?starten=ocanbhJgrjEzpTYuHLlDfVUmOW6MKy&shufflUri?=NWarAJfFqX6puQIsElcx
Error trying to connect to socket: closing socket
1247
http://clubeamigosdopedrosegundo.com.br/list/index.php?email=abuse@brain.net.pk
1248
http://vtennis.vn/uploads/3/2/2/en/mpp/Login/complete.php
Error trying to connect to socket: closing socket
1249
http://bit.do/fRCSY
1250
http://voe.eng.br/voe.eng.br/admin/dhl-isvec-eur/
1251
https://surveyheart.com/form/5e5b8d772e417841d96ee7af#form/0
1252
http://polkastarter.group
Error trying to connect to socket: closing socket
1253
http://eclipsevpn-new.com/secure-paypal/e4401b948eef5f5bf150d2a012b47dd3/?dispatch=z81D2XO83aGfLQgNO1H0b4eVdkb1XmHbqYUwPEyAuLFFMXep1E&amp;email=
1254
http://validacionakkuntsevos.gq/Notification_help.html
Error trying to connect to socket: closing socket
1255
https://iplogger.org/2GrFj6
1256
http://postfincasse.000webhostapp.com/login/
1257
http://www.stolizaparketa.ru/wp-content/themes/twentyfift

1338
http://antaresns.com/We-transfer/login.php
1339
https://l.linklyhq.com/l/T2fA?confirmations
1340
http://albenis-kerqeli.github.io/Netflix-Homepage-html-and-css-
1341
http://vfr1.22web.org/zo
1342
http://www.gkjx168.com/images?http://us.battle.net/login/http://dyfdzx.com/js?fav.1&amp;fid=1&amp;fid.4.1252899642&amp;randu0013InboxLightaspxn.1774256418&amp;rand.13InboxLight.as&amp;ref=http:/jebvahnus.batt====
1343
http://taijishentie.com/js/index.htm?amp&amp;amp&amp;http:/us.battle.net/login/en?ref=http:/bowaovvus.battle.net/d3/en/index
1344
http://ywlrwalvihcbekfschnroffqhf-dot-goff039302032323.rj.r.appspot.com/
1345
http://andhraelec.com/reimainsteard/stewart/
1346
https://sites.google.com/view/att-managements/home
1347
https://naranja-users.auth0.com/login?state=g6Fo2SAwcWphTHdRMkdXN3dISU5oNDlCMjJuNEtRbW01Vk5YNaN0aWTZIE85RnBKbkQtMDUzM3NsNnJwVFFFOUJfSXF1QnF3RGtpo2NpZNkgNkMybGREeTJVaHRVTXRPS1pOM2ZRZVVKYlVCQXdSQTc&amp;client=6C2ldDy2UhtUMtOKZN3fQeUJbUBAwRA7&amp;protocol=oauth2&amp;red

1444
https://sen-manole.firebaseapp.com/
1445
https://629afe26.orson.website/?ts=1613476656731
Error trying to connect to socket: closing socket
1446
https://tb915hdh89.mfs.gg/1Yeasd3
1447
https://fedexvoyager.com/
1448
http://www.clenapal.com/wp-content/languages/-/
1449
https://happyhouru.com/wp-includes/random_compat/
1450
http://5454451456445.hyperphp.com/
1451
https://firebasestorage.googleapis.com/v0/b/outlook-4c0f2.appspot.com/o/books%2FWebmail.htm?alt=media&amp;token=216401d2-aba7-42f4-8fd5-9b672cade830#tiekimas@tidlo.lt
1452
http://bit.do/fRFUy
1453
https://annarborhandsonmuseum-my.sharepoint.com/personal/jklute_aahom_org/_layouts/15/WopiFrame2.aspx?sourcedoc={32b08432-df6e-45ce-b9dd-bd06a2fd8ffc}&amp;action=default&amp;originalPath=aHR0cHM6Ly9hbm5hcmJvcmhhbmRzb25tdXNldW0tbXkuc2hhcmVwb2ludC5jb20vOm86L2cvcGVyc29uYWwvamtsdXRlX2FhaG9tX29yZy9FaktFc0RKdTM4NUZ1ZDI5QnFMOWpfd0J2U3EwWjVXaG1iSnNiTTdkdVg4RDBRP3J0aW1lPTNHMWxmTUI0MTBn
1454
http://paulmitchellforcongress.com/wp-content/lang

In [11]:
#Converting the list to dataframe

phishurl = pd.DataFrame(phish_features, columns= feature_names)
phishurl.head()

Unnamed: 0,domain,ip_present,at_present,url_length,url_depth,redirection,https_domain,short_url,prefix/suffix,dns_record,web_traffic,domain_age,domain_end,dot_count,specialchar_count,subdom_count,label
0,jbshtl.secure52serv.com,0,0,1,3,0,0,0,0,0,0,0,0,0,6,1,1
1,lazarus.co.zw,0,0,0,3,0,0,0,0,0,1,1,1,1,6,1,1
2,drivingschoolglasgow.co.uk,0,0,1,5,0,0,0,0,0,0,0,0,0,8,1,1
3,rektuen.qsyljs.com,0,0,0,2,0,0,0,0,0,1,0,0,1,5,1,1
4,hairahsweetcakes.com,0,0,0,1,0,0,0,0,0,1,0,1,0,4,0,1


In [12]:
# Storing the extracted legitimate URLs fatures to csv file

phishurl.to_csv('/Users/jillkathleen/Desktop/Phishing-Analysis-Detection/Back-End/Feature-Extraction-ntbk/FE-more-data/phish_SAMPLE.csv', index= False)
  

In [13]:
## NOTE THAT PHISHING DATASET WAS COLLECTED FROM PHISHTANK