**Extracting features from phishing and benign urls.**

> # **Intro**

From URLs we have to extract some features which will help us classify between phishing websites and legitimate one. 
The features can be classified into three types: 
1.   Address Bar Based Features
2.   Abnormal Based Features
3.   HTML and JavaScript based Features
4.   Domain based Features


Some of the featuers of phishing websites are: 

Using the IP Address

> Using IP address in place of domain name in the URL such as `“http://125.98.3.123/fake.html”` and sometimes and IP address is used hex code: 

```
“http://0x58.0xCC.0xCA.0x62/2/paypal.ca/index.html”
```

Other features are

*   Long URLs. More than 52 char length are usually phishing urls. 
*   Using tiny url services. 
* Having *@* symbol in the in URL. 
* Using `-` hypthens in the url. 

The list is long and the features extracted are based on this paper:
http://eprints.hud.ac.uk/id/eprint/24330/6/MohammadPhishing14July2015.pdf


 



In [1]:
import pandas as pd

Extracting phishing URLs

In [2]:
data0 = pd.read_csv('C:/Users/Liberty/OneDrive/Desktop/CYBER/PROJECTS/Phishing-website-detection-using-ML/Datasets/verified_online.csv')
#http://data.phishtank.com/data/online-valid.csv


In [3]:
data0.head()

Unnamed: 0,phish_id,url,phish_detail_url,submission_time,verified,verification_time,online,target
0,7265251,https://anazom.co.ip.lhpoct.shop/dR3snx1C.php?...,http://www.phishtank.com/phish_detail.php?phis...,2021-08-16T02:11:56+00:00,yes,2021-08-16T02:18:16+00:00,yes,Amazon.com
1,7265249,https://www.sprintage.it/images/login/bizmail.php,http://www.phishtank.com/phish_detail.php?phis...,2021-08-16T02:07:20+00:00,yes,2021-08-16T02:14:31+00:00,yes,Other
2,7265245,http://confirm-unverified-pplaccount.com/custo...,http://www.phishtank.com/phish_detail.php?phis...,2021-08-16T02:05:41+00:00,yes,2021-08-16T02:14:31+00:00,yes,Other
3,7265244,http://confirm-unverified-pplaccount.com/custo...,http://www.phishtank.com/phish_detail.php?phis...,2021-08-16T02:05:40+00:00,yes,2021-08-16T02:14:31+00:00,yes,Other
4,7265242,http://mail.confirm-unverified-pplaccount.com/...,http://www.phishtank.com/phish_detail.php?phis...,2021-08-16T02:05:36+00:00,yes,2021-08-16T02:14:31+00:00,yes,Other


In [4]:
data0.shape

(10784, 8)

In [5]:
#Collecting 5,000 Phishing URLs randomly
phis_url = data0.sample(n = 5000, random_state = 12).copy()
phis_url = phis_url.reset_index(drop=True)
phis_url.head()


Unnamed: 0,phish_id,url,phish_detail_url,submission_time,verified,verification_time,online,target
0,7260589,https://hzibupigbnrtqezn-dot-sunlit-center-322...,http://www.phishtank.com/phish_detail.php?phis...,2021-08-10T15:26:24+00:00,yes,2021-08-10T15:49:26+00:00,yes,Other
1,7251242,http://netttxnet.byethost3.com/,http://www.phishtank.com/phish_detail.php?phis...,2021-07-31T23:57:53+00:00,yes,2021-08-01T00:08:16+00:00,yes,Other
2,7257550,https://phx.chromeproxy.net/direct/aHR0cHM6Ly9...,http://www.phishtank.com/phish_detail.php?phis...,2021-08-07T02:08:44+00:00,yes,2021-08-07T02:19:40+00:00,yes,Other
3,6887460,https://midnightluna1.typeform.com/to/bSFSrH4T,http://www.phishtank.com/phish_detail.php?phis...,2020-12-11T23:36:32+00:00,yes,2021-01-02T06:10:46+00:00,yes,Other
4,6772545,http://caracasmateriais.blogspot.com/,http://www.phishtank.com/phish_detail.php?phis...,2020-09-16T14:49:55+00:00,yes,2020-09-16T17:09:06+00:00,yes,Other


> # **Address Bar Based Features**





> **1.1 Extracting Legitimate URLs:**

In [6]:
data1 = pd.read_csv('C:/Users/Liberty/OneDrive/Desktop/CYBER/PROJECTS/Phishing-website-detection-using-ML/Datasets/Benign_list.csv')

In [7]:
data1.columns = ['URLs']

data1.head()

Unnamed: 0,URLs
0,http://1337x.to/torrent/1110018/Blackhat-2015-...
1,http://1337x.to/torrent/1122940/Blackhat-2015-...
2,http://1337x.to/torrent/1124395/Fast-and-Furio...
3,http://1337x.to/torrent/1145504/Avengers-Age-o...
4,http://1337x.to/torrent/1160078/Avengers-age-o...


In [8]:
data1.shape

(35377, 1)

In [9]:
#Collecting 5,000 Legitimate URLs randomly
legi_url = data1.sample(n = 5000, random_state = 12).copy()
legi_url = legi_url.reset_index(drop=True)
legi_url.head()


Unnamed: 0,URLs
0,http://graphicriver.net/search?date=this-month...
1,http://ecnavi.jp/redirect/?url=http://www.cros...
2,https://hubpages.com/signin?explain=follow+Hub...
3,http://extratorrent.cc/torrent/4190536/AOMEI+B...
4,http://icicibank.com/Personal-Banking/offers/o...


In [10]:
%pip install bs4





> **1.1 Extracting Domain of the url**

In [11]:
import requests
import re

In [12]:
%pip install bs4




In [13]:
%pwd

'c:\\Users\\Liberty\\OneDrive\\Desktop\\CYBER\\PROJECTS\\Phishing-website-detection-using-ML\\ipynb files'

In [14]:
from bs4 import BeautifulSoup
from urllib.parse import urlparse


In [15]:
def getDomain(url):
  domain = urlparse(url).netloc
  if re.match(r"^www.",domain):
    domain = domain.replace("www.","")
  return domain


In [16]:
#checking if getDomain works
getDomain('https://www.youtube.com/results?search_query=extracting+url+information')

'youtube.com'

In [17]:
%pip install ipaddress




> **1.2 Extracting ipaddress from the url**

In [18]:
#checking if url contains an IP address
import ipaddress
def haveIP(url):
  try:
    ipaddress.ip_address(url)
    ip = 1
  except:
    ip = 0
  return ip

In [19]:
#checking the haveIP fun
x = '127.0.0.1'
print( haveIP(x) ) 
x = 'www.google.com'
print ( haveIP(x) ) 

1
0


> **1.3 Extracting *@* symbol from the url**




In [20]:
#Checking for @ symbol in the URL
def haveAt(url): 
  if '@' in url: 
    return 1
  else:
    return 0

In [21]:
#checking the have_at fun
x = 'www.google.com'
y = 'www.yahoo@gmail.com'
print( haveAt(x) ) 
print( haveAt(y) ) 

0
1


> **1.4 Extracting the length of url**

In [22]:
#Checking the length of the URL
def urlLength(url): 
  if len(url) < 54: 
    return 0
  else: 
    return 1

> **1.5 Extracting the tiny URL shotners from the url**

In [23]:
#Checking if the url uses tinyUrl services

short_url_services =  r"bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|tinyurl|tr\.im|is\.gd|cli\.gs|" \
                      r"yfrog\.com|migre\.me|ff\.im|tiny\.cc|url4\.eu|twit\.ac|su\.pr|twurl\.nl|snipurl\.com|" \
                      r"short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|snipr\.com|fic\.kr|loopt\.us|" \
                      r"doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|to\.ly|bit\.do|t\.co|lnkd\.in|db\.tt|" \
                      r"qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|tinyurl\.com|ow\.ly|bit\.ly|ity\.im|q\.gs|is\.gd|" \
                      r"po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u\.bb|yourls\.org|x\.co|" \
                      r"prettylinkpro\.com|scrnch\.me|filoops\.info|vzturl\.com|qr\.net|1url\.com|tweez\.me|v\.gd|" \
                      r"tr\.im|link\.zip\.net"

In [24]:
def usesTinyUrl(url):      
  temp = re.search(short_url_services, url)
  if(temp):
    return 1
  else: 
    return 0

In [25]:
#checking UsesTinyUrl fun
x = 'bit.ly/19DXSk4'
y = 'www.yahoo.com'
print( usesTinyUrl(x) ) 
print( usesTinyUrl(y) ) 

1
0


> **1.6 Extracting the hyphens from the url**

In [26]:
#Checking if URL contains '-'. 
def haveHyphen(url): 
  if '-' in urlparse(url).netloc:
    return 1            # phishing
  else:
    return 0            # legitimate


In [27]:
#checking the haveHypen
print( haveHyphen('www.pay-tm.com') ) 
print( haveHyphen('www.google--pay-1.com'))

0
0


In [28]:
%pip install tldextract

Note: you may need to restart the kernel to use updated packages.


> **1.7Extracting subdomains from the url.**

In [29]:
#checking if the Url have multi sub domain
#Not working currently
import tldextract
def multiSubDomain(url):   
  x = tldextract.extract(url)
  print(x)



In [30]:
#checking multi sub domain 
x = 'https://www1.eposcard-co-jp-mcmbresreviec.s20r084.abc.def.ghi.cn/'
y = 'http://www.hud.ac.uk/students/page1.html'
print ( multiSubDomain(x) )
print ( multiSubDomain(y) ) 

ExtractResult(subdomain='www1.eposcard-co-jp-mcmbresreviec.s20r084.abc.def', domain='ghi', suffix='cn', is_private=False)
None
ExtractResult(subdomain='www', domain='hud', suffix='ac.uk', is_private=False)
None


In [31]:
%pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


> **1.8 Extracting `//` from the url**

In [32]:
#finding the depth of // from the url
def redirection(url):
  pos = url.rfind('//')
  if pos > 6:
    if pos > 7:
      return 1
    else:
      return 0
  else:
    return 0

In [33]:
#checking the redirection 
x = 'http://eprints.hud.ac.uk/id/eprint/24330/6/MohammadPhishing14July2015.pdf'
print( redirection(x) )

0


> **1.9 Extracting the HTTPs certificate age and issuer**

In [34]:
import requests
#not working correctly

response = requests.get('https://stackoverflow.com/questions/29773003/check-whether-domain-is-registered')
print(response)

<Response [200]>


> **1.10 Extracting domain registration length**

In [35]:
%pip install python-whois

Note: you may need to restart the kernel to use updated packages.


In [36]:
#finding the domani registration date from the url
import whois 
from datetime import datetime
from dateutil.relativedelta import relativedelta
def domainRegLength(url):
  try: 
    temp = whois.whois(url)      
    #print(datetime.today(), ' credate ', temp.creation_date[0])  
    return relativedelta(datetime.today(), temp.creation_date[0]).years
  except: 
    return 0

In [37]:
#checking the domdin reg len fun
x = 'www.google.com'
y = 'http://u1047531.cp.regruhosting.ru/acces-inges-20200104-t452/3facd/'
print( domainRegLength(x) )
print ( domainRegLength(y) ) 



26
0


> **1.10 Extracting hidden https token in domain**


In [38]:
#checking the existense of hidden http/https
def hiddenhttps(url): 
  domain = urlparse(url).netloc
  print(domain)
  if 'https' in domain: 
    return 1
  else: 
    return 0

#checking the working of above fun
x = 'https://open.spotify.com/playlist/3mHGpdWE9oxUjcNZJvCkBe'
y = 'http://https-www-paypal-it-webapps-mpp-home.soft-hair.com/'
z = 'https://http-www-paypal-it-webapps-mpp-home.soft-hair.com/'
hiddenhttps(z)

http-www-paypal-it-webapps-mpp-home.soft-hair.com


0

> **1.11 Extracting Depth of the URL**


In [39]:
def getDepth(url):
  s = urlparse(url).path.split('/')
  depth = 0
  for j in range(len(s)):
    if len(s[j]) != 0:
      depth = depth+1
  return depth


> #  **Domain Based Features**




> **2.1 Age of Domain**

In [40]:
#If age of domain > 12 months, the vlaue of this feature is 1 (phishing) else 0 (legitimate).
def domainAge(domain_name):
  creation_date = domain_name.creation_date
  expiration_date = domain_name.expiration_date
  if (isinstance(creation_date,str) or isinstance(expiration_date,str)):
    try:
      creation_date = datetime.strptime(creation_date,'%Y-%m-%d')
      expiration_date = datetime.strptime(expiration_date,"%Y-%m-%d")
    except:
      return 1
  if ((expiration_date is None) or (creation_date is None)):
      return 1
  elif ((type(expiration_date) is list) or (type(creation_date) is list)):
      return 1
  else:
    ageofdomain = abs((expiration_date - creation_date).days)
    if ((ageofdomain/30) < 6):
      age = 1
    else:
      age = 0
  return age


> **2.2 End Period of Domain**

In [41]:
#If end period of domain > 6 months, the vlaue of this feature is 1 (phishing) else 0 (legitimate).
def domainEnd(domain_name):
  expiration_date = domain_name.expiration_date
  if isinstance(expiration_date,str):
    try:
      expiration_date = datetime.strptime(expiration_date,"%Y-%m-%d")
    except:
      return 1
  if (expiration_date is None):
      return 1
  elif (type(expiration_date) is list):
      return 1
  else:
    today = datetime.now()
    end = abs((expiration_date - today).days)
    if ((end/30) < 6):
      end = 0
    else:
      end = 1
  return end


> # **HTML and Javascript based features**


> **3.1 . IFrame Redirection**

In [42]:
  #If the iframe is empty or repsonse is not found then, the value assigned to this feature is 1 (phishing) or else 0 (legitimate).

def iframe(response):
    if response == "":
        return 1
    else:           
        if re.findall(r"[<iframe>|<frameBorder>]", response.text):
            return 0
        else:
            return 1

> **3.2 . Status Bar Customization**

In [43]:
# If the response is empty or onmouseover is found then, the value assigned to this feature is 1 (phishing) or else 0 (legitimate).
def mouseOver(response): 
  if response == "" :
    return 1
  else:    
    if re.findall("<script>.+onmouseover.+</script>", response.text):
      return 1
    else:
      return 0

> **3.3 . Status Bar Customization**

In [44]:
# If the response is empty or onmouseover is not found then, the value assigned to this feature is 1 (phishing) or else 0 (legitimate).
def rightClick(response):
  if response == "":
    return 1
  else:    
    if re.findall(r"event.button ?== ?2", response.text):
      return 0
    else:
      return 1


> **3.4 . Website Forwarding**

In [45]:
# legtimate website forwards at maxx one times, phishing websites are forwarded at least 4 times.
def forwarding(response):
  if response == "":
    return 1
  else:    
    if len(response.history) <= 2:
      return 0
    else:
      return 1


> # **Computing URL Features**

In [46]:
def featureExtraction(url,label, curr):
  feature_names = ['Domain', 'Have_IP', 'Have_At', 'URL_Length', 'URL_Depth','Redirection', 
                      'https_Domain', 'TinyURL', 'Prefix/Suffix', 'DNS_Record',
                      'Domain_Age', 'Domain_End', 'iFrame', 'Mouse_Over','Right_Click', 'Web_Forwards', 'Label']
  features = []
  #Address bar based features  are working correctly    
  features.append(getDomain(url))
  features.append(haveIP(url))
  features.append(haveAt(url))
  features.append(urlLength(url))
  features.append(getDepth(url))
  features.append(redirection(url))
  features.append(hiddenhttps(url))
  features.append(usesTinyUrl(url))
  features.append(haveHyphen(url))
  
  #Domain based features  working correctly
  dns = 0
  try:
    domain_name = whois.whois(urlparse(url).netloc)
  except:
    dns = 1

  features.append(dns)
  # features.append(web_traffic(url))
  features.append(1 if dns == 1 else domainAge(domain_name))
  features.append(1 if dns == 1 else domainEnd(domain_name))
  
  # HTML & Javascript based features working correctly
  temp = ['1']*4
  temp.append(label)

  try:   
    response = requests.get(url, timeout=5 )        
    print('HTTP response code: ', response.status_code)
    if response.status_code == 200:       
      features.append(iframe(response))      
      features.append(mouseOver(response))    
      features.append(rightClick(response))    
      features.append(forwarding(response))    
      features.append(label)          
    else: 
      print('Not reachable - ', url)
      features.extend(temp)
  except:     
    print('Timeout - ', url)
    features.extend(temp)
    

  return features


>#  **4.1. Legitimate URLs:**


In [47]:

#Extracting the feautres & storing them in a list
legi_features = []
label = 0
# 1 is phishing , 0 is legitimate
for i in range(0, len(legi_url) ):
  print('i is: ', i , end = "")
  url = legi_url['URLs'][i]  
  legi_features.append(featureExtraction(url,label, i))


i is:  0graphicriver.net
HTTP response code:  200
i is:  1ecnavi.jp
HTTP response code:  200
i is:  2hubpages.com
HTTP response code:  200
i is:  3extratorrent.cc
Timeout -  http://extratorrent.cc/torrent/4190536/AOMEI+Backupper+Technician+%2B+Server+Edition+2.8.0+%2B+Patch+%2B+Key+%2B+100%25+Working.html
i is:  4icicibank.com
HTTP response code:  403
Not reachable -  http://icicibank.com/Personal-Banking/offers/offer-detail.page?id=offer-ezeego-domestic-airtravel-20141407112611060
i is:  5nypost.com
HTTP response code:  200
i is:  6kienthuc.net.vn


2024-07-26 20:51:10,663 - whois.whois - ERROR - Error trying to connect to socket: closing socket - [Errno 11001] getaddrinfo failed


HTTP response code:  200
i is:  7thenextweb.com
HTTP response code:  403
Not reachable -  http://thenextweb.com/in/2015/04/16/india-wants-a-neutral-web-and-facebooks-internet-org-cant-be-a-part-of-it/gtm.js
i is:  8tobogo.net
Timeout -  http://tobogo.net/cdsb/board.php?board=greet&bm=view&no=5716&category=&auth=&page=1&search=&keyword=&recom=
i is:  9akhbarelyom.com
HTTP response code:  403
Not reachable -  http://akhbarelyom.com/news/newdetails/411395/1/%D9%85%D8%AD%D8%A7%D9%81%D8%B8-%D8%A7%D9%84%D8%A8%D8%AD%D9%8A%D8%B1.html
i is:  10tunein.com
HTTP response code:  200
i is:  11tune.pk
Timeout -  https://tune.pk/video/6046458/canelo-vs-kirkland-highlights-hbo-world-championship-boxing
i is:  12sfglobe.com
HTTP response code:  404
Not reachable -  http://sfglobe.com/2015/05/01/six-baltimore-police-officers-charged-in-freddie-grays-death/?src=home_feed
i is:  13mic.com
HTTP response code:  200
i is:  14thenextweb.com
HTTP response code:  403
Not reachable -  http://thenextweb.com/apps/2

2024-07-26 20:53:16,530 - whois.whois - ERROR - Error trying to connect to socket: closing socket - [Errno 11001] getaddrinfo failed


Timeout -  http://motthegioi.vn/tai-chinh-bat-dong-san/tu-17-nguoi-mua-nha-dat-phai-cong-them-hang-loat-chi-phi-187472.html
i is:  26spankbang.com
HTTP response code:  403
Not reachable -  http://spankbang.com/4ze1/video/brunette+with+big+boobs+fucked+in+a+cellar+public+agent
i is:  27torcache.net
HTTP response code:  200
i is:  28mic.com
HTTP response code:  200
i is:  29thenextweb.com
HTTP response code:  404
Not reachable -  http://thenextweb.com/dd/2014/04/08/ux-designers-side-drawer-navigation-costing-half-user-engagement/feed/gtm.start
i is:  30emgn.com
Timeout -  http://emgn.com/movies/person-of-interest-what-to-expect-from-season-4-release-date/
i is:  31depositphotos.com
HTTP response code:  200
i is:  32serverfault.com
HTTP response code:  200
i is:  33kakaku.com
HTTP response code:  200
i is:  34indianexpress.com
HTTP response code:  200
i is:  35nypost.com
HTTP response code:  200
i is:  36distractify.com
HTTP response code:  404
Not reachable -  http://distractify.com/post

2024-07-26 20:56:07,521 - whois.whois - ERROR - Error trying to connect to socket: closing socket - [Errno 11001] getaddrinfo failed


HTTP response code:  200
i is:  52twitter.com
HTTP response code:  200
i is:  53nesn.com
HTTP response code:  200
i is:  54fishki.net
HTTP response code:  403
Not reachable -  http://fishki.net/1526766-kapello-zajavil-o-bezumii-prezidenta-rfs.html?mode=recent
i is:  55syosetu.com
HTTP response code:  403
Not reachable -  http://syosetu.com/searchuser/search/index.php?name1st=%E3%82%80&all=1&all2=1&all3=1&all4=1&p=10
i is:  56akhbarelyom.com
HTTP response code:  403
Not reachable -  http://akhbarelyom.com/news/newdetails/411497/1/%D9%85%D8%B5%D8%B1-%D8%A7%D9%84%D8%B9%D8%B7%D8%A7%D8%A1-%D8%AA.html
i is:  57allegro.pl
Timeout -  http://allegro.pl/triumph-stringi-precious-essence-string-granat-40-i5035632976.html
i is:  58getpocket.com
HTTP response code:  200
i is:  59mylust.com
HTTP response code:  200
i is:  60censor.net.ua
HTTP response code:  403
Not reachable -  http://censor.net.ua/photo_news/335629/v_tsentre_donetska_ural_boevikov_dnr_razdavil_jiguli_2_pogibshih_narod_ne_protestuet

2024-07-26 20:57:17,776 - whois.whois - ERROR - Error trying to connect to socket: closing socket - [Errno 11001] getaddrinfo failed


HTTP response code:  200
i is:  67uproxx.com
HTTP response code:  200
i is:  68babal.net
HTTP response code:  404
Not reachable -  http://babal.net/women/view/80/%D8%AF%D9%84%D9%91%D9%84%D9%8A-%D9%86%D9%81%D8%B3%D9%83-%D8%A3%D9%8A%D8%AA%D9%87%D8%A7-%D8%A7%D9%84%D8%A3%D9%85-%D9%85%D8%B9-%D9%87%D8%B0%D9%87-%D8%A7%D9%84%D9%86%D8%B5%D8%A7%D8%A6%D8%AD
i is:  69mic.com
HTTP response code:  200
i is:  70torcache.net
HTTP response code:  200
i is:  71paytm.com
HTTP response code:  403
Not reachable -  https://paytm.com/blog/paytm-offer-for-app-users-get-upto-rs-50-cash-back/?share=email
i is:  72motthegioi.vn


2024-07-26 20:57:37,178 - whois.whois - ERROR - Error trying to connect to socket: closing socket - [Errno 11001] getaddrinfo failed


Timeout -  http://motthegioi.vn/suckhoe/nu-cuoi-y-hoc/nu-cuoi-suc-khoe-nhung-li-do-nen-yeu-chang-trai-bung-bu-188294.html
i is:  73torcache.net
HTTP response code:  200
i is:  74thenextweb.com
HTTP response code:  403
Not reachable -  http://thenextweb.com/asia/2014/09/26/myanmars-mobile-revolution-kicks-telenor-prepares-launch-service/gtm.start
i is:  75extratorrent.cc
Timeout -  http://extratorrent.cc/torrent/4189616/Jedi+Mind.Tricks.The.Thief.and.the.Fallen.2015.mp3.vbr.NOiR.html
i is:  76genius.com
HTTP response code:  200
i is:  77thenextweb.com
HTTP response code:  403
Not reachable -  http://thenextweb.com/apps/2012/04/19/500px-launches-android-app-and-overhauls-its-ipad-version-too/
i is:  78kienthuc.net.vn


2024-07-26 20:58:11,517 - whois.whois - ERROR - Error trying to connect to socket: closing socket - [Errno 11001] getaddrinfo failed


HTTP response code:  200
i is:  79sourceforge.net
HTTP response code:  404
Not reachable -  http://sourceforge.net/directory/development/add_facet_filter?facet=license&constraint=OSI-Approved+Open+Source+%3A%3A+PHP+License
i is:  80nypost.com
HTTP response code:  200
i is:  81ap.org
HTTP response code:  403
Not reachable -  http://ap.org/Content/Press-Release/2013/NFL-celebrates-season-with-NFL-Honors-Super-Bowl-Eve
i is:  82indianexpress.com
HTTP response code:  200
i is:  83extratorrent.cc
Timeout -  http://extratorrent.cc/torrent_download/4191066/Chappie.2015.720p.WEB-DL.AAC2.0.H.264-PLAYNOW.torrent
i is:  84correios.com.br
HTTP response code:  200
i is:  85web.de
HTTP response code:  404
Not reachable -  http://web.de/magazine/sport/fussball/champions-league/fc-bayern-muenchen-fc-barcelona/fc-bayern-muenchen-fc-barcelona-enrique-fuerchtet-freund-guardiola-30633596
i is:  86noticias.uol.com.br
Timeout -  http://noticias.uol.com.br/saude/album/2015/03/17/dengue-pelo-brasil.htm?abrefo

2024-07-26 20:59:30,242 - whois.whois - ERROR - Error trying to connect to socket: closing socket - [Errno 11001] getaddrinfo failed


Timeout -  http://kenh14.vn/tv-show/viet-huong-gao-thet-to-chong-leng-pheng-trai-tre-20150208121726226.chn
i is:  89codecanyon.net
HTTP response code:  410
Not reachable -  http://codecanyon.net/item/photofans-your-social-network-to-share-photos/full_screen_preview/6308014
i is:  90tunein.com
HTTP response code:  200
i is:  91nguyentandung.org
HTTP response code:  200
i is:  92kickass.to


2024-07-26 20:59:53,562 - whois.whois - ERROR - Error trying to connect to socket: closing socket - [WinError 10061] No connection could be made because the target machine actively refused it


HTTP response code:  200
i is:  93weathernews.jp
HTTP response code:  200
i is:  94atwiki.jp
HTTP response code:  200
i is:  95plarium.com
HTTP response code:  410
Not reachable -  http://plarium.com/en/strategy-games/sparta-war-of-empires/mmo-masters/offensive-positions
i is:  96chaturbate.com
HTTP response code:  403
Not reachable -  https://chaturbate.com/tipping/spy_on_private_show_tokens_per_minute/ingridblondy94/
i is:  97babal.net
HTTP response code:  404
Not reachable -  http://babal.net/news/view/42903/%D8%AD%D8%A8%D8%B3-%D9%85%D8%AD%D8%A7%D9%81%D8%B8-%D8%A7%D9%84%D8%A3%D9%82%D8%B5%D8%B1-%D8%A7%D9%84%D8%B3%D8%A7%D8%A8%D9%82-%D8%B9%D8%A7%D9%85%D9%8A%D9%86-%D9%88%D8%AA%D8%BA%D8%B1%D9%8A%D9%85%D9%87-10-%D8%A2%D9%84%D8%A7%D9%81-%D8%AC%D9%86%D9%8A%D9%87-%D9%84%D8%B9%D8%AF%D9%85-%D8%AA%D9%86%D9%81%D9%8A%D8%B0%D9%87-%D8%AD%D9%83%D9%85%D8%A7-%D9%82%D8%B6%D8%A7%D8%A6%D9%8A%D8%A7
i is:  98sberbank.ru
Timeout -  http://sberbank.ru/portalserver/sb-portal-ru/ru/person/paymentsandremittance

In [None]:
#converting the list to dataframe
feature_names = ['Domain', 'Have_IP', 'Have_At', 'URL_Length', 'URL_Depth','Redirection', 
                      'https_Domain', 'TinyURL', 'Prefix/Suffix', 'DNS_Record',  
                      'Domain_Age', 'Domain_End', 'iFrame', 'Mouse_Over','Right_Click', 'Web_Forwards', 'Label']

legitimate = pd.DataFrame(legi_features, columns= feature_names)
legitimate.head()


Unnamed: 0,Domain,Have_IP,Have_At,URL_Length,URL_Depth,Redirection,https_Domain,TinyURL,Prefix/Suffix,DNS_Record,Domain_Age,Domain_End,iFrame,Mouse_Over,Right_Click,Web_Forwards,Label
0,graphicriver.net,0,0,1,1,0,0,0,0,0,1,1,0,0,1,0,0
1,ecnavi.jp,0,0,1,1,1,0,0,0,0,0,1,0,0,1,0,0
2,hubpages.com,0,0,1,1,0,0,0,0,0,0,1,0,0,1,0,0
3,extratorrent.cc,0,0,1,3,0,0,0,0,0,0,0,1,1,1,1,0
4,icicibank.com,0,0,1,3,0,0,0,0,0,0,1,1,1,1,1,0


In [None]:
legitimate.to_csv('legitimate_copy.csv', index= False)

In [None]:
!copy legitimate_copy.csv "C:/Users/Liberty/OneDrive/Desktop/CYBER/PROJECTS/Phishing-website-detection-using-ML/Datasets"

        1 file(s) copied.


In [None]:
!del legitimate_copy.csv

Could Not Find c:\Users\Liberty\OneDrive\Desktop\CYBER\PROJECTS\Phishing-website-detection-using-ML\ipynb files\legitimate_copy.csv


> # **4.2. Phishing URLs:**

In [None]:
phis_url.shape

(5000, 8)

In [None]:
#Extracting the feautres & storing them in a list
phish_features = []
label = 1
for i in range(0, len(legi_url) ):
  url = phis_url['url'][i]
  print('i is: ', i, 'url: ', legi_url['URLs'][i]  )
  phish_features.append(featureExtraction(url,label, i))


i is:  0 url:  http://graphicriver.net/search?date=this-month&length_max=&length_min=&price_max=&price_min=&rating_min=&sales=&sort=sales&term=&view=list
hzibupigbnrtqezn-dot-sunlit-center-322513-556drtgf4.oa.r.appspot.com
HTTP response code:  404
Not reachable -  https://hzibupigbnrtqezn-dot-sunlit-center-322513-556drtgf4.oa.r.appspot.com/#redacted@abuse.ionos.com
i is:  1 url:  http://ecnavi.jp/redirect/?url=http://www.cross-a.net/x.php?id=1845_3212_22061_26563&m=1004&pid=%user_id%
netttxnet.byethost3.com
HTTP response code:  200
i is:  2 url:  https://hubpages.com/signin?explain=follow+Hubs&url=%2Fhub%2FComfort-Theories-of-Religion
phx.chromeproxy.net
Timeout -  https://phx.chromeproxy.net/direct/aHR0cHM6Ly93d3cuZmFjZWJvb2suY29tL2FqYXgvYno_X19hPTEmX19jY2c9RVhDRUxMRU5UJl9fY29tZXRfcmVxPTAmX19jc3I9Jl9fZHluPTd4ZTZGbzRPUTFQeVU5b3luRnduODRhMmk1VTRlMUZ4LWV3U3dNeFcwRFVlVWh3NWN4NjBWbzF1cEU0VzBPRTJXeE8wRkUyYXd0ODFzYnpvNWlhdzV6d3d3aTgxbkUzcnc5TzBSRTJKdzhXMXV3Mm9FRyZfX2hzPTE4ODM5LlBIQVNFRCUzQUR

2024-07-21 20:13:08,431 - whois.whois - ERROR - Error trying to connect to socket: closing socket - timed out


Timeout -  http://page-reconfrim-10000001234576672625690092.tk/checkpoint_next.php
i is:  49 url:  http://persianblog.ir/tags/42604/8/%d8%b3%d9%87%d8%b1%d8%a7%d8%a8_%d8%b3%d9%be%d9%87%d8%b1%db%8c/
clouddoc-authorize.firebaseapp.com
HTTP response code:  404
Not reachable -  https://clouddoc-authorize.firebaseapp.com/......xx.../...xx
i is:  50 url:  http://distractify.com/post/related/id/5537c6314a0c4b89316cdbdf/skip/20/limit/10/back/0
weddingstaffcompanies.com
HTTP response code:  404
Not reachable -  http://weddingstaffcompanies.com/js/tnp/login-mcrsftonline-com-srvcesrtrgvfvfver/
i is:  51 url:  http://kenh14.vn/star/lo-dien-ban-trai-hot-boy-cua-van-mai-huong-20140716121655295.chn
updted-access.demopage.co
HTTP response code:  200
i is:  52 url:  https://twitter.com/share?url=http%3A%2F%2Fhubpages.com%2Fhub%2FWhats-the-difference-between-self-esteem-and-confidence&text=Definitions+of+Confidence+%26+Self+Esteem+-+What%27s+the+Difference%3F
urlz.fr
HTTP response code:  404
Not reachabl

2024-07-21 20:15:31,157 - whois.whois - ERROR - Error trying to connect to socket: closing socket - [WinError 10061] No connection could be made because the target machine actively refused it


Timeout -  https://gcsnc-v.cf/adobe/email/document/authentication/
i is:  71 url:  https://paytm.com/blog/paytm-offer-for-app-users-get-upto-rs-50-cash-back/?share=email
blog.gruszka.info
Timeout -  http://blog.gruszka.info/wp-content/dhl/app.php
i is:  72 url:  http://motthegioi.vn/suckhoe/nu-cuoi-y-hoc/nu-cuoi-suc-khoe-nhung-li-do-nen-yeu-chang-trai-bung-bu-188294.html
alertabancon.repl.co
HTTP response code:  404
Not reachable -  https://alertabancon.repl.co/
i is:  73 url:  http://torcache.net/torrent/047D47DFF4DC5CD9BEA6D0F4C57D68F2F2D71205.torrent?title=[kickass.to]night.at.the.museum.secret.of.the.tomb.2014.1080p.brrip.x264.yify
www.theparlor.shop
HTTP response code:  406
Not reachable -  https://www.theparlor.shop/done/index.php?email=jaime.rodriguez@kfc.com.ec
i is:  74 url:  http://thenextweb.com/asia/2014/09/26/myanmars-mobile-revolution-kicks-telenor-prepares-launch-service/gtm.start
avadvertising.in
Timeout -  http://avadvertising.in/mkbn/QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwA

2024-07-21 20:17:21,872 - whois.whois - ERROR - Error trying to connect to socket: closing socket - timed out


Timeout -  http://paydashtracker.private-monitoring.tk/login.php?cmd=_account-details&amp;session=2edc92044c8d952a7a82a50da85b13b7&amp;dispatch=588f765d95ee26f78432ba7df146a6d819855b98
i is:  95 url:  http://plarium.com/en/strategy-games/sparta-war-of-empires/mmo-masters/offensive-positions
jbshtl.secure52serv.com
HTTP response code:  404
Not reachable -  http://jbshtl.secure52serv.com/receipt/secureNetflix/e5cf6bd5cd804e8fd94d53926c7f6548/login
i is:  96 url:  https://chaturbate.com/tipping/spy_on_private_show_tokens_per_minute/ingridblondy94/
99000.ihostfull.com
HTTP response code:  200
i is:  97 url:  http://babal.net/news/view/42903/%D8%AD%D8%A8%D8%B3-%D9%85%D8%AD%D8%A7%D9%81%D8%B8-%D8%A7%D9%84%D8%A3%D9%82%D8%B5%D8%B1-%D8%A7%D9%84%D8%B3%D8%A7%D8%A8%D9%82-%D8%B9%D8%A7%D9%85%D9%8A%D9%86-%D9%88%D8%AA%D8%BA%D8%B1%D9%8A%D9%85%D9%87-10-%D8%A2%D9%84%D8%A7%D9%81-%D8%AC%D9%86%D9%8A%D9%87-%D9%84%D8%B9%D8%AF%D9%85-%D8%AA%D9%86%D9%81%D9%8A%D8%B0%D9%87-%D8%AD%D9%83%D9%85%D8%A7-%D9%82%D8%B6%D8%A

2024-07-21 20:19:01,174 - whois.whois - ERROR - Error trying to connect to socket: closing socket - [WinError 10061] No connection could be made because the target machine actively refused it


Timeout -  http://www.rakuten.co.jp.oadkxoe.cf/
i is:  123 url:  http://lifehacker.com/5910717/plan-your-free-online-education-at-lifehacker-u-summer-semester-2012
poligrafiapias.com
HTTP response code:  406
Not reachable -  http://poligrafiapias.com/Secured-adobe/2b078be1175659f354b82191320e7c51/
i is:  124 url:  http://extratorrent.cc/torrent/4189607/The+Sleepwalker.2014.LiMiTED.DVDRip.x264.LPD.html
www.ktpn.kalisz.pl
HTTP response code:  401
Not reachable -  http://www.ktpn.kalisz.pl/read-invoice/index.php?rec=festeban@electricidadesteban.com
i is:  125 url:  http://mylust.com/videos/64031/chinese-orgy-with-sexy-and-skinny-asian-babe-in-her-bedroom/
estetika2z.com
Timeout -  https://estetika2z.com/attnew/AT&amp;T
i is:  126 url:  http://buzzfil.net/m/show-art/voici-16-voisins-qui-ont-pique-une-crise-de-nerfs-9.html
sharelink.sn.am
HTTP response code:  404
Not reachable -  https://sharelink.sn.am/lYPBgpwGauq
i is:  127 url:  http://censor.net.ua/video_news/334975/ochered_za_pensieyi_

2024-07-21 20:24:01,743 - whois.whois - ERROR - Error trying to connect to socket: closing socket - [Errno 11001] getaddrinfo failed


Timeout -  http://valocsfunes.ga/Ourtime/ourtime.php
i is:  176 url:  http://grantland.com/hollywood-prospectus/we-went-there-celebrating-25-years-of-goodfellas-with-bobby-d-ray-paulie-and-lorraine-at-tribeca-film-festival/
www.sbi.mx
HTTP response code:  200
i is:  177 url:  http://distractify.com/post/related/id/55479c6a4a0c4bc56b941a7e/skip/10/limit/10/back/0
sites.google.com
HTTP response code:  200
i is:  178 url:  http://thenextweb.com/apple/2015/05/10/11-things-i-learned-during-two-weeks-with-an-apple-watch/gtm.start
urlz.fr
HTTP response code:  404
Not reachable -  https://urlz.fr/fsth
i is:  179 url:  http://icicibank.com/Personal-Banking/cards/debit-card/debit-cards/the-gemstone-collection.page
henchdecor.com
Timeout -  https://henchdecor.com/WeTransfer/onedrive/Validation/login.php?cmd=login_submit&id=ef405d99f8e8688e3c24ba90af17085aef405d99f8e8688e3c24ba90af17085a&session=ef405d99f8e8688e3c24ba90af17085aef405d99f8e8688e3c24ba90af17085a
i is:  180 url:  http://deadspin.com/5

2024-07-21 20:24:33,259 - whois.whois - ERROR - Error trying to connect to socket: closing socket - [WinError 10061] No connection could be made because the target machine actively refused it


Timeout -  http://servweb.cf/ciudad/Banco%20Ciudad
i is:  186 url:  http://emgn.com/entertainment/coldplay-announce-release-date-for-6th-album-ghost-stories/
uto.la
HTTP response code:  404
Not reachable -  http://uto.la/GestorING
i is:  187 url:  http://katproxy.com/fast-and-furious-7-2015-hd-ts-xvid-ac3-hq-hive-cm8-t10472303.html
videobigo.com
Timeout -  http://videobigo.com/sq/
i is:  188 url:  http://elitedaily.com/sports/marcus-mariota-set-become-first-hawaiian-nfl-superstar/1015927/
hermes.trust-mail.co.uk
Timeout -  https://hermes.trust-mail.co.uk/delivery-info.php?&amp;URI=9517a66f09bc52c8e874393fa3592dd5&amp;sessionid=5dd2953af393478e8c25cb90f66a7159&amp;securessl=true
i is:  189 url:  http://extratorrent.cc/torrent_download/4191159/The+Longest+Yard+%282005%29+720p+WEB-DL+900MB+-+MkvCage.torrent
pericena.github.io
HTTP response code:  404
Not reachable -  http://pericena.github.io/
i is:  190 url:  http://pornsharing.com/handcuffed-prisoner-gets-his-big-black-dick-sucked-by-an

2024-07-21 20:26:10,575 - whois.whois - ERROR - Error trying to connect to socket: closing socket - timed out


Timeout -  http://verifiyedbluetickfeedback.ml/
i is:  207 url:  http://persianblog.ir/tags/1082/12/%d8%a7%d9%86%d8%aa%d8%ae%d8%a7%d8%a8%d8%a7%d8%aa/
banreservasdigital.com
Timeout -  http://banreservasdigital.com/
i is:  208 url:  http://onedio.com/haber/bir-gunlugune-super-kahraman-oldun-bakalim-hayatta-kalabilecek-misin--504266
rotfilkseq.duckdns.org
Timeout -  https://rotfilkseq.duckdns.org/index.html
i is:  209 url:  http://hollywoodlife.com/2015/05/04/ian-somerhalder-nikki-reed-double-date-paul-wesley-phoebe-tonkin-vampire-diaries/
bbfunding-my.sharepoint.com
HTTP response code:  403
Not reachable -  http://bbfunding-my.sharepoint.com/personal/accounting_bluebridgefunding_com/_layouts/15/doc.aspx?sourcedoc={f0763374-6329-450f-94a4-11512ab3e2fe}&amp;action=default&amp;slrid=3d693f9f-20ed-0000-3f04-fe0e8f6dfa87&amp;originalpath=ahr0chm6ly9iymz1bmrpbmctbxkuc2hhcmvwb2ludc5jb20vom86l2cvcgvyc29uywwvywnjb3vudgluz19ibhvlynjpzgdlznvuzgluz19jb20vrw5remr2qxbzdzlgbetrulvtcxo0djrcttnvrxnuvhjr

2024-07-21 20:26:39,946 - whois.whois - ERROR - Error trying to connect to socket: closing socket - [WinError 10061] No connection could be made because the target machine actively refused it


Timeout -  http://cobb-al-auth.cf/?login=do
i is:  213 url:  http://buzzfil.net/article/5466/animaux/georges-le-chat-qui-se-tient-debout-tout-le-temps-3.html?href=inner_website
husbandhof.com
Timeout -  https://husbandhof.com/CD/office/
i is:  214 url:  http://thenextweb.com/insider/2014/10/29/unbabel-integrates-mailchimp-offer-translation-service-promotional-emails/gtm.js/
www.onlineservicefree.club
Timeout -  https://www.onlineservicefree.club/landingpage/dd263864-38a2-466a-be2f-4e5ec6c5e042/ov5SZYSS4yHeTrlN0t0jnwS7Weozze4QIxfjDhcDli0
i is:  215 url:  http://fanpage.gr/must-watch/%ce%b4%ce%b5%ce%af%cf%84%ce%b5-%cf%84%ce%b9%cf%82-%ce%ba%ce%b1%ce%bb%cf%8d%cf%84%ce%b5%cf%81%ce%b5%cf%82-%cf%80%cf%81%ce%bf%cf%84%ce%ac%cf%83%ce%b5%ce%b9%cf%82-%ce%b3%ce%ac%ce%bc%ce%bf%cf%85/
7p1qy.skipdns.link
HTTP response code:  403
Not reachable -  https://7p1qy.skipdns.link/wp-content/plugins/affiliatewp-signup-referrals/includes/d/
i is:  216 url:  http://extratorrent.cc/torrent_download/4189419/Ken+Fo

2024-07-21 20:27:00,086 - whois.whois - ERROR - Error trying to connect to socket: closing socket - [Errno 11001] getaddrinfo failed


Timeout -  http://demo02.zhost.vn/wp-content/themes/2021/jp/b9a976c8c3bd537335e222920aaf423b/verify.php
i is:  217 url:  http://mylust.com/videos/89242/three-insatiable-girlfriend-share-one-juicy-stiff-cock/
a094748hn94.great-site.net
HTTP response code:  200
i is:  218 url:  http://nguyentandung.org/can-ro-muc-dich-ban-co-phan-theo-lo-tai-doanh-nghiep-nha-nuoc.html
greatmusica.com
HTTP response code:  403
Not reachable -  https://greatmusica.com/li/gateway.html
i is:  219 url:  http://babal.net/books/view/1394/%D9%88%D9%84%D8%AF%D8%AA-%D9%87%D9%86%D8%A7%D9%83%D8%8C-%D9%88%D9%84%D8%AF%D8%AA-%D9%87%D9%86%D8%A7
coingeckk.com
Timeout -  http://coingeckk.com
i is:  220 url:  http://thechive.com/2015/04/02/workers-cling-to-platform-as-it-swings-and-slams-into-the-91st-floor-video/
pytlo.com
Timeout -  https://pytlo.com/
i is:  221 url:  http://thenextweb.com/apps/2011/07/22/directly-import-your-instagram-photos-into-a-facebook-album-with-instafb/instafb1/gtm.js
support.orderup.net.au
HTTP r

2024-07-21 20:27:34,099 - whois.whois - ERROR - Error trying to connect to socket: closing socket - [Errno 11001] getaddrinfo failed


Timeout -  http://congresosba.com.ar/dbox%20(2)/vinc/
i is:  223 url:  http://pikabu.ru/tag/%D0%9A%D0%BE%D0%BA%D1%82%D0%B5%D0%B9%D0%BB%D1%8C%20%D0%9C%D0%BE%D0%BB%D0%BE%D1%82%D0%BE%D0%B2%D0%B0/hot
k1tv.rs
HTTP response code:  404
Not reachable -  https://k1tv.rs/N26-service/n26_fr/n26_fr-m3tri/n26-log.php?token=TW96aWxsYS81LjAgKGlQaG9uZTsgQ1BVIGlQaG9uZSBPUyAxM180IGxpa2UgTWFjIE9TIFgpIEFwcGxlV2ViS2l0LzYwNS4xLjE1IChLSFRNTCwgbGlrZSBHZWNrbykgVmVyc2lvbi8xMy4xIE1vYmlsZS8xNUUxNDggU2FmYXJpLzYwNC4xMTk0LjM2LjQ3LjI0NDIwMjE6QXVnOkZyaQ==
i is:  224 url:  http://variety.com/2015/film/news/broken-hollywood-the-bizs-top-players-call-out-ways-industry-needs-to-change-1201416866/2015/digital/news/susan-wojcicki-media-companies-must-learn-the-art-of-programming-content-online-1201416495/
bloxo324rz2.ru
Timeout -  https://bloxo324rz2.ru/hgbvfvgtyunhgbf/883ab6a276015894443d34e20ab6ccef/?Key=883ab6a276015894443d34e20ab6ccef&amp;rand=19lnboxLightespn_883ab6a276015894443d34e20ab6ccef_NmhwbGRFelIzYVdLb2h4bWxr-&a

2024-07-21 20:31:23,210 - whois.whois - ERROR - Error trying to connect to socket: closing socket - [Errno 11001] getaddrinfo failed


Timeout -  http://mclaren-com.ga/?login=do
i is:  265 url:  http://serverfault.com/questions/302790/why-servers-error-log-states-lots-of-file-doesnt-exist-but-files-are-accessibl
inpost-d.id-3485.me
Timeout -  https://inpost-d.id-3485.me/?code
i is:  266 url:  http://kakaku.com/daily-goods/tissue-paper/ranking_7620/pricedown/div-gpt-ad-k/header_text
keyne213ttech.ru
Timeout -  https://keyne213ttech.ru/ertyujnhgbfvjhtbgvfcds
i is:  267 url:  https://twitter.com/home?status=%E3%83%8C%E3%81%91%E3%82%8B%EF%BC%81%E3%80%90%E3%82%A2%E3%82%A4%E3%83%89%E3%83%AB%E3%80%91+http%3A%2F%2Fero-video.net%2Ft%2FJObRgxjTMOZMiYh7+%E3%83%9E%E3%83%83%E3%82%B5%E3%83%BC%E3%82%B8%E3%81%A8%E7%A7%B0%E3%81%97%E3%81%A6%E3%83%AD%E3%83%BC%E3%82%B7%E3%83%A7%E3%83%B3%E3%83%97%E3%83%AC%E3%82%A4+%23ero+%23douga+%23agesage
docomodhox.duckdns.org
Timeout -  https://docomodhox.duckdns.org/index.html
i is:  268 url:  http://plarium.com/en/strategy-games/stormfall-age-of-war/news/new-league-mechanic-league-attacks/
ww2.activ

2024-07-21 20:35:08,551 - whois.whois - ERROR - Error trying to connect to socket: closing socket - [WinError 10061] No connection could be made because the target machine actively refused it


Timeout -  https://ths-wa-orh.cf/?login=do
i is:  309 url:  http://distractify.com/maia-star-mccann/these-little-kids-have-the-absolute-fastest-salsa-moves-you-will-ever-see/
atafai-info.ci
Timeout -  https://atafai-info.ci/preview/elvis/70935
i is:  310 url:  http://qz.com/404280/the-modern-history-of-the-mobile-industry-in-one-devastating-chart/
walletconnect-restore.com
Timeout -  https://walletconnect-restore.com
i is:  311 url:  http://variety.com/2015/film/news/broken-hollywood-the-bizs-top-players-call-out-ways-industry-needs-to-change-1201416866/2015/tv/news/gary-newman-network-tv-advertising-model-needs-to-evolve-on-digital-platforms-1201416484/
siida-disperindag.kalbarprov.go.id
HTTP response code:  404
Not reachable -  https://siida-disperindag.kalbarprov.go.id/csss/aol2021
i is:  312 url:  http://nymag.com/thecut/2014/12/sexiest-dresses-of-all-time/slideshow/2014/12/18/the_sexiest_dressesofalltime/rihanna-cfda/
joeytorres.com
HTTP response code:  406
Not reachable -  https:

2024-07-21 20:38:15,719 - whois.whois - ERROR - Error trying to connect to socket: closing socket - [Errno 11001] getaddrinfo failed


Timeout -  https://fb.probox.lk/
i is:  354 url:  http://bigcinema.tv/tag/%D0%9A%D0%B8%D1%80%D0%BA%20%D0%B4%E2%80%99%D0%90%D0%BC%D0%B8%D0%BA%D0%BE
bestgrupa.website
Timeout -  https://bestgrupa.website/pko/interpay/index.php?pay&amp;b=pk
i is:  355 url:  http://web.de/magazine/politik/trotz-bnd-affaere-haelt-thomas-de-maiziere-amt-30616688
gornjimilanovac.rs
HTTP response code:  404
Not reachable -  https://gornjimilanovac.rs/offres/
i is:  356 url:  http://indianexpress.com/article/india/only-photos-of-pm-president-chief-justice-to-be-allowed-in-government-advertisements-rules-supreme-court/
c7811.wv2.masterbase.com
HTTP response code:  200
i is:  357 url:  http://nymag.com/thecut/2013/06/gisele-bundchen-look-book/slideshow/2012/08/20/gisele_bundchen
api.bdisl.com
HTTP response code:  200
i is:  358 url:  http://distractify.com/post/related/id/555120a94a0c4bb63284eae4/skip/10/limit/10/back/0
api.bdisl.com
HTTP response code:  200
i is:  359 url:  http://otomoto.pl/oferta/mercedes-benz

TypeError: can't subtract offset-naive and offset-aware datetimes

In [None]:
#converting the list to dataframe

feature_names = ['Domain', 'Have_IP', 'Have_At', 'URL_Length', 'URL_Depth','Redirection', 
                      'https_Domain', 'TinyURL', 'Prefix/Suffix', 'DNS_Record', 
                      'Domain_Age', 'Domain_End', 'iFrame', 'Mouse_Over','Right_Click', 'Web_Forwards', 'Label']
                      
phishing = pd.DataFrame(phish_features, columns= feature_names)
phishing.head()


Unnamed: 0,Domain,Have_IP,Have_At,URL_Length,URL_Depth,Redirection,https_Domain,TinyURL,Prefix/Suffix,DNS_Record,Domain_Age,Domain_End,iFrame,Mouse_Over,Right_Click,Web_Forwards,Label
0,hzibupigbnrtqezn-dot-sunlit-center-322513-556d...,0,1,1,0,0,0,1,1,1,1,1,1,1,1,1,1
1,netttxnet.byethost3.com,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1
2,phx.chromeproxy.net,0,0,1,2,0,0,0,0,0,0,1,1,1,1,1,1
3,midnightluna1.typeform.com,0,0,0,2,0,0,0,0,0,0,1,0,0,1,0,1
4,caracasmateriais.blogspot.com,0,0,0,0,0,0,1,0,1,1,1,0,0,1,0,1


In [None]:
# Storing the extracted legitimate URLs fatures to csv file
phishing.to_csv('phishing_copy.csv', index= False)


In [None]:
!copy phishing_copy.csv "C:/Users/Liberty/OneDrive/Desktop/CYBER/PROJECTS/Phishing-website-detection-using-ML/Datasets"
!del phishing_copy.csv

        1 file(s) copied.


> **Final Dataset**

In [None]:
#Concatenating the dataframes into one 
urldata = pd.concat([legitimate, phishing]).reset_index(drop=True)
feature_names = ['Domain', 'Have_IP', 'Have_At', 'URL_Length', 'URL_Depth','Redirection', 
                      'https_Domain', 'TinyURL', 'Prefix/Suffix', 'DNS_Record', 'Web_Traffic', 
                      'Domain_Age', 'Domain_End', 'iFrame', 'Mouse_Over','Right_Click', 'Web_Forwards', 'Label']
urldata = pd.DataFrame(urldata, columns= feature_names)
urldata.head()


Unnamed: 0,Domain,Have_IP,Have_At,URL_Length,URL_Depth,Redirection,https_Domain,TinyURL,Prefix/Suffix,DNS_Record,Web_Traffic,Domain_Age,Domain_End,iFrame,Mouse_Over,Right_Click,Web_Forwards,Label
0,graphicriver.net,0,0,1,1,0,0,0,0,0,,1,1,0,0,1,0,0
1,ecnavi.jp,0,0,1,1,1,0,0,0,0,,0,1,0,0,1,0,0
2,hubpages.com,0,0,1,1,0,0,0,0,0,,0,1,0,0,1,0,0
3,extratorrent.cc,0,0,1,3,0,0,0,0,0,,0,0,1,1,1,1,0
4,icicibank.com,0,0,1,3,0,0,0,0,0,,0,1,1,1,1,1,0


In [None]:
urldata.shape


(5378, 18)

In [None]:
# Storing the data in CSV file
urldata.to_csv('final_data.csv', index=False)



In [None]:
!copy final_data.csv "C:/Users/Liberty/OneDrive/Desktop/CYBER/PROJECTS/Phishing-website-detection-using-ML/Datasets"
!del final_data.csv

        1 file(s) copied.


In [None]:
temp = pd.read_csv('C:/Users/Liberty/OneDrive/Desktop/CYBER/PROJECTS/Phishing-website-detection-using-ML/Datasets/final_data.csv')

In [None]:
temp.shape

(5378, 18)