# Detecting Phishing urls with feature extraction

In this notebook, we are developing a set of features to detect phishing URLs. We extract various characteristics from the URLs, such as domain information, length, redirection patterns, and the presence of specific HTML or JavaScript elements. These features help identify suspicious URLs that may indicate phishing attempts. By analyzing these patterns, we can build a system to classify URLs as either legitimate or malicious. The goal is to create a comprehensive feature extraction process that can be used for phishing detection.


In [None]:
import pandas as pd
import re
import tldextract
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import re
from bs4 import BeautifulSoup
import whois
import urllib
import urllib.request
from datetime import datetime
import requests
from urllib.parse import urlparse,urlencode
import ipaddress

In [None]:
# Load Dataset
phishing_url = pd.read_csv("data/phishing_site_urls.csv")

In [4]:
phishing_url.sample(5)

Unnamed: 0,URL,Label
9135,pastehtml.com/view/beblf12rw.html,bad
128812,kf25zx.com/images/?http://us.battle.net/login/en/,bad
185198,eventful.com/montreal/events/greater-montreal-...,good
518826,mcyengineering.com/modules/webstat/mail_rechnu...,bad
33677,'www.cida-auto.com/images/?us.battle.net/login...,bad


We will randomly select 5,000 phishing URLs from the `phishing_url` Dataset while ensuring reproducibility with `random_state=12`. We reset the index to maintain sequential numbering and removes the old index.


In [5]:
#Collecting 5,000 Phishing URLs randomly
phishing_url = phishing_url.sample(n = 5000, random_state = 12).copy()
phishing_url = phishing_url.reset_index(drop=True)
phishing_url.shape

(5000, 2)

## Domain

We will first check if the URL starts with "http://" or "https://". If it doesn’t, we will add "http://" to ensure proper parsing. Then, we will extract the domain using `urlparse(url).netloc`. If the domain starts with "www.", we will remove it to keep a clean format. 

We are doing this because phishing URLs often use deceptive domain structures to trick users. By extracting and standardizing the domain, we can analyze patterns, compare them with known phishing domains, and improve detection accuracy.


In [None]:
# 1.Domain of the URL (Domain) 
def getDomain(url):
    if not url.startswith(('http://', 'https://')):
        url = 'http://' + url
    domain = urlparse(url).netloc
    if re.match(r"^www\.", domain):
        domain = domain.replace("www.", "")
    return domain

## IP Address

We will check if the given URL is an IP address instead of a regular domain name. Using `ipaddress.ip_address(url)`, we attempt to convert the URL into an IP address. If this succeeds, we set `ip = 1`, indicating that the URL contains an IP address. Otherwise, we catch the exception and set `ip = 0`.

We are doing this because phishing websites often use raw IP addresses instead of domain names to avoid detection. Legitimate websites typically use registered domain names, so detecting IP-based URLs can help identify suspicious links.


In [8]:
# 2.Checks for IP address in URL (Have_IP)
def havingIP(url):
  try:
    ipaddress.ip_address(url)
    ip = 1
  except:
    ip = 0
  return ip

## Presense of @ in the URL

We will check if the URL contains the "@" symbol, which is often used in email addresses or deceptive URL formats. If the "@" is present, we set `at = 1`, indicating the presence of the "@" symbol. Otherwise, we set `at = 0`.

We are doing this because phishing URLs often use the "@" symbol to disguise malicious links by embedding them within email-like formats. This technique can trick users into thinking the URL is legitimate, so detecting this character helps identify potentially harmful URLs.


In [9]:
# 3.Checks the presence of @ in URL (Have_At)
def haveAtSign(url):
  if "@" in url:
    at = 1    
  else:
    at = 0    
  return at

## URL Length

We will check the length of the URL. If the URL is shorter than 54 characters, we set `length = 0`. If it is 54 characters or longer, we set `length = 1`.

We are doing this because phishing URLs tend to be longer and more complex in order to hide malicious content or redirect users to harmful sites. Shorter URLs are more likely to be legitimate, so categorizing the URL based on its length helps in distinguishing potentially phishing URLs from safe ones.

In [10]:
# 4.Finding the length of URL and categorizing (URL_Length)
def getLength(url):
  if len(url) < 54:
    length = 0            
  else:
    length = 1            
  return length

## Number of '/'

We will calculate the depth of the URL by counting the number of slashes ("/") in its path. First, we use `urlparse(url).path` to extract the path from the URL and split it by slashes. Then, we iterate through the split path and count non-empty segments to determine the depth.

We are doing this because phishing URLs often have deep structures with multiple subdirectories or misleading paths to confuse users. By analyzing the depth of the URL, we can identify suspiciously complex or unusual URLs that may indicate phishing attempts.


In [11]:
# 5.Gives number of '/' in URL (URL_Depth)
def getDepth(url):
  s = urlparse(url).path.split('/')
  depth = 0
  for j in range(len(s)):
    if len(s[j]) != 0:
      depth = depth+1
  return depth

## Redirection

We will check for the presence of redirection indicators in the URL by looking for multiple occurrences of '//' after the initial protocol (e.g., 'http://'). We use `rfind('//')` to find the last occurrence of '//' in the URL. If it appears after the 6th character, indicating a possible redirection or suspicious structure, we return `1`. Otherwise, we return `0`.

We are doing this because phishing URLs often use redirection techniques, such as multiple slashes, to obfuscate the real destination and mislead users. Detecting such patterns helps in identifying URLs that might be attempting to hide their true intent through redirection.


In [12]:
# 6.Checking for redirection '//' in the url (Redirection)
def redirection(url):
  pos = url.rfind('//')
  if pos > 6:
    if pos > 7:
      return 1
    else:
      return 0
  else:
    return 0

## HTTPS Token

We will check if the "https" token exists in the domain part of the URL using `urlparse(url).netloc`. If the domain contains "https", we return `1`, indicating that the URL uses HTTPS. Otherwise, we return `0`.

We are doing this because legitimate websites typically use HTTPS for secure communication, while phishing sites may use HTTP or insecure protocols. By checking for "https" in the domain, we can help identify URLs that might be using insecure or less trustworthy protocols, which are common in phishing attempts.


In [13]:
# 7.Existence of “HTTPS” Token in the Domain Part of the URL (https_Domain)
def httpDomain(url):
  domain = urlparse(url).netloc
  if 'https' in domain:
    return 1
  else:
    return 0

## Shortening Services

We will create a regular expression pattern that matches common URL shortening services. This pattern includes popular shortening domains like "bit.ly", "goo.gl", "tinyurl.com", "is.gd", and others. 

We are doing this because phishing URLs often use URL shortening services to disguise malicious links. By identifying and flagging URLs that match these shortening services, we can detect potentially harmful or deceptive URLs that are trying to hide their true destination.


In [14]:
#listing shortening services
shortening_services = r"bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|tinyurl|tr\.im|is\.gd|cli\.gs|" \
                      r"yfrog\.com|migre\.me|ff\.im|tiny\.cc|url4\.eu|twit\.ac|su\.pr|twurl\.nl|snipurl\.com|" \
                      r"short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|snipr\.com|fic\.kr|loopt\.us|" \
                      r"doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|to\.ly|bit\.do|t\.co|lnkd\.in|db\.tt|" \
                      r"qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|tinyurl\.com|ow\.ly|bit\.ly|ity\.im|q\.gs|is\.gd|" \
                      r"po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u\.bb|yourls\.org|x\.co|" \
                      r"prettylinkpro\.com|scrnch\.me|filoops\.info|vzturl\.com|qr\.net|1url\.com|tweez\.me|v\.gd|" \
                      r"tr\.im|link\.zip\.net"

We will check if the URL matches any of the shortening services listed in the `shortening_services` regular expression pattern using `re.search()`. If a match is found, we return `1`, indicating that the URL is a shortened one. Otherwise, we return `0`.

We are doing this because phishing URLs often use URL shortening services to hide their true destination and deceive users. By identifying these shortened URLs, we can flag potential phishing attempts that rely on these services to obscure their malicious intent.


In [15]:
# 8. Checking for Shortening Services in URL (Tiny_URL)
def tinyURL(url):
    match=re.search(shortening_services,url)
    if match:
        return 1
    else:
        return 0

## Prefix - Suffix

We will check if the domain part of the URL (extracted using `urlparse(url).netloc`) contains a hyphen ("-"). If it does, we return `1`, indicating that the URL might be a phishing attempt. Otherwise, we return `0`, suggesting the URL is likely legitimate.

We are doing this because phishing URLs often use hyphens in the domain to mimic legitimate websites by adding extra parts or creating look-alike domains. Identifying such patterns helps in detecting suspicious URLs that may be trying to deceive users by appearing similar to trusted sites.


In [16]:
# 9.Checking for Prefix or Suffix Separated by (-) in the Domain (Prefix/Suffix)
def prefixSuffix(url):
    if '-' in urlparse(url).netloc:
        return 1            # phishing
    else:
        return 0            # legitimate

## IFrame Redirection

We will check if the response contains an `<iframe>` tag or similar redirection elements (`<frameBorder>`) using a regular expression search (`re.findall()`). If such elements are found, we return `0`, indicating no iframe-based redirection. Otherwise, if no iframe is present, we return `1`.

We are doing this because phishing websites often use `<iframe>` tags to load hidden or deceptive content from other sites. This can be a method of redirecting users to malicious pages without their knowledge. By detecting the presence of iframes, we can identify URLs that might be attempting to silently redirect


In [None]:
# IFrame Redirection (iFrame)
def iframe(response):
  if response == "":
      return 1
  else:
      if re.findall(r"[<iframe>|<frameBorder>]", response.text):
          return 0
      else:
          return 1

We will check if the response contains a JavaScript function that triggers on mouseover events by searching for `<script>.+onmouseover.+</script>` using a regular expression (`re.findall()`). If such a script is found, we return `1`, indicating potential suspicious activity. Otherwise, we return `0`.

We are doing this because phishing websites often use mouseover events to execute hidden actions, such as redirecting users or displaying deceptive information in the status bar. By detecting these scripts, we can flag potentially malicious sites that use this technique to trick users into interacting with harmful content.


In [None]:
# Checks the effect of mouse over on status bar (Mouse_Over)
def mouseOver(response): 
  if response == "" :
    return 1
  else:
    if re.findall("<script>.+onmouseover.+</script>", response.text):
      return 1
    else:
      return 0

We will check if the response contains JavaScript code that disables the right-click functionality by searching for `event.button == 2`, which corresponds to a right-click event. If such code is found, we return `0`, indicating that right-click has been disabled. Otherwise, we return `1`, suggesting no restrictions on right-click.

We are doing this because phishing websites often disable right-click to prevent users from copying the URL or viewing the page source, which can help them hide malicious activities. By detecting this behavior, we can identify suspicious websites that may be trying to prevent users from analyzing their content.


In [None]:
# Checks the status of the right click attribute (Right_Click)
def rightClick(response):
  if response == "":
    return 1
  else:
    if re.findall(r"event.button ?== ?2", response.text):
      return 0
    else:
      return 1

We will check the number of redirects or forwardings in the response by examining `response.history`. If the URL has been redirected more than twice (i.e., the length of `response.history` is greater than 2), we return `1`, indicating potential phishing activity. If there are two or fewer redirects, we return `0`.

We are doing this because phishing websites often use multiple redirects to obscure the final destination or to trick users into visiting a harmful site. By identifying excessive forwarding, we can flag suspicious URLs that may be using this technique to hide their true intent.


In [None]:
# Checks the number of forwardings (Web_Forwards)    
def forwarding(response):
  if response == "":
    return 1
  else:
    if len(response.history) <= 2:
      return 0
    else:
      return 1

We will define a function `featureExtractions()` that extracts various features from a given URL and its corresponding label. The function performs the following steps:

1. **Address Bar Based Features**:
   - It extracts the domain from the URL.
   - Checks if the URL contains an IP address.
   - Verifies if the URL has an "@" symbol.
   - Analyzes the length of the URL.
   - Determines the depth of the URL.
   - Detects if there are any redirections in the URL.
   - Checks if the domain contains "https".
   - Identifies if the domain has hyphens, which may indicate suspicious URLs.
   - Checks if the URL is a shortened URL.

2. **HTML & JavaScript Based Features**:
   - Sends a request to the URL and checks the response for potential iframe redirections, mouseover scripts, right-click blocking, and forwarding behavior.

Finally, the function returns all the collected features, including the label, as a list.

We are doing this to gather multiple characteristics of the URL, which can help in distinguishing phishing URLs from legitimate ones by analyzing various behaviors and attributes associated with the URL.


In [44]:
def featureExtractions(url, label):

  features = []
  #Address bar based features (9)
  features.append(getDomain(url))
  print(getDomain(url))
  features.append(havingIP(url))
  features.append(haveAtSign(url))
  features.append(getLength(url))
  features.append(getDepth(url))
  features.append(redirection(url))
  features.append(httpDomain(url))
  features.append(prefixSuffix(url))
  features.append(tinyURL(url))
  
  # HTML & Javascript based features (4)
  try:
    response = requests.get(url, timeout=30)
  except:
    response = ""
  features.append(iframe(response))
  features.append(mouseOver(response))
  features.append(rightClick(response))
  features.append(forwarding(response))
  features.append(label)
  
  return features

In [55]:
#Extracting the feautres & storing them in a list
benign_features = []
label = 0

for i in range(len(phishing_url)):
  url = phishing_url['URL'][i]
  label = phishing_url['Label'][i]
  benign_features.append(featureExtractions(url, label))

imdb.com
tools.ietf.org
hotel-sirius.com
speedstream.tv
7den70enano.com
pececitos.com
sen.parl.gc.ca
openpublicservices.cabinetoffice.gov.uk
lubbockonline.com
photos.lucywho.com
emp3world.com
ca.finance.yahoo.com
youtube.com
wired.com
chocogaterie.eu
gearsofwar.wikia.com
quadraphonicquad.com
wilx.com
'kandeepan.com
joe.cpbph2.com
0apogee.elephantfish.co.uk
tomahawknation.com
latimesblogs.latimes.com
103.230.226.59
mantlesbydesign.ca
singer-songwriter.com
reformation.org
sanjose.com
musicpopstars.com
allaboutjazz.com
en.wikipedia.org
skrmc.com.au
jjdstorage.com
g200.qdesign.vn
ginfovalidationrequest.com
scuolainfanziazucchi.it
meetup.com
reunion.com
southflorida.blockshopper.com
astoria.womf.com
fr.linkedin.com
usmilitary.about.com
woodfieldproperties.com
stevengraphs.com
out.ipsyc.com.ar
ivs.alislampedia.com
people.forbes.com
misaho.com.ar
calbears.com
amazon.com
fkaadbsykmsrbt.click
uspolitics.einnews.com
mylife.com
ledyury.info
en.wikipedia.org
mylife.com
lariver.org
midwest2011.org


In [31]:
label_counts = pd.DataFrame(phishing_url.Label.value_counts())
label_counts

Unnamed: 0_level_0,count
Label,Unnamed: 1_level_1
good,3543
bad,1457


In [35]:
benign_features

[['', 0, 0, 0, 3, 0, 0, 0, 0, 1, 1, 1, 1, 'good'],
 ['', 0, 0, 0, 3, 0, 0, 0, 0, 1, 1, 1, 1, 'good'],
 ['', 0, 1, 1, 8, 0, 0, 0, 0, 1, 1, 1, 1, 'bad'],
 ['', 0, 0, 0, 4, 0, 0, 0, 0, 1, 1, 1, 1, 'bad'],
 ['', 0, 0, 1, 5, 0, 0, 0, 0, 1, 1, 1, 1, 'bad'],
 ['', 0, 0, 0, 2, 0, 0, 0, 0, 1, 1, 1, 1, 'bad'],
 ['', 0, 0, 0, 4, 0, 0, 0, 0, 1, 1, 1, 1, 'good'],
 ['', 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 'good'],
 ['', 0, 0, 1, 5, 0, 0, 0, 0, 1, 1, 1, 1, 'good'],
 ['', 0, 0, 0, 2, 0, 0, 0, 0, 1, 1, 1, 1, 'good'],
 ['', 0, 0, 0, 5, 0, 0, 0, 0, 1, 1, 1, 1, 'good'],
 ['', 0, 0, 0, 2, 0, 0, 0, 0, 1, 1, 1, 1, 'good'],
 ['', 0, 0, 0, 2, 0, 0, 0, 0, 1, 1, 1, 1, 'good'],
 ['', 0, 0, 0, 5, 0, 0, 0, 0, 1, 1, 1, 1, 'good'],
 ['', 0, 0, 0, 2, 0, 0, 0, 0, 1, 1, 1, 1, 'bad'],
 ['', 0, 0, 0, 3, 0, 0, 0, 0, 1, 1, 1, 1, 'good'],
 ['', 0, 0, 1, 3, 0, 0, 0, 0, 1, 1, 1, 1, 'good'],
 ['', 0, 0, 0, 4, 0, 0, 0, 1, 1, 1, 1, 1, 'good'],
 ['', 0, 0, 1, 3, 0, 0, 0, 0, 1, 1, 1, 1, 'bad'],
 ['', 0, 0, 0, 2, 0, 0, 0, 0, 1, 1, 1

In [48]:
#converting the list to dataframe
feature_names = ['Domain', 'Have_IP', 'Have_At', 'URL_Length', 'URL_Depth','Redirection', 
                'https_Domain', 'Prefix/Suffix', 'TinyURL',
                 'iFrame', 'Mouse_Over','Right_Click', 'Web_Forwards', 'Label']

legitimate = pd.DataFrame(benign_features, columns= feature_names)
legitimate.head()

Unnamed: 0,Domain,Have_IP,Have_At,URL_Length,URL_Depth,Redirection,https_Domain,Prefix/Suffix,TinyURL,iFrame,Mouse_Over,Right_Click,Web_Forwards,Label
0,imdb.com,0,0,0,3,0,0,0,0,1,1,1,1,good
1,tools.ietf.org,0,0,0,3,0,0,0,0,1,1,1,1,good
2,hotel-sirius.com,0,1,1,8,0,0,0,0,1,1,1,1,bad
3,speedstream.tv,0,0,0,4,0,0,0,0,1,1,1,1,bad
4,7den70enano.com,0,0,1,5,0,0,0,0,1,1,1,1,bad


In [49]:
legitimate['Label'] = legitimate['Label'].apply(lambda x: 0 if x == 'good' else 1)

In [50]:
legitimate.head()

Unnamed: 0,Domain,Have_IP,Have_At,URL_Length,URL_Depth,Redirection,https_Domain,Prefix/Suffix,TinyURL,iFrame,Mouse_Over,Right_Click,Web_Forwards,Label
0,imdb.com,0,0,0,3,0,0,0,0,1,1,1,1,0
1,tools.ietf.org,0,0,0,3,0,0,0,0,1,1,1,1,0
2,hotel-sirius.com,0,1,1,8,0,0,0,0,1,1,1,1,1
3,speedstream.tv,0,0,0,4,0,0,0,0,1,1,1,1,1
4,7den70enano.com,0,0,1,5,0,0,0,0,1,1,1,1,1


In [54]:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder

y = legitimate['Label']

encoder = LabelEncoder()
legitimate['domain_encoded'] = encoder.fit_transform(legitimate['Domain']) 
X = legitimate.drop(columns=['Label', 'Domain'])  

#training model
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.2, random_state=42)
lr = LogisticRegression()
lr.fit(X_train,y_train)
report = classification_report(lr.predict(X_test), y_test,
                            target_names =["Benign", "Phishing"])
print(report)

              precision    recall  f1-score   support

      Benign       1.00      0.68      0.81       991
    Phishing       0.02      0.89      0.05         9

    accuracy                           0.68      1000
   macro avg       0.51      0.78      0.43      1000
weighted avg       0.99      0.68      0.80      1000



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
