# **Phishing Domain Detection (Data Collection & Feature Extraction)**

### **Objective : Collect data and extract necessary features from that data to train Machine Learning models**

# **1.0] Data Collection**

In [1]:
import pandas as pd

In [2]:
f_path = "E:\\University\\Year 3\\Methods for detecting cyber attacks\\Project\\datasets\\FinalDataset.csv"
df = pd.read_csv(f_path,names=["url","label"], header=None)

df.head(10)

Unnamed: 0,url,label
0,https://www.google.com,0
1,https://www.youtube.com,0
2,https://www.facebook.com,0
3,https://www.baidu.com,0
4,https://www.wikipedia.org,0
5,https://www.reddit.com,0
6,https://www.yahoo.com,0
7,https://www.google.co.in,0
8,https://www.qq.com,0
9,https://www.amazon.com,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 326750 entries, 0 to 326749
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   url     326750 non-null  object
 1   label   326750 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 5.0+ MB


In [4]:
df.shape

(326750, 2)

In [5]:
# Printing number of legit and fraud domain urls
df["label"].value_counts()

1    164649
0    162101
Name: label, dtype: int64

# **2.0] Feature Extraction**

In this step, features are extracted from the URLs dataset. In total we'll be extracting ____ features for each url in the dataset

The extracted features are categorized into :
1. Length based Features
2. Count based Features
3. Binary Features


## **2.1] Length Features**

The following features will be extracted from the URL for classification.
- Length Of Url
- Length of Hostname
- Length Of Path
- Length Of First Directory
- Length Of Top Level Domain

In [6]:
#Importing dependencies
from urllib.parse import urlparse
import os.path
import ipaddress

# changing dataframe variable
urldata = df

In [7]:
#Length of URL (Phishers can use long URL to hide the doubtful part in the address bar)
urldata['url_length'] = urldata['url'].apply(lambda i: len(str(i)))

#Hostname Length
urldata['hostname_length'] = urldata['url'].apply(lambda i: len(urlparse(i).netloc))

#Path Length
urldata['path_length'] = urldata['url'].apply(lambda i: len(urlparse(i).path))

In [8]:
#First Directory Length
def fd_length(url):
    urlpath= urlparse(url).path
    try:
        return len(urlpath.split('/')[1])
    except:
        return 0

urldata['fd_length'] = urldata['url'].apply(lambda i: fd_length(i))

In [9]:
# printing first few rows
urldata.head(10)

Unnamed: 0,url,label,url_length,hostname_length,path_length,fd_length
0,https://www.google.com,0,22,14,0,0
1,https://www.youtube.com,0,23,15,0,0
2,https://www.facebook.com,0,24,16,0,0
3,https://www.baidu.com,0,21,13,0,0
4,https://www.wikipedia.org,0,25,17,0,0
5,https://www.reddit.com,0,22,14,0,0
6,https://www.yahoo.com,0,21,13,0,0
7,https://www.google.co.in,0,24,16,0,0
8,https://www.qq.com,0,18,10,0,0
9,https://www.amazon.com,0,22,14,0,0


## **2.2] Count Features**

The following features will be extracted from the URL for classification.
- Count Of '@'
- Count Of '?'
- Count Of '%'
- Count Of '.'
- Count Of '='
- Count Of 'http'
- Count Of 'www'
- Count Of Digits
- Count Of Letters
- Count Of Redirectories

In [10]:
# Count of how many times a special character appearsin url

urldata['count@'] = urldata['url'].apply(lambda i: i.count('@'))

urldata['count?'] = urldata['url'].apply(lambda i: i.count('?'))

urldata['count%'] = urldata['url'].apply(lambda i: i.count('%'))

urldata['count.'] = urldata['url'].apply(lambda i: i.count('.'))

urldata['count='] = urldata['url'].apply(lambda i: i.count('='))

urldata['count-http'] = urldata['url'].apply(lambda i : i.count('http'))

urldata['count-https'] = urldata['url'].apply(lambda i : i.count('https'))

urldata['count-www'] = urldata['url'].apply(lambda i: i.count('www'))


In [11]:
def digit_count(url):
    digits = 0
    for i in url:
        if i.isnumeric():
            digits = digits + 1
    return digits
urldata['count-digits']= urldata['url'].apply(lambda i: digit_count(i))

In [12]:
def letter_count(url):
    letters = 0
    for i in url:
        if i.isalpha():
            letters = letters + 1
    return letters
urldata['count-letters']= urldata['url'].apply(lambda i: letter_count(i))

In [13]:
def no_of_dir(url):
    urldir = urlparse(url).path
    return urldir.count('/')
urldata['count_dir'] = urldata['url'].apply(lambda i: no_of_dir(i))

In [14]:
def redirection(url):
  pos = url.rfind('//')
  if pos > 6:
    if pos > 7:
      return 1
    else:
      return 0
  else:
    return 0

urldata['count_redirection'] = urldata['url'].apply(lambda i: redirection(i))

In [15]:
# printing first few rows
urldata.head(10)

Unnamed: 0,url,label,url_length,hostname_length,path_length,fd_length,count@,count?,count%,count.,count=,count-http,count-https,count-www,count-digits,count-letters,count_dir,count_redirection
0,https://www.google.com,0,22,14,0,0,0,0,0,2,0,1,1,1,0,17,0,0
1,https://www.youtube.com,0,23,15,0,0,0,0,0,2,0,1,1,1,0,18,0,0
2,https://www.facebook.com,0,24,16,0,0,0,0,0,2,0,1,1,1,0,19,0,0
3,https://www.baidu.com,0,21,13,0,0,0,0,0,2,0,1,1,1,0,16,0,0
4,https://www.wikipedia.org,0,25,17,0,0,0,0,0,2,0,1,1,1,0,20,0,0
5,https://www.reddit.com,0,22,14,0,0,0,0,0,2,0,1,1,1,0,17,0,0
6,https://www.yahoo.com,0,21,13,0,0,0,0,0,2,0,1,1,1,0,16,0,0
7,https://www.google.co.in,0,24,16,0,0,0,0,0,3,0,1,1,1,0,18,0,0
8,https://www.qq.com,0,18,10,0,0,0,0,0,2,0,1,1,1,0,13,0,0
9,https://www.amazon.com,0,22,14,0,0,0,0,0,2,0,1,1,1,0,17,0,0


## **2.3] Binary Features**

The following features will be extracted from the URL for classification.
- Use of IP or not
- Use of Shortening URL or not

#### **2.3.1] IP Address in the URL**

Checks for the presence of IP address in the URL. URLs may have IP address instead of domain name. If an IP address is used as an alternative of the domain name in the URL, we can be sure that someone is trying to steal personal information with this URL.

In [16]:
import re

#Use of IP or not in domain
def having_ip_address(url):
    match = re.search(
        '(([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\.'
        '([01]?\\d\\d?|2[0-4]\\d|25[0-5])\\/)|'  # IPv4
        '((0x[0-9a-fA-F]{1,2})\\.(0x[0-9a-fA-F]{1,2})\\.(0x[0-9a-fA-F]{1,2})\\.(0x[0-9a-fA-F]{1,2})\\/)' # IPv4 in hexadecimal
        '(?:[a-fA-F0-9]{1,4}:){7}[a-fA-F0-9]{1,4}', url)  # Ipv6
    if match:
        # print match.group()
        return -1
    else:
        # print 'No matching pattern found'
        return 1
urldata['use_of_ip'] = urldata['url'].apply(lambda i: having_ip_address(i))

#### **2.3.2] Using URL Shortening Services “TinyURL”**

URL shortening is a method on the “World Wide Web” in which a URL may be made considerably smaller in length and still lead to the required webpage. This is accomplished by means of an “HTTP Redirect” on a domain name that is short, which links to the webpage that has a long URL.

In [17]:
# use of url shortening service
def shortening_service(url):
    match = re.search('bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|tinyurl|tr\.im|is\.gd|cli\.gs|'
                      'yfrog\.com|migre\.me|ff\.im|tiny\.cc|url4\.eu|twit\.ac|su\.pr|twurl\.nl|snipurl\.com|'
                      'short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|snipr\.com|fic\.kr|loopt\.us|'
                      'doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|to\.ly|bit\.do|t\.co|lnkd\.in|'
                      'db\.tt|qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|tinyurl\.com|ow\.ly|bit\.ly|ity\.im|'
                      'q\.gs|is\.gd|po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u\.bb|yourls\.org|'
                      'x\.co|prettylinkpro\.com|scrnch\.me|filoops\.info|vzturl\.com|qr\.net|1url\.com|tweez\.me|v\.gd|'
                      'tr\.im|link\.zip\.net',
                      url)
    if match:
        return -1
    else:
        return 1
urldata['short_url'] = urldata['url'].apply(lambda i: shortening_service(i))

##### Prefix or Suffix "-" in Domain

In [18]:
def prefixSuffix(url):
    if '-' in urlparse(url).netloc:
        return 1            # phishing
    else:
        return 0
    
urldata['prefix_Suffix'] = urldata['url'].apply(lambda i: prefixSuffix(i))    
    

In [19]:
# printing first few rows
urldata.head(10)

Unnamed: 0,url,label,url_length,hostname_length,path_length,fd_length,count@,count?,count%,count.,...,count-http,count-https,count-www,count-digits,count-letters,count_dir,count_redirection,use_of_ip,short_url,prefix_Suffix
0,https://www.google.com,0,22,14,0,0,0,0,0,2,...,1,1,1,0,17,0,0,1,1,0
1,https://www.youtube.com,0,23,15,0,0,0,0,0,2,...,1,1,1,0,18,0,0,1,1,0
2,https://www.facebook.com,0,24,16,0,0,0,0,0,2,...,1,1,1,0,19,0,0,1,1,0
3,https://www.baidu.com,0,21,13,0,0,0,0,0,2,...,1,1,1,0,16,0,0,1,1,0
4,https://www.wikipedia.org,0,25,17,0,0,0,0,0,2,...,1,1,1,0,20,0,0,1,1,0
5,https://www.reddit.com,0,22,14,0,0,0,0,0,2,...,1,1,1,0,17,0,0,1,-1,0
6,https://www.yahoo.com,0,21,13,0,0,0,0,0,2,...,1,1,1,0,16,0,0,1,1,0
7,https://www.google.co.in,0,24,16,0,0,0,0,0,3,...,1,1,1,0,18,0,0,1,1,0
8,https://www.qq.com,0,18,10,0,0,0,0,0,2,...,1,1,1,0,13,0,0,1,1,0
9,https://www.amazon.com,0,22,14,0,0,0,0,0,2,...,1,1,1,0,17,0,0,1,1,0


In [20]:
# printing info about current dataset
urldata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 326750 entries, 0 to 326749
Data columns (total 21 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   url                326750 non-null  object
 1   label              326750 non-null  int64 
 2   url_length         326750 non-null  int64 
 3   hostname_length    326750 non-null  int64 
 4   path_length        326750 non-null  int64 
 5   fd_length          326750 non-null  int64 
 6   count@             326750 non-null  int64 
 7   count?             326750 non-null  int64 
 8   count%             326750 non-null  int64 
 9   count.             326750 non-null  int64 
 10  count=             326750 non-null  int64 
 11  count-http         326750 non-null  int64 
 12  count-https        326750 non-null  int64 
 13  count-www          326750 non-null  int64 
 14  count-digits       326750 non-null  int64 
 15  count-letters      326750 non-null  int64 
 16  count_dir          3

### **Saving the dataset as .csv file**


In [21]:
urldata.to_csv("Url_Processed.csv")