# **Phishing Domain Detection (Data Collection and Extraction)**

### The purpose of this notebook is to extract pertinent information out of the malicious and benign URLs Kaggle dataset
https://www.kaggle.com/siddharthkumar25/malicious-and-benign-urls.

Research credits go to https://github.com/deepeshdm/Phishing-Attack-Domain-Detection

In [48]:
# Check if GPU is being used
import tensorflow as tf

gpu_device = tf.test.gpu_device_name()
if gpu_device:
    print(f"GPU found: {gpu_device}")
else:
    print("No GPU found, using CPU.")


No GPU found, using CPU.


In [49]:
# Import necessary libraries
import pandas as pd

# Load the dataset and handle potential errors
try:
    df = pd.read_csv("./urldata.csv")
except FileNotFoundError:
    print("Error: File not found. Ensure the file path is correct.")
    exit()

# Remove the unnamed columns
df.drop(columns=["Unnamed: 0"], inplace=True, errors='ignore')

# Display basic info and first few rows for inspection
df.info()
df.head(5)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 450176 entries, 0 to 450175
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   url     450176 non-null  object
 1   label   450176 non-null  object
 2   result  450176 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 10.3+ MB


Unnamed: 0,url,label,result
0,https://www.google.com,benign,0
1,https://www.youtube.com,benign,0
2,https://www.facebook.com,benign,0
3,https://www.baidu.com,benign,0
4,https://www.wikipedia.org,benign,0


In [50]:
# Printing number of legit and fraud domain urls
df["label"].value_counts()

label
benign       345738
malicious    104438
Name: count, dtype: int64

## **Extracting Length Features**
#### Length features of the following properties can be extracted for relevant data analysis
- Length Of Url
- Length of Hostname
- Length Of Path
- Length Of First Directory
- Length Of Top Level Domain

In [51]:
from urllib.parse import urlparse

# Function to handle invalid or empty URLs
def safe_urlparse(url):
    try:
        parsed_url = urlparse(url)
        if parsed_url.netloc:
            return parsed_url
        else:
            return None  # Return None if no valid netloc is found
    except ValueError:
        return None  # Return None for invalid URLs

# Length of URL, Hostname, Path, and First Directory Length
df['url_length'] = df['url'].str.len()

# Apply the safe_urlparse function and handle missing or invalid netloc
df['hostname_length'] = df['url'].apply(lambda i: len(safe_urlparse(i).netloc) if safe_urlparse(i) else 0)
df['path_length'] = df['url'].apply(lambda i: len(safe_urlparse(i).path) if safe_urlparse(i) else 0)

# Function for calculating First Directory Length
def fd_length(url):
    parsed_url = safe_urlparse(url)
    if parsed_url and parsed_url.path:
        return len(parsed_url.path.split('/')[1]) if parsed_url.path.split('/') else 0
    return 0

df['fd_length'] = df['url'].apply(fd_length)


In [52]:
import re

# List of special characters
special_chars = ['-', '@', '?', '%', '.', '=', 'http', 'https', 'www']

# Loop through each special character and count its occurrences in the URL
for char in special_chars:
    # Use re.escape to ensure characters are treated as literals
    df[f'count_{char}'] = df['url'].str.count(re.escape(char))


## **Occurrence Count Features**
Occurrences of specific characters within malicious domains can be a relevant indicator for malicious domains
- Count Of '-'
- Count Of '@'
- Count Of '?'
- Count Of '%'
- Count Of '.'
- Count Of '='
- Count Of 'http'
- Count Of 'www'
- Count Of Digits
- Count Of Letters
- Count Of Number Of Directories

In [53]:
# Count digits and letters in the URL
df['count_digits'] = df['url'].str.count(r'\d')
df['count_letters'] = df['url'].str.count(r'[a-zA-Z]')


In [54]:
from urllib.parse import urlparse

# Function to count directories in the URL path with error handling
def count_directories(url):
    if isinstance(url, str):
        try:
            # Try parsing the URL
            parsed_url = urlparse(url)
            # Check if the parsed URL contains both scheme and netloc (valid URL)
            if parsed_url.scheme and parsed_url.netloc:
                return parsed_url.path.count('/')
            else:
                return 0  # Return 0 if it doesn't seem to be a valid URL
        except ValueError:
            return 0  # Return 0 if the URL causes a parsing error
    return 0  # Return 0 for non-string or invalid URLs

# Apply the function to count directories
df['count_dir'] = df['url'].apply(count_directories)


In [55]:
import re

# Function to check if the URL contains an IP address
def having_ip_address(url):
    ip_pattern = re.compile(
        r'((([01]?\d\d?|2[0-4]\d|25[0-5])\.){3}([01]?\d\d?|2[0-4]\d|25[0-5]))'  # IPv4
        r'|(([0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4})'  # IPv6
    )
    return -1 if ip_pattern.search(url) else 1

df['use_of_ip'] = df['url'].apply(having_ip_address)


In [56]:
import re

# Function to detect URL shortening services
def shortening_service(url):
    shortening_pattern = re.compile(r'bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|tinyurl|tr\.im|is\.gd|'
                                    r'cli\.gs|snipurl\.com|short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|'
                                    r'snipr\.com|fic\.kr|loopt\.us|doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|'
                                    r'to\.ly|bit\.do|t\.co|lnkd\.in|db\.tt|qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|'
                                    r'tinyurl\.com|ow\.ly|bit\.ly|ity\.im|q\.gs|is\.gd|po\.st|bc\.vc|twitthis\.com|u\.to|'
                                    r'j\.mp|buzurl\.com|cutt\.us|u\.bb|yourls\.org')
    # Check if the pattern matches the URL
    return -1 if shortening_pattern.search(url) else 1

# Apply the function to the DataFrame
df['short_url'] = df['url'].apply(shortening_service)


In [57]:
# Save the processed dataset
df.to_csv("Url_Processed.csv", index=False)
print("Data saved to 'Url_Processed.csv'.")

Data saved to 'Url_Processed.csv'.


## **Binary Features**

The following binary features can also be extracted from the dataset
- Use of IP or not
- Use of Shortening URL or not

#### **IP Address in the URL**

Checks for the presence of IP address in the URL. URLs may have IP address instead of domain name. If an IP address is used as an alternative of the domain name in the URL, we can be sure that someone is trying to steal personal information with this URL.

#### **Using URL Shortening Services “TinyURL”**

URL shortening is a method on the “World Wide Web” in which a URL may be made considerably smaller in length and still lead to the required webpage. This is accomplished by means of an “HTTP Redirect” on a domain name that is short, which links to the webpage that has a long URL.

### **Saving the dataset as .csv file**