# Features extraction

Turn raw data into features.

## Extract URL features

The following features are extracted from the URL:

- **URLLength**: Number of characters in the URL.
- **Domain**: Domain name extracted from the URL.
- **DomainLength**: Number of characters in the domain name.
- **IsDomainIP**: Indicates if the domain name is an IP address.
- **TLD**: TLD (Top Level Domain) is the last part of the domain name, such as .com or .edu.
- **URLSimilarityIndex**:
- **CharContinuationRate**:
- **TLDLegitimateProb**:
- **URLCharProb**:
- **TLDLength**: Number of characters in the TLD.
- **NoOfSubDomain**: Number of subdomains in the URL.
- **HasObfuscation**: Indicates if the URL has obfuscated characters like %20, %4D, etc.
- **NoOfObfuscatedChar**: Number of obfuscated characters in the URL.
- **ObfuscationRatio**:
- **NoOfLettersInURL**: Number of letters in the URL.
- **LetterRatioInURL**:
- **NoOfDegitsInURL**: Number of digits in the URL.
- **DegitRatioInURL**:
- **NoOfEqualsInURL**: Number of equal signs (=) in the URL.
- **NoOfQMarkInURL**: Number of question marks (?) in the URL.
- **NoOfAmpersandInURL**: Number of ampersands (&) in the URL.
- **NoOfOtherSpecialCharsInURL**: Number of other special characters in the URL.
- **SpacialCharRatioInURL**:
- **IsHTTPS**: Indicates if the webpage is running on unsecured HTTP (hypertext transfer protocol) or secured HTTPS.

**Notes**:

- We need to differentiate between URL and domain. For example, in the URL `https://www.google.com/search?q=python`, the domain is `www.google.com`. This is commonly referred to as the hostname. The domain is also commonly referred to as the base domain or the root domain (google.com). But, in our case, we'll refer to the hostname as the domain (which is ultimetly not incorrect).
- Boolean features are converted to numerical values (0=False; 1=True).

### Import libraries

In [1]:
from urllib.parse import urlparse
import re

### Extract features

In [2]:
# For testing purposes.
# Even though most URLs in the dataset do not contains query params,
# those kind of URLs are more realistic.
test_url = 'https://www.google.com/search?q=alan+turing'

In [20]:
def get_tld(url):
    hostname = urlparse(url).hostname
    return hostname.split('.')[-1]

get_tld(test_url)

'com'

In [5]:
def get_length(url):
    return len(url)

get_length(test_url)

22

In [23]:
def is_domain_ip(url):
    hostname = urlparse(url).hostname
    # This regex will match any sequence of four numbers separated by dots. This is a simple way to check if a string is an IP address.
    # However, it doesn't strictly validate IP addresses. For example, it will match 999.999.999.999, which is not a valid IP address.
    ip_pattern = r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'
    is_ip = bool(re.search(ip_pattern, hostname))
    return int(is_ip)

is_domain_ip(test_url)

1

In [24]:
def get_subdomain_count(url):
    # IP addresses are not domain names, thus they don't have subdomains. Subdomains are part of the DNS hierarchy and are only used in domain names.
    if (is_domain_ip(url)):
        return 0
    
    hostname = urlparse(url).hostname
    domains = hostname.split('.')
    return len(domains) - 2 # Subtract 2 to account for the TLD and the root domain

get_subdomain_count(test_url)

1

In [18]:
def get_obfuscated_char_count(url):
    # This regex will match any character that is not a letter or a number.
    obfuscated_char_pattern = r'[^a-zA-Z0-9]'
    obfuscated_chars = re.findall(obfuscated_char_pattern, url)
    return len(obfuscated_chars)

get_obfuscated_char_count('https://facebook.com@%61%62%63.%43%4F%4D')

12

In [25]:
def is_https(url):
    protocol = urlparse(url).scheme
    return int(protocol == 'https')

is_https(test_url)

1

In [None]:
def get_digit_symbols_count(url):
    # TODO
    return 0

## Extract HTML features

The following features are extracted from URL's HTML content:

- ****LineOfCode**: 
- ****LargestLineLength**: 
- ****HasTitle**: 
- ****Title**: 
- ****DomainTitleMatchScore**: 
- ****URLTitleMatchScore**: 
- ****HasFavicon**: 
- ****Robots**: 
- ****IsResponsive**: 
- ****NoOfURLRedirect**: 
- ****NoOfSelfRedirect**: 
- ****HasDescription**: 
- ****NoOfPopup**: 
- ****NoOfiFrame**: 
- ****HasExternalFormSubmit**: 
- ****HasSocialNet**: 
- ****HasSubmitButton**: 
- ****HasHiddenFields**: 
- ****HasPasswordField**: 
- ****Bank**: 
- ****Pay**: 
- ****Crypto**: 
- ****HasCopyrightInfo**: 
- ****NoOfImage**: 
- ****NoOfCSS**: 
- ****NoOfJS**: 
- ****NoOfSelfRef**: 
- ****NoOfEmptyRef**: 
- ****NoOfExternalRef**:

We will use [Playwright](https://playwright.dev/python/) instead of simple HTTP requests to download pages source code. HTTP clients are not able to execute JavaScript. Nowadays, many websites are built using JavaScript frameworks like React, Angular, Vue, etc. Playwright will open an isolated browser and execute JavaScript like a real user would do. This way, we can get the full HTML content of the page.