# Features extraction

## Extract URL features

The following features are extracted from the URL:

- **URLLength**: Number of characters in the URL.
- **Domain**: Domain name extracted from the URL.
- **DomainLength**: Number of characters in the domain name.
- **IsDomainIP**: Indicates if the domain name is an IP address.
- **TLD**: TLD (Top Level Domain) is the last part of the domain name, such as .com or .edu.
- **URLSimilarityIndex**:
- **CharContinuationRate**: Ratio of the number of continuous characters in the URL.
- **TLDLegitimateProb**:
- **URLCharProb**:
- **TLDLength**: Number of characters in the TLD.
- **NoOfSubDomain**: Number of subdomains in the URL.
- **HasObfuscation**: Indicates if the URL has obfuscated characters like %20, %4D, etc.
- **NoOfObfuscatedChar**: Number of obfuscated characters in the URL.
- **ObfuscationRatio**: Ratio of obfuscated characters in the URL.
- **NoOfLettersInURL**: Number of letters in the URL.
- **LetterRatioInURL**: Ratio of letters in the URL.
- **NoOfDegitsInURL**: Number of digits in the URL.
- **DegitRatioInURL**: Ratio of digits in the URL.
- **NoOfEqualsInURL**: Number of equal signs (=) in the URL.
- **NoOfQMarkInURL**: Number of question marks (?) in the URL.
- **NoOfAmpersandInURL**: Number of ampersands (&) in the URL.
- **NoOfOtherSpecialCharsInURL**: Number of special characters other than equals, question marks, and ampersands in the URL.
- **SpacialCharRatioInURL**: Ratio of all special characters in the URL. A special character is any character that is not a letter or a digit.
- **IsHTTPS**: Indicates if the webpage is running on unsecured HTTP (hypertext transfer protocol) or secured HTTPS.

**Notes**:

- We need to differentiate between URL and domain. For example, in the URL `https://www.google.com/search?q=python`, the domain is `www.google.com`. This is commonly referred to as the hostname. The domain is also commonly referred to as the base domain or the root domain (google.com). But, in our case, we'll refer to the hostname as the domain (which is ultimetly not incorrect).
- Boolean features are converted to numerical values (0=False; 1=True).

### Import libraries

In [None]:
from urllib.parse import unquote

### Extract features

In [None]:
def IsDomainIP(domain):
    # This regex will match any sequence of four numbers separated by dots. This is a simple way to check if a string is an IP address.
    # However, it doesn't strictly validate IP addresses. For example, it will match 999.999.999.999, which is not a valid IP address.
    ip_pattern = r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'
    is_ip = bool(re.search(ip_pattern, domain))
    return int(is_ip)


print(IsDomainIP('www.google.com'))  # 0
print(IsDomainIP('192.168.1.1'))  # 1
print(IsDomainIP('999.999.999.999'))  # 1 (True, but it's not a valid IP address)

0
1
1


In [None]:
def NoOfSubDomain(domain):
    # IP addresses are not domain names, thus they don't have subdomains. 
    # Subdomains are part of the DNS hierarchy and are only used in domain names.
    if (IsDomainIP(domain)):
        return 0

    domains = domain.split('.')
    return len(domains) - 2  # Subtract apex domain and TLD


print(NoOfSubDomain('docs.python.org'))  # 1
print(NoOfSubDomain('google.com'))  # 0
print(NoOfSubDomain('192.168.1.1'))  # 0

1
0
0


In [None]:
# https://github.com/arvindbitm/PhiUSIIL/blob/main/CharConRate.ipynb
# They are stripping 'www' and the TLD from the domain name before calculating the character continuation rate.
# Why only 'www'? What about other subdomains?

def CharConRate(url):
    rate = 0  # This variable is not even used 🤷‍♂️
    ln = len(url)
    chC, nmC, spC = 0, 0, 0
    maxCh, maxNm, MaxSp = 0, 0, 0
    for i in range(0, ln):
        ch = url[i]
        if ch.isalpha():
            chC = chC + 1
            if (nmC > 0):
                if (maxNm < nmC):
                    maxNm = nmC
                    nmC = 0
            elif (spC > 0):
                if (MaxSp < spC):
                    MaxSp = spC
                    spC = 0
            nmC, spC = 0, 0

        elif ch.isdigit():
            nmC = nmC + 1
            if (chC > 0):
                if (maxCh < chC):
                    maxCh = chC
                    chC = 0
            elif (spC > 0):
                if (MaxSp < spC):
                    MaxSp = spC
                    spC = 0
            chC, spC = 0, 0
        else:
            spC = spC + 1
            if (nmC > 0):
                if (maxNm < nmC):
                    maxNm = nmC
                    nmC = 0
            elif (chC > 0):
                if (maxCh < chC):
                    maxCh = chC
                    chC = 0
            nmC, chC = 0, 0

    if (maxCh < chC):
        maxCh = chC
    if (maxNm < nmC):
        maxNm = nmC
    if (MaxSp < spC):
        MaxSp = spC
    return (maxCh + maxNm + MaxSp) / ln

In [None]:
def HasObfuscation(str):
    decoded_str = unquote(str)
    return int(decoded_str != str)


print(HasObfuscation('https://facebook.com'))
print(HasObfuscation('https://facebook.com@%61%62%63.%43%4F%4D'))

0
1


In [None]:
def NoOfObfuscatedChar(str):
    # Regular expression to find percent-encoded characters.
    # A percent-encoded character is a character that is represented by a percent sign followed by two hexadecimal digits.
    encoded_char_pattern = r'%[0-9A-Fa-f]{2}'
    # Find all matches of the pattern in the URL
    encoded_chars = re.findall(encoded_char_pattern, str)
    return len(encoded_chars)


print(NoOfObfuscatedChar('https://facebook.com'))
print(NoOfObfuscatedChar('https://facebook.com@%61%62%63.%43%4F%4D'))

0
6


In [None]:
def NoOfInURL(str, pattern):
    """
    Counts the number of occurrences of a given pattern in a string.

    Parameters:
    - str (str): The input string to search for matches.
    - pattern (str): The pattern to search for in the input string.

    Returns:
    - int: The number of matches found.

    Example:
    >>> NoOfInURL("https://facebook.com?param=value", r"=")
    1
    """
    # Find all matches of the pattern in the URL
    matches = re.findall(pattern, str)
    return len(matches)


print(NoOfInURL('https://facebook.com', r'\?'))
print(NoOfInURL('https://facebook.com?param=value', r'\?'))

0
1


In [None]:
# Refering to 'URL' but seems to be calculating for 'Domain' on the dataset
# Same applies to the all ...InURL calculations
def SpecialCharRatioInURL(url):
    # Negative lookahead to match any character that is not a letter or a digit
    special_char_pattern = r'[^A-Za-z0-9]'
    special_chars = re.findall(special_char_pattern, url)
    return len(special_chars) / len(url)


print(SpecialCharRatioInURL('facebook.com'))
print(SpecialCharRatioInURL('facebook.com?param=value'))

0.08333333333333333
0.125


In [None]:
# Even though most URLs in the dataset do not contains query params, those kind of URLs are more realistic.
test_url = 'https://www.google.com/search?q=alan+turing'


def URLFeatures(url):
    parsed_url = urlparse(url)
    domain = parsed_url.hostname
    tld = domain.split('.')[-1]

    return {
        'URLLength': len(url),
        'Domain': domain,
        'DomainLength': len(domain),
        'IsDomainIP': IsDomainIP(domain),
        'TLD': tld,
        'TLDLength': len(tld),
        'TLDLegitimateProb': None,
        'URLSimilarityIndex': None,
        'CharContinuationRate': CharConRate(domain.split('.')[1 if NoOfSubDomain('google.com') > 0 else 0]),
        # For consistency, we are stripping the first subdomain (not just 'www') and the TLD
        'URLCharProb': None,
        'NoOfSubDomain': NoOfSubDomain(domain),
        'HasObfuscation': HasObfuscation(domain),
        'NoOfObfuscatedChar': NoOfObfuscatedChar(domain),
        'ObfuscationRatio': NoOfObfuscatedChar(domain) / len(domain),
        'NoOfLettersInURL': NoOfInURL(url, r'[a-zA-Z]'),
        'LetterRatioInURL': NoOfInURL(url, r'[a-zA-Z]') / len(url),
        'NoOfDegitsInURL': NoOfInURL(url, r'\d'),
        'DegitRatioInURL': NoOfInURL(url, r'\d') / len(url),
        'NoOfEqualsInURL': NoOfInURL(url, r'='),
        'NoOfQMarkInURL': NoOfInURL(url, r'\?'),
        'NoOfAmpersandInURL': NoOfInURL(url, r'&'),
        'NoOfOtherSpecialCharsInURL': NoOfInURL(url, r'[^a-zA-Z\d=&\?]'),
        'SpacialCharRatioInURL': SpecialCharRatioInURL(url),
        # I guess this is a typo and it should be 'SpecialCharRatioInURL'
        'IsHTTPS': int(parsed_url.scheme == 'https'),
    }


URLFeatures(test_url)

{'URLLength': 43,
 'Domain': 'www.google.com',
 'DomainLength': 14,
 'IsDomainIP': 0,
 'TLD': 'com',
 'TLDLength': 3,
 'TLDLegitimateProb': None,
 'URLSimilarityIndex': None,
 'CharContinuationRate': 1.0,
 'URLCharProb': None,
 'NoOfSubDomain': 1,
 'HasObfuscation': 0,
 'NoOfObfuscatedChar': 0,
 'ObfuscationRatio': 0.0,
 'NoOfLettersInURL': 34,
 'LetterRatioInURL': 0.7906976744186046,
 'NoOfDegitsInURL': 0,
 'DegitRatioInURL': 0.0,
 'NoOfEqualsInURL': 1,
 'NoOfQMarkInURL': 1,
 'NoOfAmpersandInURL': 0,
 'NoOfOtherSpecialCharsInURL': 7,
 'SpacialCharRatioInURL': 0.20930232558139536,
 'IsHTTPS': 1}

## Extract HTML features

The following features are extracted from the web page HTML:

- **LineOfCode**: Number of lines of code in the HTML.
- **LargestLineLength**: Length of the largest line of code in the HTML. This is used to detect obfuscated code.
- **HasTitle**: Whether the HTML has a title tag.
- **Title**: The title of the page.
- **DomainTitleMatchScore**: The score of the page title matching the domain name. Out of 100.
- **URLTitleMatchScore**: The score of the page title matching the URL. Out of 100.
- **HasFavicon**: Whether the page has a favicon.
- **Robots**: Does the website have a robots.txt file or a robots meta tag.
- **IsResponsive**: Whether the website is responsive.
- **NoOfURLRedirect**: Number of URL redirects.
- **NoOfSelfRedirect**: Number of redirects to the same domain.
- **HasDescription**: Whether the page has a meta description.
- **NoOfPopup**: Number of popups.
- **NoOfiFrame**: Number of iframes.
- **HasExternalFormSubmit**: Whether the page has an external form submit.
- **HasSocialNet**: Whether the page has social network links.
- **HasSubmitButton**: Whether the page has a submit button.
- **HasHiddenFields**: Whether the page has hidden fields.
- **HasPasswordField**: Whether the page has password fields.
- **Bank**: Whether the page is a bank page.
- **Pay**: Whether the page is a payment page.
- **Crypto**: Whether the page is a cryptocurrency page.
- **HasCopyrightInfo**: Whether the page has copyright information.
- **NoOfImage**: Number of images.
- **NoOfCSS**: Number of CSS files.
- **NoOfJS**: Number of JS files.
- **NoOfSelfRef**: Number of links to the same domain.
- **NoOfEmptyRef**: Number of empty links.
- **NoOfExternalRef**: Number of links to external domains.

### Import libraries

In [None]:
import re
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse

### Extract features

In [None]:
def LineOfCode(html):
    return len(re.findall('\n', html))

In [None]:
def LargestLineLength(html):
    max = 0
    for line in html.split('\n'):
        if len(line) > max:
            max = len(line)
    return max

In [None]:
def HasFavicon(url: str, soup: BeautifulSoup):
    favicon = soup.find('link', rel='icon')
    if favicon is not None:
        return int(True)

    favicon_url = urlparse(url)._replace(path='/favicon.ico').geturl()
    response = requests.get(favicon_url)
    if response.status_code == 200:
        return int(True)

    return False

In [None]:
def HasRobots(url: str, soup: BeautifulSoup):
    # Check if meta robots tag exists before making a request
    if soup:
        meta = soup.find('meta', attrs={'name': 'robots'})
        if meta:
            return int(True)  # for readability

    # If no meta tag, make a request to the robots.txt file
    if url:
        robots_url = urlparse(url)._replace(path='/robots.txt').geturl()
        response = requests.get(robots_url)
        if response.status_code == 200:
            return int(True)

    return int(False)

In [None]:
def IsResponsive(soup: BeautifulSoup):
    # Check if viewport meta tag exists
    meta = soup.find('meta', attrs={'name': 'viewport'})
    if meta:
        return int(True)

    # Check for conditionally loaded stylesheets
    stylesheet = soup.find('link', attrs={'rel': 'stylesheet', 'media': 'screen'})
    if stylesheet:
        return int(True)

    # Check if inline style contains media queries
    style = soup.find('style', string=re.compile('@media'))
    if style:
        return int(True)

    # Checking if the page is responsive is not a trivial task
    # This function may return false negatives
    # For example, a page may be responsive without using media queries.
    # Above checks don't cover all possible cases.

    return int(False)

In [None]:
def NoOfPopup(soup: BeautifulSoup):
    count = 0

    # Check for new dialog element
    popups = soup.find_all('dialog')
    count += len(popups)

    # Check for window.open() calls
    scripts = soup.find_all('script', string=re.compile('window.open'))
    count += len(scripts)

    return count

In [None]:
def HasExternalFormSubmit(soup: BeautifulSoup):
    forms = soup.find_all('form')
    for form in forms:
        action = form.get('action')
        if action and not action.startswith('/'):
            return int(True)

    return int(False)

In [None]:
def HasSocialNet(soup: BeautifulSoup):
    social_media = [
        'facebook', 'twitter', 'x.com', 'linkedin', 'instagram', 'youtube',
        'pinterest', 'tumblr', 'snapchat', 'reddit', 'tiktok', 'whatsapp',
        'wechat', 'qq', 'telegram', 'viber', 'line', 'vk', 'odnoklassniki',
        'myspace', 'flickr', 'meetup', 'mix', 'deviantart', 'livejournal',
        'badoo', 'stumbleupon', 'digg', 'friendster', 'classmates', 'xing',
        'renren', 'douban', 'vkontakte', 'qzone', 'baidu', 'weibo', 'kakao',
        'naver', 'skype', 'discord', 'slack', 'signal', 'mastodon', 'parler',
        'gab', 'clubhouse', 'ello', 'peach', 'plurk', 'mewe', 'minds', 'diaspora'
    ]

    social_media_regex = re.compile('|'.join(social_media), re.IGNORECASE)

    # Check if any social media link exists (no need to check all)
    social_media_link = soup.find('a', href=social_media_regex)

    if social_media_link:
        return int(True)

    return int(False)

In [None]:
def HasCopyrightInfo(soup: BeautifulSoup):
    copyright_variants = ['©', '(c)', 'copyright', 'all rights reserved']
    copyright_regex = re.compile('|'.join(copyright_variants), re.IGNORECASE)

    return int(soup.find(string=copyright_regex) is not None)

In [None]:
# Count self-referencing links
def NoOfSelfRef(soup: BeautifulSoup):
    count = 0

    links = soup.find_all('a')
    for link in links:
        href = link.get('href')
        if href is not None and (href.startswith('/') or href.startswith('#')):
            count += 1

    return count


# Count empty links
def NoOfEmptyRef(soup: BeautifulSoup):
    count = 0

    links = soup.find_all('a')
    for link in links:
        href = link.get('href')
        if href is None or href == '':
            count += 1

    return count


# Count external links
def NoOfExternalRef(url: str, soup: BeautifulSoup):
    count = 0
    netloc = urlparse(url).netloc

    links = soup.find_all('a')
    for link in links:
        href = link.get('href')
        if href is not None and urlparse(href).netloc != netloc:
            count += 1

    return count

In [None]:
test_url = 'https://www.google.com/search?q=alan+turing'  # Link with robots.txt


# test_url = 'https://shorturl.at/qzDIE' # Link with redirects
# test_url = 'https://example.com'

def HTMLFeatures(url):
    response = requests.get(url, allow_redirects=True)
    html = response.text
    soup = BeautifulSoup(html, 'html.parser')

    return {
        'LineOfCode': LineOfCode(html),
        'LargestLineLength': LargestLineLength(html),
        'HasTitle': int(soup.title is not None),
        'Title': soup.title.string if soup.title else '',
        'DomainTitleMatchScore': None,
        'URLTitleMatchScore': None,
        'HasFavicon': HasFavicon(url, soup),
        'Robots': HasRobots(url, soup),
        'IsResponsive': IsResponsive(soup),
        'NoOfURLRedirect': len(response.history),
        'NoOfSelfRedirect': len([redirect for redirect in response.history[1:] if
                                 urlparse(redirect.url).hostname == urlparse(url).hostname]),
        'HasDescription': int(soup.find('meta', attrs={'name': 'description'}) is not None),
        'NoOfPopup': NoOfPopup(soup),
        'NoOfiFrame': len(soup.find_all('iframe')),
        'HasExternalFormSubmit': HasExternalFormSubmit(soup),
        'HasSocialNet': HasSocialNet(soup),
        'HasSubmitButton': int(soup.find('input', type='submit') is not None),
        'HasHiddenFields': int(soup.find('input', type='hidden') is not None),
        'HasPasswordField': int(soup.find('input', type='password') is not None),
        'Bank': None,
        'Pay': None,
        'Crypto': None,
        'HasCopyrightInfo': HasCopyrightInfo(soup),
        'NoOfImage': len(soup.find_all('img')),
        'NoOfCSS': len(soup.find_all('link', rel='stylesheet')),
        'NoOfJS': len(soup.find_all('script')),
        'NoOfSelfRef': NoOfSelfRef(soup),
        'NoOfEmptyRef': NoOfEmptyRef(soup),
        'NoOfExternalRef': NoOfExternalRef(url, soup),
    }


HTMLFeatures(test_url)

{'LineOfCode': 30,
 'LargestLineLength': 45695,
 'HasTitle': 1,
 'Title': 'alan turing - Recherche Google',
 'DomainTitleMatchScore': None,
 'URLTitleMatchScore': None,
 'HasFavicon': 1,
 'Robots': 1,
 'IsResponsive': 0,
 'NoOfURLRedirect': 0,
 'NoOfSelfRedirect': 0,
 'HasDescription': 0,
 'NoOfPopup': 0,
 'NoOfiFrame': 0,
 'HasExternalFormSubmit': 0,
 'HasSocialNet': 1,
 'HasSubmitButton': 0,
 'HasHiddenFields': 1,
 'HasPasswordField': 0,
 'Bank': None,
 'Pay': None,
 'Crypto': None,
 'HasCopyrightInfo': 1,
 'NoOfImage': 9,
 'NoOfCSS': 0,
 'NoOfJS': 9,
 'NoOfSelfRef': 54,
 'NoOfEmptyRef': 0,
 'NoOfExternalRef': 56}