# Phishing Dataset Description

## URL description

![URL](../imgs/url_parts.webp)

## Description of the Dataset

This dataset of phishing websites contains 87 features across 11,430 URLs, with half labeled as "phishing" and half as "legitimate," making it a balanced dataset. It includes extracted features from three main categories, which are valuable for training machine learning models to detect phishing sites. Here’s a summary and an interpretation of what each column likely represents:


### URL COMPONENTS - Features

#### 1. URL-based features (56 columns)

- **`url`**: Full URL of the webpage
    * Data type: object
    * Nº of unique values: 11429
- **`length_url`**: URL hostname length
    * Data type: int
    * Nº of unique values: 324
- **`length_hostname`**: Length of the hostname part of the URL
    * Data type: int
    * Nº of unique values: 83
- **`ip`**: Indicates if an IP address appears in hostnamme.
    * Data type: int
    * Nº of unique values: 2 -> (0, 1)
- **`nb_dots`, `nb_hyphens`, `nb_at`, `nb_qm`, `nb_and`, `nb_or`, `nb_eq`, `nb_underscore`, `nb_tilde`, `nb_percent`, `nb_slash`, `nb_star`, `nb_colon`, `nb_comma`, `nb_semicolumn`, `nb_dollar`, `nb_space`, `nb_www`, `nb_com`, `nb_dslash`**: Counts of special characters at base URL
-  Having multiple http or https in url path, **`http_in_path`**: This feature counts the number of times the substring **`http`** (or https) appears in the **URL**.
    * Data type: int
    * Nº of unique values: 5 (0, 1, 2, 3, 4)
- Uses https protocol, **`https_token`**: This feature indicates whether the **`https`** protocol is used in the URL. If the URL uses **`https`**, the value will be **0**. Conversely, if the URL uses **`http`**, it will return 
- **Ratio of digits in in URL, `ratio_digits_url`**: Ratios of digits to total characters in URL; high ratios may indicate suspicion.
    * Data type: float
    * Nº of unique values: 1414
- **Ratio of digits in hostname, `ratio_digits_host`**: Ratios of digits to total characters in hostname; high ratios may indicate suspicion.
    * Data type: float
    * Nº of unique values: 241
- **`punycode`**: Indicates if the domain uses Punycode, often abused for phishing. It checks if the url startswith *http://xn--* or *http://xn--*
    * Data type: int
    * Nº of unique values: 2 (0, 1)
- Check port presence in domain, **`port`**: Presence of a port number in domain
    * Data type: int
    * Nº of unique values: 2 (0, 1)
- **`tld_in_path`, `tld_in_subdomain`**: Checks if TLD (Top-Level Domain) appears in the path or subdomain (`.com`, `.org`, or `.net`).
    * Data type: int
    * Nº of unique values: 2 (0, 1)
-    Abnormal subdomain starting with *`wwww-`*, *`wwNN`*, **`abnormal_subdomain`**: Checks if an abnormality of subdomains appears
    * Data type: int
    * Nº of unique values: 2 (0, 1)
- **`nb_subdomains`**: Number of subdomains in each URL. This feature categorizes URLs based on the number of subdomains within their structure. This feature is determined by counting the number of dot separators (`.`) in each URL, providing insight into the URL's complexity, which can sometimes indicate legitimacy or potential suspiciousness.
    * Data type: int
    * Nº of unique values: 3 (1, 2, 3)
    1. **Category 1**: URLs with a single dot, meaning no subdomains are present.
        - Example: `http://example.com`
    2. **Category 2**: URLs with two dots, indicating one subdomain.
        - Example: `http://sub.example.com`
    3. **Category 3**: URLs with either zero dots (indicating just the main domain) or three or more dots, indicating multiple or deeply nested subdomains.
        - Examples:
        - No subdomain: `http://localhost`
        - Multiple subdomains: `http://sub1.sub2.example.com`

- **`prefix_suffix`**: Hyphens (`-`) between words in the domain, often used in phishing
    * Data type: int
    * Nº of unique values: 2 (0, 1)
- Is the registered domain created with random characters (Sahingoz2019), **`random_domain`**:
    * Data type: int
    * Nº of unique values: 2 (0, 1)
- **`shortening_service`**: Checks if a URL shortening service is used. A URL shortener is an app that converts a long URL into a short URL. The idea is to minimize the web page address into something that's easier to remember and track. Typically, it shortens the website's address and adds a random combination of letters and numbers.
    * Data type: int
    * Nº of unique values: 2 (0, 1)
- **`path_extension`**: checks for an unusual path extensions in the URL
    * Data type: int
    * Nº of unique values: 2 (0, 1)

- **Internal redirections (Kumar Jain'18), `nb_redirection`**: This refers to the process of forwarding a user from one URL to another. When a URL has multiple redirects (as indicated by nb_redirection), it means that the user is first taken to an intermediate page before finally reaching the destination. As phishing sites may use many to mask origin
    * Data type: int
    * Nº of unique values: 7 (0, 1, 2, 3, 4, 5, 6)
- **External redirections (Kumar Jain'18),`nb_external_redirection`**: It flags the presence of any external redirection within the `nb_redirection` chain, i.e., a site that is not within the same domain as the starting URL.
    * Data type: int
    * Nº of unique values: 2 (0, 1)
- **`length_words_raw`**: Length of raw word list (Sahingoz2019)
    * Data type: int
    * Nº of unique values: 54 (0, 106)
- **Consecutive Character Repeat (Sahingoz2019), `char_repeat`**: Counts the number of occurrences of consecutive characters in each URL that repeat 2 or more times. It is based on identifying substrings where characters are repeated multiple times (e.g., "aaa", "bb", etc.).
    * Data type: int
    * Nº of unique values: 55 (0, 146)
- **`shortest_words_raw`, `shortest_word_host`, `shortest_word_path`**:  shortest word length in raw word list (Sahingoz2019), host or path
    * Data type: int
    * Nº of unique values: 25 (0, 27)

    * Data type: int
    * Nº of unique values: 34 (0, 39)

    * Data type: int
    * Nº of unique values: 33 (0, 40)
- **`longest_words_raw`, `longest_word_host`, `longest_word_path`**:  longest word length in raw word list (Sahingoz2019)/ host and path
    * Data type: int
    * Nº of unique values: 119(0, 27)

    * Data type: int
    * Nº of unique values: 49 (0, 39)

    * Data type: int
    * Nº of unique values: 120 (0, 829)
- **`avg_words_raw`, `avg_word_host`, `avg_word_path`**: Count average word lengths in raw word list (Sahingoz2019)/ host or path
    * Data type: float
    * Nº of unique values: 896

    * Data type: float
    * Nº of unique values: 174

    * Data type: float
    * Nº of unique values: 757
-  Number of phish-hints in url path, **`phish_hints`**: Presence of common phishing indicators.
        * Hints:['wp', 'login', 'includes', 'admin', 'content', 'site', 'images', 'js', 'alibaba', 'css', 'myaccount', 'dropbox', 'themes', 'plugins', 'signin', 'view']
    * Data type: int
    * Nº of unique values: 9 (0, 10)

- **`domain_in_brand`, `brand_in_subdomain`, `brand_in_path`**: Checks if brand list names are in the domain, subdomain, or path
    * Data type: int
    * Nº of unique values: 2 (0, 1)
- **`suspecious_tld`**: Identifies URLs with suspicious or potentially malicious top-level domains (TLDs). Certain TLDs have been associated with spam, phishing, or other forms of malicious activity, and this function flags URLs that use such TLDs.
    - **`1`**: The URL has a suspicious TLD, indicating a higher likelihood of being malicious.
    - **`0`**: The TLD is not suspicious, meaning it is considered safe in this context.
    - Predefined list of known suspicious TLDs: 'fit','tk', 'gp', 'ga', 'work', 'ml', 'date', 'wang', 'men', 'icu', 'online', 'click', # Spamhaus
        'country', 'stream', 'download', 'xin', 'racing', 'jetzt',
        'ren', 'mom', 'party', 'review', 'trade', 'accountants', 
        'science', 'work', 'ninja', 'xyz', 'faith', 'zip', 'cricket', 'win',
        'accountant', 'realtor', 'top', 'christmas', 'gdn', # Shady Top-Level Domains
        'link', # Blue Coat Systems
        'asia', 'club', 'la', 'ae', 'exposed', 'pe', 'go.id', 'rs', 'k12.pa.us', 'or.kr',
        'ce.ke', 'audio', 'gob.pe', 'gov.az', 'website', 'bj', 'mx', 'media', 'sa.gov.au' # statistics
        

- **`statistical_report`**: Detects potentially suspicious URLs based on known patterns within the URL structure or by resolving specific IP addresses associated with phishing or spam. It operates through two primary checks:
    - **`1`**: The URL is flagged as suspicious due to either matching a risky pattern or an associated IP address.
    - **`0`**: The URL does not match any suspicious patterns or IP addresses.
    - **`2`**: An exception occurred (e.g., if the IP resolution fails), indicating an unresolved case.

#### 2. Content-Based Features (24 columns)

- **Number of hyperlinks present in a website (Kumar Jain'18), `nb_hyperlinks`**: This feature counts the number of href and src attributes found in the dom (Document Object Model) of a webpage.
    * Data type: int
    * Nº of unique values: 691

- **`ratio_intHyperlinks`, `ratio_extHyperlinks`, `ratio_nullHyperlinks`**: Number and type of hyperlinks; phishing often uses more external links
    * Data type: float
    * Nº of unique values: 3131

    * Data type: float
    * Nº of unique values: 3131

    * Data type: int
    * Nº of unique values: 1 (0)


- **Extrenal CSS (Kumar Jain'18), `nb_extCSS`**. It refers to the count of external CSS files linked or referenced in a given CSS object. This feature measures the number of CSS files that are loaded from external sources, rather than being embedded directly within the HTML document. In simpler terms, it quantifies how many external stylesheets are being used on a webpage.
    * Data type: int
    * Nº of unique values: 33

- **`ratio_intRedirection`**: Ratios of internal redirections; phishing may disrupt navigation with errors
    * Data type: int
    * Nº of unique values: 1 (0)


- **`ratio_extRedirection`**: Ratios of external redirections; phishing may disrupt navigation with errors

    * Data type: float
    * Nº of unique values: 894

- **Ratio of internal errors (Kumar Jain'18), `ratio_intErrors`**: This feature represents the proportion of internal links on a webpage that result in errors (HTTP status codes ≥ 400). The calculation is based on the ratio of **failed internal requests** to **total internal links**.
   - **Calculation**:  
     `ratio_intErrors = count of internal errors / total number of internal links`
   - **Data Type**: Integer (`int`)
   - **Number of Unique Values**: 1 (0) — This suggests that there are no internal errors (i.e., no links returned errors across all samples)


- **Ratio of external errors (Kumar Jain'18),`ratio_extErrors`**: This feature calculates the ratio of external links that result in errors (HTTP status codes ≥ 400). It's derived from the proportion of **failed external requests** to **total external links** on the webpage.
   - **Calculation**:  
     `ratio_extErrors = count of external errors / total number of external links`
   - **Data Type**: Float (`float`)
   - **Number of Unique Values**: 8625 — This means there are 8625 unique values for the ratio, indicating variability in error rates across different samples or domains

- **Having login form link (Kumar Jain'18), `login_form`**: This feature checks if a login form is present on a webpage. It identifies the form by looking for internal or external links that contain specific patterns, typically associated with login pages, such as PHP-based login forms
    - **Calculation**:  If the `Form` contains any external or null links, or if any internal or external form links match the regular expression pattern for PHP-based login pages (`([a-zA-Z0-9\_])+.php`), the function returns `1`, indicating a login form is present. Otherwise, it returns `0`
    - **Number of Unique Values**:
        * `1` for a login form being present
        * `0` not a login form being present


- **`external_favicon`**: This feature checks if a webpage includes an external favicon. It examines the list of external favicon links and returns `1` if any external favicons are found, and `0` if none are found.

- **`links_in_tags`**: Number of hyperlinks embedded in tags
    * Data type: float
    * Nº of unique values: 473

- **`submit_email`**: This feature detects if the webpage contains any forms (internal or external) that have "mailto:" or "mail()" in their links, indicating email submission functionality. It returns `1` if such a link is found, and `0` if none are found.

- **Ratio of internal media, `ratio_intMedia`**: This feature represents the percentage of internal media links compared to the total media links (both internal and external) found on a webpage. Note that a higher value indicates that a webpage relies more on external media (hosted outside the website's domain).
    * Data type: float
    * Nº of unique values: 490 (0- 100)

- **Ratio of external media, `ratio_extMedia`**: This feature represents the percentage of internal media links compared to the total media links (both internal and external) found on a webpage. Note that a higher value indicates that a webpage relies more on external media (hosted outside the website's domain).
    * Data type: float
    * Nº of unique values: 490 (0- 100)


##### 2.1 Additional content-based features
- **Server Form Handler  : sfh in Zaini'2019,`sfh`** checks if a Server Form Handler (SFH) is present on a website based on its form elements. SFH (Server Form Handler) refers to the mechanism by which forms are processed on the server side, usually via a script or URL specified in the form's action attribute.
    - `0`is the unique value of this feature, indicating that no form has a null action (i.e., the form is handled properly)
- **IFrame Redirection, `iframe`**: This feature checks whether an invisible iframe is present on the webpage, which could be a potential indicator of redirection or malicious activity 
    - `0`: No invisible iframe detected
    - `1`: Invisible iframe detecte
- **Pop up window, `popup_window`**: This feature detects if the `prompt()` function is used in the JavaScript code, which could be used to create pop-up windows that request user input
    - `0`: No pop-up window function detected
    - `1`: Pop-up window function detected
- **Percentile of safe anchor : URL_of_Anchor in Zaini'2019 (Kumar Jain'18), `safe_anchor`**: This feature calculates the percentage of unsafe anchor links in the webpage
    - `0`: No unsafe anchors or no anchors at all
    - `Percentage`: A float representing the percentage of unsafe anchors, ranging from **0 to 100**
- **Onmouse action, `onmouseover`**: This feature identifies the presence of the `onmouseover` JavaScript event, which could potentially be used to trigger actions when the mouse hovers over an element on the page
    - **0**: No `onmouseover` event detected
    - **1**: `onmouseover` event detected
- **Right_clic action, `right_clic`**: This feature checks if JavaScript is preventing the default right-click action (e.g., using `event.button == 2`), which could be an indication of attempts to block right-click functionality on the page
    - `0`: Right-click is not disabled
    - `1`: Right-click is disabled

- **Empty title, `empty_title`**: This feature checks whether the title (*HTML title tag*) of a webpage is empty or not. It returns:
    - `0` if the title is present (non-empty)
    - `1` if the title is empty or missing

- **Domain in page title (Shirazi'18), `domain_in_title`**: Checks whether there is a domain in page title
    - `0` if the domain appears in the page title
    - `1` if not

- **Domain after copyright logo (Shirazi'18), `domain_with_copyright`**: checks whether the domain name appears after a copyright symbol or other trademark symbols in the webpage content. 
    - `0`: The domain is found near a copyright, trademark, or registered symbol
    - `1`: The domain is not found near a copyright, trademark, or registered symbol

#### 4. Thirs-party-based features

- **`domain_registration_length`** Indicates the difference in days between the domain's expiration date and the current date, indicating how long the domain has been registered.
    - If the domain has an expiration date, the function calculates the number of days from the current date.
    - If the expiration date is missing, the function returns 0.
    - If there is an error during the Whois lookup, it returns -1.
    * Nº of unique values: 1659

-  Domain registration age , **`domain_age`**: The function extracts the domain name from the URL and sends a request to an external API (https://input.payapi.io/v1/api/fraud/domain/age/) to fetch the domain's age.

    - `-2`: When the domain age is not available
    - `-1`: When the API request fails
    - `Integer`: The actual domain age, if successfully retrieved

    * Nº of unique values: 4430

-  Domain recognized by WHOIS, **`whois_registered_domain`** checks if a domain is registered by performing a Whois lookup and comparing the domain name with the provided domain.
    - Returns `0` if the domain is registered (matches the Whois data).
    - Returns `1` if the domain is not registered or if there is an error (e.g., no Whois information available or an invalid domain).

- **Page Rank from OPR, `page_rank`**: Indicates the PageRank score of a given domain from the Open PageRank API
    - `0`: No PageRank score (either domain is unranked or there's no valid score)
    - `Positive integer`: PageRank score (a measure of the importance or relevance of the domain)
    - `-1`: Error occurred (e.g., invalid domain or request issue)

- **DNSRecord  expiration length, `dns_record`,**: Indicators of DNS status in domain, Google indexing in URL, and PageRank in domain to gauge legitimacy
    - `0`: Domain has DNS records (valid name servers)
    - `1`: Domain does not have DNS records (no valid name servers or an error occurred)


- **Google index, `google_index`**: attempts to check whether a domain or URL is indexed by Google
    - `-1`: Unusual traffic detected (blocked by Google)
    - `0`: URL is indexed by Google.
    - `1`: URL is not indexed by Google.

- **Unable to get web traffic (Page Rank), `web_traffic`**: It fetches the "REACH" rank from Alexa's API using a short URL, which provides the traffic rank of the website. In this dataset:

    - High Rank: Indicates high web traffic
    - Low Rank: Indicates low web traffic
    - `0` as Output: Indicates an error or inability to fetch the data
    * Nº of unique values: 4744
    - Note: As this feature can have a wide range of values depending on website traffic, it should be treated as a continuous variable, and 0 should be handled as missing data or an outlier indicating failure to retrieve the traffic rank.
#### 4. Target Column
- **`status`**: Label indicating if the URL is phishing (`1`) or legitimate (`0`), used for training classification models


### Character count features and their symbols

- **`nb_dots`**: `.`
    * Data type: int
    * Nº of unique values: 19 (1-22)
- **`nb_hyphens`**: `-`
    * Data type: int
    * Nº of unique values: 27 (0-43)
- **`nb_at`**: `@`
    * Data type: int
    * Nº of unique values: 5 (0, 1, 2, 3, 4)
- **`nb_qm`**: `?`
    * Data type: int
    * Nº of unique values: 4 (0, 1, 2, 3)
- **`nb_and`**: `&`
    * Data type: int
    * Nº of unique values: 15 (0-18)
- **`nb_or`**: `|`
    * Data type: int
    * Nº of unique values: 1 (0)
- **`nb_eq`**: `=`
    * Data type: int
    * Nº of unique values: 16 (0, 18)
- **`nb_underscore`**: `_`
    * Data type: int
    * Nº of unique values: 17 (0-18)
- **`nb_tilde`**: `~`
    * Data type: int
    * Nº of unique values: 2 (0, 1)
- **`nb_percent`**: `%`
    * Data type: int
    * Nº of unique values: 25 (0, 96)
- **`nb_slash`**: `/`
    * Data type: int
    * Nº of unique values: 22 (2, 33)
- **`nb_star`**: `*`
    * Data type: int
    * Nº of unique values: 2 (0, 1)
- **`nb_colon`**: `:`
    * Data type: int
    * Nº of unique values: 6 (1, 2, 3, 4, 5, 7)
- **`nb_comma`**: `,`
    * Data type: int
    * Nº of unique values: 5 (0, 1, 2, 3, 4)
- **`nb_semicolumn`**: `;`
    * Data type: int
    * Nº of unique values: 15 (0, 20)
- **`nb_dollar`**: `$`
    * Data type: int
    * Nº of unique values: 6 (0, 1, 2, 3, 6)
- **`nb_space`**: (space)
    * Data type: int
    * Nº of unique values: 9 (0-7, 18)
- count www in url words (Sahingoz2019), **`nb_www`**: `www`
    * Data type: int
    * Nº of unique values: 3 (0, 1, 2)
- count com in url words (Sahingoz2019), **`nb_com`**: `com`
    * Data type: int
    * Nº of unique values: 7 (0, 1, 2, 3, 4, 5, 6)
- **`nb_dslash`**: `//`
    * Data type: int
    * Nº of unique values: 2 (0, 1)

### Punycode
Punycode is a way of encoding internationalized domain names (IDNs) that contain non-ASCII characters, such as accented letters or characters from non-Latin alphabets (e.g., Chinese, Arabic). It converts these characters into a standardized ASCII format so they can be processed by the Domain Name System (DNS), which traditionally only supports ASCII characters.

**For example**: The domain name "münchen.de" (Munich in German) would be converted to "xn--mnchen-3ya.de" in Punycode.

In phishing detection, Punycode is relevant because attackers may use lookalike characters from different languages (e.g., replacing "o" with "ο" from the Greek alphabet) to create URLs that appear legitimate to users but actually lead to malicious sites. This technique is called homograph spoofing and is a common tactic in phishing.

### `shortening_service`

The `shortening_service` feature indicates whether the URL uses a link-shortening service, like bit.ly, tinyurl.com, or goo.gl. URL shorteners create short, obfuscated links that redirect to a longer URL.

In phishing detection, this feature is essential because shortened URLs can hide the true destination of a link, making it easier for attackers to disguise malicious websites. Phishing schemes frequently use shortened URLs to deceive users into clicking on links that appear harmless but lead to fraudulent sites.



### Prefix sufix

Regular Expression Pattern: `r"https?://[^\-]+-[^\-]+/"`

This regular expression pattern is designed to identify URLs that follow a specific structure. Here’s a breakdown of each part:

- `https?`: Matches either `http` or `https` at the beginning of the URL.
- `://`: Matches the literal `://` that follows `http` or `https`.
- `[^\-]+`: Matches one or more characters that are **not** a hyphen (`-`).
- `-`: Matches a hyphen character within the URL.
- `[^\-]+`: Again matches one or more characters that are **not** a hyphen (`-`).
- `/`: Matches a forward slash at the end of the specified segment.

#### Purpose
This pattern will match URLs that start with `http` or `https`, followed by `://`, and contain a subpath segment with text that includes a hyphen surrounded by non-hyphen characters (e.g., `abc-xyz`).

#### Example Matches
- `http://example.com/abc-xyz/` will match.
- `https://domain.com/a-b/` will match.
- `http://site.com/abc/` will **not match** because it lacks a hyphen in the segment after the domain.

This regular expression is useful for identifying URLs that contain directory structures with hyphens in specified segments.


### Statistical Report

This feature is designed to detect potentially suspicious URLs based on known patterns within the URL structure or by resolving specific IP addresses associated with phishing or spam. It operates through two primary checks:

1. **URL Pattern Matching**: 
   - The function verifies if the URL contains any known suspicious domain patterns, such as `at.ua`, `usa.cc`, or `baltazarpresentes.com.br`. 
   - If any of these patterns are found within the URL, it flags the entry as potentially risky.

2. **IP Address Pattern Matching**:
   - The function attempts to resolve the domain's IP address using `socket.gethostbyname()`.
   - It then checks if this IP address matches any from a predefined list of known malicious IP addresses. 
   - If the IP address matches one of these, the URL is flagged as suspicious.

Values:
- **`1`**: The URL is flagged as suspicious due to either matching a risky pattern or an associated IP address.
- **`0`**: The URL does not match any suspicious patterns or IP addresses.
- **`2`**: An exception occurred (e.g., if the IP resolution fails), indicating an unresolved case.


### `domain_registration_age`

The `domain_registration_age` feature calculates the number of days remaining until the domain's expiration date.

#### Explanation:
- This feature queries the domain's WHOIS data to retrieve its expiration date.
- It then compares this expiration date with the current date and returns the number of days left before the domain expires.
- If no expiration date is available, the function returns `0`, indicating that the domain's registration length could not be determined.
- If there is an error fetching the information, the function returns `-1`.

---

### `domain_registration_length`

The `domain_registration_length` feature performs two checks: one for domain name validation and another for domain expiration.

#### Explanation:
- The feature first verifies whether the domain name matches the WHOIS-recorded domain name. If there is a mismatch, it assigns `1` (indicating potential inconsistency).
- Then, it retrieves the expiration date from WHOIS records and calculates the number of days left until the domain expires, similar to the first feature.
- If no expiration date is found, it returns `0`.
- The function returns two values: 
  - `v1`: A flag indicating if the domain name matches the WHOIS-recorded domain name (`0` for match, `1` for mismatch).
  - `v2`: The number of days left until the domain expiration or `-1` if no expiration date is found.

This feature adds a layer of verification for domain name consistency along with the expiration date calculation.


## **Domain after copyright logo (Shirazi'18), `domain_with_copyright`**

This feature checks whether the domain name appears near a copyright (©), trademark (™), or registered (®) symbol in the content of the webpage. 

#### Explanation:
- The function searches for copyright (©), trademark (™), or registered (®) symbols in the webpage content.
- It then checks if the domain name appears within a range of 50 characters before or after the symbol.
- If the domain appears near one of these symbols, it returns **0** (indicating no issue).
- If the domain does not appear near the symbol, it returns **1** (indicating a possible issue).
- If no copyright, trademark, or registered symbol is found, it defaults to returning **0**.


## Percentile of Safe Anchor: URL_of_Anchor in Zaini'2019 (Kumar Jain'18)

This feature calculates the percentage of unsafe anchor links in the webpage.

In web development, anchors refer to HTML elements used to create hyperlinks. An anchor (<a> tag) allows users to click on a link and navigate to another location, which could be another webpage, a different section within the same page, a downloadable file, or an external URL.

Example: <a href="https://example.com">Click here</a>

### Explanation:
- The function calculates the total number of anchor links classified as "safe" and "unsafe."
- It then computes the percentile of "unsafe" anchors relative to the total number of anchors.
- If there are no anchors or all anchors are classified as safe, the function returns **0**.
- The result is the percentage of unsafe anchors compared to the total anchors on the webpage.

### Unique Values:
- **0**: No unsafe anchors or no anchors at all.
- **Percentage**: A float representing the percentage of unsafe anchors, ranging from **0 to 100**.


