Skip to content
A PHP module for converting plain text to HTML, and any URLs in the text into HTML links.
PHP
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
README.rst
UrlLinker-example.php
UrlLinker.php

README.rst

UrlLinker

UrlLinker is a PHP module for converting plain text snippets to HTML, and any web addresses in the text into HTML hyperlinks.

Usage:

print(htmlEscapeAndLinkUrls($text));

For a longer example, see UrlLinker-example.php.

UrlLinker assumes plain text input, and returns HTML. If your input is already HTML, but it contains URLs that have not been marked up, UrlLinker can handle that as well:

print(linkUrlsInTrustedHtml($html));

Warning: The latter function must only be used on trusted input, as rendering HTML provided by a malicious user can lead to system compromise through cross-site scripting. The htmlEscapeAndLinkUrls function, on the other hand, can safely be used on untrusted input. (You can remove existing tags from untrusted input via PHP's strip_tags function.)

Recognized addresses

  • Web addresses
    • Recognized URL schemes: "http" and "https"
      • The http:// prefix is optional.
      • Support for additional schemes, e.g. "ftp", can easily be added by tweaking $rexScheme.
      • The scheme must be written in lower case. This requirement can be lifted by adding an i (the PCRE_CASELESS modifier) to $rexUrlLinker.
    • Hosts may be specified using domain names or IPv4 addresses.
      • IPv6 addresses are not supported.
    • Port numbers are allowed.
    • Internationalized Resource Identifiers (IRIs) are allowed. Note that the job of converting IRIs to URIs is left to the user's browser.
    • To reduce false positives, UrlLinker verifies that the top-level domain is on the official IANA list of valid TLDs.
      • UrlLinker is updated from time to time as the TLD list is expanded.
      • In the future, this approach may collapse under ICANN's ill-advised new policy of selling arbitrary TLDs for large amounts of cash, but for now it is an effective method of rejecting invalid URLs.
      • Internationalized top-level domain names must be written in Punycode in order to be recognized.
      • If you need to support unqualified domain names, such as localhost, you may disable the TLD check by 1) replacing + with * in the $rexDomain value and 2) replacing the if statement line beneath the "Check that the TLD is valid" comment with if (true). This is obviously a quick-and-dirty hack, and may cause false positives.
  • Email addresses
    • Supports the full range of commonly used address formats, including "plus addresses" (as popularized by Gmail).
    • Does not recognized the more obscure address variants that are allowed by the RFCs but never seen in practice.
    • Simplistic spam protection: The at-sign is converted to a HTML entity, foiling naive email address harvesters.
  • Addresses are recognized correctly in normal sentence contexts. For instance, in "Visit stackoverflow.com.", the final period is not part of the URL.
  • User input is properly sanitized to prevent cross-site scripting (XSS), and ampersands in URLs are correctly escaped as & (this does not apply to the linkUrlsInTrustedHtml function, which assumes its input to be valid HTML).

Background

A Stackoverflow.com question prompted me to consider the difficulty of this task. Initially, it seemed easy, but like an itch you just have to scratch, I kept coming back to it, to fix just one more little thing.

Feel free to upvote my answer if you find this code useful.

There's also a C# implementation by Antoine Sottiau.

Public Domain Dedication

To the extent possible under law, the author has waived all copyright and related or neighboring rights to UrlLinker.

For more information see: http://creativecommons.org/publicdomain/zero/1.0/

You can’t perform that action at this time.