Skip to content

Conversation

dunglas
Copy link
Member

@dunglas dunglas commented Sep 16, 2014

Introduce a new FILTER_VALIDATE_DOMAIN filter to validate domain name and label lengths according to RFCs. It does not check characters.
A FILTER_FLAG_HOSTNAME is also available to specifically validate hostnames (they must start with an alphanumeric character and contains only [a-z-]). This flag is used in FILTER_VALIDATE_URL.

  • According to RFC 1034, a domain name label cannot exceed 63 characters
  • According to RFC 1035 the total length of a hostname cannot exceed 253 characters (optional final dot not included)
  • According to RFC 952 and RFC 1123 a domain name label cannot start nor end with a non alphanumeric char (the current implementation only check the first character of the hostname)
  • According to RFC 2732 an IPv6 address enclosed with square brackets is a valid host
  • According to RFC 2181 and RFC 1123 a domain can contain underscores but a hostname can't

See bug #68039

@dunglas dunglas force-pushed the validate_url_length branch 2 times, most recently from 5c721f6 to 475238e Compare September 17, 2014 20:41
@dunglas dunglas changed the title FILTER_VALIDATE_URL: check max length of domain name label Enhance RFC conformance of FILTER_VALIDATE_URL Sep 17, 2014
@dunglas dunglas changed the title Enhance RFC conformance of FILTER_VALIDATE_URL Better RFC conformance for FILTER_VALIDATE_URL Sep 17, 2014
@dunglas dunglas changed the title Better RFC conformance for FILTER_VALIDATE_URL Better RFC conformance for FILTER_VALIDATE_URL (bug #68039) Sep 17, 2014
@datibbaw
Copy link
Contributor

It's kind of hard to tell what you have changed because you moved the function.

@dunglas
Copy link
Member Author

dunglas commented Sep 18, 2014

@datibbaw only in the last commit (IPv6 handling). It is necessary to move either php_filter_validate_url or _php_filter_validate_ipv6 because the last one is static.

@dunglas
Copy link
Member Author

dunglas commented Sep 18, 2014

@datibbaw do you want I put a diff of the function only in comments?

@datibbaw
Copy link
Contributor

You can add a forward declaration to keep it in place :)

@dunglas
Copy link
Member Author

dunglas commented Sep 18, 2014

@datibbaw Isn't it too much just for using a diff UI?

@datibbaw
Copy link
Contributor

It would make it easier for somebody reviewing your code changes (i.e. yours truly); no changes are required if you don't care about that.

@dunglas
Copy link
Member Author

dunglas commented Sep 18, 2014

I'll add the definition for reviewing and move the function before the merge if you agree with that.

@dunglas dunglas force-pushed the validate_url_length branch 2 times, most recently from 3594cde to 6df900c Compare September 18, 2014 19:10
@dunglas
Copy link
Member Author

dunglas commented Sep 18, 2014

@datibbaw done.

@dunglas dunglas force-pushed the validate_url_length branch from 6df900c to 81ac5b9 Compare September 18, 2014 20:28
@dunglas
Copy link
Member Author

dunglas commented Sep 24, 2014

What should be done to get this PR merged?
I'll use those commits to fix #68049 and to add an IDN support proposal (see internals mailing list).

@DaveRandom
Copy link
Contributor

According to RFC 1035 the total length of a hostname cannot exceed 253 characters (optional final dot not included)

This applies specifically to a DNS name. A URI may contain a registered name that is not a DNS name. To quote the relevant passage of RFC 3986 sec 3.2.2 (emphasis mine):

A host identified by a registered name is a sequence of characters usually intended for lookup within a locally defined host or service name registry, though the URI's scheme-specific semantics may require that a specific registry (or fixed name table) be used instead. The most common name registry mechanism is the Domain Name System (DNS).

Note "most common" and not "only".

This also applies to the other DNS-specific changes here. FILTER_VALIDATE_URL is validating the URL, not content of the components.

The IPv6 change is valid, though.

@dunglas
Copy link
Member Author

dunglas commented Oct 14, 2014

@DaveRandom according to the WHATWG URL Living Standard and RFC 2396 this is not true: an URI can have any valid host but an URL host must be a valid IP address (v4 or v6) or domain (DNS).

Relevant quotes:

https://url.spec.whatwg.org/#valid-domain

A domain is a valid domain if these steps return success:
Let result be the result of running Unicode ToASCII with domain_name set to domain, UseSTD3ASCIIRules set to true, processing_option set to Nontransitional_Processing, and VerifyDnsLength set to true.

http://www.unicode.org/reports/tr46/tr46-12.html#ToASCII

If VerifyDnsLength flag is true, then verify DNS length restrictions. This may record an error. For more information, see [STD13] and [STD3].
The length of the domain name, excluding the root label and its dot, is from 1 to 253.
The length of each label is from 1 to 63.

RFC 2396 (RFC 1034 and RFC1123 specifies DNS)

Hostnames take the form described in Section 3 of [RFC1034] and
Section 2.1 of [RFC1123]: a sequence of domain labels separated by
".", each domain label starting and ending with an alphanumeric
character and possibly also containing "-" characters. The rightmost
domain label of a fully qualified domain name will never start with a
digit, thus syntactically distinguishing domain names from IPv4
addresses, and may be followed by a single "." if it is necessary to
distinguish between the complete domain name and any local domain.
To actually be "Uniform" as a resource locator, a URL hostname should
be a fully qualified domain name. In practice, however, the host
component may be a local domain literal.

Indeed, URI may contain other host types but this filter is called FILTER_VALIDATE_URL not URI. If it is intended to validate URIs, it's confusing.
The current implementation returns false for a lot of legal URIs (but invalids URLs) such as:

var_dump(filter_var("urn:ietf:rfc:2141", FILTER_VALIDATE_URL));
// bool(false)

I think it make sense that FILTER_VALIDATE_URL validate only URL (in the WHATWG / RFC sense and as most web developers intend). Maybe can we introduce a new URI validator for more general URIs?

@dunglas
Copy link
Member Author

dunglas commented Oct 14, 2014

And from the same RFC you quoted:

This specification defines the generic URI syntax and a process for
resolving URI references that might be in relative form, along with
guidelines and security considerations for the use of URIs on the
Internet. The URI syntax defines a grammar that is a superset of all
valid URIs, allowing an implementation to parse the common components
of a URI reference without knowing the scheme-specific requirements
of every possible identifier. This specification does not define a
generative grammar for URIs; that task is performed by the individual
specifications of each URI scheme.

In my patch, DNS validity is checked only for HTTP and HTTPS schemes (it should be also applicable to FTP btw).

@DaveRandom
Copy link
Contributor

@dunglas Indeed, I re-checked the PHP docs for what this filter actually does and I see it is validating specifically URLs and not generic URIs.

In order to make this check valid and fill in a couple of blanks in the filter API, IMHO you should do the following things as part of this PR:

  • Create FILTER_VALIDATE_DNS_NAME, which performs full name validation.
  • Add a flag for FILTER_VALIDATE_IP which, when (and only when) used with FILTER_FLAG_IPV6 will allow the enclosing [] pair.
  • Refactor the host validation logic to make use of these filters (as it already partially does).

This would plug a couple of holes in the userland API - I know I've written code to work around the square bracket validation with values extracted via parse_url() before, and a DNS name validator would have had value for me in the past as well. It would also improve encapsulation and DRYness.

Thoughts?

@dunglas
Copy link
Member Author

dunglas commented Oct 14, 2014

+1 for the new FILTER_VALIDATE_DNS_NAME but -1 checking for enclosing [] in FILTER_VALIDATE_IP. It's only relevant for URLs, not standalone IPv6 addresses: https://www.ietf.org/rfc/rfc2732.txt

IMO FILTER_VALIDATE_DNS_NAME should also allow IDNa hostnames. But it will require making ICU a dependency of ext/filter (as discussed on the mailing list). Of course IDNa support can be done after the merge or this PR. What do you think about that?

@DaveRandom
Copy link
Contributor

-1 to check for enclosing [] in FILTER_VALIDATE_IP. It's only relevant for URLs, not standalone I v6 addresses: https://www.ietf.org/rfc/rfc2732.txt

While this is absolutely true and unarguable, all I'm suggesting is a convenience flag to make this userland code a bit tidier (slightly modified from real code, only error handling and class refs removed):

$parts = parse_url($url);
if ($host = filter_var($parts['host'], FILTER_VALIDATE_IP, FILTER_FLAG_IPV4)) {
    $mode = IPv4_ADDR;
} else if (($host = filter_var($parts['host'], FILTER_VALIDATE_IP, FILTER_FLAG_IPV6)) || ($parts['host'][0] == '[' && substr($parts['host'], -1) == ']' && $host = filter_var(substr($parts['host'], 1, -1), FILTER_VALIDATE_IP, FILTER_FLAG_IPV6))) {
    $host = '[' . $host . ']';
    $mode = IPv6_ADDR;
} else if (/* userland DNS name validation routine */) {
    $mode = DNS_NAME;
} else {
    // invalid
}

As you can imagine, I would much rather have written something like:

$parts = parse_url($url);
if ($host = filter_var($parts['host'], FILTER_VALIDATE_IP, FILTER_FLAG_IPV4)) {
    $mode = IPv4_ADDR;
} else if ($host = filter_var($parts['host'], FILTER_VALIDATE_IP, FILTER_FLAG_IPV6 | FILTER_FLAG_IPV6_FROM_URL)) {
    $mode = IPv6_ADDR;
} else if ($host = filter_var($parts['host'], FILTER_VALIDATE_DNS_NAME)) {
    $mode = DNS_NAME;
} else {
    // invalid
}

All I'm suggesting is an optional, non-default flag to allow the square brackets. This code also shows where I would have found the DNS name validation useful.

IMO FILTER_VALIDATE_DNS_NAME should also allow IDNa hostnames. But it will require making ICU a dependency of ext/filter (as discussed on the mailing list). Of course IDNa support can be done after the merge or this PR. What do you think about that?

Indeed. We have spoken about this before on-list, and as I mentioned at the time I have two major thoughts about this:

  1. I would not want filters to implicitly allow IDN unless the streams API also supported it.
  2. I have already done some PoC work to make this happen (the streams side of it). Unfortunately it's not as simple as just handling the host at IP resolution time, because it must also be supported by SSL.

I know that when we initially spoke about this I said I would tidy my work up and make it public. Owing to various issues IRL I haven't yet had time to do this, I will either get to this in the next few days or hand it over to @rdlowrey or @datibbaw to do because I OWN THEM1.

1 Actually I just like to annoy them.

@dunglas
Copy link
Member Author

dunglas commented Oct 14, 2014

An hostname validator can be useful in a lot of case. However all URL hostnames (domain in the WHATWG standard) are valid DNS domains (or IPs) but all DNS domains are not valid hostnames.
For instance, a DNS domain can contain underscores but URL hostnames can't: http://stackoverflow.com/a/2183140/1352334
The new FILTER_VALIDATE_DNS_NAME must have a FILTER_FLAG_URL_DOMAIN too.

For this new flag (for the new filter), I suggest implementing this URL Living Standard algorithm: https://url.spec.whatwg.org/#host-parsing (for that too we can use ICU if we make it a core dependency of PHP: http://icu-project.org/apiref/icu4c432/uidna_8h.html#aaf3bec2415dd99b4221eeebb723eb082).

About FILTER_VALIDATE_IP, instead of "crowding" it with a new flag for a very specific use case, I suggest enhancing parse_url adding a new component called host_type returning the host type (domain for IPv4 and domains and IPv6 for IPv6 - currently in the spec IPv4 are treated like domains, see https://www.w3.org/Bugs/Public/show_bug.cgi?id=26431).

It can be used internaly by FILTER_VALIDATE_URL (it already use parse_url) and it's even better (IMHO) for the use case you described:

$parts = parse_url($url);

switch ($parts['host_type']) {
    case PHP_URL_HOST_IPV6:
        $mode = IPv6_ADDR;
        // $ipv6 = substr($parts['host'], 1, -1);
       break;

    default:
        if (filter_var($parts['host'], FILTER_VALIDATE_IP, FILTER_FLAG_IPV4)) {
            $mode = IPv4_ADDR;
        } else {
            $mode = DNS_NAME;
        }
}

If everyone agree on this proposal, the work to do on PHP is:

  1. Make ICU a dependency for ext/filter
  2. Create a new FILTER_VALIDATE_DOMAIN with an optional flag FILTER_FLAG_URL_DOMAIN
  3. Make a new component host_type available with parse_url
  4. Update FILTER_VALIDATE_URL using those new features

Should we formalize that with a RFC or, as it's mostly fixing the current behavior, this issue is enough?

@dunglas dunglas changed the title Better RFC conformance for FILTER_VALIDATE_URL (bug #68039) New FILTER_VALIDATE_DOMAIN and better RFC conformance for FILTER_VALIDATE_URL Nov 5, 2014
@dunglas
Copy link
Member Author

dunglas commented Nov 5, 2014

  • Added a new FILTER_VALIDATE_DOMAIN filter validating domain names
  • Added a FILTER_FLAG_HOSTNAME flag to allow checking hostnames (_ are forbidden in hostname but not in domains)
  • Changed FILTER_VALIDATE_URL to use this new validator

When #890 will be merge, it will be easy to add IDN support to this new domain validator. What should be done next to have this code merged?
Travis build fail because of the forward declaration wanted by @datibbaw (it compiles locally). I'll remove it (and move the code) when the code will be reviewed and validated.

@dunglas dunglas force-pushed the validate_url_length branch from 7e410d3 to 728945d Compare November 14, 2014 09:02
@dunglas
Copy link
Member Author

dunglas commented Nov 14, 2014

  • Validate characters only when FILTER_FLAG_HOSTNAME is set.
  • Fixed memory leak.
  • Rebased.

@php-pulls php-pulls merged commit 54f06d2 into php:master Nov 27, 2014
@whatthejeff
Copy link
Contributor

@dunglas Should there be a flag for FILTER_VALIDATE_DOMAIN that validates FQDNs?

@dunglas
Copy link
Member Author

dunglas commented Dec 16, 2014

@whatthejeff Why not. Can be useful!

@Emdek
Copy link

Emdek commented Mar 16, 2016

It's very nice addition but it is missing documentation in manual...

@frederikbosch
Copy link
Contributor

Where can we find the documentation of this page: http://php.net/manual/en/filter.filters.validate.php. I was looking for it so I could contribute the missing ones based on this PR.

@jaisato
Copy link

jaisato commented Aug 23, 2017

Hi,
I used the filter FILTER_VALIDATE_DOMAIN with values starting and ending by special characters and it didn't work (in PHP version 7.1.7). For example, the value "$1271.com#" is not valid but the filter returns it as valid domain. It must return false.

I don't recommend using that filter for now.

@dunglas dunglas deleted the validate_url_length branch December 29, 2017 10:19
salathe pushed a commit to salathe/phpdoc-en that referenced this pull request Apr 9, 2018
…but never documented. This attempts to document it according to the notes on the original PR 826: php/php-src#826

Fixes #72013

-- 
Provided by anonymous 90461 (kevin.boyd@gmail.com)

git-svn-id: https://svn.php.net/repository/phpdoc/en/trunk@344641 c90b9560-bf6c-de11-be94-00142212c4b1
LawnGnome pushed a commit to LawnGnome/phpdoc-en that referenced this pull request Apr 9, 2018
…but never documented. This attempts to document it according to the notes on the original PR 826: php/php-src#826

Fixes #72013

-- 
Provided by anonymous 90461 (kevin.boyd@gmail.com)

git-svn-id: https://svn.php.net/repository/phpdoc/en/trunk@344641 c90b9560-bf6c-de11-be94-00142212c4b1
svn2github pushed a commit to svn2github/phpdoc_en that referenced this pull request Apr 10, 2018
…but never documented. This attempts to document it according to the notes on the original PR 826: php/php-src#826

Fixes #72013

-- 
Provided by anonymous 90461 (kevin.boyd@gmail.com)

git-svn-id: http://svn.php.net/repository/phpdoc/en@344641 c90b9560-bf6c-de11-be94-00142212c4b1
petski pushed a commit to petski/PHPMailer that referenced this pull request Dec 3, 2019
- PHP way of doing so (php/php-src#826)
- Improved and shortened regexp a bit
petski pushed a commit to petski/PHPMailer that referenced this pull request Dec 3, 2019
- PHP way of doing so (php/php-src#826)
- Improved and shortened regexp a bit
petski pushed a commit to petski/PHPMailer that referenced this pull request Dec 3, 2019
- PHP way of doing so (php/php-src#826)
- Improved and shortened regexp a bit
petski pushed a commit to petski/PHPMailer that referenced this pull request Dec 3, 2019
- PHP way of doing so (php/php-src#826)
- Improved and shortened regexp a bit
petski pushed a commit to petski/PHPMailer that referenced this pull request Dec 3, 2019
- PHP way of doing so (php/php-src#826)
- Improved and shortened regexp a bit
petski pushed a commit to petski/PHPMailer that referenced this pull request Dec 3, 2019
- PHP way of doing so (php/php-src#826)
- Improved and shortened regexp a bit
petski pushed a commit to petski/PHPMailer that referenced this pull request Dec 3, 2019
- PHP way of doing so (php/php-src#826)
- Improved and shortened regexp a bit
petski pushed a commit to petski/PHPMailer that referenced this pull request Dec 3, 2019
- PHP way of doing so (php/php-src#826)
- Improved and shortened regexp a bit
petski pushed a commit to petski/PHPMailer that referenced this pull request Dec 5, 2019
- PHP way of doing so (php/php-src#826)
- Improved and shortened regexp a bit
Synchro pushed a commit to PHPMailer/PHPMailer that referenced this pull request Dec 10, 2019
* Let isValidHost() determine validness of host

- PHP way of doing so (php/php-src#826)
- Improved and shortened regexp a bit

* Give more specific debug message in case host(entry) is invalid

* Rewrite filter_var checks

* Host [[<ipv6>]] is not valid
salathe pushed a commit to salathe/phpdoc-en that referenced this pull request Sep 3, 2020
…but never documented. This attempts to document it according to the notes on the original PR 826: php/php-src#826

Fixes #72013

-- 
Provided by anonymous 90461 (kevin.boyd@gmail.com)

git-svn-id: https://svn.php.net/repository/phpdoc/en/trunk@344641 c90b9560-bf6c-de11-be94-00142212c4b1
@dunglas
Copy link
Member Author

dunglas commented Dec 30, 2022

@jaisato sorry for the very old reply but it's intended and stated in the PR description. Domains can contain special characters. Hostnames can't. In your case, you should use the FILTER_FLAG_HOSTNAME flag: https://3v4l.org/114Sf

SuperButterfly added a commit to SuperButterfly/PHPMailer that referenced this pull request Aug 11, 2023
* Let isValidHost() determine validness of host

- PHP way of doing so (php/php-src#826)
- Improved and shortened regexp a bit

* Give more specific debug message in case host(entry) is invalid

* Rewrite filter_var checks

* Host [[<ipv6>]] is not valid
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants