Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New FILTER_VALIDATE_DOMAIN and better RFC conformance for FILTER_VALIDATE_URL #826

Merged
merged 4 commits into from Nov 27, 2014

Conversation

9 participants
@dunglas
Copy link
Contributor

commented Sep 16, 2014

Introduce a new FILTER_VALIDATE_DOMAIN filter to validate domain name and label lengths according to RFCs. It does not check characters.
A FILTER_FLAG_HOSTNAME is also available to specifically validate hostnames (they must start with an alphanumeric character and contains only [a-z-]). This flag is used in FILTER_VALIDATE_URL.

  • According to RFC 1034, a domain name label cannot exceed 63 characters
  • According to RFC 1035 the total length of a hostname cannot exceed 253 characters (optional final dot not included)
  • According to RFC 952 and RFC 1123 a domain name label cannot start nor end with a non alphanumeric char (the current implementation only check the first character of the hostname)
  • According to RFC 2732 an IPv6 address enclosed with square brackets is a valid host
  • According to RFC 2181 and RFC 1123 a domain can contain underscores but a hostname can't

See bug #68039

@dunglas dunglas force-pushed the dunglas:validate_url_length branch 2 times, most recently from 5c721f6 to 475238e Sep 17, 2014

@dunglas dunglas changed the title FILTER_VALIDATE_URL: check max length of domain name label Enhance RFC conformance of FILTER_VALIDATE_URL Sep 17, 2014

@dunglas dunglas changed the title Enhance RFC conformance of FILTER_VALIDATE_URL Better RFC conformance for FILTER_VALIDATE_URL Sep 17, 2014

@dunglas dunglas changed the title Better RFC conformance for FILTER_VALIDATE_URL Better RFC conformance for FILTER_VALIDATE_URL (bug #68039) Sep 17, 2014

@datibbaw

This comment has been minimized.

Copy link
Contributor

commented Sep 18, 2014

It's kind of hard to tell what you have changed because you moved the function.

@dunglas

This comment has been minimized.

Copy link
Contributor Author

commented Sep 18, 2014

@datibbaw only in the last commit (IPv6 handling). It is necessary to move either php_filter_validate_url or _php_filter_validate_ipv6 because the last one is static.

@dunglas

This comment has been minimized.

Copy link
Contributor Author

commented Sep 18, 2014

@datibbaw do you want I put a diff of the function only in comments?

@datibbaw

This comment has been minimized.

Copy link
Contributor

commented Sep 18, 2014

You can add a forward declaration to keep it in place :)

@dunglas

This comment has been minimized.

Copy link
Contributor Author

commented Sep 18, 2014

@datibbaw Isn't it too much just for using a diff UI?

@datibbaw

This comment has been minimized.

Copy link
Contributor

commented Sep 18, 2014

It would make it easier for somebody reviewing your code changes (i.e. yours truly); no changes are required if you don't care about that.

@dunglas

This comment has been minimized.

Copy link
Contributor Author

commented Sep 18, 2014

I'll add the definition for reviewing and move the function before the merge if you agree with that.

@dunglas dunglas force-pushed the dunglas:validate_url_length branch 2 times, most recently from 3594cde to 6df900c Sep 18, 2014

@dunglas

This comment has been minimized.

Copy link
Contributor Author

commented Sep 18, 2014

@datibbaw done.

@dunglas dunglas force-pushed the dunglas:validate_url_length branch from 6df900c to 81ac5b9 Sep 18, 2014

@dunglas

This comment has been minimized.

Copy link
Contributor Author

commented Sep 24, 2014

What should be done to get this PR merged?
I'll use those commits to fix #68049 and to add an IDN support proposal (see internals mailing list).

@DaveRandom

This comment has been minimized.

Copy link
Contributor

commented Oct 14, 2014

According to RFC 1035 the total length of a hostname cannot exceed 253 characters (optional final dot not included)

This applies specifically to a DNS name. A URI may contain a registered name that is not a DNS name. To quote the relevant passage of RFC 3986 sec 3.2.2 (emphasis mine):

A host identified by a registered name is a sequence of characters usually intended for lookup within a locally defined host or service name registry, though the URI's scheme-specific semantics may require that a specific registry (or fixed name table) be used instead. The most common name registry mechanism is the Domain Name System (DNS).

Note "most common" and not "only".

This also applies to the other DNS-specific changes here. FILTER_VALIDATE_URL is validating the URL, not content of the components.

The IPv6 change is valid, though.

@dunglas

This comment has been minimized.

Copy link
Contributor Author

commented Oct 14, 2014

@DaveRandom according to the WHATWG URL Living Standard and RFC 2396 this is not true: an URI can have any valid host but an URL host must be a valid IP address (v4 or v6) or domain (DNS).

Relevant quotes:

https://url.spec.whatwg.org/#valid-domain

A domain is a valid domain if these steps return success:
Let result be the result of running Unicode ToASCII with domain_name set to domain, UseSTD3ASCIIRules set to true, processing_option set to Nontransitional_Processing, and VerifyDnsLength set to true.

http://www.unicode.org/reports/tr46/tr46-12.html#ToASCII

If VerifyDnsLength flag is true, then verify DNS length restrictions. This may record an error. For more information, see [STD13] and [STD3].
The length of the domain name, excluding the root label and its dot, is from 1 to 253.
The length of each label is from 1 to 63.

RFC 2396 (RFC 1034 and RFC1123 specifies DNS)

Hostnames take the form described in Section 3 of [RFC1034] and
Section 2.1 of [RFC1123]: a sequence of domain labels separated by
".", each domain label starting and ending with an alphanumeric
character and possibly also containing "-" characters. The rightmost
domain label of a fully qualified domain name will never start with a
digit, thus syntactically distinguishing domain names from IPv4
addresses, and may be followed by a single "." if it is necessary to
distinguish between the complete domain name and any local domain.
To actually be "Uniform" as a resource locator, a URL hostname should
be a fully qualified domain name. In practice, however, the host
component may be a local domain literal.

Indeed, URI may contain other host types but this filter is called FILTER_VALIDATE_URL not URI. If it is intended to validate URIs, it's confusing.
The current implementation returns false for a lot of legal URIs (but invalids URLs) such as:

var_dump(filter_var("urn:ietf:rfc:2141", FILTER_VALIDATE_URL));
// bool(false)

I think it make sense that FILTER_VALIDATE_URL validate only URL (in the WHATWG / RFC sense and as most web developers intend). Maybe can we introduce a new URI validator for more general URIs?

@dunglas

This comment has been minimized.

Copy link
Contributor Author

commented Oct 14, 2014

And from the same RFC you quoted:

This specification defines the generic URI syntax and a process for
resolving URI references that might be in relative form, along with
guidelines and security considerations for the use of URIs on the
Internet. The URI syntax defines a grammar that is a superset of all
valid URIs, allowing an implementation to parse the common components
of a URI reference without knowing the scheme-specific requirements
of every possible identifier. This specification does not define a
generative grammar for URIs; that task is performed by the individual
specifications of each URI scheme.

In my patch, DNS validity is checked only for HTTP and HTTPS schemes (it should be also applicable to FTP btw).

@DaveRandom

This comment has been minimized.

Copy link
Contributor

commented Oct 14, 2014

@dunglas Indeed, I re-checked the PHP docs for what this filter actually does and I see it is validating specifically URLs and not generic URIs.

In order to make this check valid and fill in a couple of blanks in the filter API, IMHO you should do the following things as part of this PR:

  • Create FILTER_VALIDATE_DNS_NAME, which performs full name validation.
  • Add a flag for FILTER_VALIDATE_IP which, when (and only when) used with FILTER_FLAG_IPV6 will allow the enclosing [] pair.
  • Refactor the host validation logic to make use of these filters (as it already partially does).

This would plug a couple of holes in the userland API - I know I've written code to work around the square bracket validation with values extracted via parse_url() before, and a DNS name validator would have had value for me in the past as well. It would also improve encapsulation and DRYness.

Thoughts?

@dunglas

This comment has been minimized.

Copy link
Contributor Author

commented Oct 14, 2014

+1 for the new FILTER_VALIDATE_DNS_NAME but -1 checking for enclosing [] in FILTER_VALIDATE_IP. It's only relevant for URLs, not standalone IPv6 addresses: https://www.ietf.org/rfc/rfc2732.txt

IMO FILTER_VALIDATE_DNS_NAME should also allow IDNa hostnames. But it will require making ICU a dependency of ext/filter (as discussed on the mailing list). Of course IDNa support can be done after the merge or this PR. What do you think about that?

@DaveRandom

This comment has been minimized.

Copy link
Contributor

commented Oct 14, 2014

-1 to check for enclosing [] in FILTER_VALIDATE_IP. It's only relevant for URLs, not standalone I v6 addresses: https://www.ietf.org/rfc/rfc2732.txt

While this is absolutely true and unarguable, all I'm suggesting is a convenience flag to make this userland code a bit tidier (slightly modified from real code, only error handling and class refs removed):

$parts = parse_url($url);
if ($host = filter_var($parts['host'], FILTER_VALIDATE_IP, FILTER_FLAG_IPV4)) {
    $mode = IPv4_ADDR;
} else if (($host = filter_var($parts['host'], FILTER_VALIDATE_IP, FILTER_FLAG_IPV6)) || ($parts['host'][0] == '[' && substr($parts['host'], -1) == ']' && $host = filter_var(substr($parts['host'], 1, -1), FILTER_VALIDATE_IP, FILTER_FLAG_IPV6))) {
    $host = '[' . $host . ']';
    $mode = IPv6_ADDR;
} else if (/* userland DNS name validation routine */) {
    $mode = DNS_NAME;
} else {
    // invalid
}

As you can imagine, I would much rather have written something like:

$parts = parse_url($url);
if ($host = filter_var($parts['host'], FILTER_VALIDATE_IP, FILTER_FLAG_IPV4)) {
    $mode = IPv4_ADDR;
} else if ($host = filter_var($parts['host'], FILTER_VALIDATE_IP, FILTER_FLAG_IPV6 | FILTER_FLAG_IPV6_FROM_URL)) {
    $mode = IPv6_ADDR;
} else if ($host = filter_var($parts['host'], FILTER_VALIDATE_DNS_NAME)) {
    $mode = DNS_NAME;
} else {
    // invalid
}

All I'm suggesting is an optional, non-default flag to allow the square brackets. This code also shows where I would have found the DNS name validation useful.

IMO FILTER_VALIDATE_DNS_NAME should also allow IDNa hostnames. But it will require making ICU a dependency of ext/filter (as discussed on the mailing list). Of course IDNa support can be done after the merge or this PR. What do you think about that?

Indeed. We have spoken about this before on-list, and as I mentioned at the time I have two major thoughts about this:

  1. I would not want filters to implicitly allow IDN unless the streams API also supported it.
  2. I have already done some PoC work to make this happen (the streams side of it). Unfortunately it's not as simple as just handling the host at IP resolution time, because it must also be supported by SSL.

I know that when we initially spoke about this I said I would tidy my work up and make it public. Owing to various issues IRL I haven't yet had time to do this, I will either get to this in the next few days or hand it over to @rdlowrey or @datibbaw to do because I OWN THEM1.

1 Actually I just like to annoy them.

@dunglas

This comment has been minimized.

Copy link
Contributor Author

commented Oct 14, 2014

An hostname validator can be useful in a lot of case. However all URL hostnames (domain in the WHATWG standard) are valid DNS domains (or IPs) but all DNS domains are not valid hostnames.
For instance, a DNS domain can contain underscores but URL hostnames can't: http://stackoverflow.com/a/2183140/1352334
The new FILTER_VALIDATE_DNS_NAME must have a FILTER_FLAG_URL_DOMAIN too.

For this new flag (for the new filter), I suggest implementing this URL Living Standard algorithm: https://url.spec.whatwg.org/#host-parsing (for that too we can use ICU if we make it a core dependency of PHP: http://icu-project.org/apiref/icu4c432/uidna_8h.html#aaf3bec2415dd99b4221eeebb723eb082).

About FILTER_VALIDATE_IP, instead of "crowding" it with a new flag for a very specific use case, I suggest enhancing parse_url adding a new component called host_type returning the host type (domain for IPv4 and domains and IPv6 for IPv6 - currently in the spec IPv4 are treated like domains, see https://www.w3.org/Bugs/Public/show_bug.cgi?id=26431).

It can be used internaly by FILTER_VALIDATE_URL (it already use parse_url) and it's even better (IMHO) for the use case you described:

$parts = parse_url($url);

switch ($parts['host_type']) {
    case PHP_URL_HOST_IPV6:
        $mode = IPv6_ADDR;
        // $ipv6 = substr($parts['host'], 1, -1);
       break;

    default:
        if (filter_var($parts['host'], FILTER_VALIDATE_IP, FILTER_FLAG_IPV4)) {
            $mode = IPv4_ADDR;
        } else {
            $mode = DNS_NAME;
        }
}

If everyone agree on this proposal, the work to do on PHP is:

  1. Make ICU a dependency for ext/filter
  2. Create a new FILTER_VALIDATE_DOMAIN with an optional flag FILTER_FLAG_URL_DOMAIN
  3. Make a new component host_type available with parse_url
  4. Update FILTER_VALIDATE_URL using those new features

Should we formalize that with a RFC or, as it's mostly fixing the current behavior, this issue is enough?

@dunglas dunglas changed the title Better RFC conformance for FILTER_VALIDATE_URL (bug #68039) New FILTER_VALIDATE_DOMAIN and better RFC conformance for FILTER_VALIDATE_URL Nov 5, 2014

@dunglas

This comment has been minimized.

Copy link
Contributor Author

commented Nov 5, 2014

  • Added a new FILTER_VALIDATE_DOMAIN filter validating domain names
  • Added a FILTER_FLAG_HOSTNAME flag to allow checking hostnames (_ are forbidden in hostname but not in domains)
  • Changed FILTER_VALIDATE_URL to use this new validator

When #890 will be merge, it will be easy to add IDN support to this new domain validator. What should be done next to have this code merged?
Travis build fail because of the forward declaration wanted by @datibbaw (it compiles locally). I'll remove it (and move the code) when the code will be reviewed and validated.

@dunglas dunglas force-pushed the dunglas:validate_url_length branch from 7e410d3 to 728945d Nov 14, 2014

@dunglas

This comment has been minimized.

Copy link
Contributor Author

commented Nov 14, 2014

  • Validate characters only when FILTER_FLAG_HOSTNAME is set.
  • Fixed memory leak.
  • Rebased.

dunglas added some commits Nov 14, 2014

char *e, *s, *t;
size_t l;
int hostname = flags & FILTER_FLAG_HOSTNAME;
unsigned char i = 1;

This comment has been minimized.

Copy link
@datibbaw

datibbaw Nov 17, 2014

Contributor

Why start the label length at 1 if you haven't read any characters from the string yet?

This comment has been minimized.

Copy link
@dunglas

dunglas Nov 17, 2014

Author Contributor

Because the length incrementation occurs after the length check (see https://github.com/dunglas/php-src/blob/validate_url_length/ext/filter/logical_filters.c#L497-L501).

This comment has been minimized.

Copy link
@datibbaw

datibbaw Nov 17, 2014

Contributor

That's exactly my point, your invariant choice for i is just unusual; it would be more logical to define it as the number of characters since the last period ... I know, nitpicking :P

This comment has been minimized.

Copy link
@dunglas

dunglas Nov 17, 2014

Author Contributor

@datibbaw I'm ok to initialize i as 0 and move the increment operation before the check (it's your point right?) if it's already done like that in other part of the code base. Anyway, doing it like I've done avoids one unnecessary increment operation when the label is > 63 characters.

Number of ++ operations when label is 64 chars:
current implementation: 63
your suggestion: 64

It's really micro-optimization with no impact in real uses case but... :-)

@php-pulls php-pulls merged commit 54f06d2 into php:master Nov 27, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details
@whatthejeff

This comment has been minimized.

Copy link
Contributor

commented Dec 15, 2014

@dunglas Should there be a flag for FILTER_VALIDATE_DOMAIN that validates FQDNs?

@dunglas

This comment has been minimized.

Copy link
Contributor Author

commented Dec 16, 2014

@whatthejeff Why not. Can be useful!

@Emdek

This comment has been minimized.

Copy link

commented Mar 16, 2016

It's very nice addition but it is missing documentation in manual...

@frederikbosch

This comment has been minimized.

Copy link
Contributor

commented Aug 8, 2017

Where can we find the documentation of this page: http://php.net/manual/en/filter.filters.validate.php. I was looking for it so I could contribute the missing ones based on this PR.

@jaisato

This comment has been minimized.

Copy link

commented Aug 23, 2017

Hi,
I used the filter FILTER_VALIDATE_DOMAIN with values starting and ending by special characters and it didn't work (in PHP version 7.1.7). For example, the value "$1271.com#" is not valid but the filter returns it as valid domain. It must return false.

I don't recommend using that filter for now.

@ParthKolekar ParthKolekar referenced this pull request Oct 29, 2017

Merged

Re-add tests from #7572 #7579

1 of 1 task complete

@dunglas dunglas deleted the dunglas:validate_url_length branch Dec 29, 2017

salathe pushed a commit to salathe/phpdoc-en that referenced this pull request Apr 9, 2018

Pieter Hordijk
A parameter (FILTER_VALIDATE_DOMAIN) was added to filter_var in 2014 …
…but never documented. This attempts to document it according to the notes on the original PR 826: php/php-src#826

Fixes #72013

-- 
Provided by anonymous 90461 (kevin.boyd@gmail.com)

git-svn-id: https://svn.php.net/repository/phpdoc/en/trunk@344641 c90b9560-bf6c-de11-be94-00142212c4b1

LawnGnome pushed a commit to LawnGnome/phpdoc-en that referenced this pull request Apr 9, 2018

peehaa
A parameter (FILTER_VALIDATE_DOMAIN) was added to filter_var in 2014 …
…but never documented. This attempts to document it according to the notes on the original PR 826: php/php-src#826

Fixes #72013

-- 
Provided by anonymous 90461 (kevin.boyd@gmail.com)

git-svn-id: https://svn.php.net/repository/phpdoc/en/trunk@344641 c90b9560-bf6c-de11-be94-00142212c4b1

svn2github pushed a commit to svn2github/phpdoc_en that referenced this pull request Apr 10, 2018

peehaa
A parameter (FILTER_VALIDATE_DOMAIN) was added to filter_var in 2014 …
…but never documented. This attempts to document it according to the notes on the original PR 826: php/php-src#826

Fixes #72013

-- 
Provided by anonymous 90461 (kevin.boyd@gmail.com)

git-svn-id: http://svn.php.net/repository/phpdoc/en@344641 c90b9560-bf6c-de11-be94-00142212c4b1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.