New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strip some control characters from links #34

Merged
merged 8 commits into from Feb 8, 2019

Conversation

Projects
None yet
2 participants
@oalders
Copy link
Owner

oalders commented Feb 7, 2019

Fixes #30

@oalders oalders force-pushed the control-characters branch from 2f119d9 to cc40250 Feb 7, 2019


if ($link) {
$link =~ s/[\001-\010]/ /g; # decimal 1..8
$link =~ s/[\016-\037]/ /g; # decimal 14..31

This comment has been minimized.

@oschwald

oschwald Feb 7, 2019

Is there a reason why you picked this set of characters? Does the null byte or a vertical tab not have the same issues? What about Unicode control characters? I wonder if something like [[:cntrl:]] or [^[:print:]] would be preferable.

This comment has been minimized.

@oalders

oalders Feb 7, 2019

Author Owner

So, the null byte doesn't suffer from this problem. (At least not in Chrome). Assuming <a href="&#0;javascript:alert(1);">0</a><br> is the correct way to test it.

However, I did some more reading at https://infra.spec.whatwg.org/#c0-control and it looks like "U+0000 NULL to U+001F INFORMATION SEPARATOR ONE" inclusive are not valid in a URL, so there's no reason not to include the whole set. I'll make some changes.

There is no mention of C1 control codes at the spec I've been working from: https://url.spec.whatwg.org/

This comment has been minimized.

@oschwald

oschwald Feb 7, 2019

It doesn't work for the null byte, but it does work for the vertical tab and other control characters that are currently being excluded.

This comment has been minimized.

@oalders

oalders Feb 8, 2019

Author Owner

Interesting. I tested them on Mac Chrome, Firefox and Safari and none of the excluded characters created a dialog box.

This comment has been minimized.

@oalders

oalders Feb 8, 2019

Author Owner

I've reworked the PR to remove the whole set.

This comment has been minimized.

@oschwald

oschwald Feb 8, 2019

Weird. I tested<a href="&#0011;javascript:alert(1);">0</a> and <a href="&#xB;javascript:alert(1);">0</a> on both Firefox and Chrome on Ubuntu and saw the dialog with both.

@oalders oalders force-pushed the control-characters branch 2 times, most recently from 64e2c73 to 8fc1c67 Feb 7, 2019

@oalders oalders force-pushed the control-characters branch from 8fc1c67 to 6b3253c Feb 8, 2019

@oschwald
Copy link

oschwald left a comment

Looks good. I had a couple of small comments.

my $url = URI->new($link)->as_string;

# The above regex doesn't strip the null byte
$url =~ s{&#0;}{}g;

This comment has been minimized.

@oschwald

oschwald Feb 8, 2019

I am not clear on why this is necessary. Also, it doesn't seem particularly effective given that &#0000, &x0, etc, are all equivalent. Maybe it isn't even worth worrying about if browsers don't seem to have the behavior on null bytes.

This comment has been minimized.

@oalders

oalders Feb 8, 2019

Author Owner

Good point. For some reason the regex above isn't catching it. It doesn't seem to cause a problem for browsers, but the spec does say that it should be removed.

my $link = $attr->{$source_type};

# Remove unprintable ASCII control characters, which
# are 1..8 and 14..31. These characters are not valid

This comment has been minimized.

@oschwald

oschwald Feb 8, 2019

This might need to be updated.

uri_schemes => [undef],
);

for my $i ( 0 .. 31 ) {

This comment has been minimized.

@oschwald

oschwald Feb 8, 2019

Maybe it would be worth testing the &#x... form?

my $url = URI->new($link)->as_string;

## The above regex doesn't strip the null byte
$url =~ s{&#x?0+;}{}g;

This comment has been minimized.

@oschwald

oschwald Feb 8, 2019

The issue here seems to be that HTML::Parser doesn't decode &#0; and friends. I think this is a better solution than what you have:

diff --git a/lib/HTML/Restrict.pm b/lib/HTML/Restrict.pm
index 650dd8a..6040d7b 100644
--- a/lib/HTML/Restrict.pm
+++ b/lib/HTML/Restrict.pm
@@ -171,14 +171,9 @@ sub _build_parser {
 
                             # C0 control chars (decimal 0..31)
                             # sort of like $link =~ s/[[:^print:]]//g
-                            $link =~ s/[\00-\037]/ /g;
+                            $link =~ s/[\00-\037]|&#x?0+;/ /g;
 
-                            my $url = URI->new($link)->as_string;
-
-                            ## The above regex doesn't strip the null byte
-                            $url =~ s{&#x?0+;}{}g;
-
-                            $url = URI->new($url);
+                            my $url = URI->new($link);
                             if ( defined $url->scheme ) {
                                 delete $attr->{$source_type}
                                     if none { $_ eq $url->scheme }

This comment has been minimized.

@oalders

oalders Feb 8, 2019

Author Owner

Yep, that's better.

@oalders oalders merged commit 382a431 into master Feb 8, 2019

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@oalders oalders deleted the control-characters branch Feb 8, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment