Do not shellescape URLs since it borks params #312

cdwort · 2015-07-07T01:06:55Z

Fixes #166

Current Functionality (master):

>> PDFKit.new('https://www.google.com/search?q=pdfkit').command
wkhtmltopdf ... https://www.google.com/search\\?q\\=pdfkit

New Functionality (this PR):

>> PDFKit.new('https://www.google.com/search?q=pdfkit').command
wkhtmltopdf ... https://www.google.com/search?q=pdfkit

To accomplish this, this PR walks back some security measures implemented in #164.

We no longer escape URLs
We no longer escape HTML coming through middleware.

I don't know enough about bash exploits, so forgive the poor examples below.

This would impact those using PDFKit who allow raw user input in the URL (e.g. www.mysite.com/echo /etc/passwd) or in the HTML (e.g. "<p>Your PDF Content: echo /etc/passwd</p>" ).

On the other hand, I think we need to support query parameters. So, unless someone sees a better this solution, I'd say let's fix this.

cdwort · 2015-07-07T01:08:09Z

@sigmavirus24 Got a second to see you agree with me that this has minimal security implications?

cdwort · 2015-07-07T01:18:07Z

Ugg. Our wkhtmltopdf source is 404ing. I'll fix that in another commit, but likely not tonight. This will wait on that.

sigmavirus24 · 2015-07-07T01:32:50Z

So... I'm trying to decide.

I think we could mitigate this, but no matter what this will have security implications. Since we interpolate this into the command, we're going to have something like this

PDFKit.new('https://google.com/search?q=pdfkit; do_something #')

Which would be interpolated into:

pdfkit https://google.com/search?q=pdfkit; do_something # (rest of command arguments

Which is what @devn was trying to avoid by using shellescape. We can take solace in a few things.

For one, we can probably use URI tools to normalize the URI we're provided. That said, I'm not sure if the uri library in Ruby does RFC 3986 normalization. I would guess that Addressable does. One of us would have to look into this though.

sigmavirus24 · 2015-07-07T01:36:27Z

Looks like the uri library would warn us of problematic URLs

~ ❯❯❯ irb
1.9.3-p448 :001 > require 'uri'
 => true
1.9.3-p448 :002 > google = 'https://google.com/search?q=pdfkit; do_something # --args'
 => "https://google.com/search?q=pdfkit; do_something # --args"
1.9.3-p448 :003 > u = URI(google)
URI::InvalidURIError: bad URI(is not URI?): https://google.com/search?q=pdfkit; do_something # --args
    from /usr/local/rvm/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/uri/common.rb:176:in `split'
    from /usr/local/rvm/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/uri/common.rb:211:in `parse'
    from /usr/local/rvm/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/uri/common.rb:747:in `parse'
    from /usr/local/rvm/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/uri/common.rb:994:in `URI'
    from (irb):3
    from /usr/local/rvm/rubies/ruby-1.9.3-p448/bin/irb:16:in `<main>'

That says to me that if that barfs, so can we.

sigmavirus24 · 2015-07-07T01:38:05Z

Spaces are always encoded in a URI as either %20 or + so that gives us some assurances that we won't be subject to shell injection attacks. That said, I'd like @wuest's opinion too since she has a far more keen mind for this stuff than I do.

cdwort · 2015-07-07T01:39:39Z

@sigmavirus24

Interestingly, we do not consider that URL to be HTML, which means this change wouldn't affect that URL.

[1] pry(main)> require 'pdfkit'
=> true
[2] pry(main)> kit = PDFKit.new('https://google.com/search?q=pdfkit; do_something # --args')
=> #<PDFKit:0x007fbe2e849da8
 @options=
  {"--quiet"=>nil, "--page-size"=>"Letter", "--margin-top"=>"0.75in", "--margin-right"=>"0.75in", "--margin-bottom"=>"0.75in", "--margin-left"=>"0.75in", "--encoding"=>"UTF-8"},
 @source=#<PDFKit::Source:0x007fbe2e849d58 @source="https://google.com/search?q=pdfkit; do_something # --args">,
 @stylesheets=[]>
[3] pry(main)> kit.source.html?
=> false

cdwort · 2015-07-07T01:40:34Z

Hahaha, can I take back that moment of insanity? Totally wrong. License to code revoked!

cdwort · 2015-07-07T01:42:32Z

Okay, so, I'm cool with bubbling up an error on initialize if the source is not HTML and the URL is not parseable by URI.

sigmavirus24 · 2015-07-07T01:48:27Z

So even attempting to get the URL would be a problem. We should parse the URI first. Then we should bubble up an error immediately. Even attempting to get that URL would result in a shell injection vulnerability and a new CVE-ID =/.

sigmavirus24 · 2015-07-07T01:51:09Z

Can I just take this moment to complain about how Python's standard library doesn't barf on a URL like that? =(

cdwort · 2015-07-07T01:55:17Z

Well, if we're on the subject of complaints, can I take a moment to complain that the implementation of Date.parse in JS (now and ES6) is browser dependent?

cdwort · 2015-07-07T01:56:04Z

In any case, fix should be ^^

cdwort · 2015-07-07T01:57:11Z

Happy path URL and non-URL sources are implicitly tested in the rest of spec/source_spec.rb.

sigmavirus24 · 2015-07-07T02:02:55Z

Well, if we're on the subject of complaints, can I take a moment to complain that the implementation of Date.parse in JS (now and ES6) is browser dependent?

Have I said lately how glad I am not to have to deal with JS in the browser anymore?

I'd still like @wuest to review this if she has the chance. I'm also unsure if we should maybe raise our own error as a result. It might be a better user experience than having to do

require 'pdfkit'
require 'uri'

begin
  PDFKit.new(...)
rescue URI::InvalidURIError
  # do something else
end

Whereas someone would just have to do

require 'pdfkit'

begin
  PDFKit.new('..')
rescue PDFKit::InvalidURIError
  # ...
end

cdwort · 2015-07-07T02:10:18Z

Yeah, definitely, a second opinion would be great. I wonder if @wuest and I met at Ruby on Ales at a mexican restaurant... but I have a bad memory.

I'm fine with a new PDFKit::InvalidURIError if you think that's a better user experience. I don't know enough about our users to say. I chose to leave it as URI::InvalidURIError because as a consumer I know what tends to cause those errors and would lead me to look at the URI documentation. A PDFKit::InvalidURIError would require me do a google search, find no results and then grep the code base.

Maybe an argument for better documentation?

sigmavirus24 · 2015-07-07T02:16:23Z

Maybe an argument for better documentation?

I've been revamping documentation on three projects for a while now. Finished toolbelt.readthedocs.org, working on rfc3986.readthedocs.org and betamax.readthedocs.org and will be continuing work on github3.readthedocs.org too. Docs are important and we should probably try harder to have some.

cdwort · 2015-07-07T02:42:36Z

Yeah... Not sure that's a time commitment I'm able to make right now. :-/ But good to keep in mind.

wuest · 2015-07-07T02:43:33Z

Just catching up, sorry for taking so long!

Looking at the new test expected to fail, why is 'https://google.com/search?q=pdfkit; do_something # --args' not a valid URL? I'm not aware of anything in this URL which can't be encountered in the real world. My (admittedly unfamiliar) thought is generally that the URL should be URIEncoded, and a fairly limited number of characters could be preserved, since URLs should be interpreted correctly. The default of URI::UNSAFE is pretty permissive, but a more restrictive regexp such might be better:

2.2.2 :001 > require 'uri'
 => true
2.2.2 :002 > URI::escape('https://google.com/search?q=pdfkit; do_something # --args&do_other_thing', /[^\w\d\-_+\/:\.?=&%]/)
 => "https://google.com/search?q=pdfkit%3B%20do_something%20%23%20--args&do_other_thing"
2.2.2 :003 > URI(URI::escape('https://google.com/search?q=pdfkit; do_something # --args&do_other_thing', /[^\w\d\-_+\/:\.?=&%]/))
 => #<URI::HTTPS https://google.com/search?q=pdfkit%3B%20do_something%20%23%20--args&do_other_thing>

The problem that still remains is that & cannot be URIEncoded since it changes the semantics of the URL. Because of this, shellescaping isn't going to be something which you can avoid.

@cdwort I wasn't at Ruby on Ales, but perhaps we met at Madison+ Ruby last year?

sigmavirus24 · 2015-07-07T03:02:13Z

So # is the delimiter for the start of a URL fragment. That means that --args&do_other_things would be part of the fragment. That said, we could try to avoid this with (single-)quoting the URL on the command-line.

That said, URLs aren't technically allowed to have s in them. They need to be escaped. We /could/ escape them for the user, but some servers expect+instead of%20 and barf otherwise. I don't know if it's unreasonable to expect the user to pass a correct URL.

cdwort · 2015-07-07T03:07:47Z

Yeah, I guess I don't know enough about URLs. To my eye

https://google.com/search?q=pdfkit; do_something_evil --with-my-flag # --comment_out_the_other_args

does not look like a valid URL.

That said, I'm happy not policing the URL - generally that is something that is under the user's control. Given that we support the middleware, however, and allow arbitrary URLs to be sent in, it seems worthwhile to have some protection.

wuest · 2015-07-07T03:09:29Z

Here's a totally valid URL which doesn't TECHNICALLY require escaping in order to be considered "correct" by URI: https://www.google.com/search?q=unkind&cat</dev/zero>/dev/null

Do any modern servers actually have problems with + vs %20? I haven't heard of that being an issue in ages (and encodeURI() in JS on all modern browsers produces %20 afaik).

sigmavirus24 · 2015-07-07T03:16:11Z

Yeah, you're right, that URI needs some kind of escaping. My first instinct is to place the URL in 's but that allows the attacker to do:

https://www.google.com/search?q='cat</dev/zero>/dev/null'

Which would translate to

pdfkit 'https://www.google.com/search?q='cat</dev/zero>/dev/null'' ...

=(

sigmavirus24 · 2015-07-07T13:23:30Z

So it would seem that the best way forward would be:

URI::escape, maybe with a custom regular expression. By default, URI::escape works pretty well actually.

I say "maybe" because of the following examples:

URI::encode("https://www.google.com/search?q=unkind&cat</dev/zero>/dev/null")
# => "https://www.google.com/search?q=unkind&cat%3C/dev/zero%3E/dev/null"
URI::encode("https://www.google.com/search?q=unkined&cat </dev/zero >/dev/null")
# => "https://www.google.com/search?q=unkined&cat%20%3C/dev/zero%20%3E/dev/null"
URI::encode("https://www.google.com/search?q=unkined&cat '</dev/zero' >/dev/null")
# => "https://www.google.com/search?q=unkined&cat%20'%3C/dev/zero'%20%3E/dev/null"
URI::encode("https://www.google.com/search?q=unkined&cat \"</dev/zero\" >/dev/null")
# => "https://www.google.com/search?q=unkined&cat%20%22%3C/dev/zero%22%20%3E/dev/null"

Note that it doesn't encode 's (which is to be expected) but it will encode "s. I think this means that we could do something akin to quoting the URL when it's passed to the command-line. (In other words, step 2.) Double-quoting will work fine and negate the effect of the & in the URL, so will ' but that isn't escaped by default and we'd have to use a custom regexp for that.

Do something akin to:

command_string = "#{executable_path} \"#{URI::escape(url)}\" #{args}"

I'm sure there are even worse URLs that we could run into, but I think this will give us the flexibility that we need to fix the issue and some small bit of comfort in the fact that we're trying to protect the user's system. Do you agree @wuest?

wuest · 2015-07-07T13:35:02Z

Looks good at a glance!

cdwort · 2015-07-07T22:14:09Z

Added URI::escape. Ran tests.

cdwort · 2015-07-07T22:48:26Z

Alright, so now passing, but here's a question:

This is a breaking change for our users who ARE currently escaping their URLs:

[1] pry(main)> URI::escape("https://www.google.com/search?q='cat%3Cdev/zero%3E/dev/null'")
=> "https://www.google.com/search?q='cat%253Cdev/zero%253E/dev/null'"

Is that something we want to do? I'm guessing not. The only way I can think of to fix that is to check whether the decoded URL is the same as the URL passed in. That's a little ugly.

cdwort · 2015-07-07T23:05:29Z

Also, now that we are not shellescaping ampersands, are we going to run into issues like #207, where the ampersand sends the command to the background?

cdwort · 2015-07-08T00:23:04Z

Interestingly enough, WickedPDF doesn't bother with escaping any aspects of the shell command.

sigmavirus24 · 2015-07-08T00:50:06Z

@cdwort that logic would be convenient, but:

URI::decode("https://www.google.com/search?q='cat<dev/zero>/dev/null'")
# => "https://www.google.com/search?q='cat<dev/zero>/dev/null'"
URI("https://www.google.com/search?q='cat<dev/zero>/dev/null'")
URI::InvalidURIError: bad URI(is not URI?): https://www.google.com/search?q='cat<dev/zero>/dev/null'
        from /usr/local/rvm/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/uri/common.rb:176:in `split'
        ...

That said, we can probably do

def url_needs_escaping?
  begin
    URI(@source)
  rescue URI::InvalidURIError
    return false
  end
  URI::decode(@source) == @source
end

Since url_needs_escaping? I don't think we need to return early if it isn't a URL since we'll never reach that point.

Also, now that we are not shellescaping ampersands, are we going to run into issues like #207, where the ampersand sends the command to the background?

This is why I suggested:

command_string = "#{executable_path} \"#{URI::escape(url)}\" #{args}"

As step 2 of fixing this. Although now it would be "#{executable_path} \"#{shell_safe_url}\" #{args}" (pseudo-code).

wuest · 2015-07-08T01:56:09Z

f971a77 fixes the issue you ran into (this is the solution to which @sigmavirus24 was referring).

cdwort · 2015-07-08T02:07:38Z

@wuest Okay, I had that solution and it gave me a junked up URL, but I think that was just outdated code. Thanks for setting that straight.

cdwort · 2015-07-08T02:36:55Z

Removed the ready label as I would like to cut a release before this goes out (whenever we think we're done with it).

cdwort · 2015-07-08T18:18:43Z

Confirmed that we do not support multiple URLs, which is a shame, but keeps this from being a breaking change.

Release has also gone out, so no blockers at the moment.

sigmavirus24 · 2015-07-09T02:31:58Z

This needs to be rebased

cdwort · 2015-07-11T03:12:00Z

Alright, so after a long fought battle I have a working-ish Windows environment in which to test this. It works in smoke tests!

Obviously a quick smoke test isn't the best, but at least it's better than nothing.

In any case, I no longer have concerns about windows compatibility for this or #316. However, this will wait for #316 because those refactors will make merge conflicts much easier to handle.

I'm traveling this weekend so I may not get to finishing that up until Sunday or early next week. Sorry for the delay on this.

Any remaining concerns @sigmavirus24 besides need for a rebase?

cdwort · 2015-07-12T15:41:45Z

Rebased, passing and ready to go @sigmavirus24.

cdwort · 2015-07-13T19:01:18Z

I'll merge this tonight unless there are concerns.

sigmavirus24 · 2015-07-14T13:06:24Z

lib/pdfkit/source.rb

+    end
+
+    def url_needs_escaping?
+      URI::decode(@source) == @source


So if we're quoting the URI, this still won't work.

1.9.3-p448 :001 > url = 'https://google.com/search?q="foobarbogus"' => "https://google.com/search?q=\"foobarbogus\"" 1.9.3-p448 :002 > require 'uri' => true 1.9.3-p448 :003 > URI::decode(url) => "https://google.com/search?q=\"foobarbogus\""

That's a simple example that would allow someone to use something like

1.9.3-p448 :001 > url = 'https://google.com/search?q="useradd -p secrete hacked"' => "https://google.com/search?q=\"useradd -p secrete hacked\"" 1.9.3-p448 :002 > require 'uri' => true 1.9.3-p448 :003 > URI::decode(url) => "https://google.com/search?q=\"useradd -p secrete hacked\""

We really need to first parse the URI to make sure it's valid before checking if the decoded value is the same as the source. If it's not valid then the decoded URI being equal to the original is just a worthless check.

I'm not opposed to that, and I'm out of my element here, so apologies if I sound like a broken record repeating this question again:

I thought we determined that we wanted to support urls like:

https://www.google.com/search?q='cat<dev/zero>/dev/null'

Maybe I misread that in the discussion? It's what this PR uses as the example of something that needs escaping, but that we want to allow. I'm just not sure how to do that while not supporting:

"https://google.com/search?q=\"useradd -p secrete hacked\""

For example:

[14] pry(main)> url = "https://www.google.com/search?q='cat<dev/zero>/dev/null'" => "https://www.google.com/search?q='cat<dev/zero>/dev/null'" [15] pry(main)> bad_url = "https://google.com/search?q=\"useradd -p secrete hacked\"" => "https://google.com/search?q=\"useradd -p secrete hacked\"" [16] pry(main)> URI::parse(url) URI::InvalidURIError: bad URI(is not URI?): https://www.google.com/search?q='cat<dev/zero>/dev/null' from /Users/aunger/.rvm/rubies/ruby-2.0.0-p598/lib/ruby/2.0.0/uri/common.rb:176:in `split' [17] pry(main)> URI::parse(bad_url) URI::InvalidURIError: bad URI(is not URI?): https://google.com/search?q="useradd -p secrete hacked" from /Users/aunger/.rvm/rubies/ruby-2.0.0-p598/lib/ruby/2.0.0/uri/common.rb:176:in `split' [18] pry(main)> URI::parse(URI::escape(url)) => #<URI::HTTPS:0x007ffbc1baf530 URL:https://www.google.com/search?q='cat%3Cdev/zero%3E/dev/null'> [19] pry(main)> URI::parse(URI::escape(bad_url)) => #<URI::HTTPS:0x007ffbc1b76fc8 URL:https://google.com/search?q=%22useradd%20-p%20secrete%20hacked%22>

I've been reading this logic backwards. Ignore me. This looks fine. 😞

cdwort · 2015-07-16T02:30:20Z

@sigmavirus24 Thanks for taking a look - I know you're busy. I'm very grateful for the peer review on this!

Do not shellescape URLs since it borks params

cdwort · 2015-07-16T02:32:48Z

For record keeping, this fixes #284

sigmavirus24 added the ready label Jul 7, 2015

cdwort added the escaping label Jul 8, 2015

cdwort removed the ready label Jul 8, 2015

cdwort mentioned this pull request Jul 8, 2015

Added escaping of ampersands in source url #207

Closed

cdwort modified the milestone: 0.8.1 Jul 8, 2015

cdwort added the ready label Jul 8, 2015

cdwort mentioned this pull request Jul 8, 2015

Quick refactor for PR #261 #316

Merged

4 tasks

cdwort mentioned this pull request Jul 12, 2015

Use updated wkhtmltopdf download URL #320

Merged

cdwort added 6 commits July 12, 2015 10:29

Do not shellescape URLs since it borks params (pdfkit#166)

30d7b36

Raise error if given unsafe URL

33777c6

URI escape urls

dca087b

Do not escape URLs that are already escaped

e4cf966

Add quotes around URLs to allow ampersands

80c7ed0

Remove unnecessary paranoia

57b1da6

cdwort force-pushed the do_not_shellescape_source branch from b8334f1 to 57b1da6 Compare July 12, 2015 15:31

sigmavirus24 reviewed Jul 14, 2015
View reviewed changes

cdwort added a commit that referenced this pull request Jul 16, 2015

Merge pull request #312 from cdwort/do_not_shellescape_source

5c82aa5

Do not shellescape URLs since it borks params

cdwort merged commit 5c82aa5 into pdfkit:master Jul 16, 2015

sigmavirus24 removed the ready label Jul 16, 2015

ryancyq mentioned this pull request Sep 26, 2019

Footer Html options with query params #453

Closed

Do not shellescape URLs since it borks params #312

Do not shellescape URLs since it borks params #312

Conversation

cdwort commented Jul 7, 2015

cdwort commented Jul 7, 2015

cdwort commented Jul 7, 2015

sigmavirus24 commented Jul 7, 2015

sigmavirus24 commented Jul 7, 2015

sigmavirus24 commented Jul 7, 2015

cdwort commented Jul 7, 2015

cdwort commented Jul 7, 2015

cdwort commented Jul 7, 2015

sigmavirus24 commented Jul 7, 2015

sigmavirus24 commented Jul 7, 2015

cdwort commented Jul 7, 2015

cdwort commented Jul 7, 2015

cdwort commented Jul 7, 2015

sigmavirus24 commented Jul 7, 2015

cdwort commented Jul 7, 2015

sigmavirus24 commented Jul 7, 2015

cdwort commented Jul 7, 2015

wuest commented Jul 7, 2015

sigmavirus24 commented Jul 7, 2015

cdwort commented Jul 7, 2015

wuest commented Jul 7, 2015

sigmavirus24 commented Jul 7, 2015

sigmavirus24 commented Jul 7, 2015

wuest commented Jul 7, 2015

cdwort commented Jul 7, 2015

cdwort commented Jul 7, 2015

cdwort commented Jul 7, 2015

cdwort commented Jul 8, 2015

sigmavirus24 commented Jul 8, 2015

wuest commented Jul 8, 2015

cdwort commented Jul 8, 2015

cdwort commented Jul 8, 2015

cdwort commented Jul 8, 2015

sigmavirus24 commented Jul 9, 2015

cdwort commented Jul 11, 2015

cdwort commented Jul 12, 2015

cdwort commented Jul 13, 2015

sigmavirus24 Jul 14, 2015

Choose a reason for hiding this comment

cdwort Jul 16, 2015

Choose a reason for hiding this comment

sigmavirus24 Jul 16, 2015

Choose a reason for hiding this comment

cdwort commented Jul 16, 2015

cdwort commented Jul 16, 2015