Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not shellescape URLs since it borks params #312

Merged
merged 6 commits into from
Jul 16, 2015

Conversation

cdwort
Copy link
Contributor

@cdwort cdwort commented Jul 7, 2015

Fixes #166

Current Functionality (master):

>> PDFKit.new('https://www.google.com/search?q=pdfkit').command
wkhtmltopdf ... https://www.google.com/search\\?q\\=pdfkit

New Functionality (this PR):

>> PDFKit.new('https://www.google.com/search?q=pdfkit').command
wkhtmltopdf ... https://www.google.com/search?q=pdfkit

To accomplish this, this PR walks back some security measures implemented in #164.

  • We no longer escape URLs
  • We no longer escape HTML coming through middleware.

I don't know enough about bash exploits, so forgive the poor examples below.

This would impact those using PDFKit who allow raw user input in the URL (e.g. www.mysite.com/echo /etc/passwd) or in the HTML (e.g. "<p>Your PDF Content: echo /etc/passwd</p>" ).

On the other hand, I think we need to support query parameters. So, unless someone sees a better this solution, I'd say let's fix this.

@cdwort
Copy link
Contributor Author

cdwort commented Jul 7, 2015

@sigmavirus24 Got a second to see you agree with me that this has minimal security implications?

@cdwort
Copy link
Contributor Author

cdwort commented Jul 7, 2015

Ugg. Our wkhtmltopdf source is 404ing. I'll fix that in another commit, but likely not tonight. This will wait on that.

@sigmavirus24
Copy link
Contributor

So... I'm trying to decide.

I think we could mitigate this, but no matter what this will have security implications. Since we interpolate this into the command, we're going to have something like this

PDFKit.new('https://google.com/search?q=pdfkit; do_something #')

Which would be interpolated into:

pdfkit https://google.com/search?q=pdfkit; do_something # (rest of command arguments

Which is what @devn was trying to avoid by using shellescape. We can take solace in a few things.

For one, we can probably use URI tools to normalize the URI we're provided. That said, I'm not sure if the uri library in Ruby does RFC 3986 normalization. I would guess that Addressable does. One of us would have to look into this though.

@sigmavirus24
Copy link
Contributor

Looks like the uri library would warn us of problematic URLs

~ ❯❯❯ irb
1.9.3-p448 :001 > require 'uri'
 => true
1.9.3-p448 :002 > google = 'https://google.com/search?q=pdfkit; do_something # --args'
 => "https://google.com/search?q=pdfkit; do_something # --args"
1.9.3-p448 :003 > u = URI(google)
URI::InvalidURIError: bad URI(is not URI?): https://google.com/search?q=pdfkit; do_something # --args
    from /usr/local/rvm/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/uri/common.rb:176:in `split'
    from /usr/local/rvm/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/uri/common.rb:211:in `parse'
    from /usr/local/rvm/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/uri/common.rb:747:in `parse'
    from /usr/local/rvm/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/uri/common.rb:994:in `URI'
    from (irb):3
    from /usr/local/rvm/rubies/ruby-1.9.3-p448/bin/irb:16:in `<main>'

That says to me that if that barfs, so can we.

@sigmavirus24
Copy link
Contributor

Spaces are always encoded in a URI as either %20 or + so that gives us some assurances that we won't be subject to shell injection attacks. That said, I'd like @wuest's opinion too since she has a far more keen mind for this stuff than I do.

@cdwort
Copy link
Contributor Author

cdwort commented Jul 7, 2015

@sigmavirus24

Interestingly, we do not consider that URL to be HTML, which means this change wouldn't affect that URL.

[1] pry(main)> require 'pdfkit'
=> true
[2] pry(main)> kit = PDFKit.new('https://google.com/search?q=pdfkit; do_something # --args')
=> #<PDFKit:0x007fbe2e849da8
 @options=
  {"--quiet"=>nil, "--page-size"=>"Letter", "--margin-top"=>"0.75in", "--margin-right"=>"0.75in", "--margin-bottom"=>"0.75in", "--margin-left"=>"0.75in", "--encoding"=>"UTF-8"},
 @source=#<PDFKit::Source:0x007fbe2e849d58 @source="https://google.com/search?q=pdfkit; do_something # --args">,
 @stylesheets=[]>
[3] pry(main)> kit.source.html?
=> false

@cdwort
Copy link
Contributor Author

cdwort commented Jul 7, 2015

Hahaha, can I take back that moment of insanity? Totally wrong. License to code revoked!

@cdwort
Copy link
Contributor Author

cdwort commented Jul 7, 2015

Okay, so, I'm cool with bubbling up an error on initialize if the source is not HTML and the URL is not parseable by URI.

@sigmavirus24
Copy link
Contributor

So even attempting to get the URL would be a problem. We should parse the URI first. Then we should bubble up an error immediately. Even attempting to get that URL would result in a shell injection vulnerability and a new CVE-ID =/.

@sigmavirus24
Copy link
Contributor

Can I just take this moment to complain about how Python's standard library doesn't barf on a URL like that? =(

@cdwort
Copy link
Contributor Author

cdwort commented Jul 7, 2015

Well, if we're on the subject of complaints, can I take a moment to complain that the implementation of Date.parse in JS (now and ES6) is browser dependent?

@cdwort
Copy link
Contributor Author

cdwort commented Jul 7, 2015

In any case, fix should be ^^

@cdwort
Copy link
Contributor Author

cdwort commented Jul 7, 2015

Happy path URL and non-URL sources are implicitly tested in the rest of spec/source_spec.rb.

@sigmavirus24
Copy link
Contributor

Well, if we're on the subject of complaints, can I take a moment to complain that the implementation of Date.parse in JS (now and ES6) is browser dependent?

Have I said lately how glad I am not to have to deal with JS in the browser anymore?


I'd still like @wuest to review this if she has the chance. I'm also unsure if we should maybe raise our own error as a result. It might be a better user experience than having to do

require 'pdfkit'
require 'uri'

begin
  PDFKit.new(...)
rescue URI::InvalidURIError
  # do something else
end

Whereas someone would just have to do

require 'pdfkit'

begin
  PDFKit.new('..')
rescue PDFKit::InvalidURIError
  # ...
end

@cdwort
Copy link
Contributor Author

cdwort commented Jul 7, 2015

Yeah, definitely, a second opinion would be great. I wonder if @wuest and I met at Ruby on Ales at a mexican restaurant... but I have a bad memory.

I'm fine with a new PDFKit::InvalidURIError if you think that's a better user experience. I don't know enough about our users to say. I chose to leave it as URI::InvalidURIError because as a consumer I know what tends to cause those errors and would lead me to look at the URI documentation. A PDFKit::InvalidURIError would require me do a google search, find no results and then grep the code base.

Maybe an argument for better documentation?

@sigmavirus24
Copy link
Contributor

Maybe an argument for better documentation?

I've been revamping documentation on three projects for a while now. Finished toolbelt.readthedocs.org, working on rfc3986.readthedocs.org and betamax.readthedocs.org and will be continuing work on github3.readthedocs.org too. Docs are important and we should probably try harder to have some.

@cdwort
Copy link
Contributor Author

cdwort commented Jul 7, 2015

Yeah... Not sure that's a time commitment I'm able to make right now. :-/ But good to keep in mind.

@wuest
Copy link

wuest commented Jul 7, 2015

Just catching up, sorry for taking so long!

Looking at the new test expected to fail, why is 'https://google.com/search?q=pdfkit; do_something # --args' not a valid URL? I'm not aware of anything in this URL which can't be encountered in the real world. My (admittedly unfamiliar) thought is generally that the URL should be URIEncoded, and a fairly limited number of characters could be preserved, since URLs should be interpreted correctly. The default of URI::UNSAFE is pretty permissive, but a more restrictive regexp such might be better:

2.2.2 :001 > require 'uri'
 => true
2.2.2 :002 > URI::escape('https://google.com/search?q=pdfkit; do_something # --args&do_other_thing', /[^\w\d\-_+\/:\.?=&%]/)
 => "https://google.com/search?q=pdfkit%3B%20do_something%20%23%20--args&do_other_thing"
2.2.2 :003 > URI(URI::escape('https://google.com/search?q=pdfkit; do_something # --args&do_other_thing', /[^\w\d\-_+\/:\.?=&%]/))
 => #<URI::HTTPS https://google.com/search?q=pdfkit%3B%20do_something%20%23%20--args&do_other_thing>

The problem that still remains is that & cannot be URIEncoded since it changes the semantics of the URL. Because of this, shellescaping isn't going to be something which you can avoid.

@cdwort I wasn't at Ruby on Ales, but perhaps we met at Madison+ Ruby last year?

@sigmavirus24
Copy link
Contributor

So # is the delimiter for the start of a URL fragment. That means that --args&do_other_things would be part of the fragment. That said, we could try to avoid this with (single-)quoting the URL on the command-line.

That said, URLs aren't technically allowed to have s in them. They need to be escaped. We /could/ escape them for the user, but some servers expect+instead of%20 and barf otherwise. I don't know if it's unreasonable to expect the user to pass a correct URL.

@cdwort
Copy link
Contributor Author

cdwort commented Jul 7, 2015

Yeah, I guess I don't know enough about URLs. To my eye

https://google.com/search?q=pdfkit; do_something_evil --with-my-flag # --comment_out_the_other_args

does not look like a valid URL.

That said, I'm happy not policing the URL - generally that is something that is under the user's control. Given that we support the middleware, however, and allow arbitrary URLs to be sent in, it seems worthwhile to have some protection.

@wuest
Copy link

wuest commented Jul 7, 2015

Here's a totally valid URL which doesn't TECHNICALLY require escaping in order to be considered "correct" by URI: https://www.google.com/search?q=unkind&cat</dev/zero>/dev/null

Do any modern servers actually have problems with + vs %20? I haven't heard of that being an issue in ages (and encodeURI() in JS on all modern browsers produces %20 afaik).

@sigmavirus24
Copy link
Contributor

Yeah, you're right, that URI needs some kind of escaping. My first instinct is to place the URL in 's but that allows the attacker to do:

https://www.google.com/search?q='cat</dev/zero>/dev/null'

Which would translate to

pdfkit 'https://www.google.com/search?q='cat</dev/zero>/dev/null'' ...

=(

@sigmavirus24
Copy link
Contributor

So it would seem that the best way forward would be:

  1. URI::escape, maybe with a custom regular expression. By default, URI::escape works pretty well actually.

    I say "maybe" because of the following examples:

    URI::encode("https://www.google.com/search?q=unkind&cat</dev/zero>/dev/null")
    # => "https://www.google.com/search?q=unkind&cat%3C/dev/zero%3E/dev/null"
    URI::encode("https://www.google.com/search?q=unkined&cat </dev/zero >/dev/null")
    # => "https://www.google.com/search?q=unkined&cat%20%3C/dev/zero%20%3E/dev/null"
    URI::encode("https://www.google.com/search?q=unkined&cat '</dev/zero' >/dev/null")
    # => "https://www.google.com/search?q=unkined&cat%20'%3C/dev/zero'%20%3E/dev/null"
    URI::encode("https://www.google.com/search?q=unkined&cat \"</dev/zero\" >/dev/null")
    # => "https://www.google.com/search?q=unkined&cat%20%22%3C/dev/zero%22%20%3E/dev/null"

    Note that it doesn't encode 's (which is to be expected) but it will encode "s. I think this means that we could do something akin to quoting the URL when it's passed to the command-line. (In other words, step 2.) Double-quoting will work fine and negate the effect of the & in the URL, so will ' but that isn't escaped by default and we'd have to use a custom regexp for that.

  2. Do something akin to:

    command_string = "#{executable_path} \"#{URI::escape(url)}\" #{args}"

I'm sure there are even worse URLs that we could run into, but I think this will give us the flexibility that we need to fix the issue and some small bit of comfort in the fact that we're trying to protect the user's system. Do you agree @wuest?

@wuest
Copy link

wuest commented Jul 7, 2015

Looks good at a glance!

@cdwort
Copy link
Contributor Author

cdwort commented Jul 7, 2015

Added URI::escape. Ran tests.

screen shot 2015-07-07 at 5 13 10 pm

@cdwort
Copy link
Contributor Author

cdwort commented Jul 7, 2015

Alright, so now passing, but here's a question:

This is a breaking change for our users who ARE currently escaping their URLs:

[1] pry(main)> URI::escape("https://www.google.com/search?q='cat%3Cdev/zero%3E/dev/null'")
=> "https://www.google.com/search?q='cat%253Cdev/zero%253E/dev/null'"

Is that something we want to do? I'm guessing not. The only way I can think of to fix that is to check whether the decoded URL is the same as the URL passed in. That's a little ugly.

@cdwort
Copy link
Contributor Author

cdwort commented Jul 7, 2015

Also, now that we are not shellescaping ampersands, are we going to run into issues like #207, where the ampersand sends the command to the background?

@cdwort
Copy link
Contributor Author

cdwort commented Jul 8, 2015

Interestingly enough, WickedPDF doesn't bother with escaping any aspects of the shell command.

@sigmavirus24
Copy link
Contributor

@cdwort that logic would be convenient, but:

URI::decode("https://www.google.com/search?q='cat<dev/zero>/dev/null'")
# => "https://www.google.com/search?q='cat<dev/zero>/dev/null'"
URI("https://www.google.com/search?q='cat<dev/zero>/dev/null'")
URI::InvalidURIError: bad URI(is not URI?): https://www.google.com/search?q='cat<dev/zero>/dev/null'
        from /usr/local/rvm/rubies/ruby-1.9.3-p448/lib/ruby/1.9.1/uri/common.rb:176:in `split'
        ...

That said, we can probably do

def url_needs_escaping?
  begin
    URI(@source)
  rescue URI::InvalidURIError
    return false
  end
  URI::decode(@source) == @source
end

Since url_needs_escaping? I don't think we need to return early if it isn't a URL since we'll never reach that point.

Also, now that we are not shellescaping ampersands, are we going to run into issues like #207, where the ampersand sends the command to the background?

This is why I suggested:

command_string = "#{executable_path} \"#{URI::escape(url)}\" #{args}"

As step 2 of fixing this. Although now it would be "#{executable_path} \"#{shell_safe_url}\" #{args}" (pseudo-code).

@cdwort cdwort added the escaping label Jul 8, 2015
@wuest
Copy link

wuest commented Jul 8, 2015

f971a77 fixes the issue you ran into (this is the solution to which @sigmavirus24 was referring).

@cdwort
Copy link
Contributor Author

cdwort commented Jul 8, 2015

@wuest Okay, I had that solution and it gave me a junked up URL, but I think that was just outdated code. Thanks for setting that straight.

@cdwort
Copy link
Contributor Author

cdwort commented Jul 8, 2015

Removed the ready label as I would like to cut a release before this goes out (whenever we think we're done with it).

@cdwort
Copy link
Contributor Author

cdwort commented Jul 8, 2015

Confirmed that we do not support multiple URLs, which is a shame, but keeps this from being a breaking change.

Release has also gone out, so no blockers at the moment.

@cdwort cdwort modified the milestone: 0.8.1 Jul 8, 2015
@cdwort cdwort added the ready label Jul 8, 2015
@cdwort cdwort mentioned this pull request Jul 8, 2015
4 tasks
@sigmavirus24
Copy link
Contributor

This needs to be rebased

@cdwort
Copy link
Contributor Author

cdwort commented Jul 11, 2015

Alright, so after a long fought battle I have a working-ish Windows environment in which to test this. It works in smoke tests!

Obviously a quick smoke test isn't the best, but at least it's better than nothing.

In any case, I no longer have concerns about windows compatibility for this or #316. However, this will wait for #316 because those refactors will make merge conflicts much easier to handle.

I'm traveling this weekend so I may not get to finishing that up until Sunday or early next week. Sorry for the delay on this.

Any remaining concerns @sigmavirus24 besides need for a rebase?

@cdwort cdwort force-pushed the do_not_shellescape_source branch from b8334f1 to 57b1da6 Compare July 12, 2015 15:31
@cdwort
Copy link
Contributor Author

cdwort commented Jul 12, 2015

Rebased, passing and ready to go @sigmavirus24.

@cdwort
Copy link
Contributor Author

cdwort commented Jul 13, 2015

I'll merge this tonight unless there are concerns.

end

def url_needs_escaping?
URI::decode(@source) == @source
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we're quoting the URI, this still won't work.

1.9.3-p448 :001 > url = 'https://google.com/search?q="foobarbogus"'
 => "https://google.com/search?q=\"foobarbogus\""
1.9.3-p448 :002 > require 'uri'
 => true
1.9.3-p448 :003 > URI::decode(url)
 => "https://google.com/search?q=\"foobarbogus\""

That's a simple example that would allow someone to use something like

1.9.3-p448 :001 > url = 'https://google.com/search?q="useradd -p secrete hacked"'
 => "https://google.com/search?q=\"useradd -p secrete hacked\""
1.9.3-p448 :002 > require 'uri'
 => true
1.9.3-p448 :003 > URI::decode(url)
 => "https://google.com/search?q=\"useradd -p secrete hacked\""

We really need to first parse the URI to make sure it's valid before checking if the decoded value is the same as the source. If it's not valid then the decoded URI being equal to the original is just a worthless check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not opposed to that, and I'm out of my element here, so apologies if I sound like a broken record repeating this question again:

I thought we determined that we wanted to support urls like:

https://www.google.com/search?q='cat<dev/zero>/dev/null'

Maybe I misread that in the discussion? It's what this PR uses as the example of something that needs escaping, but that we want to allow. I'm just not sure how to do that while not supporting:

"https://google.com/search?q=\"useradd -p secrete hacked\""

For example:

[14] pry(main)> url = "https://www.google.com/search?q='cat<dev/zero>/dev/null'"
=> "https://www.google.com/search?q='cat<dev/zero>/dev/null'"
[15] pry(main)> bad_url = "https://google.com/search?q=\"useradd -p secrete hacked\""
=> "https://google.com/search?q=\"useradd -p secrete hacked\""
[16] pry(main)> URI::parse(url)
URI::InvalidURIError: bad URI(is not URI?): https://www.google.com/search?q='cat<dev/zero>/dev/null'
from /Users/aunger/.rvm/rubies/ruby-2.0.0-p598/lib/ruby/2.0.0/uri/common.rb:176:in `split'
[17] pry(main)> URI::parse(bad_url)
URI::InvalidURIError: bad URI(is not URI?): https://google.com/search?q="useradd -p secrete hacked"
from /Users/aunger/.rvm/rubies/ruby-2.0.0-p598/lib/ruby/2.0.0/uri/common.rb:176:in `split'
[18] pry(main)> URI::parse(URI::escape(url))
=> #<URI::HTTPS:0x007ffbc1baf530 URL:https://www.google.com/search?q='cat%3Cdev/zero%3E/dev/null'>
[19] pry(main)> URI::parse(URI::escape(bad_url))
=> #<URI::HTTPS:0x007ffbc1b76fc8 URL:https://google.com/search?q=%22useradd%20-p%20secrete%20hacked%22>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been reading this logic backwards. Ignore me. This looks fine. 😞

@cdwort
Copy link
Contributor Author

cdwort commented Jul 16, 2015

@sigmavirus24 Thanks for taking a look - I know you're busy. I'm very grateful for the peer review on this!

cdwort added a commit that referenced this pull request Jul 16, 2015
Do not shellescape URLs since it borks params
@cdwort cdwort merged commit 5c82aa5 into pdfkit:master Jul 16, 2015
@cdwort
Copy link
Contributor Author

cdwort commented Jul 16, 2015

For record keeping, this fixes #284

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

URLs being escaped after upgrade to 0.5.3
3 participants