Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transparent support for HTTP Content-Encoding #1493

Open
nnposter opened this Issue Feb 23, 2019 · 2 comments

Comments

Projects
None yet
2 participants
@nnposter
Copy link

nnposter commented Feb 23, 2019

At present the NSE implementation of HTTP does not provide means to obtain an HTTP response body processed in accordance with header Content-Encoding. Scripts are unable to deal with compressed responses, unless making one-off effort on their own, which none of them currently do.

The attached patch implements transparent support for HTTP Content-Encoding as follows:

  • The body member of the HTTP response object now contains the processed (decoded) body.
  • The original body, as received from the server, is preserved in a new member, rawbody.
  • New response member decoded contains a list of content encodings that were successfully processed.
  • New response member undecoded contains a list of encodings that could not be processed, either because they are not currently supported or the body is corrupt. In other words, a body was successfully decoded if this list is empty (or nil, if no encodings were used in the first place).
  • Returned content-encoding and content-length headers are adjusted to remain consistent with body. (If all encodings got processed then the content-encoding header is removed altogether.)
* Implements transparent processing of HTTP Content-Encoding header
--- a/nselib/http.lua
+++ b/nselib/http.lua
@@ -21,8 +21,11 @@
 -- * <code>header</code> - An associative array representing the header. Keys are all lowercase, and standard headers, such as 'date', 'content-length', etc. will typically be present.
 -- * <code>rawheader</code> - A numbered array of the headers, exactly as the server sent them. While header['content-type'] might be 'text/html', rawheader[3] might be 'Content-type: text/html'.
 -- * <code>cookies</code> - A numbered array of the cookies the server sent. Each cookie is a table with the expected keys, such as <code>name</code>, <code>value</code>, <code>path</code>, <code>domain</code>, and <code>expires</code>. This table can be sent to the server in subsequent responses in the <code>options</code> table to any function (see below).
--- * <code>body</code> - The full body, as returned by the server. Chunked encoding is handled transparently.
+-- * <code>rawbody</code> - The full body, as returned by the server. Chunked transfer encoding is handled transparently.
+-- * <code>body</code> - The full body, after processing the Content-Encoding header, if any. The Content-Encoding and Content-Length headers are adjusted to maintain consistency.
 -- * <code>fragment</code> - Partially received body (if any), in case of an error.
+-- * <code>decoded</code> - A list of processed named content encodings (like "identity" or "gzip")
+-- * <code>undecoded</code> - A list of named content encodings that could not be processed (due to lack of support or the body being corrupted for a given encoding). A body has been successfully decoded if this list is empty (or `nil`, if no encodings were used in the first place).
 -- * <code>location</code> - A numbered array of the locations of redirects that were followed.
 --
 -- Many of the functions optionally allow an "options" input table, which can
@@ -127,6 +130,7 @@
 local url = require "url"
 local smbauth = require "smbauth"
 local unicode = require "unicode"
+local zlib = require "zlib"
 
 _ENV = stdnse.module("http", stdnse.seeall)
 
@@ -800,6 +804,48 @@
   return cookie
 end
 
+--- Attempt to repeatedly decode HTTP response body according to a given list
+-- of named encodings.
+--
+-- @param body A string representing the raw, undecoded response body.
+-- @param encodings A list of encodings (string or table)
+-- @return A decoded body
+-- @return A list of encodings that were successfully applied
+-- @return A list of encodings that remain to be applied to decode the body completely.
+local decode_body = function (body, encodings)
+  if not encodings then return body end
+
+  if type(encodings) == "string" then
+    encodings = stringaux.strsplit("%W+", encodings)
+  end
+  assert(type(encodings) == "table", "Invalid encoding specification")
+
+  local decoded = {}
+  local undecoded = tableaux.tcopy(encodings)
+  while #undecoded > 0 do
+    local enc = undecoded[1]:lower()
+    if enc == "identity" then
+      -- do nothing
+      table.insert(decoded, table.remove(undecoded, 1))
+    elseif enc == "gzip" then
+      local stream = zlib.inflate(body)
+      local status, newbody = pcall(stream.read, stream, "*a")
+      stream:close()
+      if not status then
+        stdnse.debug1("Corrupted Content-Encoding: %s", enc)
+        break
+      end
+      body = newbody
+      table.insert(decoded, table.remove(undecoded, 1))
+    else
+      stdnse.debug1("Unrecognized Content-Encoding: %s", enc)
+      break
+    end
+  end
+
+  return body, decoded, undecoded
+end
+
 -- Read one response from the socket <code>s</code> and return it after
 -- parsing.
 --
@@ -847,6 +893,20 @@
   if not body then
     return nil, partial, fragment
   end
+
+  response.rawbody = body
+
+  if response.header["content-encoding"] then
+    local dcd, undcd
+    body, dcd, undcd = decode_body(body, response.header["content-encoding"])
+    response.decoded = dcd
+    response.undecoded = undcd
+    response.header["content-encoding"] = #undcd > 0 and table.concat(undcd, ", ") or nil
+    if response.header["content-length"] then
+      response.header["content-length"] = #body
+    end
+  end
+
   response.body = body
 
   return response, partial
@@ -2972,6 +3032,64 @@
     end
   end
 
+  local content_encoding_tests = {
+    { name = "gzip encoding",
+      encoding = "gzip",
+      source = stdnse.fromhex("1f8b0800000000000000f348cdc9c9d75108cf2fca49510400d0c34aec0d000000"),
+      target = "Hello, World!",
+      decoded = {"gzip"},
+      undecoded = {}
+    },
+    { name = "corrupted gzip encoding",
+      encoding = "gzip",
+      source = stdnse.fromhex("2f8b0800000000000000f348cdc9c9d75108cf2fca49510400d0c34aec0d000000"),
+      target = stdnse.fromhex("2f8b0800000000000000f348cdc9c9d75108cf2fca49510400d0c34aec0d000000"),
+      decoded = {},
+      undecoded = {"gzip"}
+    },
+    { name = "identity encoding",
+      encoding = "identity",
+      source = "SomePlaintextBody",
+      target = "SomePlaintextBody",
+      decoded = {"identity"},
+      undecoded = {}
+    },
+    { name = "no encoding",
+      encoding = {},
+      source = "SomePlaintextBody",
+      target = "SomePlaintextBody",
+      decoded = {},
+      undecoded = {}
+    },
+    { name = "stacked encoding",
+      encoding = "identity, gzip, identity",
+      source = stdnse.fromhex("1f8b0800000000000000f348cdc9c9d75108cf2fca49510400d0c34aec0d000000"),
+      target = "Hello, World!",
+      decoded = {"identity", "gzip", "identity"},
+      undecoded = {}
+    },
+    { name = "unknown encoding",
+      encoding = "identity, mystery, gzip",
+      source = stdnse.fromhex("1f8b0800000000000000f348cdc9c9d75108cf2fca49510400d0c34aec0d000000"),
+      target = stdnse.fromhex("1f8b0800000000000000f348cdc9c9d75108cf2fca49510400d0c34aec0d000000"),
+      decoded = {"identity"},
+      undecoded = {"mystery", "gzip"}
+    },
+    { name = "nil encoding list",
+      encoding = nil,
+      source = "SomePlaintextBody",
+      target = "SomePlaintextBody",
+      decoded = nil,
+      undecoded = nil
+    },
+  }
+  for _, test in ipairs(content_encoding_tests) do
+    local body, dcd, undcd = decode_body(test.source, test.encoding)
+    test_suite:add_test(unittest.equal(body, test.target), test.name .. " (body)")
+    test_suite:add_test(unittest.identical(dcd, test.decoded), test.name .. " (decoded)")
+    test_suite:add_test(unittest.identical(undcd, test.undecoded), test.name .. " (undecoded)")
+  end
+
 end
 
 return _ENV;

@nnposter nnposter self-assigned this Feb 23, 2019

@nnposter nnposter added the have code label Feb 23, 2019

@dmiller-nmap

This comment has been minimized.

Copy link

dmiller-nmap commented Feb 28, 2019

This is awesome! I'll look over the code later, but I'm totally on board with this kind of support, which was a big reason for including zlib bindings in the first place.

@nnposter

This comment has been minimized.

Copy link
Author

nnposter commented Mar 1, 2019

Some design items to consider:

  • Do we need to support encodings other than gzip and identity at this point? (They can be always added later.)
  • Should we proceed with the newly implemented response members decoded and undecoded or are they bloating the response object too much? (These members could be useful to some scripts but probably just very few and the information can be technically reconstructed by comparing header and rawheader.)
  • Should we return nil body if the decoding fails due to corruption? (As of now there is no simple test to differentiate between an unknown/unsupported encoding and a corrupted body. In both cases the last good body is returned and undecoded[1] contains the encoding that failed. On the other hand, returning nil is problematic because currently the response object always returns a body string, even if empty, so scripts might start failing if they were coded under this assumption.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.