Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

httpspider redesign to only spider each site once #82

Open
dmiller-nmap opened this issue Mar 14, 2015 · 5 comments
Open

httpspider redesign to only spider each site once #82

dmiller-nmap opened this issue Mar 14, 2015 · 5 comments

Comments

@dmiller-nmap
Copy link

This has been brought up before, like on the Script Ideas wiki page. I thought a little more and got a theoretical framework down in some very basic pseudo-Lua. It uses a multi-phase approach:

  • portrule: scripts decide whether to run (same as current design)
  • portaction: scripts register callbacks which the spider uses to determine how deep to spider, what resources to retrieve, etc. Also registers a "report" callback that actually produces output (in the same way that a NSE script would return output).
  • hostrule: returns true if the engine has any callbacks registered for this SCRIPT_NAME+host+port combination
  • hostaction: Handles launching the spider if necessary, or waiting for it to be done; then it runs the "report" callback.

Example pseudocode for the httpspider library:

hostrule = function(host)
  -- this table would be filled out during registration of callbacks
  return registered[host][stdnse.getid()]
end

hostaction = function(host)
  local report = {}
  local ports = host.registry.spider_these_ports -- or something like this
  -- code goes here to loop over ports and launch coroutines for each one.
    spider_mutex = nmap.mutex(port)
    results_condvar = nmap.condvar(port)
    -- make sure nobody else is currently spidering
    if spider_mutex("trylock") then
      -- we have a lock.
      -- make sure nobody else already spidered
      if not ports[port].done then
        -- do the hard work for everyone else
        spider(host, port)
        -- Let them know we're done.
        ports[port].done = true
        -- unlock the mutex
        spider_mutex "done"
        -- wake everyone up
        results_condvar "signal"
      end
    else
      -- we couldn't lock, so someone else is spidering. Wait for them to be done
      results_condvar "wait"
    end
    report[port] = get_report_callback(host, port, stdnse.getid())()
  -- end loop over spider_these_ports

  return report
end

wrap = function(register)
  local actions = {
    portrule = register,
    hostrule = hostaction
  }
  return hostrule, function(...) return actions[SCRIPT_TYPE](...) end
end

Example of a script using this API:

    portrule = shortport.http

    local function portaction(host, port)
      -- register callbacks, etc.
      -- These will be closures over tables local to this function
      -- which can be used to store and retrieve results.
      -- One of the callbacks will be a "report" callback, which retrieves all results
      -- for this host+port+script combo and formats them.
      -- Theoretical example:
      local results = {}
      local function find_cc_numbers(response)
        for num in string.gmatch(response.body, string.rep("%d", 16)) do
          table.insert(response, num)
        end
      end
      httpspider.register_callback(find_cc_numbers, {maxdepth=5, filetypes={"html", "htm"}})
      local function format_results()
        return results
      end
      httpspider.results_function(format_results)
    end

    hostrule, action = httpspider.wrap(portaction)

Note that there are lots of things left to be defined, though I don't anticipate any of them to be particularly hard:

  • How are callbacks stored and retrieved?
  • How can the spider engine know how deep and what things to retrieve? (this may be already solved in the current version, but should be adapted to the new design)

EDITED: I realized my initial proposal code had some errors regarding the port, which is only passed to the portaction, not the hostaction. Not a problem, since wherever the callbacks are stored, they'll be retrieved by port. The code is just a mockup, but should demonstrate a basic feasibility.

@batrick
Copy link

batrick commented Mar 15, 2015

Dan, I haven't fully reviewed your proposal yet but I have one question. In the past we found this a non-trivial problem to solve because scripts' hostrules/portrules are no longer evaluated in one phase and then action functions run in the next phase. Now we have a pipeline architecture, the rule is evaluated and then immediately the action function is run (the rule function has become a sort of useless historical artifact). So the problem we have is that two scripts which want to spider a site may never run concurrently. Said differently, the second script may not get the chance to register the callback before the first script finishes. Does your proposal address this?

@dmiller-nmap
Copy link
Author

@batrick Yes, because the actual work is done in a completely separate phase from the callback registering. Callbacks are registered in the action of the portrule phase, and the spidering occurs in the action of the hostrule phase. This has the downside of requiring scripts to output results in the host script section, not next to the port they refer to, but I think it could be made to work. Alternatively (and this is getting really wild, with big invasive changes to NSE), we could expose an API for hostrule scripts to add output to the ports directly, or for any script to add output to other ports; this could also be used for things like rpcinfo, mdns-service-discovery, etc. which obtain information about multiple ports on the same host.

@dmiller-nmap
Copy link
Author

After further discussion with @batrick on IRC, we determined that the current design in nse_main.lua is to launch hostrule and portrule scripts in the same scan phase (NSE_SCAN), so there is no guarantee that the hostaction will be executed after the portaction (in fact, it appears to be the opposite).

One alternative that preserves the same sort of idea would be to register the callbacks in the pre-scanning phase (action for "prerule"), then execute them during the portrule phase. This simplifies the spidering, since the port would be a parameter of the portaction, but means that the data store which accumulates the results for each port cannot be a table which the callbacks are closures over, but instead must be a parameter passed to the httpspider call. This would probably look something like this (in a script):

local function preaction()
  -- SCRIPT_NAME here and below may not be necessary if there's some other way to tag
  -- these callbacks to this script.
  httpspider.register_callback(SCRIPT_NAME, some_function, {maxdepth=2, other_stuff="whatever"})
end

local function portaction(host, port)
  local results = {}
  -- this function would contain the logic from "hostaction" in the earlier proposal
  httpspider.run_spider(SCRIPT_NAME, host, port, results)
  return results
end

-- again, this is just taking the work out of making a dispatch table for the appropriate action.
-- The prerule would probably just be true every time.
prerule, action = httpspider.wrap(preaction, portaction)

portrule = shortport.http -- or something more complicated. Doesn't require wrapping.

@batrick
Copy link

batrick commented Mar 16, 2015

This is one of those instances I wish we had added a "_SCRIPT" table or similar that all threads for a script share. That could be used instead of SCRIPT_NAME.

Anyways, what you've described makes sense to me Dan.

@cldrn
Copy link
Member

cldrn commented Mar 22, 2015

I attempted to solve this problem with registry variables that kept track of the status of the crawlers and if different scripts were requesting the same URL. Can't remember why it didn't work but the code shows the idea behind it:
https://github.com/cldrn/nmap-nse-scripts/blob/master/nselib/httpspider.lua

However, I like this approach much better. The callbacks allow more flexibility to script authors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants