Scraping pages behind logins #112

Closed
jonlong opened this Issue Jul 23, 2012 · 2 comments

Projects

None yet

3 participants

@jonlong

Hey Chris,

I would love to get your recommendation on the best method to scrape pages behind login screens using node.io. I've been experimenting with basic authentication via the setHeader method, but my returned pages are still unauthenticated, and I'm afraid I'm headed into unfamiliar territory.

Any chance you could point me in the right direction?

Thanks!

@chriso

Hi Jon,

To scrape a page behind a login you need to a) emulate a login request by doing exactly what the login form does, and then b) sending the resulting login/auth cookie to subsequent requests - this allows you to access the protected page.

In pseudo-code your run() method would look something like this

If we have a login cookie stored in some variable
    Set the login cookie using `this.setCookie(name, value)`
    *Scrape the protected page as you normally would
Else, simulate a login
    Send a post request to the same location as the login form
        Once complete, grab the login cookie from the response headers
        Save the cookie to an outer variable for subsequent calls to run()
        *Scrape the protected page as you normally would

I'd usually just use Firebug or the built-in Chrome developer tools to have a look at the request/response flow when logging in. Open one up, login as you normally would, and then you're after 3 things:

  1. The login field names, usually just some derivative of username and password. More complicated forms might bundle a nonce or some other security mechanism. You can get around these by actually scraping the login page before emulating a login.
  2. Check where the form POSTs to.
  3. Check which cookies are set in the login response

Good luck!

@chriso chriso closed this Oct 14, 2012
@davidhooey

What about basic authentication where forms are not used? Such as http://myusername:mypassword@mysite.com.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment