Scraping pages behind logins #112

jonlong opened this Issue Jul 23, 2012 · 2 comments


None yet

3 participants


Hey Chris,

I would love to get your recommendation on the best method to scrape pages behind login screens using I've been experimenting with basic authentication via the setHeader method, but my returned pages are still unauthenticated, and I'm afraid I'm headed into unfamiliar territory.

Any chance you could point me in the right direction?



Hi Jon,

To scrape a page behind a login you need to a) emulate a login request by doing exactly what the login form does, and then b) sending the resulting login/auth cookie to subsequent requests - this allows you to access the protected page.

In pseudo-code your run() method would look something like this

If we have a login cookie stored in some variable
    Set the login cookie using `this.setCookie(name, value)`
    *Scrape the protected page as you normally would
Else, simulate a login
    Send a post request to the same location as the login form
        Once complete, grab the login cookie from the response headers
        Save the cookie to an outer variable for subsequent calls to run()
        *Scrape the protected page as you normally would

I'd usually just use Firebug or the built-in Chrome developer tools to have a look at the request/response flow when logging in. Open one up, login as you normally would, and then you're after 3 things:

  1. The login field names, usually just some derivative of username and password. More complicated forms might bundle a nonce or some other security mechanism. You can get around these by actually scraping the login page before emulating a login.
  2. Check where the form POSTs to.
  3. Check which cookies are set in the login response

Good luck!

@chriso chriso closed this Oct 14, 2012

What about basic authentication where forms are not used? Such as

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment