Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
SoupScraper runs on Google AppEngine. This one is a remake using Java + TagSoup + Rhino + EnvJS + SizzleJS.
JavaScript
branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
src
war
README.textile

README.textile

SoupScraper

…is a webservice to scrape from HTML webpages.

It uses…

  • Google AppEngine
  • TagSoup – java lib to parse invalid XML (e.g. nasty HTML)
  • Rhino – JavaScript Engine for Java
  • EnvJS – JavaScript lib to simulate Browser Environment
  • JSON – JSON parser in JavaScript
  • SizzleJS – Best CSS Selector Engine you can find… ;-)

There are currently two versions, this one is version 2

Schema

Simple Example

URL: http://sizzlejs.com/

{ "download_link" : ["p.download.link a","@href"] }

…will output…

{
   "download_link": "http://github.com/jeresig/sizzle/zipball/master"
}

A more complex Example:

URL: http://en.wikipedia.org/wiki/Quake

{
  "game" : ["#content",{
    "name" : ["table.infobox .summary i", "TEXT"],
    "developers[]" : ["table.infobox tr:has(td a[href=/wiki/Video_game_developer]) td+td a", "TEXT"],
    "external_links[]" : ["h2:contains(links)+ul li", {
      "title": ["a.external","TEXT"],
      "href": ["a.external","@href"],
    }]
  }]
}

…will output…

 
{
   "game": {
      "name": "Quake",
      "developers": [
         "id Software",
         "Midway Games",
         "N64",
         "Lobotomy Software",
         "SS"
      ],
      "external_links": [
         {
            "title": "id Software: Quake",
            "href": "http://www.idsoftware.com/games/quake/quake/"
         },
         {
            "title": "Quake",
            "href": "http://www.dmoz.org/Games/Video_Games/Shooter/Q/Quake_Series/Quake/"
         }
      ]
   }
}

Author

© 2008-2009 by mathias leppich aka muhqu

Something went wrong with that request. Please try again.