Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Use node.js to parse Wikileaks cablegate HTML pages and output cable content as JSON
JavaScript
Tree: 8b21a214e8

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
lib
README.md
refresh_cables.sh
scrape.js

README.md

Cablescrape

WARNING: this is a work in progress.

Requirements:

  • node.js (tested with node 0.2.0 and v0.3.2-pre)
  • htmlparser
  • optimist

Included Dependencies:

The scrape.js Node.js script parses downloaded Wikileaks cablegate web pages and uses the jsdom project's JQueryify function to parse content from them. Content is then output as JSON.

To get the cablegate web pages, use the Bash script "refresh_cables.sh" to download a fresh copy of the Wikileaks site and put its "cable" directory into the same directory as "scrape.js".

Once you have cables to scrape enter the following command to output the JSON:

./scrape.js

Note that an older version of jsom and JQuery are used (and included) as I couldn't get things to work otherwise. The html_entity_decode function from the php.js project is used to decode HTML entities.

Something went wrong with that request. Please try again.