pismo - Web page content analysis and metadata extraction
Pismo extracts machine-usable metadata from unstructured (or poorly structured) English-language HTML documents. Data that Pismo can extract include titles, feed URLs, ledes, body text, image URLs, date, and keywords. Pismo is used heavily in production on http://coder.io/ to extract data from Web pages.
All tests pass on Ruby 1.8.7 (MRI) and Ruby 1.9.1-p378 (MRI).
A basic example of extracting basic metadata from a Web page:
require 'pismo' # Load a Web page (you could pass an IO object or a string with existing HTML data along, as you prefer) doc = Pismo::Document.new('http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html') doc.title # => "Cramp: Asychronous Event-Driven Ruby Web App Framework" doc.author # => "Peter Cooper" doc.lede # => "Cramp (GitHub repo) is a new, asynchronous evented Web app framework by Pratik Naik of 37signals (and the Rails core team). It's built around Ruby's EventMachine library and was designed to use event-driven I/O throughout - making it ideal for situations where you need to handle a large number of open connections (such as Comet systems or streaming APIs.)" doc.keywords # => [["cramp", 7], ["controllers", 3], ["app", 3], ["basic", 2], ..., ... ]
There's also a shorter "convenience" method which might be handy in IRB - it does the same as Pismo::Document.new:
Pismo['http://www.rubyflow.com/items/4082'].title # => "Install Ruby as a non-root User"
The current metadata methods are:
These methods are not fully documented here yet - you'll just need to try them out. The plural methods like #titles, #authors, and #feeds will return multiple matches in an array, if present. This is so you can use your own techniques to choose a "best" result in ambiguous cases.
The html_body and body methods will be of particular interest. They return the "body" of the page as determined by Pismo's "Reader" (like Arc90's Readability or Safari Reader) algorithm. #body returns it as plain-text, #html_body maintains some basic HTML styling.
CAVEATS AND SHORTCOMINGS:
There are some shortcomings or problems that I'm aware of and am going to pursue:
- I do not know how Pismo fares on JRuby, Rubinius, or others yet.
- The "Reader" content extraction algorithm is not perfect. It can sometimes return crap and can barf on certain types of characters for sentence extraction.
- The author name extraction is quite poor.
- The image extraction only handles images with absolute URLs.
- The stopword list leaves a bit to be desired. It errs on the side of being too long rather than too short, though (1024 words long!)
OTHER GROOVY STUFF:
Command Line Tool
A command line tool called "pismo" is included so that you can get metadata about a page from the command line. This is great for testing, or perhaps calling it from a non Ruby script. The output is currently in YAML.
./bin/pismo http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html title lede author datetime
--- :url: http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html :title: "Cramp: Asychronous Event-Driven Ruby Web App Framework" :lede: Cramp (GitHub repo)is a new, asynchronous evented Web app framework by Pratik Naik of 37signals :author: Peter Cooper :datetime: 2010-01-07 12:00:00 +00:00
If you call pismo without any arguments (except a URL), it starts an IRB session so you can directly work in Ruby. The URL provided is loaded and assigned to both the constant 'P' and the variable @p.
You can access Pismo's stopword list directly:
Pismo.stopwords # => [.., .., ..]
Note on Patches/Pull Requests
- Fork the project.
- Make your feature addition or bug fix.
- Add tests for it. This is important so I don't break it in a future version unintentionally.
- Commit, do not mess with Rakefile, version, or history as it's handled by Jeweler (which is awesome, btw).
- Send me a pull request. I may or may not accept it (sorry, practicality rules.. but message me and we can talk!)
COPYRIGHT AND LICENSE
Apache 2.0 License - See LICENSE for details. Copyright (c) 2009, 2010 Peter Cooper
In short, you can use Pismo for whatever you like commercial or not, but please include a brief credit (as in the NOTICE file - as per the Apache 2.0 License) somewhere deep in your license file or similar, and, if you're nice and have the time, let me know if you're using it and/or share any significant changes or improvements you make.