Clone this wiki locally
Welcome to the goose wiki!
Try it out online! http://jimplush.com/blog/goose
You can follow the latest developments from my twitter account: jimplush http://twitter.com/#!/jimplush
Links of importance
- Sites currently Unit Tested with Goose
- Configuration - Configuration options for Goose - how to set your own paths / options
Projects actively using the Goose library
Using Goose from the command line
You can now use goose from the command line to batch process extractions or just do a quick one for test purposes.
Download the goose source
cd into the goose directory
MAVEN_OPTS="-Xms256m -Xmx2000m" mvn exec:java -Dexec.mainClass=com.gravity.goose.TalkToMeGoose -Dexec.args="http://techcrunch.com/2011/05/13/native-apps-or-web-apps-particle-code-wants-you-to-do-both/" -e -q > ~/Desktop/gooseresult.txt
That will put the results of the extraction into the gooseresult.txt file on your desktop.
Project Goose is an article extractor written in Scala using Maven for the dependencies. It's an open source project born from Gravity Labs http://gravity.com, Its goal is to take a webpage, perform calculations and extract the main text of the article as well as make recommendations on what image might be the most relevant image on the page. Goose aims to create an easy to use, scalable extractor that can plug into any application that needs to extract structure from unstructured web pages. Goose was born for a project we needed that would take any article page, extract out the pure text of the content and pick what we thought was the most important image from that page. It's geared more for NLP type processing where you just care about the raw text of the article but I did code it up so we can add new OutputFormatter classes that override that behavior and give you more of a Flipboard type extraction where the content is all inline. Goose has performed tens of millions of extractions and we wanted to give back what we found out regarding extractions.
Project goose was in fact named after the Top Gun character call sign "goose". We were on a major Top Gun kick one week and that's what happens, projects get weird names. It is based on Arc90's readability code but has definitely moved away from their initial implementation and added image extraction. To see how it works check out some of the unit tests: https://github.com/jiminoc/goose/blob/master/src/test/scala/com/gravity/goose/GoldSitesTestIT.scala
No article extractor will ever be 100% on every single site, so when you come across articles Goose did not properly extract, please log an issue and I'll get it looked at.
- Scala (min 2.8)
- ImageMagick (for image extraction)
- ON OSX: sudo port install ImageMagick
- ON UBUNTU: sudo apt-get install imagemagick
Here is what it would look like to use goose from Java
String url = "http://www.cnn.com/2010/POLITICS/08/13/democrats.social.security/index.html"; Goose goose = new Goose(new Configuration()); Article article = goose.extractContent(url); System.out.println(article.cleanedArticleText());
And from Scala
val goose = new Goose(new Configuration) val article = goose.extractContent(url, rawHTML) article println(article.cleanedArticleText)
You'll receive back an Article object that has features from the article. To see all the features currently extracted you can look at: https://github.com/jiminoc/goose/blob/master/src/main/scala/com/gravity/goose/Article.scala
- Continue to add unit tests for Gold List Sites
- Show code examples of overriding the Configuration object
- Explain the Configuration Object
- Add an online app that given a URL will show you the extracted text and image (DONE)
- Add additional output formatters
- Add ability for users to define custom ids and classes for known sites to help with extraction (by domain?)
- Be able to follow multiple pages of articles
Goose goes through 3 phases during execution
- Document cleaning
- Content / Image extraction
- Document cleanup
When you pass a URL to goose the first things we'll start to do is cleanup the document to make it easier to parse. We'll go through and remove comments, common social network sharing divs, convert em and other tags to text nodes, try to convert divs used as text nodes to paragraphs as well as general doc cleanup.
Content / Images Extraction
When dealing with random article links you're bound to come across the craziest of HTML files. Some sites even like to include 2 HTML files per site. We use a scoring system based on clustering of English stop words and other factors that you can find in the code. We also do descending scoring so as the nodes move down the lower their scores become. The goal is to find the strongest grouping of text nodes inside a parent container and assume that's your group of content as long as it's high enough up on the page.
Image extraction is the one that took the longest. Trying to find the most important image on a page proved to be challenging and required to download all the images to manually inspect them using ImageMagick. Java's Image functions were just too unreliable and inaccurate. ImageMagick is well documented, tested and is fast and accurate. Images are focused from the top node that we find the content in then we do a recursive run outwards trying to find good images that aren't ads or banners or author logos.
Once we have the top node where we think the content is we'll want to format the content of that node for our application. For example for NLP type applications my output formatter will just suck all the text and ignore everything else, other extractors will be built to offer a more flipboardy type experience.