Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Ruby gem for web scraping purposes. It scrapes a given URL, and returns you its title, meta description, meta keywords, an array with all the links, all the images in it, etc.
Ruby

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
lib
samples
test
CHANGELOG.rdoc
MIT-LICENSE
README.rdoc
metainspector.gemspec

README.rdoc

MetaInspector

MetaInspector is a gem for web scraping purposes. You give it an URL, and it returns you metadata from it.

Dependencies

MetaInspector uses the nokogiri gem to parse HTML. You can install it from github.

Run the following if you haven't already:

gem sources -a http://gems.github.com

Then install the gem:

sudo gem install tenderlove-nokogiri

If you're on Ubuntu, you might need to install these packages before installing nokogiri:

sudo aptitude install libxslt-dev libxml2 libxml2-dev

Installation

Run the following if you haven't already:

gem sources -a http://gems.github.com

Then install the gem:

sudo gem install jaimeiniesta-metainspector

Usage

Initialize a MetaInspector instance with an URL like this:

page = MetaInspector.new('http://pagerankalert.com')

Then you can tell it to fetch and scrape the URL:

page.scrape!

Once scraped, you can see the returned data like this:

page.address       # URL of the page
page.title         # title of the page, as string
page.description   # meta description, as string
page.keywords      # meta keywords, as string
page.links         # array of strings, with every link found on the page

You can see if the scraping process went ok checking what page.scrape! returns (true or false), or checking the page.scraped? method, which returns false if no successfull scraping has been finished since the last address change. You can also change the address of the page to be scraped using the address= setter, like this:

page.address="http://jaimeiniesta.com"

Doing so resets the state of the MetaInspector instance to the initial state (not scraped yet, cleared stored meta data). You can re-scrape it again by calling the page.scrape! method.

The full fetched document and the scraped doc are accessible from:

page.full_doc    # it points to the temp file where the fetched doc is stored
page.scraped_doc # Hpricot doc that you can use it to get any element from the page

Examples

You can find some sample scripts on the samples folder, including a basic scraping and a spider that will follow external links using a queue. What follows is an example of use from irb:

$ irb
>> require 'metainspector'
=> true

>> page = MetaInspector.new('http://pagerankalert.com')
=> #<MetaInspector:0x5fc594 @full_doc=nil, @scraped=false, @description=nil, @links=nil,
   @address="http://pagerankalert.com", @keywords=nil, @scraped_doc=nil, @title=nil>

>> page.scrape!
=> true

>> page.title
=> "PageRankAlert.com :: Track your pagerank changes"

>> page.description
=> "Track your PageRank(TM) changes and receive alert by email"

>> page.keywords
=> "pagerank, seo, optimization, google"

>> page.links.size
=> 31

>> page.links[30]
=> "http://www.nuvio.cz/"

>> page.full_doc
=> #<File:/var/folders/X8/X8TBsDiWGYuMKzrB3bhWTU+++TI/-Tmp-/open-uri.6656.0>

>> page.scraped_doc.class
=> Nokogiri::HTML::Document

>> page.scraped?
=> true

>> page.address="http://jaimeiniesta.com"
=> "http://jaimeiniesta.com"

>> page.scraped?
=> false

>> page.scrape!
=> true

>> page.scraped?
=> true

>> page.title
=> "ruby on rails freelance developer &#8212; Jaime Iniesta"

To Do

  • Get page.base_dir from the address

  • Distinguish between external and internal links, returning page.links for all of them as found, page.external_links and page.internal_links converted to absolute URLs

  • Return array of images in page as absolute URLs

  • Return contents of meta robots tag

  • Be able to set a timeout in seconds

  • Recover from Timeout exception

  • Recover from Errno::ECONNREFUSED

  • If keywords seem to be separated by blank spaces, replace them with commas

  • Mocks

  • Check content type, process only HTML pages_

** Don't try to scrape ftp.ruby-lang.org/pub/ruby/ruby-1.9.1-p129.tar.bz2 ** Don't try to scrape isabel.dit.upm.es/component/option,com_docman/task,doc_download/gid,831/Itemid,74/

Copyright © 2009 Jaime Iniesta, released under the MIT license

Something went wrong with that request. Please try again.