Skip to content

Latest commit

 

History

History
99 lines (63 loc) · 3.58 KB

README.rdoc

File metadata and controls

99 lines (63 loc) · 3.58 KB

MetaInspector

MetaInspector is a gem for web scraping purposes. You give it an URL, and it lets you easily get its title, links, and meta tags.

Installation

Install the gem from RubyGems:

gem install metainspector

Usage

Initialize a scraper instance for an URL, like this:

page = MetaInspector::Scraper.new('http://pagerankalert.com')

or, for short, a convenience alias is also available:

page = MetaInspector.new('http://pagerankalert.com')

If you don’t include the scheme on the URL, http:// will be used by defaul:

page = MetaInspector.new('pagerankalert.com')

Then you can see the scraped data like this:

page.url                # URL of the page
page.title              # title of the page, as string
page.links              # array of strings, with every link found on the page
page.meta_description   # meta description, as string
page.meta_keywords      # meta keywords, as string
page.image              # Most relevant image, if defined with og:image

MetaInspector uses dynamic methods for meta_tag discovery, so all these will work, and will be converted to a search of a meta tag by the corresponding name, and return its content attribute

page.meta_description       # <meta name="description" content="..." />
page.meta_keywords          # <meta name="keywords" content="..." />
page.meta_robots            # <meta name="robots" content="..." />
page.meta_generator         # <meta name="generator" content="..." />

It will also work for the meta tags of the form <meta http-equiv=“name” … />, like the following:

page.meta_content_language  # <meta http-equiv="content-language" content="..." />
page.meta_Content_Type      # <meta http-equiv="Content-Type" content="..." />

Please notice that MetaInspector is case sensitive, so page.meta_Content_Type is not the same as page.meta_content_type

The full scraped document if accessible from:

page.document # Nokogiri doc that you can use it to get any element from the page

Examples

You can find some sample scripts on the samples folder, including a basic scraping and a spider that will follow external links using a queue. What follows is an example of use from irb:

$ irb
>> require 'metainspector'
=> true

>> page = MetaInspector.new('http://pagerankalert.com')
=> #<MetaInspector:0x11330c0 @url="http://pagerankalert.com">

>> page.title
=> "PageRankAlert.com :: Track your PageRank changes"

>> page.meta_description
=> "Track your PageRank(TM) changes and receive alerts by email"

>> page.meta_keywords
=> "pagerank, seo, optimization, google"

>> page.links.size
=> 8

>> page.links[5]
=> "http://pagerankalert.posterous.com"

>> page.document.class
=> String

>> page.parsed_document.class
=> Nokogiri::HTML::Document

ZOMG Fork! Thank you!

You’re welcome to fork this project and send pull requests. I want to thank Ryan Romanchuk for his help github.com/rromanchuk

To Do

Copyright © 2009-2011 Jaime Iniesta, released under the MIT license