MetaInspector is a gem for web scraping purposes. You give it an URL, and it lets you easily get its title, links, and meta tags.
Install the gem from RubyGems:
gem install metainspector
Initialize a scraper instance for an URL, like this:
page = MetaInspector::Scraper.new('http://pagerankalert.com')
or, for short, a convenience alias is also available:
page = MetaInspector.new('http://pagerankalert.com')
Then you can see the scraped data like this:
page.address # URL of the page page.title # title of the page, as string page.links # array of strings, with every link found on the page page.meta_description # meta description, as string page.meta_keywords # meta keywords, as string
MetaInspector uses dynamic methods for meta_tag discovery, so all these will work, and will be converted to a search of a meta tag by the corresponding name, and return its content attribute
page.meta_description # <meta name="description" content="..." /> page.meta_keywords # <meta name="keywords" content="..." /> page.meta_robots # <meta name="robots" content="..." /> page.meta_generator # <meta name="generator" content="..." />
It will also work for the meta tags of the form <meta http-equiv=“name” … />, like the following:
page.meta_content_language # <meta http-equiv="content-language" content="..." /> page.meta_Content_Type # <meta http-equiv="Content-Type" content="..." />
Please notice that MetaInspector is case sensitive, so page.meta_Content_Type is not the same as page.meta_content_type
The full scraped document if accessible from:
page.document # Nokogiri doc that you can use it to get any element from the page
You can find some sample scripts on the samples folder, including a basic scraping and a spider that will follow external links using a queue. What follows is an example of use from irb:
$ irb >> require 'metainspector' => true >> page = MetaInspector.new('http://pagerankalert.com') => #<MetaInspector:0x11330c0 @document=nil, @links=nil, @address="http://pagerankalert.com", @description=nil, @keywords=nil, @title=nil> >> page.title => "PageRankAlert.com :: Track your PageRank changes" >> page.meta_description => "Track your PageRank(TM) changes and receive alerts by email" >> page.meta_keywords => "pagerank, seo, optimization, google" >> page.links.size => 8 >> page.links[5] => "http://pagerankalert.posterous.com" >> page.document.class => String >> page.parsed_document.class => Nokogiri::HTML::Document
-
Get page.base_dir from the address
-
Distinguish between external and internal links, returning page.links for all of them as found, page.external_links and page.internal_links converted to absolute URLs
-
Return array of images in page as absolute URLs
-
Be able to set a timeout in seconds
-
If keywords seem to be separated by blank spaces, replace them with commas
-
Mocks
-
Check content type, process only HTML pages, don’t try to scrape TAR files like ftp.ruby-lang.org/pub/ruby/ruby-1.9.1-p129.tar.bz2 or video files like isabel.dit.upm.es/component/option,com_docman/task,doc_download/gid,831/Itemid,74/
-
Get most important image querying Facebook
Copyright © 2009-2011 Jaime Iniesta, released under the MIT license