Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Ruby gem for web scraping purposes. It scrapes a given URL, and returns you its title, meta description, meta keywords, an array with all the links, all the images in it, etc.

Fetching latest commit…

Octocat-spinner-32-eaf2f5

Cannot retrieve the latest commit at this time

Octocat-spinner-32 lib
Octocat-spinner-32 samples
Octocat-spinner-32 spec
Octocat-spinner-32 .gitignore
Octocat-spinner-32 .rspec.example Upgrade to RSpec 2.6 May 30, 2011
Octocat-spinner-32 .travis.yml
Octocat-spinner-32 Gemfile Reorganized file structure May 05, 2011
Octocat-spinner-32 MIT-LICENSE
Octocat-spinner-32 README.rdoc
Octocat-spinner-32 Rakefile default rake task is running specs July 29, 2011
Octocat-spinner-32 meta_inspector.gemspec
README.rdoc

MetaInspector

MetaInspector is a gem for web scraping purposes. You give it an URL, and it lets you easily get its title, links, images, charset, description, keywords, meta tags…

See it in action!

You can try MetaInspector live at this little demo: metainspectordemo.herokuapp.com

Installation

Install the gem from RubyGems:

gem install metainspector

This gem is tested on Ruby versions 1.8.7, 1.9.2 and 1.9.3.

Usage

Initialize a scraper instance for an URL, like this:

page = MetaInspector::Scraper.new('http://markupvalidator.com')

or, for short, a convenience alias is also available:

page = MetaInspector.new('http://markupvalidator.com')

If you don't include the scheme on the URL, http:// will be used by defaul:

page = MetaInspector.new('markupvalidator.com')

By default, MetaInspector times out after 20 seconds of waiting for a page to respond. You can set a different timeout with a second parameter, like this:

page = MetaInspector.new('markupvalidator.com', :timeout => 5) # this would wait just 5 seconds to timeout

MetaInspector will try to parse all URLs by default. If you want to parse only those URLs that have text/html as content-type you can specify it like this:

page = MetaInspector.new('markupvalidator.com', :html_content_only => true)

Then you can see the scraped data like this:

page.url                # URL of the page
page.scheme             # Scheme of the page (http, https)
page.host               # Hostname of the page (like, markupvalidator.com, without the scheme)
page.root_url           # Root url (scheme + host, like http://markupvalidator.com/)
page.title              # title of the page, as string
page.links              # array of strings, with every link found on the page as an absolute URL
page.internal_links     # array of strings, with every internal link found on the page as an absolute URL
page.extrenal_links     # array of strings, with every external link found on the page as an absolute URL
page.meta_description   # meta description, as string
page.description        # returns the meta description, or the first long paragraph if no meta description is found
page.meta_keywords      # meta keywords, as string
page.image              # Most relevant image, if defined with og:image
page.images             # array of strings, with every img found on the page as an absolute URL
page.feed               # Get rss or atom links in meta data fields as array
page.meta_og_title      # opengraph title
page.meta_og_image      # opengraph image
page.charset            # UTF-8
page.content_type       # content-type returned by the server when the url was requested

MetaInspector uses dynamic methods for meta_tag discovery, so all these will work, and will be converted to a search of a meta tag by the corresponding name, and return its content attribute

page.meta_description       # <meta name="description" content="..." />
page.meta_keywords          # <meta name="keywords" content="..." />
page.meta_robots            # <meta name="robots" content="..." />
page.meta_generator         # <meta name="generator" content="..." />

It will also work for the meta tags of the form <meta http-equiv=“name” … />, like the following:

page.meta_content_language  # <meta http-equiv="content-language" content="..." />
page.meta_Content_Type      # <meta http-equiv="Content-Type" content="..." />

Please notice that MetaInspector is case sensitive, so page.meta_Content_Type is not the same as page.meta_content_type

You can also access most of the scraped data as a hash:

page.to_hash               # { "url"=>"http://markupvalidator.com", "title" => "MarkupValidator :: site-wide markup validation tool", ... }

The full scraped document if accessible from:

page.document # Nokogiri doc that you can use it to get any element from the page

Errors handling

You can check if the page has been succesfully parsed with:

page.parsed?                # Will return true if everything looks OK

In case there have been any errors, you can check them with:

page.errors                 # Will return an array with the error messages

Examples

You can find some sample scripts on the samples folder, including a basic scraping and a spider that will follow external links using a queue. What follows is an example of use from irb:

$ irb
>> require 'metainspector'
=> true

>> page = MetaInspector.new('http://markupvalidator.com')
=> #<MetaInspector:0x11330c0 @url="http://markupvalidator.com">

>> page.title
=> "MarkupValidator :: site-wide markup validation tool"

>> page.meta_description
=> "Site-wide markup validation tool. Validate the markup of your whole site with just one click."

>> page.meta_keywords
=> "html, markup, validation, validator, tool, w3c, development, standards, free"

>> page.links.size
=> 15

>> page.links[4]
=> "/plans-and-pricing"

>> page.document.class
=> String

>> page.parsed_document.class
=> Nokogiri::HTML::Document

ZOMG Fork! Thank you!

You're welcome to fork this project and send pull requests. Just remember to include specs.

Thanks to all the contributors:

github.com/jaimeiniesta/metainspector/graphs/contributors

To Do

  • Get page.base_dir from the URL

  • If keywords seem to be separated by blank spaces, replace them with commas

  • Autodiscover all available meta tags

Copyright © 2009-2012 Jaime Iniesta, released under the MIT license

Something went wrong with that request. Please try again.