Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Browse files

Reorganized file structure

  • Loading branch information...
commit b499a6cee0ef936a2d71f79052534221df9435a8 1 parent e454325
@jaimeiniesta authored
View
5 .gitignore
@@ -0,0 +1,5 @@
+*.gem
+.bundle
+.rvmrc
+Gemfile.lock
+pkg/*
View
4 Gemfile
@@ -0,0 +1,4 @@
+source "http://rubygems.org"
+
+# Specify your gem's dependencies in MetaInspector.gemspec
+gemspec
View
47 README.rdoc
@@ -1,52 +1,28 @@
= MetaInspector
-MetaInspector is a gem for web scraping purposes. You give it an URL, and it returns you metadata from it.
-
-= Dependencies
-
-MetaInspector uses the nokogiri gem to parse HTML. You can install it from github.
-
-Run the following if you haven't already:
-
- gem sources -a http://gems.github.com
-
-Then install the gem:
-
- sudo gem install tenderlove-nokogiri
-
-If you're on Ubuntu, you might need to install these packages before installing nokogiri:
-
- sudo aptitude install libxslt-dev libxml2 libxml2-dev
-
-Please note that you should use libxml2 version 2.7.4 or later, as there is a bug in earlier versions:
-
-* http://groups.google.com/group/nokogiri-talk/browse_thread/thread/3274c25e394fde68
-
-It also uses the charguess ruby gem, you can install it with:
-
- sudo gem install charguess
+MetaInspector is a gem for web scraping purposes. You give it an URL, and it lets you easily get its title, links, and meta tags.
= Installation
-Run the following if you haven't already:
+Install the gem from RubyGems:
- gem sources -a http://gems.github.com
+ gem install metainspector
-Then install the gem:
+= Usage
- sudo gem install jaimeiniesta-metainspector
+Initialize a scraper instance for an URL, like this:
-= Usage
+ page = MetaInspector::Scraper.new('http://pagerankalert.com')
-Initialize a MetaInspector instance with an URL like this:
+or, for short, a convenience alias is also available:
page = MetaInspector.new('http://pagerankalert.com')
-Once scraped, you can see the scraped data like this:
+Then you can see the scraped data like this:
- page.address # URL of the page
- page.title # title of the page, as string
- page.links # array of strings, with every link found on the page
+ page.address # URL of the page
+ page.title # title of the page, as string
+ page.links # array of strings, with every link found on the page
page.meta_description # meta description, as string
page.meta_keywords # meta keywords, as string
@@ -103,7 +79,6 @@ You can find some sample scripts on the samples folder, including a basic scrapi
* Distinguish between external and internal links, returning page.links for all of them as found, page.external_links and page.internal_links converted to absolute URLs
* Return array of images in page as absolute URLs
* Be able to set a timeout in seconds
-* Detect charset
* If keywords seem to be separated by blank spaces, replace them with commas
* Mocks
* Check content type, process only HTML pages, don't try to scrape TAR files like http://ftp.ruby-lang.org/pub/ruby/ruby-1.9.1-p129.tar.bz2 or video files like http://isabel.dit.upm.es/component/option,com_docman/task,doc_download/gid,831/Itemid,74/
View
2  Rakefile
@@ -0,0 +1,2 @@
+require 'bundler'
+Bundler::GemHelper.install_tasks
View
12 lib/meta_inspector.rb
@@ -0,0 +1,12 @@
+# -*- encoding: utf-8 -*-
+
+require_relative 'meta_inspector/scraper'
+
+module MetaInspector
+ extend self
+
+ # Sugar method to be able to create a scraper in a shorter way
+ def new(url)
+ Scraper.new(url)
+ end
+end
View
81 lib/meta_inspector/scraper.rb
@@ -0,0 +1,81 @@
+# -*- encoding: utf-8 -*-
+
+require 'open-uri'
+require 'rubygems'
+require 'nokogiri'
+require 'charguess'
+require 'iconv'
+
+# MetaInspector provides an easy way to scrape web pages and get its elements
+module MetaInspector
+ class Scraper
+ attr_reader :address
+
+ # Initializes a new instance of MetaInspector, setting the URL address to the one given
+ # TODO: validate address as http URL, dont initialize it if wrong format
+ def initialize(address)
+ @address = address
+
+ @document = @title = @description = @keywords = @links = nil
+ end
+
+ # Returns the parsed document title, from the content of the <title> tag.
+ # This is not the same as the meta_tite tag
+ def title
+ @title ||= parsed_document.css('title').inner_html rescue nil
+ end
+
+ # Returns the parsed document links
+ def links
+ @links ||= parsed_document.search("//a").map {|link| link.attributes["href"].to_s.strip} rescue nil
+ end
+
+ # Returns the charset
+ # TODO: We should trust the charset expressed on the Content-Type meta tag
+ # and only guess it if none given
+ def charset
+ @charset ||= CharGuess.guess(document).downcase
+ end
+
+ # Returns the whole parsed document
+ def parsed_document
+ @parsed_document ||= Nokogiri::HTML(document)
+
+ rescue
+ warn 'An exception occurred while trying to scrape the page!'
+ end
+
+ # Returns the original, unparsed document
+ def document
+ @document ||= open(@address).read
+
+ rescue SocketError
+ warn 'MetaInspector exception: The url provided does not exist or is temporarily unavailable (socket error)'
+ @scraped = false
+ rescue TimeoutError
+ warn 'Timeout!!!'
+ rescue
+ warn 'An exception occurred while trying to fetch the page!'
+ end
+
+ # Scrapers for all meta_tags in the form of "meta_name" are automatically defined. This has been tested for
+ # meta name: keywords, description, robots, generator
+ # meta http-equiv: content-language, Content-Type
+ #
+ # It will first try with meta name="..." and if nothing found,
+ # with meta http-equiv="...", substituting "_" by "-"
+ # TODO: this should be case unsensitive, so meta_robots gets the results from the HTML for robots, Robots, ROBOTS...
+ # TODO: cache results on instance variables, using ||=
+ # TODO: define respond_to? to return true on the meta_name methods
+ def method_missing(method_name)
+ if method_name.to_s =~ /^meta_(.*)/
+ content = parsed_document.css("meta[@name='#{$1}']").first['content'] rescue nil
+ content = parsed_document.css("meta[@http-equiv='#{$1.gsub("_", "-")}']").first['content'] rescue nil if content.nil?
+
+ content
+ else
+ super
+ end
+ end
+ end
+end
View
5 lib/meta_inspector/version.rb
@@ -0,0 +1,5 @@
+# -*- encoding: utf-8 -*-
+
+module MetaInspector
+ VERSION = "1.2.0"
+end
View
96 lib/metainspector.rb
@@ -1,96 +0,0 @@
-require 'open-uri'
-require 'rubygems'
-require 'nokogiri'
-require 'charguess'
-require 'iconv'
-
-# MetaInspector provides an easy way to scrape web pages and get its elements
-class MetaInspector
- VERSION = '1.1.6'
-
- attr_reader :address
-
- # Initializes a new instance of MetaInspector, setting the URL address to the one given
- # TODO: validate address as http URL, dont initialize it if wrong format
- def initialize(address)
- @address = address
-
- @document = @title = @description = @keywords = @links = nil
- end
-
- # Returns the parsed document title, from the content of the <title> tag.
- # This is not the same as the meta_tite tag
- def title
- @title ||= parsed_document.css('title').inner_html rescue nil
- end
-
- # Returns the parsed document links
- def links
- @links ||= parsed_document.search("//a").map {|link| link.attributes["href"].to_s.strip} rescue nil
- end
-
- # Returns the charset
- # TODO: We should trust the charset expressed on the Content-Type meta tag
- # and only guess it if none given
- def charset
- @charset ||= CharGuess.guess(document).downcase
- end
-
- # Returns the whole parsed document
- def parsed_document
- @parsed_document ||= Nokogiri::HTML(document)
-
- rescue
- warn 'An exception occurred while trying to scrape the page!'
- end
-
- # Returns the original, unparsed document
- def document
- @document ||= open(@address).read
-
- rescue SocketError
- warn 'MetaInspector exception: The url provided does not exist or is temporarily unavailable (socket error)'
- @scraped = false
- rescue TimeoutError
- warn 'Timeout!!!'
- rescue
- warn 'An exception occurred while trying to fetch the page!'
- end
-
- # Scrapers for all meta_tags in the form of "meta_name" are automatically defined. This has been tested for
- # meta name: keywords, description, robots, generator
- # meta http-equiv: content-language, Content-Type
- #
- # It will first try with meta name="..." and if nothing found,
- # with meta http-equiv="...", substituting "_" by "-"
- # TODO: this should be case unsensitive, so meta_robots gets the results from the HTML for robots, Robots, ROBOTS...
- # TODO: cache results on instance variables, using ||=
- # TODO: define respond_to? to return true on the meta_name methods
- def method_missing(method_name)
- if method_name.to_s =~ /^meta_(.*)/
- content = parsed_document.css("meta[@name='#{$1}']").first['content'] rescue nil
- content = parsed_document.css("meta[@http-equiv='#{$1.gsub("_", "-")}']").first['content'] rescue nil if content.nil?
-
- content
- else
- super
- end
- end
-
- #########################################################################################################
- # DEPRECATED METHODS
- # These methods are deprecated and will disappear soonish.
-
- # DEPRECATED: Returns the parsed document meta description
- def description
- warn "DEPRECATION WARNING: description method is deprecated since 1.1.6 and will be removed on 1.2.0, use meta_description instead"
- @description ||= meta_description rescue nil
- end
-
- # DEPRECATED: Returns the parsed document meta keywords
- def keywords
- warn "DEPRECATION WARNING: keywords method is deprecated since 1.1.6 and will be removed on 1.2.0, use meta_keywords instead"
- @keywords ||= meta_keywords rescue nil
- end
-
-end
View
26 meta_inspector.gemspec
@@ -0,0 +1,26 @@
+# -*- encoding: utf-8 -*-
+$:.push File.expand_path("../lib", __FILE__)
+require "meta_inspector/version"
+
+Gem::Specification.new do |s|
+ s.name = "metainspector"
+ s.version = MetaInspector::VERSION
+ s.platform = Gem::Platform::RUBY
+ s.authors = ["Jaime Iniesta"]
+ s.email = ["jaimeiniesta@gmail.com"]
+ s.homepage = "https://rubygems.org/gems/metainspector"
+ s.summary = %q{MetaInspector is a ruby gem for web scraping purposes, that returns a hash with metadata from a given URL}
+ s.description = %q{MetaInspector lets you scrape a web page and get its title, charset, link and meta tags}
+
+ s.rubyforge_project = "MetaInspector"
+
+ s.files = `git ls-files`.split("\n")
+ s.test_files = `git ls-files -- {test,spec,features}/*`.split("\n")
+ s.executables = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) }
+ s.require_paths = ["lib"]
+
+ s.add_dependency 'nokogiri', '1.4.4'
+ s.add_dependency 'charguess', '1.3.20110226181011'
+
+ s.add_development_dependency 'rspec', '2.5.0'
+end
View
24 metainspector.gemspec
@@ -1,24 +0,0 @@
-Gem::Specification.new do |s|
- s.name = "metainspector"
- s.version = "1.1.6"
- s.date = "2009-09-20"
- s.summary = "Ruby gem for web scraping"
- s.email = "jaimeiniesta@gmail.com"
- s.homepage = "http://github.com/jaimeiniesta/metainspector"
- s.description = "MetaInspector is a ruby gem for web scraping purposes, that returns a hash with metadata from a given URL"
- s.has_rdoc = false
- s.authors = ["Jaime Iniesta"]
- s.files = [
- "README.rdoc",
- "CHANGELOG.rdoc",
- "MIT-LICENSE",
- "metainspector.gemspec",
- "lib/metainspector.rb",
- "samples/basic_scraping.rb",
- "samples/spider.rb"]
- s.test_files = ["spec/metainspector_spec.rb", "spec/spec_helper.rb"]
- s.rdoc_options = []
- s.extra_rdoc_files = []
- s.add_dependency("nokogiri", ["> 1.3.3"])
- s.add_dependency("chardet", [">= 0.9"])
-end
View
2  samples/basic_scraping.rb
@@ -1,6 +1,6 @@
# Some basic MetaInspector samples
-require '../lib/metainspector.rb'
+require_relative '../lib/meta_inspector.rb'
puts "Enter a valid http address to scrape it"
address = gets.strip
View
2  samples/spider.rb
@@ -1,5 +1,5 @@
# A basic spider that will follow links on an infinite loop
-require '../lib/metainspector.rb'
+require_relative '../lib/meta_inspector.rb'
q = Queue.new
visited_links=[]
View
16 spec/metainspector_spec.rb
@@ -1,3 +1,5 @@
+# -*- encoding: utf-8 -*-
+
require File.join(File.dirname(__FILE__), "/spec_helper")
describe MetaInspector do
@@ -72,18 +74,4 @@
@m.charset.should == "utf-8"
end
end
-
- context 'Deprecated methods still work' do
- before(:each) do
- @m = MetaInspector.new('http://pagerankalert.com')
- end
-
- it "should get the description as the meta_description" do
- @m.description.should == @m.meta_description
- end
-
- it "should get the keywords as the meta_keywords" do
- @m.keywords.should == @m.meta_keywords
- end
- end
end
View
4 spec/spec_helper.rb
@@ -1,2 +1,4 @@
+# -*- encoding: utf-8 -*-
+
$: << File.join(File.dirname(__FILE__), "/../lib")
-require 'metainspector'
+require 'meta_inspector'
Please sign in to comment.
Something went wrong with that request. Please try again.