Skip to content
Browse files

merged

  • Loading branch information...
1 parent cae2c62 commit 63b47894732f677be8b087350a3a309179328934 @mattetti committed
Showing with 168 additions and 3 deletions.
  1. +4 −0 Gemfile
  2. +28 −0 Gemfile.lock
  3. +40 −3 README.md
  4. +9 −0 runner.rb
  5. +69 −0 scrapers/pluzz_francetv_fr.rb
  6. +18 −0 scrapers/utils.rb
View
4 Gemfile
@@ -0,0 +1,4 @@
+# A sample Gemfile
+source "https://rubygems.org"
+
+gem "mechanize"
View
28 Gemfile.lock
@@ -0,0 +1,28 @@
+GEM
+ remote: https://rubygems.org/
+ specs:
+ domain_name (0.5.6)
+ unf (~> 0.0.3)
+ mechanize (2.5.1)
+ domain_name (~> 0.5, >= 0.5.1)
+ mime-types (~> 1.17, >= 1.17.2)
+ net-http-digest_auth (~> 1.1, >= 1.1.1)
+ net-http-persistent (~> 2.5, >= 2.5.2)
+ nokogiri (~> 1.4)
+ ntlm-http (~> 0.1, >= 0.1.1)
+ webrobots (~> 0.0, >= 0.0.9)
+ mime-types (1.19)
+ net-http-digest_auth (1.2.1)
+ net-http-persistent (2.8)
+ nokogiri (1.5.6)
+ ntlm-http (0.1.1)
+ unf (0.0.5)
+ unf_ext
+ unf_ext (0.0.5)
+ webrobots (0.0.13)
+
+PLATFORMS
+ ruby
+
+DEPENDENCIES
+ mechanize
View
43 README.md
@@ -1,4 +1,41 @@
-scrapbook
-=========
+# Scrapbook
-Experiment designed to reflect on the organization of web scraping/data processing.
+Experiment designed to reflect on the organization of the 3 important
+elements of web scraping/data processing:
+
+* scheduling
+* scraping
+* processing
+
+Optionally, monitoring might be added to each step.
+
+## Scheduling
+
+Scheduling includes the configuration & tracking of when scrapping and processing
+should take place.
+
+## Scraping
+
+Scraping is organized by target and each "scraper" responds to
+`#run` which returns an array of objects which will each respond to
+`#to_json`. Scrapers shouldn't stop the execution, in other words, they
+shouldn't raise exception but instead handle errors internally and
+expose the issues via data structures.
+The term scraping isn't quite right in the context of this experiment
+since a scraper could simply be an API client consuming a web API for
+instance.
+
+## Processing
+
+Once the data is scraped, the fetched information can be sent to one or
+many processors until it reaches its final form. If the data is meant to
+be persisted, the persistence layer should be implemented as a processor.
+Examples of processors are data extractors, event triggers, persistence
+layers etc...
+
+### Design
+
+Each unit should be autonomous, easy to test and chainable. Raised
+exceptions should really be exceptional and always mean that human
+intervention is needed. Ideally, each unit should also be designed to
+run concurrently.
View
9 runner.rb
@@ -0,0 +1,9 @@
+require 'bundler'
+Bundler.require
+
+# Require all the scrapers
+Dir.glob("./scrapers/*.rb"){|file| require file }
+
+# TODO: use a scheduler and send to processors
+episodes = FranceTVJeunesse.run
+puts episodes.map(&:to_json)
View
69 scrapers/pluzz_francetv_fr.rb
@@ -0,0 +1,69 @@
+require_relative 'utils'
+require 'json'
+
+module FranceTVJeunesse
+
+ def self.run
+ agent = Mechanize.new
+ url = "http://pluzz.francetv.fr/ajax/launchsearch/rubrique/jeunesse/datedebut/#{Time.now.strftime("%Y-%m-%dT00:00")}/datefin/#{Time.now.strftime("%Y-%m-%dT23:59")}/type/lesplusrecents/nb/100/"
+ page = agent.get(url)
+ episodes = fetch_episodes(page)
+ puts "success" unless episodes.find{|e| e.failed?}
+ episides
+ end
+
+ def self.fetch_episodes(page)
+ elements = page.search("article.rs-cell")
+ episodes = elements.map do |e|
+ episode = Episode.new
+ episode.url = episode.fetch(e, "h3 > a", ->(el){ el.first.attributes["href"].value})
+ episode.show_name = episode.fetch(e, "h3 > a", ->(el){ el.first.text.strip})
+ episode.show_ref = episode.fetch(e, "span[data-prog]", ->(el){ el.first.attributes['data-prog'].value})
+ episode.title = episode.fetch(e, "div.rs-cell-details", ->(el){ el.first.search("a.ss-titre").text.strip})
+ episode.notes = episode.fetch(e, "div.rs-cell-details", ->(el){ el.first.search("a.rs-ep-ss").text.strip})
+ episode
+ end
+ episodes
+ end
+
+ class Episode
+ include Scrapbook::Utils::Fetcher
+
+ ATTRIBUTES = [:show_name, :show_ref,
+ :title, :url, :image_url, :broadcast_date, :notes ]
+
+ attr_accessor *ATTRIBUTES
+ attr_reader :failures
+
+ def initialize(opts=nil)
+ @failures = []
+ if opts.respond_to?(:keys) && opts.respond_to?(:each)
+ opts.each do |k, v|
+ self.send("#{k}=", v)
+ end
+ end
+ self
+ end
+
+ def to_s
+ "show: #{show_name} - show ref: #{show_ref} - title: #{title} - url: #{url} - notes: #{notes} - failures: #{self.failures.join("\n")}"
+ end
+
+ def to_json
+ hash = {}
+ ATTRIBUTES.each do |att|
+ hash[att] = self.send(att)
+ end
+ hash.to_json
+ end
+
+ def failed?
+ if self.url.nil? || self.failures.size > 3
+ true
+ else
+ false
+ end
+ end
+
+ end
+end
View
18 scrapers/utils.rb
@@ -0,0 +1,18 @@
+module Scrapbook
+ module Utils
+
+ module Fetcher
+ def fetch(element, selector, modifier)
+ begin
+ modifier.call element.search(selector)
+ rescue Exception => e
+ failure = "#{e.inspect} while fetching '#{selector}' on \n#{element}\n-backtrace: #{caller[0..4].join("\n")}\n---\n"
+ @failures ||= []
+ @failures << failure
+ nil
+ end
+ end
+ end
+
+ end
+end

0 comments on commit 63b4789

Please sign in to comment.
Something went wrong with that request. Please try again.