Scrapod

A driver of remote headless scraping cluster for Capybara (aka remote Capybara Webkit).

Introduction

There are many browser automation tools, mostly built on top of PhantomJS. In my opinion, Capybara is still the best. Unfortunately most of Capybara drivers are not enough suitable for web scraping purposes. There are the reasons:

They run on the same server as your worker what can be ineffective
They do not care about headless browser process termination what can cause memory leaks
They open headless browser process on-demand what can be slow
They can take the total amount of available RAM and freeze the server

This happens because Capybara is intended firstly for testing purposes but not for web scraping. Authors do not want to support such use cases. So you as a final product developer have to solve these problems by yourself. This spawns primitive and makeshift solutions which are good until you have to run more than a few tens of tasks per hour.

The Scrapod tries to solve all or most of the problems listed above.

Architecture

The Scrapod consists of two parts: client and server.

Client

Client is a driver for Capybara. It connects to server when you create session, sends calls to Capybara API over the connection and converts responses to Ruby data structures. This is what you want to use in a final product application. This document describes the client completely.

Server

Server is a process which can run on the same or on another machine than the client. Server configuration can be complex but still not difficult. It is described in the server repository. For testing purposes it is enough to install the gem and run scrapod-server --debug. It will start listening on local port 20885.

Installation

Add the gem to your Gemfile (with git source because I do not push new experimental gems to RubyGems):

gem 'scrapod', git: 'https://github.com/krowpu/scrapod.git'

This will register a Capybara driver with name :scrapod which connects to local port 20885. To connect to the remote host register a driver by yourself. Assuming you use Sidekiq with Ruby on Rails, create the file config/initializers/scrapod.rb with the following content:

Capybara.register_driver :scrapod do |app|
  Scrapod::Driver.new app, Scrapod::Configuration::DEFAULT.merge(
    host: ENV['SCRAPOD_HOST']       || '127.0.0.1',
    port: ENV['SCRAPOD_PORT']&.to_i || 20885,
  )
end

Usage

Just create Capybara session with :scrapod driver and use it as usually:

session = Capybara::Session.new :scrapod
session.visit 'https://google.com'
session.title #=> "Google"

session.fill_in 'q', with: 'Capybara'
session.all('input')[1].trigger 'click'
session.title #=> "Capybara - Google Search"

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
bin		bin
lib		lib
.gitignore		.gitignore
.rubocop.yml		.rubocop.yml
Gemfile		Gemfile
LICENSE		LICENSE
README.md		README.md
Rakefile		Rakefile
scrapod.gemspec		scrapod.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scrapod

Introduction

Architecture

Client

Server

Installation

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Scrapod

Introduction

Architecture

Client

Server

Installation

Usage

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages