Skip to content

Commit

Permalink
Lazy configure adapter for parsing HTML
Browse files Browse the repository at this point in the history
* Configure HTML parser adapter only if it wasn't set
* Add Travis build matrix for different adapters
* Remove Evil-Proxy patch (PR already merged)
  • Loading branch information
nbulaj committed Dec 4, 2017
1 parent 85cb6bf commit 49ed0a5
Show file tree
Hide file tree
Showing 13 changed files with 99 additions and 74 deletions.
14 changes: 10 additions & 4 deletions .travis.yml
Expand Up @@ -2,19 +2,25 @@ language: ruby
before_install: gem install bundler
bundler_args: --without yard guard benchmarks
script: "rake spec"
gemfile:
- gemfiles/oga.gemfile
- gemfiles/nokogiri.gemfile
env:
global:
- "JRUBY_OPTS="$JRUBY_OPTS --debug"
matrix:
- ADAPTER=oga
- ADAPTER=nokogiri
rvm:
- 2.0
- 2.1
- 2.2.4
- 2.3.3
- 2.4.2
- ruby-head
- jruby-9.1.6.0
matrix:
allow_failures:
- rvm: ruby-head
include:

exclude:
- rvm: 2.0
gemfile: gemfiles/nokogiri.gemfile
env: ADAPTER=nokogiri # Nokogiri doesn't support Ruby 2.0
2 changes: 1 addition & 1 deletion Gemfile
Expand Up @@ -7,5 +7,5 @@ gem 'oga', '~> 2.0'

group :test do
gem 'coveralls', require: false
gem 'evil-proxy'
gem 'evil-proxy', '~> 0.2'
end
55 changes: 45 additions & 10 deletions README.md
Expand Up @@ -5,11 +5,12 @@
[![Code Climate](https://codeclimate.com/github/nbulaj/proxy_fetcher/badges/gpa.svg)](https://codeclimate.com/github/nbulaj/proxy_fetcher)
[![License](http://img.shields.io/badge/license-MIT-brightgreen.svg)](#license)

This gem can help your Ruby application to make HTTP(S) requests from proxy by fetching and validating actual
This gem can help your Ruby application to make HTTP(S) requests using proxy by fetching and validating actual
proxy lists from multiple providers.

It gives you a `Manager` class that can load proxy lists, validate them and return random or specific proxies. Take a look
at the documentation below to find all the gem features.
It gives you a special `Manager` class that can load proxy lists, validate them and return random or specific proxies.
It also has a `Client` class that encapsulates all the logic for the sending HTTP requests using proxies.
Take a look at the documentation below to find all the gem features.

Also this gem can be used with any other programming language (Go / Python / etc) as standalone solution for downloading and
validating proxy lists from the different providers. [Checkout examples](#standalone) of usage below.
Expand All @@ -33,7 +34,7 @@ validating proxy lists from the different providers. [Checkout examples](#standa
If using bundler, first add 'proxy_fetcher' to your Gemfile:

```ruby
gem 'proxy_fetcher', '~> 0.5'
gem 'proxy_fetcher', '~> 0.6'
```

or if you want to use the latest version (from `master` branch), then:
Expand Down Expand Up @@ -234,7 +235,25 @@ Btw, if you need support of JavaScript or some other features, you need to imple

## Configuration

To change open/read timeout for `cleanup!` and `connectable?` methods you need to change `ProxyFetcher.config`:
ProxyFetcher is very flexible gem. You can configure the most important parts of the library and use your own solutions.

Default configuration looks as follows:

```ruby
ProxyFetcher.configure do |config|
config.user_agent = ProxyFetcher::Configuration::DEFAULT_USER_AGENT
config.pool_size = 10
config.timeout = 3
config.http_client = ProxyFetcher::HTTPClient
config.proxy_validator = ProxyFetcher::ProxyValidator
config.providers = ProxyFetcher::Configuration.registered_providers
config.adapter = ProxyFetcher::Configuration::DEFAULT_ADAPTER # :nokogiri by default
end
```

You can change any of the options above. Let's look at this deeper.

To change open/read timeout for `cleanup!` and `connectable?` methods you need to change `timeout` options:

```ruby
ProxyFetcher.configure do |config|
Expand All @@ -245,18 +264,19 @@ manager = ProxyFetcher::Manager.new
manager.cleanup!
```

Also you can set your custom User-Agent:
Also you can set your custom User-Agent string:

```ruby
ProxyFetcher.configure do |config|
config.user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'
end
```

ProxyFetcher uses simple Ruby solution for dealing with HTTP(S) requests - `net/http` library from the stdlib. If you wanna add, for example, your custom provider that
was developed as a Single Page Application (SPA) with some JavaScript, then you will need something like [selenium-webdriver](https://github.com/SeleniumHQ/selenium/tree/master/rb)
to properly load the content of the website. For those and other cases you can write your own class for fetching HTML content by the URL and setup it
in the ProxyFetcher config:
ProxyFetcher uses standard Ruby solution for dealing with HTTP(S) requests - `net/http` library from the Ruby core.
If you wanna add, for example, your custom provider that was developed as a Single Page Application (SPA) with some JavaScript,
then you will need something like [selenium-webdriver](https://github.com/SeleniumHQ/selenium/tree/master/rb) to properly
load the content of the website. For those and other cases you can write your own class for fetching HTML content by
the URL and setup it in the ProxyFetcher config:

```ruby
class MyHTTPClient
Expand Down Expand Up @@ -300,6 +320,21 @@ manager.validate!
#=> [ ... ]
```

Be default, ProxyFetcher gem uses [Nokogiri](https://github.com/sparklemotion/nokogiri) for parsing HTML. If you want
to use [Oga](https://gitlab.com/yorickpeterse/oga) instead, then you need to add `gem 'oga'` to your Gemfile and configure
ProxyFetcher as follows:

```ruby
ProxyFetcher.config.adapter = :oga
```

Also you can write your own HTML parser implementation and use it, take a look at the [abstract class and implementations](lib/proxy_fetcher/document).
Configure it as:

```ruby
ProxyFetcher.config.adapter = MyHTMLParserClass
```

### Proxy validation speed

There are some tricks to increase proxy list validation performance.
Expand Down
9 changes: 0 additions & 9 deletions gemfiles/nokogiri.gemfile

This file was deleted.

9 changes: 0 additions & 9 deletions gemfiles/oga.gemfile

This file was deleted.

8 changes: 8 additions & 0 deletions lib/proxy_fetcher.rb
Expand Up @@ -40,5 +40,13 @@ def config
def configure
yield config
end

private

def configure_adapter!
config.adapter = Configuration::DEFAULT_ADAPTER if config.adapter.nil?
end
end

configure_adapter!
end
5 changes: 2 additions & 3 deletions lib/proxy_fetcher/configuration.rb
Expand Up @@ -6,6 +6,8 @@ class Configuration
# rubocop:disable Metrics/LineLength
DEFAULT_USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112 Safari/537.36'.freeze

DEFAULT_ADAPTER = :nokogiri

class << self
def providers_registry
@registry ||= ProvidersRegistry.new
Expand Down Expand Up @@ -33,14 +35,11 @@ def reset!
@proxy_validator = ProxyValidator

self.providers = self.class.registered_providers
self.adapter = :nokogiri
end

def adapter=(name_or_class)
@adapter = ProxyFetcher::Document::Adapters.lookup(name_or_class)
@adapter.setup!

@adapter
end

def providers=(value)
Expand Down
12 changes: 6 additions & 6 deletions lib/proxy_fetcher/document/adapters/abstract_adapter.rb
@@ -1,20 +1,20 @@
module ProxyFetcher
class Document
class AbstractAdapter
attr_reader :doc
attr_reader :document

def initialize(doc)
@doc = doc
def initialize(document)
@document = document
end

# You can override this method in your own adapter class
def xpath(selector)
doc.xpath(selector)
document.xpath(selector)
end

# You can override this method in your own adapter class
def css(selector)
doc.css(selector)
document.css(selector)
end

def proxy_node
Expand All @@ -24,7 +24,7 @@ def proxy_node
def self.setup!(*args)
install_requirements!(*args)
rescue LoadError => error
raise Exceptions::AdapterSetupError, error.message
raise Exceptions::AdapterSetupError.new(self.class.name, error.message)
end
end
end
Expand Down
19 changes: 17 additions & 2 deletions lib/proxy_fetcher/exceptions.rb
Expand Up @@ -40,8 +40,23 @@ def initialize(name)
end

class AdapterSetupError < Error
def initialize(reason)
super("can't setup adapter during the following error:\n\t#{reason}'")
def initialize(adapter_name, reason)
adapter = demodulize(adapter_name.remove('Adapter'))

super("can't setup '#{adapter}' adapter during the following error:\n\t#{reason}'")
end

private

def demodulize(path)
path = path.to_s
index = path.rindex('::')

if index
path[(index + 2)..-1]
else
path
end
end
end
end
Expand Down
2 changes: 0 additions & 2 deletions lib/proxy_fetcher/providers/base.rb
@@ -1,5 +1,3 @@
require 'forwardable'

module ProxyFetcher
module Providers
class Base
Expand Down
4 changes: 2 additions & 2 deletions lib/proxy_fetcher/version.rb
Expand Up @@ -7,9 +7,9 @@ module VERSION
# Major version number
MAJOR = 0
# Minor version number
MINOR = 5
MINOR = 6
# Smallest version number
TINY = 1
TINY = 0

# Full version number
STRING = [MAJOR, MINOR, TINY].compact.join('.')
Expand Down
8 changes: 8 additions & 0 deletions spec/spec_helper.rb
Expand Up @@ -15,6 +15,14 @@

Dir['./spec/support/**/*.rb'].sort.each { |f| require f }

adapter = ENV['ADAPTER']

if adapter
ProxyFetcher.configure do |config|
config.adapter = adapter
end
end

RSpec.configure do |config|
config.order = 'random'
end
26 changes: 0 additions & 26 deletions spec/support/evil_proxy_patch.rb

This file was deleted.

0 comments on commit 49ed0a5

Please sign in to comment.