A proxy-proxy server.
Python JavaScript HTML CSS
Latest commit ad80894 Aug 23, 2015 @omarish Improve control flow.
Permalink
Failed to load latest commit information.
app
doc
finders
.gitignore
README.md
benchmark.py
config.py
monitor.py
proxy.py
requirements.txt
server.py
user_agents.txt

README.md

waldo

Waldo is a proxy server that routes web traffic through other proxy servers. It's basically a meta-proxy server that tries to best route your traffic so that you do not get blocked.

Motivation

Large scale web crawling can be difficult if you're crawling a single website. Most sites will block you before long, so you'll have to write some logic to pull a list of available proxy servers, handle connection pooling across those proxies, and keep track of which proxies are still alive, and which are no longer responding to your requests.

I found myself constantly re-writing this code in various projects to manage outbound proxying. This process, while necessary, got a little bit tedious, so I decided to factor out the proxying logic into a separate proxy server to handle the load balancing.

How it works

Advantages

Concurrency

Waldo is written with Tornado, which is a highly scalable web server. I've been able to handle ~ 1,000 concurrent connections with Waldo, and I suspect it can handle significantly more than that.

Coordination

With a sufficiently large proxy list, keeping track of proxies becomes difficult. Proxies often die, or need to be put in a "cool off" box so that they don't get burnt out from too much traffic. Waldo handles all of this for you.

Simplicity

Waldo implements the standard HTTP Proxy spec, so just connect it to it like you would any other proxy server, and it'll handle the rest for you.

Diverse Proxies

When crawling a large website, you'll often find yourself stitching together various proxy server lists. Waldo has the concept of a Finder, which is basically a class that pulls in a list of proxy servers for you.

Setup

First, make sure redis is installed. Then, install the python dependencies:

pip install -r requirements.txt

Run

To run the server:

$ python server.py --port=1234

By default, waldo listens on port 1234 on all network interfaces.

To make sure it's working, try this:

$ curl -XGET http://omarish.com -x http://localhost:1234

Stats Monitor

Stats Monitor

To run the accompanying monitoring page, run the monitoring server:

$ python monitor.py

The monitoring page by default listens on port 1235.

Testing and Benchmarking

I've been using a benchmarking utility in benchmark.py to will simulate heavy requests. Additionally, Apache Bench and Siege have been very helpful.

To run the benchmarking script:

$ python benchmark.py