This application is useful for keep an updated list of proxy servers, it contains everything you need to make periodic checks to verify the properties of the proxies. Also you can periodically collect the proxy server from the Internet, remove broken and slow proxies.
django-proxylist-for-grab can be easily installed using pip:
$ pip install django-proxylist-for-grab
After that you need to include django-proxylist-for-grab into your INSTALLED_APPS list of your django settings file.
INSTALLED_APPS = (
...
'proxylist',
...
)
Add django-proxylist-for-grab into urls.py
urlpatterns = patterns(
...
url(r'', include('proxylist.urls')),
...
)
django-proxylist-for-grab has a list of variables that you can configure throught django's settings file. You can see the entire list at Advanced Configuration.
You have two choices here:
We ancourage recommend you using south for your database migrations. If you already use it you can migrate django-proxylist-for-grab:
$ python manage.py migrate proxylist
If you don't want to use south you can make a plain syncdb:
$ python manage.py syncdb
At first, add a mirror. For working mirror, you need to install app on server with external ip. This is in order to be able to verify the correctness of data through proxy server. After adding mirror, you can add and test your proxies.
django-proxylist-for-grab has configured by default to non-async check.
You can change this behavior. Insert into your django settings
PROXY_LIST_USE_CALLERY
and change it to True.
After you need to install and configure django-celery and rabbit-mq.
Packages installation
$ sudo pip install django-celery
$ sudo port install rabbitmq-server
Add the 'djcelery' application to 'INSTALLED_APPS' in settings
INSTALLED_APPS = (
...
'djcelery',
...
)
Sync database
$ ./manage.py syncdb
Run rabbitmq and celery
$ sudo rabbitmq-server -detached
$ nohup python manage.py celery worker >& /dev/null &
Add new proxies from a file.
$ python manage.py update_proxies [file1] <file2> <...>
Check proxies availability and anonymity.
$ python manage.py check_proxies
Search proxy list on internet
$ python manage.py grab_proxies
Remove broken proxies
$ python manage.py clean_proxies
from proxylist import grabber
grab = grabber.Grab()
# Get your ip (You can do this a few times to see how the proxy will be changed)
grab.go('http://ifconfig.me/ip')
if grab.response.code == 200:
print grab.response.body.strip()
# Get count of div on google page
grab.go('http://www.ya.ru/')
if grab.response.code == 200:
print grab.doc.select('//script').number()
# filename: apps/app/management/commands/spider.py
# usage: python manage.py spider
from django.core.management.base import BaseCommand
from grab.spider.base import Task
from proxylist.grabber import Spider
class SimpleSpider(Spider):
initial_urls = ['http://www.lib.ru/']
def task_initial(self, grab, task):
grab.set_input('Search', 'linux')
grab.submit(make_request=False)
yield Task('search', grab=grab)
def task_search(self, grab, task):
if grab.doc.select('//b/a/font/b').exists():
for elem in grab.doc.select('//b/a/font/b/text()'):
print elem.text()
class Command(BaseCommand):
help = 'Simple Spider'
def handle(self, *args, **options):
bot = SimpleSpider()
bot.run()
print bot.render_stats()