cabret

Various utilities for making bots.

Web scraping

Just a basic wrapper around requests with a default user-agent and that supports auto-discovered proxies. Can return raw html, BeautifulSoup or json.

>>> from cabret import web
>>> web.json('http://echo.jsontest.com/key/value/one/two')
{'one': 'two', 'key': 'value'}

Proxy discovery

We provide simple functions to get proxies from the following websites:

Those are great websites, it is encouraged to get a premium account. If you have an API key for pubproxy or gimmeproxy, you can put it in your configuration file.

>>> from cabret import proxy
>>> proxy.get_https_proxy()
Proxy(ip='191.251.165.63', port=8080, country='BR', anonymity='elite proxy', https=True)
>>> from cabret import web
>>> web.urlopen('https://ifconfig.co/ip', use_proxy=True)
'2a03:4000:21:435::dead:beef\n'

By default, https://free-proxy-list.net/ is used. You can change that in config.ini.

Get validation emails

For the moment, we support only https://mailsac.com/.

>>> from cabret.email import mailsac as ms
>>> ms.full_address('address')
EmailAddress(address@mailsac.com)
>>> msg = ms.message_list('address')[0]
>>> ms.get_message_text(msg)
'cabret is awesome!'
>>> msg['from']
[{'name': '', 'address': 'example@example.com'}]

Mailsac is really great, you should get a premium account to support them (although you can do almost everything for free)!

You can put your API key in config.ini.

Scrape some websites

Reddit

>>> from cabret import web
>>> from cabret.scrapers import reddit
>>> posts = next(reddit.gen_posts('https://www.reddit.com/r/copypasta'))
>>> len(posts) # posts on the first page
26
>>> post = posts[0]
>>> post
Post(title='Today r/copypasta is joining Operation: #OneMoreVote to save net neutrality', domain='self.copypasta', url='https://www.reddit.com/r/copypasta/comments/80pfhp/today_rcopypasta_is_joining_operation_onemorevote/', comments='https://www.reddit.com/r/copypasta/comments/80pfhp/today_rcopypasta_is_joining_operation_onemorevote/', author='SMUT_ADDICT', subreddit='copypasta')
>>> reddit.get_post_text(post)
'This past December, the FCC voted to kill net neutrality, letting internet providers like Verizon and Comcast impose new fees, throttle bandwidth, and censor online content. If this happens, subreddits like this one might not exist.\nWe can still block the repeal using the Congressional Review Act (CRA), and we’re just one vote away from winning in the Senate and taking the fight to the House. That’s why today we’re joining Operation: #OneMoreVote, an Internet-wide day of action.\nThis affects every redditor as well as every Internet user, and we only have a 60 legislative days left to stop it. Please, take a moment of your time to join the protest by contacting your lawmakers.\n'

The module is also a commandline utility. To scrape the text from the posts of the 3 first pages of /r/copypasta/ with parallel downloading:

python3 -m cabret.scrapers.reddit 3 /r/copypasta/

Configuration

You can add a configuration file under config.ini, following the instruction in example-config.ini.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
scrapers		scrapers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
__init__.py		__init__.py
email.py		email.py
example-config.ini		example-config.ini
proxy.py		proxy.py
requirements.txt		requirements.txt
utils.py		utils.py
web.py		web.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scrapers

scrapers

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

TODO.md

TODO.md

init.py

init.py

email.py

email.py

example-config.ini

example-config.ini

proxy.py

proxy.py

requirements.txt

requirements.txt

utils.py

utils.py

web.py

web.py

Repository files navigation

cabret

Web scraping

Proxy discovery

Get validation emails

Scrape some websites

Reddit

Configuration

About

Releases

Packages

Languages

License

louisabraham/cabret

Folders and files

Latest commit

History

Repository files navigation

cabret

Web scraping

Proxy discovery

Get validation emails

Scrape some websites

Reddit

Configuration

About

Resources

License

Stars

Watchers

Forks

Languages