A basic scrapper made in python with BeautifulSoup and Tor support to -
- Scrape hidden services - crawling functionality has not been included
- Save the output in html - Need some pre-processing to use it as a input for machine learning algorithm
Before you run the scrapper make sure the following things are done properly:
-
Run tor service
sudo service tor start
orbrew services tor start
for MacOS users -
Set a password for tor
tor --hash-password "my_password"
and do not forget to include it in your python scripts -
Modify value in scrapper.py
-
Go to /etc/tor/torrc and uncomment - ControlPort 9051 , you may consider accessing torrc config file using
sudo nano torrc
to be able to save it
-
Install privoxy
sudo apt-get install privoxy
orbrew install privoxy
for MacOS users -
Change your privoxy config to get access to Tor Network
cd /
to go to root directory thencd /etc/privoxy
-
Open your config file
nano config
-
Uncomment following line
forward-socks5 / localhost:9050
-
Restart prixoxy to load changes
sudo /etc/init.d/privoxy restart
orbrew services restart privoxy
for MacOS users
- Create your virtualenv and install requirements by running the following commands :
pip install virtualenv
virtualenv yourenv
source yourenv/bin/activate
pip install -r requirements.txt
- Copy all the onion and normal links you want to scrape in onions.txt - You can find onion hidden services by subscribing to Hunchly newsletter for example
[nano]/[vim]/[gedit]/[Your choice of editor] onions.txt
- Run Tthe scrapper using python3
python3 scrapper.py
Choose option number 2 if you want to run scraping engine
- Check the scraped outputs in output folder
MIT LICENSE