Skip to content

Latest commit

 

History

History
49 lines (34 loc) · 1.7 KB

README.md

File metadata and controls

49 lines (34 loc) · 1.7 KB

harvem

Harvesting email addresses from website(s). Another email scraper.

The scrapy_splash package must be installed even if you are not going to use Splash.

Example command lines. First with a normal Scrapy request, second with a Scrapy-Splash.

To collect email addresses from a specific website:

.../harvem$ scrapy crawl site -a url=http://www.autoszovgyongyos.hu/  -o o.csv
.../harvem$ scrapy crawl sitesplash -a url=http://www.autoszovgyongyos.hu/  -o o.csv

To collect email from multiple sites listed in a .csv file:

.../harvem$ scrapy crawl csv -a file=teszt/tesztin.csv -o o2.csv
.../harvem$ scrapy crawl csvsplash -a file=teszt/tesztin.csv -o o2.csv

From list file:

.../harvem$ scrapy crawl list -a file=teszt/tesztin.lll -o o3.csv
.../harvem$ scrapy crawl listsplash -a file=teszt/tesztin.lll -o o3.csv

With proxy:

.../harvem$ scrapy crawl site -s PROXY=http://localhost:8118 -a url=http://www.autoszovgyongyos.hu/  -o o.csv
.../harvem$ scrapy crawl sitesplash -s PROXY=http://localhost:8118 -a url=http://www.autoszovgyongyos.hu/  -o o.csv

Set the LEA variable so that you can determine which pages to crawl. The following example will crawl all the pages:

.../harvem$ scrapy crawl site -s LEA= -s DOWNLOAD_DELAY=2 -a url=https://golyankerekpar.hu -o golyan.csv

You can specify the identifier field of the input csv like this:

scrapy crawl csv -s ID=placeid -a file=../mailwalk/teszt/pizza.tatabanya2.csv -o tatab.csv

If you are using dockered Splash and the proxy is running on localhost, check the end of the settings.py file. More info: https://stackoverflow.com/questions/48546124/what-is-linux-equivalent-of-host-docker-internal/61001152

Tested with Scrapy 2.0.1