Download images from books.toscrape.com using the Images pipeline. The name of the downloaded images is a serial number that corresponds to their position in the catalog. Like: 0001.jpg, 0002.jpg ... You can change where the download starts, so how many images are downloaded, by rewriting the start_urls = variable. See: toscrape/spiders/img.py
To run, enter:
$ scrapy crawl img
The download location will be the downloads/files folder. (settings.py)
The middleware controls the crawl and stops it if the "CONTROL_XPATH" condition is false. Once you have done the necessary things with the browser - in this case following a link - the crawl will continue after you hit Enter. Selenium with Firefox.
$ scrapy crawl control -o books.csv
Crawl & scrap quotes from quotes.toscrape.com/login Programmed login.
$ scrapy crawl login
Scrape a web page that operates with infinite scroll. http://quotes.toscrape.com/scroll The API needs to be extracted!
Which in this case is in JSON format:
It's pretty simple.
$ scrapy crawl scroll -o quotes.json
Scrape off all random quotes from http://quotes.toscrape.com/random . Keeps only unique quotes. The website contains 100 citations. It takes about five hundred request-s for all citations to be queued.
But scraping can be safely interrupted with the Ctrl-C key. It is recommended to press it only once. In this case, the contents of the output file are not lost either.
$ scrapy crawl random -o egy.json
Scraping JS generated content. The information can be extracted from JavaScript code.
It scrapes off both sites: http://quotes.toscrape.com/js as well as http://quotes.toscrape.com/js-delayed .
$ scrapy crawl js -o quotes.csv
It collects URLs from the books website into an .lll list file, which is nothing more than a headless .csv file.
$ scrapy crawl collect -o books.lll
Scrapes the data from the web pages that are contained in the .lll list file. The .lll list file must be specified as a parameter. The list file is generated by the collect crawler.
$ scrapy crawl books -a lll='toscrape/10.lll'
It even includes a simple and optional tor middleware.
After each GET, it changes its IP address.
The following Scrapy settings can be used: TORCTRL = control port, TORPWD = password TORPROXIES = settings for requests.get. If you start TOR with default settings, it is enough to set TORPWD. Here you have to enter the HASH-ed password in torrc. We use the requests function instead of Scrapy/Twisted request because it knows SOCKS. Therefore, there is no need for Privoxy either.