Skip to content

p371k9/toscrape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 

Repository files navigation

toscrape

img crawler

Download images from books.toscrape.com using the Images pipeline. The name of the downloaded images is a serial number that corresponds to their position in the catalog. Like: 0001.jpg, 0002.jpg ... You can change where the download starts, so how many images are downloaded, by rewriting the start_urls = variable. See: toscrape/spiders/img.py

To run, enter:

$ scrapy crawl img

The download location will be the downloads/files folder. (settings.py)

control crawler

The middleware controls the crawl and stops it if the "CONTROL_XPATH" condition is false. Once you have done the necessary things with the browser - in this case following a link - the crawl will continue after you hit Enter. Selenium with Firefox.

$ scrapy crawl control -o books.csv

login crawler

Crawl & scrap quotes from quotes.toscrape.com/login Programmed login.

$ scrapy crawl login

scroll crawler

Scrape a web page that operates with infinite scroll. http://quotes.toscrape.com/scroll The API needs to be extracted! Screenshot

Which in this case is in JSON format: Screenshot

It's pretty simple.

$ scrapy crawl scroll -o quotes.json

random crawler

Scrape off all random quotes from http://quotes.toscrape.com/random . Keeps only unique quotes. The website contains 100 citations. It takes about five hundred request-s for all citations to be queued.

Screenshot

But scraping can be safely interrupted with the Ctrl-C key. It is recommended to press it only once. In this case, the contents of the output file are not lost either.

$ scrapy crawl random -o egy.json

js crawler

Scraping JS generated content. The information can be extracted from JavaScript code.

It scrapes off both sites: http://quotes.toscrape.com/js as well as http://quotes.toscrape.com/js-delayed .

$ scrapy crawl js -o quotes.csv

collect crawler

It collects URLs from the books website into an .lll list file, which is nothing more than a headless .csv file.

$ scrapy crawl collect -o books.lll

books crawler

Scrapes the data from the web pages that are contained in the .lll list file. The .lll list file must be specified as a parameter. The list file is generated by the collect crawler.

$ scrapy crawl books -a lll='toscrape/10.lll' 

It even includes a simple and optional tor middleware.

After each GET, it changes its IP address.

The following Scrapy settings can be used: TORCTRL = control port, TORPWD = password TORPROXIES = settings for requests.get. If you start TOR with default settings, it is enough to set TORPWD. Here you have to enter the HASH-ed password in torrc. We use the requests function instead of Scrapy/Twisted request because it knows SOCKS. Therefore, there is no need for Privoxy either.

About

practicing Scrapy

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published