Skip to content
Branch: master
Find file History
Type Name Latest commit message Commit time
Failed to load latest commit information.
icons Add scrapydweb/icons Apr 3, 2019
screenshots Update readme for ScrapydWeb Mar 16, 2019 Add docs for scrapyd-basic-auth May 9, 2019

🔤 English | 🀄️ 简体中文 | 🔙 ScrapydWeb

How to efficiently manage your distributed web scraping projects


Installation and setup

  1. Make sure that Scrapyd has been installed and started on all of your hosts. Note that if you want to visit your Scrapyd server remotely, you have to manually set the bind_address to bind_address = and restart Scrapyd to make it visible externally.
  2. Install ScrapydWeb on one of your hosts via the pip install scrapydweb command.
  3. Start ScrapydWeb via command scrapydweb. (a config file would be generated for customizing settings on the first startup.)
  4. Enable HTTP basic auth (optional).
USERNAME = 'username'
PASSWORD = 'password'
  1. Add your Scrapyd servers, both formats of string and tuple are supported, you can attach basic auth for accessing the Scrapyd server, as well as a string for grouping or labeling.
    # 'username:password@localhost:6801#group',
    ('username', 'password', 'localhost', '6801', 'group'),
  1. Restart ScrapydWeb via command scrapydweb.

Visit web UI

Visit, and log in with the USERNAME/PASSWORD above.

  • The Servers page would automatically display the working status of all of your Scrapyd servers.
  • You can select any number of Scrapyd servers by grouping and filtering, and then invoke the HTTP JSON API of Scrapyd on the cluster with just a few clicks.


  • Integrated with LogParser, the Jobs page could automatically display the pages and the items info of your scraping jobs.
  • By default, ScrapydWeb would periodically create a snapshot of your scraping jobs and save it in the database to avoid losing the jobs info if the Scrapyd server is restarted. (issue 12)


Deploy project

  • After setting up the SCRAPY_PROJECTS_DIR option to the path containing your Scrapy projects, ScrapydWeb would list all projects in that directory, with the latest modified project selected. Simply select one project and press the button to submit, and auto packaging is performed behind the scene.
  • What if you are developing Scrapy projects locally while ScrapydWeb runs on a remote server? In addition to uploading an egg file, you can also compress your project folder into an archive file with an archiving software or via the tar command tar -czvf projectname.tar.gz /home/username/myprojects/projectname and then upload it.
  • You can select any number of your Scrapyd servers to deploy projects.


Run spider

  • In turn, select a project, a version, and a spider from the drop-down box.
  • Be free to pass in any Scrapy setting or spider argument.
  • Support creating timer tasks based on APScheduler. (Don't forget to adjust the max-proc option of Scrapyd if you want to launch a number of spiders concurrently)
  • You can select any number of your Scrapyd servers to run spiders.


Log analysis & visualization

  • It's recommended to set up the SCRAPYD_LOGS_DIR and the ENABLE_LOGPARSER option if you are running Scrapyd and ScrapydWeb on the same machine so that LogParser would be launched automatically along with ScrapydWeb. The LogParser subprocess would parse Scrapy log files in the specified directory periodically and incrementally, thus speeding up the loading of the Stats page without consuming RAM and bandwidth due to requesting the original log file.
  • Similarly, install and run LogParser on the rest of your hosts.
  • LogParser would be able to collect Crawler.stats and Crawler.engine via the built-in telnet console of Scrapy if you are using Scrapy 1.5.1 and earlier version.


Timer tasks

  • Check out the parameters of a task, as well as its execution results.
  • Be free to pause, resume, fire, stop, edit, and delete a task.


Email notice

By visiting the Stats page periodically in the poll subprocess, ScrapydWeb is able to get you notified when specific conditions are satisfied, sending an email containing stats of the current job.

  1. Set up your email account:

EMAIL_USERNAME = ''  # defaults to FROM_ADDR
EMAIL_PASSWORD = 'password'
  1. Set up email working time and the basic triggers:
EMAIL_WORKING_DAYS = [1, 2, 3, 4, 5]

The settings above mean that an email would be sent hourly when current time meets the condition that it's 9, 12 and 17 o'clock of the weekday. This works for the case when a job is finished as well.

  1. In addition to the basic triggers above, ScrapydWeb provides multi-triggers for dealing with specific kind of log, including 'CRITICAL', 'ERROR', 'WARNING', 'REDIRECT', 'RETRY', and 'IGNORE'.
# ...

The settings above mean that ScrapydWeb would stop the current scraping job automatically if three or more critical logs are found in the Scrapy log. And an email would be sent if the condition of email working time is met.

Mobile UI


Jump to top

🔙 ScrapydWeb

You can’t perform that action at this time.