Skip to content
Run skyscraper inside a docker container
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
defaults
handlers
meta
tasks
templates
LICENSE
README.md

README.md

Role Name

Run skyscraper inside a docker container with this ansible role.

Pull the role with ansible-galaxy with the following requirements.yml:

- name: skyscraper-docker
  src: git+https://github.com/molescrape/ansible-skyscraper-docker.git

Then install it with ansible-galaxy install -r requirements.yml.

Requirements

None

Role Variables

  • skyscraper_useragent: The user agent that should be used for crawling
  • skyscraper_max_parallel: Maximum number of skyscraper containers that should run in paralell
  • skyscraper_postgres_connstring: Connection string for PostgreSQL connection
  • skyscraper_aws_access_key: AWS access key if AWS components are used
  • skyscraper_aws_secret_access_key: AWS secret if AWS components are used

Pipelines:

  • skyscraper_pipeline_use_duplicatesfilter_dynamodb: Whether to use DynamoDB based duplicates filtering
  • skyscraper_pipeline_use_output_s3: Whether to write data to S3
  • skyscraper_pipeline_use_output_postgres: Whether to write data to PostgreSQL
  • skyscraper_pipeline_use_output_folder: Whether to write data to a folder
  • skyscraper_pipeline_use_itemcount_postgres: Whether to store scraped item counts to PostgreSQL
  • skyscraper_storage_folder_path: Folder to which data should be stored if skyscraper_pipeline_use_output_folder is active
  • skyscraper_s3_data_bucket: Bucket to which data should be stored if skyscraper_pipeline_use_output_s3 is active
  • skyscraper_dynamodb_crawling_index: DynamoDB index to use to store duplicate IDs if DynamoDB duplicate filter is enabled

Spider Loaders:

  • skyscraper_spider_loader_class: skyscraper.spiderloader.FolderSpiderLoader or skyscraper.spiderloader.PostgresSpiderLoader
  • skyscraper_spiders_folder: Folder where spiders are stored if folder spider loader is used

Scheduler:

  • skyscraper_scheduler: Either empty or skyscraper.scheduler.PostgresScheduler
  • skyscraper_scheduler_postgres_batch_size: Number of requests to keep in memory if PostgreSQL is used as a scheduler

Anonymization:

  • skyscraper_tor_enabled

Mail:

  • skyscraper_mail_server
  • skyscraper_mail_user
  • skyscraper_mail_password
  • skyscraper_mail_from

Logging and Monitoring:

  • skyscraper_loglevel
  • skyscraper_stats_class: Either empty or skyscraper.statscollectors.PostgresStatsCollector
  • skyscraper_pidar_url: Either empty or the base URL of the PiDAR monitoring service

Dependencies

None

Example Playbook

- hosts: servers
  roles:
     - skyscraper-docker

License

MIT

You can’t perform that action at this time.