GitHub - jschnurr/scrapyscript: Run a Scrapy spider programmatically from a script or a Celery task

Embed Scrapy jobs directly in your code

What is Scrapyscript?

Scrapyscript is a Python library you can use to run Scrapy spiders directly from your code. Scrapy is a great framework to use for scraping projects, but sometimes you don't need the whole framework, and just want to run a small spider from a script or a Celery job. That's where Scrapyscript comes in.

With Scrapyscript, you can:

wrap regular Scrapy Spiders in a Job
load the Job(s) in a Processor
call processor.run() to execute them

... returning all results when the last job completes.

Let's see an example.

import scrapy
from scrapyscript import Job, Processor

processor = Processor(settings=None)

class PythonSpider(scrapy.spiders.Spider):
    name = "myspider"

    def start_requests(self):
        yield scrapy.Request(self.url)

    def parse(self, response):
        data = response.xpath("//title/text()").extract_first()
        return {'title': data}

job = Job(PythonSpider, url="http://www.python.org")
results = processor.run(job)

print(results)

[{ "title": "Welcome to Python.org" }]

See the examples directory for more, including a complete Celery example.

Install

pip install scrapyscript

Requirements

Linux or MacOS
Python 3.8+
Scrapy 2.5+

API

Job (spider, *args, **kwargs)

A single request to call a spider, optionally passing in *args or **kwargs, which will be passed through to the spider constructor at runtime.

# url will be available as self.url inside MySpider at runtime
myjob = Job(MySpider, url='http://www.github.com')

Processor (settings=None)

Create a multiprocessing reactor for running spiders. Optionally provide a scrapy.settings.Settings object to configure the Scrapy runtime.

settings = scrapy.settings.Settings(values={'LOG_LEVEL': 'WARNING'})
processor = Processor(settings=settings)

Processor.run(jobs)

Start the Scrapy engine, and execute one or more jobs. Blocks and returns consolidated results in a single list. jobs can be a single instance of Job, or a list.

results = processor.run(myjob)

or

results = processor.run([myjob1, myjob2, ...])

A word about Spider outputs

As per the scrapy docs, a Spider must return an iterable of Request and/or dict or Item objects.

Requests will be consumed by Scrapy inside the Job. dict or scrapy.Item objects will be queued and output together when all spiders are finished.

Due to the way billiard handles communication between processes, each dict or Item must be pickle-able using pickle protocol 0. It's generally best to output dict objects from your Spider.

Contributing

Updates, additional features or bug fixes are always welcome.

Setup

Install Poetry
git clone git@github.com:jschnurr/scrapyscript.git
poetry install

Tests

make test or make tox

Version History

See CHANGELOG.md

License

The MIT License (MIT). See LICENCE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 129 Commits
.github/workflows		.github/workflows
examples		examples
src/scrapyscript		src/scrapyscript
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENCE		LICENCE
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Embed Scrapy jobs directly in your code

What is Scrapyscript?

Install

Requirements

API

Job (spider, *args, **kwargs)

Processor (settings=None)

Processor.run(jobs)

A word about Spider outputs

Contributing

Setup

Tests

Version History

License

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

jschnurr/scrapyscript

Folders and files

Latest commit

History

Repository files navigation

Embed Scrapy jobs directly in your code

What is Scrapyscript?

Install

Requirements

API

Job (spider, *args, **kwargs)

Processor (settings=None)

Processor.run(jobs)

A word about Spider outputs

Contributing

Setup

Tests

Version History

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages