Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use Eliot with scrapy? #439

Closed
laurentS opened this issue Nov 28, 2019 · 3 comments
Closed

How to use Eliot with scrapy? #439

laurentS opened this issue Nov 28, 2019 · 3 comments

Comments

@laurentS
Copy link

First of all, thanks for the great library. I haven't gotten it to work yet, but I'm already impressed :)

I am writing a fairly complex scraper with scrapy that involves fetching files in a tree-like way:
one first index file yields a number of other files to fetch, each of them having several sections to process independently. The processing is fairly complex, so I am struggling to track errors, and eliot looks like a super promising solution (my first attempt was very exciting to look at although it doesn't quite work yet).

In short, scrapy is built on top of twisted, but I obviously don't want to modify scrapy's code as described in the docs. To make things worse, the initial scrapy process uses generators everywhere, so keeping track of the current action is tricky.

scrapy uses the concept of an item that it passes around between callbacks to transfer data. There is an initial scraping process which yields requests to which this item is attached. Then scrapy passes this item to a series of pipelines, each of which gets the item, modifies it and returns it.

To keep the context in Eliot, I tried serializing the action the_id = action.serialize_task_id() and then picking up the context with with Action.continue_task(task_id=the_id):. It works partially. The first time I continue_task, the logs look ok, but if I try to do it more than once, the logs look like:

83dc01af-a089-4e2e-8b70-8a690363712c
└── parse_report/1 ⇒ started 2019-11-28 14:51:52 ⧖ 0.146s
    ├── eliot:stdlib/2 2019-11-28 14:51:52
    │   ├── log_level: INFO
    │   ├── logger: spider
    │   └── message: Processing XXXX
    ├── HI/3 2019-11-28 14:51:52
    ├── eliot:remote_task/4/1 ⇒ started 2019-11-28 14:51:52 ⧖ 3.789s
    │   ├── parse_item/4/2/1 ⇒ started 2019-11-28 14:51:52 ⧖ 3.787s
    │   │   ├── eliot:stdlib/4/2/2 2019-11-28 14:51:52
    │   │   │   ├── log_level: INFO
    │   │   │   ├── logger: pipeline
    │   │   │   └── message: Parsing data for item1
    │   │   ├── eliot:stdlib/4/2/3 2019-11-28 14:51:53
    │   │   │   ├── log_level: INFO
    │   │   │   ├── logger: pipeline
    │   │   │   └── message: Error parsing item1
    │   │   ├── eliot:stdlib/4/2/4 2019-11-28 14:51:53
    │   │   │   ├── log_level: INFO
    │   │   │   ├── logger: pipeline
    │   │   │   └── message: Parsing data for item2
    │   │   ├── eliot:stdlib/4/2/5 2019-11-28 14:51:56
    │   │   │   ├── log_level: INFO
    │   │   │   ├── logger: pipeline
    │   │   │   └── message: Parser completed item2
    │   │   └── parse_item/4/2/6 ⇒ succeeded 2019-11-28 14:51:56
    │   └── eliot:remote_task/4/3 ⇒ succeeded 2019-11-28 14:51:56
    └── parse_report/5 ⇒ succeeded 2019-11-28 14:51:52

...more stuff...

83dc01af-a089-4e2e-8b70-8a690363712c   <--- this ID matches the one above
└── <unnamed>   <-- but from here, the context seems to be lost
    └── eliot:remote_task/4/1 ⇒ started 2019-11-28 14:51:56 ⧖ 0.038s
        ├── eliot:stdlib/4/2 2019-11-28 14:51:56
        │   ├── log_level: INFO
        │   ├── logger: pipeline
        │   └── message: Saving item1 to DB
        ├── eliot:stdlib/4/3 2019-11-28 14:51:56
        │   ├── log_level: INFO
        │   ├── logger: pipeline
        │   └── message: Saving item2 to DB
        └── eliot:remote_task/4/4 ⇒ succeeded 2019-11-28 14:51:56

The code looks like (these are the standard scrapy callbacks):

def parse_report(...):
    with start_action(action_type="parse_report") as action:
        do_stuff()
        item['task_id'] = action.serialize_task_id()
        return item_with_the_task_id

class Pipeline1:
    def process_item(self, item, spider):
        with Action.continue_task(task_id=item['task_id']):
            do_the_parsing_stuff()
        return item

class Pipeline2:
    # essentially the same as Pipeline1 but saving to DB

Is this kind of pipeline logic supported by continue_task, or am I trying to use the wrong solution here?
To make it clear, each pipeline.process_item() is called once per item, and I then run a loop on each subitem inside each pipeline. Ideally, I want the logs to reflect that tree structure, to ease tracing errors.
Any ideas would be great!

@itamarst
Copy link
Owner

itamarst commented Nov 28, 2019

Unfortunately you can't add action context from inside individual callbacks... You need to start actions outside, but then as you say DeferredContext is needed to track things.

Once Twisted has native support for contextvars (twisted/twisted#1192), the use of eliot.twisted.DeferredContext will in many cases become unnecessary. At that point any wrapped Twisted code should work better, you won't need to mess with the internals.

As a stopgap measure you can:

  1. Use DeferredContext on the the entry point to a scrapy Deferred-based API. This will at least give you some action context.
  2. Redirect scrapy's logs to Eliot (https://eliot.readthedocs.io/en/stable/generating/twisted.html if they use Twisted logging, https://eliot.readthedocs.io/en/stable/generating/migrating.html if they use stdlib's logging package). Looks like you're already doing this.
  3. Just have normal actions inside your callbacks.

Not ideal, but something.

@itamarst
Copy link
Owner

I guess looking at what you're doing in more detail, that might also be a reasonable stop gap, just: you can only call continue_task once. Serialized task IDs are single use.

@laurentS
Copy link
Author

Thanks for the fast reply! This was super helpful. I've tried a few different options, and so far, my understanding is:

  • serialize_task_id() is single use, but I can call it again after continue_task it seems. Right now, I'm calling it at the end of each callback, so that the next one will log in sequence.
  • the order of Message.log() and serialize_task_id() matters. Looking at the code, they both call _nextTaskLevel() to determine where the message appears in the log tree. I had not realised this at first.

So it's not perfect, I still have weird non-chronological output in some places, and it requires carefully thinking about where the context manager calls and serialization calls happen, but the results is already way more readable than what I had only 3 hours ago 🎉
Thanks for your help! (and the cool logging library)

@itamarst itamarst closed this as completed Jan 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants