-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scraping different items from the same spider, each with different pipeline requirements #20
Comments
@Ziinc I will think what is the best way of approaching the problem. I was working in Scrapy core team some time in the past, and we where using one Item structure per project. However there we used a concept of required/non-required fields. I did not really want to define the item as we did it in Scrapy, as it seems to be an overkill to have it defined separately. Again I like the idea of separating fetcher and parser even more. Let me think how to plan the changes in future. |
@oltarasenko I understand what you mean and agree. It seems like additional boilerplate for not much benefit. Perhaps looking at it from a different angle, the whole definition of an "item" in the config would be unnecessary. From my understanding, the defined item is only used for the config :crawly,
...
pipelines: [
....
{Crawly.Pipelines.Validate, item: [:title, :url] },
{Crawly.Pipelines.DuplicatesFilter, item_id: :title }
....
] With the tuple-format definitions, we can do something like this for configuring different pipeline requirements: pipelines: [
....
{Crawly.Pipelines.Validate, item: [:title, :url] },
{Crawly.Pipelines.DuplicatesFilter, item_id: :title },
{Crawly.Pipelines.IfCondition, condition: fn x -> Keyword.has_key?(x, :a) end ,[
Crawly.Pipelines.Validate, item: [:header_count],
...
] }
] And this could allow for multi-item-type logic within the pipeline: pipelines: [
....
MyCommon.Pipeline
{Crawly.Pipelines.Logic.IfCondition, condition: fn x -> Keyword.has_key?(x, :a) end , pipelines: [
{Crawly.Pipelines.Validate, item: [:hello, :world},
{Crawly.Pipelines.DuplicatesFilter, item_id: :world},
...
] },
{Crawly.Pipelines.Logic.IfCondition, condition: fn x -> Keyword.has_key?(x, :b) end , pipelines: [
{Crawly.Pipelines.Validate, item: [:title, :url] },
{Crawly.Pipelines.DuplicatesFilter, item_id: :title },
] },
MyCommon.Pipeline2
] This takes inspiration from the factorio game (haha), where you can define filters/logical triggers for moving resources. In essence, this approach allow the split off items that match a logical condition a separate pipeline, assuming an To accomplish something like this, it would require allowing a tuple defintion. The |
Well another problem is to be able to output CSV. E.g. this way or another we would have to extend the item definition (as in scrapy). As otherwise we would not be able to output the CSV headers :(. |
With parameterized pipeline definitions, one could pass DataStorage modules within the pipeline (#19 ) parameters. So for the CSV DataStorage, it could be: pipelines: [
....
MyCommon.Pipeline,
{Crawly.DataStorage.FileStorageBackend,
headers: [:id, :test]
folder: "/tmp",
include_headers: true,
extension: "csv"}
] instead of global configs like in #19 :
Having global configs prevents having multiple pipeline modules within the same pipelines declaration. I liken this to how elixir's piping works, the pipeline module either transforms the scraped data or does a side effect (e.g. store it) and returns it for downstream pipelines. |
I was syncing up with some people from Scrapy team. They are stating the following:
|
As explained in the parent post body, my use case is about handling different types of scraped items, which may or may not use different spiders. Different types of scraped items would have different processing/validation requirements, hence the need for different pipelines. |
If it helps make it clearer, I use (or at least trying to) a single instance of crawly to manage all scraping needs of the main application, which requires many different web sources and feeds. The scraped data then gets cleaned and stored to their relevant databases tables, or undergo further processing steps. This last part is why some ability to separate scraped items would be great, as they currently all end up in the same pipeline declaration. |
Right now my solution for this consists of pipeline level matching, with each key for a specific scraped item type. def run(%{my_item: item}, state) do
# do things
end
def run(item, state), do: {item, state} Should this be the standard way of approaching this problem? |
This approach operates on a per-key basis, meaning that all items are going to consist of a single-key map |
I am about to introduce spider level setting. E.g. it will be possible to do it inside init function. So the idea is that if you're specifying the spider level settings it would allow you to override the global config. I have kind of made required preparations here: e05b512 So one of the points is that spider would be able to set a list of processors for every given request/item. |
Looking through the commit, it seems the middlewares and item pipelines are going to be set to the default value from the config, then the spider overrides them? |
Yes. The idea I am playing with now is something like the following API:
So at the end of the day, the worker will get a complete request with I am fixing tests there now. Hope to show the code soon! |
This would allow middlewares to be set on a per request basis, but there isn't a way to specify pipelines on a per scraped item basis as there isn't a standard struct for each scraped item Also, it seems simpler to do pattern matching on scraped items within the pipelines than to check and specify pipelines within the spider, since it causes the spider to be unnecessarily fat |
Yeah. I don't have a good answer for that :(. E.g. you're right we can't do the same thing with items.
Maybe. The spider should not be fat. The idea, for now, it to allow setting middlewares/pipelines in init... |
What do you think of:
|
I'm gonna brain-dump my thoughts on what what Crawly's selling points are to me, and my ideal-scenario type of architecture (bear with me, it might be long): Crawly's selling points, to me, are the simplicity in how everything is just a series of pipelines. It is quite easy to visualize and understand how a request flows through the entire system from start to finish into a scraped item. This is because once I understood how a item pipeline was called, I immediately understood how a middleware worked as well, due to the reuse of the pipeline concept. I think that reusing this concept both maximizes ease of understanding of the overall system, and of how to write your own custom pipelines (which for any advanced user, would eventually happen). It is also why i think that the Crawly.ScrapedItem (or FetchedItem, as long as it is less ambiguous about what With this being said, I do note that there is no pipeline lifecycle for handling what happens to a response after it is being fetched, which is also where the retries from #39 would ideally be handled. There is also no pipeline lifecycle for handling what happens to a request between when it is fetched from the request storage, and from when it is passed to the fetcher. Request-Response-ScrapedItem Flowthis leaves out the part between the request storage and the fetcher Each phase between storage, fetcher, and spider, all need to have some degree of control and customizability. For example, for the portion between the Fetcher and the Spider, there needs to be a way to control, handle, and customize the response received by the Fetcher. Since all the fetcher does is to transform the request into the response, it should not be responsible for doing so. A separate pluggable pipeline module can be introduced between these two points (Fetcher,Spider) to handle issues like backoffs, retries, etc. This Response handler section would ideally solve the retries issue, as it handles the Response lifecycle. If it reuses the pipeline concept, it will not be any additional difficulty to understand how the response moves between pipeline modules. This makes Crawly a hyper customizable data fetching framework, which would be extremely attractive for any serious web scraper or data person. It is also simple and flexible enough to customize the request lifecycle (through middlewares) and the scraped item lifecycle (through item pipelines). The only things lacking now, I believe, is the lifecycle between when the request is fetched, and when the response is passed to the spider. These are essentially the only parts that cannot be customized. The 3 levels of configurationIdeally, in my mind, there should be 3 different levels of configuration. A global default configuration (for pipelines, middlewares), a spider-level configuration (meaning declared in In this case, the most recent PR (#39 ) will allow for request-level middleware overrides. |
To add to the 3 level overrides point: To add for the ScrapedItem struct, even with all the above implemented, the spider will still have to specify what pipeline set to use, which will tend towards a fatter spider (as it declares the pipelines for each type of scraped item, and also would be duplicated for each different spider). Pattern matching based on item type does have its benefits, since the pipeline will know what the item structure looks like. |
I think to summarize my points:
|
I think it's a useful conversation. Some of the flows above can already go to Crawly documentation, as it will explain the flow of the things. Which will allow people to understand the internals. I would appreciate if you could summarize it in relevant PRs.
Yeah. And we should keep it as simple as possible. With this regards I am still a bit unsure if pipelines/middlewares should really live within Request/Item. Also maybe we should not really separate them. The idea is that these are just pre/post processors. I am still a bit unsure. However to suggest something:
And finally:
One of the things I don't like regarding OOP is something like unexpected features. E.g. I would want to avoid the cases when if you're using a Regarding your last points |
Well at the last minute I decided to keep the middlewares in request as they are for now. (As stripping out middlewares from the request would raise a question of identifying retries, which was solved). |
Yes, we could do it as you're suggesting. I was speaking with people from scrapy core team, what they are saying - you can re-write pipeline in a spider there, however this feature is almost never used. I kind of like the approach with custom middlewares more... E.g. it seems to be a way more simple... |
I think let's keep the pipeline behaviour as-is for now. It seems like additional changes for minimal benefit. Spider overrides in the init would be easier to implement in I'll open PRs for the documentation soon:
|
PRs opened, to be closed on merge |
Problem:
Crawly only allows single item type scraping. However, what if i am crawling two different sites, with vastly different items?
For example, web page A (e.g. a blog) will have:
while web page B (e.g. a weather site) will have:
In the current setup, the only way to work around this is to lump all these logically different items into one large item, such that the end item declaration in config will be:
the issues are that:
:item_id
) is not item specific. Since the item-type from the weather site has no title, I can't specify an item-type-specific field.I have some idea of how this could be implemented, taking inspiration from scrapy.
We could define item structs, and sort the items to their appropriate pipelines according to struct.
Using the tutorial as an example:
using this ideal scenario config:
with the spider implemented as so:
The returned items then can get sorted into their specified pipelines.
This configuration method proposes the following:
{MyPipelineModule, validate_more_than: 5}
To consider backwards compatability, a single-item pipeline could still be declared. This would only be the case for a multi-item pipeline.
Do let me know what you think @oltarasenko
The text was updated successfully, but these errors were encountered: