Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraping different items from the same spider, each with different pipeline requirements #20

Closed
Ziinc opened this issue Nov 25, 2019 · 24 comments
Milestone

Comments

@Ziinc
Copy link
Collaborator

Ziinc commented Nov 25, 2019

Problem:
Crawly only allows single item type scraping. However, what if i am crawling two different sites, with vastly different items?

For example, web page A (e.g. a blog) will have:

  • comments
  • article content
  • related links
  • title

while web page B (e.g. a weather site) will have:

  • temperature
  • country

In the current setup, the only way to work around this is to lump all these logically different items into one large item, such that the end item declaration in config will be:

item: [:title, :comments, :article_content, :related_links, :temperature, :country]

the issues are that:

  1. they share the same pipeline. When scraping the weather data, the blog-related fields will be blank, and vice versa for when scraping the blog. This will affect pipeline validations, since pipeline is shared.
  2. Output of the two items will be the same.
  3. Duplication checks (such as :item_id) is not item specific. Since the item-type from the weather site has no title, I can't specify an item-type-specific field.

I have some idea of how this could be implemented, taking inspiration from scrapy.
We could define item structs, and sort the items to their appropriate pipelines according to struct.

Using the tutorial as an example:

using this ideal scenario config:

config :crawly,
  closespider_timeout: 10,
  concurrent_requests_per_domain: 8,
  follow_redirects: true,
  closespider_itemcount: 1000,
  middlewares: [
    Crawly.Middlewares.DomainFilter,
    Crawly.Middlewares.UniqueRequest,
    Crawly.Middlewares.UserAgent
  ],
  pipelines: [
    {MyItemStruct, [
        Crawly.Pipelines.Validate,
        {Crawly.Pipelines.DuplicatesFilter, item_id: :title }, # similar to how supervisor trees are declared
        Crawly.Pipelines.CSVEncoder
    ]},
    {MyOtherItemStruct, [
        Crawly.Pipelines.Validate,
        Crawly.Pipelines.CleanMyData,
        {Crawly.Pipelines.DuplicatesFilter, item_id: :name }, # similar to how supervisor trees are declared
        Crawly.Pipelines.CSVEncoder
    ]},
  ]

with the spider implemented as so:

 @impl Crawly.Spider
  def parse_item(response) do
    hrefs = response.body |> Floki.find("a.more") |> Floki.attribute("href")

    requests =
      Utils.build_absolute_urls(hrefs, base_url())
      |> Utils.requests_from_urls()

    title = response.body |> Floki.find("article.blog_post h1") |> Floki.text()
    name= response.body |> Floki.find("article.blog_post h2") |> Floki.text()

    %{
         :requests => requests,
         :items => [
            %MyItemStruct{title: title, url: response.request_url},
            %MyOtherItemStruct{name: title, url: response.request_url}
         ]
      }
  end

The returned items then can get sorted into their specified pipelines.

This configuration method proposes the following:

  • allow declaration of item-specific pipelines for multiple items
  • allow passing of arguments to a pipeline implementation e.g. {MyPipelineModule, validate_more_than: 5}

To consider backwards compatability, a single-item pipeline could still be declared. This would only be the case for a multi-item pipeline.

Do let me know what you think @oltarasenko

@oltarasenko
Copy link
Collaborator

@Ziinc I will think what is the best way of approaching the problem. I was working in Scrapy core team some time in the past, and we where using one Item structure per project. However there we used a concept of required/non-required fields.

I did not really want to define the item as we did it in Scrapy, as it seems to be an overkill to have it defined separately.
Also Crawly defined item, as a set of required fields (not a full set of fields, as it might seem: https://oltarasenko.github.io/crawly/#/?id=item-atom). So for now I would suggest to remove the definition of non overlapping fields from the item config.

Again I like the idea of separating fetcher and parser even more. Let me think how to plan the changes in future.

@Ziinc
Copy link
Collaborator Author

Ziinc commented Nov 26, 2019

@oltarasenko I understand what you mean and agree. It seems like additional boilerplate for not much benefit.

Perhaps looking at it from a different angle, the whole definition of an "item" in the config would be unnecessary. From my understanding, the defined item is only used for the Crawly.Pipelines.Validate pipeline module, and the item_id is only used in the Crawly.Pipelines.DuplicatesFilter pipeline module. With a tuple-format definition, these two parameters can be localized to the pipeline module definition instead of as a global config parameter, as so:

config :crawly,
  ...
  pipelines: [
    ....
    {Crawly.Pipelines.Validate, item: [:title, :url] },
    {Crawly.Pipelines.DuplicatesFilter, item_id: :title }
    ....
  ]

With the tuple-format definitions, we can do something like this for configuring different pipeline requirements:

pipelines: [
  ....
  {Crawly.Pipelines.Validate, item: [:title, :url] },
  {Crawly.Pipelines.DuplicatesFilter, item_id: :title },
  {Crawly.Pipelines.IfCondition, condition:  fn x -> Keyword.has_key?(x, :a) end ,[
      Crawly.Pipelines.Validate, item: [:header_count],
      ...
  ] }
]

And this could allow for multi-item-type logic within the pipeline:

pipelines: [
  ....
  MyCommon.Pipeline
  {Crawly.Pipelines.Logic.IfCondition, condition:  fn x -> Keyword.has_key?(x, :a) end , pipelines: [
      {Crawly.Pipelines.Validate, item: [:hello, :world},
      {Crawly.Pipelines.DuplicatesFilter, item_id: :world},
      ...
  ] },
  {Crawly.Pipelines.Logic.IfCondition, condition:  fn x -> Keyword.has_key?(x, :b) end , pipelines: [
      {Crawly.Pipelines.Validate, item: [:title, :url] },
      {Crawly.Pipelines.DuplicatesFilter, item_id: :title },
  ] },
  MyCommon.Pipeline2
]

This takes inspiration from the factorio game (haha), where you can define filters/logical triggers for moving resources. In essence, this approach allow the split off items that match a logical condition a separate pipeline, assuming an IfCondition module. Other logical modules might be feasible, such as a Case module or a IfElse module.

To accomplish something like this, it would require allowing a tuple defintion. The Validate and DuplicatesFilter modules would benefit from a parameterized definition too.

@oltarasenko
Copy link
Collaborator

Well another problem is to be able to output CSV. E.g. this way or another we would have to extend the item definition (as in scrapy). As otherwise we would not be able to output the CSV headers :(.

@Ziinc
Copy link
Collaborator Author

Ziinc commented Nov 26, 2019

With parameterized pipeline definitions, one could pass DataStorage modules within the pipeline (#19 ) parameters. So for the CSV DataStorage, it could be:

pipelines: [
  ....
  MyCommon.Pipeline,
  {Crawly.DataStorage.FileStorageBackend, 
       headers: [:id, :test]
       folder: "/tmp",
       include_headers: true,
       extension: "csv"}
]

instead of global configs like in #19 :

config :crawly, Crawly.DataStorage.FileStorageBackend,
       folder: "/tmp",
       include_headers: false,
       extension: "jl" 

Having global configs prevents having multiple pipeline modules within the same pipelines declaration. I liken this to how elixir's piping works, the pipeline module either transforms the scraped data or does a side effect (e.g. store it) and returns it for downstream pipelines.

@oltarasenko
Copy link
Collaborator

I was syncing up with some people from Scrapy team. They are stating the following:

  1. Scrapy actually allowed to use different pipelines for different items. However the configuration is not nice as well. https://stackoverflow.com/questions/8372703/how-can-i-use-different-pipelines-for-different-spiders-in-a-single-scrapy-proje. Honestly I would not want to follow their way here.

  2. It's quite a rare case to have multiple different items per project. In the example above they would prefer to have Article definition and comments inside it. So it will be a responsibility of a spider to validate comments.

  3. Another option is to disable the generic pipeline (ValidateItem in our case) and to put two custom instead, e.g. ValidateBlogComments/ValidateArticles.

@Ziinc
Copy link
Collaborator Author

Ziinc commented Dec 1, 2019

As explained in the parent post body, my use case is about handling different types of scraped items, which may or may not use different spiders. Different types of scraped items would have different processing/validation requirements, hence the need for different pipelines.

@Ziinc
Copy link
Collaborator Author

Ziinc commented Dec 1, 2019

If it helps make it clearer, I use (or at least trying to) a single instance of crawly to manage all scraping needs of the main application, which requires many different web sources and feeds. The scraped data then gets cleaned and stored to their relevant databases tables, or undergo further processing steps. This last part is why some ability to separate scraped items would be great, as they currently all end up in the same pipeline declaration.

@oltarasenko oltarasenko added this to the 0.8.0 milestone Dec 28, 2019
@Ziinc
Copy link
Collaborator Author

Ziinc commented Dec 30, 2019

Right now my solution for this consists of pipeline level matching, with each key for a specific scraped item type.

def run(%{my_item: item}, state) do
    # do things 
end

def run(item, state), do: {item, state} 

Should this be the standard way of approaching this problem?

@Ziinc
Copy link
Collaborator Author

Ziinc commented Dec 30, 2019

This approach operates on a per-key basis, meaning that all items are going to consist of a single-key map

@oltarasenko
Copy link
Collaborator

I am about to introduce spider level setting. E.g. it will be possible to do it inside init function. So the idea is that if you're specifying the spider level settings it would allow you to override the global config. I have kind of made required preparations here: e05b512

So one of the points is that spider would be able to set a list of processors for every given request/item.

@Ziinc
Copy link
Collaborator Author

Ziinc commented Dec 30, 2019

Looking through the commit, it seems the middlewares and item pipelines are going to be set to the default value from the config, then the spider overrides them?

@oltarasenko
Copy link
Collaborator

Yes. The idea I am playing with now is something like the following API:
(in Crawly.Request)

  @spec new(url, headers, options) :: request
        when url: binary(),
             headers: [term()],
             options: [term()],
             request: Crawly.Request.t()

  def new(url, headers \\ [], options \\ []) do
    # Define a list of middlewares which are used by default to process
    # incoming requests
    default_middlewares = [
      Crawly.Middlewares.DomainFilter,
      Crawly.Middlewares.UniqueRequest,
      Crawly.Middlewares.RobotsTxt
    ]

    middlewares =
      Application.get_env(:crawly, :middlewares, default_middlewares)

    new(url, headers, options, middlewares)
  end

  @doc """
  Same as Crawly.Request.new/3 from but allows to specify middlewares as the 4th
  parameter.
  """
  @spec new(url, headers, options, middlewares) :: request
        when url: binary(),
             headers: [term()],
             options: [term()],
             middlewares: [term()], # TODO: improve typespec here
             request: Crawly.Request.t()
  def new(url, headers, options, middlewares) do
    %Crawly.Request{
      url: url,
      headers: headers,
      options: options,
      middlewares: middlewares
    }
  end

So at the end of the day, the worker will get a complete request with processors set. However, it may read the spider settings to see if override is required.

I am fixing tests there now. Hope to show the code soon!

@Ziinc
Copy link
Collaborator Author

Ziinc commented Dec 30, 2019

This would allow middlewares to be set on a per request basis, but there isn't a way to specify pipelines on a per scraped item basis as there isn't a standard struct for each scraped item

Also, it seems simpler to do pattern matching on scraped items within the pipelines than to check and specify pipelines within the spider, since it causes the spider to be unnecessarily fat

@oltarasenko
Copy link
Collaborator

This would allow middlewares to be set on a per request basis, but there isn't a way to specify pipelines on a per scraped item basis as there isn't a standard struct for each scraped item

Yeah. I don't have a good answer for that :(. E.g. you're right we can't do the same thing with items.
Let me think even more.

Also, it seems simpler to do pattern matching on scraped items within the pipelines than to check and specify pipelines within the spider, since it causes the spider to be unnecessarily fat.

Maybe. The spider should not be fat. The idea, for now, it to allow setting middlewares/pipelines in init...
Also, I am trying to avoid pattern matching in some cases. E.g. I don't really feel comfortable if some entity does pattern matches on complex structures defined elsewhere..

@oltarasenko
Copy link
Collaborator

What do you think of:

  1. Moving middlewares outside of request?
  2. Should we define Crawly.Item and ask people to define items using the use Crawly.Item macro where the base item would be extended (and that base item define pipelines API)?

@Ziinc
Copy link
Collaborator Author

Ziinc commented Dec 30, 2019

I'm gonna brain-dump my thoughts on what what Crawly's selling points are to me, and my ideal-scenario type of architecture (bear with me, it might be long):

Crawly's selling points, to me, are the simplicity in how everything is just a series of pipelines. It is quite easy to visualize and understand how a request flows through the entire system from start to finish into a scraped item. This is because once I understood how a item pipeline was called, I immediately understood how a middleware worked as well, due to the reuse of the pipeline concept. I think that reusing this concept both maximizes ease of understanding of the overall system, and of how to write your own custom pipelines (which for any advanced user, would eventually happen).

It is also why i think that the Crawly.ScrapedItem (or FetchedItem, as long as it is less ambiguous about what Item refers to) is a good idea, as it standardizes what a scraped item is and how it should be handled through the lifecycle.

With this being said, I do note that there is no pipeline lifecycle for handling what happens to a response after it is being fetched, which is also where the retries from #39 would ideally be handled.

There is also no pipeline lifecycle for handling what happens to a request between when it is fetched from the request storage, and from when it is passed to the fetcher.

Request-Response-ScrapedItem Flow

image

this leaves out the part between the request storage and the fetcher

Each phase between storage, fetcher, and spider, all need to have some degree of control and customizability. For example, for the portion between the Fetcher and the Spider, there needs to be a way to control, handle, and customize the response received by the Fetcher. Since all the fetcher does is to transform the request into the response, it should not be responsible for doing so. A separate pluggable pipeline module can be introduced between these two points (Fetcher,Spider) to handle issues like backoffs, retries, etc.

This Response handler section would ideally solve the retries issue, as it handles the Response lifecycle. If it reuses the pipeline concept, it will not be any additional difficulty to understand how the response moves between pipeline modules.

This makes Crawly a hyper customizable data fetching framework, which would be extremely attractive for any serious web scraper or data person. It is also simple and flexible enough to customize the request lifecycle (through middlewares) and the scraped item lifecycle (through item pipelines).

The only things lacking now, I believe, is the lifecycle between when the request is fetched, and when the response is passed to the spider. These are essentially the only parts that cannot be customized.

The 3 levels of configuration

Ideally, in my mind, there should be 3 different levels of configuration. A global default configuration (for pipelines, middlewares), a spider-level configuration (meaning declared in init, and will apply to all requests from the spider, overriding the defaults), and a request-level middleware override (attached to the request).

In this case, the most recent PR (#39 ) will allow for request-level middleware overrides.
I am supportive of the idea of having middlewares declared in the Request struct, for the response handers to be declared in the Response struct, and for the item pipelines to be declared in the ScrapedItems struct. This allows for maximum customizability and flexibility, which is always a plus.

@Ziinc
Copy link
Collaborator Author

Ziinc commented Dec 30, 2019

To add to the 3 level overrides point:
Request-level overrides will be very rarely used and only in very unique cases.

To add for the ScrapedItem struct, even with all the above implemented, the spider will still have to specify what pipeline set to use, which will tend towards a fatter spider (as it declares the pipelines for each type of scraped item, and also would be duplicated for each different spider). Pattern matching based on item type does have its benefits, since the pipeline will know what the item structure looks like.

@Ziinc
Copy link
Collaborator Author

Ziinc commented Dec 31, 2019

I think to summarize my points:

  1. keeping middlewares in the requests is a good idea
  2. a use Crawly.Item or use Crawly Scraped.Item macro would help, but would make the system more opaque. There needs to be a way to selectively apply pipelines to a scraped item based on its inherent data type.

@oltarasenko
Copy link
Collaborator

I think it's a useful conversation. Some of the flows above can already go to Crawly documentation, as it will explain the flow of the things. Which will allow people to understand the internals. I would appreciate if you could summarize it in relevant PRs.

Crawly's selling points, to me, are the simplicity in how everything is just a series of pipelines. It is quite easy to visualize and understand how a request flows through the entire system from start to finish into a scraped item. This is because once I understood how a item pipeline was called, I immediately understood how a middleware worked as well, due to the reuse of the pipeline concept. I think that reusing this concept both maximizes ease of understanding of the overall system, and of how to write your own custom pipelines (which for any advanced user, would eventually happen).

Yeah. And we should keep it as simple as possible. With this regards I am still a bit unsure if pipelines/middlewares should really live within Request/Item. Also maybe we should not really separate them. The idea is that these are just pre/post processors. I am still a bit unsure. However to suggest something:

  1. Let me stip the Request changes to a separate PR. We should really discuss it outside the Retries scope.
  2. The retries for now, will use a static config to understand what needs to be skipped.

And finally:

a use Crawly.Item or use Crawly Scraped.Item macro would help, but would make the system more opaque. There needs to be a way to selectively apply pipelines to a scraped item based on its inherent data type.

One of the things I don't like regarding OOP is something like unexpected features. E.g. I would want to avoid the cases when if you're using a use macro, then something magically injects functions to your map. From my experience, this will make it very hard to develop and debug such systems. It will make everything very complicated for users. I kind of like these slim Items... (E.g. when I was using scrapy, it always felt quite strange that you have to define item, like in django. But all fields was actually text fields)

Regarding your last points

@oltarasenko
Copy link
Collaborator

Well at the last minute I decided to keep the middlewares in request as they are for now. (As stripping out middlewares from the request would raise a question of identifying retries, which was solved).

@Ziinc
Copy link
Collaborator Author

Ziinc commented Jan 3, 2020

I'll open up a PR for the documentation updates on the request-response-item flow.

As for going back to the main crux of this issue, it is about how to determine the pipeline for a scraped item, as visualized in this diagram:
image

If Requests already contain the middlewares, it does make sense for an Item to contain the required pipelines for consistency. However, if it is set by the spider when the item is scraped, logic for checking fields moves to the spider, instead of remaining in the pipeline lifecycle. This makes the spider fatter and less extraction-focused.

As you mentioned, pre/post processing should be left outside of the spider. I agree with this, hence we should not set pipelines in the spider.

What I can think of, which could be implementable as of now through just a pipeline module, would be a pipeline that uses Utils.pipe/2 and a function that returns set of pipelines.

# config.exs
pipelines: [
  ....
  MyCommon.Pipeline
  {Crawly.Pipelines.Function, func:  case do
    %MyStruct{} -> [..., My.Pipeline2, ...]
    %{my_field: _ } -> [..., My.Pipeline1, ...]
    %{other_field: _ } -> [..., My.Pipeline2, ...]
  end }
]

However, this example makes the config very verbose, which is also not ideal.

From my own experimentation, using struct-based pattern matching (in a custom pipeline) does works. For example, I can populate the fields of an ecto schema struct in the spider and then do pattern matching in the pipelines for that particular struct, but it does requires some intermediate conversions back to a map before it is inserted into the table with ecto.

Example:

# MyCustomPipeline.ex
def run(%MyStruct{}, state) do
    # do things 
    # Maybe insert into ecto
end

def run(item, state), do: {item, state} 

Perhaps we could just let advanced users figure this out themselves, and maybe just bless the struct-based pattern matching method as the best practice for adding pipeline logic?

@oltarasenko
Copy link
Collaborator

Yes, we could do it as you're suggesting. I was speaking with people from scrapy core team, what they are saying - you can re-write pipeline in a spider there, however this feature is almost never used. I kind of like the approach with custom middlewares more... E.g. it seems to be a way more simple...

@Ziinc
Copy link
Collaborator Author

Ziinc commented Jan 4, 2020

I think let's keep the pipeline behaviour as-is for now. It seems like additional changes for minimal benefit. Spider overrides in the init would be easier to implement in worker.exs too.

I'll open PRs for the documentation soon:

  • request-response-item flow
  • advanced pipeline techniques (pattern matching, structs, ecto schemas)

@Ziinc
Copy link
Collaborator Author

Ziinc commented Jan 10, 2020

PRs opened, to be closed on merge

@Ziinc Ziinc closed this as completed Jan 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants