New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow storing extra XPATHs / add another pagination option #15
Comments
How are you using this EXTRA field at the moment? In the pipeline processing method? Or have you this somehow connected to the pagination provided by DDS? Have got some problems to visualize how this is beeing used/how the workflow is going. |
Here's an issue I encountered. I needed to scrape items from a page every hour. This page was paginated and as items get older they go deeper into pages. I needed to scrape items until it encounters an item that was scraped and then it should close the spider. The items in the pipeline still need to be processed. If I raise a |
Another idea, is to add another pagination option called "Next Page Link" or something like that. The user can then store and XPATH link which points to the next page so he can get the next page URL dynamically. Here's an example of scraping a paginated site. This is probably the easiest to do. Just add another pagination option. |
That sounds easier for me, I'll think about that. |
You wrote something of an example before but you didn't post anything. Could you give me a couple of example links of how the pagination is build up? |
Currently only 5 XPATH types are stored — STANDARD, STANDARD_UPDATE, DETAIL, BASE and IMAGE. It would be good to have another section called EXTRA.
It is quite often that I need to access an XPATH value that might not be necessarily mapped to a model field. I my case, I need an additional XPATH for finding the next pagination link and have had to resort to using on of the other fields as a hack.
The text was updated successfully, but these errors were encountered: