Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow storing extra XPATHs / add another pagination option #15

Closed
mridang opened this issue Oct 17, 2012 · 5 comments
Closed

Allow storing extra XPATHs / add another pagination option #15

mridang opened this issue Oct 17, 2012 · 5 comments

Comments

@mridang
Copy link

mridang commented Oct 17, 2012

Currently only 5 XPATH types are stored — STANDARD, STANDARD_UPDATE, DETAIL, BASE and IMAGE. It would be good to have another section called EXTRA.

It is quite often that I need to access an XPATH value that might not be necessarily mapped to a model field. I my case, I need an additional XPATH for finding the next pagination link and have had to resort to using on of the other fields as a hack.

@holgerd77
Copy link
Owner

How are you using this EXTRA field at the moment? In the pipeline processing method? Or have you this somehow connected to the pagination provided by DDS? Have got some problems to visualize how this is beeing used/how the workflow is going.

@mridang
Copy link
Author

mridang commented Oct 24, 2012

Here's an issue I encountered. I needed to scrape items from a page every hour. This page was paginated and as items get older they go deeper into pages. I needed to scrape items until it encounters an item that was scraped and then it should close the spider. The items in the pipeline still need to be processed. If I raise a CloseSpider exception from my spider, the spider stops and so does the pipeline. Get my point?
I need to store an extra XPATH that points to the next page link on the page. I can't use the pagination mechanism that you've provided because that would put that many URLs into the start URLs of my spider. If I provide a pagination range function of 0,50,1 and a URL called http://mysite.com/{page}/, it would put 50 of my URLs into my start_urls and my spider would uselessly crawl them.
I've tried my best to explain this but if you're still lost with this, I'll be glad to elaborate even more. Thanks.

@mridang
Copy link
Author

mridang commented Oct 24, 2012

Another idea, is to add another pagination option called "Next Page Link" or something like that. The user can then store and XPATH link which points to the next page so he can get the next page URL dynamically. Here's an example of scraping a paginated site. This is probably the easiest to do. Just add another pagination option.

@holgerd77
Copy link
Owner

That sounds easier for me, I'll think about that.

@holgerd77
Copy link
Owner

You wrote something of an example before but you didn't post anything. Could you give me a couple of example links of how the pagination is build up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants