Allow storing extra XPATHs / add another pagination option #15

mridang · 2012-10-17T07:36:38Z

Currently only 5 XPATH types are stored — STANDARD, STANDARD_UPDATE, DETAIL, BASE and IMAGE. It would be good to have another section called EXTRA.

It is quite often that I need to access an XPATH value that might not be necessarily mapped to a model field. I my case, I need an additional XPATH for finding the next pagination link and have had to resort to using on of the other fields as a hack.

holgerd77 · 2012-10-24T16:07:18Z

How are you using this EXTRA field at the moment? In the pipeline processing method? Or have you this somehow connected to the pagination provided by DDS? Have got some problems to visualize how this is beeing used/how the workflow is going.

mridang · 2012-10-24T16:17:26Z

Here's an issue I encountered. I needed to scrape items from a page every hour. This page was paginated and as items get older they go deeper into pages. I needed to scrape items until it encounters an item that was scraped and then it should close the spider. The items in the pipeline still need to be processed. If I raise a CloseSpider exception from my spider, the spider stops and so does the pipeline. Get my point?
I need to store an extra XPATH that points to the next page link on the page. I can't use the pagination mechanism that you've provided because that would put that many URLs into the start URLs of my spider. If I provide a pagination range function of 0,50,1 and a URL called http://mysite.com/{page}/, it would put 50 of my URLs into my start_urls and my spider would uselessly crawl them.
I've tried my best to explain this but if you're still lost with this, I'll be glad to elaborate even more. Thanks.

mridang · 2012-10-24T16:19:40Z

Another idea, is to add another pagination option called "Next Page Link" or something like that. The user can then store and XPATH link which points to the next page so he can get the next page URL dynamically. Here's an example of scraping a paginated site. This is probably the easiest to do. Just add another pagination option.

holgerd77 · 2012-10-25T10:42:11Z

That sounds easier for me, I'll think about that.

holgerd77 · 2012-10-25T10:59:17Z

You wrote something of an example before but you didn't post anything. Could you give me a couple of example links of how the pagination is build up?

holgerd77 closed this as completed Jun 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow storing extra XPATHs / add another pagination option #15

Allow storing extra XPATHs / add another pagination option #15

mridang commented Oct 17, 2012

holgerd77 commented Oct 24, 2012

mridang commented Oct 24, 2012

mridang commented Oct 24, 2012

holgerd77 commented Oct 25, 2012

holgerd77 commented Oct 25, 2012

Allow storing extra XPATHs / add another pagination option #15

Allow storing extra XPATHs / add another pagination option #15

Comments

mridang commented Oct 17, 2012

holgerd77 commented Oct 24, 2012

mridang commented Oct 24, 2012

mridang commented Oct 24, 2012

holgerd77 commented Oct 25, 2012

holgerd77 commented Oct 25, 2012