accidental work of grab.spider #298

EnzoRondo · 2018-03-14T03:18:48Z

Hey there (@lorien), thanks a lot for great library 😃

I am learning your library and now see unexpected behavior during work, here is my code sample which is based on example in documentation:

import csv
import logging
import re

from grab.spider import Spider, Task


class ExampleSpider(Spider):
    def create_grab_instance(self, **kwargs):
        g = super(ExampleSpider, self).create_grab_instance(**kwargs)
        g.setup(proxy='127.0.0.1:8090', proxy_type='socks5', timeout=60, connect_timeout=15)
        return g

    def task_generator(self):
        for i in range(1, 1 + 1):
            page_url = "{}{}/".format("https://www.mourjan.com/properties/", i)
            # print("page url: {}".format(page_url))
            yield Task('stage_two', url=page_url)

    def prepare(self):
        # Prepare the file handler to save results.
        # The method `prepare` is called one time before the
        # spider has started working
        self.result_file = csv.writer(open('result.txt', 'w'))

        # This counter will be used to enumerate found images
        # to simplify image file naming
        self.result_counter = 0

    def task_stage_two(self, grab, task):
        for elem in grab.doc.select("//li[@itemprop='itemListElement']//p")[0:4]:
            part = elem.attr("onclick")
            url_part = re.search(r"(?<=wo\(\').*(?=\'\))", part).group()
            end_url = grab.make_url_absolute(url_part)
            yield Task('stage_three', url=end_url)

    def task_stage_three(self, grab, task):
        # First, save URL and title into dictionary
        post = {
            'url': task.url,
            'title': grab.doc.xpath_text("//title/text()"),
        }
        self.result_file.writerow([
            post['url'],
            post['title'],
        ])
        # Increment image counter
        self.result_counter += 1


if __name__ == '__main__':
    logging.basicConfig(level=logging.DEBUG)
    # Let's start spider with two network concurrent streams
    bot = ExampleSpider(thread_number=2)
    bot.run()

first run:

DEBUG:grab.spider.base:Using memory backend for task queue
DEBUG:grab.network:[01] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[02] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[03] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[04] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[05] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.stat:RPS: 7.35 [error:multi-added-already=5, network-count-rejected=1]
DEBUG:grab.spider.parser_pipeline:Started shutdown of parser process: Thread-1
DEBUG:grab.spider.parser_pipeline:Finished joining parser process: Thread-1
DEBUG:grab.spider.base:Main process [pid=4064]: work done

😕

then I am running code again ~20 attempts and have same shit, but 21 time gives success and I see what I want to see:

DEBUG:grab.spider.base:Using memory backend for task queue
DEBUG:grab.network:[01] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.stat:RPS: 0.52 []
DEBUG:grab.network:[02] GET https://www.mourjan.com/kw/kuwait/warehouses/rental/10854564/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[03] GET https://www.mourjan.com/ae/abu-dhabi/apartments/rental/11047384/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[04] GET https://www.mourjan.com/kw/kuwait/villas-and-houses/rental/11041455/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[05] GET https://www.mourjan.com/ae/abu-dhabi/apartments/rental/11009663/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.stat:RPS: 2.36 []
DEBUG:grab.stat:RPS: 1.28 []
DEBUG:grab.spider.parser_pipeline:Started shutdown of parser process: Thread-1
DEBUG:grab.spider.parser_pipeline:Finished joining parser process: Thread-1
DEBUG:grab.spider.base:Main process [pid=4860]: work done

why it happens?

The text was updated successfully, but these errors were encountered:

EnzoRondo · 2018-03-14T03:21:32Z

Probably it's related to this code part:

    def create_grab_instance(self, **kwargs):
        g = super(ExampleSpider, self).create_grab_instance(**kwargs)
        g.setup(proxy='127.0.0.1:8090', proxy_type='socks5', timeout=60, connect_timeout=15)
        return g

I have copied this snippet from the documentation, my goal to set proxy and other settings for grab. Without this part of code all work fine. It seems to me that need use alternative here.

lorien · 2018-04-13T09:39:17Z

@EnzoRondo
I do no understand the question.

why it happens?

Happens what?

You've provided two log outputs. I do not see big difference between them. With these logs, I can't find what have been done incorrectly (or have not been done correctly) by your spider.

lorien · 2018-04-22T22:13:48Z

Closing this issue untill @EnzoRondo provides additional details of what did he mean.

EnzoRondo · 2018-04-23T23:05:36Z

Happens what?

Look at that:

DEBUG:grab.network:[01] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[02] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[03] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[04] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[05] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5

grab trying to follow 5 same urls instead of right one

oiwn · 2018-04-28T00:50:03Z

does it work w/o socks5 proxy? as far as i remember pycurl has issue with socks5

EnzoRondo · 2018-04-28T04:35:01Z

@istinspring it works

lorien · 2018-04-28T15:33:39Z

@EnzoRondo Spider does not work correctly with socks5 in multicurl mode.
If you want to use Spider with socks proxy then you HAVE TO use urllib3 transport and threaded network service:

bot = SomeSpider(network_service='threaded`, grab_transport='urllib3')

EnzoRondo · 2018-05-02T04:33:21Z

I have translated this post and got that it's possible to use socks5 using threaded transport, but now you are saying another things, where is the truth?

Tested: bot = SomeSpider(network_service='threaded`, grab_transport='urllib3'), works perfect, thanks

Will that bug fixed in future? I am using grab.spider in different project and that's one place (create_grab_instance function) where I have problems with it

lorien · 2018-05-02T13:39:31Z

Works with socks5:

threaded network service & pycurl grab transport
threaded network service & urllib3 grab transport

Does not work with socks5:

multicurl network service & pycurl grab transport - this is not a bug in Grab, it is a bug in pycurl library.

EnzoRondo · 2018-05-03T00:26:19Z

threaded network service & urllib3 grab transport

works perfect

threaded network service & pycurl grab transport

works, but we can see bug from the first post

lorien · 2018-05-04T00:05:45Z

works, but we can see bug from the first post

Code from first post does NOT use threaded network service

EnzoRondo · 2018-05-05T04:12:07Z

Yep, but I have tried:

bot = ExampleSpider(thread_number=2, network_service='threaded', grab_transport='pycurl')

and bug still here

lorien · 2018-05-05T10:24:12Z

So what do you want from me? I do not know what you've tried and have not tried.
Please provide exact and detailed information of what you did (minimal working source code), what you expected to get and what did you get instead.

EnzoRondo · 2018-05-06T01:54:50Z

So what do you want from me?

to fix bug, but I can't reproduce it now on last dev build 😕 , probably some of your commits successfully fixed this issue, thanks a lot friend! now spider works more stable 👍

EnzoRondo · 2018-05-06T01:56:30Z

I am very happy to have no problems here, thanks a lot again, I appreciate it 😎

lorien added the bug label Apr 14, 2018

lorien closed this as completed Apr 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

accidental work of grab.spider #298

accidental work of grab.spider #298

EnzoRondo commented Mar 14, 2018

EnzoRondo commented Mar 14, 2018

lorien commented Apr 13, 2018 •

edited

lorien commented Apr 22, 2018

EnzoRondo commented Apr 23, 2018 •

edited

oiwn commented Apr 28, 2018

EnzoRondo commented Apr 28, 2018

lorien commented Apr 28, 2018

EnzoRondo commented May 2, 2018

lorien commented May 2, 2018

EnzoRondo commented May 3, 2018 •

edited

lorien commented May 4, 2018

EnzoRondo commented May 5, 2018

lorien commented May 5, 2018

EnzoRondo commented May 6, 2018

EnzoRondo commented May 6, 2018

accidental work of grab.spider #298

accidental work of grab.spider #298

Comments

EnzoRondo commented Mar 14, 2018

EnzoRondo commented Mar 14, 2018

lorien commented Apr 13, 2018 • edited

lorien commented Apr 22, 2018

EnzoRondo commented Apr 23, 2018 • edited

oiwn commented Apr 28, 2018

EnzoRondo commented Apr 28, 2018

lorien commented Apr 28, 2018

EnzoRondo commented May 2, 2018

lorien commented May 2, 2018

EnzoRondo commented May 3, 2018 • edited

lorien commented May 4, 2018

EnzoRondo commented May 5, 2018

lorien commented May 5, 2018

EnzoRondo commented May 6, 2018

EnzoRondo commented May 6, 2018

lorien commented Apr 13, 2018 •

edited

EnzoRondo commented Apr 23, 2018 •

edited

EnzoRondo commented May 3, 2018 •

edited