Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

accidental work of grab.spider #298

Closed
EnzoRondo opened this issue Mar 14, 2018 · 15 comments
Closed

accidental work of grab.spider #298

EnzoRondo opened this issue Mar 14, 2018 · 15 comments
Labels

Comments

@EnzoRondo
Copy link

Hey there (@lorien), thanks a lot for great library 馃槂

I am learning your library and now see unexpected behavior during work, here is my code sample which is based on example in documentation:

import csv
import logging
import re

from grab.spider import Spider, Task


class ExampleSpider(Spider):
    def create_grab_instance(self, **kwargs):
        g = super(ExampleSpider, self).create_grab_instance(**kwargs)
        g.setup(proxy='127.0.0.1:8090', proxy_type='socks5', timeout=60, connect_timeout=15)
        return g

    def task_generator(self):
        for i in range(1, 1 + 1):
            page_url = "{}{}/".format("https://www.mourjan.com/properties/", i)
            # print("page url: {}".format(page_url))
            yield Task('stage_two', url=page_url)

    def prepare(self):
        # Prepare the file handler to save results.
        # The method `prepare` is called one time before the
        # spider has started working
        self.result_file = csv.writer(open('result.txt', 'w'))

        # This counter will be used to enumerate found images
        # to simplify image file naming
        self.result_counter = 0

    def task_stage_two(self, grab, task):
        for elem in grab.doc.select("//li[@itemprop='itemListElement']//p")[0:4]:
            part = elem.attr("onclick")
            url_part = re.search(r"(?<=wo\(\').*(?=\'\))", part).group()
            end_url = grab.make_url_absolute(url_part)
            yield Task('stage_three', url=end_url)

    def task_stage_three(self, grab, task):
        # First, save URL and title into dictionary
        post = {
            'url': task.url,
            'title': grab.doc.xpath_text("//title/text()"),
        }
        self.result_file.writerow([
            post['url'],
            post['title'],
        ])
        # Increment image counter
        self.result_counter += 1


if __name__ == '__main__':
    logging.basicConfig(level=logging.DEBUG)
    # Let's start spider with two network concurrent streams
    bot = ExampleSpider(thread_number=2)
    bot.run()

first run:

DEBUG:grab.spider.base:Using memory backend for task queue
DEBUG:grab.network:[01] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[02] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[03] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[04] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[05] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.stat:RPS: 7.35 [error:multi-added-already=5, network-count-rejected=1]
DEBUG:grab.spider.parser_pipeline:Started shutdown of parser process: Thread-1
DEBUG:grab.spider.parser_pipeline:Finished joining parser process: Thread-1
DEBUG:grab.spider.base:Main process [pid=4064]: work done

馃槙

then I am running code again ~20 attempts and have same shit, but 21 time gives success and I see what I want to see:

DEBUG:grab.spider.base:Using memory backend for task queue
DEBUG:grab.network:[01] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.stat:RPS: 0.52 []
DEBUG:grab.network:[02] GET https://www.mourjan.com/kw/kuwait/warehouses/rental/10854564/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[03] GET https://www.mourjan.com/ae/abu-dhabi/apartments/rental/11047384/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[04] GET https://www.mourjan.com/kw/kuwait/villas-and-houses/rental/11041455/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[05] GET https://www.mourjan.com/ae/abu-dhabi/apartments/rental/11009663/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.stat:RPS: 2.36 []
DEBUG:grab.stat:RPS: 1.28 []
DEBUG:grab.spider.parser_pipeline:Started shutdown of parser process: Thread-1
DEBUG:grab.spider.parser_pipeline:Finished joining parser process: Thread-1
DEBUG:grab.spider.base:Main process [pid=4860]: work done

why it happens?

@EnzoRondo
Copy link
Author

Probably it's related to this code part:

    def create_grab_instance(self, **kwargs):
        g = super(ExampleSpider, self).create_grab_instance(**kwargs)
        g.setup(proxy='127.0.0.1:8090', proxy_type='socks5', timeout=60, connect_timeout=15)
        return g

I have copied this snippet from the documentation, my goal to set proxy and other settings for grab. Without this part of code all work fine. It seems to me that need use alternative here.

@lorien
Copy link
Owner

lorien commented Apr 13, 2018

@EnzoRondo
I do no understand the question.

why it happens?

Happens what?

You've provided two log outputs. I do not see big difference between them. With these logs, I can't find what have been done incorrectly (or have not been done correctly) by your spider.

@lorien lorien added the bug label Apr 14, 2018
@lorien
Copy link
Owner

lorien commented Apr 22, 2018

Closing this issue untill @EnzoRondo provides additional details of what did he mean.

@lorien lorien closed this as completed Apr 22, 2018
@EnzoRondo
Copy link
Author

EnzoRondo commented Apr 23, 2018

Happens what?

Look at that:

DEBUG:grab.network:[01] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[02] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[03] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[04] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[05] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5

grab trying to follow 5 same urls instead of right one

@oiwn
Copy link
Contributor

oiwn commented Apr 28, 2018

does it work w/o socks5 proxy? as far as i remember pycurl has issue with socks5

@EnzoRondo
Copy link
Author

@istinspring it works

@lorien
Copy link
Owner

lorien commented Apr 28, 2018

@EnzoRondo Spider does not work correctly with socks5 in multicurl mode.
If you want to use Spider with socks proxy then you HAVE TO use urllib3 transport and threaded network service:

bot = SomeSpider(network_service='threaded`, grab_transport='urllib3')

@EnzoRondo
Copy link
Author

I have translated this post and got that it's possible to use socks5 using threaded transport, but now you are saying another things, where is the truth?

Tested: bot = SomeSpider(network_service='threaded`, grab_transport='urllib3'), works perfect, thanks

Will that bug fixed in future? I am using grab.spider in different project and that's one place (create_grab_instance function) where I have problems with it

@lorien
Copy link
Owner

lorien commented May 2, 2018

Works with socks5:

  • threaded network service & pycurl grab transport
  • threaded network service & urllib3 grab transport

Does not work with socks5:

  • multicurl network service & pycurl grab transport - this is not a bug in Grab, it is a bug in pycurl library.

@EnzoRondo
Copy link
Author

EnzoRondo commented May 3, 2018

threaded network service & urllib3 grab transport

works perfect

threaded network service & pycurl grab transport

works, but we can see bug from the first post

@lorien
Copy link
Owner

lorien commented May 4, 2018

works, but we can see bug from the first post

Code from first post does NOT use threaded network service

@EnzoRondo
Copy link
Author

Yep, but I have tried:

bot = ExampleSpider(thread_number=2, network_service='threaded', grab_transport='pycurl')

and bug still here

@lorien
Copy link
Owner

lorien commented May 5, 2018

So what do you want from me? I do not know what you've tried and have not tried.
Please provide exact and detailed information of what you did (minimal working source code), what you expected to get and what did you get instead.

@EnzoRondo
Copy link
Author

So what do you want from me?

to fix bug, but I can't reproduce it now on last dev build 馃槙 , probably some of your commits successfully fixed this issue, thanks a lot friend! now spider works more stable 馃憤

@EnzoRondo
Copy link
Author

I am very happy to have no problems here, thanks a lot again, I appreciate it 馃槑

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants