You can clone with
No one assigned
I'm using sidekiq on my web crawler, beautiful project this sidekiq. I have 2 kind of workers a BatchWorker and SiteWorker, the BatchWorker is in charge of parsing a csv with 20000 URL and creating SiteWorkers which perform the actual crawl, in my dvelopment PC everything runs fine I can 20ish busy workers on average, however on my server I only see 8 on avg busy at the same time. What actually trigger a new worker to start? I mean why It's not triggering all 25 at the same time.
Did you resolve this? I need more info about your configuration.
I upgraded the server specs and now I can start with 40 concurrent workers which is fine for my purposes. My issue is that over time I'm losing concurrency and workers takes many times longer to complete (20x sometimes) their work.
In my app the user loads a txt file with 100k url on it. I create 2 workers for it, one worker to create(MainWorker) another workers(SiteWorker) for each url in the txt file. The SiteWorker do the actual job of scrapping the websites. So my problem is that even if everything starts very well I get almost 40 concurrent workers and each workers is done in less than a sec, but at the end when the number of processed workers it's about 80k or so it's just become so slow.
2012-10-29T16:27:14Z 6073 TID-ow6jxn7b0 SiteWorker MSG-ow6ircvm8 INFO: done: 29.916 sec
and only getting 5 busy workers at a time.
Where might be the issue? What can I do to debug this?
My system is an Amazon EC2 High-CPU MEDIUM 2 cores of Intel Xeon E5410 @ 2.33GHz
That sounds like a memory or resource leak, possibly in one of the gems you are using. I think some people have reported memory issues with Nokogiri if you are using that for scraping.
I'm using Mechanize that in fact uses nokogiri for scrapping, did people having issues mentioned what they used instead?