-
Notifications
You must be signed in to change notification settings - Fork 9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with Parallel pages html content download #1995
Comments
How about using cluster? |
Standard networking applies and the browser needs to delegate and prioritize resource downloads. You're most likely simply saturating the network. |
@nammaianh The issue I am referring to is about how much parallel operations can be done asynchronously, on a single node. Irrespective of, if it is part of a cluster or not. @Garbee All resource including CPU, RAM, Network are stable. The instance type which I am using has a network bandwidth of 2000 MBPS. Have performed the similar exercise for a batch of 5 domains as well, getting the same results, so network is surely not the culprit here. What is observed here is that all 'goto' operations for all urls get started simultaneously but its completion and 'download' operation occur sequentially. There is some blocking operation happening which is resulting in this sequential competition. Sample output for batch size of 5 and their respective completion time stamps: |
@engineerkushal I'm having a similar issue. Did you find a solution? |
@kushaljshah Puppeteer v1.10.0 has an unpleasant warning when navigating more than 10 pages simultaneously - that is addressed in #3560. Other than that, the following works fine to me (but takes some time). const puppeteer = require('puppeteer');
(async() => {
const browser = await puppeteer.launch();
const urls = [
'https://example.com',
'https://google.com',
'https://yahoo.com',
'https://cnn.com',
'https://bbc.com',
'https://reddit.com',
'https://twitter.com',
'https://facebook.com',
'https://apple.com',
'https://en.wikipedia.org',
'https://github.com',
'https://sr.ht',
];
await Promise.all(urls.map(async (url, index) => {
console.log(`${index}: started`);
const page = await browser.newPage();
console.log(`${index}: navigating...`);
await page.goto(url, {timeout: 0});
console.log(`${index}: done.`);
await page.close();
}));
await browser.close();
})(); I don't think there's any artificial limitation to the number of parallel pages, but they do fight each other for the host resources. Hope this helps. |
@aslushnikov I don't think the problem was ever in navigating multiple pages. It's just that the amount of time taken was really long, and all the system resources including CPU, memory, load average, operation queue was well under threshold values. And since the expected behavior is that io involved in each page navigation will be in parallel, the amount of time involved in the batch is not justified in comparison to the amount of time taken when navigated individually. Will get some latest numbers around this and post here. |
It does not seem that there is parallel download happening when I am downloading a batch of urls. The url downloading is completing way late than it does when download individually.
I am using a 2 core ec2-machine with 4gb RAM. All system resources are pretty stable in a batch size of 30 urls. The connection time out used here is 30 seconds. Their are urls which complete as late as 15 s, 20 secs, 23 secs which complete in about 3-7 seconds when run individually. Is not parallel IO supposed to happen here. What am I missing?
The text was updated successfully, but these errors were encountered: