Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with Parallel pages html content download #1995

Closed
kushaljshah opened this issue Feb 8, 2018 · 6 comments
Closed

Issues with Parallel pages html content download #1995

kushaljshah opened this issue Feb 8, 2018 · 6 comments

Comments

@kushaljshah
Copy link

It does not seem that there is parallel download happening when I am downloading a batch of urls. The url downloading is completing way late than it does when download individually.

I am using a 2 core ec2-machine with 4gb RAM. All system resources are pretty stable in a batch size of 30 urls. The connection time out used here is 30 seconds. Their are urls which complete as late as 15 s, 20 secs, 23 secs which complete in about 3-7 seconds when run individually. Is not parallel IO supposed to happen here. What am I missing?

let startDate new Date();
let finishDate;
const puppeteer = require('puppeteer');

await puppeteer.launch().then(async browser => {
  const promises=[];
  urlArr.forEach(function(url){
  promises.push(browser.newPage().then(async page => {
          try{
               await page.goto(url.startsWith('http') ? url : `http://${url}`, {
                 waitUntil: ['load'],timeout:connectionTimeOut
               });

              const html = await page.content();
              finishDate = new Date();
              console.log("html download complete for " + url + " - " + finishDate.getTime() + " - " + (finishDate.getTime() - startDate.getTime())/1000);
            }catch(err){
              console.log(err)
            }
       }))
  });
  await Promise.all(promises)
  await browser.close();
})
@nammaianh
Copy link

How about using cluster?

@Garbee
Copy link
Contributor

Garbee commented Feb 12, 2018

Standard networking applies and the browser needs to delegate and prioritize resource downloads. You're most likely simply saturating the network.

@kushaljshah
Copy link
Author

kushaljshah commented Feb 12, 2018

@nammaianh The issue I am referring to is about how much parallel operations can be done asynchronously, on a single node. Irrespective of, if it is part of a cluster or not.

@Garbee All resource including CPU, RAM, Network are stable. The instance type which I am using has a network bandwidth of 2000 MBPS. Have performed the similar exercise for a batch of 5 domains as well, getting the same results, so network is surely not the culprit here.

What is observed here is that all 'goto' operations for all urls get started simultaneously but its completion and 'download' operation occur sequentially. There is some blocking operation happening which is resulting in this sequential competition.

Sample output for batch size of 5 and their respective completion time stamps:
Start goto : url1 - 1518433045085
Start goto : url2 - 1518433045099
Start goto : url3 - 1518433045107
Start goto : url4 - 1518433045108
Start goto : url5 - 1518433045109
End goto : url5 - 1518433045356 - 0.593
Start content : url5 - 1518433045357
End goto : url3 - 1518433045363 - 0.6
Start content : url3 - 1518433045364
End content : url3 - 1518433045371 - 0.608
End content : url5 - 1518433045372 - 0.609
End goto : url4 - 1518433048844 - 4.081
Start content : url4 - 1518433048844
End content : url4 - 1518433048854 - 4.091
End goto : url1 - 1518433051797 - 7.034
Start content : url1 - 1518433051797
End content : url1 - 1518433052060 - 7.297
Total time : 9.938

@adamgotterer
Copy link

@engineerkushal I'm having a similar issue. Did you find a solution?

@aslushnikov
Copy link
Contributor

@kushaljshah Puppeteer v1.10.0 has an unpleasant warning when navigating more than 10 pages simultaneously - that is addressed in #3560.

Other than that, the following works fine to me (but takes some time).

const puppeteer = require('puppeteer');

(async() => {
  const browser = await puppeteer.launch();
  const urls = [
    'https://example.com',
    'https://google.com',
    'https://yahoo.com',
    'https://cnn.com',
    'https://bbc.com',
    'https://reddit.com',
    'https://twitter.com',
    'https://facebook.com',
    'https://apple.com',
    'https://en.wikipedia.org',
    'https://github.com',
    'https://sr.ht',
  ];
  await Promise.all(urls.map(async (url, index) => {
    console.log(`${index}: started`);
    const page = await browser.newPage();
    console.log(`${index}: navigating...`);
    await page.goto(url, {timeout: 0});
    console.log(`${index}: done.`);
    await page.close();
  }));
  await browser.close();
})();

I don't think there's any artificial limitation to the number of parallel pages, but they do fight each other for the host resources.

Hope this helps.

@kushaljshah
Copy link
Author

@aslushnikov I don't think the problem was ever in navigating multiple pages. It's just that the amount of time taken was really long, and all the system resources including CPU, memory, load average, operation queue was well under threshold values.

And since the expected behavior is that io involved in each page navigation will be in parallel, the amount of time involved in the batch is not justified in comparison to the amount of time taken when navigated individually.

Will get some latest numbers around this and post here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants