Issues with Parallel pages html content download #1995

kushaljshah · 2018-02-08T13:30:21Z

It does not seem that there is parallel download happening when I am downloading a batch of urls. The url downloading is completing way late than it does when download individually.

I am using a 2 core ec2-machine with 4gb RAM. All system resources are pretty stable in a batch size of 30 urls. The connection time out used here is 30 seconds. Their are urls which complete as late as 15 s, 20 secs, 23 secs which complete in about 3-7 seconds when run individually. Is not parallel IO supposed to happen here. What am I missing?

let startDate new Date();
let finishDate;
const puppeteer = require('puppeteer');

await puppeteer.launch().then(async browser => {
  const promises=[];
  urlArr.forEach(function(url){
  promises.push(browser.newPage().then(async page => {
          try{
               await page.goto(url.startsWith('http') ? url : `http://${url}`, {
                 waitUntil: ['load'],timeout:connectionTimeOut
               });

              const html = await page.content();
              finishDate = new Date();
              console.log("html download complete for " + url + " - " + finishDate.getTime() + " - " + (finishDate.getTime() - startDate.getTime())/1000);
            }catch(err){
              console.log(err)
            }
       }))
  });
  await Promise.all(promises)
  await browser.close();
})

The text was updated successfully, but these errors were encountered:

nammaianh · 2018-02-09T06:30:53Z

How about using cluster?

Garbee · 2018-02-12T11:13:20Z

Standard networking applies and the browser needs to delegate and prioritize resource downloads. You're most likely simply saturating the network.

kushaljshah · 2018-02-12T11:22:27Z

@nammaianh The issue I am referring to is about how much parallel operations can be done asynchronously, on a single node. Irrespective of, if it is part of a cluster or not.

@Garbee All resource including CPU, RAM, Network are stable. The instance type which I am using has a network bandwidth of 2000 MBPS. Have performed the similar exercise for a batch of 5 domains as well, getting the same results, so network is surely not the culprit here.

What is observed here is that all 'goto' operations for all urls get started simultaneously but its completion and 'download' operation occur sequentially. There is some blocking operation happening which is resulting in this sequential competition.

Sample output for batch size of 5 and their respective completion time stamps:
Start goto : url1 - 1518433045085
Start goto : url2 - 1518433045099
Start goto : url3 - 1518433045107
Start goto : url4 - 1518433045108
Start goto : url5 - 1518433045109
End goto : url5 - 1518433045356 - 0.593
Start content : url5 - 1518433045357
End goto : url3 - 1518433045363 - 0.6
Start content : url3 - 1518433045364
End content : url3 - 1518433045371 - 0.608
End content : url5 - 1518433045372 - 0.609
End goto : url4 - 1518433048844 - 4.081
Start content : url4 - 1518433048844
End content : url4 - 1518433048854 - 4.091
End goto : url1 - 1518433051797 - 7.034
Start content : url1 - 1518433051797
End content : url1 - 1518433052060 - 7.297
Total time : 9.938

adamgotterer · 2018-06-06T21:22:20Z

@engineerkushal I'm having a similar issue. Did you find a solution?

aslushnikov · 2018-11-17T02:23:27Z

@kushaljshah Puppeteer v1.10.0 has an unpleasant warning when navigating more than 10 pages simultaneously - that is addressed in #3560.

Other than that, the following works fine to me (but takes some time).

const puppeteer = require('puppeteer');

(async() => {
  const browser = await puppeteer.launch();
  const urls = [
    'https://example.com',
    'https://google.com',
    'https://yahoo.com',
    'https://cnn.com',
    'https://bbc.com',
    'https://reddit.com',
    'https://twitter.com',
    'https://facebook.com',
    'https://apple.com',
    'https://en.wikipedia.org',
    'https://github.com',
    'https://sr.ht',
  ];
  await Promise.all(urls.map(async (url, index) => {
    console.log(`${index}: started`);
    const page = await browser.newPage();
    console.log(`${index}: navigating...`);
    await page.goto(url, {timeout: 0});
    console.log(`${index}: done.`);
    await page.close();
  }));
  await browser.close();
})();

I don't think there's any artificial limitation to the number of parallel pages, but they do fight each other for the host resources.

Hope this helps.

kushaljshah · 2019-03-16T12:30:12Z

@aslushnikov I don't think the problem was ever in navigating multiple pages. It's just that the amount of time taken was really long, and all the system resources including CPU, memory, load average, operation queue was well under threshold values.

And since the expected behavior is that io involved in each page navigation will be in parallel, the amount of time involved in the batch is not justified in comparison to the amount of time taken when navigated individually.

Will get some latest numbers around this and post here.

aslushnikov mentioned this issue Nov 17, 2018

fix(page): navigating 11 pages simultaneously should not throw warning #3560

Merged

aslushnikov closed this as completed Nov 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with Parallel pages html content download #1995

Issues with Parallel pages html content download #1995

kushaljshah commented Feb 8, 2018

nammaianh commented Feb 9, 2018

Garbee commented Feb 12, 2018

kushaljshah commented Feb 12, 2018 •

edited

adamgotterer commented Jun 6, 2018

aslushnikov commented Nov 17, 2018

kushaljshah commented Mar 16, 2019

Issues with Parallel pages html content download #1995

Issues with Parallel pages html content download #1995

Comments

kushaljshah commented Feb 8, 2018

nammaianh commented Feb 9, 2018

Garbee commented Feb 12, 2018

kushaljshah commented Feb 12, 2018 • edited

adamgotterer commented Jun 6, 2018

aslushnikov commented Nov 17, 2018

kushaljshah commented Mar 16, 2019

kushaljshah commented Feb 12, 2018 •

edited