Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to get accurate HTML code for a website. #7435

Closed
Rajat-Sharma-2710 opened this issue Jul 21, 2021 · 7 comments
Closed

Unable to get accurate HTML code for a website. #7435

Rajat-Sharma-2710 opened this issue Jul 21, 2021 · 7 comments

Comments

@Rajat-Sharma-2710
Copy link

Steps to reproduce

Tell us about your environment:

What steps will reproduce the problem?

  1. Start browser
  2. Visit https://intersight.com/help/
  3. save HTML code using page.content()
const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  const data = await page.goto('https://intersight.com/help/');
  const result = await page.content()
  console.log(result)
  await browser.close();
})();

What is the expected result?
The HTML code should contain hrefs that we see when visiting the page in browser windows. Shown in image below
image

What happens instead?
Inaccurate HTML code is returned by puppeteer. The hrefs shown in image are not rendered.

@Rajat-Sharma-2710
Copy link
Author

@jschfflr any comments ?

@jschfflr
Copy link
Contributor

Hi @Rajat-Sharma-2710
page.goto does only wait for the load event by default. As the website is using Angular, the content might not be there yet. You could use await page.waitForSelector('.helplet-item'); before page.content() to make sure the links have been rendered.

@Rajat-Sharma-2710
Copy link
Author

@jschfflr I have updated the code to waitUntil 'networkidle2' and added the query selector you asked.

const puppeteer = require('puppeteer');
const fs = require('fs');
(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  const data = await page.goto('https://intersight.com/help/', { waitUntil: 'networkidle2' });
  await page.waitForSelector('.helplet-links')
  const result = await page.content()
  fs.writeFile('./intersight.html', result, (err) => {
    if (err) console.log(err)
    else console.log('done!!')
  })
  // console.log(result)
  await browser.close();
})();

On execution I an getting timeout error:

node index.js 
(node:22388) UnhandledPromiseRejectionWarning: TimeoutError: waiting for selector `.helplet-links` failed: timeout 30000ms exceeded

Am I missing something ?

@ggorlen
Copy link
Contributor

ggorlen commented Jul 27, 2021

This page uses shadow roots. Please see #858

@Rajat-Sharma-2710
Copy link
Author

@ggorlen I tried using the approach link

await page.evaluateHandle(`document.querySelector("#app > an-hulk").shadowRoot.querySelector("#content").shadowRoot.querySelector("#main > div > div > div > an-hulk-home")`);

The query is copied using copy JS path.
getting error:

node index.js 
(node:16540) UnhandledPromiseRejectionWarning: Error: Evaluation failed: TypeError: Cannot read property 'shadowRoot' of null

@ggorlen
Copy link
Contributor

ggorlen commented Jul 27, 2021

@Rajat-Sharma-2710 That code works for me. Please show a complete, reproducible example, and let's take the discussion and code back to the Stack Overflow thread if you don't mind, so everything's in one place -- this isn't really a Puppeteer issue. I left the note above mainly for future visitors that might stumble on the thread. Thanks.

@jschfflr
Copy link
Contributor

@ggorlen Good catch with the shadow roots, thanks! I totally missed that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants