Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bug? - Axios not interpreting large comments correctly #5450

Closed
dtemlak opened this issue Jan 7, 2023 · 4 comments
Closed

Possible bug? - Axios not interpreting large comments correctly #5450

dtemlak opened this issue Jan 7, 2023 · 4 comments

Comments

@dtemlak
Copy link

dtemlak commented Jan 7, 2023

Describe the issue

Unsing node.js, I am trying to scrape a particular webpage (https://www.pro-football-reference.com/hof/). I am specifically trying to scrape all three sections (Players, Coaches and Contributors). I can scrape the first section (players) with no issues. However, the other two are not being read correctly. It looks like it has something to do with large comment blocks that are placed before both the Coaches and Contributors sections. When I look at the axios.get response, the large comments are mostly removed, but the DOM returned (for Coaches) looks like this:

. When I send the response to Cheerio, it doesn't see anything past this DOM. I think this is an Axios issue and not a Cheerio issue, but I am not 100% sure.

Can anyone help? Thanks in advance! - DLT

Example Code

async function getWebsiteContent(url, with_Cheerio, debug_mode) {
      // This function will go to the provied URL, scrape the entire page and return the data for Cheerio to parse
        try {
          url = url.replaceAll(" ", "_")
          const response = await axios.get(url)
         
          //provide some debug info if requesed
          if (debug_mode == true) {
            //await delay(2000);
            console.log(url);
            fs.writeFile("debug_info.txt", response.data, function(err) {
            if(err) {
              return console.log("THERE WAS AN ERROR SAVING THE LOG:" , err);
            }
              console.log("The file was saved!");
            }); 
          } 
    
          //return the correct type of data depending on if Cheerio is needed
          if (with_Cheerio == "Yes") {
            return cheerio.load(response.data)
          } else {
            return (response.data);
          } 
        } catch (error) {
          return (error)
        }
      // End of Function 
      }

Expected behavior

This is what I see for players:

Players

<table class="sortable stats_table" id="hof_players" data-cols-to-freeze="1,2">
<caption>Players Table</caption>

...


This is the response I see for coaches:

Coaches

@DigitalBrainJS
Copy link
Collaborator

DigitalBrainJS commented Jan 7, 2023

That is dynamically generated content on the client side, so it doesn't actually exist in the initial HTML code. Use Puppeteer or Playwright instead of Axios. You can disable JS in Chrome Dev Tools or use Postman to see what the original content looks like.

@dtemlak dtemlak closed this as completed Jan 7, 2023
@dtemlak
Copy link
Author

dtemlak commented Jan 7, 2023

That is dynamically generated content on the client side, so it doesn't actually exist in the initial HTML code. Use Puppeteer or Playwright instead of Axios. You can disable JS in Chrome Dev Tools or use Postman to see what the original content looks like.

I disabled JS in Chrome Dev Tools. I still see the comments. You suggested using Puppeteer. Does that mean there is a bug in Axios? I would prefer not to use another tool unless I have to. I am just not sure why I can see the first grouping (players) but Axios/Cheerio can not see either of the next two groupings (Coaches or Contributors). Are you able to parse those somehow?

@dtemlak dtemlak reopened this Jan 7, 2023
@DigitalBrainJS
Copy link
Collaborator

DigitalBrainJS commented Jan 7, 2023

This means that you are trying to use the wrong tool because you need to execute JS on the page in order to get a dynamically generated DOM, you cannot get something from the server response that is not there.
The first block is generated on the server side (server-side rendering), and other blocks are generated on the client side and then inserted into DOM via JS.

@DigitalBrainJS
Copy link
Collaborator

Going to close since this issue is not related to Axios.

@DigitalBrainJS DigitalBrainJS closed this as not planned Won't fix, can't repro, duplicate, stale Jan 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants