Skip to content

Node.js version extracts incorrect content compared to browser version #764

@twang3

Description

@twang3

Expected Behavior

When parsing the same HTML content, the Node.js version should extract only the main article content, excluding metadata, subtitles, and descriptions that appear outside the article body.

Current Behavior

The Node.js version incorrectly includes extra content that should be excluded.

Steps to Reproduce

  1. Parse the same HTML content using both Node.js and browser versions
  2. Compare the extracted content
  3. Observe that Node.js version selects incorrect content (e.g., subtitles, descriptions)

Detailed Description

URL tested: https://www.theverge.com/news/766311/anthropic-class-action-ai-piracy-authors-settlement

I'm attaching three HTML files for comparison:

  1. original.html - The original HTML from the source
  2. node-processed.html - HTML after processing by Node.js version
  3. browser-processed.html - HTML after processing by browser version

The Node.js version includes text like:

"The Amazon-backed startup won't have to go to trial over claims it trained AI models on 'millions' of pirated works."

This is a subtitle/description that appears in the page metadata and should not be part of the main article content.

Possible Solution

The convertToParagraphs function was converting <div> elements containing block-level elements (like <main>, <section>, <article>) into <p> tags. This violates the HTML5 specification, which states that <p> tags cannot contain block-level elements.
Example of invalid HTML generated:

<p class="container">
  <main>
    <div>Content</div>
  </main>
</p>

Browser behavior: Browsers automatically fix this invalid structure by closing the <p> tag before the <main> element:

<p class="container"></p>
<main>
  <div>Content</div>
</main>

This causes the <p> tag to become empty, affecting scoring.

Node.js Cheerio behavior: Cheerio allows this invalid structure, preserving the content inside the <p> tag, leading to incorrect scoring and content selection.

Update DIV_TO_P_BLOCK_TAGS constant to include all HTML5 block-level elements:
File: src/extractors/generic/content/scoring/constants.js

export const DIV_TO_P_BLOCK_TAGS = [
  'a',
  'article',      // Added
  'aside',        // Added
  'blockquote',
  'dl',
  'div',
  'footer',       // Added
  'header',       // Added
  'img',
  'main',         // Added
  'nav',          // Added
  'p',
  'pre',
  'section',      // Added
  'table',
].join(',');

Also update: src/utils/dom/convert-to-paragraphs.js
Change from children() to find() to check all descendants, not just direct children:

function convertDivs($) {
  $('div').each((index, div) => {
    const $div = $(div);
    // Use find() instead of children() to check all descendants
    const convertible = $div.find(DIV_TO_P_BLOCK_TAGS).length === 0;

    if (convertible) {
      convertNodeTo($div, $, 'p');
    }
  });

  return $;
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions