-
Notifications
You must be signed in to change notification settings - Fork 529
Description
Expected Behavior
When parsing the same HTML content, the Node.js version should extract only the main article content, excluding metadata, subtitles, and descriptions that appear outside the article body.
Current Behavior
The Node.js version incorrectly includes extra content that should be excluded.
Steps to Reproduce
- Parse the same HTML content using both Node.js and browser versions
- Compare the extracted content
- Observe that Node.js version selects incorrect content (e.g., subtitles, descriptions)
Detailed Description
URL tested: https://www.theverge.com/news/766311/anthropic-class-action-ai-piracy-authors-settlement
I'm attaching three HTML files for comparison:
- original.html - The original HTML from the source
- node-processed.html - HTML after processing by Node.js version
- browser-processed.html - HTML after processing by browser version
The Node.js version includes text like:
"The Amazon-backed startup won't have to go to trial over claims it trained AI models on 'millions' of pirated works."
This is a subtitle/description that appears in the page metadata and should not be part of the main article content.
Possible Solution
The convertToParagraphs function was converting <div> elements containing block-level elements (like <main>, <section>, <article>) into <p> tags. This violates the HTML5 specification, which states that <p> tags cannot contain block-level elements.
Example of invalid HTML generated:
<p class="container">
<main>
<div>Content</div>
</main>
</p>Browser behavior: Browsers automatically fix this invalid structure by closing the <p> tag before the <main> element:
<p class="container"></p>
<main>
<div>Content</div>
</main>This causes the <p> tag to become empty, affecting scoring.
Node.js Cheerio behavior: Cheerio allows this invalid structure, preserving the content inside the <p> tag, leading to incorrect scoring and content selection.
Update DIV_TO_P_BLOCK_TAGS constant to include all HTML5 block-level elements:
File: src/extractors/generic/content/scoring/constants.js
export const DIV_TO_P_BLOCK_TAGS = [
'a',
'article', // Added
'aside', // Added
'blockquote',
'dl',
'div',
'footer', // Added
'header', // Added
'img',
'main', // Added
'nav', // Added
'p',
'pre',
'section', // Added
'table',
].join(',');Also update: src/utils/dom/convert-to-paragraphs.js
Change from children() to find() to check all descendants, not just direct children:
function convertDivs($) {
$('div').each((index, div) => {
const $div = $(div);
// Use find() instead of children() to check all descendants
const convertible = $div.find(DIV_TO_P_BLOCK_TAGS).length === 0;
if (convertible) {
convertNodeTo($div, $, 'p');
}
});
return $;
}