Node.js version extracts incorrect content compared to browser version

## Expected Behavior

When parsing the same HTML content, the Node.js version should extract only the main article content, excluding metadata, subtitles, and descriptions that appear outside the article body.

## Current Behavior

The Node.js version incorrectly includes extra content that should be excluded.

## Steps to Reproduce

1. Parse the same HTML content using both Node.js and browser versions
2. Compare the extracted content
3. Observe that Node.js version selects incorrect content (e.g., subtitles, descriptions)

## Detailed Description

URL tested: https://www.theverge.com/news/766311/anthropic-class-action-ai-piracy-authors-settlement

I'm attaching three HTML files for comparison:
1. [original.html](https://github.com/user-attachments/files/23112193/original.html) - The original HTML from the source
2. [node-processed.html](https://github.com/user-attachments/files/23112202/node-processed.html) - HTML after processing by Node.js version
3. [browser-processed.html](https://github.com/user-attachments/files/23112204/browser-processed.html) - HTML after processing by browser version

The Node.js version includes text like:
> "The Amazon-backed startup won't have to go to trial over claims it trained AI models on 'millions' of pirated works."

This is a subtitle/description that appears in the page metadata and should not be part of the main article content.

## Possible Solution

The `convertToParagraphs` function was converting `<div>` elements containing block-level elements (like `<main>`, `<section>`, `<article>`) into `<p>` tags. This violates the HTML5 specification, which states that `<p>` tags cannot contain block-level elements.
Example of invalid HTML generated:
```html
<p class="container">
  <main>
    <div>Content</div>
  </main>
</p>
```
**Browser behavior:** Browsers automatically fix this invalid structure by closing the `<p>` tag before the `<main>` element:
```html
<p class="container"></p>
<main>
  <div>Content</div>
</main>
```
This causes the `<p>` tag to become empty, affecting scoring.

**Node.js Cheerio behavior:** Cheerio allows this invalid structure, preserving the content inside the `<p>` tag, leading to incorrect scoring and content selection.

Update `DIV_TO_P_BLOCK_TAGS` constant to include all HTML5 block-level elements:
File: src/extractors/generic/content/scoring/constants.js
```javascript
export const DIV_TO_P_BLOCK_TAGS = [
  'a',
  'article',      // Added
  'aside',        // Added
  'blockquote',
  'dl',
  'div',
  'footer',       // Added
  'header',       // Added
  'img',
  'main',         // Added
  'nav',          // Added
  'p',
  'pre',
  'section',      // Added
  'table',
].join(',');
```
Also update: src/utils/dom/convert-to-paragraphs.js
Change from `children()` to `find()` to check all descendants, not just direct children:
```javascript
function convertDivs($) {
  $('div').each((index, div) => {
    const $div = $(div);
    // Use find() instead of children() to check all descendants
    const convertible = $div.find(DIV_TO_P_BLOCK_TAGS).length === 0;

    if (convertible) {
      convertNodeTo($div, $, 'p');
    }
  });

  return $;
}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Node.js version extracts incorrect content compared to browser version #764

Expected Behavior

Current Behavior

Steps to Reproduce

Detailed Description

Possible Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Node.js version extracts incorrect content compared to browser version #764

Description

Expected Behavior

Current Behavior

Steps to Reproduce

Detailed Description

Possible Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions