Skip to content

Scraper options selector skip doesnt work #297

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cbsa100 opened this issue Jul 15, 2023 · 2 comments
Closed

Scraper options selector skip doesnt work #297

cbsa100 opened this issue Jul 15, 2023 · 2 comments

Comments

@cbsa100
Copy link

cbsa100 commented Jul 15, 2023

Options

 const options = {
      wordwrap: null,
      selectors: [
        { selector: 'a', options: { ignoreHref: true } },
        { selector: 'img', format: 'skip' },
        { selector: 'nav', format: 'skip' },
        { selector: 'header', format: 'skip' },
        { selector: 'footer', format: 'skip' },
        { selector: '*[data-elementor-type=footer]', format: 'skip' },
        { selector: '*[data-elementor-type=header]', format: 'skip' },
      ],
    };```



**Version information**

    "html-to-text": "^9.0.5",
    "next": "^13.4.8",

----

When trying to scrape a webpage, i try to remove the header, footer images, navs and links to get only the text.
however, for some reason, i get the footer text in the result
i tried this both on elementor sites (with and without the data attributes) and on non-elementor sites (with the footer tag), also tried with and without the astric before the data attribute
@KillyMXI
Copy link
Member

const html = `
<header>header</header>
<div data-elementor-type="header">elementor type header</div>
<p>paragraph</p>
<div data-elementor-type="footer">elementor type footer</div>
<footer>footer</footer>`;

const options = {
  wordwrap: null,
  selectors: [
    { selector: 'a', options: { ignoreHref: true } },
    { selector: 'img', format: 'skip' },
    { selector: 'nav', format: 'skip' },
    { selector: 'header', format: 'skip' },
    { selector: 'footer', format: 'skip' },
    { selector: '*[data-elementor-type=footer]', format: 'skip' },
    { selector: '*[data-elementor-type=header]', format: 'skip' },
  ],
};
const text = htmlToText(html, options);
console.log(text);

Outputs only

paragraph

Start reducing your issue to a minimal example to find out what might be wrong in your case.

@KillyMXI
Copy link
Member

With no follow-up, I consider this resolved.

Most likely cause - unexpected input HTML and insufficient attention to what input HTNL actually contains and what options are actually used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants