-
Notifications
You must be signed in to change notification settings - Fork 529
feat: Add a custom extractor for www.ndtv.com. #554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows: * Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight. * Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3". * Add class="entry-content-asset" to "ul" elements to avoid them being removed.
…e would send contentOnly: true on subsequent pages (page 2). removed failover: true from preview.
Feature arstechnica extractor
src/extractors/collect-all-pages.js
Outdated
| html, | ||
| $, | ||
| metaCache, | ||
| contentOnly: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the logic behind removing this value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I do not recall at this point. Obviously if I thought removing that value a good change I should have included comments around it. But I did not, and I did this over a year ago now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, it looks like there's some context for this change in #553
No description provided.