Skip to content

Conversation

@jbrayton
Copy link
Contributor

This is an extractor for arstechnica.com. A few notes:

  • I removed the contentOnly: true option from extractorOpts in collect-all-pages.js because it resulted in next_page_url always being null on the second page of an article.

  • Articles from this site are often paginated, but I was unable to write a CSS selector to find the next page. On the last page, there will be a link with a CSS selector indicating that the previous page is next. But the parser appears to find the next page without this extractor finding it, as long as the fallback option is left at its default value of true.

jbrayton added 10 commits April 23, 2020 17:34
When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows:

* Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight.
* Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3".
* Add class="entry-content-asset" to "ul" elements to avoid them being removed.
…e would send contentOnly: true on subsequent pages (page 2).

removed failover: true from preview.
ezequiel454 added a commit to RecastLLC/mercury-parser that referenced this pull request Aug 25, 2021
…xtractor

added ma.ttias and engadget fixture
Copy link
Contributor

@johnholdun johnholdun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! That change to the page fetching function makes sense to me.

@johnholdun johnholdun merged commit 143631b into postlight:master Aug 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants