Skip to content

Conversation

@jbrayton
Copy link
Contributor

When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. It also demoted h1 elements to h2, giving other h2 elements the appearance of being at the same level in the organizational hierarchy of the document. This resolves these issues as follows:

  • Remove id attributes from h1 and h2 elements. Those attributes would result in the elements having a low weight.
  • Since Mercury Parser demotes h1 elements to h2, demote h2 elements to h3.
  • Add a class="entry-content-asset" attribute to ul elements to avoid them being removed.

The site does not have deks or lead images, so those are not in the extractor.

When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows:

* Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight.
* Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3".
* Add class="entry-content-asset" to "ul" elements to avoid them being removed.
@jbrayton jbrayton changed the title Feat ma ttias be extractor feat: ma.ttias.be extractor Apr 24, 2020
jbrayton added a commit to jbrayton/mercury-parser that referenced this pull request Apr 27, 2020
@johnholdun johnholdun merged commit e217648 into postlight:master May 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants