Skip to content

Conversation

@jbrayton
Copy link
Contributor

Add a custom extractor for www.engadget.com.

Engadget articles have dates, but I was unable to find one in a format I could parse. There are strings like "2h ago" and tags with blank values such as this:

<meta class="swiftype" name="published_at" data-type="date" value="">

So the extractor always returns a null date.

Engadget articles also have lead images, but I was unable to return the value. For example, the fixture has:

<meta value="https://o.aolcdn.com/images/dims?resize=1200%2C630&amp;crop=1200%2C630%2C0%2C0&amp;quality=80&#x2111;uri=https%3A%2F%2Fs.yimg.com%2Fos%2Fcreatr-images%2F2020-04%2F7e5e3a50-8658-11ea-befb-f52e76d9e7b2&amp;client=amp-blogside-v2&amp;signature=193a0258fa9a401d2f1cdfc41909ac01e4db3147" name="og:image">

If I put a simpler URL in that value, I could select the image. I think the &#x2111; sequence in the URL is messing things up. I did incorporate lead images into the HTML content.

If someone reviewing this thinks there is a good way to address these issues I am eager to do that.

When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows:

* Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight.
* Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3".
* Add class="entry-content-asset" to "ul" elements to avoid them being removed.
jbrayton added a commit to jbrayton/mercury-parser that referenced this pull request Apr 27, 2020
ezequiel454 added a commit to RecastLLC/mercury-parser that referenced this pull request Aug 25, 2021
Copy link
Contributor

@johnholdun johnholdun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I don't have a suggestion for getting timestamps, but it's a problem I've seen pop up elsewhere so I'm thinking about a way to solve it. If we come up with something, we'll make sure this extractor gets updated.

@johnholdun johnholdun merged commit 3c5c0bd into postlight:master Aug 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants