feat: arstechnica.com extractor #553

jbrayton · 2020-04-27T12:24:14Z

This is an extractor for arstechnica.com. A few notes:

I removed the contentOnly: true option from extractorOpts in collect-all-pages.js because it resulted in next_page_url always being null on the second page of an article.
Articles from this site are often paginated, but I was unable to write a CSS selector to find the next page. On the last page, there will be a link with a CSS selector indicating that the previous page is next. But the parser appears to find the next page without this extractor finding it, as long as the fallback option is left at its default value of true.

When parsing content for cron.weekly issues, such as the one at https://ma.ttias.be/cronweekly/issue-130/, Mercury Parser would remove headings and ordered lists that were part of the content. This resolves that as follows: * Remove "id" attributes from "h1" and "h2" elements. Those attributes would result in the elements having a low weight. * Since Mercury Parser demotes "h1" elements to "h2", demote "h2" elements to "h3". * Add class="entry-content-asset" to "ul" elements to avoid them being removed.

…e would send contentOnly: true on subsequent pages (page 2). removed failover: true from preview.

Identical to postlight#551

Identical to postlight#552

…xtractor added ma.ttias and engadget fixture

…-extractor

johnholdun

Looks good! That change to the page fetching function makes sense to me.

jbrayton added 10 commits April 23, 2020 17:34

removed redundant comment.

921d9d4

feat: Add a custom extractor for engadget.com.

0d63d8e

Works, but I need to figure how to make pagination work correctly.

d6966bd

fixed pagination - would only retrieve first or second page because w…

58612cd

…e would send contentOnly: true on subsequent pages (page 2). removed failover: true from preview.

rolled back { fallback: false } option removal

3df6604

Clarified comments.

9c93f9e

Merge pull request #1 from jbrayton/feat-ma-ttias-be-extractor

677b61f

Identical to postlight#551

Merge branch 'master' into feat-engadget-parser

2cfa36b

Merge pull request #2 from jbrayton/feat-engadget-parser

3efb2a9

Identical to postlight#552

jbrayton mentioned this pull request Apr 27, 2020

Feature arstechnica extractor jbrayton/mercury-parser#3

Merged

Merge branch 'master' into feature-arstechnica-extractor

3768f2e

ezequiel454 added a commit to RecastLLC/mercury-parser that referenced this pull request Aug 25, 2021

Merge pull request #4 from RecastLLC/postlight#553-arstechinica.com-e…

0a09618

…xtractor added ma.ttias and engadget fixture

johnholdun mentioned this pull request Aug 10, 2022

feat: Add a custom extractor for www.ndtv.com. #554

Merged

Merge remote-tracking branch 'origin/master' into feature-arstechnica…

ee21c6b

…-extractor

johnholdun approved these changes Aug 10, 2022

View reviewed changes

johnholdun merged commit 143631b into postlight:master Aug 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: arstechnica.com extractor #553

feat: arstechnica.com extractor #553

Uh oh!

jbrayton commented Apr 27, 2020

Uh oh!

johnholdun left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: arstechnica.com extractor #553

feat: arstechnica.com extractor #553

Uh oh!

Conversation

jbrayton commented Apr 27, 2020

Uh oh!

johnholdun left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants