scraper follow the only link #1290

qjebbs · 2021-12-08T08:52:19Z

In some cases, what the scraper got is only a landing page, with this PR, user can use scraper rules (e.g.: .post.detail .title a) to locate the link of the landing page and follow it for real content

How to use:

For those feeds who give you only a landing page, find the selector of the redirect link (e.g.: .post.detail .title a)
Setup scraper rules in feed settings

it also fix the wrong scrape rule apply when the server redirects it to another host

Do you follow the guidelines?

I have tested my changes
I read this document: https://miniflux.app/faq.html#pull-request

* in some cases, what the scraper got is only a landing page, user can use scraper rules to extract the link of the landing page and follow it * it also fix the wrong scrape rule apply when the server redirects it to another host

fguillot · 2022-01-05T04:16:45Z

Can you provide some examples of website to test this change?

qjebbs · 2022-01-05T04:21:19Z

Sure, settings below works perfect for that feed.

feed: https://toutiao.io/daily.xml
scrape rule: .post.detail .title a
rewrite rule: add_dynamic_image

thiagowfx · 2022-01-21T20:40:22Z

Another example for testing: https://www.bloomberg.com/authors/ARbTQlRLRjE/matthew-s-levine.rss

fguillot · 2022-10-30T17:50:11Z

feed: https://toutiao.io/daily.xml

This website doesn't seem to exist anymore.

qjebbs · 2022-10-31T01:11:50Z

feed: https://toutiao.io/daily.xml

This website doesn't seem to exist anymore.

It's announced to be in maintenance until 31-10-2022

Bug introduced in PR #1290. Fixes #1631.

scraper follow the only link

4ad9a2c

* in some cases, what the scraper got is only a landing page, user can use scraper rules to extract the link of the landing page and follow it * it also fix the wrong scrape rule apply when the server redirects it to another host

fguillot merged commit 1020796 into miniflux:main Nov 1, 2022

fguillot added a commit that referenced this pull request Nov 15, 2022

Add missing check in followTheOnlyLink() that leads to a panic

af72635

Bug introduced in PR #1290. Fixes #1631.

fguillot mentioned this pull request Nov 15, 2022

Add missing check in followTheOnlyLink() that leads to a panic #1632

Merged

2 tasks

fguillot added a commit that referenced this pull request Nov 15, 2022

Add missing check in followTheOnlyLink() that leads to a panic

de1a06e

Bug introduced in PR #1290. Fixes #1631.

fguillot mentioned this pull request Nov 15, 2022

Revert "scraper follow the only link" #1633

Merged

2 tasks

frobiac mentioned this pull request Dec 12, 2022

miniflux: update to 2.0.41 void-linux/void-packages#41039

Merged

qjebbs deleted the scraper_follow_link branch January 30, 2023 02:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scraper follow the only link #1290

scraper follow the only link #1290

qjebbs commented Dec 8, 2021 •

edited

fguillot commented Jan 5, 2022

qjebbs commented Jan 5, 2022

thiagowfx commented Jan 21, 2022

fguillot commented Oct 30, 2022

qjebbs commented Oct 31, 2022

scraper follow the only link #1290

scraper follow the only link #1290

Conversation

qjebbs commented Dec 8, 2021 • edited

fguillot commented Jan 5, 2022

qjebbs commented Jan 5, 2022

thiagowfx commented Jan 21, 2022

fguillot commented Oct 30, 2022

qjebbs commented Oct 31, 2022

qjebbs commented Dec 8, 2021 •

edited