Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scraper follow the only link #1290

Merged
merged 1 commit into from
Nov 1, 2022
Merged

Conversation

qjebbs
Copy link
Contributor

@qjebbs qjebbs commented Dec 8, 2021

In some cases, what the scraper got is only a landing page, with this PR, user can use scraper rules (e.g.: .post.detail .title a) to locate the link of the landing page and follow it for real content

How to use:

  1. For those feeds who give you only a landing page, find the selector of the redirect link (e.g.: .post.detail .title a)
  2. Setup scraper rules in feed settings

it also fix the wrong scrape rule apply when the server redirects it to another host

Do you follow the guidelines?

* in some cases, what the scraper got is only a landing page, user can use scraper rules to extract the link of the landing page and follow it
* it also fix the  wrong scrape rule apply when the server redirects it to another host
@fguillot
Copy link
Member

fguillot commented Jan 5, 2022

Can you provide some examples of website to test this change?

@qjebbs
Copy link
Contributor Author

qjebbs commented Jan 5, 2022

Sure, settings below works perfect for that feed.

feed: https://toutiao.io/daily.xml
scrape rule: .post.detail .title a
rewrite rule: add_dynamic_image

@thiagowfx
Copy link
Contributor

Another example for testing: https://www.bloomberg.com/authors/ARbTQlRLRjE/matthew-s-levine.rss

@fguillot
Copy link
Member

feed: https://toutiao.io/daily.xml

This website doesn't seem to exist anymore.

@qjebbs
Copy link
Contributor Author

qjebbs commented Oct 31, 2022

feed: https://toutiao.io/daily.xml

This website doesn't seem to exist anymore.

It's announced to be in maintenance until 31-10-2022

@fguillot fguillot merged commit 1020796 into miniflux:main Nov 1, 2022
fguillot added a commit that referenced this pull request Nov 15, 2022
fguillot added a commit that referenced this pull request Nov 15, 2022
@qjebbs qjebbs deleted the scraper_follow_link branch January 30, 2023 02:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants