-
Notifications
You must be signed in to change notification settings - Fork 723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deduplicate Feature #797
Comments
Some sites have original posts news and copied news, when I subscribed these feeds, I always saw duplicate articles in several feeds . I hope to deduplicate similar entries in multiple or all deeds. When adding a new entry, miniflux checks recent old entries and calculate similarity, if there is an entry reached the configured threshold, the new entry is marked removed or read. For similarity calculation, maybe we can first split words and use Cosine similarity, or simply use equals. Users can configure how to calculate similarity, title or content. |
Hopefully the "Mark as Read" option is available. That's what I manually do anyway. |
Since Miniflux relies on PostgreSQL, maybe something like the pg_trgm extension is useful: https://www.postgresql.org/docs/current/pgtrgm.html? |
This would be an awesome feature. A lot of times a writer writes for his own blog and then reposts somewhere else, but it's in the same category of Miniflux with the same title. If it were to remove either entry (preferably keeping the first), then that would be awesome. |
Came here with a slightly different (but related) problem: Some of my feeds - largely big newspapers, re-publish the same articles over time. This particularly applies to essays, I think they want to push it a number of times so their website appears "more active", without adding any new information. But it is frustrating to see the same posts popping up again and again, it is wasting my time. I was wondering whether a deduplication feature could also have some temporal comparison check such as "The same article heading was published 1 month ago, 2 years ago etc." to then get hidden from standard view. Functional wise, it would be pretty similar: One needs a persistent table with headings (and timestamps) to check against in Postgres. |
I don't know how it is developed, but ttrss has such a deduplicate feature. Maybe it can help to develope such a ffeature for Miniflux too! |
As a workaround, I created a python script using the API to check for repeated URLs and remove duplicates, borrowing from a similar solution, that can be run as a cronjob on your miniflux server. I might look into triggering it with webhooks in the future, so if anyone works that out please comment the details. Or if you tidy up the script I'd be interested to see as I don't use python often. Note if using the Also if you want to also check titles, just create another set to aggregate them with the # Licensed under MIT license
# Link to discussion: https://github.com/miniflux/v2/issues/797
# Steps to use:
# - Install the Miniflux python client: https://miniflux.app/docs/api.html#python-client
# - Replace rss.example.com with your Miniflux instance url
# - Replace API_KEY with your Miniflux api key which you can get here https://rss.example.com/keys
import re
import miniflux
# Behaviour: Keep first instance based on url, mark subsequent as removed
# Drawbacks:
# - Newer versions of article may have updates,
# but keeping these would lose read status and other attributes
# - Repeat article could be so old as to have already been removed,
# if it's been so long I assume changes might be worth reading
# - The oldest one might not've been the one read
# Removal process: https://miniflux.app/faq.html#entries-suppression
def remove_duplicates(feed_ids): ##
dupe_ids = []
for feed_id in feed_ids:
entries = client.get_feed_entries(feed_id=feed_id, order="id",
direction="asc", status=["read","unread"])
seen_urls = set()
for entry in entries["entries"]:
if (entry["url"] in seen_urls):
dupe_ids.append(entry["id"])
#print("Duplicate found " + entry["title"])
else:
seen_urls.add(entry["url"])
if dupe_ids: # Repeats found
client.update_entries(dupe_ids, status="removed")
# Get a list of feeds to check for duplicates based on blocklist text.
# To check a feed add "RemoveDuplicates" to the Blocklist_rules box
def get_feeds_w_dupes(all_feeds):
feeds_ids = []
for feed in all_feeds:
if ("RemoveDuplicates" in feed["blocklist_rules"]):
feeds_ids.append(feed["id"])
#print("Feed " + str(feed["id"])+ " has rule RemoveDuplicates")
return feeds_ids
client = miniflux.Client("https://rss.example.com", api_key="API_KEY")
all_feeds = client.get_feeds()
feeds_w_dupes = get_feeds_w_dupes(all_feeds)
remove_duplicates(feeds_w_dupes) |
Great! But this only works on a single URL basis. In case the URL changed, but the text remained the same (with e.g. slighlty changed title), this would not be detected (I am not complaining - this is much better than nothing. Thank you so much!). There are a lot of feeds from newspapers that re-publish entries under slightly changed title every x days. Some kind of semantic similarity would be needed to catch these.. |
@Sieboldianus Happy to help! Here's an edit of the main function to pickup on same titles, I'd also suggest checking if the articles keep some other attribute that can be detected, like "published_at" which will be a string like "2024-06-10T15:53:17+01:00", so should be fairly unique. def remove_duplicates(feed_ids):
dupe_ids = []
for feed_id in feed_ids:
entries = client.get_feed_entries(feed_id=feed_id, order="id",
direction="asc", status=["read","unread"])
seen_urls = set()
seen_titles = set()
for entry in entries["entries"]:
if ((entry["url"] in seen_urls) or (entry["title"] in seen_titles)):
dupe_ids.append(entry["id"])
#print("Duplicate found " + entry["title"])
else:
seen_urls.add(entry["url"])
seen_titles.add(entry["title"])
if dupe_ids: # Repeats found
client.update_entries(dupe_ids, status="removed") For fuzzy matching I wrote this function, I've checked the matching works, but didn't have any feeds that rename titles, so haven't validated it against real data, if you could post examples feeds that would be helpful, thanks. import re
import miniflux
from fuzzywuzzy import fuzz # pip install fuzzywuzzy
# Compare strings with https://en.wikipedia.org/wiki/Levenshtein_distance
import nltk # pip install nltk
from nltk.corpus import stopwords # list of words to ignore
nltk.download('stopwords') # Download stopwords (only needed on first run)
def read_duplicates(feed_ids, sensitivity = 85):
"""
This function marks articles as read if they are a duplicate of a previously
seen article, based on similarity of the title.
Similarity computed by:
Take the title, remove all punctuation, stop words (words like 'a', 'and',
'the', etc) and set to lower case.
This produces a string of the important words in the title.
If the article is read store this processsed title in a list for comparison.
If unread check its not already in the list if so mark read
If not check similar based on Levenshtein distance(LD)
words are sorted to alphabetical order before computing LD
Args:
feed_ids: The ids of feeds to check, like ["1", "2"]
sensitivity: How similar two string must be to match
"""
dupe_ids = []
stop_words = set(stopwords.words('english')) # words to ignore
for feed_id in feed_ids:
seen_titles = set()
entries = client.get_feed_entries(feed_id=feed_id, order="id",
direction="asc", status=["read","unread"])
""" initially assumed we could see all read then check unread,
but then if we see titles like
'1.bad thing happens, 2.things bad happens'
when 2. is marked read for being similar to 1. next run of
program first sees 2. in read, then sees 1. which is similar
so marks 1. as read, even though neither was ever read.
So we have to go through by id then check read status
"""
for entry in entries["entries"]:
# processed title is title without joining words and lowercase
processed_title = re.sub(r'\W+', ' ', entry["title"]) # replace non alphanumeric chars with space
processed_title = ' '.join([word for word in processed_title.lower().split()
if word not in stop_words])
if (entry["status"]=="read"):
seen_titles.add(processed_title)
else: # only check if unread mark
if (processed_title in seen_titles):
print("Duplicate found "+ processed_title)
dupe_ids.append(entry["id"])
else:
for title in seen_titles:
print("checking: '"+ processed_title + "' against '" + title + "'")
# 1-100, higher = closer matching
# token sort will sort words into order before comparing titles
if (fuzz.token_sort_ratio(processed_title, title) > sensitivity):
dupe_ids.append(entry["id"])
seen_titles.add(processed_title)
''' See matches too; so one article may have multiple seen titles
and we can match between them eg
The white whale > Big white whale > Big whale
where the initial title may not match the later title but
by keeping the intermediate we recognise the final version
'''
break # match found stop checking
seen_titles.add(processed_title)
if dupe_ids: # Repeats found
client.update_entries(dupe_ids, status="read") |
Nice, thank you! I will definitely test it. |
Thanks for your script. I did notice that I had to add a |
One big change would be to change the 1-to-1 relation from entry to feed to a 1-to-n relation such that the same entry can be part of multiple feeds (deduplication). As Miniflux is not using an ORM, this requires changes around the whole code base. Update: Miniflux already supports storing an entry's hash based on the GUID or otherwise uses the entry URL: v2/internal/reader/rss/adapter.go Lines 103 to 108 in c326d55
So it should be possible to implement deduplication on this criteria efficiently. The entry's content should still be updated when it changes. There is already deduplication within a single feed based on this hash: Line 219 in c326d55
So one probably "just" has to remove the feed filter (and add a user filter instead). The most work would be around changing the schema relation and all usages. |
Occasionally, I get duplicate entries of the same feed due reading a feed at the source as well as an aggregator like hackernews. I would love to be able to dedupe based on the following fields:
(edit for spelling)
The text was updated successfully, but these errors were encountered: