Some feeds have duplicate entries #79

lemon24 · 2018-07-08T17:49:47Z

Some entries have duplicate entries, or their ids change (resulting in an entry being stored twice).

E.g., a feed that had the entry id format updated:

$ sqlite3 db.sqlite 'select feed, id, updated, title from entries where title like "RE: xkcd%"' -line
   feed = http://sealedabstract.com/feed/
     id = http://sealedabstract.com/?p=2494
updated = 2014-09-09 08:30:25
  title = RE: xkcd #1357 free speech

   feed = http://sealedabstract.com/feed/
     id = /?p=2494
updated = 2014-09-09 07:30:25
  title = RE: xkcd #1357 free speech

If possible, only one should be shown (similar to #78).

The text was updated successfully, but these errors were encountered:

lemon24 · 2018-07-11T17:21:35Z

A way to solve this is to have the user delete and re-add the feed (obviously, any feed and entry metadata along with the entries no longer in the feed are lost).

lemon24 · 2018-07-15T21:00:07Z

Another example:

$ sqlite3 db.sqlite 'select feed, id, updated, title from entries where title like "Basics of Futexes"' -line
   feed = https://eli.thegreenplace.net/feeds/all.atom.xml
     id = tag:eli.thegreenplace.net,2018-07-13:2018/basics-of-futexes/
updated = 2018-07-13 12:53:00
  title = Basics of Futexes

   feed = https://eli.thegreenplace.net/feeds/all.atom.xml
     id = tag:eli.thegreenplace.net,2018-07-13:/2018/basics-of-futexes/
updated = 2018-07-13 12:53:00
  title = Basics of Futexes

lemon24 · 2018-12-15T23:56:08Z

Another related (and useful) case is the same entry being re-posted in a related feed:

$ sqlite3 db.sqlite "select feed, id, updated, title from entries where title like '%The Accidental Room%'" -line
   feed = http://feeds.99percentinvisible.org/99percentinvisible
     id = prx_96_030a4709-780b-415d-be97-887a5485ba5e
updated = 2018-12-12 01:30:55
  title = 332- The Accidental Room

   feed = http://feeds.feedburner.com/99pi
     id = https://99percentinvisible.org/?post_type=episode&p=27621
updated = 2018-12-12 00:45:23
  title = The Accidental Room [EPISODE]

In this case, the first feed has enclosures and the second doesn't, and we'd like to only keep the first entry.

Another one:

$ sqlite3 db.sqlite "select feed, id, updated, title from entries where title like '%Stunt Peanut%'" -line
   feed = http://www.cgpgrey.com/blog?format=rss
     id = 5005afd7e4b0a6953320bf3f:50aa71a6e4b01bb8df1073d2:5c00520b758d46bd3c8a6646
updated = 2018-11-29 21:02:45
  title = H.I. #114: Stunt Peanut

   feed = http://www.hellointernet.fm/podcast?format=rss
     id = 52d66949e4b0a8cec3bcdd46:52d67282e4b0cca8969714fa:5c00515803ce6416b8bfb992
updated = 2018-11-29 21:02:33
  title = H.I. #114: Stunt Peanut

lemon24 · 2018-12-16T11:41:02Z

So, a possible implementation for entry deduplication:

a plugin (called after update) or a script (that runs regularly) and
- checks each existing entry against each new one (can limit this to a configurable subset of feeds)
  - identifies similar entry pairs (e.g. using the title and a configurable regex, or magic via some kind of string matching, maybe applying the algorithm to strings of words instead of strings of characters)
  - decides which entry is the original and which is the duplicate (e.g. for two feeds, one of them can be configured as the original; within a single feed, the original is the oldest one; prefer the one with enclosures)
- takes some action on the duplicate (mark it as read, mark as duplicate for later action, hide from normal get_entries() calls)

Some questions:

do we want to have the detection and the action decoupled? if yes, we need to store metadata about which entries are duplicates of which original

Update: I did some digging, and the approach described above could be implemented like this:

(assuming we use the entry title, can be applied to something else) clean up words common to most entries in a feed (e.g. "H.I." in the last example), split it into word bigrams, and use the Jaccard index with the bigram sets of two entries to determine similarity; while this would be relatively simple to implement, I don't like that we're pulling all the entries from the database; I also expect tweaking might be needed to find out what "words common to most entries" means, and what Jaccard index means "similar enough"
I think we could avoid getting all the entries every time by using some sort of locality-sensitive hashing; however, I understand how this works only in theory and at a very high level

Both seem a bit too "heavy" for the problem I'm trying to solve.

lemon24 · 2018-12-19T20:56:02Z

We can also give up on a single magic solution, and acknowledge that there are probably at least two different problems:

The entry id format changes for all the entries in a feed; this results in the entry being stored twice.
Some entries are reposted on a different feed.

For both cases, the duplicates:

Have the same title (the first two examples above):
1. For the same feed, we can check on insert.
2. For different feeds, we may be able to check on insert, depending on the insert order; we can also check later (to minimize the number of entries checked, we can mark an entry as "just added", and invalidate this on the next update). We can improve performance of both by being able to get entries by title.
Do not have the same title:
1. Have a pattern identifying them (the third and fourth examples above). This can be done on insert, with no additional queries.
2. Do not have a pattern identifying them. Don't know how to solve this cheaply; you need to be able to assume something is the same. Can be done with the similarity stuff described in the previous comment (which I don't wanna do now).

lemon24 · 2018-12-20T21:49:25Z

For now, I will implement 2.i. (different title, identifying pattern) as I am currently affected by it.

For #79.

lemon24 · 2018-12-22T16:32:47Z

The only other sub-case that actually happened is 1.i. (same title, same feed); it happened about 2-4 times in 1 year.

For #79.

lemon24 · 2018-12-22T22:32:24Z

Ran reader update with both feed_entry_dedupe and regex_mark_as_read on a database with ~60 feeds / ~3000 entries; there seemed to be no significant run time degradation.

lemon24 · 2018-12-22T22:32:52Z

Covered all known cases, resolving.

lemon24 added API web app labels Jul 8, 2018

lemon24 changed the title ~~Some feeds have duplicated entries~~ Some feeds have duplicate entries Jul 8, 2018

lemon24 mentioned this issue Jul 24, 2018

Plug-in API #80

Closed

lemon24 added a commit that referenced this issue Dec 20, 2018

Add Reader._post_entry_add_plugins.

b6b1caf

For #79.

lemon24 added a commit that referenced this issue Dec 20, 2018

Add regex_mark_as_read plugin.

191312e

For #79.

lemon24 added a commit that referenced this issue Dec 22, 2018

Add regex_mark_as_read test.

6227871

For #79.

lemon24 added a commit that referenced this issue Dec 22, 2018

Add feed_entry_dedupe plugin.

9bb25cd

For #79.

lemon24 added a commit that referenced this issue Dec 22, 2018

Comment fixes.

b083ec8

For #79.

lemon24 added a commit that referenced this issue Dec 22, 2018

Mention duplicate entry plugins in the changelog.

b649475

For #79.

lemon24 closed this as completed Dec 22, 2018

lemon24 mentioned this issue Sep 21, 2019

Allow entry_dedupe to delete the old entries #140

Closed

lemon24 mentioned this issue Jan 28, 2020

Feed URL changed #149

Closed

lemon24 mentioned this issue Mar 1, 2020

Cannot search through entries #122

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some feeds have duplicate entries #79

Some feeds have duplicate entries #79

lemon24 commented Jul 8, 2018 •

edited

Loading

lemon24 commented Jul 11, 2018

lemon24 commented Jul 15, 2018

lemon24 commented Dec 15, 2018

lemon24 commented Dec 16, 2018 •

edited

Loading

lemon24 commented Dec 19, 2018 •

edited

Loading

lemon24 commented Dec 20, 2018

lemon24 commented Dec 22, 2018

lemon24 commented Dec 22, 2018

lemon24 commented Dec 22, 2018

Some feeds have duplicate entries #79

Some feeds have duplicate entries #79

Comments

lemon24 commented Jul 8, 2018 • edited Loading

lemon24 commented Jul 11, 2018

lemon24 commented Jul 15, 2018

lemon24 commented Dec 15, 2018

lemon24 commented Dec 16, 2018 • edited Loading

lemon24 commented Dec 19, 2018 • edited Loading

lemon24 commented Dec 20, 2018

lemon24 commented Dec 22, 2018

lemon24 commented Dec 22, 2018

lemon24 commented Dec 22, 2018

lemon24 commented Jul 8, 2018 •

edited

Loading

lemon24 commented Dec 16, 2018 •

edited

Loading

lemon24 commented Dec 19, 2018 •

edited

Loading