Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some feeds have duplicate entries #79

Closed
lemon24 opened this issue Jul 8, 2018 · 9 comments
Closed

Some feeds have duplicate entries #79

lemon24 opened this issue Jul 8, 2018 · 9 comments

Comments

@lemon24
Copy link
Owner

lemon24 commented Jul 8, 2018

Some entries have duplicate entries, or their ids change (resulting in an entry being stored twice).

E.g., a feed that had the entry id format updated:

$ sqlite3 db.sqlite 'select feed, id, updated, title from entries where title like "RE: xkcd%"' -line
   feed = http://sealedabstract.com/feed/
     id = http://sealedabstract.com/?p=2494
updated = 2014-09-09 08:30:25
  title = RE: xkcd #1357 free speech

   feed = http://sealedabstract.com/feed/
     id = /?p=2494
updated = 2014-09-09 07:30:25
  title = RE: xkcd #1357 free speech

If possible, only one should be shown (similar to #78).

@lemon24 lemon24 changed the title Some feeds have duplicated entries Some feeds have duplicate entries Jul 8, 2018
@lemon24
Copy link
Owner Author

lemon24 commented Jul 11, 2018

A way to solve this is to have the user delete and re-add the feed (obviously, any feed and entry metadata along with the entries no longer in the feed are lost).

@lemon24
Copy link
Owner Author

lemon24 commented Jul 15, 2018

Another example:

$ sqlite3 db.sqlite 'select feed, id, updated, title from entries where title like "Basics of Futexes"' -line
   feed = https://eli.thegreenplace.net/feeds/all.atom.xml
     id = tag:eli.thegreenplace.net,2018-07-13:2018/basics-of-futexes/
updated = 2018-07-13 12:53:00
  title = Basics of Futexes

   feed = https://eli.thegreenplace.net/feeds/all.atom.xml
     id = tag:eli.thegreenplace.net,2018-07-13:/2018/basics-of-futexes/
updated = 2018-07-13 12:53:00
  title = Basics of Futexes

@lemon24 lemon24 mentioned this issue Jul 24, 2018
@lemon24
Copy link
Owner Author

lemon24 commented Dec 15, 2018

Another related (and useful) case is the same entry being re-posted in a related feed:

$ sqlite3 db.sqlite "select feed, id, updated, title from entries where title like '%The Accidental Room%'" -line
   feed = http://feeds.99percentinvisible.org/99percentinvisible
     id = prx_96_030a4709-780b-415d-be97-887a5485ba5e
updated = 2018-12-12 01:30:55
  title = 332- The Accidental Room

   feed = http://feeds.feedburner.com/99pi
     id = https://99percentinvisible.org/?post_type=episode&p=27621
updated = 2018-12-12 00:45:23
  title = The Accidental Room [EPISODE]

In this case, the first feed has enclosures and the second doesn't, and we'd like to only keep the first entry.

Another one:

$ sqlite3 db.sqlite "select feed, id, updated, title from entries where title like '%Stunt Peanut%'" -line
   feed = http://www.cgpgrey.com/blog?format=rss
     id = 5005afd7e4b0a6953320bf3f:50aa71a6e4b01bb8df1073d2:5c00520b758d46bd3c8a6646
updated = 2018-11-29 21:02:45
  title = H.I. #114: Stunt Peanut

   feed = http://www.hellointernet.fm/podcast?format=rss
     id = 52d66949e4b0a8cec3bcdd46:52d67282e4b0cca8969714fa:5c00515803ce6416b8bfb992
updated = 2018-11-29 21:02:33
  title = H.I. #114: Stunt Peanut

@lemon24
Copy link
Owner Author

lemon24 commented Dec 16, 2018

So, a possible implementation for entry deduplication:

  • a plugin (called after update) or a script (that runs regularly) and
    • checks each existing entry against each new one (can limit this to a configurable subset of feeds)
      • identifies similar entry pairs (e.g. using the title and a configurable regex, or magic via some kind of string matching, maybe applying the algorithm to strings of words instead of strings of characters)
      • decides which entry is the original and which is the duplicate (e.g. for two feeds, one of them can be configured as the original; within a single feed, the original is the oldest one; prefer the one with enclosures)
    • takes some action on the duplicate (mark it as read, mark as duplicate for later action, hide from normal get_entries() calls)

Some questions:

  • do we want to have the detection and the action decoupled? if yes, we need to store metadata about which entries are duplicates of which original

Update: I did some digging, and the approach described above could be implemented like this:

  • (assuming we use the entry title, can be applied to something else) clean up words common to most entries in a feed (e.g. "H.I." in the last example), split it into word bigrams, and use the Jaccard index with the bigram sets of two entries to determine similarity; while this would be relatively simple to implement, I don't like that we're pulling all the entries from the database; I also expect tweaking might be needed to find out what "words common to most entries" means, and what Jaccard index means "similar enough"
  • I think we could avoid getting all the entries every time by using some sort of locality-sensitive hashing; however, I understand how this works only in theory and at a very high level

Both seem a bit too "heavy" for the problem I'm trying to solve.

@lemon24
Copy link
Owner Author

lemon24 commented Dec 19, 2018

We can also give up on a single magic solution, and acknowledge that there are probably at least two different problems:

  1. The entry id format changes for all the entries in a feed; this results in the entry being stored twice.
  2. Some entries are reposted on a different feed.

For both cases, the duplicates:

  1. Have the same title (the first two examples above):
    1. For the same feed, we can check on insert.
    2. For different feeds, we may be able to check on insert, depending on the insert order; we can also check later (to minimize the number of entries checked, we can mark an entry as "just added", and invalidate this on the next update). We can improve performance of both by being able to get entries by title.
  2. Do not have the same title:
    1. Have a pattern identifying them (the third and fourth examples above). This can be done on insert, with no additional queries.
    2. Do not have a pattern identifying them. Don't know how to solve this cheaply; you need to be able to assume something is the same. Can be done with the similarity stuff described in the previous comment (which I don't wanna do now).

@lemon24
Copy link
Owner Author

lemon24 commented Dec 20, 2018

For now, I will implement 2.i. (different title, identifying pattern) as I am currently affected by it.

lemon24 added a commit that referenced this issue Dec 20, 2018
lemon24 added a commit that referenced this issue Dec 20, 2018
lemon24 added a commit that referenced this issue Dec 22, 2018
@lemon24
Copy link
Owner Author

lemon24 commented Dec 22, 2018

The only other sub-case that actually happened is 1.i. (same title, same feed); it happened about 2-4 times in 1 year.

lemon24 added a commit that referenced this issue Dec 22, 2018
lemon24 added a commit that referenced this issue Dec 22, 2018
lemon24 added a commit that referenced this issue Dec 22, 2018
@lemon24
Copy link
Owner Author

lemon24 commented Dec 22, 2018

Ran reader update with both feed_entry_dedupe and regex_mark_as_read on a database with ~60 feeds / ~3000 entries; there seemed to be no significant run time degradation.

@lemon24
Copy link
Owner Author

lemon24 commented Dec 22, 2018

Covered all known cases, resolving.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant