Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic .dedupe.once.title, sometimes #322

Open
lemon24 opened this issue Sep 29, 2023 · 2 comments
Open

Automatic .dedupe.once.title, sometimes #322

lemon24 opened this issue Sep 29, 2023 · 2 comments

Comments

@lemon24
Copy link
Owner

lemon24 commented Sep 29, 2023

I got a feed with duplicate entries because the ids for all the entries changed; content dedupe didn't work for (most of?) them, likely because the content formatting/suffixes changed (todo: check).

I fixed it with .dedupe.once.title, checking beforehand that:

  • old entries don't have duplicate titles
  • most of the new entries have titles identical to the old ones (in this case, it was all except the newest one)

There's no reason the plugin can't do these checks in code.

@davidag
Copy link

davidag commented Oct 16, 2023

Hello @lemon24! 👋🏼

I'd like to help with this issue if possible. I could use a bit of help though :)

Taking a look at the problematic feed, I don't see content/summary fields, but you mentioned they probably had changed. Maybe they are gone now? Am I missing something?

Beyond that, I'm thinking about how the solution would look like:

  1. On the after_feed_update hook, if there are no dedupe-specific tags, check if all old entries are duplicated with new ones (using only titles).
    1. Get all entries and separate between old and new, checking entry.added == entry.last_updated to distinguish new entries.
    2. Check if all entries in the old set have a corresponding one with the same title in the new set.
  2. If the check in step 1 is positive, run the code for .dedupe.once.title already present in the aforementioned hook.

What do you think?

Thanks 🙏🏼 and great project 💯

@lemon24
Copy link
Owner Author

lemon24 commented Oct 17, 2023

Hi @davidag, thank you for your interest!

Taking a look at the problematic feed [...]

I checked a backup and the old entries didn't have content/summary either, so the pairs were not deduped because the body of these for loops never got a chance to run (and wouldn't have, unless both entries in a pair had content).

This is partly by design, the current code tries very hard not to delete data – "when in doubt, keep both".

I'm thinking about how the solution would look like:

Indeed, most of the logic should happen in after_feed_update (the stuff in after_entry_update should have probably been there from the start).

Here's what I believe the complete logic may look like; it matches your outline (with one difference noted below):

def after_entry_update_hook:
    tag new entries with '.dedupe._new'

def after_feed_update_hook:
    # optimization, not possible at the moment;
    # would require the hook to receive the UpdatedFeed,
    # or get_entries(tags='.dedupe._new') (filtering by entry tags)
    if there are no new entries:
        return
        
    collect all entry ids and titles
    group collected entries by title
    exclude groups with no more than 1 entry
    if feed does not have any '.dedupe.once*' tag:
        exclude groups that do not have new entries
    
    # optimization
    if there are no groups:
        clear '.dedupe._new' tag from entries
        return
  
    # select how strict we are about what we consider duplicates
    if feed has '.dedupe.once.title' tag:
        # user said so
        is_duplicate = is_duplicate_title
    elif (
        none of the old entries have duplicate titles
        and none of the new entries have duplicate titles
        and most new entries have old entries with the same title
    )
        # reasonably safe to dedupe by title alone
        is_duplicate = _is_duplicate_title
    else:
        # similarity dedupe
        is_duplicate = _is_duplicate_full
        
    run _dedupe_entries for each group (original logic)
    clear '.dedupe._new' tag from entries

Some notes:

  • We rely on a temporary .dedupe._new entry tag to tell new entries apart. This way, if the plugin fails for some reason, we can pick new entries up on some future run.
  • I am unsure about "most new entries have old entries with the same title", it may not actually help; a lot of times, when people move blogging platforms, the maximum number of entries in the feed file changes (and if it goes up, this check may fail).
  • Feel free to ignore the optimizations, I added them to paint the full picture.

Once again, thank you, and don't hesitate to ask any follow-up questions if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants