Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different feed update frequencies #332

Open
lemon24 opened this issue Mar 18, 2024 · 4 comments
Open

Different feed update frequencies #332

lemon24 opened this issue Mar 18, 2024 · 4 comments

Comments

@lemon24
Copy link
Owner

lemon24 commented Mar 18, 2024

#307 needs a way of changing the feed update frequency. Just like #96 (comment), this should be a configurable global/per-feed strategy.

Use cases:

  • different recurring frequencies (1h, 6h, 1d); a naive Handle 429 Too Many Requests gracefully #307 implementation could bump the feed up to the next less frequent value
  • no earlier than X; can be used by Handle 429 Too Many Requests gracefully #307 to react to Retry-After
  • only new feeds; special case that already exists for update_feeds(new=True)
  • never/infinity; special case that already exists for update_feeds(updates_enabled=True)
    • although, updates_enabled should be independent (e.g. if I temporarily disable updates, the configured frequency should not be deleted)
  • maybe adding some kind of jitter to regular updates would be nice / polite

A unified way of updating feeds would be nice as well, e.g. "just call this every minute". Currently, I'm running the following in a cron:

  • every hour: update
  • every minute update --new-only; search update
@lemon24
Copy link
Owner Author

lemon24 commented Mar 31, 2024

Logic is easy enough (turning update_after etc. to/from seconds not shown):

import random
from dataclasses import dataclass

@dataclass
class Feed:
    url: str
    update_period: int
    update_after: int = 0
    # last_updated is only set when the feed is actually updated
    # (not when it's not modified, not when there was an exception)
    # https://github.com/lemon24/reader/blob/3.12/src/reader/_update.py#L276
    # https://github.com/lemon24/reader/blob/3.12/src/reader/_update.py#L445
    last_retrieved: int = 0

def get_feeds_for_update(feeds, now):
    return [f for f in feeds if f.update_after <= now]

def next_period(feed, now, jitter_ratio=0):
    jitter = random.random() * jitter_ratio
    current_period_no = now // feed.update_period
    return (current_period_no + 1 + jitter) * feed.update_period

def update_feeds(feeds, now, get_update_after=next_period):
    to_update = get_feeds_for_update(feeds, now)
    for feed in to_update:
        feed.last_retrieved = now
        feed.update_after = get_update_after(feed, now)
    return to_update

def set_update_period(feed, update_period):
    feed.update_period = update_period
    feed.update_after = next_period(feed, feed.last_retrieved)
Tests:
from collections import Counter
from functools import partial
import pytest


@pytest.mark.parametrize('old_after, new_after, now', [
    (0, 10, 0),
    (0, 10, 1),
    (0, 10, 9.999),
    (0, 20, 10),
    (5, 10, 5),
    (10, 20, 10),
    (105, 110, 109),
    (105, 120, 110),
    (105, 200, 199.999),
    (105, 210, 200),
    
])
def test_update(old_after, new_after, now):
    feeds = [Feed('one', 10, old_after)]
    assert len(update_feeds(feeds, now)) == 1
    assert feeds == [Feed('one', 10, new_after, now)]    


@pytest.mark.parametrize('old_after, now', [
    (5, 4),
    (10, 9.999),
    (20, 19),
])
def test_no_update(old_after, now):
    feeds = [Feed('one', 10, old_after)]
    assert len(update_feeds(feeds, now)) == 0
    assert feeds == [Feed('one', 10, old_after)]    


@pytest.mark.parametrize('get_update_after', [
    next_period, 
    # jitter ratio less than 10-1, to account for time step
    partial(next_period, jitter_ratio=.9),
])
def test_sweep(get_update_after):
    feeds = [Feed('one', 10), Feed('two', 20), Feed('three', 100)]
    
    counts = Counter()
    for now in range(100):
        for feed in update_feeds(feeds, now, get_update_after):
            counts[feed.url] += 1
            
    assert counts == {'one': 10, 'two': 5, 'three': 1}


def test_set_period_up():
    feeds = [Feed('one', 10)]
    update_feeds(feeds, 5)
    
    set_update_period(feeds[0], 20)
    
    # no update needed, already updated in this period
    assert len(update_feeds(feeds, 15)) == 0


def test_set_period_down():
    feeds = [Feed('one', 20)]
    update_feeds(feeds, 5)
    
    set_update_period(feeds[0], 10)
    
    # update needed, if taking new period into account
    assert len(update_feeds(feeds, 15)) == 1
    

On to the API!

Update: We can get rid of last_retrieve and rely on the current time in set_update_period(); all tests pass with minimal changes.

@lemon24
Copy link
Owner Author

lemon24 commented Apr 8, 2024

API

Add update_feeds(scheduled: bool | None = None) argument that filters feeds to update:

  • if none: if global tag .reader.update is set (regardless of value), assume true, else assume false
  • if true: update only update_after < now
  • if false: update everything (no update_after filtering)

In reader 4.0 (#291), we can make scheduled default to True (or just change the default behavior).


To configure the update interval, we can use a .reader.update tag:

  • can be set globally or per feed, feed overrides global overrides default
  • format: interval: int (seconds), jitter: float|bool (in [0,1])
  • default: {interval: 3600, jitter: 0.25}

Using tags is good because it allows configuring stuff without additional UI.

Possible cons (WIP):

  • We need to find a way to reset update_after when the update interval changes (a dedicated set_feed_update_period() method can take care of this out of the box). One solution may be being able to register hooks on tag updates.
  • We want to validate the config. How do we handle validation errors? Ignore / use default, or raise to the user? The latter may be accomplished with a hook, although a standard validation mechanism would be better.

In the (low level) storage API:

  • add update_after and last_retrieved to FeedUpdateIntent and FeedForUpdate
  • get_feeds_for_update(after: datetime | None) (also returns nulls)
    • new uses last_updated, it should use last_retrieved
  • set_feed_update_after() (used when setting the interval)

@lemon24
Copy link
Owner Author

lemon24 commented May 4, 2024

To do (minimal):

  • add update_after and last_retrieved on Feed, FeedUpdateIntent, and FeedForUpdate
  • get_feeds_for_update(after)
    • (new should use last_retrieved)
  • update_feeds(scheduled) / get_feeds(scheduled) / get_feed_counts(scheduled)
  • update logic (just ignore invalid tag values and use default)
  • expose scheduled in the CLI
  • tests
  • documentation
    • docstrings
    • user guide
    • changelog, dev notes

Later:

  • reset update_after after interval changes / set_feed_update_after()
  • scheduled=None (default to false initially)
  • config validation

@lemon24
Copy link
Owner Author

lemon24 commented May 5, 2024

update_after and last_retrieved should go on FeedUpdateIntent, and in turn FeedUpdateIntent should have union of FeedData-with-extra-stuff or exception or None, but I'm having trouble finding a good name for FeedData-with-extra-stuff.

For reference, here's all the feed-related data classes and how they're used:

        .-- FeedForUpdate ---.
        v                    |
     parser                  |
        |                    | 
    ParsedFeed            storage -. 
  (has FeedData)             ^     |
        v                    |     |
     updater                 |     |
        |                    |     |
        |- FeedUpdateIntent -'    Feed
        | (has ??? (has FeedData)  |
        |   or ExceptionInfo       |
        |   or None)               |
        |                          v
        '---- UpdateResult -----> user
            (has UpdatedFeed)

lemon24 added a commit that referenced this issue May 9, 2024
Also, use last_retrieved instead of last_updated for the "new" feed filter.
lemon24 added a commit that referenced this issue May 11, 2024
(Pipeline will need the name scheme and get_tag() too.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

1 participant