Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I have to go through all the entries to find out how many there are #185

Closed
lemon24 opened this issue Aug 10, 2020 · 5 comments
Closed

I have to go through all the entries to find out how many there are #185

lemon24 opened this issue Aug 10, 2020 · 5 comments

Comments

@lemon24
Copy link
Owner

lemon24 commented Aug 10, 2020

I have to go through all the entries to find out how many there are (total, read, important, per feed etc.).

@lemon24
Copy link
Owner Author

lemon24 commented Sep 8, 2020

replaced by #185 (comment), since it's less rambly

@lemon24
Copy link
Owner Author

lemon24 commented Nov 22, 2020

Use cases

There are a lot of ways to aggregate the stuff to be counted. Allowing everything would complicate things a lot, so it makes sense to start from some known use cases:

In order of decreasing usefulness:

  • entry counts (total, read, important, maybe "has enclosures")
    • global
    • for whatever entry / search result list is being shown in the entries view
    • for each feed in the the feeds view
    • for each tag in the tags view
  • feed counts (total, broken, inactive (Feed decommissioned #187))
    • global
    • for whatever feed list is being shown in the entries view
    • for each tag in the tags view
  • tag count (total)
    • global
    • for each feed in the feeds view
  • metadata counts (total)
    • global (maybe)
    • for each feed in the feeds view

Performance

All of the above can be done by having new _counts methods corresponding to the ones that return iterables.

The "for each" use cases would require calling them many times, which may get inefficient (n+1 query problem). Initially, we can just accept this, since for the SQLite storage it should be OK. If we ever get a storage implemention where it matters more, we can address it in #191 ("with_counts").

I initially expected some of the "facets" (e.g. read) will be relatively slow, since without indexes they'll have to do a full table scan. While this is true, they're not much slower than a plain count(*), and adding multiple counts still results in just one scan.

Here are some timings for an 100+ feeds, 11k entries database for various select feed, *one or more counts* from entries group by feed queries:
sqlite> .timer on
sqlite> .output /dev/null
sqlite> select feed from entries group by feed;
Run Time: real 0.007 user 0.006225 sys 0.000445
sqlite> select feed, count(*) from entries group by feed;
Run Time: real 0.007 user 0.006173 sys 0.000307
sqlite> select feed, sum(read) from entries group by feed;
Run Time: real 0.020 user 0.012347 sys 0.007328
sqlite> select feed, sum(important) from entries group by feed;
Run Time: real 0.020 user 0.012372 sys 0.007328
sqlite> select feed, sum(read), sum(important) from entries group by feed;
Run Time: real 0.020 user 0.013036 sys 0.007446
sqlite> select feed, sum((json_array_length(entries.enclosures) IS NULL OR json_array_length(entries.enclosures) = 0)) from entries group by feed;
Run Time: real 0.026 user 0.017904 sys 0.008618
sqlite> select feed, sum(read), sum(important), sum((json_array_length(entries.enclosures) IS NULL OR json_array_length(entries.enclosures) = 0)) from entries group by feed;
Run Time: real 0.028 user 0.018934 sys 0.008452

Even so, it makes sense to be able to get counts for only some of the facets, but we can add this later, if needed.

API

As mentioned before, it makes sense to add new _counts methods corresponding to the ones that return iterables.

Batch ("for each") counts will be handled in #191, so we won't discuss it here much. The attribute will have the same type as the return value of the corresponding _counts method. Because get_feed_tags() returns a list of strings, for this to work we'd either have to change the return type to a TagInfo type, or have a new get_feed_tag_info() method that returns TagInfo.

Obviously, when new "facets" are added, it should be possible to add them in a backwards-compatible way.

The return values can be one of:

  • a dataclass, with one attribute per facet
  • a namedtuple, with one attribute per facet
  • a dict[str, int]; we can annotate it as a TypedDict
  • an list[str], matching the order of the string sequence passed as the "which facets" argument (second option below); this is probably the worst option

To make some facets optional, we can have an argument of:

  • the result dataclass/namedtuple with bool attributes; the symmetry is nice, but seems a bit forced / gimmicky
  • an sequence of strings or enum values corresponding to the dataclass / namedtuple / typed dict attributes / keys; we can annotate this as Sequence[Literal['first', 'second']]

For the dataclass/namedtuple return type, having optional facets implies the attribute type is Optional.

The dict return type better reflects that values can come and go. The dataclass/namedtuple type allows adding properties like unread (total - read). Per the discussion in #191 (comment), using generics to link the input and output types isn't really an option, so having an object and not a dict wouldn't help there.

Here's what various invocations would look like:
>>> get_entry_counts(counts=['total', 'read']))
EntryCounts(total=11752, read=8977, important=None)

>>> get_entry_counts(counts=['total', 'read']))
{'total': 11752, 'read': 8977}

>>> get_entry_counts(counts=EntryCounts(total=True, read=True, important=False))
EntryCounts(total=11752, read=8977, important=None)

The first option seems the best to me.

Naming might be a bit difficult for some of the methods:

  • get_feeds() -> get_feed_counts()
  • get_entries() -> get_entry_counts()
  • search_entries() -> ???
  • iter_feed_metadata() -> ???
    • get_feed_metadata() is taken for something else (it returns a single metadata value); if we rename the method to get_all_feed_metadata(), we can have get_all_feed_metadata_counts()
    • 2021-05 update: how about this instead?
      get_feed_metadata_value(feed, key) -> value
      get_feed_metadata(feed) -> [(key, value), ...]
      get_feed_metadata_counts(feed) -> counts
  • get_feed_tags() -> get_feed_tag_counts()

@lemon24
Copy link
Owner Author

lemon24 commented Nov 26, 2020

To do (I'll only do the minimum of things, we can add stuff later):

  • core
    • types
      • FeedCounts(total, broken, updates_enabled)
      • EntryCounts(total, read, important, has_enclosures)
      • EntrySearchCounts(total, read, important, has_enclosures); here we could add some counts of what matched, but YAGNI for now
    • methods
      • get_feed_counts(), get_entry_counts(), search_entry_counts(); will likely require refactoring some of the query builder bits for reuse
    • tests; new tests is likely the way to go; one big fixture with 1 read, 2 important, 4 has_enclosures, ...
  • web app
    • entries: show entry counts for result list
    • feeds: show feed counts for result list, show entry counts for each feed
    • tags: show feed counts and entry counts for each tag
    • maybe make all of the above optional
    • run some benchmarks
  • docs
    • types
    • methods
    • user guide
    • changelog

lemon24 added a commit that referenced this issue Nov 26, 2020
lemon24 added a commit that referenced this issue Nov 26, 2020
lemon24 added a commit that referenced this issue Nov 26, 2020
lemon24 added a commit that referenced this issue Nov 27, 2020
lemon24 added a commit that referenced this issue Nov 27, 2020
lemon24 added a commit that referenced this issue Nov 27, 2020
lemon24 added a commit that referenced this issue Nov 27, 2020
lemon24 added a commit that referenced this issue Nov 27, 2020
lemon24 added a commit that referenced this issue Nov 27, 2020
lemon24 added a commit that referenced this issue Nov 28, 2020
lemon24 added a commit that referenced this issue Nov 28, 2020
lemon24 added a commit that referenced this issue Nov 28, 2020
lemon24 added a commit that referenced this issue Nov 28, 2020
@lemon24
Copy link
Owner Author

lemon24 commented Nov 28, 2020

Some manual ("page generated in about ... seconds") web app benchmarks, using my "production" database; I picked the smallest time of 5-10 refreshes:

On my laptop:

path ?counts=no ?counts=yes
/ 2.7 2.7
/?limit=64 .187 .206
/feeds .254 1.9
/tags .004 .186

On a t3a.nano instance:

path counts=no counts=yes
/?limit=64 .182 .188
/feeds .220 1.460
/tags .005 .313

As expected, getting counts for each feed does make /feeds slower (6-8 times).

lemon24 added a commit that referenced this issue Nov 28, 2020
@lemon24 lemon24 closed this as completed Nov 28, 2020
@lemon24
Copy link
Owner Author

lemon24 commented Nov 28, 2020

BTW, this took about 14h 😞

  • 3h design
  • 5.5h core implementation (including docstrings and tests); search_entry_counts() took 2.5x more than the other methods because of the more finicky tests and queries (despite the fact that I reused a lot of the get_entry_counts() tests)
  • .5h user guide
  • 4.5h web app (lots of "UX" and CSS futzing)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant