get_entries(has_enclosures=...) should be a plugin(?) #327

lemon24 · 2023-11-18T23:58:14Z

The has_enclosures filter predates entry tags, and was meant as a proxy for "is a podcast item" (which works fine, at least with the feeds I'm subscribed to).

The same functionality can be obtained with a plugin that sets a tag, then used as a filter with get_entries(tags=['.has-enclosures']).

Some arguments for this:

Makes the core simpler (and lowers the burden on storage implementations).
Allows for more specific logic, e.g. .has-audio-enclosures.
Provides a reason to implement filtering by entry tags. Update: done in Filter entries by entry tags #328.
- Update: ...and a reason for get_entry_counts() to be able to count tags! (kinda need to think how this would work)

Removing the has_enclosures argument is a compatibility break, so it needs to be done in 4.0, #291.

lemon24 · 2023-12-12T18:36:20Z

What about get_feeds(broken=..., updates_enabled=..., new=...)?

Where do we draw the line? Is this turning into #253? (DynamoDB has rotted my brain.)

lemon24 · 2023-12-12T19:25:45Z

Related: http://howto.philippkeller.com/2005/04/24/Tags-Database-schemas/, vaguely reminiscent of https://en.wikipedia.org/wiki/Star_schema; also see https://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80%93value_model

What would reader look like if you could only filter and sort by tags?

Both feed (user) title and entry recent sort are derived values, so we wouldn't have an issue here.
Ensuring consistency would be left to reader code (e.g. required "tag" attributes, updating computed values, modeling tristate attributes like important).
Idem for a lot of migrations (which may be better for databases that do not support schema transactions).
What would indices look like? (use cases: filter by [tag1, tag2, ...], sort by tag1 value, count by tag1)

lemon24 · 2023-12-17T11:42:25Z

So, based on various SQLite forum threads, the general conclusion seems to be "don't bother – design your schema as you normally would, and add indexes as needed later on"; in fairness, this is something I already knew, but as I said, DynamoDB has rotted my brain.

I also tentatively removed has_enclosures, and it didn't remove all that much code.

So:

has_enclosure may become a tag, but that would likely lock in a performance penalty (right now, we don't have indexes on it, but if we make it a tag it won't be possible to add one)
- it might be interesting to see what query performance looks like with tags, though (update: slightly worse, see get_entries(has_enclosures=...) should be a plugin(?) #327 (comment))
- the more specific enclosures filtering (e.g. .has-audio-enclosures) can still be achieved via the plugin
read and important are integral to the data model / filtering, so we still want them as regular columns
- related: Use entry tags for read, important #253
feed filtering attributes do not matter all that much, since feeds are both much fewer and smaller than entries (a feeds full table scan is likely negligible)
- on one hand, this is an argument for "do nothing", since the code is already there
- on the other hand, it may be an argument for "use tags" (since we can afford the performance penalty)
  - we may do this once we can set tags in a transaction, and get tags in a single query

lemon24 · 2023-12-18T21:50:12Z

Ran some benchmarks, here's a summary:

With only the has-enclosures entry tag, there seems to be almost no difference between using has_enclosures or the tag.
Adding a 1-2 more tags to each entry seems to make using tags only a bit worse.
Adding 20 more tags to each entry seems to make using tags ~1.5x worse.

Single entry tag results.

Given a has-enclosures entry tag set like this:

$ python -c '
from reader import make_reader
reader = make_reader("db.sqlite")
for e in reader.get_entries(has_enclosures=True):
    reader.set_tag(e, "has-enclosures")
print(reader.get_entry_counts())
'
EntryCounts(total=21609, read=15614, important=222, has_enclosures=3978, averages=(0.0, 6.868131868131868, 10.117808219178082))

...and this benchmark script:

export BENCH_TIME_STAT='avg min'
lines='for _ in reader.get_entries(has_enclosures=True): pass
for _ in reader.get_entries(tags=["has-enclosures"]): pass
for _ in reader.get_entries(has_enclosures=True, limit=100): pass
for _ in reader.get_entries(tags=["has-enclosures"], limit=100): pass
for _ in reader.search_entries("python", has_enclosures=True): pass
for _ in reader.search_entries("python", tags=["has-enclosures"]): pass
for _ in reader.search_entries("python", has_enclosures=True, limit=20): pass
for _ in reader.search_entries("python", tags=["has-enclosures"], limit=20): pass'
while IFS= read -r line; do
    echo "# $line"
    sync && sudo purge
    python scripts/bench.py time snippet -r10 --snippet "$line"
done <<< "$lines"

The output is:

# for _ in reader.get_entries(has_enclosures=True): pass
stat number repeat snippet
 avg      1     10   0.702
 min      1     10   0.374
# for _ in reader.get_entries(tags=["has-enclosures"]): pass
stat number repeat snippet
 avg      1     10   0.571
 min      1     10   0.393
# for _ in reader.get_entries(has_enclosures=True, limit=100): pass
stat number repeat snippet
 avg      1     10   0.022
 min      1     10   0.010
# for _ in reader.get_entries(tags=["has-enclosures"], limit=100): pass
stat number repeat snippet
 avg      1     10   0.020
 min      1     10   0.010
# for _ in reader.search_entries("python", has_enclosures=True): pass
stat number repeat snippet
 avg      1     10   0.538
 min      1     10   0.384
# for _ in reader.search_entries("python", tags=["has-enclosures"]): pass
stat number repeat snippet
 avg      1     10   0.514
 min      1     10   0.395
# for _ in reader.search_entries("python", has_enclosures=True, limit=20): pass
stat number repeat snippet
 avg      1     10   0.250
 min      1     10   0.110
# for _ in reader.search_entries("python", tags=["has-enclosures"], limit=20): pass
stat number repeat snippet
 avg      1     10   0.226
 min      1     10   0.112

1-2 entry tags results.

Extra tags were set for read and (un)important like so:

$ python -c '
from reader import make_reader
reader = make_reader("db.sqlite")
for e in reader.get_entries():
    if e.read:
        reader.set_tag(e, "read")
    if e.important is True:
        reader.set_tag(e, "important")
    if e.important is False:
        reader.set_tag(e, "unimportant")
'

Output (same script as before, but only for the tags snippets):

# for _ in reader.get_entries(tags=["has-enclosures"]): pass
stat number repeat snippet
 avg      1     10   0.592
 min      1     10   0.408
# for _ in reader.get_entries(tags=["has-enclosures"], limit=100): pass
stat number repeat snippet
 avg      1     10   0.022
 min      1     10   0.011
# for _ in reader.search_entries("python", tags=["has-enclosures"]): pass
stat number repeat snippet
 avg      1     10   0.536
 min      1     10   0.408
# for _ in reader.search_entries("python", tags=["has-enclosures"], limit=20): pass
stat number repeat snippet
 avg      1     10   0.245
 min      1     10   0.115

20+ entry tags results.

Extra tags were set for read and (un)important like so:

$ python -c '
from reader import make_reader
reader = make_reader("db.sqlite")
tags = "one two three four five six seven eight nine ten eleven twelve thirteen fourteen sixteen seventeen eighteen nineteen twenty".split()
for e in reader.get_entries():
    for tag in tags:
        reader.set_tag(e, tag)
'

Output (same script as before, but only for the tags snippets):

# for _ in reader.get_entries(tags=["has-enclosures"]): pass
stat number repeat snippet
 avg      1     10   1.170
 min      1     10   0.613
# for _ in reader.get_entries(tags=["has-enclosures"], limit=100): pass
stat number repeat snippet
 avg      1     10   0.042
 min      1     10   0.016
# for _ in reader.search_entries("python", tags=["has-enclosures"]): pass
stat number repeat snippet
 avg      1     10   0.789
 min      1     10   0.548
# for _ in reader.search_entries("python", tags=["has-enclosures"], limit=20): pass
stat number repeat snippet
 avg      1     10   0.342
 min      1     10   0.174

lemon24 added API plugin labels Nov 18, 2023

lemon24 mentioned this issue Nov 25, 2023

Filter entries by entry tags #328

Closed

5 tasks

lemon24 added the tags label Nov 25, 2023

lemon24 changed the title ~~get_entries(has_enclosures=...) should be a plugin~~ get_entries(has_enclosures=...) should be a plugin(?) Dec 12, 2023

lemon24 added a commit that referenced this issue Dec 18, 2023

Add snippet timing. #327

89278fb

lemon24 added a commit that referenced this issue Dec 23, 2023

Dev notes for #327, #330, and #325.

01a3629

lemon24 closed this as completed Dec 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_entries(has_enclosures=...) should be a plugin(?) #327

get_entries(has_enclosures=...) should be a plugin(?) #327

lemon24 commented Nov 18, 2023 •

edited

lemon24 commented Dec 12, 2023 •

edited

lemon24 commented Dec 12, 2023 •

edited

lemon24 commented Dec 17, 2023 •

edited

lemon24 commented Dec 18, 2023

get_entries(has_enclosures=...) should be a plugin(?) #327

get_entries(has_enclosures=...) should be a plugin(?) #327

Comments

lemon24 commented Nov 18, 2023 • edited

lemon24 commented Dec 12, 2023 • edited

lemon24 commented Dec 12, 2023 • edited

lemon24 commented Dec 17, 2023 • edited

lemon24 commented Dec 18, 2023

lemon24 commented Nov 18, 2023 •

edited

lemon24 commented Dec 12, 2023 •

edited

lemon24 commented Dec 12, 2023 •

edited

lemon24 commented Dec 17, 2023 •

edited