Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_entries(has_enclosures=...) should be a plugin(?) #327

Closed
lemon24 opened this issue Nov 18, 2023 · 4 comments
Closed

get_entries(has_enclosures=...) should be a plugin(?) #327

lemon24 opened this issue Nov 18, 2023 · 4 comments

Comments

@lemon24
Copy link
Owner

lemon24 commented Nov 18, 2023

The has_enclosures filter predates entry tags, and was meant as a proxy for "is a podcast item" (which works fine, at least with the feeds I'm subscribed to).

The same functionality can be obtained with a plugin that sets a tag, then used as a filter with get_entries(tags=['.has-enclosures']).

Some arguments for this:

  • Makes the core simpler (and lowers the burden on storage implementations).
  • Allows for more specific logic, e.g. .has-audio-enclosures.
  • Provides a reason to implement filtering by entry tags. Update: done in Filter entries by entry tags #328.
    • Update: ...and a reason for get_entry_counts() to be able to count tags! (kinda need to think how this would work)

Removing the has_enclosures argument is a compatibility break, so it needs to be done in 4.0, #291.

@lemon24
Copy link
Owner Author

lemon24 commented Dec 12, 2023

What about get_feeds(broken=..., updates_enabled=..., new=...)?

Where do we draw the line? Is this turning into #253? (DynamoDB has rotted my brain.)

@lemon24
Copy link
Owner Author

lemon24 commented Dec 12, 2023

Related: http://howto.philippkeller.com/2005/04/24/Tags-Database-schemas/, vaguely reminiscent of https://en.wikipedia.org/wiki/Star_schema; also see https://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80%93value_model

What would reader look like if you could only filter and sort by tags?

  • Both feed (user) title and entry recent sort are derived values, so we wouldn't have an issue here.
  • Ensuring consistency would be left to reader code (e.g. required "tag" attributes, updating computed values, modeling tristate attributes like important).
  • Idem for a lot of migrations (which may be better for databases that do not support schema transactions).
  • What would indices look like? (use cases: filter by [tag1, tag2, ...], sort by tag1 value, count by tag1)

@lemon24 lemon24 changed the title get_entries(has_enclosures=...) should be a plugin get_entries(has_enclosures=...) should be a plugin(?) Dec 12, 2023
@lemon24
Copy link
Owner Author

lemon24 commented Dec 17, 2023

So, based on various SQLite forum threads, the general conclusion seems to be "don't bother – design your schema as you normally would, and add indexes as needed later on"; in fairness, this is something I already knew, but as I said, DynamoDB has rotted my brain.

I also tentatively removed has_enclosures, and it didn't remove all that much code.

So:

  • has_enclosure may become a tag, but that would likely lock in a performance penalty (right now, we don't have indexes on it, but if we make it a tag it won't be possible to add one)
  • read and important are integral to the data model / filtering, so we still want them as regular columns
  • feed filtering attributes do not matter all that much, since feeds are both much fewer and smaller than entries (a feeds full table scan is likely negligible)
    • on one hand, this is an argument for "do nothing", since the code is already there
    • on the other hand, it may be an argument for "use tags" (since we can afford the performance penalty)
      • we may do this once we can set tags in a transaction, and get tags in a single query

lemon24 added a commit that referenced this issue Dec 18, 2023
@lemon24
Copy link
Owner Author

lemon24 commented Dec 18, 2023

Ran some benchmarks, here's a summary:

  • With only the has-enclosures entry tag, there seems to be almost no difference between using has_enclosures or the tag.
  • Adding a 1-2 more tags to each entry seems to make using tags only a bit worse.
  • Adding 20 more tags to each entry seems to make using tags ~1.5x worse.
Single entry tag results.

Given a has-enclosures entry tag set like this:

$ python -c '
from reader import make_reader
reader = make_reader("db.sqlite")
for e in reader.get_entries(has_enclosures=True):
    reader.set_tag(e, "has-enclosures")
print(reader.get_entry_counts())
'
EntryCounts(total=21609, read=15614, important=222, has_enclosures=3978, averages=(0.0, 6.868131868131868, 10.117808219178082))

...and this benchmark script:

export BENCH_TIME_STAT='avg min'
lines='for _ in reader.get_entries(has_enclosures=True): pass
for _ in reader.get_entries(tags=["has-enclosures"]): pass
for _ in reader.get_entries(has_enclosures=True, limit=100): pass
for _ in reader.get_entries(tags=["has-enclosures"], limit=100): pass
for _ in reader.search_entries("python", has_enclosures=True): pass
for _ in reader.search_entries("python", tags=["has-enclosures"]): pass
for _ in reader.search_entries("python", has_enclosures=True, limit=20): pass
for _ in reader.search_entries("python", tags=["has-enclosures"], limit=20): pass'
while IFS= read -r line; do
    echo "# $line"
    sync && sudo purge
    python scripts/bench.py time snippet -r10 --snippet "$line"
done <<< "$lines"

The output is:

# for _ in reader.get_entries(has_enclosures=True): pass
stat number repeat snippet
 avg      1     10   0.702
 min      1     10   0.374
# for _ in reader.get_entries(tags=["has-enclosures"]): pass
stat number repeat snippet
 avg      1     10   0.571
 min      1     10   0.393
# for _ in reader.get_entries(has_enclosures=True, limit=100): pass
stat number repeat snippet
 avg      1     10   0.022
 min      1     10   0.010
# for _ in reader.get_entries(tags=["has-enclosures"], limit=100): pass
stat number repeat snippet
 avg      1     10   0.020
 min      1     10   0.010
# for _ in reader.search_entries("python", has_enclosures=True): pass
stat number repeat snippet
 avg      1     10   0.538
 min      1     10   0.384
# for _ in reader.search_entries("python", tags=["has-enclosures"]): pass
stat number repeat snippet
 avg      1     10   0.514
 min      1     10   0.395
# for _ in reader.search_entries("python", has_enclosures=True, limit=20): pass
stat number repeat snippet
 avg      1     10   0.250
 min      1     10   0.110
# for _ in reader.search_entries("python", tags=["has-enclosures"], limit=20): pass
stat number repeat snippet
 avg      1     10   0.226
 min      1     10   0.112
1-2 entry tags results.

Extra tags were set for read and (un)important like so:

$ python -c '
from reader import make_reader
reader = make_reader("db.sqlite")
for e in reader.get_entries():
    if e.read:
        reader.set_tag(e, "read")
    if e.important is True:
        reader.set_tag(e, "important")
    if e.important is False:
        reader.set_tag(e, "unimportant")
'

Output (same script as before, but only for the tags snippets):

# for _ in reader.get_entries(tags=["has-enclosures"]): pass
stat number repeat snippet
 avg      1     10   0.592
 min      1     10   0.408
# for _ in reader.get_entries(tags=["has-enclosures"], limit=100): pass
stat number repeat snippet
 avg      1     10   0.022
 min      1     10   0.011
# for _ in reader.search_entries("python", tags=["has-enclosures"]): pass
stat number repeat snippet
 avg      1     10   0.536
 min      1     10   0.408
# for _ in reader.search_entries("python", tags=["has-enclosures"], limit=20): pass
stat number repeat snippet
 avg      1     10   0.245
 min      1     10   0.115
20+ entry tags results.

Extra tags were set for read and (un)important like so:

$ python -c '
from reader import make_reader
reader = make_reader("db.sqlite")
tags = "one two three four five six seven eight nine ten eleven twelve thirteen fourteen sixteen seventeen eighteen nineteen twenty".split()
for e in reader.get_entries():
    for tag in tags:
        reader.set_tag(e, tag)
'

Output (same script as before, but only for the tags snippets):

# for _ in reader.get_entries(tags=["has-enclosures"]): pass
stat number repeat snippet
 avg      1     10   1.170
 min      1     10   0.613
# for _ in reader.get_entries(tags=["has-enclosures"], limit=100): pass
stat number repeat snippet
 avg      1     10   0.042
 min      1     10   0.016
# for _ in reader.search_entries("python", tags=["has-enclosures"]): pass
stat number repeat snippet
 avg      1     10   0.789
 min      1     10   0.548
# for _ in reader.search_entries("python", tags=["has-enclosures"], limit=20): pass
stat number repeat snippet
 avg      1     10   0.342
 min      1     10   0.174

lemon24 added a commit that referenced this issue Dec 23, 2023
@lemon24 lemon24 closed this as completed Dec 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant