Speed up RssIgnores::matches #345

Minoru · 2018-10-28T10:22:52Z

I have 7 ignore-article commands and set ignore-mode to "display", and I noticed that if I comment them out, startup time goes down by 10% (when cache.db is already in the disk cache). GNU Prof shows that quite a bit of time is spent in RssIgnores::matches.

That method takes an RssItem, loops through all ignore-article rules looking for the ones that match item's feed, and checks if their associated regexes match. There are two inefficiencies here:

ignore-article rules are stored in vector<pair<string, regex>>, which is basically an std::map<string, regex> in disguise. If we switch to an actual map, the lookup time will become near-zero and won't grow with the number of ignore-article rules. std::unordered_multimap seems the most fitting;
as far as I can see, that method is only used in RssFeed::update_items, where it is called on all items of each feed. In that scenario, we can get feed's URL once, lookup the associated ignore-article rules, and use that "shortlist" when checking individual items.

RssIgnores::matches lacks any tests. They need to be written before doing any of the aforementioned optimizations.

Evaluation

Since we don't have benchmarks, I have to describe how I'll evaluate the results of these optimizations.

I have a large cache file: over 400 feeds, almost 1 gigabyte of data. I'll put in on tmpfs to make sure I/O doesn't screw the results.

I'll run the following command five times in a row and take the smallest result:

$ echo q | time ./newsboat --cache-file=/dev/shm/cache.db

Newsboat will be compiled in release mode (i.e. just make newsboat).

My config file will contain one ignore-mode "display" entry and 0 to 20 ignore-article entries. I will be looking at two things:

how startup time depends on presence of ignore-article entries; and
how startup time depends on the number of ignore-article entries.

I will be comparing the results to results from then-current master. The goal is to improve on master.

The text was updated successfully, but these errors were encountered:

Minoru · 2024-03-17T16:30:31Z

@danieloh0714 asked me to elaborate on this:

as far as I can see, that method is only used in RssFeed::update_items, where it is called on all items of each feed. In that scenario, we can get feed's URL once, lookup the associated ignore-article rules, and use that "shortlist" when checking individual items.

First of all, this idea is wrong about the places where RssIgnores::matches() is used. Currently they are:

RssParser::add_item_to_feed() -- used when parsing an RSS file we fetched from the network. If ignore-mode is download, ignored articles are dropped here, so they never make it to the database at all;
Cache::internalize_rssfeed() -- used when reading a feed from the database. If ignore-mode is display, we have to re-apply the ignore rules every time we load an article, because the rules might have changed since the last time we applied them;
Cache::search_for_items() -- when searching, the database returns everything it finds, and we drop ignored articles in this method.

All three methods just loop through items and call RssIgnores::matches() on each one. However, the first two methods operate on a single feed, i.e. they know the feed's URL before they even start looping. That gives them ability to find all applicable rules once, and then apply them to each item. That should save us the repeated lookup of the same feed URL.

This optimization has the most potential for regex rules, because "looking them up" involves compiling a regex and applying it to the feed URL. If we do it just once per feed, we could save some time.

danieloh0714 · 2024-03-17T19:51:40Z

Ok, I think I understand now. Thanks for elaborating. I'll handle this in a separate PR from #2706.

Minoru added good first issue Working on this issue is an easy way to start with Newsboat development refactoring This issue describes a way in which some particular part of the code could be improved labels Oct 28, 2018

Minoru mentioned this issue Apr 14, 2020

Alternative design for DB schema, RssFeed, and RssItem #887

Open

Minoru added the Hacktoberfest Issues nominated for the participants of https://hacktoberfest.digitalocean.com/ label Oct 1, 2020

Minoru removed the Hacktoberfest Issues nominated for the participants of https://hacktoberfest.digitalocean.com/ label Nov 1, 2020

danieloh0714 mentioned this issue Mar 16, 2024

optimize RssIgnores::matches function #2706

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up RssIgnores::matches #345

Speed up RssIgnores::matches #345

Minoru commented Oct 28, 2018

Minoru commented Mar 17, 2024

danieloh0714 commented Mar 17, 2024

Speed up RssIgnores::matches #345

Speed up RssIgnores::matches #345

Comments

Minoru commented Oct 28, 2018

Evaluation

Minoru commented Mar 17, 2024

danieloh0714 commented Mar 17, 2024