Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up RssIgnores::matches #345

Open
Minoru opened this issue Oct 28, 2018 · 2 comments
Open

Speed up RssIgnores::matches #345

Minoru opened this issue Oct 28, 2018 · 2 comments
Labels
good first issue Working on this issue is an easy way to start with Newsboat development refactoring This issue describes a way in which some particular part of the code could be improved

Comments

@Minoru
Copy link
Member

Minoru commented Oct 28, 2018

I have 7 ignore-article commands and set ignore-mode to "display", and I noticed that if I comment them out, startup time goes down by 10% (when cache.db is already in the disk cache). GNU Prof shows that quite a bit of time is spent in RssIgnores::matches.

That method takes an RssItem, loops through all ignore-article rules looking for the ones that match item's feed, and checks if their associated regexes match. There are two inefficiencies here:

  1. ignore-article rules are stored in vector<pair<string, regex>>, which is basically an std::map<string, regex> in disguise. If we switch to an actual map, the lookup time will become near-zero and won't grow with the number of ignore-article rules. std::unordered_multimap seems the most fitting;
  2. as far as I can see, that method is only used in RssFeed::update_items, where it is called on all items of each feed. In that scenario, we can get feed's URL once, lookup the associated ignore-article rules, and use that "shortlist" when checking individual items.

RssIgnores::matches lacks any tests. They need to be written before doing any of the aforementioned optimizations.

Evaluation

Since we don't have benchmarks, I have to describe how I'll evaluate the results of these optimizations.

I have a large cache file: over 400 feeds, almost 1 gigabyte of data. I'll put in on tmpfs to make sure I/O doesn't screw the results.

I'll run the following command five times in a row and take the smallest result:

$ echo q | time ./newsboat --cache-file=/dev/shm/cache.db

Newsboat will be compiled in release mode (i.e. just make newsboat).

My config file will contain one ignore-mode "display" entry and 0 to 20 ignore-article entries. I will be looking at two things:

  1. how startup time depends on presence of ignore-article entries; and
  2. how startup time depends on the number of ignore-article entries.

I will be comparing the results to results from then-current master. The goal is to improve on master.

@Minoru Minoru added good first issue Working on this issue is an easy way to start with Newsboat development refactoring This issue describes a way in which some particular part of the code could be improved labels Oct 28, 2018
@Minoru Minoru added the Hacktoberfest Issues nominated for the participants of https://hacktoberfest.digitalocean.com/ label Oct 1, 2020
@Minoru Minoru removed the Hacktoberfest Issues nominated for the participants of https://hacktoberfest.digitalocean.com/ label Nov 1, 2020
@Minoru
Copy link
Member Author

Minoru commented Mar 17, 2024

@danieloh0714 asked me to elaborate on this:

as far as I can see, that method is only used in RssFeed::update_items, where it is called on all items of each feed. In that scenario, we can get feed's URL once, lookup the associated ignore-article rules, and use that "shortlist" when checking individual items.

First of all, this idea is wrong about the places where RssIgnores::matches() is used. Currently they are:

  1. RssParser::add_item_to_feed() -- used when parsing an RSS file we fetched from the network. If ignore-mode is download, ignored articles are dropped here, so they never make it to the database at all;
  2. Cache::internalize_rssfeed() -- used when reading a feed from the database. If ignore-mode is display, we have to re-apply the ignore rules every time we load an article, because the rules might have changed since the last time we applied them;
  3. Cache::search_for_items() -- when searching, the database returns everything it finds, and we drop ignored articles in this method.

All three methods just loop through items and call RssIgnores::matches() on each one. However, the first two methods operate on a single feed, i.e. they know the feed's URL before they even start looping. That gives them ability to find all applicable rules once, and then apply them to each item. That should save us the repeated lookup of the same feed URL.

This optimization has the most potential for regex rules, because "looking them up" involves compiling a regex and applying it to the feed URL. If we do it just once per feed, we could save some time.

@danieloh0714
Copy link
Contributor

Ok, I think I understand now. Thanks for elaborating. I'll handle this in a separate PR from #2706.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Working on this issue is an easy way to start with Newsboat development refactoring This issue describes a way in which some particular part of the code could be improved
Projects
None yet
Development

No branches or pull requests

2 participants