Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Filtering articles into multiple labels causes huge slowdown #952

Closed
Mitecon opened this issue May 9, 2023 · 45 comments
Closed

[BUG]: Filtering articles into multiple labels causes huge slowdown #952

Mitecon opened this issue May 9, 2023 · 45 comments
Assignees
Labels
Component-DB Status-Fixed Ticket is resolved. Type-Defect This is BUG!!!
Milestone

Comments

@Mitecon
Copy link

Mitecon commented May 9, 2023

Brief description of the issue

I make heavy use of filters and labels. Quite often I'm not able to go through everything for a while and so the articles can sit there unattended for some time until I'm able to look at them. Over time, the amount of articles build up and because many of them can be filtered into more than one label, an enormous amount of lag is introduced when doing simple things like opening the filter dialogue, switching labels or even sorting columns alphabetically or by date.

How to reproduce the bug?

I've been doing some testing and I have created a test database which has only one feed and, for simplicity's sake, I created filters to apply the relevant labels to every article containing a vowel:

vowels-database.txt

(I can't upload anything with a .db extension so this needs a rename.)

Load up the database, right-click on the feed and 'Fetch selected'. It would be useful to have your task manager open to see how threads become fully locked up during this process. If you have your task manager open to the right you will be able to see the articles arriving into the labels on the left. This will take some time as each article is filtered.

Once the feed has updated and the articles have the labels applied, try clicking through the UI. For instance, open the filter dialogue and time how long it takes. Then try clicking on each filter one by one and see how long that takes. Close this dialogue and try clicking on each label and time how long that takes.

Depending on your system, it may actually not be too slow for you. So I created another test database which has a more extreme set of filters that pick up a few extra common words:

extreme-database.txt

Obviously, this is a stupid set of filters designed for testing and reproducing this problem that nobody would use day-to-day. However...

The feed I've found only has around 1800 articles in it. Now, imagine dozens of feeds where some of them have articles added regularly throughout each day. Quite quickly this can (and on mine, does) build up to thousands of articles. Now imagine there are filters that sort many of these articles into multiple labels and you can see how this lag will become exponentially longer.

In fact, if you do any of the above with either test database and go into each label, select all articles and remove the labels from them, RSS Guard becomes a little bit faster each time. If you remove all the labels from all articles this way, then navigating the UI become almost instantaneous.

Go into the filter dialogue and apply all the filters again and RSS Guard slows down once again.

So the problem comes from having multiple articles having multiple labels applied. This, for me, is how I work. I do hate having to mention this but my workflow was like this for years with QuiteRSS which has never exhibited this issue even with a many years-old database.

Looking through the RSS Guard database, I can see that articles have separate entries for each label that is applied - meaning the articles are duplicated, sometimes multiple times depending on how many labels are applied, in the database:

rssguard-db-labels

In QuiteRSS, the articles have the corresponding label ID applied to each article in a single column - meaning each article is a single unique entry in the database with no duplication:

quiterss-db-labels

I obviously understand that to change RSS Guard's database would entail a rather significant rewrite and probably require everyone to wipe out their databases since I imagine a new format would be incompatible.

I also understand that, from what I've read in the Issues, no-one else has reported this problem. So this might just be me who has encountered this and everyone else is sailing by without really making heavy use of labels. But then again, there might be a whole bunch of people who tried RSS Guard, hit this problem, decided it was 'too slow' and went back to their old RSS reader without ever reporting the issue here and so no-one will ever know.

What was the expected result?

RSS Guard should not have massive lag during simple operations such as navigating the UI.

What actually happened?

For context, with my current database, it takes more than 3 minutes to open the filter dialogue. The same again to switch to a different label. Or anything, really. This labels problem makes RSS Guard unwieldy for me and I would like for it to be... wieldy.

I realise if I am the only one with this problem then it won't be a high priority. But it's a bug report, so... I wish I was a programmer so I could do something about this. Unfortunately I'm not.

Debug log

A log is not really useful here. It shows nothing out of the ordinary. Besides, using the databases I've attached above will allow you to test yourself on your own system(s).

Operating system and version

System 1:
Operating System: Manjaro Linux
KDE Plasma Version: 5.27.4
KDE Frameworks Version: 5.105.0
Qt Version: 5.15.9
Kernel Version: 6.1.26-1-MANJARO (64-bit)
Graphics Platform: X11
Processors: 8 × AMD Ryzen 7 3700X 8-Core Processor
Memory: 15.6 GiB of RAM
Graphics Processor: NVIDIA GeForce GT 1030/PCIe/SSE2
Manufacturer: QEMU
Product Name: Standard PC (Q35 + ICH9, 2009)
System Version: pc-q35-4.2

RSS Guard: v4.3.4 Flatpak

System 2:
Operating System: Manjaro Linux
KDE Plasma Version: 5.27.4
KDE Frameworks Version: 5.105.0
Qt Version: 5.15.9
Kernel Version: 6.1.26-1-MANJARO (64-bit)
Graphics Platform: X11
Processors: 8 × Intel® Core™ i7 CPU Q 720 @ 1.60GHz
Memory: 11.6 GiB of RAM
Graphics Processor: AMD REDWOOD

RSS Guard: v4.3.4 Appimage

System 3:
Device name Laptop
Processor Intel(R) Core(TM) i7 CPU Q 720 @ 1.60GHz 1.60 GHz
Installed RAM 12.0 GB
System type 64-bit operating system, x64-based processor
Pen and touch No pen or touch input is available for this display
Edition Windows 10 Home
Version 22H2
Installed on ‎16/‎04/‎2023
OS build 19045.2846
Experience Windows Feature Experience Pack 120.2212.4190.0

RSS Guard: v4.3.4 ( rssguard-4.3.4-fb2c439b-win10.exe)

System 2 and 3 are the same laptop with two separate hard drives, dual booting.

@Mitecon Mitecon added the Type-Defect This is BUG!!! label May 9, 2023
@martinrotter
Copy link
Owner

Hi.

Thank you for very elaborated report.

First, articles ARE NOT duplicated in RSS Guard database. There is single table "Messages" which holds all articles and each article is SINGLE row in the table.

Label asssigned to articles are stored in separate 1:N table.

I will test your database. There might be some performance killer bug I am not aware of.

@Mitecon
Copy link
Author

Mitecon commented May 11, 2023

I see. So when I open the database in DB Browser, this is just how it displays the data? In a sense - the table LabelsInMessages doesn't display raw data but the result of a query on the database.

Or something. As I said - I'm not a programmer. I appear to know enough to be dangerous. I did not mean to cause offence. I gave you a wall of text in an effort to be as detailed as possible and my attempt at debugging this caused me to throw ideas out there based on what I could, rightly or wrongly, deduce. Again, I meant no offence.

Out of curiosity - do you test new versions against a test database or a copy of a 'live' database?

I could imagine that you might want to use, for example, the default feeds that RSS Guard suggests during a new installation so that you have a control group of sorts. At the same time, a database like that would not contain much data. Also, it would depend on your filter setup, amongst other things.

If you have a decently large database with many thousands of articles going back a fair amount of time, I might suggest copying that to a new install. Then add the filters I created in these test databases and run these filters against the copy of your own personal database.

For context, my current database has 55k articles and it takes at least the three minutes I mentioned originally to navigate a single element of the UI.

@ghost
Copy link

ghost commented May 14, 2023

Not sure this is related to labels. I've also got an incredible lag when opening the filter dialog (currently 2 minutes 20 seconds for first opening, roughly 45 seconds for each clicking on a filter in the list and waiting for it to become displayed). My impression is rather that automatically filling the "Existing articles" tab with tens of thousands of articles may not be a good idea.

@Mitecon
Copy link
Author

Mitecon commented May 14, 2023

On mine if I remove all labels I can, indeed, load up tens of thousands of articles in a fraction of a second. I currently have 59k of those articles. However, once labels are reapplied, everything slows down again.

You don't say whether you do or don't use labels but if you do, there is an easy test to see if this is the same cause of your slowdown:

  1. Back up your database
  2. Remove all labels (easiest way is to right-click labels and 'Delete Selected Item' - this won't delete articles, only the label itself)
  3. See if you can navigate the UI at a decent speed.

My Issue sounds pretty much the same as yours except, by following the three steps above, I've narrowed it down to knowing it's to do with labels. It's an easy way to test if you want to try it.

@ghost
Copy link

ghost commented May 14, 2023

You don't say whether you do or don't use labels but if you do, there is an easy test to see if this is the same cause of your slowdown:

Ouch. You are right. Deleting the labels gets me back to normal speed, without any lag.

@Mitecon
Copy link
Author

Mitecon commented May 14, 2023

What is your operating system, specs and RSS Guard version/type?

@ghost
Copy link

ghost commented May 14, 2023

What is your operating system, specs and RSS Guard version/type?

Linux, RSS Guard 4.3.4. Not sure what you mean by specs, but an Intel Core i5-10210U with 32GB of RAM available should be sufficient, at least htop says I'm nowhere getting near any kind of stress working with RSS Guard.

@Mitecon
Copy link
Author

Mitecon commented May 14, 2023

Specs just means processor, RAM, etc. I would say yours is more than good enough. Are you using an SSD?

Which Linux distro and which DE?

Are you using the full version of RSS Guard or nowebengine?

Also, are you using a version from your repo, the appimage or flatpak?

@ghost
Copy link

ghost commented May 14, 2023

Specs just means processor, RAM, etc. I would say yours is more than good enough.

  • SSD
  • Manjaro Linux (kernel 6.1.26)
  • KDE
  • RSS Guard full version
  • flatpak

@Mitecon
Copy link
Author

Mitecon commented May 14, 2023

This is basically the same as mine except different computers.

So we have three instances of Manjaro and one Windows 10 across three separate computers. The common link across everything is the labels.

I wonder if anyone else (Mac users, other Linux distros) also have this problem.

At least Martin has more information now we know for sure it's not an isolated incident.

Out of curiosity, can you open System Monitor (EDIT: might be KSysGuard) and screenshot the Process Table tab (put 'rssguard' in the Quick Search box). Make sure you have the Memory and Relative Start Time columns visible in the screenshot.

I had a previous Issue to do with memory usage that Martin made a fix for that kind of, sort of fixed it somewhat. However, I let it go because this labels problem is far more of an issue for me. I'd be curious to see how much RAM your RSS Guard takes up after some time running. Especially since you have the Full (i.e. not nowebengine) version and so I presume you'll be loading up full web pages within RSS Guard.

@ghost
Copy link

ghost commented May 14, 2023

Out of curiosity, can you open System Monitor (EDIT: might be KSysGuard) and screenshot the Process Table tab (put 'rssguard' in the Quick Search box). Make sure you have the Memory and Relative Start Time columns visible in the screenshot.

https://pasteboard.co/HX2d6ySRgkYz.jpg

@Mitecon
Copy link
Author

Mitecon commented May 14, 2023

So your instance of RSS Guard has been running (hmm.. timezones... are you in Germany?) for less than two hours and it's using 1.3GB RAM? That's quite a lot for that amount of time. Also, you might not have noticed this before with 32GB RAM but cache pressure would have been an issue for you on, say, 8GB RAM.

This is mine right now:

rssguard

However, mine sits there basically unused because of the labels/UI slowdown. When I do use it to read articles, the memory usage rockets sky high. I did think it might be the Qt library contributing to this as that is what the webengine uses. This one is fixed by restarting RSS Guard, though. You could report this in my other post to not pollute this Issue with a different problem.

@ghost
Copy link

ghost commented May 14, 2023

So your instance of RSS Guard has been running (hmm.. timezones... are you in Germany?) for less than two hours and it's using 1.3GB RAM? That's quite a lot for that amount of time. Also, you might not have noticed this before with 32GB RAM but cache pressure would have been an issue for you on, say, 8GB RAM.

At the time I took the screenshot, it had been running for about 6 hours (now almost 8 hours). Mem value as of now keeps jumping between ~930MB and 1,3GB. It's running on my workstation where I keep doing my daily work in parallel, did not notice issues so far.

@martinrotter
Copy link
Owner

OK guys.

I did some preliminary testing and debugging. All problems we see stem from DB layer and its design. SQL layer is designed in a way to be very simple for programmer and follows all basic DB design principles. Are associations (1:M mostly) are made via separate tables, no data are duplicated. Design itself is very clean.

Sadly when data is separated to multiple tables then some queries turn out to tak HUGE amount of time when there is big number of articles AND big number of labels assigned to those articles. This fact did not arose when I was designing the DB some years ago simply because I did not test it with big number of labels.

Here is sample SQL query which simply lists all articles:

ANALYZE;
SELECT Messages.id,
        Messages.id + 0 as sss,
       Messages.is_read,
       Messages.is_important,
       Messages.is_deleted,
       Messages.is_pdeleted,
       Messages.feed,
       Messages.title,
       Messages.url,
       Messages.author,
       Messages.date_created,
       Messages.contents,
       Messages.enclosures,
       Messages.score,
       Messages.account_id,
       Messages.custom_id,
       Messages.custom_hash,
       Feeds.title,
       CASE WHEN length(Messages.enclosures) > 10 THEN 'true' ELSE 'false' END AS has_enclosures,
       (
           SELECT GROUP_CONCAT(Labels.name) 
             FROM Labels
            WHERE Labels.custom_id IN (
                      SELECT LabelsInMessages.label
                        FROM LabelsInMessages
                       WHERE LabelsInMessages.account_id = Messages.account_id AND 
                             LabelsInMessages.message = Messages.custom_id
                  )
       )
       AS msg_labels
  FROM Messages
       LEFT JOIN
       Feeds ON Messages.feed = Feeds.custom_id AND 
                Messages.account_id = Feeds.account_id
 WHERE Messages.is_deleted = 0 AND 
       Messages.is_pdeleted = 0 AND 
       Messages.account_id = 1
 ORDER BY Messages.is_read ASC;

The problem is column msg_labels which contains some string operations with GROUP_CONCANT operator and this is exectuted for each single row and takes huuuge amount of time. This specific column is responsible for showing titles of labels assigned to article in article list.

There are several similar performance problems caused essentially by same SQL things, all connected to labels. This in effect makes all some operations regarding labels ultra slow.

I will have to fix and check it all and hope that it is all fixable with non-backwards-compatible SQL fixes. If those fixes will be bigger, then new major RSS Guard release will have to be released.

QuiteRSS includes IDs of assigned labels directly in "article" table which is from design POV big "no", but I have to say that performance-wise the method is quite good.

@martinrotter
Copy link
Owner

I will temporarily disable the problematic SQL code for labels titles column.

@rfkat @Mitecon

Please test daf0ca4 after it finishes building via development build and report back, thanks.

@martinrotter
Copy link
Owner

Pushed now, it should be compiled in 15 minutes.

@Mitecon
Copy link
Author

Mitecon commented May 15, 2023

I'm using: rssguard-devbuild-0716fd476-linux64.AppImage

From preliminary testing I can see:

Changing from label to label = 01:16, 01:18, 01:16, 01:17 - so a minute and a quarter, give or take
Opening filter dialogue = basically instant
Moving from label to feed = basically instant
Moving from feed to feed = basically instant

All of the above except from 'moving from feed to feed' would take minutes previously, so this is dramatically improved. Before, moving from feed to feed would not be 'instant' but there would be a little delay but manageable. Now it's noticeably faster.

I tested opening the filter dialogue, selecting a filter and clicking on 'Test'.

In this case I used 'Reject duplicate articles' - the built-in filter:

function filterMessage() {
  if (msg.isAlreadyInDatabase(MessageObject.SameTitle | MessageObject.AllFeedsSameAccount)) {
    return MessageObject.Ignore;
  }
  else {
    return MessageObject.Accept;
  }
}

Then I selected a category containing 158/21602 (%unread/%all) articles and hit the 'Test' button:

It took 05:26. Needless to say, I only ran that test once. It used to be something like three minutes but that was me looking at the second hand of my wall clock and roughly counting the time. For these tests, I've been using the timer on my phone and five and a half minutes to test filters is more excessive than before by a long way.

However, I wonder if it's because between then and now I simply have a lot more articles for the filters to run against so I never really noticed the increase in time taken. In the beginning I would hit a button and watch my wall clock. Then I began to hit some button in the UI, switch away from RSS Guard to do something else and then come back to it after some time.

All-in-all, this has made a dramatic difference. Though, the main one for me was to be able to navigate between labels and that's the part that still takes some time.

Navigating between labels, along with anything that takes any amount of time, has always locked up a whole CPU thread for the duration of the operation:

cpu

I'm obviously available for more testing whenever you need it.

@ghost
Copy link

ghost commented May 15, 2023

Please test daf0ca4 after it finishes building via development build and report back, thanks.

Tested using rssguard-devbuild-0716fd476-linux64.AppImage, Browsing the article filter list is as responsive as expected (instant response). Switching between labels takes around 8 seconds from clicking on the label until the articles show up.

@martinrotter
Copy link
Owner

martinrotter commented May 17, 2023

Changing from label to label

So what remains is problems when chaning among labels and when processing/testing filters in filters window. Pls report all slowness you see. I will first identify all problematic places and then address them somehow.

EDIT: Also, can you provide filled database file which exactly has the problem when switching from label to label which takes one minute? On my "extreme" database yes, it takes some seconds, not minutes.

@Mitecon
Copy link
Author

Mitecon commented May 17, 2023

I knew I'd be taking a while to run these tests so I copied my main database to my laptop so I wouldn't have a big gap where my main one would miss a whole bunch of new articles:

Operating System: Manjaro Linux
KDE Plasma Version: 5.27.4
KDE Frameworks Version: 5.105.0
Qt Version: 5.15.9
Kernel Version: 6.1.26-1-MANJARO (64-bit)
Graphics Platform: X11
Processors: 8 × Intel® Core™ i7 CPU Q 720 @ 1.60GHz
Memory: 11.6 GiB of RAM
Graphics Processor: AMD REDWOOD

Then I deleted all labels (which took forever) and deleted most of my feeds (because of privacy). I've left two feeds which have months of articles and lots of them - a total of 37,260.

Then I created three simple filters to label a decent amount of articles. Because of the way these filters are set up, no articles should have more than one label applied - you'll see why when you look at them.

Actually running the filters took way longer than I expected. I didn't time the first one but the second and third took around ten minutes each to run (wall clock precision).

On to the testing...

I'm using:  rssguard-devbuild-0716fd476-linux64.AppImage

These operations were performed by closing down RSS Guard (using File->Quit), watching KSysGuard to make sure all processes were ended and disappeared, then running RSS Guard again and clicking on nothing but what is listed below. So I ran RSS Guard for a total of three fresh starts. These are all stopwatch precision.

Time to select first label: 01:39
Time to select second label: 01:41
Time to select third label: 01:39

So the time taken is reliably consistent from a fresh start to select any label. These labels all have different amounts of articles in them so their size does not affect the time taken. That 01:39 time seems to be important for this database on this system. Any slight discrepancy I assume is due to the system doing something in the background. This laptop is set up as an 'emergency' computer so I always have a spare if something happens to my main one. This means that very little is actually happening on it which should make for very consistent testing overall. On my main computer, this kind of thing reliably takes about a minute and a quarter. Due to it being one thread, I'm guessing that clock speed matters more than thread count.

Without exiting RSS Guard keeping the third label selected from above:

Keeping the third label selected from above, time to select the second label: 01:40
Keeping the second label selected from above, time to select the first label: 01:39

So it's remarkably consistent no matter if there's a fresh start or not.

Then I realised the articles were sorted by alphabetically by Title. So I wondered if that may have a bearing on how long it took. For completeness I ran some more tests. These were fresh starts each time.

With articles sorted by 'Date' - newest at top:

Time to select first label: 01:39
Time to select second label: 01:39
Time to select third label: 01:40

With articles sorted by 'Date' - oldest at top:

Time to select first label: 01:39
Time to select second label: 01:39
Time to select third label: 01:41

While I was doing this, I also noticed something else. The time it takes to change column order depends on which view you currently have selected - i.e. which part is displaying the articles - feed or label.

While a label is selected, time to change from oldest at top to sort by Title: 01:37
While a feed is selected, time to change from oldest at top to sort by Title: practically instant

In fact while a feed is selected, I can change column order one after the other with minimal time delay. If I do this exact same thing while a Label is selected it takes that one minute and thirty-plus seconds.

I've also never used Scores in filters. I wonder if that would affect things.

The time it takes to create labels is also a problem:

From a fresh start, right-clicking the root of 'Labels' to create a new label took 01:39.

It's that magic number again. However, it's also loading up articles in all labels in order to do this.

One oddity I've noticed is, when running from a fresh start:

Time to run Tools->Cleanup database: 01:35

This database is 11.7MB so it shouldn't have that much to do. I haven't noticed this before on my main database but then I haven't particularly been watching for anything. When I run Cleanup database, I have these settings:

Remove all articles from Recycle bin -> Checked
Optimise database file -> Checked

All other checkboxes are unchecked.

It gets stuck on 'Purging recycle bin' at 16% for practically all of the time it takes before, at the last second, the progress bar flashes along and it completes:

cleanup-database

Opening the Filter dialogue used to take about the same time as it would to switch labels but with this dev version it's only a slight momentary delay. Opening the filter dialogue is not a problem in this dev version.

QUESTIONS:

  1. Am I right in thinking that RSS Guard only runs filters against incoming new articles? Then if you want to run a filter on everything in the database, you'd have to go to the filter dialogue and click on 'Process checked feeds'? To only filter new incoming articles, I think, is the better way. It's also one of the reasons I've been so desperate to get away from QuiteRSS as I'm fairly certain that it runs all filters against all data in the database every single time it refreshes the feeds. So you can be sitting there reading something and if it refreshes feeds before you finish reading, you're stuck there, unable to switch to a different feed or label until it's completed.

So with QuiteRSS if your database grows over time, as mine has, more filters + more articles = more time spent sitting doing nothing but waiting. Everyone is looking for more speed and minimal disruption.

  1. Would it would be possible to create labels dynamically from filters alone? For example, I'd like it so that if I were to create a filter that sends anything it finds to a specified label, that RSS Guard would then create that label itself without me manually having to switch out of the filter dialogue to do it.

I've found myself creating filters and then running them only to realise that any matching articles have nowhere to go unless I remember to create the label first. This means creating the filter, closing the filter dialogue, creating the label, opening the filter dialogue again, then running the filter. With the problem of how long it takes to do some of these things, I can find myself having to wait for no reason quite often because I don't always remember to stick to this process. If I forget altogether, then I can go away and come back the next day expecting something to be there waiting in a label for me but then I have to run the filter again manually.

  1. In this database I set up some simple filters based on video resolution. How would I filter TV shows? For example, how can I filter anything in the format 's01e01' or 's10e12', or whatever combination, to a label? I've had a go at this before a few times and run things through Javascript testers online but even though they say everything's fine and can find matches in sample data, I've never got it to work in RSS Guard. Some examples:
//var hdtv = [
//    '(.*?)\\.S?(\\d{1,2})E?(\\d{2})\\.(.*)',
//    '(\\S[0-9][0-9]\\E[0-9][0-9])',
//    '*S\d\dE\d\d',
//    '(S\d{2}E\d{2})',
//    '.*(?=s\d{1,2}e\d{1,2})(s\d{1,2}e\d{1,2}).*(\..*)',
//    `[0-9][0-9]E[0-9][0-9]`,
//    '\\d\\dE\\d\\d',
//];

This is not even an exhaustive list and looking at this, seeing what I've tried, makes my poor non-programmer head hurt. You can see by how many are commented out that I've left them there so I can see what I've already tried instead of researching more and trying the same things over and over and expecting different results. Obviously, a sign of madness.

  1. Finally, maybe something that's way out of scope for this Issue, but maybe has a simple solution. The internal web browser is essentially Chromium. However, no matter what I put into Settings->Network & web & tools->WebEngine, I cannot get dark mode on web pages. I know this can be done in Chromium (and Chrome) flags but it doesn't work here. I've tried:
--force-dark-mode
--blink-settings=darkMode=4,darkModeImagePolicy=2

Both of these I've read as having one or the other working, I think depending on the Qt version. If there's a quick and easy flag I can put in there - great. If not, it's not the biggest problem.

Database:

For the database, can you email 2gc11b47@duck.com and I'll reply with it attached.

I think that's it for now. Yes, I went nuts with code blocks.

@martinrotter
Copy link
Owner

martinrotter commented Jun 1, 2023

Working on this. It is big change but DB migration will be done on-the-fly and no major RSS Guard version change is needed.

Separate branch: https://github.com/martinrotter/rssguard/tree/solve-labels-performance

@martinrotter
Copy link
Owner

OKAY. some code in the branch. I reworked database structure a bit. Now each articles has column "labels" which keeps list if activated labels per each article. This approach does speed up basically all operations related to labels. Majority of code is now untested which I will do next days. Note that database schema is now upgraded to version "5" so backup your database files prior testing.

https://github.com/martinrotter/rssguard/blob/solve-labels-performance/resources/sql/db_update_sqlite_4_5.sql

The DB schme upgrade is seamless and users will not even notice that happening. Report back any findings. Some label op migh be temporarily broken but will test all and fix.

@Mitecon
Copy link
Author

Mitecon commented Jun 2, 2023

NOTE: This is based off running rssguard-devbuild-1d19f49a4-linux64.AppImage. I wrote the following last night and was going to post this today not knowing there'd be a newer version available. I was too tired to post it last night without re-reading with a fresh pair of eyes. I'm now running rssguard-devbuild-ff7e52739-linux64.AppImage but the following was based off rssguard-devbuild-1d19f49a4-linux64.AppImage:

I can immediately see the difference. I imagine you'll be fine tuning it but right now, switching from feeds to labels and back again is about as fast as I'd expect. Opening the Filter dialogue is also fast enough, as is switching between filters.

I'm just running on my main database, not the laptop I was using for the major testing above, so I haven't been able to run the same tests with the exact same database I used then. Unfortunately, I'm a bit busy with my life at the moment to run through all those tests. However, I'm running it on a larger database than before and the performance is at a point where I could say this is more than good enough for me. I wonder how much the database could handle now - as in how many feeds, articles and, indeed, labels.

One note is that it took a few minutes for the database to convert on first startup. This is fine since it's a one-time operation. The upgrade took my database from 101.0 MB to 101.9 MB with more than 76k articles in it. It might not even be noticeable for people with fewer feeds, articles and labels.

The biggest slowdown for me now is filtering. In my previous reply I had a question about 'Am I right in thinking that RSS Guard only runs filters against incoming new articles?'.

The reason I asked is because filtering is where I'll be testing next since heavy filtering of articles into many labels was why this whole thing started. In the time since my last reply I've made a whole bunch of filters (basically converted all of my QuiteRSS filters) that are all now queued up and ready to be enabled. RSS Guard's Javascript filters really are so much more powerful and QuiteRSS has its bugs regarding filters anyway. However, with real life things going on for me right now it may take a few days before I can see if there's anything glaring to report. It's one thing to manually run filters on existing articles and another for them to be automatically filtered as feeds are updated. So that'll take time to see how it all works. Right now I have many feeds that are unfiltered and I'm going to have to create all the labels which was taking forever before with the older stable v4.3.4 so I haven't done that yet) and then run each filter one by one until everything's done.

At this point, everything will be where I wanted it - pushing the filtering and use of labels. If performance remains as good as it is now, then I would call this Issue closed.

Unrelated question: The database and all config is in a directory called '/.config/RSS Guard 4/'. What happens when you decide to release a version 5? Why not just call it '/.config/RSS Guard/'?

@martinrotter
Copy link
Owner

RSS Guard runs filters against all articles downloaded from feed.

  1. Feed file is downloaded.
  2. Articles from feed file are extracted.
  3. Articles are ran with filter if any is activated.
  4. "accepted" articles are saved/overwritten in the DB.

As for "RSS Guard 4" folder. Yes, in RSS Guard 3 and before the folder was just "RSS Guard" but in RSS Guard 4 some major overhaul of database was done and it was not possible to automatically migrate old database files, therefore I decided to use completely new folder to save RSS Guard user data.

@martinrotter
Copy link
Owner

OK, closing this. Feel free to report any subsequent related problems etc.

@martinrotter martinrotter added the Status-Fixed Ticket is resolved. label Jun 5, 2023
@martinrotter martinrotter added this to the 4.3.5 milestone Jun 5, 2023
@Mitecon
Copy link
Author

Mitecon commented Jun 5, 2023

I may have spoken too soon about this being resolved. This is what happens when I reply before fully testing everything. Now I've had a bit more time, I have some things to report. I think, though, that this is not labels themselves. It feels like it's to do with how RSS Guard handles opening and displaying items in the articles pane.

I'm noticing that some basic operations are now much slower than they used to be. These are:

Selecting articles
Right-clicking articles

I tested in a small feed with a total of 3687 articles.

With one article:

Selecting any one article takes 16 seconds.
Right-clicking any one article takes 8 seconds for the menu to show.
Right-clicking any one article which is not already selected takes 16 seconds.

With two articles:

Selecting an additional article (CTRL+select) is practically instant.
Right-clicking any of the two selected articles takes 8 seconds for the menu to show.

However, big changes happen when more than two articles are selected, in this case three:

Selecting an additional article (CTRL+select) is practically instant.
Right-clicking three selected articles takes 24 seconds.

With four articles selected:

Selecting any amount of additional articles (Shift+select) is practically instant.
Right-clicking four selected articles takes 31 seconds.

With ten articles selected:

Selecting any amount of additional articles (Shift+select) is practically instant.
Right-clicking four selected articles takes 1m 18s.

With twenty articles selected:

Selecting any amount of additional articles (Shift+select) is practically instant.
Right-clicking four selected articles takes 2m 37s.

This, however, is right-clicking. If I wanted to perform some action like deleting articles, then I can use the main menu (Articles->Delete articles) and that has no delay whatsoever. It seems to be just the right-click menu that has the delay.

I could be that the items in the right-click menu has to calculate certain things before it's displayed? Things like: which labels are applied and show a tick or something in this menu beside the relevant label. I notice in the main menu 'Articles' there is no entry for 'Labels' and opening this menu is instant. Maybe that's it? If there's one thing I definitely have now that I didn't have before I noticed this slowdown - it's lots of labels. However, it's important to note that the articles and feed I was testing on above have no labels applied to them at all.

I'm wondering now if this has something to do with why articles are now taking a while to be selected: Because of that list of dots down the left-hand side of the article pane that might be calculating all the names of the labels and their respective colours first before it can display the article.

Also, I've gone to View->Show/hide->Message viewer toolbars to disable that and it made no difference.

In fact, I have a request to tweak that popout menu for labels: #971

For my use case, all my labelling is handled by filters alone. I don't even need to see these menus. Can there be an option to disable them or move them into their own separate menu or something?

Regarding your comment about filters:

Yes, I see now that filters only apply to incoming articles. I set up a bunch of filters, let them be enabled overnight, and saw that they hadn't applied labels to most things in my feeds. So I had to go and apply them manually. I should have tested this first.

Regarding the directory names:

I wonder if you could have used the naming structure:

/.config/RSS Guard/database v3/
/.config/RSS Guard/database v4/
/.config/RSS Guard/database v5/
etc.

Keeping the root directory the same but changing the database directory name as required.

Or even

/.config/RSS Guard/database/database-v3.db
/.config/RSS Guard/database/database-v4.db
/.config/RSS Guard/database/database-v5.db
etc.

This way, when you upgrade a database, the older format could be left as a backup (instead of database.db.bak).

Perhaps me and my OCD need to be stopped...

@martinrotter martinrotter reopened this Jun 6, 2023
@martinrotter martinrotter added Status-Partially-Fixed Part of bug/feature is fixed/implemented. and removed Status-Fixed Ticket is resolved. labels Jun 6, 2023
@martinrotter
Copy link
Owner

I tested selecting of articles with your "extreme" database and it is instant. Can you provide sample database (possibly whole "RSS Guard" folder which exibits the issue?

@martinrotter
Copy link
Owner

Also, I added a fix so sqlite database file backups are new versions like this: database.db-v4.bak

@martinrotter
Copy link
Owner

Yes, I see now that filters only apply to incoming articles. I set up a bunch of filters, let them be enabled overnight, and saw that they hadn't applied labels to most things in my feeds. So I had to go and apply them manually. I should have tested this first.

Filters are applied to all incoming articles but if that particular article already exists in DB and is not "updated" then the "filtered" new "clone" is not stored in DB and original article resides there. If article is "updated" (that means that its title, creation date etc. changes), then it is overwritten in DB).

If you make some changes to filters which you need to apply to existing articles then you have to run "Process checked feeds" button in "Article filters" dialog.

@Mitecon
Copy link
Author

Mitecon commented Jun 6, 2023

The 'extreme' database was fairly empty and was not set up in the way I have things set up now. So it's probably not a good one to continue testing with.

Instead, I've been working on getting you a good sample size with various things set up to help demonstrate these problems I'm discovering.

I wonder what kind of database you use personally. How many feeds, articles, filters and labels? Maybe you don't really have a complex setup, whereas I do. So I'm hitting these walls pretty quickly. I'm finding it's not so much to do with having lots of feeds and articles - you can have lots of these and not really have any issues at all. However, I'm seeing that it's to do with not only having lots of feeds and articles but also lots of filters and labels. Then it becomes how they interact with each other.

For example:

This database has 28k articles:

database.pre-populated.general.news.feeds.zip

When you go to select any article from any feed, it's basically instant.

However, with the exact same database but with 30 labels:

database.pre-populated.general.news.feeds.30.labels.zip

You begin to see the same operation take longer.

Then with the exact same database again but with 100 labels:

database.pre-populated.general.news.feeds.100.labels.zip

You can see that doing the same operation takes even longer.

So this is absolutely to do with labels - specifically the amount of them.

The more labels there are, the longer it takes.

It's like some sort of odd see-saw going on here. It used to be that reading articles was fast but labels were really slow. Now it's the other way around. There's some strange interaction between articles and labels and how they affect each other.

@martinrotter
Copy link
Owner

Investigated, yes, the problem is linearly dependent on number of labels and problem is that when any UNREAD message is selected then counts of unread articles in each label might change, thus, they are recalculated, have to come with efficient strategy

@martinrotter
Copy link
Owner

OK, I made some more enhancements (except right click message, which I will solve soon).

30a471a

Please re-test.

@Mitecon
Copy link
Author

Mitecon commented Jun 9, 2023

It's noticeably faster now but not instant. I went through a category that contained articles from different feeds, clicking on each or using the arrow keys one-by-one. When counting in my head, it seemed to take each article about three seconds to load up. This is much faster than before but, as I said, not instant. However, it's much more liveable.

When a single feed is selected:

When articles are read, the time to move through articles is practically instant
When articles are unread, there is a delay of around three seconds

I added articles from three feeds into a test label:

When articles are read, the time to move through articles is practically instant
When articles are unread, there is a delay of around three seconds

In the test label, it does not matter which feed the article is from, the above remains the same.

Also, adding a test label led me to discover: #978

However, the one time when you'd want to navigate through articles is when you haven't already read them - so the articles would have to be unread, leading to the problem above.

The articles I tested with above were always the same ones. That is, I was marking them read and unread several times to make sure I was testing on the exact same articles each time.

@martinrotter
Copy link
Owner

4784baa - much faster right click on several articles -> label's context menu was the culprit

now, i will make it work much faster when browsing unread articles

@Mitecon
Copy link
Author

Mitecon commented Jun 12, 2023

Yes - right-click is instant now. Nice!

I wonder why I can right-click on and see any article and immediately see it displayed in the bottom pane, yet if I left-click on any article it takes about three seconds to show.

Especially since a left-click has nothing to do with showing a context menu.

@martinrotter
Copy link
Owner

The delay should be only with UNread articles (the marking as read is what takes the time now).

@Mitecon
Copy link
Author

Mitecon commented Jun 12, 2023

I see now:

Right-clicking on articles previews them but does not mark them read.
Left-clicking on articles both previews and marks them read.

Interesting.

Left-clicking, or navigating with the arrows keys, on unread articles is probably what most people would be wanting to do, though. So it feels like it's just this final little bit and this thing could all be over.

@aleksejrs
Copy link

aleksejrs commented Jun 13, 2023

4.3.4 (supposedly; I am not sure I haven't built anything later) often fails to mark the article read from reading it. I believe I mostly use left-click.

@martinrotter
Copy link
Owner

4.3.4 (supposedly; I am not sure I haven't built anything later) often fails to mark the article read from reading it. I believe I mostly use left-click.

Report as separate issue, but first test with latest development release.

@martinrotter
Copy link
Owner

So it feels like it's just this final little bit and this thing could all be over.

Yes, that is the work I do now, should be ready soon.

@martinrotter
Copy link
Owner

OK, just commited.

d866378

Please test when compiled and report back - now the performance when browsing article list (and marking as read in the proces) should be quite faster, but depending on number of labels assigned to affected articles.

We are now almost bump into limits of SQLite full-text search which is now used to detect assigned labels to each article. This could be now also HUGELY speeed up by using special FTS5 functions of SQLite databases, but thing is that those are not present in MariaDB which is supported too.

Anyway test and let me know if the performance is now satisfactory. It is fair to say that big majority of users will not have hundred of labels, but maybe 10-20 at most.

@aleksejrs
Copy link

aleksejrs commented Jun 14, 2023

d866378

an unread article: immediate! And the article is marked read.
a read article: ≈3 seconds.
a category with 1 million unread articles: ≈20 seconds.

The font is small (probably AppImage-specific).

@Mitecon
Copy link
Author

Mitecon commented Jun 14, 2023

Yes, this is fixed (for me)! Selecting articles, either with the mouse or arrow up/down will mark articles read immediately.

Right-clicking on articles does not mark them read but to be honest this doesn't even matter. In fact, there might be a time when you would want to keep the article unread and just copy the URL or something. Or maybe keep it unread and apply a label to it so you can file it away to read later. So I think I would class right-clicking leaving the read state as a feature - not a bug. I'd like to keep it like this. After all - if a user wants to read something, they will left-click. If they want to do anything else - they'll right-click.

This was a long road but we got there in the end!

Question: I wonder if it might be worth running a poll asking how many people use SQLite vs. MariaDB.

Looking at this, one of the responses was:

64% of users use Windows version of RSS Guard, 46% of users run Linux port, some minor numbers use other ports.

So most users are using Windows. I know you can run MariaDB on Windows but really - how many people are likely to? So of those above, you'd more likely be looking at Linux/other users who may be more likely to be running MariaDB.

You might even be surprised at the results. After all, with another question, you said:

62% of users use DARK look in RSS Guard (this is a big surprise for me)

So imagine not having to support something hardly anyone uses which then frees you up to implement massive gains and allows you to concentrate your efforts.

Plus, at least for me, data portability is a big bonus. All those databases I've sent you with various configurations - I'm not even sure I could, or would, have done that if I'd used MariaDB. A quick look at this makes me squirm. I'm not afraid of using the terminal but it's not like an SQLite database with a quick copy/paste. And I obviously use backup software which meant, at least when I first started using RSS Guard, after I accidentally wiped out the database with badly written filters - I could then simply restore it and start again.

I'm guessing you want to make the best RSS reader available. If you continue supporting a feature that maybe hardly anyone uses then you'll be artificially limiting yourself for no real reason.

For context, how many people here have said they're coming to RSS Guard from QuiteRSS? QuiteRSS does not support MariaDB - only SQLite - and nobody ever complained there.

In fact, I just searched in the QuiteRSS forum and there are no results for MariaDB at all. Not even anyone asking for it as an alternative to SQLite:

quiterss-forum-search-for-mariadb

Maybe you could even release a new stable version and include a one-time popup with something like:

There is a poll for RSS Guard users at [LINK TO POLL] to help shape the future of the program.

Or maybe even some way of a popup listing various (anonymous) data that you will show to the user that contains fundamental information like:

OS
Database type
WebEngine/NoWebEngine
Language

Just the basics. If you show the exact data so users can see what the implications are regarding privacy, then maybe they'll click that 'Send once' button rather 'Ignore'. I wonder how many users there are who don't even visit Github, never mind reporting Issues. It does put people off when they have to create yet another new account just to file a bug report or ask a simple question.

I'm even thinking - well, this is an RSS reader with built-in functionality to actually rewrite article titles. So you could even create a new category that you can inject an article into that loads up a page for users to simply click whatever they need to if they want to take part.

So you could auto create with every new update to RSS Guard something like:

RSS Guard [CATEGORY]
    RSS Guard Stable Release Notes [Feed, into which you inject an article with the changelog for this new version]
    RSS Guard DevBuilds Release Notes [Feed, into which you inject an article with the changelog for this new version]
    RSS Guard User Poll [Feed, into which you inject an article with some info about and a link to a poll somewhere]

These don't even need to be downloaded - you can generate them from within RSS Guard itself by injecting them into the list whenever a new version is run. If people see them and use them - great. If they don't really care - they'll mark as read and they'll become hidden so they don't see them anymore. Plus, because they'll be generated by RSS Guard itself, there is nothing to download and no privacy implications. Users can either take part or ignore as they see fit. The main point being that they will easily be given the option to see, rather than to simply 'know about' or go looking for certain things. It would be good visibility for something that would be able to help you.

Really a lot of possibilities here and I should probably stop. Once I get ideas, they tend to roll on for a while.

@aleksejrs
Copy link

aleksejrs commented Jun 15, 2023

an unread article: immediate! And the article is marked read.

They are not always marked read. It is not related to the delay caused by some stages of fetching blocking the UI completely.

@martinrotter
Copy link
Owner

OK guys. I am closing this as solved. There are some more boosts which might be done, but MariaDB support would have to be dropped first which is not feasible at this point.

For 99 % of users, 1-100 labels with some thousands of articles is super fine and with these amounts it now works okay.

Report any problems in separated tickets.

@martinrotter martinrotter added Status-Fixed Ticket is resolved. and removed Status-Partially-Fixed Part of bug/feature is fixed/implemented. labels Jun 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component-DB Status-Fixed Ticket is resolved. Type-Defect This is BUG!!!
Projects
None yet
Development

No branches or pull requests

3 participants