High CPU usage in "Update all Episode Subtitles from disk" task #1916

ngosang · 2022-08-01T11:28:43Z

Describe the bug
High CPU usage in "Update all Episode Subtitles from disk" task

To Reproduce
The task "Update all Episode Subtitles from disk" is launched everyday at 4:00
As you can see in the screenshot it takes 14 minutes and uses 100% of the CPU. The system is a RaspberryPi but that is a lot of CPU...
I have 1556 episodes, you can guess I have 5000 subs approx.

Expected behavior
Reduce the CPU usage and the scan time.

Screenshots

Logs

01/08/2022 01:20:49|WARNING |retry.api                       |HTTPSConnectionPool(host='www.podnapisi.net', port=443): Read timed out. (read timeout=10), retrying in 5 seconds...|
01/08/2022 01:28:35|INFO    |root                            |Throttling subscene for 10 minutes, until 22/08/01 01:38, because of: ProviderError. Exception info: 'Something went wrong when trying to log in #1'|
01/08/2022 01:29:32|INFO    |root                            |BAZARR Finished searching for missing Series Subtitles. Check History for more information.|
01/08/2022 03:15:51|INFO    |root                            |Using subscene again after 10 minutes, (disabled because: ProviderError)|
01/08/2022 04:13:07|INFO    |root                            |BAZARR All existing episode subtitles indexed from disk.|
01/08/2022 04:20:25|INFO    |root                            |Throttling subscene for 10 minutes, until 22/08/01 04:30, because of: ProviderError. Exception info: 'Something went wrong when trying to log in #1'|
01/08/2022 04:20:58|WARNING |retry.api                       |HTTPSConnectionPool(host='www.podnapisi.net', port=443): Read timed out. (read timeout=10), retrying in 5 seconds...|
01/08/2022 04:22:03|INFO    |root                            |BAZARR Finished searching for missing Series Subtitles. Check History for more information.|
01/08/2022 04:32:30|INFO    |root                            |Using subscene again after 10 minutes, (disabled because: ProviderError)|
01/08/2022 04:32:31|INFO    |root                            |Throttling subscene for 10 minutes, until 22/08/01 04:42, because of: ProviderError. Exception info: 'Something went wrong when trying to log in #1'|
01/08/2022 07:20:24|INFO    |root                            |Using subscene again after 10 minutes, (disabled because: ProviderError)|
01/08/2022 07:20:25|INFO    |root                            |Throttling subscene for 10 minutes, until 22/08/01 07:30, because of: ProviderError. Exception info: 'Something went wrong when trying to log in #1'|
01/08/2022 07:20:55|WARNING |retry.api                       |HTTPSConnectionPool(host='www.podnapisi.net', port=443): Read timed out. (read timeout=10), retrying in 5 seconds...|
01/08/2022 07:22:01|WARNING |retry.api                       |HTTPSConnectionPool(host='www.podnapisi.net', port=443): Read timed out. (read timeout=10), retrying in 5 seconds...|
01/08/2022 07:22:14|INFO    |root                            |BAZARR Finished searching for missing Series Subtitles. Check History for more information.|
01/08/2022 08:27:45|INFO    |root                            |Using subscene again after 10 minutes, (disabled because: ProviderError)|
01/08/2022 08:29:43|WARNING |retry.api                       |HTTPSConnectionPool(host='www.podnapisi.net', port=443): Read timed out. (read timeout=10), retrying in 5 seconds...|
01/08/2022 08:32:11|INFO    |root                            |BAZARR Finished searching for Subtitles to upgrade. Check History for more information.|
01/08/2022 10:20:45|WARNING |retry.api                       |HTTPSConnectionPool(host='www.podnapisi.net', port=443): Read timed out. (read timeout=10), retrying in 5 seconds...|
01/08/2022 10:25:28|WARNING |retry.api                       |HTTPSConnectionPool(host='www.podnapisi.net', port=443): Read timed out. (read timeout=10), retrying in 5 seconds...|
01/08/2022 10:26:25|WARNING |retry.api                       |HTTPSConnectionPool(host='www.podnapisi.net', port=443): Read timed out. (read timeout=10), retrying in 5 seconds...|
01/08/2022 10:30:56|WARNING |retry.api                       |HTTPSConnectionPool(host='www.podnapisi.net', port=443): Read timed out. (read timeout=10), retrying in 5 seconds...|
01/08/2022 10:35:06|INFO    |root                            |BAZARR Finished searching for missing Series Subtitles. Check History for more information.|

Software (please complete the following information):

Bazarr: 1.1.0
Radarr version No
Sonarr version 3.0.8.1507
OS: Docker Linux

Additional context
.

The text was updated successfully, but these errors were encountered:

morpheus65535 · 2022-08-01T11:33:41Z

What kind of hardware are we talking about here? I don't see a bug here, only the perception that CPU usage is too high.

ngosang · 2022-08-01T11:38:17Z

RaspberryPi 4. I know the hardware has his limitations but you can see a comparison with other containers I'm running in the same node. That includes Sonarr, Radarr, 3 Jacketts, Jellyfin...

ngosang · 2022-08-01T11:40:38Z

Zoom into related services

ngosang · 2022-08-01T11:42:15Z

One week

morpheus65535 · 2022-08-01T11:43:03Z

Does one of these open each and every media file and index the properties and embedded subtitles? The answer is no.

It takes 1m30s to scan 3500 episodes on my setup. You can run this job less often if it bother you but there's nothing I can do about this.

ngosang · 2022-08-01T11:48:45Z

Please be polite. I am spending my time trying to improve your software.
I think Radar, Sonarr and Lidarr have a task to check/reindex the collection and the metadata from time to time.
As improvement you don't need to read the files. Just check the last modification date of the file or calc the sha256 hash of the whole file. Compare it with the value stored in your database and if it matches you can skip that file. To calc the hash of 5000 files should take less than 10s, just the time spent in disk i/o. To check the last modification date should take less than 1s.

morpheus65535 · 2022-08-01T12:04:59Z

Please be polite. I am spending my time trying to improve your software.

Sorry, I didn't meant to be impolite.

I think Radar, Sonarr and Lidarr have a task to check/reindex the collection and the metadata from time to time. As improvement you don't need to read the files. Just check the last modification date of the file or calc the sha256 hash of the whole file. Compare it with the value stored in your database and if it matches you can skip that file. To calc the hash of 5000 files should take less than 10s, just the time spent in disk i/o. To check the last modification date should take less than 1s.

This is exactly what we do by caching ffprobe output until the episode id from Sonarr or the file size change.

I'll look at the code again in the upcoming days to see if something can be improved.

ngosang · 2022-08-01T12:34:35Z

Thank you. File size is ok, but it's not as reliable as the modification date or the hash. I'm pretty sure you will find you are doing more than comparing the file size. I don't know how how are handling the migration when the user updates Bazarr version. In that case you should invalidate the cache (by setting a wrong hash in all subtitles in the database) that will force a complete scan just in case a newer version of ffmpeg provides different info.

soenneker · 2022-08-05T16:28:57Z

I'm running into this too - it's a problem. A 36 thread CPU @ 5GHz being consumed across all threads 90-100% for 5 minutes isn't acceptable, even if it's doing legitimate work.

Possible solutions I see are:

Throttling the process from within (priority)
Setting internal delays between units of work that are cpu heavy (so at least other processes can get clock time)
Limiting the number of total number of threads available

From reading the above it seems like the task is not critical and it wouldn't be a problem for it to take extra time.

ngosang · 2022-08-05T19:36:16Z

@morpheus65535 I looked at the code.

You are caching the output of ffmpeg for embedded subtitles. That is working as expected and it's fast enough
You are reading all external subtitles and using a "cpu intensive" library to guess the language

How to reproduce:

Go to 1 series in Bazarr and get the "sonarr id".
Remove the column ffprobe_cache for that series in the database
Go to the series in Bazarr UI and click "Scan Disk"
In my case a series with just 10 episodes with internal subtitles an external subtitles, it takes 19 seconds (running ffprobe)
Go to the series in Bazarr UI and click "Scan Disk"
With ffprobe cache it takes 12 seconds due to external subtitles. It looks like it's not much but I have 5000 subtitles...

Solution:

You should save the size of the external subtitle in the database. Eg, intead of saving this:

[
    ['en', None], 
    ['es:hi', '/media/1883/Season 01/1883 (2021) S01E01 WEBDL-1080p.es.srt']
]

save

[
    ['en', None, None], 
    ['es:hi', '/media/1883/Season 01/1883 (2021) S01E01 WEBDL-1080p.es.srt', 1234567]
]

Get the name and size of all the files in the folder.
Search in the list by path and compare the size
If the path + size matches you can skip that subtitle

Expected result:
It should take less than 1 second instead of 12 seconds. That will be noticeable not just for the task "Update all Episode Subtitles from disk", for all tasks scanning subtitles.

morpheus65535 · 2022-08-06T01:06:45Z

I'm running into this too - it's a problem. A 36 thread CPU @ 5GHz being consumed across all threads 90-100% for 5 minutes isn't acceptable, even if it's doing legitimate work.

I don't know what's causing all those thread but the indexing task is sequential: it will run one episode after another in a single thread. If required, it will start a thread for ffprobe and wait for the process to return before going to the next one. The worst case scenario is having one thread for the indexing process and a sub process for ffprobe.

I don't say you don't have an issue but it's not related to the indexing process.

morpheus65535 · 2022-08-06T01:09:09Z

Get the name and size of all the files in the folder.

Search in the list by path and compare the size

If the path + size matches you can skip that subtitle

Sounds like a good idea, I'll see how I can implement this because, actually, we don't care about the previously index subtitles. We simply reindex all of them. I'll try to work on this in the upcoming week.

Thanks for the investigation!

morpheus65535 · 2022-08-11T02:37:56Z

@ngosang I've implemented your suggestion. I don't know if you'll really get some benefit out of that. You can try the upcoming beta and let me know.

ngosang · 2022-08-15T16:28:08Z

@morpheus65535 thank you. it's working in most cases but there is a bug. I opened a PR, it's tested by me but feel free to improve the code. => #1926

First iteration:

Subtitles found on disk => ['The Shadow Line (2011) S01E01 SDTV XviD MP3 BiA.srt': None, 'The Shadow Line (2011) S01E01 SDTV XviD MP3 BiA.en.srt': <Language [en]>]
Subtitled saved in Bazarr DB => ['es:hi', '/media/hdd_raid/MEDIA/TV_EN/The Shadow Line VOSE/Season 01/The Shadow Line (2011) S01E01 SDTV XviD MP3 BiA.srt', 54054], ['en', '/media/hdd_raid/MEDIA/TV_EN/The Shadow Line VOSE/Season 01/The Shadow Line (2011) S01E01 SDTV XviD MP3 BiA.en.srt', 48352]]

Second iteration:

Subtitles found on disk => ['The Shadow Line (2011) S01E01 SDTV XviD MP3 BiA.srt': None, 'The Shadow Line (2011) S01E01 SDTV XviD MP3 BiA.en.srt': <Language [en]>]
Subtitled saved in Bazarr DB => [['en', '/media/hdd_raid/MEDIA/TV_EN/The Shadow Line VOSE/Season 01/The Shadow Line (2011) S01E01 SDTV XviD MP3 BiA.en.srt', 48352]]

The first subtitle is missing because you "continue" and return a subtitle with None language. In the third iteration the subtitle is discovered again using CPU... and the process repeats.

morpheus65535 · 2022-08-15T18:29:30Z

@ngosang thanks for this one! I've worked on that for too many night and I've lost track of what was done and missing.

The PR seems to be good. I don't see any improvement on my side but all of my subtitles include language code so it's to be expected.

I'll merge your PR in a couple of minutes. I'll let you close this issue if it's good enough for you.

Thanks again!

ngosang · 2022-08-31T06:58:48Z

I just tested the release 1.1.1 and it looks much better. From 15 min to 3 min with less CPU usage. Values are aprox since my monitor solution does not have enough resolution.

morpheus65535 · 2022-08-31T10:19:15Z

Good improvement!

morpheus65535 self-assigned this Aug 1, 2022

morpheus65535 added a commit that referenced this issue Aug 11, 2022

Tried to improve full disk daily indexing tasks resources usage. #1916

c74f9e9

morpheus65535 added the awaiting feedback label Aug 11, 2022

ngosang added a commit to ngosang/bazarr that referenced this issue Aug 15, 2022

Fix issue in external subtitles cache. morpheus65535#1916

82ccd57

morpheus65535 pushed a commit that referenced this issue Aug 15, 2022

Fix issue in external subtitles cache. #1916

bc0a7c2

ngosang closed this as completed Aug 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High CPU usage in "Update all Episode Subtitles from disk" task #1916

High CPU usage in "Update all Episode Subtitles from disk" task #1916

ngosang commented Aug 1, 2022 •

edited

morpheus65535 commented Aug 1, 2022

ngosang commented Aug 1, 2022

ngosang commented Aug 1, 2022

ngosang commented Aug 1, 2022

morpheus65535 commented Aug 1, 2022

ngosang commented Aug 1, 2022 •

edited

morpheus65535 commented Aug 1, 2022

ngosang commented Aug 1, 2022

soenneker commented Aug 5, 2022

ngosang commented Aug 5, 2022

morpheus65535 commented Aug 6, 2022

morpheus65535 commented Aug 6, 2022

morpheus65535 commented Aug 11, 2022

ngosang commented Aug 15, 2022

morpheus65535 commented Aug 15, 2022

ngosang commented Aug 31, 2022 •

edited

morpheus65535 commented Aug 31, 2022

High CPU usage in "Update all Episode Subtitles from disk" task #1916

High CPU usage in "Update all Episode Subtitles from disk" task #1916

Comments

ngosang commented Aug 1, 2022 • edited

morpheus65535 commented Aug 1, 2022

ngosang commented Aug 1, 2022

ngosang commented Aug 1, 2022

ngosang commented Aug 1, 2022

morpheus65535 commented Aug 1, 2022

ngosang commented Aug 1, 2022 • edited

morpheus65535 commented Aug 1, 2022

ngosang commented Aug 1, 2022

soenneker commented Aug 5, 2022

ngosang commented Aug 5, 2022

morpheus65535 commented Aug 6, 2022

morpheus65535 commented Aug 6, 2022

morpheus65535 commented Aug 11, 2022

ngosang commented Aug 15, 2022

morpheus65535 commented Aug 15, 2022

ngosang commented Aug 31, 2022 • edited

morpheus65535 commented Aug 31, 2022

ngosang commented Aug 1, 2022 •

edited

ngosang commented Aug 1, 2022 •

edited

ngosang commented Aug 31, 2022 •

edited