Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU usage in "Update all Episode Subtitles from disk" task #1916

Closed
ngosang opened this issue Aug 1, 2022 · 17 comments
Closed

High CPU usage in "Update all Episode Subtitles from disk" task #1916

ngosang opened this issue Aug 1, 2022 · 17 comments
Assignees

Comments

@ngosang
Copy link
Contributor

ngosang commented Aug 1, 2022

Describe the bug
High CPU usage in "Update all Episode Subtitles from disk" task

To Reproduce
The task "Update all Episode Subtitles from disk" is launched everyday at 4:00
As you can see in the screenshot it takes 14 minutes and uses 100% of the CPU. The system is a RaspberryPi but that is a lot of CPU...
I have 1556 episodes, you can guess I have 5000 subs approx.

Expected behavior
Reduce the CPU usage and the scan time.

Screenshots
image

Logs

01/08/2022 01:20:49|WARNING |retry.api                       |HTTPSConnectionPool(host='www.podnapisi.net', port=443): Read timed out. (read timeout=10), retrying in 5 seconds...|
01/08/2022 01:28:35|INFO    |root                            |Throttling subscene for 10 minutes, until 22/08/01 01:38, because of: ProviderError. Exception info: 'Something went wrong when trying to log in #1'|
01/08/2022 01:29:32|INFO    |root                            |BAZARR Finished searching for missing Series Subtitles. Check History for more information.|
01/08/2022 03:15:51|INFO    |root                            |Using subscene again after 10 minutes, (disabled because: ProviderError)|
01/08/2022 04:13:07|INFO    |root                            |BAZARR All existing episode subtitles indexed from disk.|
01/08/2022 04:20:25|INFO    |root                            |Throttling subscene for 10 minutes, until 22/08/01 04:30, because of: ProviderError. Exception info: 'Something went wrong when trying to log in #1'|
01/08/2022 04:20:58|WARNING |retry.api                       |HTTPSConnectionPool(host='www.podnapisi.net', port=443): Read timed out. (read timeout=10), retrying in 5 seconds...|
01/08/2022 04:22:03|INFO    |root                            |BAZARR Finished searching for missing Series Subtitles. Check History for more information.|
01/08/2022 04:32:30|INFO    |root                            |Using subscene again after 10 minutes, (disabled because: ProviderError)|
01/08/2022 04:32:31|INFO    |root                            |Throttling subscene for 10 minutes, until 22/08/01 04:42, because of: ProviderError. Exception info: 'Something went wrong when trying to log in #1'|
01/08/2022 07:20:24|INFO    |root                            |Using subscene again after 10 minutes, (disabled because: ProviderError)|
01/08/2022 07:20:25|INFO    |root                            |Throttling subscene for 10 minutes, until 22/08/01 07:30, because of: ProviderError. Exception info: 'Something went wrong when trying to log in #1'|
01/08/2022 07:20:55|WARNING |retry.api                       |HTTPSConnectionPool(host='www.podnapisi.net', port=443): Read timed out. (read timeout=10), retrying in 5 seconds...|
01/08/2022 07:22:01|WARNING |retry.api                       |HTTPSConnectionPool(host='www.podnapisi.net', port=443): Read timed out. (read timeout=10), retrying in 5 seconds...|
01/08/2022 07:22:14|INFO    |root                            |BAZARR Finished searching for missing Series Subtitles. Check History for more information.|
01/08/2022 08:27:45|INFO    |root                            |Using subscene again after 10 minutes, (disabled because: ProviderError)|
01/08/2022 08:29:43|WARNING |retry.api                       |HTTPSConnectionPool(host='www.podnapisi.net', port=443): Read timed out. (read timeout=10), retrying in 5 seconds...|
01/08/2022 08:32:11|INFO    |root                            |BAZARR Finished searching for Subtitles to upgrade. Check History for more information.|
01/08/2022 10:20:45|WARNING |retry.api                       |HTTPSConnectionPool(host='www.podnapisi.net', port=443): Read timed out. (read timeout=10), retrying in 5 seconds...|
01/08/2022 10:25:28|WARNING |retry.api                       |HTTPSConnectionPool(host='www.podnapisi.net', port=443): Read timed out. (read timeout=10), retrying in 5 seconds...|
01/08/2022 10:26:25|WARNING |retry.api                       |HTTPSConnectionPool(host='www.podnapisi.net', port=443): Read timed out. (read timeout=10), retrying in 5 seconds...|
01/08/2022 10:30:56|WARNING |retry.api                       |HTTPSConnectionPool(host='www.podnapisi.net', port=443): Read timed out. (read timeout=10), retrying in 5 seconds...|
01/08/2022 10:35:06|INFO    |root                            |BAZARR Finished searching for missing Series Subtitles. Check History for more information.|

Software (please complete the following information):

  • Bazarr: 1.1.0
  • Radarr version No
  • Sonarr version 3.0.8.1507
  • OS: Docker Linux

Additional context
.

@morpheus65535
Copy link
Owner

What kind of hardware are we talking about here? I don't see a bug here, only the perception that CPU usage is too high.

@ngosang
Copy link
Contributor Author

ngosang commented Aug 1, 2022

RaspberryPi 4. I know the hardware has his limitations but you can see a comparison with other containers I'm running in the same node. That includes Sonarr, Radarr, 3 Jacketts, Jellyfin...
image

@ngosang
Copy link
Contributor Author

ngosang commented Aug 1, 2022

Zoom into related services
image

@ngosang
Copy link
Contributor Author

ngosang commented Aug 1, 2022

One week
image

@morpheus65535
Copy link
Owner

Does one of these open each and every media file and index the properties and embedded subtitles? The answer is no.

It takes 1m30s to scan 3500 episodes on my setup. You can run this job less often if it bother you but there's nothing I can do about this.

@ngosang
Copy link
Contributor Author

ngosang commented Aug 1, 2022

Please be polite. I am spending my time trying to improve your software.
I think Radar, Sonarr and Lidarr have a task to check/reindex the collection and the metadata from time to time.
As improvement you don't need to read the files. Just check the last modification date of the file or calc the sha256 hash of the whole file. Compare it with the value stored in your database and if it matches you can skip that file. To calc the hash of 5000 files should take less than 10s, just the time spent in disk i/o. To check the last modification date should take less than 1s.

@morpheus65535
Copy link
Owner

Please be polite. I am spending my time trying to improve your software.

Sorry, I didn't meant to be impolite.

I think Radar, Sonarr and Lidarr have a task to check/reindex the collection and the metadata from time to time. As improvement you don't need to read the files. Just check the last modification date of the file or calc the sha256 hash of the whole file. Compare it with the value stored in your database and if it matches you can skip that file. To calc the hash of 5000 files should take less than 10s, just the time spent in disk i/o. To check the last modification date should take less than 1s.

This is exactly what we do by caching ffprobe output until the episode id from Sonarr or the file size change.

I'll look at the code again in the upcoming days to see if something can be improved.

@morpheus65535 morpheus65535 self-assigned this Aug 1, 2022
@ngosang
Copy link
Contributor Author

ngosang commented Aug 1, 2022

Thank you. File size is ok, but it's not as reliable as the modification date or the hash. I'm pretty sure you will find you are doing more than comparing the file size. I don't know how how are handling the migration when the user updates Bazarr version. In that case you should invalidate the cache (by setting a wrong hash in all subtitles in the database) that will force a complete scan just in case a newer version of ffmpeg provides different info.

@soenneker
Copy link
Contributor

I'm running into this too - it's a problem. A 36 thread CPU @ 5GHz being consumed across all threads 90-100% for 5 minutes isn't acceptable, even if it's doing legitimate work.

Possible solutions I see are:

  • Throttling the process from within (priority)
  • Setting internal delays between units of work that are cpu heavy (so at least other processes can get clock time)
  • Limiting the number of total number of threads available

From reading the above it seems like the task is not critical and it wouldn't be a problem for it to take extra time.

@ngosang
Copy link
Contributor Author

ngosang commented Aug 5, 2022

@morpheus65535 I looked at the code.

  • You are caching the output of ffmpeg for embedded subtitles. That is working as expected and it's fast enough
  • You are reading all external subtitles and using a "cpu intensive" library to guess the language

How to reproduce:

  1. Go to 1 series in Bazarr and get the "sonarr id".
  2. Remove the column ffprobe_cache for that series in the database
  3. Go to the series in Bazarr UI and click "Scan Disk"
  4. In my case a series with just 10 episodes with internal subtitles an external subtitles, it takes 19 seconds (running ffprobe)
  5. Go to the series in Bazarr UI and click "Scan Disk"
  6. With ffprobe cache it takes 12 seconds due to external subtitles. It looks like it's not much but I have 5000 subtitles...

Solution:

  • You should save the size of the external subtitle in the database. Eg, intead of saving this:
[
    ['en', None], 
    ['es:hi', '/media/1883/Season 01/1883 (2021) S01E01 WEBDL-1080p.es.srt']
]

save

[
    ['en', None, None], 
    ['es:hi', '/media/1883/Season 01/1883 (2021) S01E01 WEBDL-1080p.es.srt', 1234567]
]
  • Get the name and size of all the files in the folder.
  • Search in the list by path and compare the size
  • If the path + size matches you can skip that subtitle

Expected result:
It should take less than 1 second instead of 12 seconds. That will be noticeable not just for the task "Update all Episode Subtitles from disk", for all tasks scanning subtitles.

@morpheus65535
Copy link
Owner

I'm running into this too - it's a problem. A 36 thread CPU @ 5GHz being consumed across all threads 90-100% for 5 minutes isn't acceptable, even if it's doing legitimate work.

I don't know what's causing all those thread but the indexing task is sequential: it will run one episode after another in a single thread. If required, it will start a thread for ffprobe and wait for the process to return before going to the next one. The worst case scenario is having one thread for the indexing process and a sub process for ffprobe.

I don't say you don't have an issue but it's not related to the indexing process.

@morpheus65535
Copy link
Owner

  • Get the name and size of all the files in the folder.
  • Search in the list by path and compare the size
  • If the path + size matches you can skip that subtitle

Sounds like a good idea, I'll see how I can implement this because, actually, we don't care about the previously index subtitles. We simply reindex all of them. I'll try to work on this in the upcoming week.

Thanks for the investigation!

@morpheus65535
Copy link
Owner

@ngosang I've implemented your suggestion. I don't know if you'll really get some benefit out of that. You can try the upcoming beta and let me know.

@ngosang
Copy link
Contributor Author

ngosang commented Aug 15, 2022

@morpheus65535 thank you. it's working in most cases but there is a bug. I opened a PR, it's tested by me but feel free to improve the code. => #1926

First iteration:

  1. Subtitles found on disk => ['The Shadow Line (2011) S01E01 SDTV XviD MP3 BiA.srt': None, 'The Shadow Line (2011) S01E01 SDTV XviD MP3 BiA.en.srt': <Language [en]>]
  2. Subtitled saved in Bazarr DB => ['es:hi', '/media/hdd_raid/MEDIA/TV_EN/The Shadow Line VOSE/Season 01/The Shadow Line (2011) S01E01 SDTV XviD MP3 BiA.srt', 54054], ['en', '/media/hdd_raid/MEDIA/TV_EN/The Shadow Line VOSE/Season 01/The Shadow Line (2011) S01E01 SDTV XviD MP3 BiA.en.srt', 48352]]

Second iteration:

  1. Subtitles found on disk => ['The Shadow Line (2011) S01E01 SDTV XviD MP3 BiA.srt': None, 'The Shadow Line (2011) S01E01 SDTV XviD MP3 BiA.en.srt': <Language [en]>]
  2. Subtitled saved in Bazarr DB => [['en', '/media/hdd_raid/MEDIA/TV_EN/The Shadow Line VOSE/Season 01/The Shadow Line (2011) S01E01 SDTV XviD MP3 BiA.en.srt', 48352]]

The first subtitle is missing because you "continue" and return a subtitle with None language. In the third iteration the subtitle is discovered again using CPU... and the process repeats.

@morpheus65535
Copy link
Owner

@ngosang thanks for this one! I've worked on that for too many night and I've lost track of what was done and missing.

The PR seems to be good. I don't see any improvement on my side but all of my subtitles include language code so it's to be expected.

I'll merge your PR in a couple of minutes. I'll let you close this issue if it's good enough for you.

Thanks again!

@ngosang
Copy link
Contributor Author

ngosang commented Aug 31, 2022

I just tested the release 1.1.1 and it looks much better. From 15 min to 3 min with less CPU usage. Values are aprox since my monitor solution does not have enough resolution.

image

@morpheus65535
Copy link
Owner

Good improvement!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants