Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRON job blocked indefinetly with 100% CPU load #505

Closed
nRaecheR opened this issue Nov 19, 2022 · 26 comments
Closed

CRON job blocked indefinetly with 100% CPU load #505

nRaecheR opened this issue Nov 19, 2022 · 26 comments
Labels
bug Something isn't working v3.x

Comments

@nRaecheR
Copy link

nRaecheR commented Nov 19, 2022

I'm running "Recognize" 3.2.x on my Nextcloud 25.0.1 installation with a 12-Core AMD Ryzen Processor and 64 GB RAM. Nextcloud is running in the official Docker container and for background jobs Cron is configured with the recommended 5 minutes frequency. There is no limitation for Recognize in regard to the number of cores to use (but I've tried to reduce it to 6 or 3 without any changes to the issue). WASM mode is disabled.

The Nextcloud instance servers nearly 50000 picture and video files, the queue size is nearly the same size after updating from 2.x.

The problem:

Enabling any of the recognition jobs the Nextcloud CRON process blocks with 100% cpu load on one core. Every 5 minutes a new php Cron job gets started and will be blocked with 100% on another core. This leads to a DOS of the server after the memory gets exhausted and the OOM starts killing the server processes.

Disabling all recognition background jobs restores the normal behavior.

It seems the background job handling of Recognize is faulty and seems to block the CRON job from finishing. I can't see any log entries in the Nextcloud log related to "Recognize". Lower the number of files to process won't seem to fix this problem.

@marcelklehr
Copy link
Member

@nRaecheR I've been getting reports of this for a while but couldn't reproduce this. Would you be able to debug this situation with xdebug so that we find out what's running that's blocking these processes from stopping?

@nRaecheR
Copy link
Author

Thanks for your answer. I want to keep the downtime of the instance as low as possible, modifying and enabling debug code into it would be the last resort if all the other things won't help.

But I've got more information. It seems to be related to the face detection only. I've enabled only one of each, video, face and object recognition, at the same time and only the face detection seems to hang the Cron job. The queue of the face recognition stays at ~ 300 items, the video queue is down to 0, the object recognition is still processing.

@MrSSvard
Copy link

MrSSvard commented Nov 20, 2022

I've got the same problem with the same setup (docker image on 25.0.1). In my case it's 100k+ images. Both in user's folders and in an external storage. I tried emptying the queue and clearing all tags and stuff and the behaviour comes back, not immediately though. It takes a while.
The actual recognition goes through without issue it looks like. Maybe it's the scanning for files that breaks a bit? There are a lot of sleeping db connections when it happens. I'll see if I can get some xdebug info when I've got time.

@nRaecheR

This comment was marked as off-topic.

@marcelklehr

This comment was marked as off-topic.

@nRaecheR

This comment was marked as off-topic.

@rhatguy
Copy link

rhatguy commented Nov 21, 2022

I have the same problem but slightly different symptoms. I'm running the official Nextcloud docker container inside Unraid on a dual Intel 6132 (28 real cores) with no limites on core utilization for the app. The difference on my instance is that the background job never completes or takes forever and stays stuck on a single core at 100% utilization. I don't get the multiple jobs backing up like @nRaecheR does. Perhaps Unraid is blocking the next cron run from executing until the previous one finishes. Right now Nextcloud is reporting that it hasn't executed a cron job in the last 6 hours. I do have lots of faces showing up in the "people" tab of the memories app. The people tab of the photos app is either taking to long to load or is never finishing. I'm not exactly sure about it.

I'm able to do some debugging (downtime acceptable), but not entirely sure how best to start. I'm very linux savvy, just haven't investigated yet how best to debug this particular problem.

@marcelklehr
Copy link
Member

I'm able to do some debugging (downtime acceptable), but not entirely sure how best to start. I'm very linux savvy, just haven't investigated yet how best to debug this particular problem.

@rhatguy That's great! Likely the best road forward is to enable xdebug for your php installation (likely tricky within docker) and try to profile the stuck processes or attach a debugger to them.

@sbglt
Copy link

sbglt commented Nov 24, 2022

I've the same problem and I'll enable xdebug.

@Forza-tng
Copy link

Forza-tng commented Nov 27, 2022

Hi. I also face the same issue. Disabling people detection solves it.

NC 25.0.1.1
Recognize 3.2.3
Linux kernel 6.0.9
PHP 8.1.12

I also tested using system nodejs 18.12.1 but it did not help.

@marcelklehr I could help debug, but would need some guidance on how to do this.

@marcelklehr
Copy link
Member

I believe I've found the cause of the load: The clustering algorithm will clogg up a lot of resources the more images you have. v3.3.4 now has an incremental clustering algorithm that should be much faster on successive runs.

@MrSSvard
Copy link

Nice! Unfortunately I'm still having the same issue. :/ I turned on face detection over night after updating to 3.3.4 and this morning I've got 9 running cron.php using about 0.4 core each and each having a sleeping mysql connection. No IO showing in iotop either.

I was able to install xdebug in the container but I don't know enough about php do debug ^^'

@ljm625
Copy link

ljm625 commented Dec 30, 2022

well I think this issue might caused by setting too many files in recognize config for a batch. Try setting it to less than 50 and see, if the CPU goes under 100% before next cron job trigger then its a good value to use.

Anyway the nextcloud cron should detect previous running recognize jobs and wait for it instead of spawning up an new one every 5 min

@marcelklehr
Copy link
Member

This would then be a duplicate of #335

@johnnyborg
Copy link

@ljm625

well I believe this issue is caused by setting too many files in recognize config for a batch. Try setting it to less than 50 and see, if the CPU goes under 100% before next cron job trigger then its a good value to use.

Anyway the nextcloud cron should detect previous running recognize jobs and wait for it instead of spawning up an new one every 5 min

I think you're making the wrong conclusion here, a bug in the clustering algorithm should not affect any of the other cron processes anyway. Even setting it to 2 photo's at a time causes troubles at my installation (500 000 photos). This behavior occurred somewhere in the updates (I think). With the last update it already improved a lot, but is not fully resolved. There is just a full (of one core) utilization without any noticeable stress on the database or file system.

On a note, running on a G4560 and 8 GB of RAM.

@ljm625
Copy link

ljm625 commented Dec 31, 2022

Yeah I saw the latest edit in the main thread. If lowering the files to maybe 1 still having issues it's an different story, might need some xdebug trace to see where it stuck at.

@marcelklehr
Copy link
Member

@johnnyborg Clustering does take some time the more images you have, if you believe you found a different bug, please open a new issue. It's always easier to deal with one person at a time than rumbling different problems of multiple people into one thread.

@MrSSvard
Copy link

MrSSvard commented Jan 1, 2023

I don't think this is issue about classification either, I also get the problem with 1 images per batch. The image classification processes show up separately from the cron.php process. When I turn on people recognition for a while, I get a growing number of cron.php processes with sleeping connections to the DB, no IO that I can see and no classification processes.

@marcelklehr
Copy link
Member

When I turn on people recognition for a while, I get a growing number of cron.php processes with sleeping connections to the DB

What do you see in the oc_jobs table when this happens? Any jobs with three non-zero time values?

@marcelklehr marcelklehr reopened this Jan 1, 2023
@marcelklehr marcelklehr changed the title [3.2.2] Background scan blocks Nextcloud CRON php job indefinetly with 100% CPU load CRON job blocked indefinetly with 100% CPU load Jan 1, 2023
@MrSSvard
Copy link

MrSSvard commented Jan 2, 2023

When I turn on people recognition for a while, I get a growing number of cron.php processes with sleeping connections to the DB

What do you see in the oc_jobs table when this happens? Any jobs with three non-zero time values?

Here're the recognize rows of my oc_jobs. At the time, 6 cron.php processes were running.
Screenshot from 2023-01-02 08-47-14

@marcelklehr
Copy link
Member

Mh. From the perspective of the jobs table nothing recognize-related seems to be running

@nRaecheR
Copy link
Author

I've updated to 3.3.6 and tried the face recognition again, it has still the same problems with the CRON jobs getting blocked with 100% CPU load. So unfortunately, the changes in the latest version don't fix the issue.

@Xyrodileas
Copy link

Xyrodileas commented Jan 26, 2023

Still the same 100% CPU load issue with Recognize 3.3.6, only people recognition with 20 photo at a time

@marcelklehr
Copy link
Member

This might be fixed with v3.5.0. I don't have high hopes, but we did introduce a new clustering algorithm.

@nRaecheR
Copy link
Author

Good news, after updating to 3.5.0 and re-enabling the face recognition the works as expected, no more extensive cpu-load on one core.

Side note: It seems that the face recognition quality has been decreased, faces that has been recognized with the older versions will be added to other people now. But it will finish now without any DOD of the server =)

@marcelklehr
Copy link
Member

It seems that the face recognition quality has been decreased

mmh, that was certainly not the intention :/

Good news, after updating to 3.5.0 and re-enabling the face recognition the works as expected, no more extensive cpu-load on one core.

Yay, I'm closing this then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working v3.x
Projects
None yet
Development

No branches or pull requests

9 participants