-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak / Large images kill classifier process OOM #111
Comments
I am not sure but from my point of view the tagging lets the app And |
I have disabled the preview generation for category "Movie" - let's see how it will do. |
So now it simply fails with a:
Within Nextcloud log there are not entries which indicates why it fails. |
This may due to your CPU architecture. You can try running the classifier script directly to check:
|
I do not think that this is a CPU architecture issue as I "show" my VM what kind of CPU I am using and it also has the mentioned AVX flag:
Also a test run worked fine:
|
Nice! That's good. Then we can try to increase the log verbosity to debug, to see what's happening when the app does the same thing. |
You mean the Nextcloud log itself or just the app as I run it already with |
The nextcloud log, yeah, the command doesn't print any logs at the moment (which I'm aware isn't ideal). |
The problem of setting the general log to Debug is that I have billions of "deprecated" things and the log is rotating a lot:
|
So I left it running at debug level & it crashed again - also again with an SQL exception.
The debug log of Nextcloud does not show anything useful. |
Do you still get errors in postgres? that may be the reason the connection is terminated |
Next step of trying: Bypass my DB loadbalancer and connect directly to the master DB. |
I am not sure if this is a Postgres error or a HAProxy. |
How is recognize communicating with Postgres? |
There's two points of communication with the db. During a classification run, each file is tagged separately, but there's a sort of batch job that adds the unrecognized tag to all files that have been processed but yielded no tags. |
Hmmm it might be possible, as my users have tons of pictures that it might run into a database timeout? On HAProxy the regular timeout is set on 30 minutes |
I spoke too soon, it appears I already refactored the latter function to tag files individually: https://github.com/marcelklehr/recognize/blob/master/lib/Service/TagManager.php#L62 |
Is this also part of the current version of your app from the Nextcloud store? |
Ah. Good point :D let's see |
But why did I saw so many "tag already existing" database entries? |
If you still see those, that's definitely something we should investigate. Possibly that's a nextcloud bug. |
Yup, it's still appearing at the DB log:
Could this come from my testing during spring/summer when I tried to have my GPU working with recognize? |
I doubt it. |
Hmmm... not sure what or why this is happening then. Additional: After I changed to direct DB connections it keeps running and running and so on - no timeout so far. So maybe it does some asynchrone stuff in the background? |
So no timeout but the "generic" issue again:
|
Maybe you're running out of memory? |
Nope - my VM has 24 GB of dedicated RAM. I have also checked PHP-FPM log and Apache logs - nothing would indicate that I am running out of memory. |
Is there any kind of resetting recognize?` Even manually like droping tables at the DB and removing files from the server?! Maybe it would make sense have a clear beginning |
In the settings there's a button that resets the tags |
Alright, then I'll have to find the memory leak. Which of the three classifier scripts is the culprit? |
How can I determine this? |
If you use htop, e.g. it should show the process command line, e.g. |
Thanks! It's exactly that script which is running and consuming RAM & CPU: |
Additional I have set the log rotation size to 10 GB + Debug-logging on. So I should be able to determine which picture would be the issue. |
Unfortunately there are no really information within the debug log. |
there is likely no single picture causing this. It's a software bug that causes accumulation of stale memory without freeing it. |
Ahhh... okay. |
It should only process pictures in users' files |
Nice find, thanks! |
@marcelklehr Should I re-enable the app and test if it is fixed. |
@coalwater I'd appreciate that, thank you :) |
@marcelklehr I will also start the manual process. |
This comment has been minimized.
This comment has been minimized.
https://github.com/marcelklehr/recognize/releases/tag/v1.8.0 now comes with CLI debugging output again. |
This comment has been minimized.
This comment has been minimized.
Yep, ideally we should resize images that are too large for loading them directly in tensorflow. |
Any idea what to do or not to do? I bought plus 16GB ram, set PHP_MEMORY_LIMIT to different numbers and still have the same issue :/ |
@Lipown Which issue is that? Out of memory errors? |
Should be fixed with #365 |
Describe the bug
After some time recognize throws an error and stops working (see
additional context
)To Reproduce
Steps to reproduce the behavior:
Recognize (please complete the following information):
Server (please complete the following information):
Additional context
Error message from bash:
Nextcloud log:
+ direclty after such an error:
PostgreSQL log (literally millions entries, log = 1.4 GB):
From within the PostgreSQL logs I cannot find anything which indicates a closing of the connection.
The text was updated successfully, but these errors were encountered: