Memory leak / Large images kill classifier process OOM #111

derritter88 · 2021-11-03T08:01:52Z

Describe the bug
After some time recognize throws an error and stops working (see additional context)

To Reproduce
Steps to reproduce the behavior:

Start manual process.
Wait.
Check errors

Recognize (please complete the following information):

JS-only mode: No
Enabled modes: All

Server (please complete the following information):

Nextcloud: 22
OS: Ubuntu 20.04
RAM: 24
Processor Architecture x86_64
Database: PostgreSQL 13

Additional context

Error message from bash:

Classifying photos of user 4479F707-255F-465B-9468-9D92669CDB71
Failed to classify images
An exception occurred while executing a query: SQLSTATE[HY000]: General error: 7 server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

Nextcloud log:

[core] Error: Movie preview generation failed Output: ["ffmpeg version 4.2.4-1ubuntu0.1 Copyright (c) 2000-2020 the FFmpeg developers","  built with gcc 9 (Ubuntu 9.3.0-10ubuntu2)","  configuration: --prefix=/usr --extra-version=1ubuntu0.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-openal --enable-opencl --enable-opengl --enable-sdl2 --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-nvenc --enable-chromaprint --enable-frei0r --enable-libx264 --enable-shared","  libavutil      56. 31.100 / 56. 31.100","  libavcodec     58. 54.100 / 58. 54.100","  libavformat    58. 29.100 / 58. 29.100","  libavdevice    58.  8.100 / 58.  8.100","  libavfilter     7. 57.100 /  7. 57.100","  libavresample   4.  0.  0 /  4.  0.  0","  libswscale      5.  5.100 /  5.  5.100","  libswresample   3.  5.100 /  3.  5.100","  libpostproc    55.  5.100 / 55.  5.100","[mov,mp4,m4a,3gp,3g2,mj2 @ 0x563a3443a700] moov atom not found","/tmp/oc_tmp_c1YWtl: Invalid data found when processing input"]

GET /core/preview?fileId=13377535&x=256&y=256&a=true&v=616ef18336c79
from 192.168.10.2 by 3A60C52D-9415-4F28-A2B7-71A8CBD7A9E3 at 2021-11-02T22:34:12+01:00

+ direclty after such an error:

[index] Error: An exception occurred while executing a query: SQLSTATE[HY000]: General error: 7 server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

GET /core/preview?fileId=13377748&x=256&y=256&a=true&v=616f1e70cccac
from 192.168.10.2 by 3A60C52D-9415-4F28-A2B7-71A8CBD7A9E3 at 2021-11-02T22:34:26+01:00

PostgreSQL log (literally millions entries, log = 1.4 GB):

2021-11-02 15:34:44.619 UTC [3225304] nextcloud@nextcloud STATEMENT:  INSERT INTO "oc_systemtag_object_mapping" ("objectid", "objecttype", "systemtagid") VALUES($1, $2, $3)
2021-11-02 15:34:44.620 UTC [3225701] nextcloud@nextcloud ERROR:  duplicate key value violates unique constraint "oc_systemtag_object_mapping_pkey"
2021-11-02 15:34:44.620 UTC [3225701] nextcloud@nextcloud DETAIL:  Key (objecttype, objectid, systemtagid)=(files, 13400109, 316) already exists.

From within the PostgreSQL logs I cannot find anything which indicates a closing of the connection.

The text was updated successfully, but these errors were encountered:

derritter88 · 2021-11-03T08:04:43Z

I am not sure but from my point of view the tagging lets the app Preview Generator do something which makes it crash?!

And recognize writes everything multiple times to the database.

derritter88 · 2021-11-03T08:15:57Z

I have disabled the preview generation for category "Movie" - let's see how it will do.

derritter88 · 2021-11-03T09:33:38Z

So now it simply fails with a:

root@wwwubuntu:/var/www/cloud# sudo -u www-data php ./occ recognize:classify -vvv
Failed to classify images

Within Nextcloud log there are not entries which indicates why it fails.

marcelklehr · 2021-11-03T09:39:24Z

This may due to your CPU architecture. You can try running the classifier script directly to check:

$ apps/recognize/bin/node apps/recognize/src/classifier_imagenet.js path/to/a/photo.jpg

derritter88 · 2021-11-03T09:56:12Z

I do not think that this is a CPU architecture issue as I "show" my VM what kind of CPU I am using and it also has the mentioned AVX flag:

Architektur:                     x86_64
CPU Operationsmodus:             32-bit, 64-bit
Byte-Reihenfolge:                Little Endian
Adressgrößen:                    48 bits physical, 48 bits virtual
CPU(s):                          8
Liste der Online-CPU(s):         0-7
Thread(s) pro Kern:              1
Kern(e) pro Socket:              8
Sockel:                          1
NUMA-Knoten:                     1
Anbieterkennung:                 AuthenticAMD
Prozessorfamilie:                23
Modell:                          49
Modellname:                      AMD EPYC 7272 12-Core Processor
Stepping:                        0
CPU MHz:                         2894.363
BogoMIPS:                        5788.72
Virtualisierung:                 AMD-V
L1d Cache:                       512 KiB
L1i Cache:                       512 KiB
L2 Cache:                        4 MiB
L3 Cache:                        16 MiB
NUMA-Knoten0 CPU(s):             0-7
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full AMD retpoline, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Markierungen:                    fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_de
                                 adline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni x
                                 saveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat npt nrip_save umip rdpid arch_capabilities

Also a test run worked fine:

sudo -u www-data apps/recognize/bin/node apps/recognize/src/classifier_imagenet.js /mnt/cloud/nextcloud/3A60C52D-9415-4F28-A2B7-71A8CBD7A9E3/files/IMG_20211102_222424.jpg
2021-11-03 10:54:07.618633: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-03 10:54:09.609194: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 36000000 exceeds 10% of free system memory.
2021-11-03 10:54:09.665774: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 144000000 exceeds 10% of free system memory.
2021-11-03 10:54:09.696105: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 144000000 exceeds 10% of free system memory.
2021-11-03 10:54:09.719646: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 144000000 exceeds 10% of free system memory.
2021-11-03 10:54:09.742251: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 144000000 exceeds 10% of free system memory.
{
  className: 'caldron',
  probability: 0.557533860206604,
  rule: { label: 'cooking', threshold: 0.1 }
}
{
  className: 'cup',
  probability: 0.08837056905031204,
  rule: { label: 'cup', threshold: 0.1 }
}
{
  className: 'coffee mug',
  probability: 0.043676361441612244,
  rule: { label: 'coffee', threshold: 0.2, categories: [ 'beverage' ] }
}
{
  className: 'tiger cat',
  probability: 0.01587754860520363,
  rule: {
    label: 'cat',
    priority: 5,
    threshold: 0.1,
    categories: [ 'animal' ]
  }
}
{
  className: 'egyptian cat',
  probability: 0.009242596104741096,
  rule: {
    label: 'cat',
    priority: 5,
    threshold: 0.1,
    categories: [ 'animal' ]
  }
}
{
  className: 'pitcher',
  probability: 0.008557061664760113,
  rule: { label: 'beverage', threshold: 0.1 }
}
{
  className: 'teapot',
  probability: 0.006404006388038397,
  rule: { threshold: 0.1, categories: [ 'beverage' ] }
}
["cooking"]

marcelklehr · 2021-11-03T09:59:51Z

Nice! That's good. Then we can try to increase the log verbosity to debug, to see what's happening when the app does the same thing.

derritter88 · 2021-11-03T10:00:58Z

You mean the Nextcloud log itself or just the app as I run it already with -vvv

marcelklehr · 2021-11-03T10:02:27Z

The nextcloud log, yeah, the command doesn't print any logs at the moment (which I'm aware isn't ideal).

derritter88 · 2021-11-03T10:06:31Z

The problem of setting the general log to Debug is that I have billions of "deprecated" things and the log is rotating a lot:

{"reqId":"YYJe_HfHlXE-HTCFlBOEDAAAVBE","level":0,"time":"2021-11-03T11:05:48+01:00","remoteAddr":"192.168.10.2","user":"3A60C52D-9415-4F28-A2B7-71A8CBD7A9E3","app":"video_converter","method":"GET","url":"/apps/logreader/poll?lastReqId=YYJe-3fHlXE-HTCFlBOECwAAVxE","message":"/appinfo/app.php is deprecated, use \\OCP\\AppFramework\\Bootstrap\\IBootstrap on the application class instead.","userAgent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0","version":"22.2.0.2"}

derritter88 · 2021-11-03T10:45:41Z

So I left it running at debug level & it crashed again - also again with an SQL exception.

Classifying photos of user 4479F707-255F-465B-9468-9D92669CDB71
Failed to classify images
An exception occurred while executing a query: SQLSTATE[HY000]: General error: 7 server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

The debug log of Nextcloud does not show anything useful.

marcelklehr · 2021-11-03T10:47:33Z

Do you still get errors in postgres? that may be the reason the connection is terminated

derritter88 · 2021-11-03T10:47:57Z

Next step of trying: Bypass my DB loadbalancer and connect directly to the master DB.

derritter88 · 2021-11-03T10:48:17Z

Do you still get errors in postgres? that may be the reason the connection is terminated

I am not sure if this is a Postgres error or a HAProxy.

derritter88 · 2021-11-03T10:52:07Z

How is recognize communicating with Postgres?
Is it doing kind of a batch-job? So recognizing everything before inputing stuff into the DB or is it running on a syncronised base?

marcelklehr · 2021-11-03T10:57:26Z

There's two points of communication with the db. During a classification run, each file is tagged separately, but there's a sort of batch job that adds the unrecognized tag to all files that have been processed but yielded no tags.

derritter88 · 2021-11-03T10:59:44Z

Hmmm it might be possible, as my users have tons of pictures that it might run into a database timeout?
I will keep an eye on the direct Postgres communication to verify if it has another issue or not.

On HAProxy the regular timeout is set on 30 minutes

marcelklehr · 2021-11-03T11:24:56Z

I spoke too soon, it appears I already refactored the latter function to tag files individually: https://github.com/marcelklehr/recognize/blob/master/lib/Service/TagManager.php#L62

derritter88 · 2021-11-03T11:31:12Z

Is this also part of the current version of your app from the Nextcloud store?

marcelklehr · 2021-11-03T11:31:39Z

Ah. Good point :D let's see

marcelklehr · 2021-11-03T11:32:22Z

Yep: https://github.com/marcelklehr/recognize/blob/v1.6.10/lib/Service/TagManager.php#L62

derritter88 · 2021-11-03T11:34:20Z

But why did I saw so many "tag already existing" database entries?

marcelklehr · 2021-11-03T11:36:09Z

If you still see those, that's definitely something we should investigate. Possibly that's a nextcloud bug.

derritter88 · 2021-11-03T11:41:39Z

Yup, it's still appearing at the DB log:

2021-11-03 11:38:59.521 UTC [3509011] nextcloud@nextcloud STATEMENT:  INSERT INTO "oc_systemtag_object_mapping" ("objectid", "objecttype", "systemtagid") VALUES($1, $2, $3)
2021-11-03 11:38:59.522 UTC [3509011] nextcloud@nextcloud ERROR:  duplicate key value violates unique constraint "oc_systemtag_object_mapping_pkey"
2021-11-03 11:38:59.522 UTC [3509011] nextcloud@nextcloud DETAIL:  Key (objecttype, objectid, systemtagid)=(files, 9710849, 1) already exists.

Could this come from my testing during spring/summer when I tried to have my GPU working with recognize?

marcelklehr · 2021-11-03T12:08:20Z

Could this come from my testing during spring/summer when I tried to have my GPU working with recognize?

I doubt it.

derritter88 · 2021-11-03T12:11:29Z

Hmmm... not sure what or why this is happening then.

Additional: After I changed to direct DB connections it keeps running and running and so on - no timeout so far.

So maybe it does some asynchrone stuff in the background?

derritter88 · 2021-11-03T12:15:18Z

So no timeout but the "generic" issue again:

Classifying photos of user 4479F707-255F-465B-9468-9D92669CDB71
Failed to classify images
Classifier process error

marcelklehr · 2021-11-03T12:16:19Z

Maybe you're running out of memory?

derritter88 · 2021-11-03T12:20:29Z

Nope - my VM has 24 GB of dedicated RAM.
Yesterday when recognize was able to run my VM consumed around 14 to 15 GB RAM - so it should have enought space.

I have also checked PHP-FPM log and Apache logs - nothing would indicate that I am running out of memory.

derritter88 · 2021-11-03T12:28:34Z

Is there any kind of resetting recognize?`

Even manually like droping tables at the DB and removing files from the server?!

Maybe it would make sense have a clear beginning

marcelklehr · 2021-11-03T12:32:09Z

In the settings there's a button that resets the tags

marcelklehr · 2021-11-04T12:25:45Z

From my point of view it looks like the script consumes more and more RAM until the OS says "no more".

Alright, then I'll have to find the memory leak. Which of the three classifier scripts is the culprit?

derritter88 · 2021-11-04T12:28:53Z

Which of the three classifier scripts is the culprit?

How can I determine this?

marcelklehr · 2021-11-04T12:35:13Z

If you use htop, e.g. it should show the process command line, e.g. recognize/bin/node recognize/src/classifier_imagenet.js -

derritter88 · 2021-11-04T12:38:47Z

Thanks! It's exactly that script which is running and consuming RAM & CPU:
/var/www/cloud/recognize/bin//node /var/www/cloud/apps/recognize/src/classifier_imagenet.js -

derritter88 · 2021-11-04T15:21:09Z

Additional I have set the log rotation size to 10 GB + Debug-logging on. So I should be able to determine which picture would be the issue.

derritter88 · 2021-11-05T10:18:03Z

Unfortunately there are no really information within the debug log.
How would an error message look like within the log?

marcelklehr · 2021-11-05T10:38:45Z

So I should be able to determine which picture would be the issue.

there is likely no single picture causing this. It's a software bug that causes accumulation of stale memory without freeing it.

derritter88 · 2021-11-05T10:51:13Z

Ahhh... okay.
Just a question: Is recognize only doing its job on user pictures or also on pictures being resized with the app preview?

marcelklehr · 2021-11-05T12:24:51Z

It should only process pictures in users' files

coalwater · 2021-11-11T21:16:33Z

I'm also suffering of the same issue,
I'm using an Ubuntu VM inside proxmox, with 6GB total ram and 8 cores.

The system runs out of ram that it starves the rest of the system, prometheus exporters die and don't respond till the memory is released, and it you can see the gaps in graph below.

I'll also disable the plugin till the issue is resolved.

One interesting note, the number of remaining images to be processed has been the same since the issue started to happen, so it seems that there's a certain file that the app can't process, could be a big file or something like that

marcelklehr · 2021-11-12T15:18:13Z

One interesting note, the number of remaining images to be processed has been the same since the issue started to happen, so it seems that there's a certain file that the app can't process, could be a big file or something like that

Nice find, thanks!

See #111

coalwater · 2021-12-14T17:24:31Z

@marcelklehr Should I re-enable the app and test if it is fixed.
As far as I can tell this is more of a workaround than a fix right ?

marcelklehr · 2021-12-15T09:19:18Z

Should I re-enable the app and test if it is fixed.

@coalwater I'd appreciate that, thank you :)

derritter88 · 2021-12-15T09:21:13Z

@marcelklehr I will also start the manual process.

coalwater · 2021-12-15T17:21:13Z

So the process no longer ~~takes the system to a halt~~ makes the system run out of memory to the point it's non responsive, but it still doesn't handle the files that it couldn't handle before.
Which for me makes sense from the description of the PR (Setting a max file limit)

~~I will post a screenshot of my grafana soon~~

A side question: How do I increase verbosity in the command line, I would like to know which of my files is blocking this process.

Update: Added the screenshot

marcelklehr · 2021-12-31T16:13:37Z

How do I increase verbosity in the command line, I would like to know which of my files is blocking this process.

https://github.com/marcelklehr/recognize/releases/tag/v1.8.0 now comes with CLI debugging output again.

marcelklehr · 2021-12-31T16:17:42Z

it still doesn't handle the files that it couldn't handle before

Yep, ideally we should resize images that are too large for loading them directly in tensorflow.

Lipown · 2022-06-10T17:23:52Z

Any idea what to do or not to do? I bought plus 16GB ram, set PHP_MEMORY_LIMIT to different numbers and still have the same issue :/

marcelklehr · 2022-06-11T20:06:31Z

@Lipown Which issue is that? Out of memory errors?

marcelklehr · 2022-10-21T13:44:48Z

Should be fixed with #365

marcelklehr changed the title ~~Database errors~~ Memory leak Nov 5, 2021

marcelklehr added a commit that referenced this issue Dec 13, 2021

Set upper limit for image file size

91aa5c0

See #111

This comment has been minimized.

Sign in to view

marcelklehr changed the title ~~Memory leak~~ Memory leak / Large images kill classifier process OOM Dec 31, 2021

marcelklehr added this to In progress in Recognize Jan 7, 2022

marcelklehr mentioned this issue Oct 10, 2022

Classify HEIC/HEIF images #233

Closed

pulsejet mentioned this issue Oct 10, 2022

Add support for HEIC/HEIF/TIFF #365

Merged

marcelklehr closed this as completed Oct 21, 2022

Recognize automation moved this from In progress to Done Oct 21, 2022

Memory leak / Large images kill classifier process OOM #111

Memory leak / Large images kill classifier process OOM #111

Comments

derritter88 commented Nov 3, 2021

Error message from bash:

Nextcloud log:

derritter88 commented Nov 3, 2021

derritter88 commented Nov 3, 2021

derritter88 commented Nov 3, 2021

marcelklehr commented Nov 3, 2021 • edited

derritter88 commented Nov 3, 2021

marcelklehr commented Nov 3, 2021

derritter88 commented Nov 3, 2021

marcelklehr commented Nov 3, 2021

derritter88 commented Nov 3, 2021

derritter88 commented Nov 3, 2021

marcelklehr commented Nov 3, 2021 • edited

derritter88 commented Nov 3, 2021

derritter88 commented Nov 3, 2021

derritter88 commented Nov 3, 2021

marcelklehr commented Nov 3, 2021

derritter88 commented Nov 3, 2021

marcelklehr commented Nov 3, 2021

derritter88 commented Nov 3, 2021

marcelklehr commented Nov 3, 2021

marcelklehr commented Nov 3, 2021

derritter88 commented Nov 3, 2021

marcelklehr commented Nov 3, 2021

derritter88 commented Nov 3, 2021

marcelklehr commented Nov 3, 2021

derritter88 commented Nov 3, 2021

derritter88 commented Nov 3, 2021

marcelklehr commented Nov 3, 2021

derritter88 commented Nov 3, 2021

derritter88 commented Nov 3, 2021

marcelklehr commented Nov 3, 2021

marcelklehr commented Nov 4, 2021

derritter88 commented Nov 4, 2021

marcelklehr commented Nov 4, 2021

derritter88 commented Nov 4, 2021

derritter88 commented Nov 4, 2021

derritter88 commented Nov 5, 2021

marcelklehr commented Nov 5, 2021

derritter88 commented Nov 5, 2021

marcelklehr commented Nov 5, 2021

coalwater commented Nov 11, 2021

marcelklehr commented Nov 12, 2021

coalwater commented Dec 14, 2021

marcelklehr commented Dec 15, 2021

derritter88 commented Dec 15, 2021

coalwater commented Dec 15, 2021 • edited

This comment has been minimized.

marcelklehr commented Dec 31, 2021

This comment has been minimized.

marcelklehr commented Dec 31, 2021

Lipown commented Jun 10, 2022

marcelklehr commented Jun 11, 2022

marcelklehr commented Oct 21, 2022

marcelklehr commented Nov 3, 2021 •

edited

marcelklehr commented Nov 3, 2021 •

edited

coalwater commented Dec 15, 2021 •

edited