Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Face recognition blocked for 1 month: Large batch size causes job to be stopped during preview generation #967

Closed
EVOTk opened this issue Aug 30, 2023 · 68 comments
Labels
Projects

Comments

@EVOTk
Copy link

EVOTk commented Aug 30, 2023

Which version of recognize are you using?

4.3.2

Enabled Modes

Face recognition

TensorFlow mode

WASM mode

Downstream App

Memories App

Which Nextcloud version do you have installed?

27.0.2 27.1.0

Which Operating system do you have installed?

Debian 12

Which database are you running Nextcloud on?

MariaDB 10.11.4 10.11.5

Which Docker container are you using to run Nextcloud? (if applicable)

Linuxserver 27.0.2 27.1.0
PHP 8.2.10

How much RAM does your server have?

32 GB

What processor Architecture does your CPU have?

x86_64

Describe the Bug

For the past 1 month, I've had no new face detection on the photos I've added.

the process appears to be "blocked"

image
"Last classification: 06/08/2023"

occ recognize:recrawl or occ recognize:classify not resolve issue

I've also tried "clear the classifier queues and clear all background jobs" and relaunching a complete classification, but it always comes back to the same thing.

I didn't run occ recognize:reset-face-clusters and occ recognize:reset-faces because I don't want to lose the current classification.

Thx for your help

Expected Behavior

Face detection on new photos

To Reproduce

  1. Upload new photo on Nextcloud
  2. Faces are never detected

Debug log

No response

@EVOTk EVOTk added the bug Something isn't working label Aug 30, 2023
@github-actions github-actions bot added this to Backlog in Recognize Aug 30, 2023
@EVOTk
Copy link
Author

EVOTk commented Aug 31, 2023

image

I've added a few photos. But the process remains at the same stage, even though the jobs in the background are running.

@marcelklehr
Copy link
Member

Do you have more than 120 new faces? You can check your database with the following query SELECT cluster_id, COUNT(id) from oc_recognize_face_detections GROUP_BY cluster_id;

@EVOTk
Copy link
Author

EVOTk commented Sep 2, 2023

Thank you for your reply.
I use adminer to manage my database.

The command returns an error, should I execute it differently?

image

Reading other issues, I have seen other commands that can help diagnosis, but I do not know what they correspond to:

image

image

image

Thx

@EVOTk
Copy link
Author

EVOTk commented Sep 5, 2023

Hello,
I'm add ~100 photos. Same issue.

image


Values for of commands have not changed

SELECT COUNT(*) from oc_recognize_face_detections WHERE cluster_id IS NULL

COUNT(*)
4122

SELECT COUNT(*) from oc_recognize_face_detections WHERE cluster_id = -1

COUNT(*)
6196

SELECT COUNT(*) from oc_recognize_face_detections WHERE cluster_id > 0

COUNT(*)
19202

@craiq
Copy link

craiq commented Sep 7, 2023

Hello
I've the same problem with different architecture
Ubuntu lxc container on proxmox

ok 200 OK 27.0.2.1 yes yes \OC\Memcache\APCu \OC\Memcache\Redis yes \OC\Memcache\Redis no 3634111709184 0.36474609375 0.4296875 0.42724609375 6291456 3680256 6291456 6027264 57 0 6 2356869 23 1 8 14 21 3 0 18 0 0 17 0 0 1 7 2 2 2 7 nginx/1.25.2 8.2.10 1073741824 3600 10737418240 1 136150120 132285336 0 0 67108864 22731392 44377472 92898 2470 4724 130987 79025987 1693939639 0 0 0 0 2519 0 0 99.996812542553 1 1 5 5 6 268435440 267399360 4099 0 12801907 7802 11663 1009 0 1693939639 478816 mmap 1 33554312 33006456 Core date libxml openssl pcre zlib filter hash json random Reflection SPL session standard sodium cgi-fcgi mysqlnd PDO xml apcu bcmath bz2 calendar ctype curl dom mbstring FFI fileinfo ftp gd gettext gmp iconv igbinary imagick intl exif mysqli pdo_mysql Phar posix readline redis shmop SimpleXML sockets sysvmsg sysvsem sysvshm tokenizer xmlreader xmlwriter xsl zip Zend OPcache mysql 10.6.14 1499897856 2 2 4
This XML file does not appear to have any style information associated with it. The document tree is shown below.
<ocs>
<meta>
<status>ok</status>
<statuscode>200</statuscode>
<message>OK</message>
...
</meta>
<data>
<nextcloud>
<system>
<version>27.0.2.1</version>
<theme/>
<enable_avatars>yes</enable_avatars>
<enable_previews>yes</enable_previews>
<memcache.local>\OC\Memcache\APCu</memcache.local>
<memcache.distributed>\OC\Memcache\Redis</memcache.distributed>
<filelocking.enabled>yes</filelocking.enabled>
<memcache.locking>\OC\Memcache\Redis</memcache.locking>
<debug>no</debug>
<freespace>3634111709184</freespace>
<cpuload>
<element>0.36474609375</element>
<element>0.4296875</element>
<element>0.42724609375</element>
...
</cpuload>
<mem_total>6291456</mem_total>
<mem_free>3680256</mem_free>
<swap_total>6291456</swap_total>
<swap_free>6027264</swap_free>
<apps>
<num_installed>57</num_installed>
<num_updates_available>0</num_updates_available>
<app_updates/>
...
</apps>
...
</system>
<storage>
<num_users>6</num_users>
<num_files>2356869</num_files>
<num_storages>23</num_storages>
<num_storages_local>1</num_storages_local>
<num_storages_home>8</num_storages_home>
<num_storages_other>14</num_storages_other>
...
</storage>
<shares>
<num_shares>21</num_shares>
<num_shares_user>3</num_shares_user>
<num_shares_groups>0</num_shares_groups>
<num_shares_link>18</num_shares_link>
<num_shares_mail>0</num_shares_mail>
<num_shares_room>0</num_shares_room>
<num_shares_link_no_password>17</num_shares_link_no_password>
<num_fed_shares_sent>0</num_fed_shares_sent>
<num_fed_shares_received>0</num_fed_shares_received>
<permissions_0_15>1</permissions_0_15>
<permissions_3_17>7</permissions_3_17>
<permissions_3_19>2</permissions_3_19>
<permissions_3_21>2</permissions_3_21>
<permissions_0_31>2</permissions_0_31>
<permissions_3_31>7</permissions_3_31>
...
</shares>
...
</nextcloud>
<server>
<webserver>nginx/1.25.2</webserver>
<php>
<version>8.2.10</version>
<memory_limit>1073741824</memory_limit>
<max_execution_time>3600</max_execution_time>
<upload_max_filesize>10737418240</upload_max_filesize>
<opcache>
<opcache_enabled>1</opcache_enabled>
<cache_full/>
<restart_pending/>
<restart_in_progress/>
<memory_usage>
<used_memory>136150120</used_memory>
<free_memory>132285336</free_memory>
<wasted_memory>0</wasted_memory>
<current_wasted_percentage>0</current_wasted_percentage>
...
</memory_usage>
<interned_strings_usage>
<buffer_size>67108864</buffer_size>
<used_memory>22731392</used_memory>
<free_memory>44377472</free_memory>
<number_of_strings>92898</number_of_strings>
...
</interned_strings_usage>
<opcache_statistics>
<num_cached_scripts>2470</num_cached_scripts>
<num_cached_keys>4724</num_cached_keys>
<max_cached_keys>130987</max_cached_keys>
<hits>79025987</hits>
<start_time>1693939639</start_time>
<last_restart_time>0</last_restart_time>
<oom_restarts>0</oom_restarts>
<hash_restarts>0</hash_restarts>
<manual_restarts>0</manual_restarts>
<misses>2519</misses>
<blacklist_misses>0</blacklist_misses>
<blacklist_miss_ratio>0</blacklist_miss_ratio>
<opcache_hit_rate>99.996812542553</opcache_hit_rate>
...
</opcache_statistics>
<jit>
<enabled>1</enabled>
<on>1</on>
<kind>5</kind>
<opt_level>5</opt_level>
<opt_flags>6</opt_flags>
<buffer_size>268435440</buffer_size>
<buffer_free>267399360</buffer_free>
...
</jit>
...
</opcache>
<apcu>
<cache>
<num_slots>4099</num_slots>
<ttl>0</ttl>
<num_hits>12801907</num_hits>
<num_misses>7802</num_misses>
<num_inserts>11663</num_inserts>
<num_entries>1009</num_entries>
<expunges>0</expunges>
<start_time>1693939639</start_time>
<mem_size>478816</mem_size>
<memory_type>mmap</memory_type>
...
</cache>
<sma>
<num_seg>1</num_seg>
<seg_size>33554312</seg_size>
<avail_mem>33006456</avail_mem>
...
</sma>
...
</apcu>
<extensions>
<element>Core</element>
<element>date</element>
<element>libxml</element>
<element>openssl</element>
<element>pcre</element>
<element>zlib</element>
<element>filter</element>
<element>hash</element>
<element>json</element>
<element>random</element>
<element>Reflection</element>
<element>SPL</element>
<element>session</element>
<element>standard</element>
<element>sodium</element>
<element>cgi-fcgi</element>
<element>mysqlnd</element>
<element>PDO</element>
<element>xml</element>
<element>apcu</element>
<element>bcmath</element>
<element>bz2</element>
<element>calendar</element>
<element>ctype</element>
<element>curl</element>
<element>dom</element>
<element>mbstring</element>
<element>FFI</element>
<element>fileinfo</element>
<element>ftp</element>
<element>gd</element>
<element>gettext</element>
<element>gmp</element>
<element>iconv</element>
<element>igbinary</element>
<element>imagick</element>
<element>intl</element>
<element>exif</element>
<element>mysqli</element>
<element>pdo_mysql</element>
<element>Phar</element>
<element>posix</element>
<element>readline</element>
<element>redis</element>
<element>shmop</element>
<element>SimpleXML</element>
<element>sockets</element>
<element>sysvmsg</element>
<element>sysvsem</element>
<element>sysvshm</element>
<element>tokenizer</element>
<element>xmlreader</element>
<element>xmlwriter</element>
<element>xsl</element>
<element>zip</element>
<element>Zend OPcache</element>
...
</extensions>
...
</php>
<database>
<type>mysql</type>
<version>10.6.14</version>
<size>1499897856</size>
...
</database>
...
</server>
<activeUsers>
<last5minutes>2</last5minutes>
<last1hour>2</last1hour>
<last24hours>4</last24hours>
...
</activeUsers>
...
</data>
...
</ocs>

I did the complete face reset and there be sure that new faces are recognized

Screenshot_20230907_144703_Chrome

The newly detected faces are not shown in the memories app only in Fotos.

@illnesse

This comment was marked as off-topic.

@EVOTk

This comment was marked as off-topic.

@illnesse

This comment was marked as off-topic.

@EVOTk
Copy link
Author

EVOTk commented Sep 16, 2023

image

SQL command values are always identical to the last time. #967 (comment)

@marcelklehr
Copy link
Member

So, the issue is that face recognition is stuck. Can you check if there are jobs in the oc_jobs table for recognize and maybe post the list? (something like SELECT * from oc_jobs where app = 'recognize';)

@EVOTk
Copy link
Author

EVOTk commented Sep 17, 2023

So, the issue is that face recognition is stuck. Can you check if there are jobs in the oc_jobs table for recognize and maybe post the list? (something like SELECT * from oc_jobs where app = 'recognize';)

Thx for your reply

image
image
image

@marcelklehr
Copy link
Member

Ah, sorry about the wrong query. It looks like the job is intact and should be running, last_run indicates 'Sun Sep 17 04:58:21 2023 UTC'. Can you find any errors in the nextcloud log related to recognize (try something like cat data/nextcloud.log | grep recognize)?

@marcelklehr marcelklehr moved this from Backlog to Triaging in Recognize Sep 17, 2023
@EVOTk
Copy link
Author

EVOTk commented Sep 17, 2023

Here are the attached logs, thank you

log_recognize.log


I see errors related to "Welcome App" . An update of Welcome is available since 1h ( 1.0.10 ) , I have just made the update.
https://apps.nextcloud.com/apps/welcome
https://github.com/nextcloud/welcome/blob/main/CHANGELOG.md
image
Are you thinking of a conflict with Welcome App?

@marcelklehr
Copy link
Member

Are you thinking of a conflict with Welcome App?

Nah, I think that's unrelated. It's also just warnings. We can ignore those.

I think the reason for this is that we limit the run time of the job to 5 minutes, but by the time the 5minutes are over it still hasn't finished preparing the previews for classification. :/

marcelklehr added a commit that referenced this issue Sep 17, 2023
…Time to 0

see #967

Signed-off-by: Marcel Klehr <mklehr@gmx.net>
@EVOTk
Copy link
Author

EVOTk commented Sep 17, 2023

okay, do you think I should decrease the value of "The number of files to process per job run"? I'm currently at 50, because I'm in WASM mode.

@marcelklehr
Copy link
Member

If anything, try reducing it, but I will publish a new release soon with a proper fix that will hopefully solve this for you

@EVOTk
Copy link
Author

EVOTk commented Sep 17, 2023

okay !, I've just reduced it to "5" to see if there's any change there.

Thanks, I'll be waiting with impatience for this new version :)

@marcelklehr
Copy link
Member

Do let me know if the reduced value changes things, to confirm that I've indentified the bug 🙏

@EVOTk
Copy link
Author

EVOTk commented Sep 17, 2023

Yes yes, of course, I'll come back here to tell you if it works or not.

@craig0990
Copy link

craig0990 commented Sep 20, 2023

Chiming in to say that I've had the same issue, and this seems to have fixed it 🎉 (I think?)

Lowered (roughly halving the defaults) the number of files to process and it appears to be working well over the past 24h with "Schedule jobs" now having values some of the time and "Last background job execution" showing values for some 🙃

The oc_jobs table had several entries, but the last_run was 0 for all but the top one, and last_checked for the same were "old" timestamps (was just poking around, didn't keep screenshots/logs, might be mis-remembering the exact column).

Whereas now select * from oc_jobs where class like '%Recog%'; seems to show turnover with the jobs clearing themselves and setting up new ones?

I ended up with a large backlog (don't have original numbers, but "Object recognition" has 22,740 queued files right now) because I disabled the app for a few months while I worked out better hardware.

Cheers 🙏🏻

(Edit: Had to drop object/landmark to ~25 since that seems to re-occur once the face recognition queue clears; and this is in native Tensorflow mode, not WASM mode like the parent)

@marcelklehr
Copy link
Member

Great! Thanks for letting me know. I'll soon release v5 of recognize which will fix this permanently, so you don't have to reduce the batch size anymore

@marcelklehr marcelklehr changed the title Face recognition blocked for 1 month Face recognition blocked for 1 month: Large batch size causes job to be stopped during preview generation Sep 22, 2023
@EVOTk
Copy link
Author

EVOTk commented Sep 23, 2023

image

Reducing the number of "files to process per job run" solved my problem. New faces are identified! Thanks @marcelklehr

@EVOTk
Copy link
Author

EVOTk commented Sep 30, 2023

Hello @marcelklehr

After lowering the value, and doing a full indexation. New faces have appeared.

however, it's blocked again :( I lowered the value again, but it doesn't change anything.

image

Recognize automation moved this from Triaging to Done Oct 24, 2023
@EVOTk
Copy link
Author

EVOTk commented Oct 24, 2023

For my side, I've restarted indexing, but the system is still blocked.

@marcelklehr
Copy link
Member

@EVOTk Which version are you on now?

@marcelklehr marcelklehr reopened this Oct 25, 2023
@EVOTk
Copy link
Author

EVOTk commented Oct 25, 2023

recognize.txt
Hello,
NC 27.1.2
Recognize 5.0.3

I cleaned up the jobs several days ago, and restarted a complete indexing ( over 40k files ). It takes a long time.
Now it seems to be blocked again, unfortunately.

image

--
SELECT COUNT(*) from oc_recognize_face_detections WHERE cluster_id IS NULL

COUNT(*)
4134

--
SELECT COUNT(*) from oc_recognize_face_detections WHERE cluster_id = -1

COUNT(*)
6009

--
SELECT COUNT(*) from oc_recognize_face_detections WHERE cluster_id > 0

COUNT(*)
19389

@marcelklehr
Copy link
Member

Can you up the number of files to process per run?

@EVOTk
Copy link
Author

EVOTk commented Oct 25, 2023

Would you like set to 50 as originally proposed?

@marcelklehr
Copy link
Member

We can do 500 as recommended in the settings, now

@EVOTk
Copy link
Author

EVOTk commented Oct 25, 2023

Okay, i'm use WASM mode

@marcelklehr
Copy link
Member

Ah, right. then 50

@EVOTk
Copy link
Author

EVOTk commented Oct 28, 2023

image

I'm still stuck :( What can I do to help me?
Thx

@marcelklehr
Copy link
Member

@EVOTk Can you post the nextcloud log again, now that you're on v5.x

@marcelklehr
Copy link
Member

Also, could you check if some of your images have 0 bytes?

@EVOTk
Copy link
Author

EVOTk commented Nov 5, 2023

Hello !
Mmmmm, my nextcloud.log is spamming by :
{"reqId":"8RS4L0YWzWPiDxisNxjw","level":0,"time":"2023-11-05T20:20:00+00:00","remoteAddr":"","user":"--","app":"cron","method":"","url":"--","message":"Skipping OCA\\Recognize\\BackgroundJobs\\ClassifyFacesJob job with ID 134868 because another job with the same class is already running","userAgent":"--","version":"27.1.3.2","data":{"app":"cron"}}

There are a lot of photo files, but I haven't seen any empty files
nextcloud-grep-recognize.log
.

@marcelklehr
Copy link
Member

Interesting! Could you post the oc_jobs table again?

@marcelklehr
Copy link
Member

And then, if you're comfortable with that, could you try applying the following patch to your nextcloud? nextcloud/server#41295

@EVOTk
Copy link
Author

EVOTk commented Nov 6, 2023

Hello,

Yesterday, I did a "clean queue and background job".
image

I applied the change manually in the file, and restarted.

Here is the oc_jobs currently :
image

@marcelklehr
Copy link
Member

Alright, then let's wait if it trips up again with the change applied. Thank you for bearing with me :)

@EVOTk
Copy link
Author

EVOTk commented Nov 8, 2023

Alright, then let's wait if it trips up again with the change applied. Thank you for bearing with me :)

Sounds good. Indexing isn't finished yet, but new detections are visible in "People", even though there hasn't been anything new since August 31st.

image

@EVOTk
Copy link
Author

EVOTk commented Nov 11, 2023

Hello @marcelklehr

Sounds good. I have new detection present in Memories.

I'll now see when I've added photos, if the recognition works properly.

Thx !

image

@marcelklehr
Copy link
Member

How is it looking? @EVOTk

@EVOTk
Copy link
Author

EVOTk commented Nov 19, 2023

How is it looking? @EVOTk

Hello !
I had detection news until November 12. Currently I have no new detections, but I haven't added a many new photos, so maybe that's the cause.

image

@marcelklehr
Copy link
Member

marcelklehr commented Nov 19, 2023

Did you update your nextcloud in the meantime? Maybe my patch was reverted on your instance because of that.

19 Queued files, 1 scheduled job and last classification 12/11/2023 with last execution 5min ago sounds like it's stuck again... Could you post some logs again? 🙏

@EVOTk
Copy link
Author

EVOTk commented Nov 19, 2023

Hello,
no, i'm use NC v27.1.3

However, the container doesn't keep the patch, so if I re-create the container, I have to re-apply the patch manually.

image


SELECT COUNT(*) from oc_recognize_face_detections WHERE cluster_id IS NULL

COUNT(*)
4192

SELECT COUNT(*) from oc_recognize_face_detections WHERE cluster_id = -1

COUNT(*)
6166

SELECT COUNT(*) from oc_recognize_face_detections WHERE cluster_id > 0

COUNT(*)
19908

recognize.log

@phirestalker
Copy link

mine is also stalled now it seems. However, I actually found an error in the log.

{"reqId":"GhMbQVzl6TfSxTuxXOOY","level":0,"time":"2023-11-24T14:15:15+00:00","remoteAddr":"","user":"--","app":"recognize","method":"","url":"--","message":"Stalling job OCA\\Recognize\\BackgroundJobs\\ClassifyImagenetJob with argument array (\n  'storageId' => 2,\n  'rootId' => 5,\n) because other classifiers are already reserved","userAgent":"--","version":"27.1.3.2","data":{"app":"recognize"}}

Is this the same problem or a different one?
Also, for mine it had jobs keep going on with no change in the queue. The last run changed many times, but still the same amount of pictures in the queue.

@marcelklehr
Copy link
Member

Is this the same problem or a different one?

Yep, this is the same problem.

Also, for mine it had jobs keep going on with no change in the queue. The last run changed many times, but still the same amount of pictures in the queue.

And this is the symptom.

@marcelklehr
Copy link
Member

@EVOTk Ah, the concurrent execution limits us to one job from all classifiers, so you can either have the imagenet classifier running or the face recognition classifier. If you want to run as many as possible at the same time, you can check Enable unlimited concurrency of classifier processes in the settings.

@marcelklehr
Copy link
Member

marcelklehr commented Nov 28, 2023

Nextcloud 27.1.4 should fix the last bug in the concurrency detection, I think. So one recognize classifier job should be running at all times (as much as possible via cron, at least), but it may not be the face recognition one, so some stalling is normal, unless Enable unlimited concurrency of classifier processes is checked, or unless no classifiers are progressing. I'm closing this for now.

Do let me know here if you have a case where

  • Enable unlimited concurrency of classifier processes is not checked and
  • all classifiers are not progressing

or

  • Enable unlimited concurrency of classifier processes is checked and
  • one or all classifiers are not progressing

@EVOTk
Copy link
Author

EVOTk commented Nov 28, 2023

@EVOTk Ah, the concurrent execution limits us to one job from all classifiers, so you can either have the imagenet classifier running or the face recognition classifier. If you want to run as many as possible at the same time, you can check Enable unlimited concurrency of classifier processes in the settings.

I will activate this. Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

No branches or pull requests

8 participants