Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Background processing stops very often #412

Closed
Demian98 opened this issue Oct 21, 2022 · 31 comments
Closed

Background processing stops very often #412

Demian98 opened this issue Oct 21, 2022 · 31 comments
Labels
bug Something isn't working v3.x

Comments

@Demian98
Copy link

Describe the issue
I installed Recognize with NC 25 at the 19.10, and it's working on my images since this time. In the CPU usage from my nextcloud VM, it shows that the background process only works for some minutes and stops then:
grafik

In the settings of Recognize it displays the same:
grafik

It always just processes a small amount of images, and then it stops for a while.
My cron jobs are running every 15 mins, it looks like image processing also starts with every cron job and then runs for some minutes.

Currently, there are 11k images in the queue, and it is still increasing. This process would take multiple week to finish if it continues like this.

Expected behavior
I would expect the image recognition to run the whole time in the background, without stopping that often.

Recognize (please complete the following information):

  • JS-only mode: Yes (WASM is not enabled)
  • Enabled modes: Only face recognition (for now)

Server (please complete the following information):

  • Nextcloud: 25.0.0
  • OS: Ubuntu
  • RAM: 4GB
  • Processor Architecture: x64 (3.40GHz (8 cores))
@dhzl84
Copy link

dhzl84 commented Oct 21, 2022

Same here, and if I issue occ recognize:recrawl I get this in return:

DivisionByZeroError: Division by zero in /var/www/nextcloud/custom_apps/recognize/lib/Service/FaceClusterAnalyzer.php:190
Stack trace:
#0 [internal function]: OCA\Recognize\Service\FaceClusterAnalyzer::OCA\Recognize\Service\{closure}()
#1 /var/www/nextcloud/custom_apps/recognize/lib/Service/FaceClusterAnalyzer.php(189): array_map()
#2 /var/www/nextcloud/custom_apps/recognize/lib/Service/FaceClusterAnalyzer.php(90): OCA\Recognize\Service\FaceClusterAnalyzer::calculateCentroidOfDetections()
#3 /var/www/nextcloud/custom_apps/recognize/lib/BackgroundJobs/ClusterFacesJob.php(28): OCA\Recognize\Service\FaceClusterAnalyzer->calculateClusters()
#4 /var/www/nextcloud/lib/public/BackgroundJob/Job.php(78): OCA\Recognize\BackgroundJobs\ClusterFacesJob->run()
#5 /var/www/nextcloud/lib/public/BackgroundJob/QueuedJob.php(58): OCP\BackgroundJob\Job->start()
#6 /var/www/nextcloud/lib/public/BackgroundJob/QueuedJob.php(48): OCP\BackgroundJob\QueuedJob->start()
#7 /var/www/nextcloud/cron.php(152): OCP\BackgroundJob\QueuedJob->execute()
#8 {main}

So probably related and fixed by #398 ?
Once initial recognition is performed, cron based prcessing probably keeps up with the amount of new media added.

  • Nextcloud: 25.0.0
  • Recognize 3.1.0
  • OS: Debian 10 LXC in Proxmox
  • RAM: 8 GB
  • Processor Architecture: x64 (AMD Ryzen 5 PRO 4650G)

@illnesse
Copy link

Same, would be nice if i could actually use those 16 cores dedicated to this container to speed things up

@marcelklehr marcelklehr added bug Something isn't working v3.x labels Oct 21, 2022
@euro2
Copy link

euro2 commented Oct 22, 2022

same. 6 cores set, but i get 30% cpu util. in intervals

@tkiesel
Copy link

tkiesel commented Oct 26, 2022

same here with 4 cores and 8 GiB RAM. ~30% utilization for 2-5 min in 10min intervals.

@ollioddi
Copy link

ollioddi commented Nov 2, 2022

I am also having issues. There are sometimes half to whole hours between classifications and i have a queue of 52360 faces.

@SuperSandro2000
Copy link
Contributor

SuperSandro2000 commented Nov 2, 2022

Try manually updating to 3.1.1. Immediately after the update my server constantly used one core to classify files.

Only bug I found so far is that the appinfo.xml is out of date and still display 3.1.0.

@derekakelly
Copy link

I too am having this issue.

Face recognition: 198618 Queued files, Last classification: 5 minutes ago

[cron] Error: DivisionByZeroError: Division by zero at <>

  1. <>
    OCA\Recognize\Service\FaceClusterAnalyzer::OCA\Recognize\Service{closure}("*** sensitive parameters replaced ***")
  2. /var/www/nextcloud/apps/recognize/lib/Service/FaceClusterAnalyzer.php line 189
    array_map()
  3. /var/www/nextcloud/apps/recognize/lib/Service/FaceClusterAnalyzer.php line 90
    OCA\Recognize\Service\FaceClusterAnalyzer::calculateCentroidOfDetections()
  4. /var/www/nextcloud/apps/recognize/lib/BackgroundJobs/ClusterFacesJob.php line 28
    OCA\Recognize\Service\FaceClusterAnalyzer->calculateClusters()
  5. /var/www/nextcloud/lib/public/BackgroundJob/Job.php line 78
    OCA\Recognize\BackgroundJobs\ClusterFacesJob->run()
  6. /var/www/nextcloud/lib/public/BackgroundJob/QueuedJob.php line 58
    OCP\BackgroundJob\Job->start()
  7. /var/www/nextcloud/lib/public/BackgroundJob/QueuedJob.php line 48
    OCP\BackgroundJob\QueuedJob->start()
  8. /var/www/nextcloud/cron.php line 152
    OCP\BackgroundJob\QueuedJob->execute()

at 2022-11-03T06:25:03+00:00

@marcelklehr
Copy link
Member

v3.1.2 should fix all of this. Let me know if it does. 🙏

@ollioddi
Copy link

ollioddi commented Nov 3, 2022 via email

@derekakelly
Copy link

Doesn't seem to have fixed it for me. However, there is no more divide by zero error.

I updated everything a few hours ago, and this is what I'm seeing now

image

image

@marcelklehr marcelklehr reopened this Nov 3, 2022
@marcelklehr
Copy link
Member

I guess a setting is in order here, to allow everyone to tweak it to their liking.

@ollioddi
Copy link

ollioddi commented Nov 3, 2022

@marcelklehr is the interval hard-coded? I understand why, but when the queue is so huge, I'd rather have it use more resources and finish early.

The pattern @derekakelly has looks very similar to what I experience.

Docker stats show an idle of around 5% usage spiking towarda 600-700%? when identifying.

@tanpro260196
Copy link

In my case, the app seems to take 5mins break very 10 or so items.
I have about 15k pics, 4k songs, and ~1.5k vid. I may die of old age at the current processing rate.

@derekakelly
Copy link

The Number of CPU cores setting didn't have anything set at first, so I set it to 0 hoping that would help, but no luck. This setting implies to me it's going to push as hard as it can 24/7. There shouldn't be any hard code interval in my opinion. The cpu cores option is enough.

image

@marcelklehr
Copy link
Member

There shouldn't be any hard code interval in my opinion. The cpu cores option is enough.

The cpu cores option is only applicable for a single job, though, and we don't want a single job doing all the work because this needs to scale to huge instances as well, with multiple worker nodes. It's not entirely trivial to make something scale up and down seamlessly, but we're trying to accomodate everyone :)

@marcelklehr
Copy link
Member

Another issue that influenced this bug was that recognize would always assume people were using WASM mode, and scaled down the batch size accordingly, but erroneously, which causes a lot of pauses on fast machines.

@ollioddi
Copy link

@marcelklehr would it assume WASM mode even when toggled in the UI? This could perhaps explain my experience.

I am running on an i5 11400 with 12 threads so it would categorize it as a "faster" machine

@marcelklehr
Copy link
Member

Yes, even when WASM mode is off the latest release uses a smaller batch size currently.

@marcelklehr
Copy link
Member

v3.2.0 is out now which should fix the always-assume-WASM bug and allows setting custom batch sizes for all classifiers. Let me know about your experience with the latest version :)

@leegarrett
Copy link

v3.2.0 is out now which should fix the always-assume-WASM bug and allows setting custom batch sizes for all classifiers. Let me know about your experience with the latest version :)

Hi, thanks for the update. How are you supposed to change the batch size? If I type a different number in the field in the settings menu, it will fall back to the old value after reloading the page. There doesn't seem to be a button to save/apply the changes.

@Cebrain
Copy link

Cebrain commented Nov 11, 2022

v3.2.0 is out now which should fix the always-assume-WASM bug and allows setting custom batch sizes for all classifiers. Let me know about your experience with the latest version :)

Sadly i get now an error in recognize and in the logs.
Have 3.2 on latest docker nextcloud version installed

Recognize:
An error occurred during face recognition, please check the Nextcloud logs.

`[PHP] Error: Error: strpos(): Passing null to parameter #1 ($haystack) of type string is deprecated at /var/www/html/lib/private/Files/Cache/Scanner.php#508 at <<closure>>

 0. <<closure>>
    OC\Log\ErrorHandler::onError(8192, "strpos(): Passi ... d", "/var/www/html/l ... p", 508)
 1. /var/www/html/lib/private/Files/Cache/Scanner.php line 508
    strpos(null, ".part/")
 2. /var/www/html/lib/private/Files/View.php line 1384
    OC\Files\Cache\Scanner::isPartialFile(null)
 3. /var/www/html/lib/private/Files/Node/HookConnector.php line 227
    OC\Files\View->getFileInfo(null)
 4. /var/www/html/lib/private/Files/Node/HookConnector.php line 113
    OC\Files\Node\HookConnector->getNodeForPath(null)
 5. /var/www/html/lib/private/legacy/OC_Hook.php line 106
    OC\Files\Node\HookConnector->postWrite([null])
 6. /var/www/html/apps/dav/lib/Connector/Sabre/File.php line 471
    OC_Hook::emit("OC_Filesystem", "post_write", [null])
 7. /var/www/html/apps/dav/lib/Connector/Sabre/File.php line 398
    OCA\DAV\Connector\Sabre\File->emitPostHooks(false)
 8. /var/www/html/apps/dav/lib/Connector/Sabre/Directory.php line 151
    OCA\DAV\Connector\Sabre\File->put(null)
 9. /var/www/html/apps/dav/lib/Upload/UploadFolder.php line 45
    OCA\DAV\Connector\Sabre\Directory->createFile("0000000000000000-0000000004310956", null)
10. /var/www/html/3rdparty/sabre/dav/lib/DAV/Server.php line 1098
    OCA\DAV\Upload\UploadFolder->createFile("0000000000000000-0000000004310956", null)
11. /var/www/html/3rdparty/sabre/dav/lib/DAV/CorePlugin.php line 504
    Sabre\DAV\Server->createFile("uploads/hidden/ ... 6", null, null)
12. /var/www/html/3rdparty/sabre/event/lib/WildcardEmitterTrait.php line 89
    Sabre\DAV\CorePlugin->httpPut(Sabre\HTTP\Request {}, Sabre\HTTP\Response {})
13. /var/www/html/3rdparty/sabre/dav/lib/DAV/Server.php line 472
    Sabre\DAV\Server->emit("method:PUT", [Sabre\HTTP\Requ ... }])
14. /var/www/html/3rdparty/sabre/dav/lib/DAV/Server.php line 253
    Sabre\DAV\Server->invokeMethod(Sabre\HTTP\Request {}, Sabre\HTTP\Response {})
15. /var/www/html/3rdparty/sabre/dav/lib/DAV/Server.php line 321
    Sabre\DAV\Server->start()
16. /var/www/html/apps/dav/lib/Server.php line 360
    Sabre\DAV\Server->exec()
17. /var/www/html/apps/dav/appinfo/v2/remote.php line 35
    OCA\DAV\Server->exec()
18. /var/www/html/remote.php line 171
    require_once("/var/www/html/a ... p")

PUT /remote.php/dav/uploads/hidden/55b42946b1ec3e0c8be692fc61ac9d54/0000000000000000-0000000004310956
`
`[PHP] Error: Error: pathinfo(): Passing null to parameter #1 ($path) of type string is deprecated at /var/www/html/lib/private/Files/Cache/Scanner.php#505 at <<closure>>

 0. <<closure>>
    OC\Log\ErrorHandler::onError(8192, "pathinfo(): Pas ... d", "/var/www/html/l ... p", 505)
 1. /var/www/html/lib/private/Files/Cache/Scanner.php line 505
    pathinfo(null, 4)
 2. /var/www/html/lib/private/Files/View.php line 1384
    OC\Files\Cache\Scanner::isPartialFile(null)
 3. /var/www/html/lib/private/Files/Node/HookConnector.php line 227
    OC\Files\View->getFileInfo(null)
 4. /var/www/html/lib/private/Files/Node/HookConnector.php line 113
    OC\Files\Node\HookConnector->getNodeForPath(null)
 5. /var/www/html/lib/private/legacy/OC_Hook.php line 106
    OC\Files\Node\HookConnector->postWrite([null])
 6. /var/www/html/apps/dav/lib/Connector/Sabre/File.php line 471
    OC_Hook::emit("OC_Filesystem", "post_write", [null])
 7. /var/www/html/apps/dav/lib/Connector/Sabre/File.php line 398
    OCA\DAV\Connector\Sabre\File->emitPostHooks(false)
 8. /var/www/html/apps/dav/lib/Connector/Sabre/Directory.php line 151
    OCA\DAV\Connector\Sabre\File->put(null)
 9. /var/www/html/apps/dav/lib/Upload/UploadFolder.php line 45
    OCA\DAV\Connector\Sabre\Directory->createFile("0000000000000000-0000000004310956", null)
10. /var/www/html/3rdparty/sabre/dav/lib/DAV/Server.php line 1098
    OCA\DAV\Upload\UploadFolder->createFile("0000000000000000-0000000004310956", null)
11. /var/www/html/3rdparty/sabre/dav/lib/DAV/CorePlugin.php line 504
    Sabre\DAV\Server->createFile("uploads/hidden/ ... 6", null, null)
12. /var/www/html/3rdparty/sabre/event/lib/WildcardEmitterTrait.php line 89
    Sabre\DAV\CorePlugin->httpPut(Sabre\HTTP\Request {}, Sabre\HTTP\Response {})
13. /var/www/html/3rdparty/sabre/dav/lib/DAV/Server.php line 472
    Sabre\DAV\Server->emit("method:PUT", [Sabre\HTTP\Requ ... }])
14. /var/www/html/3rdparty/sabre/dav/lib/DAV/Server.php line 253
    Sabre\DAV\Server->invokeMethod(Sabre\HTTP\Request {}, Sabre\HTTP\Response {})
15. /var/www/html/3rdparty/sabre/dav/lib/DAV/Server.php line 321
    Sabre\DAV\Server->start()
16. /var/www/html/apps/dav/lib/Server.php line 360
    Sabre\DAV\Server->exec()
17. /var/www/html/apps/dav/appinfo/v2/remote.php line 35
    OCA\DAV\Server->exec()
18. /var/www/html/remote.php line 171
    require_once("/var/www/html/a ... p")

PUT /remote.php/dav/uploads/hidden/55b42946b1ec3e0c8be692fc61ac9d54/0000000000000000-0000000004310956
`

@marcelklehr
Copy link
Member

The error stacks you posted are unrelated to recognize I'm afraid.

@marcelklehr
Copy link
Member

How are you supposed to change the batch size? If I type a different number in the field in the settings menu, it will fall back to the old value after reloading the page. There doesn't seem to be a button to save/apply the changes.

That would be a bug :/ v3.2.1 should fix this. Thanks for the feedback!

@rhatguy

This comment was marked as off-topic.

@marcelklehr

This comment was marked as off-topic.

@rhatguy

This comment was marked as off-topic.

@marcelklehr

This comment was marked as abuse.

@rhatguy

This comment was marked as off-topic.

@marcelklehr

This comment was marked as off-topic.

@rhatguy

This comment was marked as off-topic.

@marcelklehr

This comment was marked as off-topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working v3.x
Projects
None yet
Development

No branches or pull requests