Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compareWithCurrentIndex can take 9-12 hours per user for 100k PDFs from files_external #408

Closed
truelai opened this issue Nov 6, 2018 · 36 comments

Comments

Projects
None yet
3 participants
@truelai
Copy link

commented Nov 6, 2018

NC 14.0.3
Full text search 1.1.0
Full text search - Elasticsearch Platform 1.0.2
Full text search - Files 1.1.1
Full text search - Files - Tesseract OCR 1.0.0
Elasticsearch 6.4.2
Tesseract 3.04.01
PHP 7.0.32-0ubuntu0.16.04.1

=================================================

After resetting the index, I re-index. There are ~105K PDFs in an external mount. While indexing, compareWithCurrentIndex takes many hours with each user that has access to this mount. The mount is the only thing that I'm indexing and OCRing (though I often see other files - that aren't in the mount - in the INFO section).

Why is this taking so long? I'm way over-spec'd on hardware.

Also, if there are overlaps in content providers, how should we deal with that? For example, file_external is all PDFs. If I'm trying to OCR all of the PDFs in the external mount, do I also need to have files_PDF as a content provider? Or do pdf, office, image, and audio only apply to files that aren't external?

Files 1.1.1
{
    "files_local": "0",
    "files_external": "1",
    "files_group_folders": "0",
    "files_encrypted": "0",
    "files_federated": "0",
    "files_size": "100",
    "files_pdf": "1",
    "files_office": "1",
    "files_image": "0",
    "files_audio": "0"
}

kibana

@truelai truelai changed the title Compare with currrent index can take 9-12 hours for 100k PDFs compareWithCurrentIndex can take 9-12 hours per user for 100k PDFs from files_external Nov 6, 2018

@daita

This comment has been minimized.

Copy link
Member

commented Dec 19, 2018

During the compareWithCurrentIndex process, there is no request to elasticsearch. We're just comparing 2 lists:

  • all user's documents and their last modified date (retrieved from the previous process generateIndexFiles)
  • full list of any previous index (stored in the database)

Could you see an issue that would explain the slow process ?

@truelai

This comment has been minimized.

Copy link
Author

commented Jan 15, 2019

During the compareWithCurrentIndex process, there is no request to elasticsearch. We're just comparing 2 lists:

* all user's documents and their last modified date (retrieved from the previous process _generateIndexFiles_)

* full list of any previous index (stored in the database)

Could you see an issue that would explain the slow process ?

Yes. Multiple previous indexes could be the culprit here. Regarding the "full list of any previous index", how can I purge these? Is this purged with fulltextsearch:reset?

Also, I still have the following question:
... if there are overlaps in content providers, how should we deal with that? For example, file_external is all PDFs. If I'm trying to OCR all of the PDFs in the external mount, do I also need to have files_PDF as a content provider? Or do pdf, office, image, and audio only apply to files that aren't external?

@daita

This comment has been minimized.

Copy link
Member

commented Jan 29, 2019

type and source are not related. If you select PDF and/or Office, it will index all PDF and/or Office file available on your Nextcloud if they are available in selected filesystems (local, external, ...)

Now, the OCR of PDF is not available right now, the files_fulltextsearch_tesseract only OCR Image files.

to purge previous index: ./occ fulltextsearch:reset

@daita

This comment has been minimized.

Copy link
Member

commented Jan 31, 2019

with NC16, you might see some improvement on big setup like yours

@truelai

This comment has been minimized.

Copy link
Author

commented Feb 4, 2019

Now, the OCR of PDF is not available right now....

Are you saying the OCR of PDf text is not working (and I would infer that it hasn't been working before)? When is this supposed to be implemented? Is there a road-map you can refer me to?

the files_fulltextsearch_tesseract only OCR Image files.

PDF is an image file?

with NC16, you might see some improvement on big setup like yours

Can you let us in on the improvements?

Also - thank you for your work on this project.

@daita

This comment has been minimized.

Copy link
Member

commented Feb 4, 2019

PDF with text-layer are indexed.
If the text-layer is missing (most of the time if the source is a scanned document), then we need to OCR the pages of the PDF before indexing. The issue is that you cannot OCR a PDF directly, you need to convert each page to an image first.

This second way of indexing PDF is not available yet.

@truelai

This comment has been minimized.

Copy link
Author

commented Feb 11, 2019

So if I convert all PDFs to TIFF and feed those, OCR and indexing will both work?

Also, any timeline on OCRing image-only PDFs?

@daita

This comment has been minimized.

Copy link
Member

commented Feb 12, 2019

This would means a lot of work to convert them, and the result on your search will point to the TIFF image, not the PDF

@daita

This comment has been minimized.

Copy link
Member

commented Feb 12, 2019

@truelai : can you have a look to daita/files_fulltextsearch_tesseract#8

You cannot search within the indexed content right now, as it's not stored in the content field, but in parts.ocr. However, if you have access to your elasticsearch, you should be able to check the indexed content of your file using the http://localhost:9200/IndexName/standard/files:FileId (by replacing IndexName and FileId: http://localhost:9200/nc15/standard/files:819) (fixed)

@daita

This comment has been minimized.

Copy link
Member

commented Feb 12, 2019

After a bunch of tests, please note that enabling the OCR of PDF takes a lot more resources.

@truelai

This comment has been minimized.

Copy link
Author

commented Feb 12, 2019

This would means a lot of work to convert them, and the result on your search will point to the TIFF image, not the PDF

So the implementation of OCR is not really to convert the file? From what I'm interpreting, if I feed it a directory of TIF files:

  1. Tesseract generates a .txt file of the TIF
  2. ES indexes the txt files and has a pointer to the TIF

Is this a correct understanding of the current implementation?

After a bunch of tests, please note that enabling the OCR of PDF takes a lot more resources.
The issue is that you cannot OCR a PDF directly, you need to convert each page to an image first.

I'm having trouble reconciling these statements. Are you saying there's already some FTS function that converts PDF to friendlier image format and the converts that to PDF(searchable)?

The fact is, I have a ton of scanned PDFs (image only). I'm looking to:

  1. convert them to searchable PDFs
  2. have ES index the searchable PDF
  3. be able to search for those PDFs via NC GUI

If I plan on using the FTS apps to achieve this, do my short and long term strategies differ? Is this currently possible? Will it be possible in the future? What do you recommend?

@daita

This comment has been minimized.

Copy link
Member

commented Feb 12, 2019

This would means a lot of work to convert them, and the result on your search will point to the TIFF image, not the PDF

So the implementation of OCR is not really to convert the file? From what I'm interpreting, if I feed it a directory of TIF files:

1. Tesseract generates a .txt file of the TIF

2. ES indexes the txt files and has a pointer to the TIF

When indexing an image file, defined as an IndexDocument, the app just get the result from tesseract and save it as the content of the IndexDocument.

After a bunch of tests, please note that enabling the OCR of PDF takes a lot more resources.
The issue is that you cannot OCR a PDF directly, you need to convert each page to an image first.

I'm having trouble reconciling these statements. Are you saying there's already some FTS function that converts PDF to friendlier image format and the converts that to PDF(searchable)?

I pushed a new branch earlier today (nextcloud/files_fulltextsearch_tesseract#8) that would:

  • convert each page of your PDF to a jpeg (as a temporary file)
  • send the path of the temporary file to tesseract
  • save the result as a partial content of the IndexDocument.

I added some requirements in the PR.

The fact is, I have a ton of scanned PDFs (image only). I'm looking to:

1. convert them to searchable PDFs

2. have ES index the searchable PDF

3. be able to search for those PDFs via NC GUI

If I plan on using the FTS apps to achieve this, do my short and long term strategies differ? Is this currently possible? Will it be possible in the future? What do you recommend?

Well, it seems to work but it's heavy on resources.
Please test the PR and confirm you were able to make it work.

@it25fg

This comment has been minimized.

Copy link

commented May 15, 2019

Regarding the slow indexing process, I have some questions. Versions nearly match the ones of the opening question:
Nextcloud: 14.0.9
Fulltextsearch: 1.1.1
Fulltextsearch_Elasticsearch: 1.0.3

I'm facing the situation that occ fulltextsearch:index for ~2500 users (~11 000 000 filecache entries) runs for two months, and currently has completed ~1000 users. The table entries in oc_fulltextsearch_ticks show that 95% of the overall time is spent in 'compareWithCurrentIndex', and I can see that 'compareWithCurrentIndex' processes no more than three files per second, which feels very slow given your explanation that it only loops over things in memory, not using external resources.

Is there something I can tune to speed this up? Or is it simply the overall number of files that slows everything down? Any hints are greatly appreciated.

@daita

This comment has been minimized.

Copy link
Member

commented May 15, 2019

@it25fg I am impressed you found out this (future) fulltextsearch_ticks feature :-)

Now, in Nextcloud 16, this is greatly improved, but I will see if some improvement on the getIndex() can be ported to your version of fulltextsearch.

@it25fg

This comment has been minimized.

Copy link

commented May 17, 2019

Glad to hear there's something in the works. As soon as new versions of FTS come in, I'll give feedback on the impact. Keep up the good work!

@daita

This comment has been minimized.

Copy link
Member

commented May 20, 2019

Please keep me updated with last version of fulltextsearch

@it25fg

This comment has been minimized.

Copy link

commented May 27, 2019

@daita Thanks for the quick changes -- I've noticed the new version. Will get back with results when the next maintenance window will allow updating the app.

@truelai

This comment has been minimized.

Copy link
Author

commented May 29, 2019

Full text search 1.3.2
Elasticsearch 1.3.0
Deck 0.6.1 (patched)
Files 1.3.0

compare with crrent index takes ~2 days per user now. Much longer. I do have OCR enabled (as I did before) and I understand that it's supposed to be working now. though I don't understand how/why that would slow down the comparison of indices.

@daita

This comment has been minimized.

Copy link
Member

commented May 29, 2019

could it be that your files are on a remote computer ?

@truelai

This comment has been minimized.

Copy link
Author

commented May 29, 2019

could it be that your files are on a remote computer ?

@daita Yes, they are in directories that are mounted with SMB, as before.

@daita

This comment has been minimized.

Copy link
Member

commented May 29, 2019

Can you test your index without enabling the external files (only local files) in the Admin Settings page and see if it takes so long ? do not hesitate to send a small video clip to maxence@nextcloud.com if you still experience slow index.

@truelai

This comment has been minimized.

Copy link
Author

commented May 30, 2019

@daita I'll give that a try but I want to let this current user finish, first. It's now day three. I'll shoot over a vid.

@truelai

This comment has been minimized.

Copy link
Author

commented May 31, 2019

I applied the latest updates:

Full text search 1.3.3
Full text search - Elasticsearch Platform 1.3.1
Full text search - Files 1.3.2
Full text search - Files - Tesseract OCR 1.3.0

I set, in the admin:

External Files: Do not index path nor content

Compare with index is still slow. I will email a short video.

Providers still show "2" for external, BTW.

"files_local": "1",
"files_external": "2",
"files_group_folders": "0",
"files_encrypted": "0",
"files_federated": "0",
"files_size": "70",
"files_pdf": "1",
"files_office": "1",
"files_image": "0",
"files_audio": "0"

@it25fg

This comment has been minimized.

Copy link

commented Jun 7, 2019

Many thanks @daita ! After upgrading Fulltextsearch to 1.1.2, the indexing of the whole document base completed in two days.
One last question: I have now a lot of entries in oc_fulltextsearch_indexes with a nonempty message (exceptions from Elasticsearch, based on document contents). What would be the proper way to force those documents to reindex when Elasticsearch was updated?

@truelai

This comment has been minimized.

Copy link
Author

commented Jun 7, 2019

HUGE IMPROVEMENT

Updating everyone else as I've already notified @daita and provided more details. My versions are:

NC 16
Full text search 1.3.3
Full text search - Elasticsearch Platform 1.3.1
Full text search - Files 1.3.2
Full text search - Files - Tesseract OCR 1.3.0
Deck (patched) 0.6.1

There is a new method available for testing using:

./occ fulltextsearch:index "{\"test_request\": true}"

Previous versions actually made indexing slower than when I originally authored this issue. I got to the point where a user was taking multiple days to index. With this new method, I was able to index a user in less than an hour on my installation which has ~500,000 documents (pages).

*Note that I had turned off OCR for this test.

This is a HUGE improvement. Kudos to @daita. May his beer always be cold and his whiskey always be old.

@daita

This comment has been minimized.

Copy link
Member

commented Jun 8, 2019

@it25fg please try: ./occ fulltextsearch:index "{\"errors\": \"reset\"}"

@truelai

This comment has been minimized.

Copy link
Author

commented Jun 11, 2019

@daita. I got an exception on the last run using that command. Looks like it's due to ldap, though. I checked the admin and had a red "Configuration incorrect" notice (though I've changed nothing there). I test all the settings, everything goes back to "Configuration OK" until I reload that setting and it's immediately back to "Configuration incorrect" . I'm guessing this has something to do with it. I can provide a video if needed.

note: a username was replaced with "redacted-for-github"

TypeError: Argument 1 passed to OCA\User_LDAP\Group_LDAP::walkNestedGroups() must be of the type string, null given, called in /var/www/nextcloud/apps/user_ldap/lib/Group_LDAP.php on line 796 and defined in /var/www/nextcloud/apps/user_ldap/lib/Group_LDAP.php:284Stack trace:
#0 /var/www/nextcloud/apps/user_ldap/lib/Group_LDAP.php(796): OCA\User_LDAP\Group_LDAP->walkNestedGroups(NULL, Object(Closure), Array)
#1 /var/www/nextcloud/apps/user_ldap/lib/Group_LDAP.php(752): OCA\User_LDAP\Group_LDAP->getGroupsByMember(NULL)
#2 /var/www/nextcloud/apps/user_ldap/lib/Group_Proxy.php(123): OCA\User_LDAP\Group_LDAP->getUserGroups(NULL)
#3 /var/www/nextcloud/lib/private/Group/Manager.php(280): OCA\User_LDAP\Group_Proxy->getUserGroups('redacted-for-github')
#4 /var/www/nextcloud/lib/private/Group/Manager.php(267): OC\Group\Manager->getUserIdGroups('redacted-for-github')
#5 /var/www/nextcloud/lib/private/Group/Manager.php(328): OC\Group\Manager->getUserGroups(Object(OC\User\User))
#6 /var/www/nextcloud/apps/files_external/lib/Service/UserGlobalStoragesService.php(77): OC\Group\Manager->getUserGroupIds(Object(OC\User\User))
#7 /var/www/nextcloud/apps/files_external/lib/Service/StoragesService.php(128): OCA\Files_External\Service\UserGlobalStoragesService->readDBConfig()
#8 /var/www/nextcloud/apps/files_external/lib/Service/StoragesService.php(178): OCA\Files_External\Service\StoragesService->readConfig()
#9 /var/www/nextcloud/apps/files_external/lib/Service/StoragesService.php(187): OCA\Files_External\Service\StoragesService->getAllStorages()
#10 /var/www/nextcloud/apps/files_external/lib/config.php(105): OCA\Files_External\Service\StoragesService->getStorages()
#11 /var/www/nextcloud/apps/files_fulltextsearch/lib/Service/ExternalFilesService.php(261): OC_Mount_Config::getAbsoluteMountPoints('redacted-for-github')
#12 /var/www/nextcloud/apps/files_fulltextsearch/lib/Service/ExternalFilesService.php(137): OCA\Files_FullTextSearch\Service\ExternalFilesService->getMountPoints('redacted-for-github')
#13 /var/www/nextcloud/apps/files_fulltextsearch/lib/Service/FilesService.php(341): OCA\Files_FullTextSearch\Service\ExternalFilesService->initExternalFilesForUser('redacted-for-github')
#14 /var/www/nextcloud/apps/files_fulltextsearch/lib/Service/FilesService.php(196): OCA\Files_FullTextSearch\Service\FilesService->initFileSystems('redacted-for-github')
#15 /var/www/nextcloud/apps/files_fulltextsearch/lib/Provider/FilesProvider.php(230): OCA\Files_FullTextSearch\Service\FilesService->getChunksFromUser('redacted-for-github', Object(OCA\FullTextSearch\Model\IndexOptions))
#16 /var/www/nextcloud/apps/fulltextsearch/lib/Service/IndexService.php(183): OCA\Files_FullTextSearch\Provider\FilesProvider->generateChunks('redacted-for-github')
#17 /var/www/nextcloud/apps/fulltextsearch/lib/Command/Index.php(409): OCA\FullTextSearch\Service\IndexService->indexProviderContentFromUser(Object(OCA\FullTextSearch_ElasticSearch\Platform\ElasticSearchPla
tform), Object(OCA\Files_FullTextSearch\Provider\FilesProvider), 'redacted-for-github', Object(OCA\FullTextSearch\Model\IndexOptions))
#18 /var/www/nextcloud/apps/fulltextsearch/lib/Command/Index.php(273): OCA\FullTextSearch\Command\Index->indexProvider(Object(OCA\Files_FullTextSearch\Provider\FilesProvider), Object(OCA\FullTextSearch\Mode
l\IndexOptions))
#19 /var/www/nextcloud/3rdparty/symfony/console/Command/Command.php(255): OCA\FullTextSearch\Command\Index->execute(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output
\ConsoleOutput))
#20 /var/www/nextcloud/core/Command/Base.php(166): Symfony\Component\Console\Command\Command->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#21 /var/www/nextcloud/3rdparty/symfony/console/Application.php(901): OC\Core\Command\Base->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#22 /var/www/nextcloud/3rdparty/symfony/console/Application.php(262): Symfony\Component\Console\Application->doRunCommand(Object(OCA\FullTextSearch\Command\Index), Object(Symfony\Component\Console\Input\Arg
vInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#23 /var/www/nextcloud/3rdparty/symfony/console/Application.php(145): Symfony\Component\Console\Application->doRun(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output
ConsoleOutput))
#24 /var/www/nextcloud/lib/private/Console/Application.php(213): Symfony\Component\Console\Application->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\Console
Output))
#25 /var/www/nextcloud/console.php(97): OC\Console\Application->run()
#26 /var/www/nextcloud/occ(11): require_once('/var/www/nextcl...')
#27 {main}root

@it25fg

This comment has been minimized.

Copy link

commented Jun 17, 2019

@daita seems your backport has introduced a regression :-( On a notexistant (or empty) index, occ fulltextsearch:test throws

In ProviderIndexes.php line 59:
                                                              
  [OCA\FullTextSearch\Exceptions\IndexDoesNotExistException]  

which is exactly the place of your change.

Could it be possible that the environment in NC 14.x is not prepared to catch this exception? In FTS 1.1.1, the 'not found' case returns $null instead.

@daita

This comment has been minimized.

Copy link
Member

commented Jun 17, 2019

@daita

This comment has been minimized.

Copy link
Member

commented Jun 17, 2019

@it25fg thanks for your ... catch :]

#519

@truelai

This comment has been minimized.

Copy link
Author

commented Jun 18, 2019

@truelai could you try: nextcloud/files_fulltextsearch#70 ?

@daita: Happy to. FYI, except for that one exception caused by the LDAP issue, I haven't been getting exceptions in quite a bit of time.

Running the index now. I'll update tomorrow.

@truelai

This comment has been minimized.

Copy link
Author

commented Jun 19, 2019

@daita Patched with /nextcloud/files_fulltextsearch/pull/70 and ran index of all users without issue.

@it25fg

This comment has been minimized.

Copy link

commented Jun 27, 2019

@daita I have patched manually using #519, now the first index runs without problems. And patch now superseded by FTS 1.1.3, so the integrity check doesn't yell at me anymore ;-) Many thanks.

@daita

This comment has been minimized.

Copy link
Member

commented Jun 27, 2019

@it25fg but 1.1.3 already included the patch, right ?

@it25fg

This comment has been minimized.

Copy link

commented Jun 27, 2019

@it25fg but 1.1.3 already included the patch, right ?

Exactly. I was not patient enough to wait for it to arrive.

@daita

This comment has been minimized.

Copy link
Member

commented Jun 27, 2019

There is nothing wrong in testing patch!

Let's (finally) close this ticket. Thanks to everyone involved <3

@daita daita closed this Jun 27, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.