Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing of long text documents are tricky #127

Closed
tommykoctur opened this issue Jul 19, 2022 · 6 comments
Closed

Indexing of long text documents are tricky #127

tommykoctur opened this issue Jul 19, 2022 · 6 comments

Comments

@tommykoctur
Copy link

Hello,

my use case is the search in long text documents.
Documents are split to chunks (lets say sentences) and each chunk has its embedding. Root document has no embedding.
I am not able to index documents with annlite indexer because of missing embedding of root document, only chunks may be indexed.
If I store documents directly to lmdb via self._index.doc_store(0).insert(root_docs) then when loading query flow it throws error.

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (10,) + inhomogeneous part.

10 means (5 root docs, and 5 chunks together - dummy data)

Can you please help me
Thanks

@JoanFM
Copy link
Member

JoanFM commented Jul 19, 2022

We are working on a feature that will allow the user to have multiple indices and sub_indices around the same DocArray API, I think this could be useful for you?

@tommykoctur
Copy link
Author

We are working on a feature that will allow the user to have multiple indices and sub_indices around the same DocArray API, I think this could be useful for you?

I don't know yet, how it will look like.
But Document's nested structure (chunks are senteces from long text) are suitable for this case, just annlite indexer doesn't allow to index (just store) documents without embeddings.

@JoanFM
Copy link
Member

JoanFM commented Jul 19, 2022

in this case you would need to have your own version of AnnLiteIndexer indexing different parts in different DocArrays, but yes current implementation does not work

@tommykoctur
Copy link
Author

in this case you would need to have your own version of AnnLiteIndexer indexing different parts in different DocArrays, but yes current implementation does not work

could you please explain how would sub_indices work.
When do you plan to implement it ?

Thanks

@numb3r3
Copy link
Member

numb3r3 commented Aug 23, 2022

@tommykoctur The subindex has been released. https://docarray.jina.ai/fundamentals/documentarray/subindex/

@tommykoctur
Copy link
Author

Thank you, but I don't think that this would help me. I would probably add another LMDB to store root doc information to save some space.

@JoanFM JoanFM closed this as completed Oct 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants