Indexing of long text documents are tricky #127

tommykoctur · 2022-07-19T12:17:48Z

Hello,

my use case is the search in long text documents.
Documents are split to chunks (lets say sentences) and each chunk has its embedding. Root document has no embedding.
I am not able to index documents with annlite indexer because of missing embedding of root document, only chunks may be indexed.
If I store documents directly to lmdb via self._index.doc_store(0).insert(root_docs) then when loading query flow it throws error.

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (10,) + inhomogeneous part.

10 means (5 root docs, and 5 chunks together - dummy data)

Can you please help me
Thanks

The text was updated successfully, but these errors were encountered:

JoanFM · 2022-07-19T12:35:33Z

We are working on a feature that will allow the user to have multiple indices and sub_indices around the same DocArray API, I think this could be useful for you?

tommykoctur · 2022-07-19T12:44:10Z

We are working on a feature that will allow the user to have multiple indices and sub_indices around the same DocArray API, I think this could be useful for you?

I don't know yet, how it will look like.
But Document's nested structure (chunks are senteces from long text) are suitable for this case, just annlite indexer doesn't allow to index (just store) documents without embeddings.

JoanFM · 2022-07-19T12:49:52Z

in this case you would need to have your own version of AnnLiteIndexer indexing different parts in different DocArrays, but yes current implementation does not work

tommykoctur · 2022-07-19T12:53:53Z

in this case you would need to have your own version of AnnLiteIndexer indexing different parts in different DocArrays, but yes current implementation does not work

could you please explain how would sub_indices work.
When do you plan to implement it ?

Thanks

numb3r3 · 2022-08-23T10:44:03Z

@tommykoctur The subindex has been released. https://docarray.jina.ai/fundamentals/documentarray/subindex/

tommykoctur · 2022-08-23T11:13:28Z

Thank you, but I don't think that this would help me. I would probably add another LMDB to store root doc information to save some space.

JoanFM closed this as completed Oct 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing of long text documents are tricky #127

Indexing of long text documents are tricky #127

tommykoctur commented Jul 19, 2022

JoanFM commented Jul 19, 2022

tommykoctur commented Jul 19, 2022

JoanFM commented Jul 19, 2022

tommykoctur commented Jul 19, 2022

numb3r3 commented Aug 23, 2022

tommykoctur commented Aug 23, 2022

Indexing of long text documents are tricky #127

Indexing of long text documents are tricky #127

Comments

tommykoctur commented Jul 19, 2022

JoanFM commented Jul 19, 2022

tommykoctur commented Jul 19, 2022

JoanFM commented Jul 19, 2022

tommykoctur commented Jul 19, 2022

numb3r3 commented Aug 23, 2022

tommykoctur commented Aug 23, 2022