the design of `_traverse_apply` makes all BaseRecursiveDriver low-efficient #1932

hanxiao · 2021-02-11T21:25:30Z

Describe the bug

Highest priority, must solve now.

The current design combination of _apply_all and _traverse_apply in BaseRecursiveDriver results in a low efficient recursion, in particular when each document has only a small number of chunks. The _apply_all will only apply on that small documents, making the CPU/GPU extremely data-hungry.

This is inefficient. The only reason that this problem was not paid enough attention is that the problem only becomes obvious when you have multiple granularities. Single granularity, i.e. directly working on root-level won't reveal it.

This inefficient design also introduces further hacky implementations of the driver. One example is the Cache in EncodeDriver which is unnecessary if apply_all can work on all defined granularity in one-shot. Note that this temporally solves the data-hungry problem for EncodeDriver but all other drivers facing the same problem.

Describe how you solve it

redesign _traverse_apply, move it to Document
_apply_all iterates over the concatenation of iterators in one shot
remove hacky implementation in EncodeDriver

Environment

Screenshots

The text was updated successfully, but these errors were encountered:

JoanFM · 2021-02-11T21:34:09Z

Would you mean, creating a DocumentSet with all the Documents where Driver should apply in advance?

Or would promote the CacheDocumentSet of Encoder solve also the problem?

hanxiao · 2021-02-12T15:34:45Z

Or would promote the CacheDocumentSet of Encoder solve also the problem?

no, as I said this cacheencodedriver is a result of the bad design of traverse_apply. It does not solve the problem that other driver is facing.

After all done, i will delete this logic from the EncodeDriver, reverting EncodeDriver to the simple EncodeDriver as before. We will rely on @batching decorator on the executor level as it should be.

Would you mean, creating a DocumentSet with all the Documents where Driver should apply in advance?

The following modification is expected:

traverse_apply -> move to DocSet, rename to traverse, return one iterator of Document
apply_all-> decouple from traverse, loop over the iterator
apply_root-> delete

FYI, this is strongly related to the previous idea I mentioned here: #1326 unfortunately this was never implemented

hanxiao · 2021-02-12T15:37:27Z

algorithm logic-wise & design-wise i do not worry at all.

I worry about the huge number of unit tests that deeply tangled with the traverse_apply function, those tests have to retired and refactored one way or the other. Those tests shouldn't be the restriction of our idea though.

JoanFM · 2021-02-12T16:41:20Z

Or would promote the CacheDocumentSet of Encoder solve also the problem?

no, as I said this cacheencodedriver is a result of the bad design of traverse_apply. It does not solve the problem that other driver is facing.

After all done, i will delete this logic from the EncodeDriver, reverting EncodeDriver to the simple EncodeDriver as before. We will rely on @batching decorator on the executor level as it should be.

Would you mean, creating a DocumentSet with all the Documents where Driver should apply in advance?

The following modification is expected:

traverse_apply -> move to DocSet, rename to traverse, return one iterator of Document
apply_all-> decouple from traverse, loop over the iterator
apply_root-> delete

FYI, this is strongly related to the previous idea I mentioned here: #1326 unfortunately this was never implemented

Yes, I remember.

This was actually implemented, Document and DocumentSet do have traverse and traverse_rec logic, but they currently have a callback to keep applying.

For what I understand, the signature of DocumentSet.traverse() should change to return another DocumentSet. And then
the driver could simply call apply_all with the returned documentset.

hanxiao · 2021-02-12T18:40:43Z

For what I understand, the signature of DocumentSet.traverse() should change to return another DocumentSet. And then
the driver could simply call apply_all with the returned documentset.

yes, but returning a docuset is the second step or interface thing, the very essence of this problem is that:

no matter how we want to traverse, the "traverse" function must return one iterator, not multiple iterators. this will serve the foundation.
the "apply" function can then work on this iterator in one-shot. this will enable batching and all kinds of fancy stuff.

So the real problem is not about caching, batching, etc. It is about the traverse_apply and apply have to be refactored.

Traversing and applying are two different things and they have to be separated.

The reason of this legacy traversal_apply is because this traversal logic born before Jina type.

JoanFM · 2021-02-12T19:21:07Z

For what I understand, the signature of DocumentSet.traverse() should change to return another DocumentSet. And then
the driver could simply call apply_all with the returned documentset.

yes, but returning a docuset is the second step or interface thing, the very essence of this problem is that:

no matter how we want to traverse, the "traverse" function must return one iterator, not multiple iterators. this will serve the foundation.

the "apply" function can then work on this iterator in one-shot. this will enable batching and all kinds of fancy stuff.

So the real problem is not about caching, batching, etc. It is about the traverse_apply and apply have to be refactored.

Traversing and applying are two different things and they have to be separated.

The reason of this legacy traversal_apply is because this traversal logic born before Jina type.

Yes I see

JoanFM · 2021-02-12T21:53:21Z

One problem will appear with Rankers, specially Chunk2DocRankDriver, which do need to keep track of the context_doc they refer to.

JoanFM · 2021-02-12T22:03:40Z

If we can have a DocumentSet or maybe a new type being a composite of DocumentSet, ChunkSet and MatchSets we can easily pass all the Documents to the executor while having the Driver keep the knowledge about the ref_document.

hanxiao · 2021-02-12T22:21:12Z

I would first forget everything about the current design, including the context_doc, next_edge, field. They are really the legacy code before Jina types.

Regressing the current implementation may be hard. Let's think about if context_doc get in other way, for example using ChunkSet.parent_doc or MatchSet.reference_doc. I remember clearly I added these two things for the purpose of replicating context_doc.

Maybe unifying the name of the properties ChunkSet.parent_doc or MatchSet.reference_doc into one ChunkSet.context_doc and MatchSet.context_doc could be a first step.

JoanFM · 2021-02-12T22:51:37Z

Thinking it through, right now Rankers are the only Drivers needing this. And if they receive a DocumentSet, these Drivers can already be responsible of grouping the Documents by context_docs.

Another potential design is to have DocumentSet as a set of Documents at root level (maybe even change the name to RootDocumentSet, and then from it everything else will be confomed by MatchSet and ChunkSet.

And have a DocumentSet or a new DocumentSuperSet.

It could look as something like this:

class DocumentSet
""""Abstract Composite class formed by Iterable[DocumentSet] (RootDocumentSet, MatchSet, ChunkSet) """"
  @property
  def docs() -> Iterable[Document]
 """ Property used by all Drivers Except Rankers"""

  @property 
  def docsets()-> Iterable[DocumentSet] 
  """ Property used by RankerDriver and for Each DocumentSet will call property docs with access to a common `ref_doc` """

class RootDocumentSet(DocumentSet)
""""Set of Documents at Root Level extracte directly from Request""""

class MatchSet(DocumentSet)
"""" Set of Documents that are Match of some common ref Document"""

class ChunkSet(DocumentSet) 
"""" Set of Documents that are Match of some common ref Document"""

If the traverse obtains a set of DocumentSet, the Driver can very easily access the context_doc (or ref_doc). Does it seem like overengineering?

hanxiao · 2021-02-13T16:23:10Z

To clarify the size of the problem, I'm making a table to summarize the effort of big refactoring due to the new traverse.

The limitation of the new traverse is that it can not modify the collection (individual-level modify is fine), therefore ops such as del, reverse, sort won't work.

Driver	Refactor needed?
jina.drivers.BaseExecutableDriver
jina.drivers.cache.BaseCacheDriver	No
jina.drivers.cache.TaggingCacheDriver	No
jina.drivers.convert.Blob2PngURI	No
jina.drivers.convert.Buffer2URI	No
jina.drivers.convert.BufferImage2Blob	No
jina.drivers.convert.ConvertDriver	No
jina.drivers.convert.Text2URI	No
jina.drivers.convert.URI2Blob	No
jina.drivers.convert.URI2Buffer	No
jina.drivers.convert.URI2DataURI	No
jina.drivers.convert.URI2Text	No
jina.drivers.craft.CraftDriver	No
jina.drivers.delete.DeleteDriver	No
jina.drivers.encode.BaseEncodeDriver	No
jina.drivers.encode.EncodeDriver	Done in #1938
jina.drivers.evaluate.BaseEvaluateDriver	No
jina.drivers.evaluate.FieldEvaluateDriver	No
jina.drivers.evaluate.LoadGroundTruthDriver	Yes, because it uses `del self.docs[j]`
jina.drivers.evaluate.NDArrayEvaluateDriver	No
jina.drivers.evaluate.RankEvaluateDriver	No
jina.drivers.evaluate.TextEvaluateDriver	No
jina.drivers.index.BaseIndexDriver	No
jina.drivers.index.KVIndexDriver	No
jina.drivers.index.VectorIndexDriver	No
jina.drivers.multimodal.MultiModalDriver	No
jina.drivers.predict.BaseLabelPredictDriver	No
jina.drivers.predict.BasePredictDriver	No
jina.drivers.predict.BinaryPredictDriver	No
jina.drivers.predict.MultiLabelPredictDriver	No
jina.drivers.predict.OneHotPredictDriver	No
jina.drivers.predict.Prediction2DocBlobDriver	No
jina.drivers.querylang.filter.FilterQL	Yes, because of `del docs[j]`
jina.drivers.querylang.reverse.ReverseQL	Yes, because of `docs.reverse()`
jina.drivers.querylang.select.ExcludeQL	No
jina.drivers.querylang.select.ExcludeReqQL	No
jina.drivers.querylang.select.SelectQL	No
jina.drivers.querylang.select.SelectReqQL	No
jina.drivers.querylang.slice.SliceQL	No
jina.drivers.querylang.sort.SortQL	Yes, because of `del docs[int(self.end):]`
jina.drivers.rank.BaseRankDriver	No
jina.drivers.rank.Matches2DocRankDriver	Yes, because of the usage of `context_doc`
jina.drivers.rank.aggregate.AggregateMatches2DocRankDriver	Yes, because of the usage of `context_doc`
jina.drivers.rank.aggregate.BaseAggregateMatchesRanker	Yes, because of the usage of `context_doc`
jina.drivers.rank.aggregate.Chunk2DocRankDriver	Yes, because of the usage of `context_doc`
jina.drivers.reduce.CollectEvaluationDriver	Yes, because of the usage of `context_doc`
jina.drivers.reduce.ConcatEmbedDriver	Yes, because of the usage of `context_doc`
jina.drivers.reduce.ReduceAllDriver	Yes, because of the usage of `context_doc`
jina.drivers.search.BaseSearchDriver	No
jina.drivers.search.KVSearchDriver	Yes, because of the usage of `del docs[j]`
jina.drivers.search.VectorFillDriver	No
jina.drivers.search.VectorSearchDriver	No
jina.drivers.segment.SegmentDriver	No

JoanFM · 2021-02-13T18:51:03Z

Could a solution similar to the one I proposed above be helpful? Having a DocumentSet aware of the subsets it contains? Thus being able to refer properly to a context_doc or to delete and reverse only specific subsets of the set?

hanxiao · 2021-02-13T19:10:40Z

In #1938 I decided to use FastRecursiveMixin and only add this mixin to those drivers that really need it, e.g. EncodeDriver, IndexDriver, PredictDriver. After all, they are the ones who need this fast operation as they call for docs.all_embedding, docs.all_content.

Not a universal solution for sure, but good for the time being. Allowing one to move on on the cross/multi-modality search

JoanFM · 2021-02-13T19:13:24Z

In #1938 I decide to use FastRecursiveMixin and only add this mixin to those drivers that really need it, e.g. EncodeDriver, IndexDriver, PredictDriver. After all they are the ones need this fast operation as they call for docs.all_embedding, docs.all_content

But for the sake of completion, it would be cleaner to use the same concept in other Drivers as CraftDriver? So that we can avoid having the traversal logic duplicated in driver and types?

hanxiao · 2021-02-13T19:15:07Z

true, but I will leave this to the team.

nan-wang · 2021-02-14T15:17:41Z

In #1938 I decide to use FastRecursiveMixin and only add this mixin to those drivers that really need it, e.g. EncodeDriver, IndexDriver, PredictDriver. After all they are the ones need this fast operation as they call for docs.all_embedding, docs.all_content

But for the sake of completion, it would be cleaner to use the same concept in other Drivers as CraftDriver? So that we can avoid having the traversal logic duplicated in driver and types?

Basically, after this refactoring, _traverse_apply can be removed from BaseRecursiveDriver.

nan-wang · 2021-02-14T15:21:17Z

BaseEvaluateDriver needs refactoring as well, right? Because it is using _traverse_apply().

JoanFM · 2021-02-14T15:30:17Z

BaseEvaluateDriver needs refactoring as well, right? Because it is using _traverse_apply().

I opened a PR doing some refavtoring in #1939

maximilianwerk · 2021-02-15T10:12:37Z

I don't believe, we need the parent_doc functionality. We could give the rank driver just the parent Document and the driver will reach through to the child elements.

When we use the Chunk2DocRankDriver we could just change the former traversal_paths: ['c'] to ['r'] and we can remove the context_doc. In the end, a driver only needs to know, where the "anchor" of the task-to-be-performed is. And before we used c for historical reasons, but apparently r as an anchor makes way more sense. It would define the Document, towards which should be propagated.

As a consequence, we would directly get a better naming for the Chunk2DocRankDriver: Something along the lines of ChunkAggregationRankDriver.

maximilianwerk · 2021-02-15T10:14:30Z

I cannot yet exactly justify why, but I would love to get one iterator per traversal_path. Such that only documents on one defined path are batched. For an encoder, there is no difference, but for a ranker it feels dangerous to mix Documents from different levels into one batch.

maximilianwerk · 2021-03-10T10:19:02Z

Initial work is finished. Further steps are taken here: #2097

FionnD · 2021-03-10T13:04:46Z

Since further steps are in #2097, can i close this issue @hanxiao ?

hanxiao added type/bug Something isn't working priority/critical Urgent: Security, critical bugs, blocking issues. drop everything until this issue is addressed. labels Feb 11, 2021

hanxiao mentioned this issue Feb 12, 2021

feat(types): traverse is a method of document set #1937

Merged

github-actions bot mentioned this issue Feb 12, 2021

refactor(driver): use traverse from document type #1938

Merged

2 tasks

FionnD added status/in-progess This issue is activity being worked on. area/core This issue/PR affects the core codebase labels Feb 26, 2021

maximilianwerk self-assigned this Mar 1, 2021

hanxiao closed this as completed Mar 10, 2021

FionnD added status/done This issue is complete. and removed status/in-progess This issue is activity being worked on. labels Mar 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the design of `_traverse_apply` makes all BaseRecursiveDriver low-efficient #1932

the design of `_traverse_apply` makes all BaseRecursiveDriver low-efficient #1932

hanxiao commented Feb 11, 2021 •

edited

Loading

JoanFM commented Feb 11, 2021

hanxiao commented Feb 12, 2021 •

edited

Loading

hanxiao commented Feb 12, 2021 •

edited

Loading

JoanFM commented Feb 12, 2021

hanxiao commented Feb 12, 2021 •

edited

Loading

JoanFM commented Feb 12, 2021

JoanFM commented Feb 12, 2021

JoanFM commented Feb 12, 2021

hanxiao commented Feb 12, 2021 •

edited

Loading

JoanFM commented Feb 12, 2021 •

edited

Loading

hanxiao commented Feb 13, 2021 •

edited

Loading

JoanFM commented Feb 13, 2021

hanxiao commented Feb 13, 2021 •

edited

Loading

JoanFM commented Feb 13, 2021

hanxiao commented Feb 13, 2021

nan-wang commented Feb 14, 2021

nan-wang commented Feb 14, 2021

JoanFM commented Feb 14, 2021

maximilianwerk commented Feb 15, 2021 •

edited

Loading

maximilianwerk commented Feb 15, 2021

maximilianwerk commented Mar 10, 2021

FionnD commented Mar 10, 2021

the design of _traverse_apply makes all BaseRecursiveDriver low-efficient #1932

the design of _traverse_apply makes all BaseRecursiveDriver low-efficient #1932

Comments

hanxiao commented Feb 11, 2021 • edited Loading

JoanFM commented Feb 11, 2021

hanxiao commented Feb 12, 2021 • edited Loading

hanxiao commented Feb 12, 2021 • edited Loading

JoanFM commented Feb 12, 2021

hanxiao commented Feb 12, 2021 • edited Loading

JoanFM commented Feb 12, 2021

JoanFM commented Feb 12, 2021

JoanFM commented Feb 12, 2021

hanxiao commented Feb 12, 2021 • edited Loading

JoanFM commented Feb 12, 2021 • edited Loading

hanxiao commented Feb 13, 2021 • edited Loading

JoanFM commented Feb 13, 2021

hanxiao commented Feb 13, 2021 • edited Loading

JoanFM commented Feb 13, 2021

hanxiao commented Feb 13, 2021

nan-wang commented Feb 14, 2021

nan-wang commented Feb 14, 2021

JoanFM commented Feb 14, 2021

maximilianwerk commented Feb 15, 2021 • edited Loading

maximilianwerk commented Feb 15, 2021

maximilianwerk commented Mar 10, 2021

FionnD commented Mar 10, 2021

the design of `_traverse_apply` makes all BaseRecursiveDriver low-efficient #1932

the design of `_traverse_apply` makes all BaseRecursiveDriver low-efficient #1932

hanxiao commented Feb 11, 2021 •

edited

Loading

hanxiao commented Feb 12, 2021 •

edited

Loading

hanxiao commented Feb 12, 2021 •

edited

Loading

hanxiao commented Feb 12, 2021 •

edited

Loading

hanxiao commented Feb 12, 2021 •

edited

Loading

JoanFM commented Feb 12, 2021 •

edited

Loading

hanxiao commented Feb 13, 2021 •

edited

Loading

hanxiao commented Feb 13, 2021 •

edited

Loading

maximilianwerk commented Feb 15, 2021 •

edited

Loading