Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the design of _traverse_apply makes all BaseRecursiveDriver low-efficient #1932

Closed
hanxiao opened this issue Feb 11, 2021 · 22 comments
Closed
Assignees
Labels
area/core This issue/PR affects the core codebase priority/critical Urgent: Security, critical bugs, blocking issues. drop everything until this issue is addressed. status/done This issue is complete. type/bug Something isn't working

Comments

@hanxiao
Copy link
Member

hanxiao commented Feb 11, 2021

Describe the bug

Highest priority, must solve now.

The current design combination of _apply_all and _traverse_apply in BaseRecursiveDriver results in a low efficient recursion, in particular when each document has only a small number of chunks. The _apply_all will only apply on that small documents, making the CPU/GPU extremely data-hungry.

This is inefficient. The only reason that this problem was not paid enough attention is that the problem only becomes obvious when you have multiple granularities. Single granularity, i.e. directly working on root-level won't reveal it.

This inefficient design also introduces further hacky implementations of the driver. One example is the Cache in EncodeDriver which is unnecessary if apply_all can work on all defined granularity in one-shot. Note that this temporally solves the data-hungry problem for EncodeDriver but all other drivers facing the same problem.

Describe how you solve it

  • redesign _traverse_apply, move it to Document
  • _apply_all iterates over the concatenation of iterators in one shot
  • remove hacky implementation in EncodeDriver

Environment

Screenshots

@hanxiao hanxiao added type/bug Something isn't working priority/critical Urgent: Security, critical bugs, blocking issues. drop everything until this issue is addressed. labels Feb 11, 2021
@JoanFM
Copy link
Member

JoanFM commented Feb 11, 2021

Would you mean, creating a DocumentSet with all the Documents where Driver should apply in advance?

Or would promote the CacheDocumentSet of Encoder solve also the problem?

@hanxiao
Copy link
Member Author

hanxiao commented Feb 12, 2021

Or would promote the CacheDocumentSet of Encoder solve also the problem?

no, as I said this cacheencodedriver is a result of the bad design of traverse_apply. It does not solve the problem that other driver is facing.

After all done, i will delete this logic from the EncodeDriver, reverting EncodeDriver to the simple EncodeDriver as before. We will rely on @batching decorator on the executor level as it should be.

Would you mean, creating a DocumentSet with all the Documents where Driver should apply in advance?

The following modification is expected:

traverse_apply -> move to DocSet, rename to traverse, return one iterator of Document
apply_all-> decouple from traverse, loop over the iterator
apply_root-> delete

FYI, this is strongly related to the previous idea I mentioned here: #1326 unfortunately this was never implemented

@hanxiao
Copy link
Member Author

hanxiao commented Feb 12, 2021

algorithm logic-wise & design-wise i do not worry at all.

I worry about the huge number of unit tests that deeply tangled with the traverse_apply function, those tests have to retired and refactored one way or the other. Those tests shouldn't be the restriction of our idea though.

@JoanFM
Copy link
Member

JoanFM commented Feb 12, 2021

Or would promote the CacheDocumentSet of Encoder solve also the problem?

no, as I said this cacheencodedriver is a result of the bad design of traverse_apply. It does not solve the problem that other driver is facing.

After all done, i will delete this logic from the EncodeDriver, reverting EncodeDriver to the simple EncodeDriver as before. We will rely on @batching decorator on the executor level as it should be.

Would you mean, creating a DocumentSet with all the Documents where Driver should apply in advance?

The following modification is expected:

traverse_apply -> move to DocSet, rename to traverse, return one iterator of Document
apply_all-> decouple from traverse, loop over the iterator
apply_root-> delete

FYI, this is strongly related to the previous idea I mentioned here: #1326 unfortunately this was never implemented

Yes, I remember.

This was actually implemented, Document and DocumentSet do have traverse and traverse_rec logic, but they currently have a callback to keep applying.

For what I understand, the signature of DocumentSet.traverse() should change to return another DocumentSet. And then
the driver could simply call apply_all with the returned documentset.

@hanxiao
Copy link
Member Author

hanxiao commented Feb 12, 2021

For what I understand, the signature of DocumentSet.traverse() should change to return another DocumentSet. And then
the driver could simply call apply_all with the returned documentset.

yes, but returning a docuset is the second step or interface thing, the very essence of this problem is that:

  • no matter how we want to traverse, the "traverse" function must return one iterator, not multiple iterators. this will serve the foundation.
  • the "apply" function can then work on this iterator in one-shot. this will enable batching and all kinds of fancy stuff.

So the real problem is not about caching, batching, etc. It is about the traverse_apply and apply have to be refactored.

Traversing and applying are two different things and they have to be separated.

The reason of this legacy traversal_apply is because this traversal logic born before Jina type.

@JoanFM
Copy link
Member

JoanFM commented Feb 12, 2021

For what I understand, the signature of DocumentSet.traverse() should change to return another DocumentSet. And then
the driver could simply call apply_all with the returned documentset.

yes, but returning a docuset is the second step or interface thing, the very essence of this problem is that:

  • no matter how we want to traverse, the "traverse" function must return one iterator, not multiple iterators. this will serve the foundation.
  • the "apply" function can then work on this iterator in one-shot. this will enable batching and all kinds of fancy stuff.

So the real problem is not about caching, batching, etc. It is about the traverse_apply and apply have to be refactored.

Traversing and applying are two different things and they have to be separated.

The reason of this legacy traversal_apply is because this traversal logic born before Jina type.

Yes I see

@JoanFM
Copy link
Member

JoanFM commented Feb 12, 2021

One problem will appear with Rankers, specially Chunk2DocRankDriver, which do need to keep track of the context_doc they refer to.

@JoanFM
Copy link
Member

JoanFM commented Feb 12, 2021

If we can have a DocumentSet or maybe a new type being a composite of DocumentSet, ChunkSet and MatchSets we can easily pass all the Documents to the executor while having the Driver keep the knowledge about the ref_document.

@hanxiao
Copy link
Member Author

hanxiao commented Feb 12, 2021

I would first forget everything about the current design, including the context_doc, next_edge, field. They are really the legacy code before Jina types.

Regressing the current implementation may be hard. Let's think about if context_doc get in other way, for example using ChunkSet.parent_doc or MatchSet.reference_doc. I remember clearly I added these two things for the purpose of replicating context_doc.

Maybe unifying the name of the properties ChunkSet.parent_doc or MatchSet.reference_doc into one ChunkSet.context_doc and MatchSet.context_doc could be a first step.

@JoanFM
Copy link
Member

JoanFM commented Feb 12, 2021

Thinking it through, right now Rankers are the only Drivers needing this. And if they receive a DocumentSet, these Drivers can already be responsible of grouping the Documents by context_docs.

Another potential design is to have DocumentSet as a set of Documents at root level (maybe even change the name to RootDocumentSet, and then from it everything else will be confomed by MatchSet and ChunkSet.

And have a DocumentSet or a new DocumentSuperSet.

It could look as something like this:

class DocumentSet
""""Abstract Composite class formed by Iterable[DocumentSet] (RootDocumentSet, MatchSet, ChunkSet) """"
  @property
  def docs() -> Iterable[Document]
 """ Property used by all Drivers Except Rankers"""

  @property 
  def docsets()-> Iterable[DocumentSet] 
  """ Property used by RankerDriver and for Each DocumentSet will call property docs with access to a common `ref_doc` """

class RootDocumentSet(DocumentSet)
""""Set of Documents at Root Level extracte directly from Request""""

class MatchSet(DocumentSet)
"""" Set of Documents that are Match of some common ref Document"""

class ChunkSet(DocumentSet) 
"""" Set of Documents that are Match of some common ref Document"""

If the traverse obtains a set of DocumentSet, the Driver can very easily access the context_doc (or ref_doc). Does it seem like overengineering?

@hanxiao
Copy link
Member Author

hanxiao commented Feb 13, 2021

To clarify the size of the problem, I'm making a table to summarize the effort of big refactoring due to the new traverse.

The limitation of the new traverse is that it can not modify the collection (individual-level modify is fine), therefore ops such as del, reverse, sort won't work.

Driver Refactor needed?
jina.drivers.BaseExecutableDriver
jina.drivers.cache.BaseCacheDriver No
jina.drivers.cache.TaggingCacheDriver No
jina.drivers.convert.Blob2PngURI No
jina.drivers.convert.Buffer2URI No
jina.drivers.convert.BufferImage2Blob No
jina.drivers.convert.ConvertDriver No
jina.drivers.convert.Text2URI No
jina.drivers.convert.URI2Blob No
jina.drivers.convert.URI2Buffer No
jina.drivers.convert.URI2DataURI No
jina.drivers.convert.URI2Text No
jina.drivers.craft.CraftDriver No
jina.drivers.delete.DeleteDriver No
jina.drivers.encode.BaseEncodeDriver No
jina.drivers.encode.EncodeDriver Done in #1938
jina.drivers.evaluate.BaseEvaluateDriver No
jina.drivers.evaluate.FieldEvaluateDriver No
jina.drivers.evaluate.LoadGroundTruthDriver Yes, because it uses del self.docs[j]
jina.drivers.evaluate.NDArrayEvaluateDriver No
jina.drivers.evaluate.RankEvaluateDriver No
jina.drivers.evaluate.TextEvaluateDriver No
jina.drivers.index.BaseIndexDriver No
jina.drivers.index.KVIndexDriver No
jina.drivers.index.VectorIndexDriver No
jina.drivers.multimodal.MultiModalDriver No
jina.drivers.predict.BaseLabelPredictDriver No
jina.drivers.predict.BasePredictDriver No
jina.drivers.predict.BinaryPredictDriver No
jina.drivers.predict.MultiLabelPredictDriver No
jina.drivers.predict.OneHotPredictDriver No
jina.drivers.predict.Prediction2DocBlobDriver No
jina.drivers.querylang.filter.FilterQL Yes, because of del docs[j]
jina.drivers.querylang.reverse.ReverseQL Yes, because of docs.reverse()
jina.drivers.querylang.select.ExcludeQL No
jina.drivers.querylang.select.ExcludeReqQL No
jina.drivers.querylang.select.SelectQL No
jina.drivers.querylang.select.SelectReqQL No
jina.drivers.querylang.slice.SliceQL No
jina.drivers.querylang.sort.SortQL Yes, because of del docs[int(self.end):]
jina.drivers.rank.BaseRankDriver No
jina.drivers.rank.Matches2DocRankDriver Yes, because of the usage of context_doc
jina.drivers.rank.aggregate.AggregateMatches2DocRankDriver Yes, because of the usage of context_doc
jina.drivers.rank.aggregate.BaseAggregateMatchesRanker Yes, because of the usage of context_doc
jina.drivers.rank.aggregate.Chunk2DocRankDriver Yes, because of the usage of context_doc
jina.drivers.reduce.CollectEvaluationDriver Yes, because of the usage of context_doc
jina.drivers.reduce.ConcatEmbedDriver Yes, because of the usage of context_doc
jina.drivers.reduce.ReduceAllDriver Yes, because of the usage of context_doc
jina.drivers.search.BaseSearchDriver No
jina.drivers.search.KVSearchDriver Yes, because of the usage of del docs[j]
jina.drivers.search.VectorFillDriver No
jina.drivers.search.VectorSearchDriver No
jina.drivers.segment.SegmentDriver No

@JoanFM
Copy link
Member

JoanFM commented Feb 13, 2021

Could a solution similar to the one I proposed above be helpful? Having a DocumentSet aware of the subsets it contains? Thus being able to refer properly to a context_doc or to delete and reverse only specific subsets of the set?

@hanxiao
Copy link
Member Author

hanxiao commented Feb 13, 2021

In #1938 I decided to use FastRecursiveMixin and only add this mixin to those drivers that really need it, e.g. EncodeDriver, IndexDriver, PredictDriver. After all, they are the ones who need this fast operation as they call for docs.all_embedding, docs.all_content.

Not a universal solution for sure, but good for the time being. Allowing one to move on on the cross/multi-modality search

@JoanFM
Copy link
Member

JoanFM commented Feb 13, 2021

In #1938 I decide to use FastRecursiveMixin and only add this mixin to those drivers that really need it, e.g. EncodeDriver, IndexDriver, PredictDriver. After all they are the ones need this fast operation as they call for docs.all_embedding, docs.all_content

But for the sake of completion, it would be cleaner to use the same concept in other Drivers as CraftDriver? So that we can avoid having the traversal logic duplicated in driver and types?

@hanxiao
Copy link
Member Author

hanxiao commented Feb 13, 2021

true, but I will leave this to the team.

@nan-wang
Copy link
Member

In #1938 I decide to use FastRecursiveMixin and only add this mixin to those drivers that really need it, e.g. EncodeDriver, IndexDriver, PredictDriver. After all they are the ones need this fast operation as they call for docs.all_embedding, docs.all_content

But for the sake of completion, it would be cleaner to use the same concept in other Drivers as CraftDriver? So that we can avoid having the traversal logic duplicated in driver and types?

Basically, after this refactoring, _traverse_apply can be removed from BaseRecursiveDriver.

@nan-wang
Copy link
Member

BaseEvaluateDriver needs refactoring as well, right? Because it is using _traverse_apply().

@JoanFM
Copy link
Member

JoanFM commented Feb 14, 2021

BaseEvaluateDriver needs refactoring as well, right? Because it is using _traverse_apply().

I opened a PR doing some refavtoring in #1939

@maximilianwerk
Copy link
Member

maximilianwerk commented Feb 15, 2021

I don't believe, we need the parent_doc functionality. We could give the rank driver just the parent Document and the driver will reach through to the child elements.

When we use the Chunk2DocRankDriver we could just change the former traversal_paths: ['c'] to ['r'] and we can remove the context_doc. In the end, a driver only needs to know, where the "anchor" of the task-to-be-performed is. And before we used c for historical reasons, but apparently r as an anchor makes way more sense. It would define the Document, towards which should be propagated.

As a consequence, we would directly get a better naming for the Chunk2DocRankDriver: Something along the lines of ChunkAggregationRankDriver.

@maximilianwerk
Copy link
Member

I cannot yet exactly justify why, but I would love to get one iterator per traversal_path. Such that only documents on one defined path are batched. For an encoder, there is no difference, but for a ranker it feels dangerous to mix Documents from different levels into one batch.

@FionnD FionnD added status/in-progess This issue is activity being worked on. area/core This issue/PR affects the core codebase labels Feb 26, 2021
@maximilianwerk maximilianwerk self-assigned this Mar 1, 2021
@maximilianwerk
Copy link
Member

Initial work is finished. Further steps are taken here: #2097

@FionnD
Copy link
Contributor

FionnD commented Mar 10, 2021

Since further steps are in #2097, can i close this issue @hanxiao ?

@hanxiao hanxiao closed this as completed Mar 10, 2021
@FionnD FionnD added status/done This issue is complete. and removed status/in-progess This issue is activity being worked on. labels Mar 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/core This issue/PR affects the core codebase priority/critical Urgent: Security, critical bugs, blocking issues. drop everything until this issue is addressed. status/done This issue is complete. type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants