-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the design of _traverse_apply
makes all BaseRecursiveDriver low-efficient
#1932
Comments
Would you mean, creating a DocumentSet with all the Documents where Driver should apply in advance? Or would promote the CacheDocumentSet of Encoder solve also the problem? |
no, as I said this cacheencodedriver is a result of the bad design of After all done, i will delete this logic from the EncodeDriver, reverting EncodeDriver to the simple EncodeDriver as before. We will rely on
The following modification is expected:
FYI, this is strongly related to the previous idea I mentioned here: #1326 unfortunately this was never implemented |
algorithm logic-wise & design-wise i do not worry at all. I worry about the huge number of unit tests that deeply tangled with the |
Yes, I remember. This was actually implemented, For what I understand, the signature of |
yes, but returning a docuset is the second step or interface thing, the very essence of this problem is that:
So the real problem is not about caching, batching, etc. It is about the Traversing and applying are two different things and they have to be separated. The reason of this legacy |
Yes I see |
One problem will appear with Rankers, specially Chunk2DocRankDriver, which do need to keep track of the context_doc they refer to. |
If we can have a DocumentSet or maybe a new type being a composite of DocumentSet, ChunkSet and MatchSets we can easily pass all the Documents to the executor while having the Driver keep the knowledge about the ref_document. |
I would first forget everything about the current design, including the Regressing the current implementation may be hard. Let's think about if Maybe unifying the name of the properties |
Thinking it through, right now Another potential design is to have And have a It could look as something like this:
If the traverse obtains a set of DocumentSet, the Driver can very easily access the |
To clarify the size of the problem, I'm making a table to summarize the effort of big refactoring due to the new The limitation of the new
|
Could a solution similar to the one I proposed above be helpful? Having a DocumentSet aware of the subsets it contains? Thus being able to refer properly to a |
In #1938 I decided to use Not a universal solution for sure, but good for the time being. Allowing one to move on on the cross/multi-modality search |
But for the sake of completion, it would be cleaner to use the same concept in other Drivers as |
true, but I will leave this to the team. |
Basically, after this refactoring, |
|
I opened a PR doing some refavtoring in #1939 |
I don't believe, we need the When we use the As a consequence, we would directly get a better naming for the |
I cannot yet exactly justify why, but I would love to get one iterator per |
Initial work is finished. Further steps are taken here: #2097 |
Describe the bug
Highest priority, must solve now.
The current design combination of
_apply_all
and_traverse_apply
inBaseRecursiveDriver
results in a low efficient recursion, in particular when each document has only a small number of chunks. The_apply_all
will only apply on that small documents, making the CPU/GPU extremely data-hungry.This is inefficient. The only reason that this problem was not paid enough attention is that the problem only becomes obvious when you have multiple granularities. Single granularity, i.e. directly working on root-level won't reveal it.
This inefficient design also introduces further hacky implementations of the driver. One example is the
Cache
inEncodeDriver
which is unnecessary ifapply_all
can work on all defined granularity in one-shot. Note that this temporally solves the data-hungry problem forEncodeDriver
but all other drivers facing the same problem.Describe how you solve it
_traverse_apply
, move it toDocument
_apply_all
iterates over the concatenation of iterators in one shotEncodeDriver
Environment
Screenshots
The text was updated successfully, but these errors were encountered: