-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(driver): add the extract-apply-update pattern #2313
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2313 +/- ##
=========================================
Coverage ? 90.91%
=========================================
Files ? 222
Lines ? 11792
Branches ? 0
=========================================
Hits ? 10721
Misses ? 1071
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Latency summaryCurrent PR yields:
Breakdown
Backed by latency-tracking. Further commits will update this comment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a chat with @maximilianwerk and he came with a proposal that may solve many of these aspects.
He would like to present it soon and it may make this change not needed. If possible, let's see if we can discuss that before merging this.
are you talking about a proposal or a working PR? I don't see any ongoing work on this direction. If we are talking about a rough idea without concrete implementation then I'm afraid I can't wait. I will merge it after tmr fireside gathering, to save time i will only answer question you raise as i think the PR description is concrete enough |
an idea proposal. |
It is more then just a rough idea. Anyhow, it does not conflict this PR but rather a further advancement. Let me show concrete code to you in tomorrows fireside. Apart from that: I like the binding of the function signature to the extracted data a lot. Great idea! |
@@ -397,6 +399,162 @@ def _apply_all( | |||
""" | |||
|
|||
|
|||
class DocsExtractUpdateMixin: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes a lot of parameters of the drivers, maybe it should be directly a Driver itself? And then they can directly inherit from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it should be directly a Driver itself?
i thought about that, but the side-effect on making it default behavior at BaseExecutorDriver
is not clear to me and maybe an overshoot for this PR.
well to me the point would be more about to have less changes and refactors |
Note that this PR only requires simple follow-up changes on hub executors: i.e. changing their |
I meant changes for people with custom executors, or is it backcompatible? |
It will be a breaking change for Hub & executor developers, but it is a good & necessary breaking change that leads executor to better readability and maintainability. The old function signature is not usable anymore. There is already a detailed error message to guide people to migrate. |
While is handy, I am not a big fan of coupling the If I |
Breaking Change #2281
Background
As presented on April 7 2021 Engineering Show & Tell, and private discussions with @bwanglzu @JoanFM and @florian-hoenicke we aim to solve the usability problems when writing a new
Executor
. To recap, the usability problem can be specified as three subproblems when a user tries to create a newExecutor
:Executor
to inherit.Scope of this PR
This PR solves Problem 3, Problem 2 is solved for free (we have it supported before this PR). It also sheds a light on Problem 1. While reviewing this PR, please focus on Problem 3, as it is my major & original intention.
This PR presents a new Driver mixin class
DocsExtractUpdateMixin
that summarizes the pattern of extracting -> applying -> updating. Specifically,exec_fn
's arguments;Document
types;exec_fn
;Document
with the the result returned fromexec_fn
.Understanding Problem 3
Take
BaseEncoder
as the example, the "core" logic functionencode
has the following specification:It is impossible to tell what
data
argument is by just looking at this signature. The same problem also appears inBaseClassifier
,BaseIndexer
, etc.It is impossible for users to know what
data
,key
,values
,keys
actually mean.Reason behind Problem 3
The reason behind is that:
Driver
describes what those arguments in the executor means. The extraction logic is written inside theDriver
.Driver
has much less visibility thanExecutor
to users.Executor
perspective, one can hardly guess the argument without reading the code of theDriver
.The following figure describes this problem.
General Idea of the Solution
As presented on April 7 2021 Engineering Show & Tell, and private discussions with @bwanglzu @JoanFM and @florian-hoenicke, the solution is to implement a contract between the
Document
attributes andExecutor
's logic function; and removing the customized extraction logic from the Driver. Simply put, the executor's logic function must name arguments as validDocument
attributes. (At the current version, this means the name of the arguments must come from{'parent_id', 'content_type', 'siblings', 'mime_type', 'chunks', 'granularity', 'buffer', 'tags', 'level_name', 'proto', 'offset', 'modality', 'text', 'uri', 'matches', 'evaluations', 'content_hash', 'blob', 'weight', 'id', 'adjacency', 'location', 'embedding', 'score', 'non_empty_fields', 'content'}
. Note that this list is auto-derived at runtime, not hardcoded.)Comparing to previous where the driver gives the executor whatever data structure built inside the driver, now the driver gives what executor's
exec_fn
explicitly asks for. From the Executor's developer aspect, he now gets a feeling of "controlling" what he want.There are some smart logics implemented inside
DocsExtractUpdateMixin
to improve usability. In general, the principles on the new user experience of Executor's logic function are:Document
type once, write all;Affects & Aftermath
This PR currently covers the following drivers and executors:
EncodeDriver
<->Encoder
CraftDriver
<->Crafter
SegmentDriver
<->Segmenter
PredictDriver
<->Classifier
IndexDriver
&SearchDriver
&Indexer
are in the scope as they fall into the same pattern. But refactoring them maybe overshot in this PR.Aftermath on Hub Executors
Note that this PR only requires simple follow-up changes on hub executors: i.e. changing their
exec_fn
signature to comply with Document types. As a hub developer, I would very much happy for this change as it greatly improves the readability of the executor.A temp patch is added to the chatbot demo, this should be reverted after Hub executors got updated.
Code Examples
The encoder's
encode
now can be written as:content
is an attribute defined inDocument
type.Accept arbitrary
Document
attributesencode
function now can also accepts multiple & arbitrary document types, e.g.A new document property
raw
is added, hence the executor can directly access the Document.Type annotation matters
Type hint an argument with
np.ndarray
orndarray
will tell the driver to stack data into Numpy Ndarray. For example,gives:
gives:
gives:
Problem 2
Though not the major focus of this PR, it is obvious that Problem 2 can be solved by setting
method
args for the driver. For example,To bind
foo
withEncodeDriver
, one only need to setmethod='foo'
, e.g. in YAML viaOr in Python (often used in tests),
Relation to Problem 1
After Problem 2 & 3 are solved, the foundation of solving Problem 1 is there.
To recap, the emerging of
DocsExtractUpdateMixin
divides the allExecutorDriver
classes into two categories:Encode
,Craft
,Segment
,Index
,Search
Rank
,Evaluate
.The ones that do not fit the
DocsExtractUpdateMixin
pattern are often due to the complicated extraction and extra data preprocessing before callingexec_fn
. In other way, ifDocument
class can provide powerful attributes anddoc.complex_attribute
can be used as a property, then they can also fit withDocsExtractUpdateMixin
.That is almost to say: there is no need to have polymorphism on the
ExecutorDriver
, which translates to the following YAML specification:The only thing that makes multiple
GenericExecutorDriver
different is the method that it binds to.Code Example
For the sake of clarity, the code snippet is presented first, explanation on it comes after. This PR enables the following code example:
One can also put all logic functions in one class and use
@requests(on=)
to route them.