Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(driver): add the extract-apply-update pattern #2313

Merged
merged 23 commits into from
Apr 19, 2021

Conversation

hanxiao
Copy link
Member

@hanxiao hanxiao commented Apr 18, 2021

Breaking Change #2281

Background

As presented on April 7 2021 Engineering Show & Tell, and private discussions with @bwanglzu @JoanFM and @florian-hoenicke we aim to solve the usability problems when writing a new Executor. To recap, the usability problem can be specified as three subproblems when a user tries to create a new Executor:

  1. User has to find out which Executor to inherit.
  2. User has to know which method to override.
  3. User has to know what do the arguments of the overridden method mean.

Scope of this PR

This PR solves Problem 3, Problem 2 is solved for free (we have it supported before this PR). It also sheds a light on Problem 1. While reviewing this PR, please focus on Problem 3, as it is my major & original intention.

This PR presents a new Driver mixin class DocsExtractUpdateMixin that summarizes the pattern of extracting -> applying -> updating. Specifically,

  • It leverages Python reflection & type hint to inspect executors exec_fn's arguments;
  • It extracts the same name attributes from Document types;
  • It feeds the extraction to exec_fn;
  • It updates Document with the the result returned from exec_fn.

Understanding Problem 3

Take BaseEncoder as the example, the "core" logic function encode has the following specification:

    def encode(self, data: Any, *args, **kwargs) -> 'EncodingType':

It is impossible to tell what data argument is by just looking at this signature. The same problem also appears in BaseClassifier, BaseIndexer, etc.

def predict(self, data: 'np.ndarray', *args, **kwargs) -> 'np.ndarray':
def add(
        self, keys: Iterable[str], values: Iterable[bytes], *args, **kwargs
    ) -> None:
def query(self, key: str, *args, **kwargs)

It is impossible for users to know what data, key, values, keys actually mean.

Reason behind Problem 3

The reason behind is that:

  • Driver describes what those arguments in the executor means. The extraction logic is written inside the Driver.
  • Driver has much less visibility than Executor to users.
  • Hence from Executor perspective, one can hardly guess the argument without reading the code of the Driver.

The following figure describes this problem.

image

General Idea of the Solution

As presented on April 7 2021 Engineering Show & Tell, and private discussions with @bwanglzu @JoanFM and @florian-hoenicke, the solution is to implement a contract between the Document attributes and Executor's logic function; and removing the customized extraction logic from the Driver. Simply put, the executor's logic function must name arguments as valid Document attributes. (At the current version, this means the name of the arguments must come from {'parent_id', 'content_type', 'siblings', 'mime_type', 'chunks', 'granularity', 'buffer', 'tags', 'level_name', 'proto', 'offset', 'modality', 'text', 'uri', 'matches', 'evaluations', 'content_hash', 'blob', 'weight', 'id', 'adjacency', 'location', 'embedding', 'score', 'non_empty_fields', 'content'}. Note that this list is auto-derived at runtime, not hardcoded.)

Comparing to previous where the driver gives the executor whatever data structure built inside the driver, now the driver gives what executor's exec_fn explicitly asks for. From the Executor's developer aspect, he now gets a feeling of "controlling" what he want.

image

There are some smart logics implemented inside DocsExtractUpdateMixin to improve usability. In general, the principles on the new user experience of Executor's logic function are:

  • Learn Document type once, write all;
  • Readability matters;
  • Throw helpful error message.

Affects & Aftermath

This PR currently covers the following drivers and executors:

  • EncodeDriver <-> Encoder
  • CraftDriver <-> Crafter
  • SegmentDriver <-> Segmenter
  • PredictDriver <-> Classifier

IndexDriver & SearchDriver & Indexer are in the scope as they fall into the same pattern. But refactoring them maybe overshot in this PR.

Aftermath on Hub Executors

Note that this PR only requires simple follow-up changes on hub executors: i.e. changing their exec_fn signature to comply with Document types. As a hub developer, I would very much happy for this change as it greatly improves the readability of the executor.

A temp patch is added to the chatbot demo, this should be reverted after Hub executors got updated.

Code Examples

The encoder's encode now can be written as:

class MyEncoder(BaseEncoder):
  def encode(content: 'np.ndarray'):
    pass

content is an attribute defined in Document type.

Accept arbitrary Document attributes

encode function now can also accepts multiple & arbitrary document types, e.g.

class MyEncoder(BaseEncoder):
  def encode(content: 'np.ndarray', uri: str, mime_type: str, id: str):
    pass

A new document property raw is added, hence the executor can directly access the Document.

class MyEncoder(BaseEncoder):
  def encode(raw: 'Document')
    pass

Type annotation matters

Type hint an argument with np.ndarray or ndarray will tell the driver to stack data into Numpy Ndarray. For example,

class MyExecutor(BaseEncoder):
    def encode(self, id, embedding):
        pass

gives:

[['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], [array([0.5488135 , 0.71518937, 0.60276338, 0.54488318, 0.4236548 ,
       0.64589411, 0.43758721, 0.891773  , 0.96366276, 0.38344152]), array([0.15896958, 0.11037514, 0.65632959, 0.13818295, 0.19658236,
       0.36872517, 0.82099323, 0.09710128, 0.83794491, 0.09609841]),...]]

class MyExecutor(BaseEncoder):
    def encode(self, id: 'ndarray', embedding):
        pass

gives:

([array(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], dtype='<U1'), [array([0.5488135 , 0.71518937, 0.60276338, 0.54488318, 0.4236548 ,
       0.64589411, 0.43758721, 0.891773  , 0.96366276, 0.38344152]), array([0.15896958, 0.11037514, 0.65632959, 0.13818295, 0.19658236,
       0.36872517, 0.82099323, 0.09710128, 0.83794491, 0.09609841]),...]]

class MyExecutor(BaseEncoder):
    def encode(self, id: 'ndarray', embedding: 'ndarray'):
        pass

gives:

([array(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], dtype='<U1'), array([[0.5488135 , 0.71518937, 0.60276338, 0.54488318, 0.4236548 ,
        0.64589411, 0.43758721, 0.891773  , 0.96366276, 0.38344152],
       [0.15896958, 0.11037514, 0.65632959, 0.13818295, 0.19658236,
        0.36872517, 0.82099323, 0.09710128, 0.83794491, 0.09609841], ...]

Problem 2

Though not the major focus of this PR, it is obvious that Problem 2 can be solved by setting method args for the driver. For example,

class MyEncoder(BaseEncoder):
  def foo(content: 'np.ndarray'):
     pass

To bind foo with EncodeDriver, one only need to set method='foo', e.g. in YAML via

jtype: MyEncoder
requests:
  on:
     IndexRequest:
       - jtype: EncodeDriver
         with:
            method: foo

Or in Python (often used in tests),

exec = MyEncoder()
bd = EncodeDriver(method='foo')
bd.attach(exec, runtime=None)

Relation to Problem 1

After Problem 2 & 3 are solved, the foundation of solving Problem 1 is there.

To recap, the emerging of DocsExtractUpdateMixin divides the all ExecutorDriver classes into two categories:

  • The ones that fit to "extract -> apply -> update pattern", e.g. Encode, Craft, Segment, Index, Search
  • The ones do not, e.g. Rank, Evaluate.

The ones that do not fit the DocsExtractUpdateMixin pattern are often due to the complicated extraction and extra data preprocessing before calling exec_fn. In other way, if Document class can provide powerful attributes and doc.complex_attribute can be used as a property, then they can also fit with DocsExtractUpdateMixin.

That is almost to say: there is no need to have polymorphism on the ExecutorDriver, which translates to the following YAML specification:

...
requests:
  on:
    IndexRequest:
           - GenericExecutorDriver
              with:
                  ...
    SearchRequest:
           - GenericExecutorDriver
              with:
                  ...

The only thing that makes multiple GenericExecutorDriver different is the method that it binds to.

Code Example

For the sake of clarity, the code snippet is presented first, explanation on it comes after. This PR enables the following code example:

import numpy as np

from jina import Flow, Document, Executor
from jina.executors.decorators import requests


class MyExecutor(Executor):

    @requests
    def foo(self, id, embedding):
        print(id)
        print(embedding)


f = Flow().add(uses=MyExecutor)

with f:
    f.index(Document(embedding=np.array([1, 2, 3])))

One can also put all logic functions in one class and use @requests(on=) to route them.

class MyExecutor(Executor):
    @requests
    def foo(self, id):
        return [{'embedding': np.array([1, 2, 3])}] * len(id)

    @requests(on='SearchRequest')
    def bar(self, id):
        return [{'embedding': np.array([4, 5, 6])}] * len(id)

    @requests(on='UpdateRequest')
    def bar2(self, id):
        return [{'embedding': np.array([10, 11, 12])}] * len(id)


# test code
def validate_index_resp(req):
    np.testing.assert_equal(req.docs[0].embedding, np.array([1, 2, 3]))

def validate_search_resp(req):
    np.testing.assert_equal(req.docs[0].embedding, np.array([4, 5, 6]))

def validate_update_resp(req):
    np.testing.assert_equal(req.docs[0].embedding, np.array([10, 11, 12]))

f = Flow().add(uses=MyExecutor)

with f:
    f.index(Document(), on_done=validate_index_resp)
    f.search(Document(), on_done=validate_search_resp)
    f.update(Document(), on_done=validate_update_resp)

@codecov
Copy link

codecov bot commented Apr 18, 2021

Codecov Report

❗ No coverage uploaded for pull request base (master@4375be5). Click here to learn what that means.
The diff coverage is 89.88%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #2313   +/-   ##
=========================================
  Coverage          ?   90.91%           
=========================================
  Files             ?      222           
  Lines             ?    11792           
  Branches          ?        0           
=========================================
  Hits              ?    10721           
  Misses            ?     1071           
  Partials          ?        0           
Flag Coverage Δ
daemon 51.05% <34.52%> (?)
jina 91.08% <89.88%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
jina/executors/compound.py 83.16% <ø> (ø)
jina/executors/crafters/__init__.py 100.00% <ø> (ø)
jina/executors/segmenters/__init__.py 100.00% <ø> (ø)
jina/types/mixin.py 93.10% <66.66%> (ø)
jina/drivers/generic.py 80.00% <80.00%> (ø)
jina/drivers/__init__.py 90.74% <82.85%> (ø)
jina/executors/decorators.py 87.26% <90.00%> (ø)
jina/types/document/__init__.py 96.22% <90.90%> (ø)
jina/helper.py 83.72% <95.23%> (ø)
jina/__init__.py 74.41% <100.00%> (ø)
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4375be5...8d50739. Read the comment docs.

@github-actions
Copy link

github-actions bot commented Apr 18, 2021

Latency summary

Current PR yields:

  • 😶 index QPS at 931, delta to last 3 avg.: +0%
  • 😶 query QPS at 13, delta to last 3 avg.: -4%

Breakdown

Version Index QPS Query QPS
current 931 13
1.1.6 917 13
1.1.5 949 13

Backed by latency-tracking. Further commits will update this comment.

Copy link
Member

@JoanFM JoanFM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a chat with @maximilianwerk and he came with a proposal that may solve many of these aspects.

He would like to present it soon and it may make this change not needed. If possible, let's see if we can discuss that before merging this.

@hanxiao
Copy link
Member Author

hanxiao commented Apr 18, 2021

are you talking about a proposal or a working PR? I don't see any ongoing work on this direction. If we are talking about a rough idea without concrete implementation then I'm afraid I can't wait.

I will merge it after tmr fireside gathering, to save time i will only answer question you raise as i think the PR description is concrete enough

@JoanFM
Copy link
Member

JoanFM commented Apr 18, 2021

are you talking about a proposal or a working PR? I don't see any ongoing work on this direction. If we are talking about a rough idea without concrete implementation then I'm afraid I can't wait

an idea proposal.

@maximilianwerk
Copy link
Member

are you talking about a proposal or a working PR? I don't see any ongoing work on this direction. If we are talking about a rough idea without concrete implementation then I'm afraid I can't wait.

I will merge it after tmr fireside gathering, to save time i will only answer question you raise as i think the PR description is concrete enough

It is more then just a rough idea. Anyhow, it does not conflict this PR but rather a further advancement. Let me show concrete code to you in tomorrows fireside.

Apart from that: I like the binding of the function signature to the extracted data a lot. Great idea!

@@ -397,6 +399,162 @@ def _apply_all(
"""


class DocsExtractUpdateMixin:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes a lot of parameters of the drivers, maybe it should be directly a Driver itself? And then they can directly inherit from it.

Copy link
Member Author

@hanxiao hanxiao Apr 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it should be directly a Driver itself?

i thought about that, but the side-effect on making it default behavior at BaseExecutorDriver is not clear to me and maybe an overshoot for this PR.

@JoanFM
Copy link
Member

JoanFM commented Apr 18, 2021

are you talking about a proposal or a working PR? I don't see any ongoing work on this direction. If we are talking about a rough idea without concrete implementation then I'm afraid I can't wait.
I will merge it after tmr fireside gathering, to save time i will only answer question you raise as i think the PR description is concrete enough

It is more then just a rough idea. Anyhow, it does not conflict this PR but rather a further advancement. Let me show concrete code to you in tomorrows fireside.

Apart from that: I like the binding of the function signature to the extracted data a lot. Great idea!

well to me the point would be more about to have less changes and refactors

@hanxiao
Copy link
Member Author

hanxiao commented Apr 18, 2021

have less changes and refactors

Note that this PR only requires simple follow-up changes on hub executors: i.e. changing their exec_fn signature to comply with Document types. As a hub developer, I would very much happy for this change as it greatly improves the readability of the executor.

@hanxiao hanxiao marked this pull request as ready for review April 18, 2021 16:00
@hanxiao hanxiao requested a review from a team as a code owner April 18, 2021 16:00
@JoanFM
Copy link
Member

JoanFM commented Apr 18, 2021

have less changes and refactors

Note that this PR only requires simple follow-up changes on hub executors: i.e. changing their exec_fn signature to comply with Document types. As a hub developer, I would very much happy for this change as it greatly improves the readability of the executor.

I meant changes for people with custom executors, or is it backcompatible?

@jina-bot jina-bot added size/XL and removed size/L labels Apr 18, 2021
@hanxiao
Copy link
Member Author

hanxiao commented Apr 18, 2021

I meant changes for people with custom executors, or is it backcompatible?

It will be a breaking change for Hub & executor developers, but it is a good & necessary breaking change that leads executor to better readability and maintainability.

The old function signature is not usable anymore. There is already a detailed error message to guide people to migrate.

@jina-bot jina-bot added the area/helper This issue/PR affects the helper functionality label Apr 19, 2021
@JoanFM
Copy link
Member

JoanFM commented Apr 19, 2021

While is handy, I am not a big fan of coupling the data to be extracted with the class instead of the instance.

If I craft text into something different, why should I not be able to craft other fields of a Document?

@jina-bot jina-bot added the area/entrypoint This issue/PR affects the entrypoint codebase label Apr 19, 2021
@hanxiao hanxiao merged commit 0e02d22 into master Apr 19, 2021
@hanxiao hanxiao deleted the refactor-driver-extract-update branch April 19, 2021 12:08
hanxiao added a commit that referenced this pull request Apr 20, 2021
@nan-wang nan-wang added this to the v1.2 Breaking Changes milestone Apr 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/core This issue/PR affects the core codebase area/entrypoint This issue/PR affects the entrypoint codebase area/helper This issue/PR affects the helper functionality area/testing This issue/PR affects testing component/driver component/executor component/resource component/type executor/crafter executor/encoder executor/meta size/XL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants