refactor(driver): add the extract-apply-update pattern #2313

hanxiao · 2021-04-18T13:10:32Z

Breaking Change #2281

Background

As presented on April 7 2021 Engineering Show & Tell, and private discussions with @bwanglzu @JoanFM and @florian-hoenicke we aim to solve the usability problems when writing a new Executor. To recap, the usability problem can be specified as three subproblems when a user tries to create a new Executor:

User has to find out which Executor to inherit.
User has to know which method to override.
User has to know what do the arguments of the overridden method mean.

Scope of this PR

This PR solves Problem 3, Problem 2 is solved for free (we have it supported before this PR). It also sheds a light on Problem 1. While reviewing this PR, please focus on Problem 3, as it is my major & original intention.

This PR presents a new Driver mixin class DocsExtractUpdateMixin that summarizes the pattern of extracting -> applying -> updating. Specifically,

It leverages Python reflection & type hint to inspect executors exec_fn's arguments;
It extracts the same name attributes from Document types;
It feeds the extraction to exec_fn;
It updates Document with the the result returned from exec_fn.

Understanding Problem 3

Take BaseEncoder as the example, the "core" logic function encode has the following specification:

    def encode(self, data: Any, *args, **kwargs) -> 'EncodingType':

It is impossible to tell what data argument is by just looking at this signature. The same problem also appears in BaseClassifier, BaseIndexer, etc.

def predict(self, data: 'np.ndarray', *args, **kwargs) -> 'np.ndarray':

def add(
        self, keys: Iterable[str], values: Iterable[bytes], *args, **kwargs
    ) -> None:

def query(self, key: str, *args, **kwargs)

It is impossible for users to know what data, key, values, keys actually mean.

Reason behind Problem 3

The reason behind is that:

Driver describes what those arguments in the executor means. The extraction logic is written inside the Driver.
Driver has much less visibility than Executor to users.
Hence from Executor perspective, one can hardly guess the argument without reading the code of the Driver.

The following figure describes this problem.

General Idea of the Solution

As presented on April 7 2021 Engineering Show & Tell, and private discussions with @bwanglzu @JoanFM and @florian-hoenicke, the solution is to implement a contract between the Document attributes and Executor's logic function; and removing the customized extraction logic from the Driver. Simply put, the executor's logic function must name arguments as valid Document attributes. (At the current version, this means the name of the arguments must come from {'parent_id', 'content_type', 'siblings', 'mime_type', 'chunks', 'granularity', 'buffer', 'tags', 'level_name', 'proto', 'offset', 'modality', 'text', 'uri', 'matches', 'evaluations', 'content_hash', 'blob', 'weight', 'id', 'adjacency', 'location', 'embedding', 'score', 'non_empty_fields', 'content'}. Note that this list is auto-derived at runtime, not hardcoded.)

Comparing to previous where the driver gives the executor whatever data structure built inside the driver, now the driver gives what executor's exec_fn explicitly asks for. From the Executor's developer aspect, he now gets a feeling of "controlling" what he want.

There are some smart logics implemented inside DocsExtractUpdateMixin to improve usability. In general, the principles on the new user experience of Executor's logic function are:

Learn Document type once, write all;
Readability matters;
Throw helpful error message.

Affects & Aftermath

This PR currently covers the following drivers and executors:

EncodeDriver <-> Encoder
CraftDriver <-> Crafter
SegmentDriver <-> Segmenter
PredictDriver <-> Classifier

IndexDriver & SearchDriver & Indexer are in the scope as they fall into the same pattern. But refactoring them maybe overshot in this PR.

Aftermath on Hub Executors

Note that this PR only requires simple follow-up changes on hub executors: i.e. changing their exec_fn signature to comply with Document types. As a hub developer, I would very much happy for this change as it greatly improves the readability of the executor.

A temp patch is added to the chatbot demo, this should be reverted after Hub executors got updated.

Code Examples

The encoder's encode now can be written as:

class MyEncoder(BaseEncoder):
  def encode(content: 'np.ndarray'):
    pass

content is an attribute defined in Document type.

Accept arbitrary `Document` attributes

encode function now can also accepts multiple & arbitrary document types, e.g.

class MyEncoder(BaseEncoder):
  def encode(content: 'np.ndarray', uri: str, mime_type: str, id: str):
    pass

A new document property raw is added, hence the executor can directly access the Document.

class MyEncoder(BaseEncoder):
  def encode(raw: 'Document')
    pass

Type annotation matters

Type hint an argument with np.ndarray or ndarray will tell the driver to stack data into Numpy Ndarray. For example,

class MyExecutor(BaseEncoder):
    def encode(self, id, embedding):
        pass

gives:

[['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], [array([0.5488135 , 0.71518937, 0.60276338, 0.54488318, 0.4236548 ,
       0.64589411, 0.43758721, 0.891773  , 0.96366276, 0.38344152]), array([0.15896958, 0.11037514, 0.65632959, 0.13818295, 0.19658236,
       0.36872517, 0.82099323, 0.09710128, 0.83794491, 0.09609841]),...]]

class MyExecutor(BaseEncoder):
    def encode(self, id: 'ndarray', embedding):
        pass

gives:

([array(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], dtype='<U1'), [array([0.5488135 , 0.71518937, 0.60276338, 0.54488318, 0.4236548 ,
       0.64589411, 0.43758721, 0.891773  , 0.96366276, 0.38344152]), array([0.15896958, 0.11037514, 0.65632959, 0.13818295, 0.19658236,
       0.36872517, 0.82099323, 0.09710128, 0.83794491, 0.09609841]),...]]

class MyExecutor(BaseEncoder):
    def encode(self, id: 'ndarray', embedding: 'ndarray'):
        pass

gives:

([array(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], dtype='<U1'), array([[0.5488135 , 0.71518937, 0.60276338, 0.54488318, 0.4236548 ,
        0.64589411, 0.43758721, 0.891773  , 0.96366276, 0.38344152],
       [0.15896958, 0.11037514, 0.65632959, 0.13818295, 0.19658236,
        0.36872517, 0.82099323, 0.09710128, 0.83794491, 0.09609841], ...]

Problem 2

Though not the major focus of this PR, it is obvious that Problem 2 can be solved by setting method args for the driver. For example,

class MyEncoder(BaseEncoder):
  def foo(content: 'np.ndarray'):
     pass

To bind foo with EncodeDriver, one only need to set method='foo', e.g. in YAML via

jtype: MyEncoder
requests:
  on:
     IndexRequest:
       - jtype: EncodeDriver
         with:
            method: foo

Or in Python (often used in tests),

exec = MyEncoder()
bd = EncodeDriver(method='foo')
bd.attach(exec, runtime=None)

Relation to Problem 1

After Problem 2 & 3 are solved, the foundation of solving Problem 1 is there.

To recap, the emerging of DocsExtractUpdateMixin divides the all ExecutorDriver classes into two categories:

The ones that fit to "extract -> apply -> update pattern", e.g. Encode, Craft, Segment, Index, Search
The ones do not, e.g. Rank, Evaluate.

The ones that do not fit the DocsExtractUpdateMixin pattern are often due to the complicated extraction and extra data preprocessing before calling exec_fn. In other way, if Document class can provide powerful attributes and doc.complex_attribute can be used as a property, then they can also fit with DocsExtractUpdateMixin.

That is almost to say: there is no need to have polymorphism on the ExecutorDriver, which translates to the following YAML specification:

...
requests:
  on:
    IndexRequest:
           - GenericExecutorDriver
              with:
                  ...
    SearchRequest:
           - GenericExecutorDriver
              with:
                  ...

The only thing that makes multiple GenericExecutorDriver different is the method that it binds to.

Code Example

For the sake of clarity, the code snippet is presented first, explanation on it comes after. This PR enables the following code example:

import numpy as np

from jina import Flow, Document, Executor
from jina.executors.decorators import requests


class MyExecutor(Executor):

    @requests
    def foo(self, id, embedding):
        print(id)
        print(embedding)


f = Flow().add(uses=MyExecutor)

with f:
    f.index(Document(embedding=np.array([1, 2, 3])))

One can also put all logic functions in one class and use @requests(on=) to route them.

class MyExecutor(Executor):
    @requests
    def foo(self, id):
        return [{'embedding': np.array([1, 2, 3])}] * len(id)

    @requests(on='SearchRequest')
    def bar(self, id):
        return [{'embedding': np.array([4, 5, 6])}] * len(id)

    @requests(on='UpdateRequest')
    def bar2(self, id):
        return [{'embedding': np.array([10, 11, 12])}] * len(id)


# test code
def validate_index_resp(req):
    np.testing.assert_equal(req.docs[0].embedding, np.array([1, 2, 3]))

def validate_search_resp(req):
    np.testing.assert_equal(req.docs[0].embedding, np.array([4, 5, 6]))

def validate_update_resp(req):
    np.testing.assert_equal(req.docs[0].embedding, np.array([10, 11, 12]))

f = Flow().add(uses=MyExecutor)

with f:
    f.index(Document(), on_done=validate_index_resp)
    f.search(Document(), on_done=validate_search_resp)
    f.update(Document(), on_done=validate_update_resp)

codecov · 2021-04-18T13:14:41Z

Codecov Report

❗ No coverage uploaded for pull request base (master@4375be5). Click here to learn what that means.
The diff coverage is 89.88%.

@@            Coverage Diff            @@
##             master    #2313   +/-   ##
=========================================
  Coverage          ?   90.91%           
=========================================
  Files             ?      222           
  Lines             ?    11792           
  Branches          ?        0           
=========================================
  Hits              ?    10721           
  Misses            ?     1071           
  Partials          ?        0

Flag	Coverage Δ
daemon	`51.05% <34.52%> (?)`
jina	`91.08% <89.88%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
jina/executors/compound.py	`83.16% <ø> (ø)`
jina/executors/crafters/__init__.py	`100.00% <ø> (ø)`
jina/executors/segmenters/__init__.py	`100.00% <ø> (ø)`
jina/types/mixin.py	`93.10% <66.66%> (ø)`
jina/drivers/generic.py	`80.00% <80.00%> (ø)`
jina/drivers/__init__.py	`90.74% <82.85%> (ø)`
jina/executors/decorators.py	`87.26% <90.00%> (ø)`
jina/types/document/__init__.py	`96.22% <90.90%> (ø)`
jina/helper.py	`83.72% <95.23%> (ø)`
jina/__init__.py	`74.41% <100.00%> (ø)`
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4375be5...8d50739. Read the comment docs.

github-actions · 2021-04-18T13:28:37Z

Latency summary

Current PR yields:

😶 index QPS at 931, delta to last 3 avg.: +0%
😶 query QPS at 13, delta to last 3 avg.: -4%

Breakdown

Version	Index QPS	Query QPS
current	931	13
`1.1.6`	917	13
`1.1.5`	949	13

Backed by latency-tracking. Further commits will update this comment.

JoanFM

We had a chat with @maximilianwerk and he came with a proposal that may solve many of these aspects.

He would like to present it soon and it may make this change not needed. If possible, let's see if we can discuss that before merging this.

hanxiao · 2021-04-18T14:49:25Z

are you talking about a proposal or a working PR? I don't see any ongoing work on this direction. If we are talking about a rough idea without concrete implementation then I'm afraid I can't wait.

I will merge it after tmr fireside gathering, to save time i will only answer question you raise as i think the PR description is concrete enough

JoanFM · 2021-04-18T14:55:38Z

are you talking about a proposal or a working PR? I don't see any ongoing work on this direction. If we are talking about a rough idea without concrete implementation then I'm afraid I can't wait

an idea proposal.

maximilianwerk · 2021-04-18T15:50:45Z

are you talking about a proposal or a working PR? I don't see any ongoing work on this direction. If we are talking about a rough idea without concrete implementation then I'm afraid I can't wait.

I will merge it after tmr fireside gathering, to save time i will only answer question you raise as i think the PR description is concrete enough

It is more then just a rough idea. Anyhow, it does not conflict this PR but rather a further advancement. Let me show concrete code to you in tomorrows fireside.

Apart from that: I like the binding of the function signature to the extracted data a lot. Great idea!

JoanFM · 2021-04-18T15:52:35Z

jina/drivers/__init__.py

@@ -397,6 +399,162 @@ def _apply_all(
        """


+class DocsExtractUpdateMixin:


This assumes a lot of parameters of the drivers, maybe it should be directly a Driver itself? And then they can directly inherit from it.

maybe it should be directly a Driver itself?

i thought about that, but the side-effect on making it default behavior at BaseExecutorDriver is not clear to me and maybe an overshoot for this PR.

JoanFM · 2021-04-18T15:54:22Z

are you talking about a proposal or a working PR? I don't see any ongoing work on this direction. If we are talking about a rough idea without concrete implementation then I'm afraid I can't wait.
I will merge it after tmr fireside gathering, to save time i will only answer question you raise as i think the PR description is concrete enough

It is more then just a rough idea. Anyhow, it does not conflict this PR but rather a further advancement. Let me show concrete code to you in tomorrows fireside.

Apart from that: I like the binding of the function signature to the extracted data a lot. Great idea!

well to me the point would be more about to have less changes and refactors

hanxiao · 2021-04-18T15:57:34Z

have less changes and refactors

Note that this PR only requires simple follow-up changes on hub executors: i.e. changing their exec_fn signature to comply with Document types. As a hub developer, I would very much happy for this change as it greatly improves the readability of the executor.

JoanFM · 2021-04-18T16:10:52Z

have less changes and refactors

Note that this PR only requires simple follow-up changes on hub executors: i.e. changing their exec_fn signature to comply with Document types. As a hub developer, I would very much happy for this change as it greatly improves the readability of the executor.

I meant changes for people with custom executors, or is it backcompatible?

hanxiao · 2021-04-18T16:24:54Z

I meant changes for people with custom executors, or is it backcompatible?

It will be a breaking change for Hub & executor developers, but it is a good & necessary breaking change that leads executor to better readability and maintainability.

The old function signature is not usable anymore. There is already a detailed error message to guide people to migrate.

JoanFM · 2021-04-19T07:13:37Z

While is handy, I am not a big fan of coupling the data to be extracted with the class instead of the instance.

If I craft text into something different, why should I not be able to craft other fields of a Document?

hanxiao added 7 commits April 18, 2021 09:35

refactor(driver): move dbms to index module

a5b1ad5

refactor(driver): remove post_init takes too long hint

c326be8

refactor(driver): move dbms to index module

6e8b634

refactor(driver): move dbms to index module

7dba2be

Merge branch 'master' into refactor-driver-extract-update

e02d5e5

refactor(driver): improve driver extract and update logic

b51f742

refactor(driver): improve driver extract and update logic

2e5f380

jina-bot added size/L area/core This issue/PR affects the core codebase area/testing This issue/PR affects testing component/driver component/executor component/resource component/type executor/crafter executor/encoder labels Apr 18, 2021

refactor(driver): improve driver extract and update logic

cce4a83

JoanFM reviewed Apr 18, 2021

View reviewed changes

refactor(driver): improve driver extract and update logic

af1dfc4

hanxiao added 3 commits April 18, 2021 23:24

refactor(driver): improve driver extract and update logic

bc3f68b

refactor(driver): improve driver extract and update logic

015a850

refactor(driver): improve driver extract and update logic

611dcce

jina-bot added the executor/meta label Apr 18, 2021

JoanFM reviewed Apr 18, 2021

View reviewed changes

hanxiao marked this pull request as ready for review April 18, 2021 16:00

hanxiao requested a review from a team as a code owner April 18, 2021 16:00

hanxiao requested review from cristianmtr and alanthssss April 18, 2021 16:00

refactor(driver): improve driver extract and update logic

24a0b79

jina-bot added size/XL and removed size/L labels Apr 18, 2021

refactor(driver): improve driver extract and update logic

a4be775

jina-bot added the area/helper This issue/PR affects the helper functionality label Apr 19, 2021

hanxiao added 5 commits April 19, 2021 16:11

refactor(driver): improve driver extract and update logic

0447151

refactor(driver): improve driver extract and update logic

d2aa2b2

refactor(driver): improve driver extract and update logic

23a5102

refactor(driver): improve driver extract and update logic

7efcaed

refactor(driver): improve driver extract and update logic

5424e8e

jina-bot added the area/entrypoint This issue/PR affects the entrypoint codebase label Apr 19, 2021

hanxiao added 4 commits April 19, 2021 17:53

refactor(driver): improve driver extract and update logic

cb603d6

refactor(driver): improve driver extract and update logic

082a5bb

refactor(driver): improve driver extract and update logic

d47d35b

refactor(driver): improve driver extract and update logic

8d50739

hanxiao merged commit 0e02d22 into master Apr 19, 2021

hanxiao deleted the refactor-driver-extract-update branch April 19, 2021 12:08

JoanFM pushed a commit that referenced this pull request Apr 19, 2021

refactor(driver): add the extract-apply-update pattern (#2313)

4f207b0

hanxiao added a commit that referenced this pull request Apr 20, 2021

refactor(indexer): rename arg to comply with #2313

03a1423

nan-wang added this to the v1.2 Breaking Changes milestone Apr 22, 2021

nan-wang mentioned this pull request Apr 29, 2021

🚨 BREAKING CHANGES 1.1 -> 1.2 #2281

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(driver): add the extract-apply-update pattern #2313

refactor(driver): add the extract-apply-update pattern #2313

hanxiao commented Apr 18, 2021 •

edited by nan-wang

codecov bot commented Apr 18, 2021 •

edited

github-actions bot commented Apr 18, 2021 •

edited

JoanFM left a comment

hanxiao commented Apr 18, 2021 •

edited

JoanFM commented Apr 18, 2021

maximilianwerk commented Apr 18, 2021

JoanFM Apr 18, 2021

hanxiao Apr 18, 2021 •

edited

JoanFM commented Apr 18, 2021

hanxiao commented Apr 18, 2021 •

edited

JoanFM commented Apr 18, 2021

hanxiao commented Apr 18, 2021 •

edited

JoanFM commented Apr 19, 2021 •

edited

		@@ -397,6 +399,162 @@ def _apply_all(
		"""


		class DocsExtractUpdateMixin:

refactor(driver): add the extract-apply-update pattern #2313

refactor(driver): add the extract-apply-update pattern #2313

Conversation

hanxiao commented Apr 18, 2021 • edited by nan-wang

Breaking Change #2281

Background

Scope of this PR

Understanding Problem 3

Reason behind Problem 3

General Idea of the Solution

Affects & Aftermath

Aftermath on Hub Executors

Code Examples

Accept arbitrary Document attributes

Type annotation matters

Problem 2

Relation to Problem 1

Code Example

codecov bot commented Apr 18, 2021 • edited

Codecov Report

github-actions bot commented Apr 18, 2021 • edited

Latency summary

Breakdown

JoanFM left a comment

Choose a reason for hiding this comment

hanxiao commented Apr 18, 2021 • edited

JoanFM commented Apr 18, 2021

maximilianwerk commented Apr 18, 2021

JoanFM Apr 18, 2021

Choose a reason for hiding this comment

hanxiao Apr 18, 2021 • edited

Choose a reason for hiding this comment

JoanFM commented Apr 18, 2021

hanxiao commented Apr 18, 2021 • edited

JoanFM commented Apr 18, 2021

hanxiao commented Apr 18, 2021 • edited

JoanFM commented Apr 19, 2021 • edited

hanxiao commented Apr 18, 2021 •

edited by nan-wang

Accept arbitrary `Document` attributes

codecov bot commented Apr 18, 2021 •

edited

github-actions bot commented Apr 18, 2021 •

edited

hanxiao commented Apr 18, 2021 •

edited

hanxiao Apr 18, 2021 •

edited

hanxiao commented Apr 18, 2021 •

edited

hanxiao commented Apr 18, 2021 •

edited

JoanFM commented Apr 19, 2021 •

edited