refactor(multimodal): refactor multimodal document and set #1368

bwanglzu · 2020-11-28T13:20:14Z

Blocked by #1372 when adding chunk level test cases.

jina/drivers/__init__.py

JoanFM · 2020-11-28T13:26:15Z

jina/types/document/__init__.py

+        if path:
+            next_edge = path[0]
+            if next_edge == 'm':
+                self._traverse_rec(


Just a comment, This call does not match signature right?

yes, I'm working on that

JoanFM · 2020-11-28T13:28:07Z

jina/types/document/__init__.py

@@ -534,3 +534,24 @@ def MergeFrom(self, doc: 'Document'):

    def CopyFrom(self, doc: 'Document'):
        self._document.CopyFrom(doc.as_pb_object)
+
+    def traverse_apply(self, traversal_paths: Tuple[str], apply_func: Callable, *args, **kwargs) -> None:


Not sure, but traverse_apply feels like it may belong into DocumentSet class maybe?

I have a better idea, the document handle traverse and document set handles traverse_all, I'll update the code, one sec.

Seems like a good idea! But anyway good to iterate on this, and to break things to learn the insights of how the feature works

JoanFM · 2020-11-28T21:37:49Z

jina/drivers/__init__.py

-        else:
-            self._apply_all(docs, parent_doc, parent_edge_type, *args, **kwargs)
+        for doc in self.docs:
+            doc.traverse_apply(self._traversal_paths, self._apply_all)


I think this does not comply with the current behavior. This function should belong to document set. Otherwise you will not be able to apply batching

hanxiao · 2020-11-30T16:47:27Z

@bwanglzu @JoanFM I want to highlight the importance of having traversal() as a method in the Document class. (The DocumentSet's traversal can be based on that, but no need to make them fully separated).

All drivers adoption aside, I would like to have the following interface for Document and DocumentSet:

class Document:
   def traverse(self, traverse_path: str, callback_fn: Callable[['Document', None], None]) -> None:

class DocumentSet:
   def traverse(self, traverse_path: str, callback_fn: Callable[['Document', None], None]) -> None:

Usecase 1:

I'm trying to give a visualization of recursive doc structure in Jupyter Notebook, I would like to implement my doc.plot() function based on traverse function by giving the following callback:

plot_str = ''

def callback(parent: Document, current: Document, siblings: DocumentSet, ...):
   plot_str += f'{parent.id} -> {current.id}'

plot(doc.traverse(callback))

Usecase 2:

After the result is returned from Jina, I would like to validate all embeddings on every document recursively.

def callback(parent: Document, current: Document, siblings: DocumentSet, ...):
   np.testing.assert_equal(current.embedding, np.ndarray([1,2]))

def is_all_equal():
   doc.traverse(callback)

In general, the current very powerful traversal is implemented at the Driver level, which restricts its usage on any other level. Moving it to Document data type will greatly unlock this restriction and enable many possibilities, in particular on the client side.

JoanFM · 2020-11-30T16:52:10Z

@bwanglzu @JoanFM I want to highlight the importance of having traversal() as a method in the Document class. (The DocumentSet's traversal can be based on that, but no need to make them fully separated).

All drivers adoption aside, I would like to have the following interface for Document and DocumentSet:
class Document:
   def traverse(self, traverse_path: str, callback_fn: Callable[['Document', None], None]) -> None:

class DocumentSet:
   def traverse(self, traverse_path: str, callback_fn: Callable[['Document', None], None]) -> None:
Usecase 1:

I'm trying to give a visualization of recursive doc structure in Jupyter Notebook, I would like to implement my doc.plot() function based on traverse function by giving the following callback:
plot_str = ''

def callback(parent: Document, current: Document, siblings: DocumentSet, ...):
   plot_str += f'{parent.id} -> {current.id}'
Usecase 2:

After the result is returned from Jina, I would like to validate all embeddings on every document recursively.
def callback(parent: Document, current: Document, siblings: DocumentSet, ...):
   np.testing.assert_equal(current.embedding, np.ndarray([1,2]))
In general, the current very powerful traversal is implemented at the Driver level, which restricts its usage on any other level. Moving it to Document data type will greatly unlock this restriction and enable many possibilities, in particular on the client side.

Good idea, It is important to know that from the current working pattern, the interface to be used is the one in DocumentSet.

bwanglzu · 2020-11-30T16:56:32Z

@bwanglzu @JoanFM I want to highlight the importance of having traversal() as a method in the Document class. (The DocumentSet's traversal can be based on that, but no need to make them fully separated).

All drivers adoption aside, I would like to have the following interface for Document and DocumentSet:
class Document:
   def traverse(self, traverse_path: str, callback_fn: Callable[['Document', None], None]) -> None:

class DocumentSet:
   def traverse(self, traverse_path: str, callback_fn: Callable[['Document', None], None]) -> None:
Usecase 1:

I'm trying to give a visualization of recursive doc structure in Jupyter Notebook, I would like to implement my doc.plot() function based on traverse function by giving the following callback:
plot_str = ''

def callback(parent: Document, current: Document, siblings: DocumentSet, ...):
   plot_str += f'{parent.id} -> {current.id}'

plot(doc.traverse(callback))
Usecase 2:

After the result is returned from Jina, I would like to validate all embeddings on every document recursively.
def callback(parent: Document, current: Document, siblings: DocumentSet, ...):
   np.testing.assert_equal(current.embedding, np.ndarray([1,2]))

def is_all_equal():
   doc.traverse(callback)
In general, the current very powerful traversal is implemented at the Driver level, which restricts its usage on any other level. Moving it to Document data type will greatly unlock this restriction and enable many possibilities, in particular on the client side.

clear to me, thanks for the clarification.

hanxiao · 2020-12-02T07:38:31Z

this PR has been here for quite sometime, can we get it done asap? @bwanglzu what is the estimation here?

bwanglzu · 2020-12-02T08:48:05Z

@hanxiao Sorry I've been busy with other PRs yesterday, I'll make it ready to review today

jina/drivers/__init__.py

hanxiao · 2020-12-03T13:11:34Z

tests/unit/types/test_documentset.py

+    # assert len(documentset[0].matches[0].chunks) == 1
+
+
+# def test_only_matches():


are they temporally disabled or these tests are depreciated?

@hanxiao I disabled these while debugging

but are they new? should I comment it back to valid the PR?

These seem to be copy pasted from driver.

hanxiao · 2020-12-03T13:13:08Z

this PR takes too long, i'm taking it now.

hanxiao · 2020-12-03T13:40:44Z

🔴

bwanglzu · 2020-12-03T18:02:49Z

tests/unit/drivers/test_cache_driver.py


        # new docs
        docs = list(random_docs(10))
-        driver._traverse_apply(docs)
+        driver.__call__()


i think driver() would be sufficiently enough

yes, but this part is still broken, need a way to change driver.docs

codecov · 2020-12-03T21:40:45Z

Codecov Report

Merging #1368 (16c8890) into master (878740c) will decrease coverage by 0.26%.
The diff coverage is 76.56%.

@@            Coverage Diff             @@
##           master    #1368      +/-   ##
==========================================
- Coverage   83.67%   83.41%   -0.27%     
==========================================
  Files         104      104              
  Lines        6861     6860       -1     
==========================================
- Hits         5741     5722      -19     
- Misses       1120     1138      +18

Impacted Files	Coverage Δ
jina/types/document/__init__.py	`92.56% <25.00%> (-4.81%)`	⬇️
jina/types/sets/document_set.py	`96.34% <60.00%> (-2.51%)`	⬇️
jina/types/document/multimodal.py	`97.91% <96.77%> (-2.09%)`	⬇️
jina/drivers/multimodal.py	`91.66% <100.00%> (+0.23%)`	⬆️
jina/executors/decorators.py	`88.30% <100.00%> (+0.06%)`	⬆️
jina/executors/indexers/vector.py	`87.65% <100.00%> (+0.77%)`	⬆️
jina/types/sets/match_set.py	`100.00% <100.00%> (ø)`
jina/peapods/grpc_asyncio.py	`76.53% <0.00%> (-4.09%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b5c2878...6f2442b. Read the comment docs.

nan-wang · 2020-12-04T07:22:49Z

jina/types/document/multimodal.py

+        # Joan: https://github.com/jina-ai/jina/pull/1335#discussion_r533905780
+        # If chunk.granularity is 0. (This means a user without caring for granularity wants
+        #   to merge N documents into a multimodal document, therefore we do what
+        #   u have here of increasing their granularity inside this set) Well documented please
+        # If the chunk comes with granularity > 0, then it means that someone has cared to chunk already
+        #   the document or that we have some driver that generates muktimodal documents in the future.
+        #   Then, have document.granularity = chunk.granularity - 1.


Maybe we need to clean this up

I put some good thought on it and I truly believe it is the best trade off between coherency with granularity and ease of use from both Driver and Client side

@bwanglzu did we put good unittest on this?

@JoanFM yes we did

feat: move traverse from driver to document

cc04881

jina-bot added the size/S label Nov 28, 2020

bwanglzu linked an issue Nov 28, 2020 that may be closed by this pull request

refactoring BaseRecursiveDriver's recursion logic to Document data type #1326

Closed

jina-bot added area/core This issue/PR affects the core codebase component/driver component/type labels Nov 28, 2020

feat: move traverse from driver to document

4079804

JoanFM requested changes Nov 28, 2020

View reviewed changes

jina/drivers/__init__.py Outdated Show resolved Hide resolved

JoanFM reviewed Nov 28, 2020

View reviewed changes

bwanglzu self-assigned this Nov 28, 2020

JoanFM reviewed Nov 28, 2020

View reviewed changes

bwanglzu added 6 commits November 28, 2020 14:41

feat: move traverse from driver to document

5635195

feat: move traverse from driver to document

f09febb

feat: move traverse from driver to document

8d5303a

feat: move traverse from driver to document

f9a806e

feat: move traverse from driver to document

61b3e1e

feat: move traverse from driver to document

a9df71e

JoanFM requested changes Nov 28, 2020

View reviewed changes

feat: move traverse from driver to document

de22ed7

bwanglzu mentioned this pull request Nov 29, 2020

refactor chunkset and matchset append method #1372

Merged

jina-bot added size/M area/testing This issue/PR affects testing and removed size/S labels Dec 2, 2020

bwanglzu added 2 commits December 2, 2020 21:57

Merge branch 'master' into feat-refactor-recursion

58c0573

feat: change callback fn name

3401d17

hanxiao reviewed Dec 3, 2020

View reviewed changes

jina/drivers/__init__.py Outdated Show resolved Hide resolved

fix: udate jina/drivers/__init__.py

cf2678e

hanxiao reviewed Dec 3, 2020

View reviewed changes

Merge branch 'master' into feat-refactor-recursion

50f5026

hanxiao added 2 commits December 3, 2020 16:50

refactor(driver): move traverse to docs

bd86c5d

refactor(driver): move traverse to docs

6d4ca78

jina-bot added the area/helper This issue/PR affects the helper functionality label Dec 3, 2020

bwanglzu commented Dec 3, 2020

View reviewed changes

refactor(proto): jina primitive types

3a6180e

jina-bot added size/L and removed size/M labels Dec 3, 2020

refactor(types): move append and extend into QueryLangSet

2bd85f7

jina-bot added size/M and removed size/L labels Dec 3, 2020

hanxiao added 5 commits December 3, 2020 20:55

refactor(types): move append and extend into QueryLangSet

69822ce

refactor(types): move append and extend into QueryLangSet

e5e5470

refactor(types): move append and extend into QueryLangSet

66220b2

refactor(types): move append and extend into QueryLangSet

f35d403

refactor(types): move append and extend into QueryLangSet

6f2442b

hanxiao changed the title ~~feat: move traverse from driver to document~~ refactor(multimodal): refactor multimodal document and set Dec 3, 2020

hanxiao marked this pull request as ready for review December 3, 2020 21:30

hanxiao requested a review from a team as a code owner December 3, 2020 21:30

hanxiao requested review from maximilianwerk and florian-hoenicke December 3, 2020 21:30

hanxiao merged commit 576fbb7 into master Dec 3, 2020

hanxiao deleted the feat-refactor-recursion branch December 3, 2020 21:46

nan-wang reviewed Dec 4, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(multimodal): refactor multimodal document and set #1368

refactor(multimodal): refactor multimodal document and set #1368

bwanglzu commented Nov 28, 2020 •

edited

Loading

JoanFM Nov 28, 2020

bwanglzu Nov 28, 2020

JoanFM Nov 28, 2020

bwanglzu Nov 28, 2020

JoanFM Nov 28, 2020

JoanFM Nov 28, 2020

hanxiao commented Nov 30, 2020 •

edited

Loading

JoanFM commented Nov 30, 2020

bwanglzu commented Nov 30, 2020

hanxiao commented Dec 2, 2020

bwanglzu commented Dec 2, 2020

hanxiao Dec 3, 2020

bwanglzu Dec 3, 2020

hanxiao Dec 3, 2020

JoanFM Dec 3, 2020

hanxiao commented Dec 3, 2020

hanxiao commented Dec 3, 2020

bwanglzu Dec 3, 2020

hanxiao Dec 3, 2020

codecov bot commented Dec 3, 2020 •

edited

Loading

nan-wang Dec 4, 2020

JoanFM Dec 4, 2020

JoanFM Dec 4, 2020

bwanglzu Dec 4, 2020

		# assert len(documentset[0].matches[0].chunks) == 1


		# def test_only_matches():

refactor(multimodal): refactor multimodal document and set #1368

refactor(multimodal): refactor multimodal document and set #1368

Conversation

bwanglzu commented Nov 28, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hanxiao commented Nov 30, 2020 • edited Loading

JoanFM commented Nov 30, 2020

bwanglzu commented Nov 30, 2020

hanxiao commented Dec 2, 2020

bwanglzu commented Dec 2, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hanxiao commented Dec 3, 2020

hanxiao commented Dec 3, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Dec 3, 2020 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bwanglzu commented Nov 28, 2020 •

edited

Loading

hanxiao commented Nov 30, 2020 •

edited

Loading

codecov bot commented Dec 3, 2020 •

edited

Loading