Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(multimodal): refactor multimodal document and set #1368

Merged
merged 25 commits into from
Dec 3, 2020

Conversation

bwanglzu
Copy link
Member

@bwanglzu bwanglzu commented Nov 28, 2020

Blocked by #1372 when adding chunk level test cases.

@jina-bot jina-bot added area/core This issue/PR affects the core codebase component/driver component/type labels Nov 28, 2020
jina/drivers/__init__.py Outdated Show resolved Hide resolved
if path:
next_edge = path[0]
if next_edge == 'm':
self._traverse_rec(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a comment, This call does not match signature right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I'm working on that

@bwanglzu bwanglzu self-assigned this Nov 28, 2020
@@ -534,3 +534,24 @@ def MergeFrom(self, doc: 'Document'):

def CopyFrom(self, doc: 'Document'):
self._document.CopyFrom(doc.as_pb_object)

def traverse_apply(self, traversal_paths: Tuple[str], apply_func: Callable, *args, **kwargs) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, but traverse_apply feels like it may belong into DocumentSet class maybe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a better idea, the document handle traverse and document set handles traverse_all, I'll update the code, one sec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good idea! But anyway good to iterate on this, and to break things to learn the insights of how the feature works

else:
self._apply_all(docs, parent_doc, parent_edge_type, *args, **kwargs)
for doc in self.docs:
doc.traverse_apply(self._traversal_paths, self._apply_all)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this does not comply with the current behavior. This function should belong to document set. Otherwise you will not be able to apply batching

@hanxiao
Copy link
Member

hanxiao commented Nov 30, 2020

@bwanglzu @JoanFM I want to highlight the importance of having traversal() as a method in the Document class. (The DocumentSet's traversal can be based on that, but no need to make them fully separated).

All drivers adoption aside, I would like to have the following interface for Document and DocumentSet:

class Document:
   def traverse(self, traverse_path: str, callback_fn: Callable[['Document', None], None]) -> None:

class DocumentSet:
   def traverse(self, traverse_path: str, callback_fn: Callable[['Document', None], None]) -> None:

Usecase 1:

  • I'm trying to give a visualization of recursive doc structure in Jupyter Notebook, I would like to implement my doc.plot() function based on traverse function by giving the following callback:
plot_str = ''

def callback(parent: Document, current: Document, siblings: DocumentSet, ...):
   plot_str += f'{parent.id} -> {current.id}'

plot(doc.traverse(callback))

Usecase 2:

  • After the result is returned from Jina, I would like to validate all embeddings on every document recursively.
def callback(parent: Document, current: Document, siblings: DocumentSet, ...):
   np.testing.assert_equal(current.embedding, np.ndarray([1,2]))

def is_all_equal():
   doc.traverse(callback)

In general, the current very powerful traversal is implemented at the Driver level, which restricts its usage on any other level. Moving it to Document data type will greatly unlock this restriction and enable many possibilities, in particular on the client side.

@JoanFM
Copy link
Member

JoanFM commented Nov 30, 2020

@bwanglzu @JoanFM I want to highlight the importance of having traversal() as a method in the Document class. (The DocumentSet's traversal can be based on that, but no need to make them fully separated).

All drivers adoption aside, I would like to have the following interface for Document and DocumentSet:

class Document:
   def traverse(self, traverse_path: str, callback_fn: Callable[['Document', None], None]) -> None:

class DocumentSet:
   def traverse(self, traverse_path: str, callback_fn: Callable[['Document', None], None]) -> None:

Usecase 1:

  • I'm trying to give a visualization of recursive doc structure in Jupyter Notebook, I would like to implement my doc.plot() function based on traverse function by giving the following callback:
plot_str = ''

def callback(parent: Document, current: Document, siblings: DocumentSet, ...):
   plot_str += f'{parent.id} -> {current.id}'

Usecase 2:

  • After the result is returned from Jina, I would like to validate all embeddings on every document recursively.
def callback(parent: Document, current: Document, siblings: DocumentSet, ...):
   np.testing.assert_equal(current.embedding, np.ndarray([1,2]))

In general, the current very powerful traversal is implemented at the Driver level, which restricts its usage on any other level. Moving it to Document data type will greatly unlock this restriction and enable many possibilities, in particular on the client side.

Good idea, It is important to know that from the current working pattern, the interface to be used is the one in DocumentSet.

@bwanglzu
Copy link
Member Author

@bwanglzu @JoanFM I want to highlight the importance of having traversal() as a method in the Document class. (The DocumentSet's traversal can be based on that, but no need to make them fully separated).

All drivers adoption aside, I would like to have the following interface for Document and DocumentSet:

class Document:
   def traverse(self, traverse_path: str, callback_fn: Callable[['Document', None], None]) -> None:

class DocumentSet:
   def traverse(self, traverse_path: str, callback_fn: Callable[['Document', None], None]) -> None:

Usecase 1:

  • I'm trying to give a visualization of recursive doc structure in Jupyter Notebook, I would like to implement my doc.plot() function based on traverse function by giving the following callback:
plot_str = ''

def callback(parent: Document, current: Document, siblings: DocumentSet, ...):
   plot_str += f'{parent.id} -> {current.id}'

plot(doc.traverse(callback))

Usecase 2:

  • After the result is returned from Jina, I would like to validate all embeddings on every document recursively.
def callback(parent: Document, current: Document, siblings: DocumentSet, ...):
   np.testing.assert_equal(current.embedding, np.ndarray([1,2]))

def is_all_equal():
   doc.traverse(callback)

In general, the current very powerful traversal is implemented at the Driver level, which restricts its usage on any other level. Moving it to Document data type will greatly unlock this restriction and enable many possibilities, in particular on the client side.

clear to me, thanks for the clarification.

@hanxiao
Copy link
Member

hanxiao commented Dec 2, 2020

this PR has been here for quite sometime, can we get it done asap? @bwanglzu what is the estimation here?

@bwanglzu
Copy link
Member Author

bwanglzu commented Dec 2, 2020

@hanxiao Sorry I've been busy with other PRs yesterday, I'll make it ready to review today

@jina-bot jina-bot added size/M area/testing This issue/PR affects testing and removed size/S labels Dec 2, 2020
jina/drivers/__init__.py Outdated Show resolved Hide resolved
# assert len(documentset[0].matches[0].chunks) == 1


# def test_only_matches():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are they temporally disabled or these tests are depreciated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hanxiao I disabled these while debugging

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but are they new? should I comment it back to valid the PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem to be copy pasted from driver.

@hanxiao
Copy link
Member

hanxiao commented Dec 3, 2020

this PR takes too long, i'm taking it now.

@hanxiao
Copy link
Member

hanxiao commented Dec 3, 2020

🔴

@jina-bot jina-bot added the area/helper This issue/PR affects the helper functionality label Dec 3, 2020

# new docs
docs = list(random_docs(10))
driver._traverse_apply(docs)
driver.__call__()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think driver() would be sufficiently enough

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but this part is still broken, need a way to change driver.docs

@jina-bot jina-bot added size/L and removed size/M labels Dec 3, 2020
@jina-bot jina-bot added size/M and removed size/L labels Dec 3, 2020
@hanxiao hanxiao changed the title feat: move traverse from driver to document refactor(multimodal): refactor multimodal document and set Dec 3, 2020
@hanxiao hanxiao marked this pull request as ready for review December 3, 2020 21:30
@hanxiao hanxiao requested a review from a team as a code owner December 3, 2020 21:30
@codecov
Copy link

codecov bot commented Dec 3, 2020

Codecov Report

Merging #1368 (16c8890) into master (878740c) will decrease coverage by 0.26%.
The diff coverage is 76.56%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1368      +/-   ##
==========================================
- Coverage   83.67%   83.41%   -0.27%     
==========================================
  Files         104      104              
  Lines        6861     6860       -1     
==========================================
- Hits         5741     5722      -19     
- Misses       1120     1138      +18     
Impacted Files Coverage Δ
jina/types/document/__init__.py 92.56% <25.00%> (-4.81%) ⬇️
jina/types/sets/document_set.py 96.34% <60.00%> (-2.51%) ⬇️
jina/types/document/multimodal.py 97.91% <96.77%> (-2.09%) ⬇️
jina/drivers/multimodal.py 91.66% <100.00%> (+0.23%) ⬆️
jina/executors/decorators.py 88.30% <100.00%> (+0.06%) ⬆️
jina/executors/indexers/vector.py 87.65% <100.00%> (+0.77%) ⬆️
jina/types/sets/match_set.py 100.00% <100.00%> (ø)
jina/peapods/grpc_asyncio.py 76.53% <0.00%> (-4.09%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b5c2878...6f2442b. Read the comment docs.

@hanxiao hanxiao merged commit 576fbb7 into master Dec 3, 2020
@hanxiao hanxiao deleted the feat-refactor-recursion branch December 3, 2020 21:46
Comment on lines +72 to +78
# Joan: https://github.com/jina-ai/jina/pull/1335#discussion_r533905780
# If chunk.granularity is 0. (This means a user without caring for granularity wants
# to merge N documents into a multimodal document, therefore we do what
# u have here of increasing their granularity inside this set) Well documented please
# If the chunk comes with granularity > 0, then it means that someone has cared to chunk already
# the document or that we have some driver that generates muktimodal documents in the future.
# Then, have document.granularity = chunk.granularity - 1.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to clean this up

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put some good thought on it and I truly believe it is the best trade off between coherency with granularity and ease of use from both Driver and Client side

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bwanglzu did we put good unittest on this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoanFM yes we did

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/core This issue/PR affects the core codebase area/helper This issue/PR affects the helper functionality area/testing This issue/PR affects testing component/driver component/type size/M
Projects
None yet
Development

Successfully merging this pull request may close these issues.

refactoring BaseRecursiveDriver's recursion logic to Document data type
5 participants