-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(multimodal): refactor multimodal document and set #1368
Conversation
jina/types/document/__init__.py
Outdated
if path: | ||
next_edge = path[0] | ||
if next_edge == 'm': | ||
self._traverse_rec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a comment, This call does not match signature right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I'm working on that
jina/types/document/__init__.py
Outdated
@@ -534,3 +534,24 @@ def MergeFrom(self, doc: 'Document'): | |||
|
|||
def CopyFrom(self, doc: 'Document'): | |||
self._document.CopyFrom(doc.as_pb_object) | |||
|
|||
def traverse_apply(self, traversal_paths: Tuple[str], apply_func: Callable, *args, **kwargs) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, but traverse_apply
feels like it may belong into DocumentSet
class maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a better idea, the document handle traverse
and document set handles traverse_all, I'll update the code, one sec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a good idea! But anyway good to iterate on this, and to break things to learn the insights of how the feature works
jina/drivers/__init__.py
Outdated
else: | ||
self._apply_all(docs, parent_doc, parent_edge_type, *args, **kwargs) | ||
for doc in self.docs: | ||
doc.traverse_apply(self._traversal_paths, self._apply_all) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this does not comply with the current behavior. This function should belong to document set. Otherwise you will not be able to apply batching
@bwanglzu @JoanFM I want to highlight the importance of having All drivers adoption aside, I would like to have the following interface for class Document:
def traverse(self, traverse_path: str, callback_fn: Callable[['Document', None], None]) -> None:
class DocumentSet:
def traverse(self, traverse_path: str, callback_fn: Callable[['Document', None], None]) -> None: Usecase 1:
plot_str = ''
def callback(parent: Document, current: Document, siblings: DocumentSet, ...):
plot_str += f'{parent.id} -> {current.id}'
plot(doc.traverse(callback)) Usecase 2:
def callback(parent: Document, current: Document, siblings: DocumentSet, ...):
np.testing.assert_equal(current.embedding, np.ndarray([1,2]))
def is_all_equal():
doc.traverse(callback) In general, the current very powerful traversal is implemented at the Driver level, which restricts its usage on any other level. Moving it to |
Good idea, It is important to know that from the |
clear to me, thanks for the clarification. |
this PR has been here for quite sometime, can we get it done asap? @bwanglzu what is the estimation here? |
@hanxiao Sorry I've been busy with other PRs yesterday, I'll make it ready to review today |
tests/unit/types/test_documentset.py
Outdated
# assert len(documentset[0].matches[0].chunks) == 1 | ||
|
||
|
||
# def test_only_matches(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are they temporally disabled or these tests are depreciated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hanxiao I disabled these while debugging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but are they new? should I comment it back to valid the PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These seem to be copy pasted from driver.
this PR takes too long, i'm taking it now. |
🔴 |
|
||
# new docs | ||
docs = list(random_docs(10)) | ||
driver._traverse_apply(docs) | ||
driver.__call__() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think driver()
would be sufficiently enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but this part is still broken, need a way to change driver.docs
Codecov Report
@@ Coverage Diff @@
## master #1368 +/- ##
==========================================
- Coverage 83.67% 83.41% -0.27%
==========================================
Files 104 104
Lines 6861 6860 -1
==========================================
- Hits 5741 5722 -19
- Misses 1120 1138 +18
Continue to review full report at Codecov.
|
# Joan: https://github.com/jina-ai/jina/pull/1335#discussion_r533905780 | ||
# If chunk.granularity is 0. (This means a user without caring for granularity wants | ||
# to merge N documents into a multimodal document, therefore we do what | ||
# u have here of increasing their granularity inside this set) Well documented please | ||
# If the chunk comes with granularity > 0, then it means that someone has cared to chunk already | ||
# the document or that we have some driver that generates muktimodal documents in the future. | ||
# Then, have document.granularity = chunk.granularity - 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we need to clean this up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put some good thought on it and I truly believe it is the best trade off between coherency with granularity and ease of use from both Driver and Client side
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bwanglzu did we put good unittest on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JoanFM yes we did
Blocked by #1372 when adding chunk level test cases.