-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: document random_id #1440
feat: document random_id #1440
Conversation
I made the first PR refactoring the places where the And to still be able to have the hash of the document there is a new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a good direction to go for me.
Hey people, as an update, there's a |
I think that file explains how it works: https://github.com/jina-ai/jina/blob/master/jina/proto/build-proto.sh So it should be |
Yeah it was like that, I didn't have to do anything else |
jina/types/document/__init__.py
Outdated
@@ -135,6 +137,12 @@ def __init__(self, document: Optional[DocumentSourceType] = None, | |||
f'if you are trying to set the content ' | |||
f'you may use "Document(content=your_content)"') from ex | |||
|
|||
if custom_id is None: | |||
import random | |||
custom_id = random.randint(0, 9) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is the id a string (from the function signature) or an int?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not matter. Good opportunity to dig a little deeper: Try to find out why is does not matter by looking at the implementation of self.id
.
29538da
to
5849ab3
Compare
Just rebased. Please review once again @JoanFM @theUnkownName |
2bda5bf
to
9a31132
Compare
Latency summaryCurrent PR yields:
Breakdown
Backed by latency-tracking. Further commits will update this comment. |
@@ -18,12 +18,10 @@ | |||
Tuple[DocumentSourceType, DocumentSourceType]]] | |||
|
|||
|
|||
def _build_doc(data, data_type: DataInputType, override_doc_id, **kwargs) -> Tuple['Document', 'DataInputType']: | |||
def _build_doc(data, data_type: DataInputType, **kwargs) -> Tuple['Document', 'DataInputType']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't we allow override_doc_id so that the user can choose that we give for them a `random id? Have we checked if some example uses this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get the question. Either, the user gives and id
in the data or the Document
gets a random id
. In either case, the user can decide to overwrite the id
later via doc.id = 1337
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok good, using this _build_doc
will generate random ids, if the user wants custom ids, they have to use their own generator. Fair enough
jina/types/document/__init__.py
Outdated
@@ -135,6 +137,10 @@ def __init__(self, document: Optional[DocumentSourceType] = None, | |||
f'if you are trying to set the content ' | |||
f'you may use "Document(content=your_content)"') from ex | |||
|
|||
if self._document.id is None or self._document.id == '': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._document.id.is_empty() may work better? not sure if works in python
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well apparently mine is the best practice in python as far as I know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create a constant on top. MAX_DOCUMENT_ID
jina/types/document/__init__.py
Outdated
@@ -135,6 +137,10 @@ def __init__(self, document: Optional[DocumentSourceType] = None, | |||
f'if you are trying to set the content ' | |||
f'you may use "Document(content=your_content)"') from ex | |||
|
|||
if self._document.id is None or self._document.id == '': | |||
import random | |||
self.id = random.randint(0, 9223372036854775807) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't there some Python built-in Integer.Max or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is 2**64/2
, which is what should well fit into 64 bit. Apparently, we could directly use 2**64
. Haven't thought too much about it. I'll spend another thought tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes but python must have some default value for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, python does not. int
is unbounded in python. Anyhow, numpy
has: np.iinfo(np.int64).max
, which is 9223372036854775807
. Since we do int(value).to_bytes(_digest_size, sys.byteorder, signed=True)
to convert it to bytes, this should also be the right number. If I understand correctly, we could also have negative id
s, but that feels weird. So I restrict for the numpy
max.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I am not saying number is wrong, at least have a CONSTANT well documented rather than a magic number.
@@ -402,7 +401,7 @@ def __enter__(self): | |||
return self | |||
|
|||
def __exit__(self, exc_type, exc_val, exc_tb): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eventually we will remove the context manager
concept from document
right? or should we at least do some check like wether id
is not None or not empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It cannot be None or empty at the point. It will always be populated. This is just for backward compatibility and for "niceness".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And out of a sudden, it is not pass
anymore. Reasoning: It makes sense to update the content_hash
after Document
is completed. Otherwise, the user would have to call the update_content_hash
manually.
@@ -64,21 +64,6 @@ def test_np_int(): | |||
assert hash2bytes(np.int64(a)) == hash2bytes(a) | |||
|
|||
|
|||
def test_hash(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have these functions been removed? I have not seen the removal of these functions in this PR right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, so predictable. When I removed this test, I knew you would write that.
new_doc_id
was renamed to get_content_hash
and new_doc_hash
was removed. So there is no point in testing forward-backward converting. I thought about a test like
hash_ = get_content_hash(d)
assert hash_ == hash2id(id2hash(hash_))
But this already shows, that the semantic of id
and hash
are not precise anymore (Beware, it is NOT id2hash(hash2id(hash_))
!). I will have another thought about this tomorrow with a fresh mind.
88ed15a
to
9b4741b
Compare
I added the hash calculation, which omits the |
|
||
|
||
class KVIndexDriver(BaseIndexDriver): | ||
"""Serialize the documents/chunks in the request to key-value JSON pairs and write it using the executor | ||
""" | ||
|
||
def _apply_all(self, docs: 'DocumentSet', *args, **kwargs) -> None: | ||
keys = [hash(doc.id) for doc in docs] | ||
keys = [int(doc.id) for doc in docs] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this have to be done in the VectorIndexDriver as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is. See line 30.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah okay, I searched for VectorIndexDriver and did not find a hit. 😳
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was approved before. Tests now succeed.
UPDATE [2020-21-16 21:21 CET by @maximilianwerk]
This PR now consistently renames
hash
back toid
and also changes the semantic back to a simpleid
. In case noid
is provided for aDocument
, a random one will be generated.Furthermore, some artifacts of the old
hash
are still existing, since they are still used in the Hub. Once I refactored all code in the Hub to the newid
, I will remove the missing oldhash
artifacts.Finally, a new
content_hash
is "introduced". Alternative formulation: The hash-semantic of the oldid/hash
is now inside thecontent_hash