Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: document random_id #1440

Merged
merged 24 commits into from
Dec 18, 2020
Merged

feat: document random_id #1440

merged 24 commits into from
Dec 18, 2020

Conversation

CatStark
Copy link
Member

@CatStark CatStark commented Dec 11, 2020

UPDATE [2020-21-16 21:21 CET by @maximilianwerk]
This PR now consistently renames hash back to id and also changes the semantic back to a simple id. In case no id is provided for a Document, a random one will be generated.

Furthermore, some artifacts of the old hash are still existing, since they are still used in the Hub. Once I refactored all code in the Hub to the new id, I will remove the missing old hash artifacts.

Finally, a new content_hash is "introduced". Alternative formulation: The hash-semantic of the old id/hash is now inside the content_hash

@jina-bot jina-bot added size/M area/core This issue/PR affects the core codebase area/network This issue/PR affects network functionality area/testing This issue/PR affects testing component/client component/proto component/type labels Dec 11, 2020
@CatStark
Copy link
Member Author

I made the first PR refactoring the places where the update_id was called.
Now the id will either be set by the client or if none was set, it will be a random one.

And to still be able to have the hash of the document there is a new get_document_hash function. If this is ok I will then refactor all the tests that were using update_id

Copy link
Member

@JoanFM JoanFM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good direction to go for me.

jina/clients/python/request.py Outdated Show resolved Hide resolved
jina/types/document/__init__.py Outdated Show resolved Hide resolved
jina/types/document/uid.py Outdated Show resolved Hide resolved
@CatStark
Copy link
Member Author

Hey people, as an update, there's a content_hash on the DocumentProto now, Max told me that this needs another step to transform the jina.proto into python file, anyone knows how to proceed on this?

@maximilianwerk
Copy link
Member

maximilianwerk commented Dec 14, 2020

Hey people, as an update, there's a content_hash on the DocumentProto now, Max told me that this needs another step to transform the jina.proto into python file, anyone knows how to proceed on this?

I think that file explains how it works: https://github.com/jina-ai/jina/blob/master/jina/proto/build-proto.sh

So it should be docker run -v $(pwd)/jina/proto:/jina/proto jinaai/protogen (perhaps you have to build the docker image beforehand)

@CatStark
Copy link
Member Author

Hey people, as an update, there's a content_hash on the DocumentProto now, Max told me that this needs another step to transform the jina.proto into python file, anyone knows how to proceed on this?

I think that file explains how it works: https://github.com/jina-ai/jina/blob/master/jina/proto/build-proto.sh

So it should be docker run -v $(pwd)/jina/proto:/jina/proto jinaai/protogen (perhaps you have to build the docker image beforehand)

Yeah it was like that, I didn't have to do anything else

@@ -135,6 +137,12 @@ def __init__(self, document: Optional[DocumentSourceType] = None,
f'if you are trying to set the content '
f'you may use "Document(content=your_content)"') from ex

if custom_id is None:
import random
custom_id = random.randint(0, 9)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is the id a string (from the function signature) or an int?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not matter. Good opportunity to dig a little deeper: Try to find out why is does not matter by looking at the implementation of self.id.

@maximilianwerk
Copy link
Member

Just rebased. Please review once again @JoanFM @theUnkownName

@maximilianwerk maximilianwerk force-pushed the feat-document-random-id branch 2 times, most recently from 2bda5bf to 9a31132 Compare December 15, 2020 19:22
@github-actions
Copy link

github-actions bot commented Dec 15, 2020

Latency summary

Current PR yields:

  • 😶 index QPS at 1557, delta to last 3 avg.: -4%
  • 😶 query QPS at 26, delta to last 3 avg.: -2%

Breakdown

Version Index QPS Query QPS
current 1557 26
0.8.8 1634 26
0.8.7 1659 26
0.8.6 1611 26

Backed by latency-tracking. Further commits will update this comment.

@@ -18,12 +18,10 @@
Tuple[DocumentSourceType, DocumentSourceType]]]


def _build_doc(data, data_type: DataInputType, override_doc_id, **kwargs) -> Tuple['Document', 'DataInputType']:
def _build_doc(data, data_type: DataInputType, **kwargs) -> Tuple['Document', 'DataInputType']:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we allow override_doc_id so that the user can choose that we give for them a `random id? Have we checked if some example uses this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get the question. Either, the user gives and id in the data or the Document gets a random id. In either case, the user can decide to overwrite the id later via doc.id = 1337.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok good, using this _build_doc will generate random ids, if the user wants custom ids, they have to use their own generator. Fair enough

@@ -135,6 +137,10 @@ def __init__(self, document: Optional[DocumentSourceType] = None,
f'if you are trying to set the content '
f'you may use "Document(content=your_content)"') from ex

if self._document.id is None or self._document.id == '':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._document.id.is_empty() may work better? not sure if works in python

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well apparently mine is the best practice in python as far as I know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create a constant on top. MAX_DOCUMENT_ID

@@ -135,6 +137,10 @@ def __init__(self, document: Optional[DocumentSourceType] = None,
f'if you are trying to set the content '
f'you may use "Document(content=your_content)"') from ex

if self._document.id is None or self._document.id == '':
import random
self.id = random.randint(0, 9223372036854775807)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't there some Python built-in Integer.Max or something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is 2**64/2, which is what should well fit into 64 bit. Apparently, we could directly use 2**64. Haven't thought too much about it. I'll spend another thought tomorrow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but python must have some default value for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, python does not. int is unbounded in python. Anyhow, numpy has: np.iinfo(np.int64).max, which is 9223372036854775807. Since we do int(value).to_bytes(_digest_size, sys.byteorder, signed=True) to convert it to bytes, this should also be the right number. If I understand correctly, we could also have negative ids, but that feels weird. So I restrict for the numpy max.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I am not saying number is wrong, at least have a CONSTANT well documented rather than a magic number.

@@ -402,7 +401,7 @@ def __enter__(self):
return self

def __exit__(self, exc_type, exc_val, exc_tb):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eventually we will remove the context manager concept from document right? or should we at least do some check like wether id is not None or not empty?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It cannot be None or empty at the point. It will always be populated. This is just for backward compatibility and for "niceness".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And out of a sudden, it is not pass anymore. Reasoning: It makes sense to update the content_hash after Document is completed. Otherwise, the user would have to call the update_content_hash manually.

@@ -64,21 +64,6 @@ def test_np_int():
assert hash2bytes(np.int64(a)) == hash2bytes(a)


def test_hash():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have these functions been removed? I have not seen the removal of these functions in this PR right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, so predictable. When I removed this test, I knew you would write that.

new_doc_id was renamed to get_content_hash and new_doc_hash was removed. So there is no point in testing forward-backward converting. I thought about a test like

hash_ = get_content_hash(d)
assert hash_ == hash2id(id2hash(hash_))

But this already shows, that the semantic of id and hash are not precise anymore (Beware, it is NOT id2hash(hash2id(hash_))!). I will have another thought about this tomorrow with a fresh mind.

@maximilianwerk
Copy link
Member

Shouldn't the content hash not depend on the id?

I added the hash calculation, which omits the id. This should pretty much be the same as the old hash calculation.



class KVIndexDriver(BaseIndexDriver):
"""Serialize the documents/chunks in the request to key-value JSON pairs and write it using the executor
"""

def _apply_all(self, docs: 'DocumentSet', *args, **kwargs) -> None:
keys = [hash(doc.id) for doc in docs]
keys = [int(doc.id) for doc in docs]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this have to be done in the VectorIndexDriver as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. See line 30.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah okay, I searched for VectorIndexDriver and did not find a hit. 😳

Copy link
Member

@maximilianwerk maximilianwerk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was approved before. Tests now succeed.

@hanxiao hanxiao merged commit 2569eba into master Dec 18, 2020
@hanxiao hanxiao deleted the feat-document-random-id branch December 18, 2020 12:10
@cristianmtr cristianmtr restored the feat-document-random-id branch December 18, 2020 12:30
@JoanFM JoanFM deleted the feat-document-random-id branch November 30, 2021 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants