feat: document random_id #1440

CatStark · 2020-12-11T12:04:29Z

UPDATE [2020-21-16 21:21 CET by @maximilianwerk]
This PR now consistently renames hash back to id and also changes the semantic back to a simple id. In case no id is provided for a Document, a random one will be generated.

Furthermore, some artifacts of the old hash are still existing, since they are still used in the Hub. Once I refactored all code in the Hub to the new id, I will remove the missing old hash artifacts.

Finally, a new content_hash is "introduced". Alternative formulation: The hash-semantic of the old id/hash is now inside the content_hash

CatStark · 2020-12-11T12:08:08Z

I made the first PR refactoring the places where the update_id was called.
Now the id will either be set by the client or if none was set, it will be a random one.

And to still be able to have the hash of the document there is a new get_document_hash function. If this is ok I will then refactor all the tests that were using update_id

JoanFM

Looks like a good direction to go for me.

jina/clients/python/request.py

jina/types/document/__init__.py

jina/types/document/uid.py

CatStark · 2020-12-14T09:22:26Z

Hey people, as an update, there's a content_hash on the DocumentProto now, Max told me that this needs another step to transform the jina.proto into python file, anyone knows how to proceed on this?

maximilianwerk · 2020-12-14T09:51:15Z

Hey people, as an update, there's a content_hash on the DocumentProto now, Max told me that this needs another step to transform the jina.proto into python file, anyone knows how to proceed on this?

I think that file explains how it works: https://github.com/jina-ai/jina/blob/master/jina/proto/build-proto.sh

So it should be docker run -v $(pwd)/jina/proto:/jina/proto jinaai/protogen (perhaps you have to build the docker image beforehand)

CatStark · 2020-12-14T12:20:17Z

Hey people, as an update, there's a content_hash on the DocumentProto now, Max told me that this needs another step to transform the jina.proto into python file, anyone knows how to proceed on this?

I think that file explains how it works: https://github.com/jina-ai/jina/blob/master/jina/proto/build-proto.sh

So it should be docker run -v $(pwd)/jina/proto:/jina/proto jinaai/protogen (perhaps you have to build the docker image beforehand)

Yeah it was like that, I didn't have to do anything else

jina/types/document/__init__.py

cristianmtr · 2020-12-14T13:30:04Z

jina/types/document/__init__.py

@@ -135,6 +137,12 @@ def __init__(self, document: Optional[DocumentSourceType] = None,
                             f'if you are trying to set the content '
                             f'you may use "Document(content=your_content)"') from ex

+        if custom_id is None:
+            import random
+            custom_id = random.randint(0, 9)


So is the id a string (from the function signature) or an int?

It does not matter. Good opportunity to dig a little deeper: Try to find out why is does not matter by looking at the implementation of self.id.

jina/types/sets/chunk.py

tests/unit/types/document/test_document.py

maximilianwerk · 2020-12-15T19:20:49Z

Just rebased. Please review once again @JoanFM @theUnkownName

github-actions · 2020-12-15T19:28:38Z

Latency summary

Current PR yields:

😶 index QPS at 1557, delta to last 3 avg.: -4%
😶 query QPS at 26, delta to last 3 avg.: -2%

Breakdown

Version	Index QPS	Query QPS
current	1557	26
`0.8.8`	1634	26
`0.8.7`	1659	26
`0.8.6`	1611	26

Backed by latency-tracking. Further commits will update this comment.

JoanFM · 2020-12-15T19:25:23Z

jina/clients/request.py

@@ -18,12 +18,10 @@
                                     Tuple[DocumentSourceType, DocumentSourceType]]]


-def _build_doc(data, data_type: DataInputType, override_doc_id, **kwargs) -> Tuple['Document', 'DataInputType']:
+def _build_doc(data, data_type: DataInputType, **kwargs) -> Tuple['Document', 'DataInputType']:


why don't we allow override_doc_id so that the user can choose that we give for them a `random id? Have we checked if some example uses this?

I don't get the question. Either, the user gives and id in the data or the Document gets a random id. In either case, the user can decide to overwrite the id later via doc.id = 1337.

ok good, using this _build_doc will generate random ids, if the user wants custom ids, they have to use their own generator. Fair enough

JoanFM · 2020-12-15T19:26:29Z

jina/types/document/__init__.py

@@ -135,6 +137,10 @@ def __init__(self, document: Optional[DocumentSourceType] = None,
                             f'if you are trying to set the content '
                             f'you may use "Document(content=your_content)"') from ex

+        if self._document.id is None or self._document.id == '':


self._document.id.is_empty() may work better? not sure if works in python

Well apparently mine is the best practice in python as far as I know.

create a constant on top. MAX_DOCUMENT_ID

JoanFM · 2020-12-15T19:27:04Z

jina/types/document/__init__.py

@@ -135,6 +137,10 @@ def __init__(self, document: Optional[DocumentSourceType] = None,
                             f'if you are trying to set the content '
                             f'you may use "Document(content=your_content)"') from ex

+        if self._document.id is None or self._document.id == '':
+            import random
+            self.id = random.randint(0, 9223372036854775807)


isn't there some Python built-in Integer.Max or something?

This is 2**64/2, which is what should well fit into 64 bit. Apparently, we could directly use 2**64. Haven't thought too much about it. I'll spend another thought tomorrow.

yes but python must have some default value for this.

Nope, python does not. int is unbounded in python. Anyhow, numpy has: np.iinfo(np.int64).max, which is 9223372036854775807. Since we do int(value).to_bytes(_digest_size, sys.byteorder, signed=True) to convert it to bytes, this should also be the right number. If I understand correctly, we could also have negative ids, but that feels weird. So I restrict for the numpy max.

yes I am not saying number is wrong, at least have a CONSTANT well documented rather than a magic number.

JoanFM · 2020-12-15T19:28:19Z

jina/types/document/__init__.py

@@ -402,7 +401,7 @@ def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):


eventually we will remove the context manager concept from document right? or should we at least do some check like wether id is not None or not empty?

It cannot be None or empty at the point. It will always be populated. This is just for backward compatibility and for "niceness".

And out of a sudden, it is not pass anymore. Reasoning: It makes sense to update the content_hash after Document is completed. Otherwise, the user would have to call the update_content_hash manually.

JoanFM · 2020-12-15T19:30:16Z

tests/unit/test_helper.py

@@ -64,21 +64,6 @@ def test_np_int():
    assert hash2bytes(np.int64(a)) == hash2bytes(a)


-def test_hash():


have these functions been removed? I have not seen the removal of these functions in this PR right?

Haha, so predictable. When I removed this test, I knew you would write that.

new_doc_id was renamed to get_content_hash and new_doc_hash was removed. So there is no point in testing forward-backward converting. I thought about a test like

hash_ = get_content_hash(d) assert hash_ == hash2id(id2hash(hash_))

But this already shows, that the semantic of id and hash are not precise anymore (Beware, it is NOT id2hash(hash2id(hash_))!). I will have another thought about this tomorrow with a fresh mind.

maximilianwerk · 2020-12-17T20:15:28Z

Shouldn't the content hash not depend on the id?

I added the hash calculation, which omits the id. This should pretty much be the same as the old hash calculation.

florian-hoenicke · 2020-12-18T09:10:39Z

jina/drivers/index.py



 class KVIndexDriver(BaseIndexDriver):
    """Serialize the documents/chunks in the request to key-value JSON pairs and write it using the executor
    """

    def _apply_all(self, docs: 'DocumentSet', *args, **kwargs) -> None:
-        keys = [hash(doc.id) for doc in docs]
+        keys = [int(doc.id) for doc in docs]


does this have to be done in the VectorIndexDriver as well?

It is. See line 30.

ah okay, I searched for VectorIndexDriver and did not find a hit. 😳

maximilianwerk

Was approved before. Tests now succeed.

CatStark requested a review from maximilianwerk December 11, 2020 12:04

jina-bot added size/M area/core This issue/PR affects the core codebase area/network This issue/PR affects network functionality area/testing This issue/PR affects testing component/client component/proto component/type labels Dec 11, 2020

JoanFM reviewed Dec 11, 2020

View reviewed changes

jina/clients/python/request.py Outdated Show resolved Hide resolved

jina/types/document/__init__.py Outdated Show resolved Hide resolved

jina/types/document/uid.py Outdated Show resolved Hide resolved

jina-bot added size/S and removed size/M labels Dec 11, 2020

JoanFM mentioned this pull request Dec 13, 2020

update Document id at __enter__ rather than __exit__ #1451

Closed

cristianmtr reviewed Dec 14, 2020

View reviewed changes

jina/types/document/__init__.py Outdated Show resolved Hide resolved

cristianmtr reviewed Dec 14, 2020

View reviewed changes

jina/types/sets/chunk.py Outdated Show resolved Hide resolved

cristianmtr reviewed Dec 14, 2020

View reviewed changes

tests/unit/types/document/test_document.py Outdated Show resolved Hide resolved

maximilianwerk force-pushed the feat-document-random-id branch from 29538da to 5849ab3 Compare December 15, 2020 19:15

jina-bot added size/M and removed size/S labels Dec 15, 2020

maximilianwerk force-pushed the feat-document-random-id branch 2 times, most recently from 2bda5bf to 9a31132 Compare December 15, 2020 19:22

JoanFM reviewed Dec 15, 2020

View reviewed changes

JoanFM mentioned this pull request Dec 16, 2020

feat: add named score type #1430

Merged

maximilianwerk mentioned this pull request Dec 16, 2020

feat(driver): add docidcache for content #1466

Closed

CatStark and others added 18 commits December 17, 2020 20:52

fix: remove random_id and set custom_id

be67ece

feat: add random id

5c73b87

fix: add test documentset

09b6302

fix: removed context manager from Document

392fd57

fix: remove random_id and set custom_id

3a872fb

fix: get context manager back

56de400

fix: removed duplicate code and disabled id updating

e4fa5a4

feat: added hash interface

1250dd6

feat: remove update_id and add content_hash

bcd5bb0

feat: remove update_id and add content_hash

58991b3

test: refactor tests

ec33e0b

fix: add hash functions

8f0c21d

refactor: update_id is now update_content_hash

2ea9c2b

refactor: refactor last tests to use update_has_Âh_content

12408d7

feat: finalize refactoring

0067cf9

feat: hash becomes id again

eef15a9

fix: documentset unittest running again

10bf377

feat: added content hash calculation with context manager again

9b4741b

maximilianwerk dismissed JoanFM’s stale review via 9b4741b December 17, 2020 19:53

maximilianwerk force-pushed the feat-document-random-id branch from 88ed15a to 9b4741b Compare December 17, 2020 19:53

maximilianwerk added 2 commits December 17, 2020 21:03

feat: added last missing hash to id field

9a91732

feat: document hash without id

7606b85

florian-hoenicke reviewed Dec 18, 2020

View reviewed changes

test: disabled flaky test

b80d539

maximilianwerk approved these changes Dec 18, 2020

View reviewed changes

hanxiao merged commit 2569eba into master Dec 18, 2020

hanxiao deleted the feat-document-random-id branch December 18, 2020 12:10

cristianmtr restored the feat-document-random-id branch December 18, 2020 12:30

JoanFM deleted the feat-document-random-id branch November 30, 2021 16:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: document random_id #1440

feat: document random_id #1440

CatStark commented Dec 11, 2020 •

edited by maximilianwerk

Loading

CatStark commented Dec 11, 2020

JoanFM left a comment

CatStark commented Dec 14, 2020

maximilianwerk commented Dec 14, 2020 •

edited

Loading

CatStark commented Dec 14, 2020

cristianmtr Dec 14, 2020

maximilianwerk Dec 15, 2020

maximilianwerk commented Dec 15, 2020

github-actions bot commented Dec 15, 2020 •

edited

Loading

JoanFM Dec 15, 2020

maximilianwerk Dec 15, 2020

JoanFM Dec 17, 2020

JoanFM Dec 15, 2020

maximilianwerk Dec 16, 2020

JoanFM Dec 16, 2020

JoanFM Dec 15, 2020

maximilianwerk Dec 15, 2020

JoanFM Dec 15, 2020

maximilianwerk Dec 16, 2020

JoanFM Dec 16, 2020

JoanFM Dec 15, 2020

maximilianwerk Dec 15, 2020

maximilianwerk Dec 17, 2020

JoanFM Dec 15, 2020

maximilianwerk Dec 15, 2020

maximilianwerk commented Dec 17, 2020

florian-hoenicke Dec 18, 2020

maximilianwerk Dec 18, 2020

florian-hoenicke Dec 18, 2020

maximilianwerk left a comment

		@@ -402,7 +401,7 @@ def __enter__(self):
		return self

		def __exit__(self, exc_type, exc_val, exc_tb):

		@@ -64,21 +64,6 @@ def test_np_int():
		assert hash2bytes(np.int64(a)) == hash2bytes(a)


		def test_hash():

feat: document random_id #1440

feat: document random_id #1440

Conversation

CatStark commented Dec 11, 2020 • edited by maximilianwerk Loading

CatStark commented Dec 11, 2020

JoanFM left a comment

Choose a reason for hiding this comment

CatStark commented Dec 14, 2020

maximilianwerk commented Dec 14, 2020 • edited Loading

CatStark commented Dec 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maximilianwerk commented Dec 15, 2020

github-actions bot commented Dec 15, 2020 • edited Loading

Latency summary

Breakdown

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maximilianwerk commented Dec 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maximilianwerk left a comment

Choose a reason for hiding this comment

CatStark commented Dec 11, 2020 •

edited by maximilianwerk

Loading

maximilianwerk commented Dec 14, 2020 •

edited

Loading

github-actions bot commented Dec 15, 2020 •

edited

Loading