Skip to content
This repository has been archived by the owner on Feb 22, 2020. It is now read-only.

Add a notice when indexing flow text is only accept Bytes. Not str #355

Closed
ilham-bintang opened this issue Oct 24, 2019 · 2 comments
Closed

Comments

@ilham-bintang
Copy link

Hi. I found an issue when use Flow to indexing text as type str it will encounter horrible message:

E:CLIClient:[cli:sta: 53]:<_Rendezvous of RPC that terminated with:
	status = StatusCode.UNKNOWN
	details = "Exception iterating requests!"
	debug_error_string = "None"
>

This caused by passing the string not bytes.

so to solve this you need to convert/encode your str type to bytes.
I will make PR for this soon

@hanxiao
Copy link
Collaborator

hanxiao commented Oct 26, 2019

Hi, using bytes to represent documents seems counter-intuitive, let me explain why.

In the very early version of GNES, we did send text in vanilla str/List[str] type, send image in ndarray, etc. However, we soon realize that there are two problems:

  • these data types are Python-only, not universal. In principle, we don't want to restrict the end-user to use a Python client only. They can use a Java client, Go client or Javascript client to communicate with GNES. If we insist on such Python-oriented design, a serializer must exist on both client and server side to convert one type to another. It is a bad experience for developer and error-prone.
  • these data types are too specific, not generic. GNES is not for NLP-only, image/video/audio retrieval is also in the scope of GNES. So how do you want represent every content in all modality using a single, generic representation? Bytes is the only option.

One follow-up question you may have is, if every income data is in bytes, how can GNES know what is what and how to deserialize these bytes to the correct modality?

The answer is the Preprocessor. It will deserialize the bytes into the correct modal and Python data type. Note how the class attribute doc_type affect all preprocessor classes inherited.

class BaseTextPreprocessor(BasePreprocessor):
doc_type = gnes_pb2.Document.TEXT
class BaseAudioPreprocessor(BasePreprocessor):
doc_type = gnes_pb2.Document.AUDIO
class BaseImagePreprocessor(BasePreprocessor):
doc_type = gnes_pb2.Document.IMAGE
class BaseVideoPreprocessor(BasePreprocessor):
doc_type = gnes_pb2.Document.VIDEO

For example, using a SentSplitPreprocessor (inherited from BaseTextPreprocessor) will convert bytes into str, using a WeightedSlidingPreprocessor (inherited from BaseImagePreprocessor) will convert bytes into ndarray.

As a summary, let me repeat the whole procedure again.

  1. Client (e.g. CLIClient) converts everything into bytes, and fill in the docs.raw_bytes field defined in our Protobuf. As the protobuf is universal, one can use whatever language he/she likes to perform this task.
  2. The message is sent to GNES frontend, and its first service in the stack i.e. preprocessor takes over the message.
  3. The preprocessor service process the message and deserialize it according to the module loaded, i.e. docs but most importantly chunks information will be enriched based on the raw_bytes and the deserialization logic. You can of course customize a preprocessor and use it via the way written in GNES Hub.
  4. Follow-up service will take over the message and use the preprocessed chunks or docs.

In short, a GNES flow/stack without a Preprocessor service is useless, as it wont know how to handle a message in the correct way.

If you feel like this idea need to be known more for others, welcome to make a PR about this, either via Python docstring or improve logging info. ❤️

@ilham-bintang
Copy link
Author

Hi @hanxiao ,

I see that problem when use python data type to represent various type of document (text, image, video, audio).

Thanks

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants