Add a notice when indexing flow text is only accept Bytes. Not str #355

ilham-bintang · 2019-10-24T04:11:57Z

Hi. I found an issue when use Flow to indexing text as type str it will encounter horrible message:

E:CLIClient:[cli:sta: 53]:<_Rendezvous of RPC that terminated with:
	status = StatusCode.UNKNOWN
	details = "Exception iterating requests!"
	debug_error_string = "None"
>

This caused by passing the string not bytes.

so to solve this you need to convert/encode your str type to bytes.
I will make PR for this soon

hanxiao · 2019-10-26T09:47:17Z

Hi, using bytes to represent documents seems counter-intuitive, let me explain why.

In the very early version of GNES, we did send text in vanilla str/List[str] type, send image in ndarray, etc. However, we soon realize that there are two problems:

these data types are Python-only, not universal. In principle, we don't want to restrict the end-user to use a Python client only. They can use a Java client, Go client or Javascript client to communicate with GNES. If we insist on such Python-oriented design, a serializer must exist on both client and server side to convert one type to another. It is a bad experience for developer and error-prone.
these data types are too specific, not generic. GNES is not for NLP-only, image/video/audio retrieval is also in the scope of GNES. So how do you want represent every content in all modality using a single, generic representation? Bytes is the only option.

One follow-up question you may have is, if every income data is in bytes, how can GNES know what is what and how to deserialize these bytes to the correct modality?

The answer is the Preprocessor. It will deserialize the bytes into the correct modal and Python data type. Note how the class attribute doc_type affect all preprocessor classes inherited.

gnes/gnes/preprocessor/base.py

Lines 41 to 54 in b4d2c8c

    
           class BaseTextPreprocessor(BasePreprocessor): 
        
               doc_type = gnes_pb2.Document.TEXT 
        
           class BaseAudioPreprocessor(BasePreprocessor): 
        
               doc_type = gnes_pb2.Document.AUDIO 
        
           class BaseImagePreprocessor(BasePreprocessor): 
        
               doc_type = gnes_pb2.Document.IMAGE 
        
           class BaseVideoPreprocessor(BasePreprocessor): 
        
               doc_type = gnes_pb2.Document.VIDEO

For example, using a SentSplitPreprocessor (inherited from BaseTextPreprocessor) will convert bytes into str, using a WeightedSlidingPreprocessor (inherited from BaseImagePreprocessor) will convert bytes into ndarray.

As a summary, let me repeat the whole procedure again.

Client (e.g. CLIClient) converts everything into bytes, and fill in the docs.raw_bytes field defined in our Protobuf. As the protobuf is universal, one can use whatever language he/she likes to perform this task.
The message is sent to GNES frontend, and its first service in the stack i.e. preprocessor takes over the message.
The preprocessor service process the message and deserialize it according to the module loaded, i.e. docs but most importantly chunks information will be enriched based on the raw_bytes and the deserialization logic. You can of course customize a preprocessor and use it via the way written in GNES Hub.
Follow-up service will take over the message and use the preprocessed chunks or docs.

In short, a GNES flow/stack without a Preprocessor service is useless, as it wont know how to handle a message in the correct way.

If you feel like this idea need to be known more for others, welcome to make a PR about this, either via Python docstring or improve logging info. ❤️

ilham-bintang · 2019-10-27T08:19:43Z

Hi @hanxiao ,

I see that problem when use python data type to represent various type of document (text, image, video, audio).

Thanks

ilham-bintang closed this as completed Oct 24, 2019

ilham-bintang reopened this Oct 24, 2019

ilham-bintang closed this as completed Oct 27, 2019

ilham-bintang mentioned this issue Oct 27, 2019

How to access gRPCFrontend?? #343

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a notice when indexing flow text is only accept Bytes. Not str #355

Add a notice when indexing flow text is only accept Bytes. Not str #355

ilham-bintang commented Oct 24, 2019

hanxiao commented Oct 26, 2019 •

edited

ilham-bintang commented Oct 27, 2019

Add a notice when indexing flow text is only accept Bytes. Not str #355

Add a notice when indexing flow text is only accept Bytes. Not str #355

Comments

ilham-bintang commented Oct 24, 2019

hanxiao commented Oct 26, 2019 • edited

ilham-bintang commented Oct 27, 2019

hanxiao commented Oct 26, 2019 •

edited