indexing docid with non-ascii characters causes 503 error #7

clamprecht opened this Issue Feb 5, 2012 · 3 comments

Trying to index a doc whose docid contains a "high ascii" or Unicode character above 127 causes the following exception in restapi:

17669 05/02-00.50.12      RPC:ERRO Unexpected failure to run send_batch, reconnecting once
Traceback (most recent call last):
  File "../api/", line 77, in wrap
    return att(*args, **kwargs)
  File "../gen-py/flaptor/indextank/rpc/", line 39, in send_batch
  File "../gen-py/flaptor/indextank/rpc/", line 46, in send_send_batch
  File "../gen-py/flaptor/indextank/rpc/", line 139, in write
  File "../gen-py/flaptor/indextank/rpc/", line 1679, in write
  File "../gen-py/flaptor/indextank/rpc/", line 1441, in write
  File "../api/thrift/protocol/", line 123, in writeString
  File "../api/thrift/transport/", line 164, in write
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe2' in position 0: ordinal not in range(128)

To reproduce, do this in python:

from indextank.client import ApiClient
c = ApiClient('<YOUR_API_URL>')  
idx = c.create_index('testascii')
idx.add_document("â", { "text":"a"}) 

I think it's ok to reject docids with non-latin1 or non-ascii characters, but I think it should return an HTTP 400 instead of 503 "service unavailable". (Or maybe docids are supposed to accept non-ascii characters?)

Also, this seems to be related but I'm not sure yet: when indexing in batches when this happened, it seemed to cause some problem with the LogWriter, with the following stack trace:

ERROR [pool-1-thread-32] org.apache.thrift.server.TThreadPoolServer - [Error occurred during processing of message.] 2012-02-04 10:27:15,724
java.lang.IllegalStateException: Can't insert records to the live log without defining the index code
        at com.flaptor.indextank.rpc.LogWriter$Processor$send_batch.process(
        at com.flaptor.indextank.rpc.LogWriter$Processor.process(
        at org.apache.thrift.server.TThreadPoolServer$
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(
        at java.util.concurrent.ThreadPoolExecutor$

Finally, after this all happened, the LogWriter (slave) was taking all the CPU when no docs were being written, like it was in a spin loop. I did a kill -3 to get a thread stack dump, and one or two threads were RUNNABLE at this line:

at org.apache.thrift.protocol.TProtocolUtil.skip(
at org.apache.thrift.protocol.TProtocolUtil.skip(

I can create a separate issue for the LogWriter stuff if you want. But I'm not sure exactly what reproduces it yet.

Let me know if I can provide any more details.

We should support unicode docids. Actually, __validate_docid on api/

Check the code at

So it seems the code sending the update to the LogStorage is not supporting non-ascii docids ..


I'm not a python expert, but I dug around, I noticed that thrift uses StringIO, and I found this in the python docs:

The StringIO object can accept either Unicode or 8-bit strings, but mixing the two may take some care. If both are used, 8-bit strings that cannot be interpreted as 7-bit ASCII (that use the 8th bit) will cause a UnicodeError to be raised when getvalue() is called.


Maybe this is what's happening (and why it's happening in the middle of a thrift call)?


It also seems that when batch indexing, a single document causing this issue in the batch can cause the whole batch to fail and return a 503.

