Exiting WorkerSinkTask due to unrecoverable exception #36

ramyogi · 2019-12-20T12:04:52Z

When there is a problem with Message and SOLR is not able to index and returns exception so we get in to the below , How do we resolve this and skipping this message and move forward. Even if I restart it is keep stuck with this message,
In my connector configuration , Like this, Still it is stuck.
behavior.on.malformed.documents=warn
solr.commit.within=100
errors.tolerance=all
errors.log.enable=true

org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:560)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:321)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:224)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:192)

ramyogi · 2019-12-20T14:37:09Z

I would like to provide what solr client throws exception. Ideally this will not occur but if it is occuring we should have solution to proceed so please suggest some or can I take a look your code and provide pull request ?

Caused by: org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException: Error from server at http://localhost/solr/realtimeindexing_shard6_replica_n7: Exception writing document id 12345678 to the index; possible analysis error: Document contains at least one immense term in field="abc" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[49, 48, 46, 49, 49, 48, 51, 47, 80, 104, 121, 115, 82, 101, 118, 76, 101, 116, 116, 46, 57, 51, 46, 49, 51, 48, 54, 48, 51, 80]...', original message: bytes can be at most 32766 in length; got 38490. Perhaps the document has an indexed string field (solr.StrField) which is too large
at org.apache.solr.client.solrj.impl.CloudSolrClient.getRouteException(CloudSolrClient.java:125)
at org.apache.solr.client.solrj.impl.CloudSolrClient.getRouteException(CloudSolrClient.java:46)
at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.directUpdate(BaseCloudSolrClient.java:549)
at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.sendRequest(BaseCloudSolrClient.java:1037)

ramyogi · 2019-12-20T15:01:27Z

Reason behind is : below exception only caught.
catch (SolrServerException | IOException ex) {
throw new RetriableException(ex);
}
SOLR cloud exception above is different.

jcustenborder · 2019-12-20T17:10:30Z

Isn't this fatal though?

Caused by: org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException: Error from server at http://localhost/solr/realtimeindexing_shard6_replica_n7: Exception writing document id 12345678 to the index; possible analysis error: Document contains at least one immense term in field="abc" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[49, 48, 46, 49, 49, 48, 51, 47, 80, 104, 121, 115, 82, 101, 118, 76, 101, 116, 116, 46, 57, 51, 46, 49, 51, 48, 54, 48, 51, 80]...', original message: bytes can be at most 32766 in length; got 38490. Perhaps the document has an indexed string field (solr.StrField) which is too large

ramyogi · 2019-12-21T09:18:14Z

Correct, this is the reason. But this should be retried and go to Dead letter queue to proceed further messages, In current situation it stuck not moving at all.

jcustenborder · 2019-12-23T15:36:43Z

That's not how the dead letter topic works unfortunately. It's only for deserialization errors.

ramyogi · 2019-12-24T09:41:29Z

How to resolve this issue, if there is an error in message which is unrecoverable. Just ignore that message and proceed. As I am doing SOLR in schema managed so some messages data is coming unexpectedly. Any suggestion you could provide very helpful.

ramyogi · 2019-12-30T09:23:02Z

In the elastic search plugin they provide option drop invalid message and proceeding so probably we could add the flag like this and catch this kind of exception to proceed further. Right now the Task is stuck and partition offset not moving So need some kind of handling this.

catch (ConnectException convertException) {
if (dropInvalidMessage) {
log.error(
"Can't convert record from topic {} with partition {} and offset {}. "

"Error message: {}",
sinkRecord.topic(),
sinkRecord.kafkaPartition(),
sinkRecord.kafkaOffset(),
convertException.getMessage()
);
} else {
throw convertException;
}
}

hartmut-co-uk · 2020-10-20T12:43:18Z

+1
this is a valid situation - I'd also welcome an option to 'unblock' by either log+skip (the proposed behavior.on.malformed.documents=warn)

hartmut-co-uk · 2020-10-20T12:45:35Z

@ramyogi did you find a solution / settled with an alternative / forked?

jcustenborder · 2020-10-20T12:49:32Z

We could potentially add support for something like this. The concern I have is most if not all the examples were problems where due to infrastructure failing. Meaning we'd fail on the next message anyway.

hartmut-co-uk · 2020-10-21T20:59:26Z

Hmm valid concern!
In case of infrastructure failing / timeout/network or other temporary issues - it certainly wouldn't make sense to skip or move messages to dead-letter queue.
Would something more fine granular be feasible?
TBH I'm not familiar with the Solr Java client lib - so I don't know about exception/error behaviour.

jcustenborder · 2020-10-21T21:31:09Z

I think it boils down to a limitation of the SOLR api. This connector specifically uses add(Collection docs) to index documents. This is the fastest way to write data to Solr. Each batch of records that get written to poll are converted and sent. Do you all know of a way for me to figure out which document failed? That would be immensely helpful in this use case. The alternative is to use add(SolrInputDocument doc) which comes with the warning Adds a single document Many SolrClient implementations have drastically slower indexing performance when documents are added individually. Document batching generally leads to better indexing performance and should be used whenever possible. Without being able to figure out which document(s) failed I'd have to report the entire batch as failed.

ramyogi7283 · 2020-10-21T22:15:19Z

Yes Jeremy, we should be able to log the document unique Id if the situation occurs. Right now this kind of error completely stuck and offset not moving at all. Manually we need to delete the record and resume the kafkaconnect process. I am.looking forward your suggestion then I can prepare a PR .

jcustenborder · 2020-10-21T22:41:09Z

The issue is it's not going to be A document it's going to be a batch of documents. If we send 50 documents to solr and one is bad which one is it?

hartmut-co-uk · 2020-10-22T08:34:30Z

is there a config setting to change the batch size?
For a (still manual work involved) recovery process - one could:

stop/delete the connect task
restart with batch size=1
process until again stuck at the exact record causing the failure
restart with 'skip on error' enabled
stop and set config back to initial state (~ batch size >1, 'skipOnError'=false)

Or as an alternative - move the entire failing batch to a dl topic for further manual analysis / manual 'replay'.
Depends on the use case - but it might be preferable over the entire sink task to freeze and block.

jcustenborder · 2020-10-23T13:19:15Z

Yes Jeremy, we should be able to log the document unique Id if the situation occurs. Right now this kind of error completely stuck and offset not moving at all. Manually we need to delete the record and resume the kafkaconnect process. I am.looking forward your suggestion then I can prepare a PR .

The problem is I am unclear on how to determine which document is the offending one. Meaning that if we did something where we dumped to the log, we would have to do the whole batch which could be 500 documents pending on how you did your batch size. The part I really don't like is throwing away a batch over a single document or two.

is there a config setting to change the batch size?
For a (still manual work involved) recovery process - one could:

stop/delete the connect task

restart with batch size=1

process until again stuck at the exact record causing the failure

restart with 'skip on error' enabled

stop and set config back to initial state (~ batch size >1, 'skipOnError'=false)

You can do with with max.poll.records. It's a standard kafka setting. You might need to set this at the worker level pending on the kafka connect version.

Or as an alternative - move the entire failing batch to a dl topic for further manual analysis / manual 'replay'.
Depends on the use case - but it might be preferable over the entire sink task to freeze and block.

Unfortunately dlq functionality is only for serialization at the moment. You don't have an api to produce messages to a DLQ from a connector. I'd literally have to create a producer to do so and that would mean pulling in all of the producer settings.

hartmut-co-uk · 2020-11-20T16:49:30Z

Unfortunately dlq functionality is only for serialization at the moment. You don't have an api to produce messages to a DLQ from a connector. I'd literally have to create a producer to do so and that would mean pulling in all of the producer settings.

Note: Kafka 2.6 shipped KIP-610
Which brings such capability.
Though with batching and given error scenario - the limitation is the same - the entire batch would have to be 'skipped' and sent to DLQ.

The part I really don't like is throwing away a batch over a single document or two.

Yes, this would probably apply for most use cases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exiting WorkerSinkTask due to unrecoverable exception #36

Exiting WorkerSinkTask due to unrecoverable exception #36

ramyogi commented Dec 20, 2019

ramyogi commented Dec 20, 2019

ramyogi commented Dec 20, 2019

jcustenborder commented Dec 20, 2019

ramyogi commented Dec 21, 2019

jcustenborder commented Dec 23, 2019

ramyogi commented Dec 24, 2019

ramyogi commented Dec 30, 2019

hartmut-co-uk commented Oct 20, 2020

hartmut-co-uk commented Oct 20, 2020

jcustenborder commented Oct 20, 2020

hartmut-co-uk commented Oct 21, 2020 •

edited

Loading

jcustenborder commented Oct 21, 2020

ramyogi7283 commented Oct 21, 2020

jcustenborder commented Oct 21, 2020

hartmut-co-uk commented Oct 22, 2020

jcustenborder commented Oct 23, 2020

hartmut-co-uk commented Nov 20, 2020

Exiting WorkerSinkTask due to unrecoverable exception #36

Exiting WorkerSinkTask due to unrecoverable exception #36

Comments

ramyogi commented Dec 20, 2019

ramyogi commented Dec 20, 2019

ramyogi commented Dec 20, 2019

jcustenborder commented Dec 20, 2019

ramyogi commented Dec 21, 2019

jcustenborder commented Dec 23, 2019

ramyogi commented Dec 24, 2019

ramyogi commented Dec 30, 2019

hartmut-co-uk commented Oct 20, 2020

hartmut-co-uk commented Oct 20, 2020

jcustenborder commented Oct 20, 2020

hartmut-co-uk commented Oct 21, 2020 • edited Loading

jcustenborder commented Oct 21, 2020

ramyogi7283 commented Oct 21, 2020

jcustenborder commented Oct 21, 2020

hartmut-co-uk commented Oct 22, 2020

jcustenborder commented Oct 23, 2020

hartmut-co-uk commented Nov 20, 2020

hartmut-co-uk commented Oct 21, 2020 •

edited

Loading