Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

生产环境下使用的一些建议 #26

Closed
jsding opened this issue Dec 11, 2019 · 7 comments
Closed

生产环境下使用的一些建议 #26

jsding opened this issue Dec 11, 2019 · 7 comments

Comments

@jsding
Copy link

jsding commented Dec 11, 2019

我们希望用clickhouse_sinker 作为一个系统级服务,能够常驻,自动适应clickhouse, kafka链接状态的变化, 而不需要去手动停止。

现在的配置retryTimes, 如果使用,将导致clickhouse集群维护期间大量的数据丢失,不适合生产环境.

建议LoopWrite, 抛弃retryTimes的设置,

func (c *ClickHouse) LoopWrite(metrics []model.Metric) {
	err := c.Write(metrics)
	times := 1
	for err != nil {
		log.Error("saving msg error", err.Error(), "will loop to write the data")
		waitTime := time.Duration(math.Min(math.Pow(2, float64(times)), 3600))
		time.Sleep(waitTime * time.Second)
		err = c.Write(metrics)
		times = times + 1
	}
}

写失败的情况下,自动倍增系统sleep的时间,直到成功。这样就无需在clickhouse停机的情况下,首先要关闭clickhouse_sinker, clickhouse_sinker可以常驻,在最多1小时候就自动提供sinker服务了。

@sundy-li
Copy link
Member

sundy-li commented Dec 12, 2019

这个想法挺好的

自动倍增系统sleep的时间,直到成功

不能倍增,恢复时间越久,倍增的时间越多, 集群ok后,sinker写入的延迟就可能越大

建议LoopWrite, 抛弃retryTimes的设置

第一个版本是完全死循环写ck的,直到成功, 但数据库异常是不可规避的,死循环写在异常不可恢复的情况下,需要kill -9,数据也会丢失;
并且调大 retryTimes 也相当于循环写,也是可以满足你目前的需求的

sinker开源后,没有把业务相关的处理做到非常好,我建议用户基于基础代码进行适当改造

我司内部版本 有以下功能:

  1. clickhouse_manager 进行task配置,task下发
  2. sinker 定时获取 clickhouse_manager 的task配置,进行task的生命周期管理(start, stop)
  3. kafka消费到ck 不丢不重 (flush 成功后,手动提交这一批次最大的offset到kafka)
  4. 其他parser支持, 如 pb协议,内部上报协议等
  5. exporters监控

后续若有的代码与业务绑定不大,可以持续开源这块

@jsding
Copy link
Author

jsding commented Dec 12, 2019

期待能够实现开源

  1. Kafka, Clickhouse链接不正常的时候,系统进入休眠状态, 休眠时间倍增,但倍增的最大时间可以设置,比如1小时, 最多1小时后,系统激活检查是否正常, 如果正常,正常工作,如果不正常,继续休眠。

  2. kafka消费到ck 不丢不重 (flush 成功后,手动提交这一批次最大的offset到kafka)

如果实现了1,实际上clickhouse_sinker开启以后就可以不用太关注这个服务了。clickhouse sinker程序在一台机器上常开, 不必担心kafka, clickhouse维护的时候, clickhouse sinker大量的报错和重试。

@ns-gzhang
Copy link

  1. kafka消费到ck 不丢不重 (flush 成功后,手动提交这一批次最大的offset到kafka)

I'm curious how this can achieve exactly-once ingestion, if there are multiple Kafka partitions, and a consumer in a group may get messages from multiple partitions or even changing partitions when Kafka rebalancing happens. Keeping track of batch high offsets would work for a single partition in a batch in case of clickhouse_sinker crashes. But if the ClickHouse server crashes before you get positive response to the last insert, you won't be able to tell if the last batch is successful or not, right? In order to deal with that with ClickHouse's batch idempotency (exactly same batches will be deduped), we need to send in exactly the same batches for the unacked batches in case of ClickHouse server crash, which means we need to keep track of batch low offsets and high offsets (from a single partition, or every partition involved in the batch). Right? And we cannot use consumer group when rebalancing can happen? Thanks in advance for sharing your insights on this.

@sundy-li
Copy link
Member

sundy-li commented Feb 19, 2020

But if the ClickHouse server crashes before you get positive response to the last insert, you won't be able to tell if the last batch is successful or not, right?

Yes, so we should ensure each insert is ok, which we use LoopWrite to retry the failed inserts, users could set retry times to be a large number or send alarm messages.

Keeping track of batch high offsets would work for a single partition in a batch in case of clickhouse_sinker crashes

In each batch insert, we keep tracking the largest offsets of involved partitions, when the batch insert is successful, we commit the offsets of partitions.

@ns-gzhang
Copy link

Thanks Sundy for sharing more insights.

Yes, so we should ensure each insert is ok, which we use LoopWrite to retry the failed inserts, users could set retry times to be a large number or send alarm messages.

So you are saying LoopWrite should never give up until it succeeds. What if the sinker or the server/pod it runs on also crashes during retry?

In each batch insert, we keep tracking the largest offsets of involved partitions, when the batch insert is successful, we commit the offsets of partitions.

That works only if you never need to fetch from Kafka again for a pending batch (i.e. batch insertions are always successful - the assumption above), right? If I ever need to re-assemble a batch, I have to be able to control the mix of data from all the partitions involved to generate exactly the same batch (to deal with imaginary crash case above).

@sundy-li
Copy link
Member

sundy-li commented Feb 19, 2020

What if the sinker or the server/pod it runs on also crashes during retry?

If it crashes, the offset will not commit either, so it'ok.
It's possible that the messages may be duplicated. When we successfully insert into ClickHouse,
the sinker crashes, then we lose the chance to commit offsets. That way messages are duplicated when the sinker starts next time.

That works only if you never need to fetch from Kafka again for a pending batch (i.e. batch insertions are always successful - the assumption above), right?

Yes

@ns-gzhang
Copy link

Thanks again. That's what I'd like to confirm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants