Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zookeeper lock #111

Merged
merged 15 commits into from
Jul 13, 2021
Merged

Zookeeper lock #111

merged 15 commits into from
Jul 13, 2021

Conversation

ZLBer
Copy link
Member

@ZLBer ZLBer commented Jul 3, 2021

What this PR does:

add zookeeper lock

Which issue(s) this PR fixes:

Fixes #104

Special notes for your reviewer:

@seeflood

Does this PR introduce a user-facing change?:


@mosn-community-bot
Copy link

Hi @ZLBer, welcome to mosn community, Please sign Contributor License Agreement!

After you signed CLA, we will automatically sync the status of this pull request in 3 minutes.

@zhenjunMa
Copy link
Contributor

@ZLBer thanks for your contribute!

could you please add a quickstart for this component? such as a zk_lock.json and some clients to lock/unlock?

@seeflood
Copy link
Member

seeflood commented Jul 4, 2021

恩恩 目前的代码只实现了组件部分,其实还需要一些细碎的配套工作,包括在layotto的main里注册组件,demo(client、config.json),文档,这些工作量不大但是很有必要

稍等哈我在写《组件开发指南》,描述下有哪些配套工作

@ZLBer
Copy link
Member Author

ZLBer commented Jul 4, 2021

@zhenjunMa @seeflood fine, i will add a quickstart after the guide book is finished.

@seeflood
Copy link
Member

seeflood commented Jul 5, 2021

@ZLBer Hi, I just added a developer guide.See https://github.com/mosn/layotto/blob/main/docs/zh/development/developing-component.md

Please improve this pr according to the guide.
Thanks for your passion!

@seeflood
Copy link
Member

seeflood commented Jul 6, 2021

@ZLBer 我在《组件开发者指南》中补充了单元测试的mock方法,供参考~
image

@seeflood
Copy link
Member

seeflood commented Jul 6, 2021

有几个问题抛出来讨论下:

  1. 关于如何写zk的ut:zk的sdk返回的zk.Conn是个struct,这个sdk设计的不太好,我们可以把他封装成一个interface
    我看了下Dapr的zookeeper组件的写法也是这种思路,可以参考https://github.com/dapr/components-contrib/tree/master/state/zookeeper
    image

  2. 目前每次TryLock是重新和zk建立连接,我理解其实初始化的时候建立一次连接就够了、这个连接可以一直复用吧?(同样可以参考上面的链接)
    image

  3. 在继续改代码之前,我们可能要先想清楚一个设计上的问题、免得等会返工:下面这个特殊场景如何处理?
    app tryLock 声明10秒过期,调用成功
    image
    sidecar和zk服务器之间网络抖动/网络故障连不上了,临时节点被自动删除
    image
    由于临时节点已经被删除,此时sidecar2抢同一把锁,抢锁成功:
    image

于是,同一时刻有两个app有同一把锁,出现了正确性问题
会不会出现这种情况呢?如果会的话我们能否避免、应该怎么避免这种问题?
我要想想,先去看下zk客户端是咋实现锁的

@ZLBer
Copy link
Member Author

ZLBer commented Jul 6, 2021

@seeflood
1.我也想过,要加mock的话只能加一层interface封装。

2.为什么不复用连接? 一开始想的是, 为了保证在任何情况下,都可以释放锁, 依赖于临时节点和connection的生命周期保持一致,所以需要connection和每次请求保持一致,当超过expireTime时间时需要主动close掉connection,但复用好像也不存在错误的情况?

  1. 我想了一下,可以设置sessionTimeout=expireTime来避免这种情况, The provided session timeout sets the amount of time for which a session is considered valid after losing connection to a server, 即在expireTime时间内,即使发生网络抖动,server也不会去删除临时节点,保证在expireTime时间内锁的独占。那这也会导致没法复用连接(每次新建connection都要传sessionTimeout)
    zk实现的锁更多是顺序节点实现的阻塞锁吧,不符合这边的语义

@seeflood
Copy link
Member

seeflood commented Jul 7, 2021

  1. 我想了一下,可以设置sessionTimeout=expireTime来避免这种情况, The provided session timeout sets the amount of time for which a session is considered valid after losing connection to a server, 即在expireTime时间内,即使发生网络抖动,server也不会去删除临时节点,保证在expireTime时间内锁的独占。那这也会导致没法复用连接(每次新建connection都要传sessionTimeout)
    zk实现的锁更多是顺序节点实现的阻塞锁吧,不符合这边的语义

恩恩,不管是阻塞锁还是非阻塞锁,其实都有这个带锁故障问题,我理解zk是靠session timeout解决的,但是这东西是每个连接固定好了的,不能既复用连接、又在每次处理请求时改timeout值

这个问题本质上是:api定义成了要求超时时间任意,但是zk(不开启ttl节点功能的话)不能任意。

那么推理一下我们有以下几种选择:

image

Life is hard,and defining API is harder!

@seeflood
Copy link
Member

seeflood commented Jul 7, 2021

所以如果我们想追求复用连接的话,那么有几条路:

方案c.复用连接,tryLock后起个定时任务删节点

实际ttl=session_timeout or 定时任务ttl(用户传的ttl)
(取决于定时任务有没有执行失败)
这样就能解决问题了。虽然他不完美,但分布式锁本来就没法保证100%正确性

对比一下:
原生zk client的实际ttl=session_timeout
方案c的实际ttl=session_timeout or 用户传的ttl

方案c的正确性保证并没有比原生方案差

但是这里coding的时候需要处理”定时任务失败,但是过了一会心跳恢复、session没断“的情况,需要对定时任务做重试

如果不考虑定时任务重试的话会有问题,比如我举个特殊场景:
app tryLock 成功,ttl 10秒
image

网络抖动,app unlock失败,app无视这次失败不做重试;但是这段时间sidecar和zk server心跳还正常
image

10秒后,sidecar定时任务启动,准备删节点,但是这会儿sidecar和zk server网络抖动了,定时任务删除失败
image

失败后,我们期望session timeout、节点被删,但不幸的是过了会心跳好了(zk sdk自己有一些fail over策略,连接断了后会fo到别的zk server,尝试心跳、维持之前的session)
image

于是,定时任务没有重试,节点永远不会被删,死锁发生了

方案d. 改API,不要传什么ttl了

方案e. 用ttl node实现这个功能,不开ttl的zk不支持

@seeflood
Copy link
Member

seeflood commented Jul 7, 2021

为什么redis实现这个API简单,为什么zk实现这个API这么复杂,复杂度来源于哪?

本质上是因为:TTL是个好东西,如果server端支持TTL,那么多个节点之间不需要做状态同步、故障检测,反正故障的时候server来自动删除锁,app和sidecar啥也不用管,就是俩无状态节点,偷懒美滋滋;

而如果server端不支持TTL,我们就要做多个有状态节点(app、sidecar、server)之间的状态同步、可靠故障检测,就有了上面这些问题

(zk原生sdk里的锁,本质上是帮用户实现了这些状态同步、故障检测)

分布式锁防止死锁的两种模式

a. 基于server TTL 的无状态架构

b. 基于节点之间状态同步、故障检测的有状态架构

两种模式对sdk的依赖

如果选择了a,那么app端不需要sdk,直接用grpc api就行

如果选择了b,那么app端必然有逻辑,必然要开发sdk,想做到app不依赖sdk、直接用grpc api不现实(除非它自己实现sdk的逻辑)

后续API的演进方向

我们后面要加阻塞锁Lock(),其实也可以弄两种API,LockWithTTL(key,expire)和Lock(key)

想不依赖sdk的,调LockWithTTL;愿意依赖sdk的,调Lock

@ZLBer
Copy link
Member Author

ZLBer commented Jul 8, 2021

@seeflood
我看了下curator实现lock的文档,发现他们也没主动去解决问题3,而是建议用户自己去主动处理
https://curator.apache.org/curator-recipes/shared-reentrant-lock.html
image
或许方案a 可以先在sidecar本地做下限制,尽量减少connection的建立。

所以我们现阶段先采用哪个方案呢?

@seeflood
Copy link
Member

seeflood commented Jul 8, 2021

@ZLBer 我理解其实方案a和方案c的差别只是复用连接。方案a也要考虑上面的定时任务删除失败了怎么重试
(我下午看下zk go sdk的实现,看看调conn.Delete如果失败了 sdk能否帮我们重试)

所以我个人倾向于选择c方案+允许通过ttl node来实现
你可以看下c方案好写么,如果觉得不太好写也不要紧,可以先写ttl node;或者先写a方案,以后再优化

如果能一步到位,我们就c方案+ttl node,如果太麻烦了我们就一步一步来?这样如何~抛出来讨论下

@seeflood
Copy link
Member

seeflood commented Jul 8, 2021

关于api演进的一种思路,供讨论:
后续grpc API都带上ttl,这样用户用grpc直接调的话,传ttl可以直接调,不需要考虑复杂的特殊场景;
app侧可以封装个sdk,用户调用的时候不需要传ttl,sdk自动帮忙做续租等操作

image

@ZLBer
Copy link
Member Author

ZLBer commented Jul 8, 2021

@seeflood
我怎么感觉方案c和方案a存在差别呢,正确性的问题,就是之前关于问题3,例如A节点网络发生抖动,导致连接断开,方案a在expire时间后关闭connection,方案c在sesisontimeout时间后关闭。如果sessiontimeout远小于expire,那么另一个节点B节点就会占据锁,而对A节点的承诺的expire时间没有达到且A也没有主动释放,但A的锁没了,A还在执行,B也在执行,出错。
你的意思是暂时不考虑这种情况吗?
我的想法是可以先保证正确性,其次才是效率,比如连接复用。

@seeflood
Copy link
Member

seeflood commented Jul 8, 2021

我的想法是可以先保证正确性,其次才是效率,比如连接复用。

同意!

@ZLBer
我想了一下,你说的对,如果用户用粗粒度锁(任务执行的很慢,expire时间设置的很长,比session_timeout长多了)方案c可能导致session_timeout后就被别的机器抢锁了,而方案a能避免这个问题(因为方案a的session_timeout就设置成expire)

我本来想干脆不管这种情况了,是因为看zk的文档,说zk server有配置、限制session_timeout的范围,总归不能设置太大。但是确实方案a对于”粗粒度锁"的正确性保证更好一些,那就方案a+ttl node好了,性能问题我们倒是可以通过连接池优化

@seeflood
Copy link
Member

seeflood commented Jul 8, 2021

补充一下,关于a方案性能问题的思考:我们后续可以优化成连接池(现在先不用管)
比如连接池里有:timeout是5s的连接,timeout是10s的连接,timeout是15s的连接
这样一个请求来了,expire是4s,我们可以把timeout是5s的连接拿出来用
这样的话,新起定时任务,是在4s后调delete删除节点,而不是在4s后调close关闭连接(因为连接允许并发复用,用完了要放回连接池)
那么在定时任务delete删节点的时候,得保证这次delete执行失败后,有重试机制(不然就会有我上面说的问题,定时任务删除失败,但是心跳没断,发生死锁)
这就要看下zk的sdk实现,看看调delete的时候有没有帮我们做重试

当然这些以后做性能优化的时候再说

@ZLBer
Copy link
Member Author

ZLBer commented Jul 8, 2021

@seeflood
按照方案a 提交了一版代码,增加了一层interface,增加了mock测试

components/lock/zookeeper/zookeeper_lock.go Outdated Show resolved Hide resolved
components/lock/zookeeper/zookeeper_lock_test.go Outdated Show resolved Hide resolved
components/lock/zookeeper/zookeeper_lock.go Outdated Show resolved Hide resolved
components/lock/zookeeper/zookeeper_lock.go Outdated Show resolved Hide resolved
components/lock/zookeeper/zookeeper_lock.go Outdated Show resolved Hide resolved
components/lock/zookeeper/zookeeper_lock.go Outdated Show resolved Hide resolved
components/lock/zookeeper/zookeeper_lock.go Show resolved Hide resolved
components/lock/zookeeper/zookeeper_lock.go Show resolved Hide resolved
components/lock/zookeeper/zookeeper_lock.go Outdated Show resolved Hide resolved
components/lock/zookeeper/zookeeper_lock.go Show resolved Hide resolved
components/lock/zookeeper/zookeeper_lock.go Outdated Show resolved Hide resolved
components/lock/zookeeper/zookeeper_lock.go Outdated Show resolved Hide resolved
docs/zh/component_specs/lock/zookeeper.md Show resolved Hide resolved
components/lock/zookeeper/zookeeper_lock.go Outdated Show resolved Hide resolved
# Conflicts:
#	cmd/layotto/main.go
#	components/go.mod
#	components/go.sum
#	go.mod
#	go.sum
# Conflicts:
#	components/go.mod
#	components/go.sum
#	go.sum
@seeflood
Copy link
Member

@ZLBer CI流程里校验格式化没通过~
你在项目目录下敲:

go fmt ./...

然后git提交下有改动的东西即可

components/lock/zookeeper/zookeeper_lock.go Outdated Show resolved Hide resolved
components/lock/zookeeper/zookeeper_lock.go Outdated Show resolved Hide resolved
@seeflood
Copy link
Member

seeflood commented Jul 13, 2021

另外就是文档侧边栏需要加上~
等会合并进这个pr后,我来加吧,怪我没写到贡献指南里
image

@ZLBer
Copy link
Member Author

ZLBer commented Jul 13, 2021

@seeflood
侧边栏也改了 辛苦~

@seeflood
Copy link
Member

这个pr厉害了,升级golang/mock包后,发现了以前的configuration api相关的ut错误
我来看下

@codecov-commenter
Copy link

Codecov Report

Merging #111 (bbb32be) into main (34e8ff0) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #111   +/-   ##
=======================================
  Coverage   46.95%   46.95%           
=======================================
  Files          41       41           
  Lines        1640     1640           
=======================================
  Hits          770      770           
  Misses        820      820           
  Partials       50       50           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 34e8ff0...bbb32be. Read the comment docs.

Copy link
Member

@seeflood seeflood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@seeflood seeflood merged commit d6289c0 into mosn:main Jul 13, 2021
@seeflood
Copy link
Member

Thanks again for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add more components for distributed lock API
4 participants