Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPVR-67] add "yavirt" as a new runtime #73

Merged
merged 7 commits into from
Sep 27, 2021

Conversation

DuodenumL
Copy link
Contributor

@DuodenumL DuodenumL commented Sep 14, 2021

除了Yavirt相关,本次PR也修了一些历史遗留bug和这次重构新产生的bug:

优化:Runtime daemon退出时会导致agent整体退出的问题
修复:Bufpipe在特殊情况下Read会无限阻塞的问题
修复:logs/Writer内的重试机制可能会导致goroutine数量爆炸的问题
修复:容器无限重启时,backoff retry可能会导致goroutine数量爆炸的问题
修复:容器无限重启时,logBroadcaster可能会阻塞的问题


突然发现我之前getFilter的用法有点问题。

在Docker里,filter应该是这样的格式:{Key: "label", Value: "ERU=1"}
而Yavirt里需要自己手动实现这块,在ETCD里把labels解析出来,所以Yavirt需要的格式是:{Key: "ERU", Value: "1"}

所以应该在manager的层面实现一个getBaseFilter,返回的是常规的键值对。如果Runtime层面想要对labels进行特殊处理的话,自己手动转换一下。

以及重构前的版本getFilter有一个extend参数,我感觉没什么用,给去掉了。

types.KV全部替换为map[string]string,当初为了应对docker里key可能重复(都叫"label")的时候引入的。

@DuodenumL
Copy link
Contributor Author

这个先不用合并,随着Yavirt的迁移工作,可能agent这里还会有若干细节要改动,放一个PR里好了...

比如我刚刚发现ListWorkloadIDs(ctx context.Context, all bool, filter map[string]string)里的all现在已经没有用了,所有用到这个方法的地方都是all=true。(以前agent启动时load的时候用的all=false,但是上次重构的时候改掉了。)

以及Yavirt里一些细节跟Docker是不一样的。例如Docker可以根据容器名字来解析出来app / entrypoint / ident,但是Yavirt里的VM只有一个ID,是解析不出来的。好在只有Attach用了这部分功能,但是Yavirt的Attach应该暂时不用做。

@DuodenumL
Copy link
Contributor Author

因为libyavirt那边还没合,现在go mod里用我自己的版本replace了...所以在libyavirt合并之前,这里不应该合...

以及Yavirt里monitor部分的代码应该先去掉?还是和agent共存一段时间?

@DuodenumL DuodenumL changed the title make "getFilter" more universal [WIP] add "yavirt" as a new runtime Sep 20, 2021
@DuodenumL DuodenumL changed the title [WIP] add "yavirt" as a new runtime [WIP] [SPVR-67] add "yavirt" as a new runtime Sep 20, 2021
Copy link
Member

@jschwinger233 jschwinger233 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

manager 和 agent 退出逻辑好像有点问题, 你的写法是出现一个 event watch error 就直接整个进程关闭, 但我觉得不应该. event watch error 多正常, docker / yavirt 重启都会报错, agent 就等着重连嘛.

// ListWorkloadIDs lists workload IDs filtered by given condition
func (y *Yavirt) ListWorkloadIDs(ctx context.Context, filters map[string]string) ([]string, error) {
var ids []string
var err error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以把这个变量声明放到函数返回值里:

func (y *Yavirt) ListWorkloadIDs(ctx context.Context, filters map[string]string) (ids []string, err error) {

var yaEventChan <-chan yavirttypes.EventMessage
var yaErrChan <-chan error

yaEventChan, yaErrChan = y.client.Events(ctx, filters)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这两个变量好像不用单独声明类型, 直接 yaEventChan, yaErrChan :=

HTTPCode int
}

type healthCheckBridge struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为啥叫这名字

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为yavirt里就叫这个...

if err != nil {
return err
}

wg := &sync.WaitGroup{}
for _, wid := range workloadIDs {
log.Debugf("[load] detect workload %s", coreutils.ShortID(wid))
log.Debugf("[load] detect workload %s", wid)
wg.Add(1)
go func(ID string) {
defer wg.Done()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个函数下面(36L) 的 log.Errorf 信息量不够, 如果真有报错, 看日志甚至不知道是哪个实例上报状态失败, 至少应该加上 ID

@@ -4,21 +4,19 @@ import (
"context"
"sync"

coreutils "github.com/projecteru2/core/utils"

log "github.com/sirupsen/logrus"
)

func (m *Manager) load(ctx context.Context) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个名字也不是很好, "load workload" 听起来像是一个无副作用的操作

@jschwinger233
Copy link
Member

jschwinger233 commented Sep 20, 2021

manager 部分的实现上次没细看, 这次看到了一些问题.

manager/node/manager.go 里的 Manager.Run(Context) 函数:

// Run runs a node manager
func (m *Manager) Run(ctx context.Context) error {
	log.Infof("[NodeManager] start node status heartbeat")
	go m.heartbeat(ctx)

	// wait for signal
	<-ctx.Done()
	log.Info("[NodeManager] exiting")
	log.Infof("[NodeManager] mark node %s as down", m.config.HostName)

	// ctx is now canceled. use a new context.
	var err error
	utils.WithTimeout(context.TODO(), m.config.GlobalConnectionTimeout, func(ctx context.Context) {
		err = m.store.SetNode(ctx, m.config.HostName, false)
	})
	if err != nil {
		log.Errorf("[NodeManager] failed to mark the node %s as down, err: %s", m.config.HostName, err)
		return err
	}
	return nil
}

由于你上层也 select 了 <-ctx.Done(), 所以信号来的时候上层直接 return 了然后结束进程, 你这里的 set node down 的操作根本来不及进行.

需要一个同步机制来实现 graceful termination, 收到信号, 等待清理工作做完, 再退出.

@jschwinger233
Copy link
Member

因此你在自测的时候, 也要覆盖到 dockerd / yavirtd 挂掉时候的场景.

@DuodenumL DuodenumL force-pushed the feature/filter branch 5 times, most recently from 23b6eb3 to 8fbe7d0 Compare September 21, 2021 07:27
Copy link
Member

@jschwinger233 jschwinger233 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

跑起来过吗

agent.go Outdated
defer cancel()
if err := workloadManager.Run(ctx); err != nil {
log.Errorf("[agent] workload manager err: %v, exiting", err)
cancel()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cancel 是不是太多啦, defer 也有

@jschwinger233
Copy link
Member

有仔细测试过各种场景吗, 包括 dockerd 挂掉, dockerd 卡住, dockerd 重启, 容器无限自动重启, 日志转发, 日志 attach, 还可以开一下 pprof 看看折腾完很多操作后有没有 goroutine 泄露

@jschwinger233
Copy link
Member

毕竟这影响的是业务的 failover, 要灵敏, 要准确, 就算出了问题也要可追溯, 日志的完整性足够吗, 比如 health check 的时候 tcp 连不上的日志有记录实例 id 吗, 如果 tcp 卡住会积压 ticker 吗, setstatus 失败日志能反映出问题吗, 几个重要的 IO 都去创造一下失败看看会怎样.

@jschwinger233
Copy link
Member

多个 core 实例的 HA 也测一下, 两个 core server, 两个端口, 轮流 kill -9 看看 agent 的表现.

@jschwinger233
Copy link
Member

graceful term 也测一下, 就是发个 15 信号, 看看能不能标记 node down, 还有容器会标记 unhealthy 吗?

@DuodenumL DuodenumL force-pushed the feature/filter branch 3 times, most recently from a81fa9f to 1fd2687 Compare September 21, 2021 10:47
@DuodenumL DuodenumL force-pushed the feature/filter branch 4 times, most recently from b8bd43a to 829e3a9 Compare September 22, 2021 07:46
@DuodenumL DuodenumL force-pushed the feature/filter branch 4 times, most recently from eb7cbc7 to 84ec68f Compare September 23, 2021 03:44
@DuodenumL DuodenumL force-pushed the feature/filter branch 2 times, most recently from 54411c5 to 757f14f Compare September 23, 2021 10:31
@DuodenumL DuodenumL changed the title [WIP] [SPVR-67] add "yavirt" as a new runtime [SPVR-67] add "yavirt" as a new runtime Sep 23, 2021
@DuodenumL
Copy link
Contributor Author

除了Yavirt相关,本次PR也修了一些历史遗留bug和这次重构新产生的bug:

  • 优化:Runtime daemon退出时会导致agent整体退出的问题
  • 修复:Bufpipe在特殊情况下Read会无限阻塞的问题
  • 修复:logs/Writer内的重试机制可能会导致goroutine数量爆炸的问题
  • 修复:容器无限重启时,backoff retry可能会导致goroutine数量爆炸的问题
  • 修复:容器无限重启时,logBroadcaster可能会阻塞的问题

logs/writer.go Outdated
// Close .
func (w *Writer) Close() error {
// leave some time for the pending writing
time.Sleep(CloseWaitInterval)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Close 不是一个阻塞的操作了咩?

return
}
defer func() {
go writer.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... 还真是

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

= =要不我把go放Close里边?觉得异步调用Close有点奇怪

@@ -96,14 +115,22 @@ func (l *logBroadcaster) broadcast(log *types.Log) {
// use wait group to make sure the logs are ordered
wg := &sync.WaitGroup{}
wg.Add(len(subscribers))
for _, sub := range subscribers {
go func(sub *subscriber) {
for ID, sub := range subscribers {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... 我真的是受不了这个 ID, 是要 exported 吗

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

无视我= =

}()

return outr, errr, nil
}

// checkHostname check if ERU_NODE_NAME env in container is the hostname of this agent
// TODO should be removed in the future, should always use label to filter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

觉得这个遥遥无期了...

agent.go Outdated Show resolved Hide resolved
logs/writer.go Outdated
@@ -17,16 +17,25 @@ import (
const Discard = "__discard__"

// ErrConnecting means writer is in connecting status, waiting to be connected
var ErrConnecting = errors.New("Connecting")
var ErrConnecting = errors.New("connecting")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

竟然有 error.go 了

if err != common.ErrNotImplemented {
log.Errorf("[attach] failed to get workload name, id: %v, err: %v", ID, err)
} else {
log.Debugf("[attach] should ignore this workload")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log.Debug, not format

manager/workload/event.go Outdated Show resolved Hide resolved
@CMGS
Copy link
Contributor

CMGS commented Sep 27, 2021

LGTM...maybe

@CMGS CMGS merged commit a56aaed into projecteru2:master Sep 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants