Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: TXN leak causes the CN to crash. #11966

Closed
1 task done
sukki37 opened this issue Sep 30, 2023 · 27 comments
Closed
1 task done

[Bug]: TXN leak causes the CN to crash. #11966

sukki37 opened this issue Sep 30, 2023 · 27 comments
Assignees
Labels
attention/further-obervation kind/bug Something isn't working severity/s0 Extreme impact: Cause the application to break down and seriously affect the use
Milestone

Comments

@sukki37
Copy link
Contributor

sukki37 commented Sep 30, 2023

Is there an existing issue for the same bug?

  • I have checked the existing issues.

Environment

- Version or commit-id (e.g. v0.1.0 or 8b23a93):
- Hardware parameters:
- OS type:
- Others:

Actual Behavior



2023-09-30 19:23:40 | 2023/09/30 11:23:40.974660 +0000 FATAL cn-service found leak txn {"uuid": "", "txn-id": "d0285677f3014c73ad051a0b0bcf1750", "create-at": "2023/09/30 10:46:45.758886 +0000", "create-by": "frontend-session-0xc04a5b4e00"}
-- | --


image

Expected Behavior

No response

Steps to Reproduce

No response

Additional information

No response

@sukki37 sukki37 added kind/bug Something isn't working severity/s-1 labels Sep 30, 2023
@sukki37 sukki37 added this to the 1.0.0 milestone Sep 30, 2023
@qingxinhome qingxinhome assigned qingxinhome and unassigned daviszhen Oct 1, 2023
@qingxinhome
Copy link
Contributor

Processing

@daviszhen daviszhen assigned daviszhen and unassigned qingxinhome Oct 7, 2023
@daviszhen
Copy link
Contributor

问题定位中

@daviszhen daviszhen mentioned this issue Oct 8, 2023
7 tasks
@daviszhen
Copy link
Contributor

问题定位中

@xzxiong
Copy link
Contributor

xzxiong commented Oct 9, 2023

9月21日,freetier-02 的 default-cn 上是有一个 “写监控”的负载。
是由 mo-agent这个组件产生的,它是负责将 Prometheus采集的metric 写入到 MO中存储。

如果你需要在本地拉起,需要两个组件 Prometheus 和 mo-agent



  1. Prometheus


  2. mo-agent

@daviszhen
Copy link
Contributor

张旭的pr进去之后,再测试

2 similar comments
@daviszhen
Copy link
Contributor

张旭的pr进去之后,再测试

@daviszhen
Copy link
Contributor

张旭的pr进去之后,再测试

@daviszhen
Copy link
Contributor

freetier-02升级有问题,dn修复中。

1 similar comment
@daviszhen
Copy link
Contributor

freetier-02升级有问题,dn修复中。

@sukki37
Copy link
Contributor Author

sukki37 commented Oct 12, 2023

After the upgrade, the issue hasn't reproduced, so we've downgraded it for further observation.

@sukki37 sukki37 added severity/s1 High impact: Logical errors or data errors that must occur and removed severity/s-1 labels Oct 12, 2023
@sukki37 sukki37 assigned sukki37 and unassigned daviszhen Oct 12, 2023
@sukki37 sukki37 assigned daviszhen and unassigned sukki37 Oct 13, 2023
@sukki37 sukki37 added severity/s-1 and removed severity/s1 High impact: Logical errors or data errors that must occur labels Oct 13, 2023
@daviszhen daviszhen mentioned this issue Oct 26, 2023
7 tasks
@sukki37 sukki37 added severity/s0 Extreme impact: Cause the application to break down and seriously affect the use and removed severity/s-1 labels Oct 27, 2023
@sukki37
Copy link
Contributor Author

sukki37 commented Nov 3, 2023

still repro

@nnsgmsone
Copy link
Contributor

pr已经合入,pr修复了一些可能导致的问题。。看是否还会出现

nnsgmsone added a commit that referenced this issue Nov 7, 2023
## What type of PR is this?

- [ ] API-change
- [X] BUG
- [ ] Improvement
- [ ] Documentation
- [ ] Feature
- [ ] Test and CI
- [ ] Code Refactoring

## Which issue(s) this PR fixes:
#11966,
#12385

issue #

## What this PR does / why we need it:

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
@nnsgmsone
Copy link
Contributor

在观察一天

@nnsgmsone
Copy link
Contributor

在观察一天。

sukki37 pushed a commit that referenced this issue Nov 9, 2023
issue #12385, #11966
Fix the error handling of channel when recovering
@nnsgmsone nnsgmsone assigned sukki37 and unassigned nnsgmsone Nov 10, 2023
@heni02
Copy link
Contributor

heni02 commented Nov 10, 2023

nightly-7244406e 1.0-dev 还有这个问题 @nnsgmsone
企业微信截图_8c5f0f4f-0686-4d5d-90d4-2b62e75adcf4

cn crash log: cncrash_moc.log
{"level":"INFO","time":"2023/11/10 07:15:50.638991 +0000","caller":"motrace/syncer.go:89","msg":"Wait signal done."}
panic: found leak txn

goroutine 265 [running]:
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x5?, 0x5?, {0x0?, 0x0?, 0xc0b6d9d8a0?})
/go/pkg/mod/go.uber.org/zap@v1.24.0/zapcore/entry.go:198 +0x65
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc02cec7ee0, {0xc017c0af00, 0x5, 0x5})
/go/pkg/mod/go.uber.org/zap@v1.24.0/zapcore/entry.go:264 +0x3ec
github.com/matrixorigin/matrixone/pkg/common/log.(*MOLogger).Log(0xc0b6d9d6a0, {0x2f5533e, 0xe}, {{0x0, 0x0}, 0x5, {0x0, 0x0, 0x0}, 0x0, ...}, ...)
/go/src/github.com/matrixorigin/matrixone/pkg/common/log/logger.go:169 +0x44a
github.com/matrixorigin/matrixone/pkg/common/log.(*MOLogger).Fatal(0x2f4b123?, {0x2f5533e?, 0xc14b91b5189e716b?}, {0xc017c0af00?, 0x4d2e8e0?, 0x64?})
/go/src/github.com/matrixorigin/matrixone/pkg/common/log/logger.go:137 +0x4e
github.com/matrixorigin/matrixone/pkg/cnservice.(*service).getTxnClient.func1.1({0xc046fbe450, 0x10, 0xc01bf4d000?}, {0x0?, 0x0?, 0x4d2e8e0?}, {0xc046fb4300, 0xfd}, {0xc045f3a940, 0x39}, ...)
/go/src/github.com/matrixorigin/matrixone/pkg/cnservice/server.go:576 +0x56b
github.com/matrixorigin/matrixone/pkg/txn/client.(*leakChecker).doCheck(0xc0005690e0)
/go/src/github.com/matrixorigin/matrixone/pkg/txn/client/leak_checker.go:114 +0x2b2
github.com/matrixorigin/matrixone/pkg/txn/client.(*leakChecker).check(0xc0005690e0, {0x367e918, 0xc0017e18c0})
/go/src/github.com/matrixorigin/matrixone/pkg/txn/client/leak_checker.go:98 +0x36
github.com/matrixorigin/matrixone/pkg/common/stopper.(*Stopper).doRunCancelableTask.func1()
/go/src/github.com/matrixorigin/matrixone/pkg/common/stopper/stopper.go:259 +0x79
created by github.com/matrixorigin/matrixone/pkg/common/stopper.(*Stopper).doRunCancelableTask
/go/src/github.com/matrixorigin/matrixone/pkg/common/stopper/stopper.go:254 +0xe5

cn crash log:
cncrash_moc.log

@sukki37 sukki37 modified the milestones: 1.0.0, 1.1.0 Nov 15, 2023
@daviszhen
Copy link
Contributor

daviszhen commented Nov 15, 2023

企业微信截图_22648f27-679f-4410-a2dc-cee34c782108

image

image

http://47.97.80.230/explore?orgId=1&panes=%7B%22aG4%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22freetier-01%5C%22%7D%20%7C%3D%20%60goroutine%20160946%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%22now-12h%22,%22to%22:%22now%22%7D%7D,%22aLR%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22expr%22:%22%7Bapp%3D%5C%22default-cn%5C%22,apps_kruise_io_cloneset_instance_id%3D%5C%22kjp5z%5C%22,container%3D%5C%22main%5C%22,controller_revision_hash%3D%5C%22default-cn-7997dc9b66%5C%22,filename%3D%5C%22%2Fvar%2Flog%2Fpods%2Ffreetier-01_default-cn-kjp5z_bd212baf-58d8-4bc3-a4a9-271aacd6d27f%2Fmain%2F9.log%5C%22,job%3D%5C%22freetier-01%2Fdefault-cn%5C%22,lifecycle_apps_kruise_io_state%3D%5C%22Normal%5C%22,matrixone_cloud_cluster%3D%5C%22freetier-01%5C%22,matrixone_cloud_component%3D%5C%22cn%5C%22,matrixone_cloud_main_cluster%3D%5C%22freetier-01%5C%22,matrixone_cloud_profile%3D%5C%22cn.standard%5C%22,matrixorigin_io_component%3D%5C%22CNSet%5C%22,matrixorigin_io_instance%3D%5C%22default%5C%22,matrixorigin_io_namespace%3D%5C%22freetier-01%5C%22,namespace%3D%5C%22freetier-01%5C%22,node_name%3D%5C%22cn-hangzhou.10.3.137.132%5C%22,pod%3D%5C%22default-cn-kjp5z%5C%22,pod_template_hash%3D%5C%227997dc9b66%5C%22,stream%3D%5C%22stderr%5C%22%7D%22,%22queryType%22:%22range%22,%22refId%22:%22log-row-context-query-_0.9749565064064918%22,%22maxLines%22:1000,%22direction%22:%22backward%22,%22datasource%22:%7B%22uid%22:%22loki%22,%22type%22:%22loki%22%7D%7D%5D,%22range%22:%7B%22from%22:%221700018176532%22,%22to%22:%221700018177532%22%7D%7D%7D&schemaVersion=1

@daviszhen
Copy link
Contributor

cn 启动失败。报错显示wait TN ready timeout: context deadline exceeded

企业微信截图_e58c702a-a6b1-47eb-9222-2343c52af4cc 企业微信截图_4281c3a2-ce20-4365-8534-80fc607dc33b

http://47.97.80.230/explore?orgId=1&panes=%7B%22aG4%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22freetier-01%5C%22%7D%20%7C%3D%20%60panic%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%22now-12h%22,%22to%22:%22now%22%7D%7D,%228O6%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22expr%22:%22%7Bapp%3D%5C%22default-cn%5C%22,%20apps_kruise_io_cloneset_instance_id%3D%5C%22k4tx7%5C%22,%20container%3D%5C%22main%5C%22,%20controller_revision_hash%3D%5C%22default-cn-7997dc9b66%5C%22,%20job%3D%5C%22freetier-01%2Fdefault-cn%5C%22,%20lifecycle_apps_kruise_io_state%3D%5C%22Normal%5C%22,%20matrixone_cloud_cluster%3D%5C%22freetier-01%5C%22,%20matrixone_cloud_component%3D%5C%22cn%5C%22,%20matrixone_cloud_main_cluster%3D%5C%22freetier-01%5C%22,%20matrixone_cloud_profile%3D%5C%22cn.standard%5C%22,%20matrixorigin_io_component%3D%5C%22CNSet%5C%22,%20matrixorigin_io_namespace%3D%5C%22freetier-01%5C%22,%20namespace%3D%5C%22freetier-01%5C%22,%20pod%3D%5C%22default-cn-k4tx7%5C%22%7D%22,%22queryType%22:%22range%22,%22refId%22:%22log-row-context-query-_0.4009359423871903%22,%22maxLines%22:1000,%22direction%22:%22backward%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221700108402000%22,%22to%22:%221700108404000%22%7D%7D%7D&schemaVersion=1

企业微信截图_02e37c55-6fa5-4880-bcd5-dc5db884eb1a

@daviszhen
Copy link
Contributor

企业微信截图_f844ff85-1eed-414d-873a-cabb63c3a73d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
attention/further-obervation kind/bug Something isn't working severity/s0 Extreme impact: Cause the application to break down and seriously affect the use
Projects
None yet
Development

No branches or pull requests

6 participants