Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul consumes tens GB or RAM for no reason #16290

Open
AndreiPashkin opened this issue Feb 16, 2023 · 4 comments
Open

Consul consumes tens GB or RAM for no reason #16290

AndreiPashkin opened this issue Feb 16, 2023 · 4 comments

Comments

@AndreiPashkin
Copy link

AndreiPashkin commented Feb 16, 2023

Overview of the Issue

We use Consul in single-node mode for distributed locks and for service-discovery in our app. Service discovery used to connect our application environments with our monitoring. Consuls connect with each other over WAN.

After startup it starts consuming memory very quickly, memory usage goes over 30GB and quickly overwhelms our server. What I've found is that repeated calls to consul info shows that goroutines number increases rapidly along with increase of memory usage.

I'm also attaching logs.

I cap provide consul debug output by request of maintainers.

Possibly related issues: #12564, #9076, #12288, #3111

Reproduction Steps

So far I haven't figure out how to reproduce it in isolated environment.

Consul info for both Client and Server

Server info
agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 1
build:
        prerelease = 
        revision = 0e046bbb
        version = 1.13.2
        version_metadata = 
consul:
        acl = disabled
        bootstrap = true
        known_datacenters = 4
        leader = true
        leader_addr = 172.22.0.2:8300
        server = true
raft:
        applied_index = 14
        commit_index = 14
        fsm_pending = 0
        last_contact = 0
        last_log_index = 14
        last_log_term = 2
        last_snapshot_index = 0
        last_snapshot_term = 0
        latest_configuration = [{Suffrage:Voter ID:6a5bc7e2-7c9b-bcdd-23a8-a4f0d35dc3d2 Address:172.22.0.2:8300}]
        latest_configuration_index = 0
        num_peers = 0
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Leader
        term = 2
runtime:
        arch = amd64
        cpu_count = 8
        goroutines = 642723
        max_procs = 8
        os = linux
        version = go1.18.1
serf_lan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 1
        event_time = 2
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 1
        members = 1
        query_queue = 0
        query_time = 1
serf_wan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 1
        failed = 1
        health_score = 4
        intent_queue = 0
        left = 0
        member_time = 182
        members = 5
        query_queue = 0
        query_time = 1
/ # consul ^C
/ # consul info
agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 1
build:
        prerelease = 
        revision = 0e046bbb
        version = 1.13.2
        version_metadata = 
consul:
        acl = disabled
        bootstrap = true
        known_datacenters = 4
        leader = true
        leader_addr = 172.22.0.2:8300
        server = true
raft:
        applied_index = 15
        commit_index = 15
        fsm_pending = 0
        last_contact = 0
        last_log_index = 15
        last_log_term = 2
        last_snapshot_index = 0
        last_snapshot_term = 0
        latest_configuration = [{Suffrage:Voter ID:6a5bc7e2-7c9b-bcdd-23a8-a4f0d35dc3d2 Address:172.22.0.2:8300}]
        latest_configuration_index = 0
        num_peers = 0
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Leader
        term = 2
runtime:
        arch = amd64
        cpu_count = 8
        goroutines = 818620
        max_procs = 8
        os = linux
        version = go1.18.1
serf_lan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 1
        event_time = 2
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 1
        members = 1
        query_queue = 0
        query_time = 1
serf_wan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 1
        failed = 1
        health_score = 5
        intent_queue = 0
        left = 0
        member_time = 182
        members = 5
        query_queue = 0
        query_time = 1
/ # consul info
agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 1
build:
        prerelease = 
        revision = 0e046bbb
        version = 1.13.2
        version_metadata = 
consul:
        acl = disabled
        bootstrap = true
        known_datacenters = 4
        leader = true
        leader_addr = 172.22.0.2:8300
        server = true
raft:
        applied_index = 16
        commit_index = 16
        fsm_pending = 0
        last_contact = 0
        last_log_index = 16
        last_log_term = 2
        last_snapshot_index = 0
        last_snapshot_term = 0
        latest_configuration = [{Suffrage:Voter ID:6a5bc7e2-7c9b-bcdd-23a8-a4f0d35dc3d2 Address:172.22.0.2:8300}]
        latest_configuration_index = 0
        num_peers = 0
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Leader
        term = 2
runtime:
        arch = amd64
        cpu_count = 8
        goroutines = 1147773
        max_procs = 8
        os = linux
        version = go1.18.1
serf_lan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 1
        event_time = 2
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 1
        members = 1
        query_queue = 0
        query_time = 1
serf_wan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 1
        failed = 2
        health_score = 4
        intent_queue = 0
        left = 0
        member_time = 182
        members = 5
        query_queue = 0
        query_time = 1

Operating system and Environment details

Ubuntu 20.04

Log Fragments

https://gist.github.com/AndreiPashkin/0a95cdcb5e349c881ff4ee94af5f7b15

Version

# consul version
Consul v1.13.2
Revision 0e046bbb
Build Date 2022-09-20T20:30:07Z
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
@AndreiPashkin AndreiPashkin changed the title Strange OOMs Consul consumes tens GB or RAM for no reason Feb 17, 2023
@huikang
Copy link
Collaborator

huikang commented Mar 2, 2023

@AndreiPashkin , could you provide more info about how the distributed locks and service-discovery query in the cluster to help reproduce the issue? Thanks.

@PavelYadrov
Copy link

Hello, we've faced with the similar problem, Consul servers consumes all of dedicated RAM gradually. After that, rebooted and work normally 3-4 days.
Resources for consul were raised 3 times for the last month.

The consul has been working fine for the last six days. Total load decreased and consul stopped to consume all of the dedicated RAM.

I've tried to analyze it with consul-snapshot-tool, but there was nothing special as in related issue - #5327 (comment)

@PavelYadrov
Copy link

Hello, we've gathered some metrics, hope it'll help with analyze
consul-agent-metrics.txt

@AndreiPashkin
Copy link
Author

AndreiPashkin commented Mar 13, 2023

@AndreiPashkin , could you provide more info about how the distributed locks and service-discovery query in the cluster to help reproduce the issue? Thanks.

@huikang, I've captured logs using consul debug when the issue was happening, I can post them. I also think that the issue is still reproducible and I can collect some additional info - but I need to know what specifically do you need to know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants