Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eye severe memory leak #204

Open
innovia opened this issue Nov 29, 2017 · 19 comments
Open

Eye severe memory leak #204

innovia opened this issue Nov 29, 2017 · 19 comments

Comments

@innovia
Copy link

innovia commented Nov 29, 2017

Hi Konstantin

I have Eye 0.9.2 installed on ubuntu 14.04

the monitored processes are:

reDash
  celery_scheduler ................ up  (Nov26, 0%, 74Mb, <1290>)
    child-1303 .................... up  (Nov26, 0%, 67Mb, <1303>)
    child-25635 ................... up  (15:33, 0%, 87Mb, <25635>)
    child-25953 ................... up  (15:37, 0%, 80Mb, <25953>)
  celery_worker ................... up  (Nov26, 1%, 72Mb, <1127>)
    child-25875 ................... up  (15:36, 0%, 85Mb, <25875>)
    child-25946 ................... up  (15:37, 0%, 85Mb, <25946>)
    child-25947 ................... up  (15:37, 0%, 85Mb, <25947>)
    child-25948 ................... up  (15:37, 0%, 80Mb, <25948>)
  flower .......................... up  (Nov26, 0%, 197Mb, <1132>)
  gunicorn ........................ up  (Nov26, 0%, 14Mb, <1304>)
    child-10761 ................... up  (12:30, 0%, 80Mb, <10761>)
    child-12652 ................... up  (12:54, 0%, 85Mb, <12652>)
    child-13920 ................... up  (13:10, 0%, 81Mb, <13920>)
    child-15800 ................... up  (13:34, 0%, 81Mb, <15800>)
    child-23143 ................... up  (15:08, 0%, 79Mb, <23143>)
    child-27819 ................... up  (09:14, 0%, 96Mb, <27819>)
    child-28854 ................... up  (09:28, 0%, 83Mb, <28854>)
    child-29043 ................... up  (09:30, 0%, 81Mb, <29043>)
    child-29362 ................... up  (09:34, 0%, 81Mb, <29362>)
  nginx ........................... up  (Nov26, 0%, 4Mb, <1372>)
    child-1374 .................... up  (Nov26, 0%, 4Mb, <1374>)
    child-1375 .................... up  (Nov26, 0%, 4Mb, <1375>)
    child-1376 .................... up  (Nov26, 0%, 4Mb, <1376>)
    child-1377 .................... up  (Nov26, 0%, 4Mb, <1377>)
  postgresql ...................... up  (15:28, 0%, 188Mb, <25069>)
    child-25071 ................... up  (15:28, 0%, 53Mb, <25071>)
    child-25072 ................... up  (15:28, 0%, 50Mb, <25072>)
    child-25073 ................... up  (15:28, 0%, 1Mb, <25073>)
    child-25074 ................... up  (15:28, 0%, 2Mb, <25074>)
    child-25075 ................... up  (15:28, 0%, 1Mb, <25075>)
    child-25181 ................... up  (15:29, 0%, 9Mb, <25181>)
    child-25200 ................... up  (15:29, 0%, 9Mb, <25200>)
    child-25233 ................... up  (15:30, 0%, 9Mb, <25233>)
    child-25268 ................... up  (15:30, 0%, 11Mb, <25268>)
    child-25371 ................... up  (15:30, 0%, 9Mb, <25371>)
    child-25377 ................... up  (15:30, 0%, 9Mb, <25377>)
    child-25540 ................... up  (15:32, 0%, 9Mb, <25540>)
    child-25541 ................... up  (15:32, 0%, 9Mb, <25541>)
    child-25624 ................... up  (15:33, 0%, 9Mb, <25624>)
    child-25639 ................... up  (15:33, 0%, 14Mb, <25639>)
    child-25876 ................... up  (15:36, 0%, 13Mb, <25876>)
    child-25949 ................... up  (15:37, 0%, 13Mb, <25949>)
    child-25950 ................... up  (15:37, 0%, 13Mb, <25950>)
  redis ........................... up  (Nov26, 128%, 10Mb, <1159>)

they add up to ~1500MB

but Eye is taking 18GB!

root 1087 0.5 54.3 25626424 17915500 ? Sl Nov26 26:40 eye monitoring v0.9.2 [reDash] (in /home/ubuntu)

you can see in this picture that the memory is leaking:

screen shot 2017-11-29 at 6 16 40 pm

Any Idea how to find this memory leak?

@kostya
Copy link
Owner

kostya commented Nov 29, 2017

show eye x output when it in 18gb

@innovia
Copy link
Author

innovia commented Nov 29, 2017

Ok i had to restart it - it will happen again

here's the x info now after restart:

about:     Eye v0.9.2 (c) 2012-2016 @kostya
resources: 16:19, 0%, 115Mb, <18658>
ruby:      ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-linux]
gems:      ["Celluloid=0.17.3", "Celluloid::IO=0.17.3", "StateMachines=0.5.0", "NIO=2.1.0", "Timers=4.1.2", "Sigar=1.7.0.0"]
logger:    /opt/redash/logs/eye.log
home:      /root
dir:       /var/run/eye
pid_path:  /var/run/eye/pid
sock_path: /var/run/eye/sock
actors:    [["Eye::ChildProcess", 33], ["Eye::Process", 7], ["Eye::Group", 1], ["Eye::Controller", 1], ["Eye::SystemResources::Cache", 1], ["Eye::Server", 1], ["Celluloid::Supervision::Service::Public", 1], ["Celluloid::IncidentReporter", 1], ["Celluloid::Notifications::Fanout", 1], ["Celluloid::Supervision::Service::Root", 1]]

@kostya
Copy link
Owner

kostya commented Nov 29, 2017

btw i not think you need to monitor children for postgres or nginx, its quite strange use case.

@innovia
Copy link
Author

innovia commented Nov 29, 2017

I just wanted to see how much each of the processes are taking but now it's seems like eye is the cause of that memory leak.

Once we stablize it I'll remove these children.

@innovia
Copy link
Author

innovia commented Nov 30, 2017

here it is after a day 6GB!

root@reDash:~# eye x
about:     Eye v0.9.2 (c) 2012-2016 @kostya
resources: Nov29, 0%, 6093Mb, <18658>
ruby:      ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-linux]
gems:      ["Celluloid=0.17.3", "Celluloid::IO=0.17.3", "StateMachines=0.5.0", "NIO=2.1.0", "Timers=4.1.2", "Sigar=1.7.0.0"]
logger:    /opt/redash/logs/eye.log
home:      /root
dir:       /var/run/eye
pid_path:  /var/run/eye/pid
sock_path: /var/run/eye/sock
actors:    [["Eye::ChildProcess", 32], ["Eye::Process", 7], ["Eye::Group", 1], ["Eye::Controller", 1], ["Eye::SystemResources::Cache", 1], ["Eye::Server", 1], ["Celluloid::Supervision::Service::Public", 1], ["Celluloid::IncidentReporter", 1], ["Celluloid::Notifications::Fanout", 1], ["Celluloid::Supervision::Service::Root", 1]]

@kostya
Copy link
Owner

kostya commented Nov 30, 2017

This is really strange, i not see leaking in actors.
So hard to say where it can be.
Can you show full config? may be you have some global variables which not cleans.
You have only 7 processes. I have eye process running for half year for 156 processes.
Also there is difference in ruby, and also in gems StateMachines, NIO, Timers (which potentially can give leak).

about:     Eye v0.9.1 (c) 2012-2016 @kostya
resources: Apr26, 0%, 106Mb, <32291>
ruby:      ruby 1.9.3p484 (2013-11-22) [x86_64-linux]
gems:      ["Celluloid=0.17.3", "Celluloid::IO=0.17.1", "StateMachines=0.4.0", "NIO=1.1.1", "Timers=4.1.1", "Sigar=1.7.0.0"]
logger:    /projects/client/log/eye.log
home:      /home/deploy
dir:       /home/deploy/.eye
pid_path:  /home/deploy/.eye/pid
sock_path: /home/deploy/.eye/sock
actors:    [["Eye::Process", 156], ["Eye::Group", 29], ["Eye::Server", 1], ["Eye::SystemResources::Cache", 1], ["Eye::Controller", 1], ["Celluloid::Supervision::Service::Public", 1], ["Celluloid::IncidentReporter", 1], ["Celluloid::Notifications::Fanout", 1], ["Celluloid::Supervision::Service::Root", 1]]

@innovia
Copy link
Author

innovia commented Nov 30, 2017

im trying to reinstall the server on 16.04, will let you know

@kostya
Copy link
Owner

kostya commented Nov 30, 2017

i rerun my eye with

ruby:      ruby 2.2.2p95 (2015-04-13 revision 50295) [x86_64-linux]
gems:      ["Celluloid=0.17.3", "Celluloid::IO=0.17.3", "StateMachines=0.5.0", "NIO=2.1.0", "Timers=4.1.2", "Sigar=1.7.0.0"]

so will see, is there any leak or not

@kostya
Copy link
Owner

kostya commented Dec 1, 2017

i not see any leaks with new gems and ruby 2.2.2, so may be this is problem in config

@innovia
Copy link
Author

innovia commented Dec 4, 2017

here's my config

#!/usr/bin/env ruby
Eye.load '/etc/eye/mailer.rb' # mailer set params (like variables)
Eye.load '/etc/eye/cloudwatch.rb'
Eye.load '/etc/eye/config.rb' # config assign params values

Eye.application :reDash do
  working_dir "/opt/redash/current"

  load_env "/root/.env"

  trigger :flapping, times: 3, within: 1.minute, retry_in: 5.minutes
  notify :by_email, :info
  notify :cloudwatch, :info

  process(:postgresql) do
    pid_file "/var/run/postgresql/9.3-main.pid"
    stdall "/opt/redash/logs/postgresql"
    start_command "service postgresql start"
    stop_command "service postgresql stop"
    restart_command "service postgresql restart"
  end

  process(:nginx) do
    depend_on :gunicorn
    pid_file "/var/run/nginx.pid"
    stdall "/opt/redash/logs/nginx.log"
    start_command "/usr/sbin/nginx"
    stop_signals [:QUIT, 30.seconds, :TERM, 15.seconds, :KILL]
    restart_command "kill -HUP {PID}"
    daemonize true
  end

  process(:redis) do
    pid_file "/var/run/redis.pid"
    stdall "/opt/redash/logs/redis.log"
    start_command "/usr/local/bin/redis-server /etc/redis/6379.conf"
    stop_signals [:TERM, 30.seconds, :QUIT]
    restart_command  "kill -HUP {{PID}}"
    daemonize true
  end

  process(:gunicorn) do
    uid 'redash'
    gid 'nogroup'
    depend_on :redis
    pid_file "/var/run/gunicorn/gunicorn.pid"
    stdall "/opt/redash/logs/gunicorn.log"
    start_command "gunicorn -b unix:///var/run/gunicorn/gunicorn.sock --name redash -w 9 --max-requests 1000 redash.wsgi:app"
    stop_signals [:TERM, 30.seconds, :QUIT]
    restart_command  "kill -HUP {{PID}}"
    daemonize true
    monitor_children do
      stop_command "kill -TERM {PID}"
      check :cpu, :every => 30, :below => 80, :times => 3
      check :memory, :every => 30, :below => 350.megabytes, :times => [3,5]
    end
  end

  process(:flower) do
    uid 'redash'
    gid 'nogroup'
    pid_file "/var/run/celery/flower.pid"
    stdall "/opt/redash/logs/flower.log"
    start_command "celery flower -A redash.worker --logging=debug --broker=redis://127.0.0.1:6379/0 --broker_api=redis://127.0.0.1:6379/0 --address=0.0.0.0 --port=5555 --persistent"
    stop_signals [:TERM, 30.seconds, :QUIT]
    restart_command  "kill -HUP {{PID}}"
    check :cpu, :every => 30, :below => 80, :times => 3
    check :memory, :every => 30, :below => 250.megabytes, :times => [3,5]
    daemonize true
  end

  process(:celery_scheduler) do
    uid 'redash'
    gid 'nogroup'
    pid_file "/var/run/celery/celery_schedule_worker.pid"
    stdall "/opt/redash/logs/celery_schedule_worker.log"
    start_command "celery worker --app=redash.worker --beat --concurrency=4 --queues=queries,celery --maxtasksperchild=100 --events -Ofair --autoscale=10,4 -n redash_celery_scheduled@%h"
    stop_signals [:TERM, 30.seconds, :QUIT]
    restart_command  "kill -HUP {{PID}}"
    daemonize true
    monitor_children do
      stop_command "kill -TERM {PID}"
      check :cpu, :every => 30, :below => 80, :times => 3
      check :memory, :every => 30, :below => 512.megabytes, :times => [3,5]
    end
  end

   process(:celery_worker) do
    uid 'redash'
    gid 'nogroup'
    pid_file "/var/run/celery/celery_worker.pid"
    stdall "/opt/redash/logs/celery_worker.log"
    start_command "celery worker --app=redash.worker --concurrency=4 --queues=scheduled_queries --maxtasksperchild=100 --events -Ofair  --autoscale=10,4 -n redash_celery_worker@%h"
    stop_signals [:TERM, 30.seconds, :QUIT]
    restart_command  "kill -HUP {{PID}}"
    daemonize true
    monitor_children do
      stop_command "kill -TERM {PID}"
      check :cpu, :every => 30, :below => 80, :times => 3
      check :memory, :every => 30, :below => 400.megabytes, :times => [3,5]
    end
  end
end

@kostya
Copy link
Owner

kostya commented Dec 4, 2017

not see anything bad in config:
i think we should try 3 things:

  1. me for try ruby 2.4.1 (cause i try only 2.2.2)
  2. you try remove monitor_children and see if leak still here.
  3. you try remove depend_on and see if leak still here.

btw using restart_command and daemonize true quite strange. for example unicorn also used restart_command, but it run with daemonize false, because it potentially can create some problems with restarts.

@innovia
Copy link
Author

innovia commented Dec 4, 2017

Im trying to downgrade to 2.2.2

i'll do what you suggested

@innovia
Copy link
Author

innovia commented Dec 6, 2017

downgrading to 2.2.2 and disabling all of these dependancies and children stabilized it at 100Mb

i'll try adding back the children monitoring

@kostya
Copy link
Owner

kostya commented Dec 7, 2017

how it is?

@innovia
Copy link
Author

innovia commented Dec 7, 2017

its growing

now at 700MB

eye x
about: Eye v0.9.2 (c) 2012-2016 @kostya
resources: Dec04, 0%, 776Mb, <1089>
ruby: ruby 2.2.2p95 (2015-04-13 revision 50295) [x86_64-linux]
gems: ["Celluloid=0.17.3", "Celluloid::IO=0.17.3", "StateMachines=0.5.0", "NIO=2.1.0", "Timers=4.1.2", "Sigar=1.7.0.0"]
logger: /opt/redash/logs/eye.log
home: /home/ubuntu
dir: /var/run/eye
pid_path: /var/run/eye/pid
sock_path: /var/run/eye/sock
actors: [["Eye::ChildProcess", 18], ["Eye::Process", 7], ["Eye::Notify::Mail", 1], ["Eye::Group", 1], ["Eye::Controller", 1], ["Eye::SystemResources::Cache", 1], ["Eye::Server", 1], ["Celluloid::Supervision::Service::Public", 1], ["Celluloid::IncidentReporter", 1], ["Celluloid::Notifications::Fanout", 1], ["Celluloid::Supervision::Service::Root", 1]]

@kostya
Copy link
Owner

kostya commented Dec 7, 2017

ok, can you test separately monitor_children and depend_on, to see where bug is.

@innovia
Copy link
Author

innovia commented Dec 7, 2017

I didnt add depend_on at all

@kostya
Copy link
Owner

kostya commented Dec 7, 2017

so bug in monitor_children, it remove all of it, it just works without leaks.

@kostya
Copy link
Owner

kostya commented Dec 8, 2017

confirm that leak in monitor_children, i'll try to fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants