Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory issues (leak?) investigation #5817

Closed
jywarren opened this issue Jun 1, 2019 · 81 comments
Closed

Memory issues (leak?) investigation #5817

jywarren opened this issue Jun 1, 2019 · 81 comments
Labels
bug the issue is regarding one of our programs which faces problems when a certain task is executed help wanted requires help by anyone willing to contribute high-priority

Comments

@jywarren
Copy link
Member

jywarren commented Jun 1, 2019

We're seeing a persistent memory issue since one week ago on Saturday, and I'm compiling information about it here to investigate.

Wondering if it's related to this controller method for the dashboard.

https://www.skylight.io/app/applications/GZDPChmcfm1Q/1559320320/1d/endpoints/HomeController%23dashboard?responseType=html

@jywarren jywarren added bug the issue is regarding one of our programs which faces problems when a certain task is executed help wanted requires help by anyone willing to contribute labels Jun 1, 2019
@jywarren
Copy link
Member Author

jywarren commented Jun 1, 2019

Noting @icarito's comment:

I wonder jywarren because I had edited docker-compose-production.yml to use fewer processes (didn't make a PR for it). So it could be we just made it fit that way.

And this graph:

mdmmflaoadbbjepe(1)

@jywarren
Copy link
Member Author

jywarren commented Jun 1, 2019

We're seeing a lot of SMTP test errors too:

Link: | https://intelligence.rackspace.com/cloud/entities/en45StuOyk/checks/chXoX9GHhF/alarm/alycd3HZyu

@icarito
Copy link
Member

icarito commented Jun 2, 2019

Yes load is very high too. From the htop and especially iotop it appears mailman is quite active. It's the culprit for sure! Prior to May 22th we ran it a few times a day - if we can run it every few minutes minute or so (not every second!) - it would be fine!

I, [2019-05-07T23:56:44.702410 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-08T21:33:03.762360 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-09T07:47:27.518491 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-09T08:18:47.825703 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-10T08:14:53.010705 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-10T21:45:50.739207 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-11T17:38:51.647335 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-13T03:33:15.682877 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-14T05:51:40.603184 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-14T05:53:20.857041 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-14T05:55:00.356772 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-14T05:56:40.487219 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-15T01:43:42.908744 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-16T10:13:45.703985 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-18T12:57:16.194957 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:49:27.019569 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:49:55.827419 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:50:18.722700 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:50:41.709075 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:51:00.124271 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:51:17.146210 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:51:33.745494 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:51:51.387282 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:52:09.145006 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:52:31.266559 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:53:03.176998 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:53:26.991989 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:53:54.074275 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:54:13.905343 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:54:37.736641 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:54:57.357057 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:55:15.522535 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:55:34.343241 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:55:51.964241 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:56:10.016964 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:56:42.822692 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:56:59.826809 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:57:16.178517 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:57:35.871196 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:57:59.731422 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:58:16.353160 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:58:33.608591 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:58:50.037296 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:59:06.912680 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:59:32.287362 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T08:59:59.201948 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T09:00:18.739067 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T09:00:42.144910 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T09:01:03.495556 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T09:01:20.493712 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T09:01:37.089192 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T09:01:53.921571 #1]  INFO -- : Mailman v0.7.0 started 
I, [2019-05-22T09:02:14.509227 #1]  INFO -- : Mailman v0.7.0 started

imagen

@icarito
Copy link
Member

icarito commented Jun 2, 2019

The log is filled with cycles of these, no error:

I, [2019-06-02T02:35:26.270644 #1]  INFO -- : Mailman v0.7.0 started
I, [2019-06-02T02:35:26.270851 #1]  INFO -- : Rails root found in ., requiring environment...
I, [2019-06-02T02:35:56.930267 #1]  INFO -- : POP3 receiver enabled (notifications@publiclab.org@pop.gmail.com).
I, [2019-06-02T02:35:56.938850 #1]  INFO -- : Polling enabled. Checking every 5 seconds.

@icarito
Copy link
Member

icarito commented Jun 2, 2019

Looks like mailman is crashing and being immediately respawn!

icarito@rs-plots2:/srv/plots_container/plots2$ docker ps
CONTAINER ID        IMAGE                COMMANDCREATED             STATUS              PORTS NAMES
8d13c675568e        containers_mailman   "script/mailman_serv…"4 days ago          Up 14 seconds containers_mailman_1
f423dec91ebe        containers_web       "/bin/bash -c 'sleep…"4 days ago          Up 4 days           127.0.0.1:4001->4001/tcp containers_web_1
24f7b43efebc        containers_sidekiq   "bundle exec sidekiq…"4 days ago          Up 4 days containers_sidekiq_1
070511ab43d1        redis:latest         "docker-entrypoint.s…"4 days ago          Up 4 days           6379/tcp containers_redis_1
6ea8f0498b2c        mariadb:10.2         "docker-entrypoint.s…"4 days ago          Up 3 days           3306/tcp containers_db_1

I've decided to stop this container for tonight in order to monitor the effect on performance.

@jywarren
Copy link
Member Author

jywarren commented Jun 3, 2019

I think we may also look at what gems updatea were merged in the days leading up to this code publication. Thanks!

@jywarren
Copy link
Member Author

jywarren commented Jun 3, 2019

That's so weird about mailman, I will look at the config but I don't remember any changes to the rate.

@jywarren
Copy link
Member Author

jywarren commented Jun 3, 2019

Oh you know what? We set it to retry 3 times. Maybe these are overlapping now? It could at least have increased the rate of attempts since it retries 3 times for every scheduled run.

retry if (retries += 1) < 3

@jywarren
Copy link
Member Author

jywarren commented Jun 3, 2019

Ok modified it for 20 seconds, which should mean max a retry every 5 seconds --

a40ea56

That'll be the same rate as before when we added retries.

@jywarren
Copy link
Member Author

jywarren commented Jun 3, 2019

OK, now working on analysis after a few hours:

https://oss.skylight.io/app/applications/GZDPChmcfm1Q/1559574420/6h/endpoints

Screen Shot 2019-06-03 at 4 36 39 PM

Overall looks good. But, on closer look, it's ramping up in load time:

Screen Shot 2019-06-03 at 4 37 03 PM

Comparing the latter portion where it's starting to go back up:

Screen Shot 2019-06-03 at 4 37 41 PM

to the earlier just after the reboot:

Screen Shot 2019-06-03 at 4 37 51 PM

And then to this from a couple weeks ago before all our trouble:

Screen Shot 2019-06-03 at 4 38 42 PM

Then finally just after we started seeing issues on the 22-23rd of May:

Screen Shot 2019-06-03 at 4 39 15 PM

Overall it's not conclusive.

Resources:

One of the tough things about this is that it's right around where these two commits happened:

  1. disabling caching on profiles (which we later reverted): 794df37
  2. the container build process change: 794df37

I'd like to think it relates to the addition of the retry 3 times code in 2bc7b49, which I tried tweaking today. But actually load times are still slowly growing.

This could mean that a) something else is driving it, or b) the "rescue/retry" cycle itself could be causing memory leak buildup?

@jywarren
Copy link
Member Author

jywarren commented Jun 3, 2019

shall i comment out the rescue/retry code entirely?

maybe the hanging waiting for mysql to pick up is actually taking up threads?

I'll try this. Site is almost unresponsive.

I removed the retry here: faa5a12

Deploying... it'll take a while.

@jywarren
Copy link
Member Author

jywarren commented Jun 4, 2019

@jywarren
Copy link
Member Author

jywarren commented Jun 4, 2019

Ok I wonder if the container setup affected the mailman container at all? Because at this point we've reverted all the likely stuff from the mailman script.

@jywarren
Copy link
Member Author

jywarren commented Jun 4, 2019

OK, overnight it peaked and went back down a bit. But our problematic ones are still quite high, with peaks at about 20 seconds:

image

@jywarren
Copy link
Member Author

jywarren commented Jun 4, 2019

The stats range calls are taking up to 40+ seconds!

@jywarren
Copy link
Member Author

jywarren commented Jun 4, 2019

They're also taking forever on cache generation:

image

Could we be seeing an issue with the cache read/write?

@jywarren
Copy link
Member Author

jywarren commented Jun 4, 2019

@icarito could there be like an issue on the read/write io or something on cache generation? I'm just not sure why it would take this long to pack all the data into the cache.

@jywarren
Copy link
Member Author

jywarren commented Jun 4, 2019

Leaky gems -- check off if we're OK

  • celluloid > 0.16.0, < 0.17.2
  • csspool < 4.0.3
  • grape < 0.2.5
  • oj < 2.12.4
  • redcarpet < 3.3.3
  • redis = 3.3.0
  • sidekiq < 3.5.1
  • sidekiq-statistic
  • therubyracer < 0.12.2
  • zipruby <= 0.3.6

Non-leaky but memory issues in any case:

  • activeadmin
  • axlsx
  • delayed_job >= 4.06
  • libxml-ruby < 2.9.0 with Nokogiri RC3
  • newrelic_rpm >= 3.9.4, <= 3.9.7
  • sequel >= 2.12.0
  • stomp <= 1.3.5

@jywarren
Copy link
Member Author

jywarren commented Jun 5, 2019

I'm still seeing this massive cache generation time for stats_controller#range and wondering if we need to tweak where cache is stored. It looks like the default is file storage (and I checked, we have cache files in /plots2/tmp/cache/. Would we be helped at all by switching to in-memory caching or memcached, both of which are apparently pretty simple changes?

https://guides.rubyonrails.org/v3.2/caching_with_rails.html#activesupport-cache-memorystore

image

@jywarren
Copy link
Member Author

jywarren commented Jun 5, 2019

Also looking at https://www.skylight.io/support/performance-tips

@jywarren
Copy link
Member Author

jywarren commented Jun 5, 2019

I'll look at the email configuration now but if it doesn't yield anything I'll merge this, turning off the begin/rescue loop: #5840

@jywarren
Copy link
Member Author

jywarren commented Jun 5, 2019

OK our next step for #5841 is to develop a monitoring strategy for if mailman goes down.

@jywarren
Copy link
Member Author

jywarren commented Jun 5, 2019

Deploying with the new email credentials, AND the begin/rescue removal. However, I think it's worth redeploying with the begin/rescue re-instated if the memory leak is solved, because it could have been the email credential issues.

@jywarren
Copy link
Member Author

jywarren commented Jun 5, 2019

Latest error:

mailman_1 | /app/app/models/comment.rb:265:in add_comment': undefined methodbody' for nil:NilClass (NoMethodError) mailman_1 | from /app/app/models/comment.rb:218:in receive_mail' mailman_1 | from script/mailman_server:31:inblock (2 levels) in <main>' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/router.rb:66:in instance_exec' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/router.rb:66:inroute' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/message_processor.rb:23:in block in process' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/middleware.rb:33:inblock in run' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/middleware.rb:38:in run' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/message_processor.rb:22:inprocess' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/receiver/pop3.rb:43:in block in get_messages' mailman_1 | from /usr/local/lib/ruby/2.4.0/net/pop.rb:666:ineach' mailman_1 | from /usr/local/lib/ruby/2.4.0/net/pop.rb:666:in each_mail' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/receiver/pop3.rb:42:inget_messages' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/application.rb:133:in block in polling_loop' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/application.rb:130:inloop' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/application.rb:130:in polling_loop' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/application.rb:83:inrun' mailman_1 | from /usr/local/bundle/gems/mailman-0.7.0/lib/mailman/application.rb:11:in run' mailman_1 | from script/mailman_server:22:in<main>'

That's here:

mail_doc = Nokogiri::HTML(mail.html_part.body.decoded) # To parse the mail to extract comment content and reply content

@jywarren
Copy link
Member Author

jywarren commented Jun 5, 2019

Fixed that! c02b560

@jywarren
Copy link
Member Author

jywarren commented Jun 5, 2019

ug, finally relaly publishing the comment.rb fix....

@icarito
Copy link
Member

icarito commented Jun 18, 2019

unstable is still deleting...

Doing operation on live production database:

  • Mysql dump rsessions table with where updated_at > DATE_SUB(NOW(), INTERVAL 7 DAY)
    • time: 48 s
    • dump size: 143Mb
  • Rename rsessions table
    • time: 11 s
    • home page was down for 15 minutes at ~6am UTC
  • Load clean data from dumped table to new rsessions table
    • creates new rsessions table file (220Mb)
    • restores homepage
  • Drop old rsessions table:
    • MariaDB [plots]> drop table rsessions_prob; Query OK, 0 rows affected (43.39 sec)
    • Freed ~29 GB.

Tested https://publiclab.org - session was retained!
🎉

@icarito
Copy link
Member

icarito commented Jun 18, 2019

mitigation done! Hopefully this will free us!

@icarito
Copy link
Member

icarito commented Jun 18, 2019

i'll leave it for tonight, site looks speedy to me... 😝 hopefully this is it!

@icarito
Copy link
Member

icarito commented Jun 18, 2019

OK, so mitigations list would be:

  • reduce process pool

  • move db to google cloud db solution

  • reduce rsessions

  • switch to memcached

@jywarren
Copy link
Member Author

Hmm, it was very fast this morning, but overall I don't see a huge difference! 😞

image

@skilfullycurled
Copy link
Contributor

Nooooooooooooo! Well, there's only one other explanation and that's ghosts. I'll open up another issue and look into finding an exorcist or ghostbusters gem.

@icarito
Copy link
Member

icarito commented Jun 19, 2019

I think actually there's been improvement on I/O use because using a 30GB table is heavy - if you look closely the peaks seem related to Statscontroller... maybe we could do the stats work on staging? I can make it copy production database regularly say weekly?

@skilfullycurled
Copy link
Contributor

skilfullycurled commented Jun 19, 2019

Hey @icarito, I was wondering if you could answer some "educational" questions for me :

if you look closely the peaks seem related to Statscontroller...

Why would this be? Due to the caching? I can only think of three people who would be using it and I'm one of them and I haven't been.

maybe we could do the stats work on staging?

I've been hearing...er...seeing you use the word "staging" a lot lately. What is that and how does it play into the site/workflow? If it's a part of the docs, let me know which one and I'll take a crack at understanding it first.

I can make it copy production database regularly say weekly?

I think that'd be good. It's not so much that the freshest data are important, but between the Q&A system being changed and the recent tags migration, I suppose weekly is a good idea since it will catch any structural changes as they come in. @cesswairimu, what do you think?

@cesswairimu
Copy link
Collaborator

This was a really awesome thread to read. Yeah its a great idea having the stats in stage and copying weekly is fine too 👍
I have had this thought of in future making the stats queries a script that creates a sql view and its deleted and recreated daily/or weekly by a job and maybe this can live in stage also. Would like to hear your thoughts on this and if this can help the memory leaks in any way.

@grvsachdeva
Copy link
Member

Hey @icarito, can we increase the RAM of the server? Maybe that'll help in speeding up the website until we improve our query response rate?

Thanks!

@icarito
Copy link
Member

icarito commented Jun 19, 2019

Thanks for your replies! I am thankful for the work that you are doing and for replying to this issue and reading thru our efforts! I don't want to sound accusing or anything! I'm just looking at the data and trying to improve our site's reliability.
For instance we got a peak this morning: https://www.skylight.io/app/applications/GZDPChmcfm1Q/1560920940/5m/endpoints
imagen
We also see peaks every night (6AM UTC) on backup for a couple of hours.

Regarding staging and production, currently we have three instances:

Instance URL Build log Workspace
unstable https://unstable.publiclab.org/ https://jenkins.laboratoriopublico.org/view/Staging/job/Plots-Unstable/ https://jenkins.laboratoriopublico.org/view/Staging/job/Plots-Unstable/ws/
stable https://stable.publiclab.org/ https://jenkins.laboratoriopublico.org/view/Staging/job/Plots-Stable/ https://jenkins.laboratoriopublico.org/view/Staging/job/Plots-Stable/ws/
production https://publiclab.org/ n/a n/a

You are right that documentation wise we should do a better job describing this process. Currently i found some docs here https://github.com/publiclab/plots2/blob/master/doc/TESTING.md#testing-branches but it's not clear at all that these branches build when we push to those branches.

The database is currently updated manually every so often but it should be simple to automate it now that we have daily database dumps. I will set it up and ping you!

This doesn't mean we shouldn't implement more solutions, next I think a threaded webserver (Puma) could help!

@icarito
Copy link
Member

icarito commented Jun 19, 2019 via email

@jywarren
Copy link
Member Author

jywarren commented Jun 19, 2019 via email

@skilfullycurled
Copy link
Contributor

Oh, @icarito, no, no, I didn't sense any accusation, not at all. I read, "this is what's happening" and I was just saying "that's odd, why would it be doing that if no one was on it...?" Along the same lines, I didn't mean to imply the documentation was poor. Only that you didn't have to explain it if there was any.

And hey, it's not an entirely unfounded accusation : ) although I am having a bit of fun pretending that I've been framed and I've gone underground and have to prove my innocence but that's a whole other screenplay that I'm working on.

Thankfully these lurid and baseless accusations ; ) on both are parts have been cleared up and we can get back to the business at hand.

Related question: Why would the stats controller be active if no one was using it or is that the mystery?

Regarding the staging, thanks for the explanation. To make sure I've got, is saying...

I'll try this in stable staging instance.

...interchangeable with saying, "I'll try this on stable.publiclab.org"?

@jywarren
Copy link
Member Author

jywarren commented Jun 19, 2019 via email

@skilfullycurled
Copy link
Contributor

@jywarren, yup! Got now. Thank you!

@icarito
Copy link
Member

icarito commented Jun 20, 2019

Thanks for the clarification @skilfullycurled !
It is indeed a mystery why StatsController is so active?

Brief moments ago we had another peak that knocked us down for few minutes:
imagen

The trigger in this case was actually the Full Text Search.
But one can see that even in this brief timeslice (3 min), StatsController was called 21 times.

I think this may be significantly affecting our baseline performance. If this usage is not known, then perhaps crawlers are hitting these endpoints? Maybe a robots.txt or some access control would fix it?

@jywarren thanks for the clarification, I'll look into doing it asap then.

@icarito
Copy link
Member

icarito commented Jun 20, 2019

Actually here's Statscontroller details for previous timeslice:
imagen

@jywarren
Copy link
Member Author

jywarren commented Jun 20, 2019 via email

@jywarren
Copy link
Member Author

jywarren commented Jun 20, 2019 via email

@skilfullycurled
Copy link
Contributor

So you don't think it's the caching?

@jywarren
Copy link
Member Author

jywarren commented Jun 20, 2019 via email

@jywarren
Copy link
Member Author

statscontroller is called 5.5 times per minute

via @icarito - so on tonight's update we can see if robots.txt changes help this.

@grvsachdeva
Copy link
Member

Hey @jywarren, I saw that robot.txt update commit was pushed to stable some days ago. Any improvement you noticed?

@skilfullycurled
Copy link
Contributor

Yes, would love an update! Not sure I grabbed the correct data, but here's some images from skylight of before the commit, after the commit, and the last ~24 hours. The red line indicates when the commit was made. Looks on the surface like the answer is yes, but it may not be significant, or I might be interpreting the data incorrectly.

robots_txt

@jywarren
Copy link
Member Author

jywarren commented Jun 25, 2019 via email

@jywarren
Copy link
Member Author

Closing this now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug the issue is regarding one of our programs which faces problems when a certain task is executed help wanted requires help by anyone willing to contribute high-priority
Projects
None yet
Development

No branches or pull requests

5 participants