Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redis Connections on constant increase #754

Closed
cchatham opened this issue Sep 21, 2015 · 18 comments
Closed

Redis Connections on constant increase #754

cchatham opened this issue Sep 21, 2015 · 18 comments
Assignees

Comments

@cchatham
Copy link
Contributor

Hey guys!

We are seeing a constant increase of connections with Redis that eventually causes our app servers to fail (in AWS). I wanted to see if anyone else had seen this issue.

We are looking into a fix but any help would be appreciated. Thanks!

@brianhyder
Copy link
Member

Thanks for reporting. It is possible to create your own cache connections however the way it is setup the connection should be shared: https://github.com/pencilblue/pencilblue/blob/0.5.0/include/dao/cache.js#L47

I would look to see where the "getInstance" and "createInstance" functions are being called. The code that uses it should be isolated to the instances of cache_entity_service, session storage, and server registry. That is all dependent on your configuration but I think for you guys that is probably close to accurate.

It also appears that the driver for Redis has been updated to a stable "1.0.0" version. I would suggest, as a plan of attack, that you check for obvious places where connections are being created instead of reused. Look for a configuration option for the driver to ensure that connections auto re-connect. Look for stack traces in logs to see if connections are hanging around after workers die off. Update driver and see if behavior is the same.

I can also help take a look at this tonight.

@brianhyder brianhyder self-assigned this Sep 21, 2015
@cchatham
Copy link
Contributor Author

Hey Brian!

We only leverage redis for session that we know of and we call it by using pb.cache in our custom code.

We managed to get a list of all the clients connected to redis at a moment in time. I removed the IPs for security reason, but if you look at the idle seconds, it's very curious. I am also including our average connections over the past day to so the increasing issue we are seeing.

Redis:
redisclients

Connections:
image

@brianhyder
Copy link
Member

Thanks for the additional info. Can you confirm what command broker
implementation you are using? I'd also like to get the time frame from the
log snippet you sent. I wouldn't expect that many "subscribes" unless you
have that many workers. The subscribe command is used to listen for
commands and jobs from other members of the cluster.
On Sep 21, 2015 12:02 PM, "cchatham" notifications@github.com wrote:

Hey Brian!

We only leverage redis for session that we know of and we call it by using
pb.cache in our custom code.

We managed to get a list of all the clients connected to redis at a moment
in time. I removed the IPs for security reason, but if you look at the idle
seconds, it's very curious. I am also including our average connections
over the past day to so the increasing issue we are seeing.

Redis:
[image: redisclients]
https://cloud.githubusercontent.com/assets/4976408/9997095/530a24fc-6058-11e5-966f-400caa004830.png

Connections:
[image: image]
https://cloud.githubusercontent.com/assets/4976408/9997144/9c4b6112-6058-11e5-88f4-ef5550c20268.png


Reply to this email directly or view it on GitHub
#754 (comment)
.

@cchatham
Copy link
Contributor Author

We are definitely using the default RedisCommandBroker. We have 2 workers running according to our global home view. The time frame for the client list was around 11:30am this morning EST.

@brianhyder
Copy link
Member

It is weird why it would be spawning so many new connections. The node redis package was updated today, coincidentally enough v2 was released. I'll update it tonight and play with it to see if I can get the connection count to rise. There are also a few options we can tweak to optimize the connection.

@cchatham
Copy link
Contributor Author

So it looks like when AWS Beanstalk spins down servers the connections still survive. I did the client list command again and it returned a lot of IPs that are no longer on servers. So either quit is not getting called when it is supposed to or Beanstalk isn't letting us clean up...

@brianhyder
Copy link
Member

Interesting. There should be log statements (not sure what log level has to be active) that say XYZ shutting down. It could either be that the instance isn't signaling properly or that PB isn't catching signals appropriately. I'll try and double check that tonight. Now that we know it isn't the driver I'll hold off on the upgrade so we don't introduce another variable.

@brianhyder
Copy link
Member

Thanks again for additional information. It is extremely helpful.

@cchatham
Copy link
Contributor Author

Still not 100% sure it isn't the driver...

Let me know if you want more information. We are racking our brains over here. The code looks solid. We may need a timeout on the redis server side? Lots of possibilities!!!

@brianhyder
Copy link
Member

OK, I took a look at the code tonight. There was an issue with the platform responding appropriately to process signals (#755). That has been resolved and merged into 0.5.0. Hopefully, at a minimum, it will eliminate one variable.

Are y'all using ElastiCache or true redis instances? I've seen instances where TCP connections are kept alive in an ELB but are actually dead. It just seems weird that the redis server is the one holding onto the connection. My thought is that the connection would be dropped unless a heartbeat was received.

@cchatham
Copy link
Contributor Author

Yea we are using redis via Elasticache. We will update our fork and see if that helps. Next step we want to try is updating the driver because we are running out of ideas and I don't want to put a timeout on our connections.

@brianhyder
Copy link
Member

Sounds like a plan. The release notes for the latest version of the driver are pretty good. The maintainer outlines the breaking changes and other modifications to defaults. One of which is connection timeout iirc.

@btidwell
Copy link
Contributor

FYI, I just merged the latest from 0.5.0 and noticed all of the localizations missing specific to site management and global plugins.

@brianhyder
Copy link
Member

Yup, you are correct. I botched the merge. I'll get that corrected this evening. My apologies.

@brianhyder
Copy link
Member

@btidwell Just to confirm, y'all only added en-US translations for multi-site correct? I have added the missing translations for English back into the 0.5.0 branch.

@brianhyder
Copy link
Member

@cchatham I updated the driver on a local branch to v2.0.0. It appeared to work as desired. I will test a bit more tomorrow but it would most likely be safe to update the driver and test to see if that fixes the connection issue.

@cchatham
Copy link
Contributor Author

cchatham commented Oct 5, 2015

@brianhyder We updated our driver and pulled your bug fix. It has seemed to help the severity of the incline but we are still steadily increasing over the past 2 weeks. We will continue to keep an eye on it.
oct5awsredis

@cchatham
Copy link
Contributor Author

Since we upgraded we haven't had any further issues. We could probably close out this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants