-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Too many connections #37
Comments
I've never seen this before, but it's still possible if you exceed the total number of concurrent connections. How many ACUs were running and what was your concurrency? |
We've always experienced Aurora Serverless like this, one of the reasons we stopped using it. EDIT: I probably should be more in-depth rather than snarky, but the lag between hitting the connection limit and scaling to the next ACU level is substantial, your application needs to be capable of waiting until more connections are available. We had to add try/catch loops everywhere that would catch exactly this situation and just keep slamming the database until it decided to actually scale up (this can take up to 5-10 minutes for each level). Also another thing you need to make sure you're considering is the database automatically deciding to scale down on you even though you don't want it to. You will suddenly get disconnected on half your jobs and they need to be able to recover from that. |
We've hit this too, at a much higher capacity limit - around 64 ACUs. And our graphed cloud watch connections weren't anywhere near the 2k+ limit. I don't understand why this happens with the data-api though, isn't it supposed to be "connectionless"? |
@jrgilman, that's really interesting. Did you reach out to the AWS RDS team? I think this would be incredibly useful feedback for them. |
@AndrewBarba, I know it's not "connectionless", as it does proxy connections to your Aurora Serverless Cluster. But there seems to be two separate issues here.
|
For 1. I can't stress enough how important it is to check the magical "Force scaling the capacity" button: We run our largest application db on Aurora Serverless, so we see the scaling events a lot and after checking that (we got burned terribly before having it on) we see scaling events almost instantly. For 2. you would think since its an http api it would do something much closer to the new RDS proxies instead of actually translating every http request into a connection. That would totally defeat the purpose if it did that. My guess is you're right and its something in between - probably not translating every request to a connection but also not managing the pool correctly. |
We used to use the mysql js client directly.
@jrgilman even using the Data API?
@AndrewBarba I've enabled it now, finger crossed |
AFAIK this has been an on-going "growing pain" for Aurora Serverless. Your current connection count to your cluster will be limited by your current ACU--meaning if you've already hit max_connections for that ACU, you're stuck waiting until additional capacity is allocated. To be clear, as far as I'm aware, the proxy fleet will simply pass the connection to the cluster, hit the "too many connections" error, and return it back to the client. If you reached out to premium support you'd likely be met with a response telling you Serverless isn't great for bursty workloads (they may also point out that you could essentially pre-warm, or increase your ACU minimum). If you ARE aware that you're about to have a large increase in connections, you can obviously just force scale: https://docs.aws.amazon.com/cli/latest/reference/rds/modify-current-db-cluster-capacity.html Another note here, if you didn't see an increase in traffic during your increase in connection count, it's worth checking your slow query log (assuming you have it enabled). See if you're seeing several queries pileup behind a blocking query, see if you're hitting your configured wait_timeout--and if you are, it may be your retries piling up. SHOW ENGINE INNODB STATUS can be useful here too--see if you have an overall increase in spin waits leading to OS waits during that period. In any case--any large shift in connections can result in this behavior if it takes a while to add the additional ACU--but in your case you may have an underlying cause if there was no increase in traffic (generally a blocking query is what I'd expect here). This doesn't address the "too many connections" error you're hitting during your ACU changes, but investigating a cause in the connection pileup will help prevent the spike in connection count leading to the error. I'm happy to help if you have specific questions-- just reach out to me. |
Force scaling is still insanely slow. Not good for burst workloads as @kelbyenevoldLA mentioned. @jeremydaly we got the exact response that @kelbyenevoldLA mentioned. We did our best, but in the end writing our own autoscaler code was the easiest solution rather than making any of Amazon's offerings try to work for our use case. @guidev I would imagine you would run into the same problem if the DB scales down and suddenly your connection limit is too low, but I am unfamiliar with the data api (we just used a standard db connection). |
PM for Data API here. Thanks for raising this @jeremydaly and others. We are looking into it. |
Thanks @nitesmeh! |
Unfortunately, enabling force scaling didn't solve the issue, we have the same error rate as before... |
@kelbyenevoldLA Thanks for your suggestions... We haven't been able to find any underlying problem with our queries, obviously, it doesn't mean there isn't one. We're thinking about switching to serverless-mysql to make sure the issue is the Data API... |
@nitesmeh Am I right to assume that the data-api should pool requests more efficiently than traditional connections? If not - is there any reason to use the data api if we do not need to access from outside a VPC? |
@guidev Have you fixed the problem? |
My apologies for the delay in responding. That is correct, Data API pools connections so you don't need to worry about connection management. @AndrewBarba |
UPDATE: Actually, nevermind. I think my benchmarks were misattributing CPU load to the connections when it was really my queries. I've redone my benchmarks to focus on connections, and the Data API seems to be performing well for now. What kind of load is the Data API supposed to be able to handle? I'm running some load tests, and I am very disappointed at what I'm seeing. Using Aurora Serverless with the Data API, I'm very easily generating many connections and CPU usage. My load tests are doing on average between 5-50 http requests per second, pulling around 2000 records for each request. Is this just too much load for the Data API? I was hoping the Data API would truly solve my serverless RDS connection issues, but seems not so far. I'm also testing direct connections to my Aurora Serverless database versus the Data API, and the Data API might be using more CPU and more connections. I'm not sure what's going on. |
First off, thanks SO MUCH for this @jeremydaly. It's awesome! Secondly, @nitesmeh, can you confirm that the Data API should not consume connections? We have a very "uneven" load and have the PostgreSQL Aurora Serverless set as ACU 4-64 and set to "enforce" scaling. Scaling is triggered but way too late so hundreds of transactions fails (which are backed out to SQS). |
@QAnders, I’m glad you like it. The DATA API does consume connections, but is supposed to act a bit like RDS Proxy in that you shouldn’t have to worry about it since it’s using a connection pool on the backend. There seems to be a lot of people who have had this scaling issue, though. |
@QAnders @jeremydaly as far as my experience indicates -- each concurrent transaction via the Data API appears to need it's own connection to the database. If you have a workload that uses a lot of concurrent transactions, it'll easily blow through the number of connections available. If you can eliminate, consolidate, or reduce the number of transactions - that appears to help immensely. |
@jonathannen Thanks for the update on the transactions. I'm currently using TypeORM with the typeorm-aurora-data-api-driver I believe it's creating a new transaction for every query it does. I haven't tried yet but I was thinking about creating a single transaction for every lambda invocation. Do you know if there are any limits on the number of queries you can have in a transaction? Or do you see any other pitfalls with that approach? |
Here to ressurrect this issue -- we're regularly seeing this After investigation I see that even under a small load, the number of open connections the Aurora Serverless cluster explodes, and then the conncetions are kept alive in What makes our setup different is we're also using AppSync RDS resolvers for our GraphQL API separately from our lambdas which are handling webhooks and using the Data API. @nitesmeh and anyone else -- would it be possible that the Data API's connection pooling mechanism is using up all our connections and not leaving enough for the AppSync RDS resolvers and this is what's causing the issues... Any insight? |
@ortonomy We have that exact same issue after increasing our use of AppSync! We have one express based server (Elastic Beanstalk) which is still using native DB connection to PG (the Aurora Serverless cluster) but it is limited to a max. of 10 connections. I've had a few chats with AWS support and (they are helpful and try) but no real solution as of yet. They (kind of) agree with my assumption that a lot of updates/inserts prevents the cluster from scaling and only when the updates/inserts are all committed it scales (but then it's too late). This is very annoying and I hope that V2 of serverless fixes this (although our AWS contact and the support are very tight-lipped about it, even when it's going to be released for Postgres...) |
@QAnders and anyone else that cares -- I got a reply from AWS premium support and they have been able to repro the issue: will update when I get a response! |
Hi @ortonomy, |
Thanks for this issue writeup, @guidev, @jeremydaly and others. We've recently moved to an Aurora Serverless (postgres) and have seen the same issues. We've had an ingestion task that parallel inserts via Lambdas and the Data API and it was hitting connection limits. We've put an SQS FIFO queue in place now to throttle it down which has helped however the performance we're seeing is still pretty poor vs native clients. It certainly seems not to be using connections optimally and the CPU utilisation is higher than I would expect. We've also recently started using it behind AppSync and so we'll keep an eye out for the issue that @ortonomy raised as well. @nitesmeh, do you have any updates on this Data API connections issue? If there's an imminent Data API being released for Aurora Serverless v2 that'd be great too - it's the main reason we've not moved to v2 yet (bonus marks if it solves the 1MB data limit!). Thanks for your efforts - the Data API has huge potential when some of these issues are resolved. |
I've had similar issues with Serverless RDS and Data API, we were using Our solution was to properly close transactions with function as we were using a lot of transactions but implementation of @aschafs I don't think they will be having Data API for Aurora Serverless v2 as they have newer RDS Proxy, which is already supported in v2. So my opinion is Data API will be deprecated with v1. And RDS Proxy is the solution for 1MB data limit. I am preparing to move to Aurora Serverless v2 with RDS Proxy in near future. |
Well, this is a farse from AWS... We found the same We were eagerly awaiting Serverless v2 (and was part of the beta) and I testing it thoroughly and it was scaling nicely even with a lot of open transactions. At launch (GA) we got the information that the Data-API won't be added in v2... We've move a lot of our backend workloads over to AppSync for the sole purpose of being able to query DB's directly using VTL and the fact that AppSync is using the Data-API to "smooth out" the peaks. It's not working! Several times a week we hit "too many connections" and as @deel77 found, no go as 500 is the max and it's not possible to raise it. None of these limitations were mentioned in "architectural meetings" with AWS experts prior to moving to the solution... So, for us, the only viable solution is moving to Serverless v2 (or standard RDS) but as we have a very uneven load Serverless would make more sense. However, v2 is quite costly, and of course no Data-API.... AWS really dropped the ball on this one! |
what did your errors look like for Aurora Serverless v2 PostgreSQL, if you don't mind my asking? I experienced the following myself:
... despite there only being a handful of active processes in |
@obataku From the discussions I've had with AWS and can "reveal" it would seem that Aurora Serverless v2 is a complete rebuild of Aurora (non-Serverless) where they've added the "serverless" part from v1... So, kind of what Aurora Serverless should have been from the beginning... The Data-API might of course share code with the RDS Proxy, but the Data-API is completely ingrained into v1 as I have understood it and it won't be possible to "port" over to v2. This is a disaster for us that have moved to AppSync and have a micro-service architecture with a lot of databases where the architecture with AppSync and VTL with direct DB connection was a superb solution! So, what do we do now...? We have a crappy DB solution that won't scale and we are capped at 500 connections which is hampering us and forces us to build work-arounds and sweat each day that we hit the max... Or we have a shiny new (and expensive) DB that is doing it's job, but we can't continue using AppSync... or, rather, we'd have to rebuild a lot, throw all our VTL's out the window and add Lambda's (and in that case why use AppSync at all as we are not a "GraphQL" shop) |
Thanks @deel77 and @QAnders for the feedback. Deprecating the Data API would certainly be disappointing for us too. I'd love to get an official AWS view on it as it does strike me as a step backwards. Like yourselves, we were very excited about coupling AppSync with the Data API though now we may hold off moving in that direction. We might look at RDS Proxy though the appeal of Data API was its REST-based API. If @nitesmeh or anyone from AWS has any updates that would be great. We'd love to use this pattern and Aurora Serverless generally but it doesn't seem fit-for-purpose for our use cases at the moment. |
Hello,
I get this error on about 1% of executions...
Looks like there are no more available connections, but isn't aurora serverless supposed to autoscale automatically?
The text was updated successfully, but these errors were encountered: