-
Notifications
You must be signed in to change notification settings - Fork 62
Black Listing To Aggressive #51
Comments
Please mock up a change and submit a pull request this is probably a valuable addition to have. |
I am currently working on implementing the Circuit Breaker pattern along with unit tests. Once it is implemented and tested, we will not need to blacklist a server. I believe the Circuit Breaker pattern is relevant to this issue. Please feel free to comment if you have considered the pattern and have any insight on why it should not be used. Current plan is to be able to configure the number of attempts that trip the breaker. Also, confugrable min (millisec) and max (millsec) params that allows the retries (to see if Breaker in Open state can be Closed) progrssively between min and max. Once max is reached, the Circuit Breaker for the node will publish an event that can be subsribed to indicating that the node require special attention/intervention. |
I think this will be a great addition. I never really liked the blacklist, so I am happy to hear somebody is trying to get rid of it. |
Update: Some of the design principles:
|
Sorry, accidently hit submit...
.. and there are probably more. We've got something staged up, here: https://github.com/W3i/fluentcassandra/tree/circuit_breaker Let us know how you would like to proceed. Thank you. |
Is this ready for prime time? |
I've implemented something similar that I am currently using in production but also takes it a step further and classify's various exception types as retry/clienthealthy -- so while you might get a connection that's good from CassandraSession.GetClient(), it might go down between the time it was opened successfully and when a command is executed against it. If we have more than one server in the list try to move to another one up to a max retry count. |
eplowe: do you have the code you can submit for it? I;d like to take a look and possibly use it in our scenario as well. |
Sure -- When I get home I'll commit into my fork. My retry implementation On Tue, Nov 13, 2012 at 4:27 PM, ilushka85 notifications@github.com wrote:
|
If any of you guys want to send a pull request, I would be happy to take it. I know this is an issue that we need to get resolved soon. Eventually I would like to make the a ServiceManager that is ring aware so that the RoundRobin can be retired. |
I was also thinking of the circuit breaker pattern. I'll try out @dlbromen's patch |
eplowe did you get a chance to upload your changes? |
@eplowe ^ |
No sorry been kinda crazy with the day job. Let me get it by today.
|
https://github.com/eplowe/fluentcassandra/tree/blacklisting-enhancements Let me know what you think. So far from my testing, it appears to be done what I expect. With that said, there may be some edge cases not accounted for. |
Should we really be using RoundRobin anymore? It was created at a time when there were no mechanisms to describe the nodes in a cluster. But now, but we now have that ability with the describe_ring function. So I guess my question is should we keep putting lipstick on this pig, or move to something that is better suited for Cassandra moving forward? |
Agreed 100% --- but for now either dlbromens patch, or my patch can act as a stop-gap until this new mechanism is in place. |
With that said, I'd be more than happy to work on an auto discovery mechanism. |
What does autodiscover have todo with round robin? Wouldn't you use autodiscover to get the list of servers to round robin? Also auto discover means you need to be sure you don't connect across DCs. Some clients like Astynax try to connect to the one of the replicas so you cut down on coordinators hops. But even then you would still need some kind of round robin across replicas. We've been running with @dlbromen's patch and so far so good. |
You'd need something similar for sure. The idea is that as long as your Hector does this and uses policies to determine which machine to connect to
|
If you don't define a connection lifetime I am seeing the connections left On Thu, Nov 15, 2012 at 11:48 AM, ilushka85 notifications@github.comwrote:
|
@eplowe well the connections should be left open... the question is if they get returned properly. This is an issue we have had that I have commented on and had a slight but improper fix for this issue. |
@ilushka85 Understood. And by returned properly, you mean from say an On Thu, Nov 15, 2012 at 12:11 PM, ilushka85 notifications@github.comwrote:
|
heres a simple test if you debug this and say after the tenth time thru the loop look at the connection pool you will see ten connections are in use and not returned to freed. var builder = new ConnectionBuilder(ConfigurationManager.AppSettings["ResultlyDataConnectionString"]);
|
You're right -- I did a similar test against a my dev cassandra server. I On Thu, Nov 15, 2012 at 12:32 PM, ilushka85 notifications@github.comwrote:
|
@eplowe i posted a hack that fixes this in my other message: My crude attempt at fixing this was inserting the following snippets in each of the operation function classes var temp = output.GetEnumerator(); Any better ideas on how to do the above? it has to do with how .net doesnt really execute code till later. IN addition i noticed connections having an exception do not get returned to pool or closed in original code. |
I haven't had a chance to catch up with the recent activity here, but can you please take the Connection releasing discussion to issue #29. Thank you. |
also current round robin method and connection pool means that second and subsequent queries you execute are not hit on next server in round robin but on the same one open connection ... so if your code only requires the use of one connection and you are using a pool all executions will happen on that one server regardless of how many you have defined. |
I have a general idea in my head that I am churning over while at lunch
|
@eplowe Ok. Waiting for your input there. |
@eplowe any thoughts on my comment above about your round robin replacement changes not taking into account using other servers when using a pool? |
I have a change that I am still testing/not happy with that does auto
|
@eplowe using your blacklisting enhancements with enough timeout exceptions using the pool the connections still do not get reset and all subsequent connections fail as pool is exhausted. |
Please provide an example.
|
Don't have an example per say but that's what we see happening in our app Sent from my iPhone On Nov 17, 2012, at 4:10 PM, Eric Plowe notifications@github.com wrote: Please provide an example.
— |
I've simulated throwing random timeout executions and from what I am seeing "No connection could be made because all servers have failed." If so, that's because I am marking the Client as not healthy for a On Sat, Nov 17, 2012 at 5:11 PM, ilushka85 notifications@github.com wrote:
|
So for a brief moment all servers will be blacklisted. When the timer runs On Sat, Nov 17, 2012 at 5:28 PM, Eric Plowe eric.plowe@gmail.com wrote:
|
@eplowe I have not been able to reproduce since we corrected what was causing the timeout issue in the first place but will continue to test it. We have found a bug in cassandra with secondary indexes just so you are aware that we reported to datastax who is working on a solution. Basically timeouts occur if you rely on secondary indexes and perform deletes on the column with the secondary index. If you delete enough of them a timeout will occur during querys against the secondary index while it skips over tombstones as they still exist in the secondary index. this is what lead to our timeouts. In any case.... Were you able to make any more progress on your changes with auto discovery etc? |
Any chance we can get a patch? Was hoping to wrap up the next release before Thanksgiving. |
I'll push the fix to mark TimedOutException::IsClientHealthy as true vs
|
@eplowe thanks for the TimedOutException RE: auto discovery, this functionality should be moved to another connection manager, since for the time being we still need to support the current methodology of server lists that are provided. Right now I am favoring if in the connection string the server is just a |
Agreed. You still do need to provide at least one server because you need
|
You are right, I like your method better. I think auto discover should be turned on by default, what is your thoughts? |
That's the route I am going. Also, to give you a heads up I am removing the
|
The change for marking TimedOutException as IsClientHealthy = true instead On Tue, Nov 20, 2012 at 3:03 PM, Eric Plowe eric.plowe@gmail.com wrote:
|
Did you send a pull request? |
There you go. Also let me know what you decide to do with the connection On Tue, Nov 20, 2012 at 4:03 PM, Nick Berardi notifications@github.comwrote:
|
Enumerator/lazy should be switched to lists since it is an RPC call. Nick Berardi On Nov 20, 2012, at 4:07 PM, Eric Plowe notifications@github.com wrote: There you go. Also let me know what you decide to do with the connection On Tue, Nov 20, 2012 at 4:03 PM, Nick Berardi notifications@github.comwrote:
— |
Just looked at the refactoring -- You should stop the timer when you try to recover, and the recovery time IMO is too long. You could have a situation where you're having some network hiccups and given the amount of traffic to your application you blacklist all your servers and now you have 30 seconds until we attempt to check if they are back up or not. 30 seconds when you're getting getting hammered with traffic feels like years. |
You do realize you weren't actually starting or stopping anything? You were A better one to use for servers according to MSDN is Timers.Timer. But I http://msdn.microsoft.com/en-us/library/system.timers.timer.aspx Nick Berardi On Nov 20, 2012, at 6:41 PM, Eric Plowe notifications@github.com wrote: Just looked at the refactoring -- You should stop the timer when you try to — |
http://msdn.microsoft.com/en-us/library/yz1c7148%28v=vs.80%29.aspx Timer.Change Method (Int32, Int32) dueTime
period
Also, there is a slight bug -- the reason I was passing the two states on the exception was to know what to do based on the exception. But you are doing things like blacklisting a connection and retrying for TimedOutExceptions -- which is wrong, a timeout doesn't necessarily indicate an unhealthy client. Just that the particular command timed out. You're also not taking into account IOException and NotFoundException -- which are important. That is a pretty easy fix. |
Also, I just noticed this as well, and when we do remove a server from the blacklist we need to add it back to _serverQueue. |
And sorry if I diverged from your style I'll make sure future changes
|
I know what they say, but how do you think it actually work under the covers? It is essentially this. while(true) {
if (TickCount == CallbackTickCount)
Callback();
} I do realize this is an over simplification, but essentially this is how all timers work. Also on MSDN they do make this note on http://msdn.microsoft.com/en-us/library/system.threading.timer.aspx
Also here is the difference between System.Timers.Timer and System.Threading.Timer http://stackoverflow.com/questions/1416803/system-timers-timer-vs-system-threading-timer RE: IOException and NotFoundException I didn't actually see you implement them anywhere, so I assumed it was a mistake. Also it has been good to work with you, I just like to maintain some sort of order so the codebase doesn't get overrun with personal one-off changes that only benefit one entity. |
Oh I know. I actually believe they sleep to avoid spinning -- my memory
|
In the RoundRobinServerManager, if a connection cannot be made to a server, it is added to the blacklist. Once it's in the blacklist, it is there for the life time of RoundRobinServerManager, which is a problem if a connection pool is used, and the CassandraContext is held open for a long time.
The text was updated successfully, but these errors were encountered: