New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cpp-client lockup #150
Comments
Thank you for the detailed report! I'll have a look tomorrow. |
I can provide code for a test client if that would be helpful. |
I have tracked this down to: ConnectionPool::invalidateObject Here is the existing code: Here is what I am testing with now... which seems to fix the problem (but someone who knows this code needs to validate this fix): |
Hi, please give it a try with this commit applied and let me know if you are still experiencing the issue: |
Hi, Also, the following earlier commit (which I did not have) fixes the lockup problem I was seeing: I still occasionally get one of the following exceptions from a put() when one of my services is asynchronously stopped: What is the proper way to handle an exception here? Thanks |
Retrying should be the proper way of handling cache operation exceptions. The transport is invalidated and recreated if necessary. Can you give me more details on the scenario which triggers the invalid messageid/magic number issue? Is the operation done in a multithreaded env with the cache shared between threads? |
Hi, for(;;) { try { std::cerr << "At start\n"; manager = new RemoteCacheManager(builder.build(), true); StringCache cache = manager->getCache(args[3], true); key = "12345"; value = "a value"; for (;;) { usleep(100000); try { pprev = cache.put(key, value); pprev = cache.put(key, value); pprev = cache.put(key, value); } catch (const std::exception &e) { std::cerr << "Failed with error: " << e.what() << "\n"; std::cout << "Errors: " << errors++ << "\n"; if((strstr(e.what(), "org.infinispan.remoting.transport.jgroups.SuspectException")) != 0) { std::cout << "Get new cache object\n"; manager->stop(); delete manager; break; } if((strstr(e.what(), "Exception encountered, retry")) != 0) { std::cout << "Get new cache object\n"; manager->stop(); delete manager; break; } std::cout << "Keep cache object\n"; continue; } std::cout << "Loop: " << loop++ << ", Errors so far: " << errors << "\n"; } } catch (const std::exception &f) { std::cerr << "Outer Failed with error: " << f.what() << "\n"; } std::cout << "Outer Loop: " << outerloop++ << "\n"; sleep(1); } The way to induce errors is to run this code on each client, then |
OK. I'll give it a try. Could not reproduce the problem so far. |
One note though: if you recreate the manager you loose a bit of the fault-tolerance capability the client offers. If you have 4 nodes and the one to which the client is currently connected fails, based on the topology information, the client will reconnect to one of the other 3. If you instead recreate the manager it will drop the existing topology info and it will try to connect to the node hardcoded on the command line and will remain disconnected until that node is restarted. |
Hi, But, I'm beginning to believe that what I'm seeing is a cluster problem rather a cpp library problem. To review, I am tasked with testing to see how a client recovers from server failures...
The clients are now reporting the following: ERROR [RetryOnFailureOperation.h:68] Exception encountered, retry 23 of 24: Request for message id[43336] returned �org.infinispan.remoting.transport.jgroups.SuspectException: One or more nodes have left the cluster while replicating command SingleRpcCommand{cacheName='cacheTest', command=PutKeyValueCommand{key=[B0x3132333435, value=[B@15686e07, flags=null, putIfAbsent=false, valueMatcher=MATCH_ALWAYS, metadata=EmbeddedMetadata{version=NumericVersion{version=6192458077705685}}, successful=true}} Failed with error: Request for message id[43336] returned �org.infinispan.remoting.transport.jgroups.SuspectException: One or more nodes have left the cluster while replicating command SingleRpcCommand{cacheName='cacheTest', command=PutKeyValueCommand{key=[B0x3132333435, value=[B@15686e07, flags=null, putIfAbsent=false, valueMatcher=MATCH_ALWAYS, metadata=EmbeddedMetadata{version=NumericVersion{version=6192458077705685}}, successful=true}} Now.... if I even try restarting any of my client application programs, they all fail with the above message. Bottom line: I think the main issues that I was seeing with the hang and descriptor leaks are now fixed in the current code for the cpp library. |
I was not able to reproduce the issue locally (all 4 nodes where running on the same machine though). |
Hi,
I am experiencing a lockup problem with the Infinispan cpp-client library.
It seems like there should be a blog/forum where I can report this; please forgive me if submitting this as an “issue” is inappropriate…if so, where should I be reporting this to?
I have been tasked with testing Infinispan to see if it meets our performance and high-availability needs.
With this in-mind, I have created a test environment which is an Infinispan cluster on four virtual machines (RHEL 5.10, 2.6.18-371.9.1.el5). The cluster is configured to communicate using TCP:
cluster="flexnet" stack="tcp" />
default-stack="${jboss.default.jgroups.stack:tcp}">
The machines are statically configured to know about each other:
infinispan-test1a[7800],infinispan-test2a[7800],infinispan-test3a[7800],infinispan-test4a[7800],
On each server VM, I have a simple client which uses the cpp-client library. A simplified version of the client code follows:
std::string key;
std::string value;
std::string *pprev = NULL;
key = "12345";
value = "a value";
for(;;) {
try {
pprev = cache.put(key, value);
} catch (const std::exception &e) {
std::cerr << "Failed with error: " << e.what() << "\n";
continue;
}
if(pprev != NULL) std::cout << "Value: " << *pprev << "\n";
delete pprev;
usleep(100000);
}
All four clients use the same code (same key, same value).
Each client is started with the IP address and TCP port of its local server and the cache name:
ConfigurationBuilder builder;
builder.addServer().host(args[1]).port(atoi(args[2]));
StringCache cache = manager.getCache<std::string, std::string>(args[3], true);
For example:
./iftest1 192.168.136.131 11222 cacheTest
Where cacheTest is defined as:
I start each of the clients and then monitor network traffic on each of the servers with tcpdump…
This test configuration will run forever with no reported errors if I leave it alone…
High-availability testing:
We are under the belief that if any one of the Infinispan services in the cluster is stopped, that the cluster should detect this and automatically recover. With this in-mind, the first test is to gracefully stop the Infinispan service on one of the hosts. In my configuration, one way of doing that is just:
sudo /etc/init.d/infinispan stop
Again, it is my assumption that the cluster should recover…if I subsequently start the service back up again, it should join and resume processing requests.
By watching the tcpdump output I can determine which servers have a copy of the data. If I stop Infinispan on one service using the above command, ALL my clients on ALL of my hosts hang.
Attaching gdb to the hung process and doing a stack trace results in the following:
#0 0x0000003b6ca0b019 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x0000000000434553 in infinispan::hotrod::sys::Condition::wait(infinispan::hotrod::sys::Mutex&) ()
#2 0x0000000000434e5d in infinispan::hotrod::sys::BlockingQueueinfinispan::hotrod::transport::TcpTransport*::pop() ()
#3 0x0000000000433c55 in infinispan::hotrod::transport::ConnectionPool::borrowObject(infinispan::hotrod::transport::InetSocketAddress const&) ()
#4 0x0000000000424b7d in infinispan::hotrod::transport::TcpTransportFactory::borrowTransportFromPool(infinispan::hotrod::transport::InetSocketAddress const&) ()
#5 0x00000000004245b7 in infinispan::hotrod::transport::TcpTransportFactory::getTransport(infinispan::hotrod::hrbytes const&) ()
#6 0x0000000000417317 in infinispan::hotrod::operations::AbstractKeyOperationinfinispan::hotrod::hrbytes::getTransport(int) ()
#7 0x000000000041700f in infinispan::hotrod::operations::RetryOnFailureOperationinfinispan::hotrod::hrbytes::execute() ()
#8 0x000000000040f9ac in infinispan::hotrod::RemoteCacheImpl::put(infinispan::hotrod::RemoteCacheBase&, void const_, void const_, unsigned long, unsigned long, void*) ()
#9 0x00000000004073e4 in infinispan::hotrod::RemoteCacheBase::base_put(void const_, void const_, long, long, void*) ()
#10 0x0000000000405fdd in infinispan::hotrod::RemoteCache<std::basic_string<char, std::char_traits, std::allocator >, std::basic_string<char, std::char_traits, std::allocator > >::put(std::basic_string<char, std::char_traits, std::allocator > const&, std::basic_string<char, std::char_traits, std::allocator > const&, unsigned long, infinispan::hotrod::TimeUnit, unsigned long, infinispan::hotrod::TimeUnit) ()
#11 0x000000000040609b in infinispan::hotrod::RemoteCache<std::basic_string<char, std::char_traits, std::allocator >, std::basic_string<char, std::char_traits, std::allocator > >::put(std::basic_string<char, std::char_traits, std::allocator > const&, std::basic_string<char, std::char_traits, std::allocator > const&, unsigned long, unsigned long) ()
#12 0x0000000000403c7b in main ()
Any ideas?
More info:
cluster="flexnet" stack="udp" />
default-stack="${jboss.default.jgroups.stack:udp}">
still eventually results in the clients hanging…it just takes a few more steps.
[Server1]# service infinispan stop
Wait 30 seconds
[Server1]# service infinispan start
Wait 30 seconds
[Server2]# service infinispan stop
Wait 30 seconds
[Server2]# service infinispan start
Wait 30 seconds
[Server3]# service infinispan stop
Wait 30 seconds
[Server3]# service infinispan start
Wait 30 seconds
[Server4]# service infinispan stop
Wait 30 seconds
[Server4]# service infinispan start
Repeat the above until the hang in the hotrod client occurs…
Thanks,
Kert
The text was updated successfully, but these errors were encountered: