Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpp-client lockup #150

Closed
kjans opened this issue Jul 30, 2014 · 11 comments
Closed

cpp-client lockup #150

kjans opened this issue Jul 30, 2014 · 11 comments

Comments

@kjans
Copy link

kjans commented Jul 30, 2014

Hi,
I am experiencing a lockup problem with the Infinispan cpp-client library.

It seems like there should be a blog/forum where I can report this; please forgive me if submitting this as an “issue” is inappropriate…if so, where should I be reporting this to?

I have been tasked with testing Infinispan to see if it meets our performance and high-availability needs.
With this in-mind, I have created a test environment which is an Infinispan cluster on four virtual machines (RHEL 5.10, 2.6.18-371.9.1.el5). The cluster is configured to communicate using TCP:
cluster="flexnet" stack="tcp" />
default-stack="${jboss.default.jgroups.stack:tcp}">

The machines are statically configured to know about each other:
infinispan-test1a[7800],infinispan-test2a[7800],infinispan-test3a[7800],infinispan-test4a[7800],

On each server VM, I have a simple client which uses the cpp-client library. A simplified version of the client code follows:
std::string key;
std::string value;
std::string *pprev = NULL;
key = "12345";
value = "a value";
for(;;) {
try {
pprev = cache.put(key, value);
} catch (const std::exception &e) {
std::cerr << "Failed with error: " << e.what() << "\n";
continue;
}
if(pprev != NULL) std::cout << "Value: " << *pprev << "\n";
delete pprev;
usleep(100000);
}
All four clients use the same code (same key, same value).
Each client is started with the IP address and TCP port of its local server and the cache name:

ConfigurationBuilder builder;
builder.addServer().host(args[1]).port(atoi(args[2]));
StringCache cache = manager.getCache<std::string, std::string>(args[3], true);

For example:
./iftest1 192.168.136.131 11222 cacheTest

Where cacheTest is defined as:

I start each of the clients and then monitor network traffic on each of the servers with tcpdump…
This test configuration will run forever with no reported errors if I leave it alone…

High-availability testing:
We are under the belief that if any one of the Infinispan services in the cluster is stopped, that the cluster should detect this and automatically recover. With this in-mind, the first test is to gracefully stop the Infinispan service on one of the hosts. In my configuration, one way of doing that is just:

sudo /etc/init.d/infinispan stop

Again, it is my assumption that the cluster should recover…if I subsequently start the service back up again, it should join and resume processing requests.
By watching the tcpdump output I can determine which servers have a copy of the data. If I stop Infinispan on one service using the above command, ALL my clients on ALL of my hosts hang.

Attaching gdb to the hung process and doing a stack trace results in the following:
#0 0x0000003b6ca0b019 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x0000000000434553 in infinispan::hotrod::sys::Condition::wait(infinispan::hotrod::sys::Mutex&) ()
#2 0x0000000000434e5d in infinispan::hotrod::sys::BlockingQueueinfinispan::hotrod::transport::TcpTransport*::pop() ()
#3 0x0000000000433c55 in infinispan::hotrod::transport::ConnectionPool::borrowObject(infinispan::hotrod::transport::InetSocketAddress const&) ()
#4 0x0000000000424b7d in infinispan::hotrod::transport::TcpTransportFactory::borrowTransportFromPool(infinispan::hotrod::transport::InetSocketAddress const&) ()
#5 0x00000000004245b7 in infinispan::hotrod::transport::TcpTransportFactory::getTransport(infinispan::hotrod::hrbytes const&) ()
#6 0x0000000000417317 in infinispan::hotrod::operations::AbstractKeyOperationinfinispan::hotrod::hrbytes::getTransport(int) ()
#7 0x000000000041700f in infinispan::hotrod::operations::RetryOnFailureOperationinfinispan::hotrod::hrbytes::execute() ()
#8 0x000000000040f9ac in infinispan::hotrod::RemoteCacheImpl::put(infinispan::hotrod::RemoteCacheBase&, void const_, void const_, unsigned long, unsigned long, void*) ()
#9 0x00000000004073e4 in infinispan::hotrod::RemoteCacheBase::base_put(void const_, void const_, long, long, void*) ()
#10 0x0000000000405fdd in infinispan::hotrod::RemoteCache<std::basic_string<char, std::char_traits, std::allocator >, std::basic_string<char, std::char_traits, std::allocator > >::put(std::basic_string<char, std::char_traits, std::allocator > const&, std::basic_string<char, std::char_traits, std::allocator > const&, unsigned long, infinispan::hotrod::TimeUnit, unsigned long, infinispan::hotrod::TimeUnit) ()
#11 0x000000000040609b in infinispan::hotrod::RemoteCache<std::basic_string<char, std::char_traits, std::allocator >, std::basic_string<char, std::char_traits, std::allocator > >::put(std::basic_string<char, std::char_traits, std::allocator > const&, std::basic_string<char, std::char_traits, std::allocator > const&, unsigned long, unsigned long) ()
#12 0x0000000000403c7b in main ()

Any ideas?
More info:

  • Changing the clients usleep to one-second seems to make the problem harder to reproduce.
  • Changing the configuration of the cluster to use UDP and restarting, eg:
    cluster="flexnet" stack="udp" />
    default-stack="${jboss.default.jgroups.stack:udp}">

still eventually results in the clients hanging…it just takes a few more steps.
[Server1]# service infinispan stop
Wait 30 seconds
[Server1]# service infinispan start
Wait 30 seconds
[Server2]# service infinispan stop
Wait 30 seconds
[Server2]# service infinispan start
Wait 30 seconds
[Server3]# service infinispan stop
Wait 30 seconds
[Server3]# service infinispan start
Wait 30 seconds
[Server4]# service infinispan stop
Wait 30 seconds
[Server4]# service infinispan start
Repeat the above until the hang in the hotrod client occurs…

Thanks,
Kert

@ghost
Copy link

ghost commented Jul 30, 2014

Thank you for the detailed report! I'll have a look tomorrow.
You can use the Infinispan forum for support: https://community.jboss.org/en/infinispan

@kjans kjans closed this as completed Jul 31, 2014
@kjans
Copy link
Author

kjans commented Jul 31, 2014

I can provide code for a test client if that would be helpful.

@kjans kjans reopened this Jul 31, 2014
@kjans
Copy link
Author

kjans commented Aug 1, 2014

I have tracked this down to: ConnectionPool::invalidateObject
This method destroys the object but does not remove it from the busy queue.
Eventually the busy queue becomes full and then subsequent invocations block forever waiting for an object to become available (which will never happen).

Here is the existing code:
void ConnectionPool::invalidateObject(const InetSocketAddress& key, TcpTransport* val) {
sys::ScopedLocksys::Mutex l(lock);
if (val != NULL) {
factory->destroyObject(key, *val);
}
}

Here is what I am testing with now... which seems to fix the problem (but someone who knows this code needs to validate this fix):
void ConnectionPool::invalidateObject(const InetSocketAddress& key, TcpTransport* val) {
sys::ScopedLocksys::Mutex l(lock);
if (val != NULL) {
if (busy.count(key)) {
busy[key]->remove(val);
}
factory->passivateObject(key, *val);
factory->destroyObject(key, *val);
}

@ghost
Copy link

ghost commented Aug 4, 2014

Hi, please give it a try with this commit applied and let me know if you are still experiencing the issue:
https://github.com/isavin/cpp-client/commit/1d340c8eed87d0f3ca92ad900539b12e26fa5511

@kjans
Copy link
Author

kjans commented Aug 4, 2014

Hi,
Commit isavin@1d340c8 fixes the file descriptor leak, thanks!

Also, the following earlier commit (which I did not have) fixes the lockup problem I was seeing:
commit 84a6fd3
Author: rvansa rvansa@redhat.com
Date: Mon Jan 13 18:02:33 2014 +0100

I still occasionally get one of the following exceptions from a put() when one of my services is asynchronously stopped:
Invalid message id. Expected 19989 and received 18291
Invalid magic number. Expected � and received
Failed to connect (host: 192.168.136.129 port: 11222) Operation now in progress
Exception encountered, retry 7 of 8: Failed to connect (host: 192.168.136.129 port: 11222) Operation now in progress

What is the proper way to handle an exception here?
try {
pprev = cache.put(key, value);
} catch (const std::exception &e) {
std::cerr << "Failed with error: " << e.what() << "\n";
}
Is the underlying socket still valid on an exception?
Is it okay ignore the exception and just retry?
Or should the existing RemoteCacheManager be recreated?

Thanks

@ghost
Copy link

ghost commented Aug 6, 2014

Retrying should be the proper way of handling cache operation exceptions. The transport is invalidated and recreated if necessary.

Can you give me more details on the scenario which triggers the invalid messageid/magic number issue? Is the operation done in a multithreaded env with the cache shared between threads?

@kjans
Copy link
Author

kjans commented Aug 7, 2014

Hi,
A lot of the errors seem to go away based upon selective handling of the exception.
For some exceptions, I am able to reuse the cache object...
for others it seems to need to be recreated.
The following code isn't pretty...it's just for test purposes:

  for(;;)
  {
    try {
        std::cerr << "At start\n";
        manager = new RemoteCacheManager(builder.build(), true);
        StringCache cache = manager->getCache(args[3], true);
        key = "12345";
        value = "a value";
        for (;;) {
          usleep(100000);
          try {
            pprev = cache.put(key, value);
            pprev = cache.put(key, value);
            pprev = cache.put(key, value);
          } catch (const std::exception &e) {
            std::cerr << "Failed with error: " << e.what() << "\n";
            std::cout << "Errors: " << errors++ << "\n";
            if((strstr(e.what(), "org.infinispan.remoting.transport.jgroups.SuspectException")) != 0)
            {
              std::cout << "Get new cache object\n";
              manager->stop();
              delete manager;
              break;
            }
            if((strstr(e.what(), "Exception encountered, retry")) != 0)
            {
              std::cout << "Get new cache object\n";
              manager->stop();
              delete manager;
              break;
            }
            std::cout << "Keep cache object\n";
            continue;
          }
          std::cout << "Loop: " << loop++ << ", Errors so far: " << errors << "\n";
        }
    } catch (const std::exception &f) {
      std::cerr << "Outer Failed with error: " << f.what() << "\n";
    }
    std::cout << "Outer Loop: " << outerloop++ << "\n";
    sleep(1);
  }

The way to induce errors is to run this code on each client, then
just stop the Infinispan service on (only) one node, wait 10 seconds and restart...
and then move on to the next node...restart...repeat...
eventually this results in exceptions in the client.

@ghost
Copy link

ghost commented Aug 7, 2014

OK. I'll give it a try. Could not reproduce the problem so far.

@ghost
Copy link

ghost commented Aug 8, 2014

One note though: if you recreate the manager you loose a bit of the fault-tolerance capability the client offers.

If you have 4 nodes and the one to which the client is currently connected fails, based on the topology information, the client will reconnect to one of the other 3. If you instead recreate the manager it will drop the existing topology info and it will try to connect to the node hardcoded on the command line and will remain disconnected until that node is restarted.

@kjans
Copy link
Author

kjans commented Aug 8, 2014

Hi,
In most cases, I can use the existing manager object and recover...that is the goal.
There are some failures where continuing to use that object never seems to recover, others that do.

But, I'm beginning to believe that what I'm seeing is a cluster problem rather a cpp library problem.

To review, I am tasked with testing to see how a client recovers from server failures...
When I don't stop/start the servers, my clients appear to run forever...
It's only when I try to break the cluster by stopping (then restarting) one node at a time where I see problems.
For example, I just experience a failure under the following conditions:

  • all four servers working
  • all four clients working
  • on node 1: stopped infinispan

The clients are now reporting the following:

ERROR [RetryOnFailureOperation.h:68] Exception encountered, retry 23 of 24: Request for message id[43336] returned �org.infinispan.remoting.transport.jgroups.SuspectException: One or more nodes have left the cluster while replicating command SingleRpcCommand{cacheName='cacheTest', command=PutKeyValueCommand{key=[B0x3132333435, value=[B@15686e07, flags=null, putIfAbsent=false, valueMatcher=MATCH_ALWAYS, metadata=EmbeddedMetadata{version=NumericVersion{version=6192458077705685}}, successful=true}}
Failed with error: Request for message id[43336] returned �org.infinispan.remoting.transport.jgroups.SuspectException: One or more nodes have left the cluster while replicating command SingleRpcCommand{cacheName='cacheTest', command=PutKeyValueCommand{key=[B0x3132333435, value=[B@15686e07, flags=null, putIfAbsent=false, valueMatcher=MATCH_ALWAYS, metadata=EmbeddedMetadata{version=NumericVersion{version=6192458077705685}}, successful=true}}

Now.... if I even try restarting any of my client application programs, they all fail with the above message.
So, it appears that I am able to get my cluster in a bad state...not the library.
Restarting the cluster fixes the problem...clients are happy again.

Bottom line: I think the main issues that I was seeing with the hang and descriptor leaks are now fixed in the current code for the cpp library.
I now need to understand why my entire cluster gets in a bad state and is only recoverable after restarting all nodes.
Thanks for your help.

@kjans kjans closed this as completed Aug 8, 2014
@ghost
Copy link

ghost commented Aug 8, 2014

I was not able to reproduce the issue locally (all 4 nodes where running on the same machine though).
Please post the issue on the forum and include the server version you are using and setup details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant