Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Multiple strange issues with long-running scripts #18
The discussion of my issues would likely be better suited to a mailing list or forum environment, but I can't seem to find any that are associated with the project. If they exist and I've just missed them somehow, please point me in their direction.
I've been converting a server monitoring application to utilize Cassandra, and I've been using the CPCL to bridge the gap between PHP and Cassandra. For the most part, it's been pretty painless, but I've run into some really strange issues that I haven't been able to isolate or correct. I'm still fairly new to Cassandra, so I'm not really certain whether I'm finding bugs somewhere, or whether I'm just "Doing It Wrong (tm)".
The entire application is running in a development environment on a collection of CentOS 5.8 64-bit Xen instances with the php53 (specifically: php53-5.3.3-7.el5_8) RPMs provided by CentOS. The Cassandra cluster started as 8 VMs running on a single hardware node with 2 VCPUs and 2GB RAM each, and has grown to 12 VMs running on three hardware nodes, with 2 VCPUs and 4GB of RAM each. Cassandra is installed via the DataStax RPMs (apache-cassandra1-1.0.9-1 , to be specific). I've done no tweaking to the configuration other than setting the initial tokens and cluster names on each server, and to configure the listen/RPC addresses. I'm using a self-compiled version of the thrift binary library built from the code provided in the CPCL. The CPCL code I'm using seems to correlate to commit 766dc14 from the git repo (retrieved via kallaspriit-Cassandra-PHP-Client-Library-766dc14.zip on 2012/06/04).
My app accesses Cassandra both through short-lived HTTP-based PHP scripts and through long-lived PHP scripts that run via the command line. The problems exist in the latter set of scripts, seemingly after the script has done a good number of large-ish operations from cassandra. By "large-ish", I mean a get or set of a single key with 10,000-30,000 columns or so. These issues all occur within scripts that repeatedly retrieve a bunch of data from Cassandra, process it in some way, then store the processed data back into Cassandra. So far, I'm seeing multiple distinct issues.
Since I'm not seeing any issues whatsoever with the short-lived scripts that run via HTTP requests, my guess is that data objects are becoming "cluttered" over time as the long-running scripts do their thing, and that clutter is somehow causing the issues I'm seeing above. I've tried digging into the various classes involved, but being completely unfamiliar with the Cassandra/thrift binary protocols, I can only dig so far.
I realize I'm laying out some rather ambiguous and ill-defined issues here, so if you need specifics, please let me know what you need. Being fairly new to Cassandra, I wouldn't be at all surprised if I'm just missing something. Any insight is welcome.
I'm not sure how to debug this sort of thing. The first issue can come from the nature of how a distributed Cassandra database works and some updates might not have propagated yet.
Do the cassandra instances produce any logs? Do the second and third issue manifest themselves under low load conditions?
Unfortunately, I gave up on the CPCL a while back because of the issues above, so I can't provide much more than my recollections on the issues at this point. :\
For the first issue, I don't believe this is an issue of nodes being out of sync, or some other distributed writing behavior. The data coming back from the get request was valid and correct data, but for the wrong key. Example: In one iteration through a loop, the script gets a range of data from "Key12345". In the next iteration, it grabs data from "Key23456". In the second iteration, valid and correct data from "Key12345" was showing up in the results for "Key23456". I did every possible thing I could think of to sanitize the variable those results were being stored into, and nothing eliminated the problem. It was not consistently happening with every get (only a small percentage), and only seemed to happen when the script had been running for some time (many minutes to hours), but happened enough that the data I was processing (time series monitoring data) was visibly and obviously wrong. That's why I suspected some sort of buffer corruption somewhere in the call stack.
I have no earthly idea what would cause issue #2. If I remember correctly, I tested a number of different concurrency scenarios to see if scripts were somehow interfering with each other. It seemed that even when the overall cassandra cluster load was low, and there was only copy of my processing script running, issues #1 and #2 would still occur.
I wouldn't be at all surprised if #3 was caused by my testing environment. Perhaps drew can elaborate on which issues he's seeing to provide some more light on the situation?