Join GitHub today
kamailio crashes when attempting to query offline database #1821
I am testing how kamailio reacts to various database conditions. One such condition is if the database engine is simply shut down (that is, database server process no longer running, tcp listening socket closed, etc...)
I am utilizing the db_unixodbc module to connect to an Informix database engine.
I am currently running on Kamailio version 5.0.1.
I have a test query that executes against the database engine every 10 seconds.
Here is what i have noticed if i shut down the database engine at some point after i run Kamailio.
The first test query that attempts to run against the db engine fails; it tries to reconnect and fails.
The second test query (10 seconds after the 1st) results in a SIGCHILD and shuts down the entire Kamailio process.
I communicated this info to the mailing list and was asked to open an issue regarding this and to also test the more recent version 5 releases.
Here is the update on said tests:
I have tested the master branch(5.3.0-dev2) and 5.2.1 and neither branch resolves the issue.
However I did notice in the master branch that there is new code that is related to this issue.
In issue 1681 there is code that allows Kamailio to start even if a database connection can not be established. Queries attempting to run against the offline database fail gracefully. And once the database is back online, a connection is established and queries against it are successful.
However, if at some later point I shut down the database, we're back to the original issue that i reported. Kamailio crashes with the same output as listed before except the first query that is attempted against the offline db causes the crash in this master branch unlike previously (branch 5.0.1) the first attempt fails, tries again and fails, and the second attempt causes the crash. Regardless, the output is more or less the same and Kamailio is down.
I suspect this might be the same behavior even if one is not using an odbc driver; but maybe not.
Start Kamailio; kill the database engine; run test query from kamailo to said database engine; Kamailio crashes. This can be replicated using the db_unixodbc module; not sure if it's the same for other types of database drivers.
Thank you for the report. Can you try to get a core dump file and attach the backtrace here?
Have a look to this page for more information: https://kamailio.org/wiki/tutorials/troubleshooting/coredumpfile
I am going to take a step back here; it might be best to address the following issue that i have found (which is very much related to the one at hand) before proceeding to the issue in this ticket.
Regarding the following statements i made in this ticket: "In issue 1681 there is code that allows Kamailio to start even if a database connection can not be established. Queries attempting to run against the offline database fail gracefully. And once the database is back online, a connection is established and queries against it are successful.".
Those statements are indeed true; however, what i have noticed is that if i leave the database offline and another unrelated query using another unrelated database handle via sqlops executes, the program crashes. This other database handle is to a database that is connected to upon startup and is online, however it appears from kamailio's logs and the gdb output that the code thinks this particular online database is not online and attempts a reconnect (at which point the program crashes). And so we have the following scenario: one database offline; another online; test query to the offline database is gracefully rejected; but a query to the online database crashes kamailio.
So, the setup is this:
Here is the output of gdb for the issue where the database remains offline from start to end.
Thank you for your reply. I have spent the last few weeks reviewing our system and have noticed that we have a few outdated shared libraries in use. Unixodbc seems a bit dated as well as the odbc client SDK for the informix database engine. I've updated the unix odbc libraries and have noticed that the segmentation fault appears to be occurring within the Informix csdk libraries. In particular, at a function call of SQLFreeHandle. A few online searches has shown that there is indeed a memory violation that occurs within that function call when a protocol issue is encountered. No further detail regarding protocol is mentioned but it seems to fit very closely to the issue at hand (TCP disconnect indicating a protocol issue and then a seg fault). The fix to this issue is within an updated Informix CSDK library which i am in the process of installing on my system. I am hoping this resolves the issue. Thank you again for taking a look into this; I will let you all know one way or another what the results are.
Hello Daniel and Henning,
I have confirmed that it is indeed the out-dated Informix CSDK libraries that were at fault. I have updated said libraries to the latest version (clientsdk.4.10.FC9DE) and the crash no longer occurs. Kamailio is able to detect that a database connection is severed, gracefully handles it, and reconnects to the offline database once it's back online. I appreciate your responses to this item. I will go ahead and close this ticket now.
Well, it appears that over the course of the past several weeks i have managed to confuse myself on the issues at hand. Upon re-reading the ticket there are two issues that I had uncovered. The first issue is where a database engine goes offline at some point during normal call processing causing a kamailio crash. This issue has been addressed as per the updated Informix CSDK library which i recently confirmed. The second issue is where a database is offline at kamailio startup and crashes kamailio based on the steps below:
The gdb output is exactly the same as previously pasted even with the updated Informix CSDK libraries. I was hoping the Informix CSDK update would solve this issue also, but it didn't.
Here are the results to your inquiry about gdb output.
Thank you for looking into this.
The root cause of the crash lies in the sqlops/sql_api.c file within the function sql_connect. I pasted that function below so we can reference it when reviewing my notes below it:
int sql_connect(int mode)
Notice the if(mode) clause. Looks like the statements within it need to be reversed. That is, if mode, then continue trying connecting to other database instances. If not mode, then return false immediately.
The setup for the crash begins to manifest if you have more database instances to connect to in the sql_con_t linked list when the code encounters a database instance it can't connect to and returns false.
If at a later time one of those database instances (ones remaining in the linked list that we weren't able to connect to because of a pre-mature return) has a sql submitted to it, the sql_reconnect function gets called because the connection structure has been initialized for that database instance but unfortunately because there was no actual attempt to connect made in sql_connect, the sc->dbf member is null. Basically this piece of code never gets executed for the remaining database instances in the linked list with the sql_connect function :
sc->dbf remains null and access to it via sql_reconnect creates the segmentation fault.
This is clearly seen in the gdb output.
I have tested the code with reversing the logic in the if(mode) statement and all works well.
If you agree with my analysis, please let me know how we should proceed here.
Either i can make the change or you can. I am fine with either.