Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kamailio crashes when attempting to query offline database #1821

Closed
srinikart opened this issue Jan 23, 2019 · 13 comments

Comments

Projects
None yet
3 participants
@srinikart
Copy link

commented Jan 23, 2019

Description

I am testing how kamailio reacts to various database conditions. One such condition is if the database engine is simply shut down (that is, database server process no longer running, tcp listening socket closed, etc...)

I am utilizing the db_unixodbc module to connect to an Informix database engine.

I am currently running on Kamailio version 5.0.1.

I have a test query that executes against the database engine every 10 seconds.

Here is what i have noticed if i shut down the database engine at some point after i run Kamailio.

The first test query that attempts to run against the db engine fails; it tries to reconnect and fails.

The second test query (10 seconds after the 1st) results in a SIGCHILD and shuts down the entire Kamailio process.

I communicated this info to the mailing list and was asked to open an issue regarding this and to also test the more recent version 5 releases.

Here is the update on said tests:

I have tested the master branch(5.3.0-dev2) and 5.2.1 and neither branch resolves the issue.

However I did notice in the master branch that there is new code that is related to this issue.

#1681

In issue 1681 there is code that allows Kamailio to start even if a database connection can not be established. Queries attempting to run against the offline database fail gracefully. And once the database is back online, a connection is established and queries against it are successful.

However, if at some later point I shut down the database, we're back to the original issue that i reported. Kamailio crashes with the same output as listed before except the first query that is attempted against the offline db causes the crash in this master branch unlike previously (branch 5.0.1) the first attempt fails, tries again and fails, and the second attempt causes the crash. Regardless, the output is more or less the same and Kamailio is down.

I suspect this might be the same behavior even if one is not using an odbc driver; but maybe not.

Troubleshooting

Reproduction

Start Kamailio; kill the database engine; run test query from kamailo to said database engine; Kamailio crashes. This can be replicated using the db_unixodbc module; not sure if it's the same for other types of database drivers.

Log Messages

Jan 17 20:07:25 [29297]: INFO: (s)  SQL query: FIRST TEST QUERY 
Jan 17 20:07:25 [29297]: ERROR: db_unixodbc [dbase.c:135]: db_unixodbc_submit_query(): rv=-1. Query= FIRST TEST QUERY
Jan 17 20:07:25 [29297]: ERROR: db_unixodbc [connection.c:220]: db_unixodbc_extract_error(): unixodbc:SQLExecDirect=08S01:1:-11020:[Informix][Informix ODBC Driver]Communication link failure.
Jan 17 20:07:25 [29297]: ERROR: db_unixodbc [dbase.c:59]: reconnect(): Attempting DB reconnect
Jan 17 20:07:25 [29297]: ERROR: db_unixodbc [dbase.c:74]: reconnect(): failed to connect
Jan 17 20:07:25 [29297]: ERROR: db_unixodbc [connection.c:220]: db_unixodbc_extract_error(): unixodbc:SQLDriverConnect=08002:1:0:[unixODBC][Driver Manager]Connection name in use
Jan 17 20:07:25 [29297]: ERROR: db_unixodbc [connection.c:220]: db_unixodbc_extract_error(): unixodbc:SQLDriverConnect=HY010:2:-11067:[Informix][Informix ODBC Driver]Function sequence error.
Jan 17 20:07:25 [29297]: ERROR: <core> [db_query.c:181]: db_do_raw_query(): error while submitting query
Jan 17 20:07:25 [29297]: ERROR: sqlops [sql_api.c:265]: sql_do_query(): cannot do the query FIRST TEST QUERY
Jan 17 20:07:25 [29297]: INFO: (s) [123] SQL ret: fail (-1)
Jan 17 20:07:25 [29297]: INFO: (s) [123] SQL res: no rows
Jan 17 20:07:35 [29297]: INFO: (s) [123] SQL query: 10 seconds later the SECOND TEST QUERY (it's the same query as the first one)
Jan 17 20:07:35 [29301]: CRITICAL: <core> [core/pass_fd.c:277]: receive_fd(): EOF on 28
Jan 17 20:07:35 [29283]: ALERT: <core> [main.c:744]: handle_sigs(): child process 29297 exited by a signal 11
Jan 17 20:07:35 [29283]: ALERT: <core> [main.c:747]: handle_sigs(): core was not generated
Jan 17 20:07:35 [29283]: INFO: <core> [main.c:759]: handle_sigs(): terminating due to SIGCHLD
Jan 17 20:07:35 [29301]: INFO: <core> [main.c:814]: sig_usr(): signal 15 received
Jan 17 20:07:35 [29295]: INFO: <core> [main.c:814]: sig_usr(): signal 15 received
Jan 17 20:07:35 [29291]: INFO: <core> [main.c:814]: sig_usr(): signal 15 received
Jan 17 20:07:35 [29288]: INFO: <core> [main.c:814]: sig_usr(): signal 15 received
Jan 17 20:07:35 [29300]: INFO: <core> [main.c:814]: sig_usr(): signal 15 received
Jan 17 20:07:35 [29284]: INFO: <core> [main.c:814]: sig_usr(): signal 15 received
Jan 17 20:07:35 [29286]: INFO: <core> [main.c:814]: sig_usr(): signal 15 received
Jan 17 20:07:35 [29293]: INFO: <core> [main.c:814]: sig_usr(): signal 15 received
Jan 17 20:07:35 [29289]: INFO: <core> [main.c:814]: sig_usr(): signal 15 received
Jan 17 20:07:35 [29287]: INFO: <core> [main.c:814]: sig_usr(): signal 15 received
Jan 17 20:07:35 [29292]: INFO: <core> [main.c:814]: sig_usr(): signal 15 received
Jan 17 20:07:35 [29296]: INFO: <core> [main.c:814]: sig_usr(): signal 15 received
Jan 17 20:07:35 [29298]: INFO: <core> [main.c:814]: sig_usr(): signal 15 received
Jan 17 20:07:35 [29299]: INFO: <core> [main.c:814]: sig_usr(): signal 15 received
Jan 17 20:07:35 [29285]: INFO: <core> [main.c:814]: sig_usr(): signal 15 received
Jan 17 20:07:35 [29294]: INFO: <core> [main.c:814]: sig_usr(): signal 15 received
Jan 17 20:07:35 [29290]: INFO: <core> [main.c:814]: sig_usr(): signal 15 received
Jan 17 20:07:35 [29283]: INFO: <core> [core/sctp_core.c:53]: sctp_core_destroy(): SCTP API not initialized

Additional Information

version: kamailio 5.0.1 (x86_64/linux)
flags: STATS: Off, USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MEM, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLACKLIST, HAVE_RESOLV_RES
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB
poll method support: poll, epoll_lt, epoll_et, sigio_rt, select.
id: unknown
compiled on 11:25:09 May  2 2018 with gcc 4.1.2

version: kamailio 5.3.0-dev2 (x86_64/linux) 30857d-dirty
flags: STATS: Off, USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLACKLIST, HAVE_RESOLV_RES
ADAPTIVE_WAIT_LOOPS 1024, MAX_RECV_BUFFER_SIZE 262144, MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB
poll method support: poll, epoll_lt, epoll_et, sigio_rt, select.
id: 30857d -dirty
compiled on 14:30:54 Jan 22 2019 with gcc 4.1.2
  • Operating System:
Red Hat Enterprise Linux Server release 5.11 (Tikanga)
Linux 2.6.18-404.el5 #1 SMP Sat Mar 7 04:14:13 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
@henningw

This comment has been minimized.

Copy link
Contributor

commented Jan 26, 2019

Thank you for the report. Can you try to get a core dump file and attach the backtrace here?

Have a look to this page for more information: https://kamailio.org/wiki/tutorials/troubleshooting/coredumpfile

@srinikart

This comment has been minimized.

Copy link
Author

commented Feb 4, 2019

Hello,

I am going to take a step back here; it might be best to address the following issue that i have found (which is very much related to the one at hand) before proceeding to the issue in this ticket.

Regarding the following statements i made in this ticket: "In issue 1681 there is code that allows Kamailio to start even if a database connection can not be established. Queries attempting to run against the offline database fail gracefully. And once the database is back online, a connection is established and queries against it are successful.".

Those statements are indeed true; however, what i have noticed is that if i leave the database offline and another unrelated query using another unrelated database handle via sqlops executes, the program crashes. This other database handle is to a database that is connected to upon startup and is online, however it appears from kamailio's logs and the gdb output that the code thinks this particular online database is not online and attempts a reconnect (at which point the program crashes). And so we have the following scenario: one database offline; another online; test query to the offline database is gracefully rejected; but a query to the online database crashes kamailio.

So, the setup is this:
leave a database offline (shut down)
start up kamailio
kamailio starts even though the database is offline which is good
my test query begins to execute against the offline database and kamailio reacts gracefully to not
being able to submit the queries to the offline database. also good.
now, if another entirely unrelated query is executed (a query against an online database for example),
the program crashes.
I figure it might be best to tackle this issue first before addressing the one originally referenced in this ticket which is where the program crashes when a database engine is shutdown during normal call processing.

Here is the output of gdb for the issue where the database remains offline from start to end.

(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00002ba7077959fb in sql_reconnect (sc=0x2ba5e34aac10) at sql_api.c:187
#2  0x00002ba7077a3448 in sql_check_connection (dbl=0x2ba5e34aac10) at sqlops.c:234
#3  0x00002ba7077a3765 in sql_query (msg=0x2ba5e3b59178, dbl=0x2ba5e34aac10 "(\253J\343\245+", query=0x2ba5e3b79c10 "\320\335c\343\245+",
    res=0x2ba5e35f87d0 "\257\242\376h") at sqlops.c:247
#4  0x0000000000433f19 in do_action (h=0x7fff05edb820, a=0x2ba5e363e800, msg=0x2ba5e3b59178) at core/action.c:1085
#5  0x00000000004414df in run_actions (h=0x7fff05edb820, a=0x2ba5e363e800, msg=0x2ba5e3b59178) at core/action.c:1564
#6  0x0000000000441bd6 in run_actions_safe (h=0x7fff05ede060, a=0x2ba5e363e800, msg=0x2ba5e3b59178) at core/action.c:1625
#7  0x00000000004dd859 in lval_pvar_assign (h=0x7fff05ede060, msg=0x2ba5e3b59178, lv=0x2ba5e363e688, rv=0x2ba5e363f450) at core/lvalue.c:284
#8  0x00000000004dbf19 in lval_assign (h=0x7fff05ede060, msg=0x2ba5e3b59178, lv=0x2ba5e363e688, rve=0x2ba5e363f448) at core/lvalue.c:400
#9  0x000000000043f14d in do_action (h=0x7fff05ede060, a=0x2ba5e363de48, msg=0x2ba5e3b59178) at core/action.c:1443
#10 0x00000000004414df in run_actions (h=0x7fff05ede060, a=0x2ba5e363d3f8, msg=0x2ba5e3b59178) at core/action.c:1564
#11 0x00000000004305a6 in do_action (h=0x7fff05ede060, a=0x2ba5e390da70, msg=0x2ba5e3b59178) at core/action.c:691
#12 0x00000000004414df in run_actions (h=0x7fff05ede060, a=0x2ba5e3907958, msg=0x2ba5e3b59178) at core/action.c:1564
#13 0x00000000004305a6 in do_action (h=0x7fff05ede060, a=0x2ba5e3989260, msg=0x2ba5e3b59178) at core/action.c:691
#14 0x00000000004414df in run_actions (h=0x7fff05ede060, a=0x2ba5e3989260, msg=0x2ba5e3b59178) at core/action.c:1564
#15 0x0000000000441bd6 in run_actions_safe (h=0x7fff05ee1920, a=0x2ba5e3989260, msg=0x2ba5e3b59178) at core/action.c:1625
#16 0x00000000005843e0 in rval_get_int (h=0x7fff05ee1920, msg=0x2ba5e3b59178, i=0x7fff05ede768, rv=0x2ba5e39893b8, cache=0x0) at core/rvalue.c:915
#17 0x0000000000586d00 in rval_expr_eval_int (h=0x7fff05ee1920, msg=0x2ba5e3b59178, res=0x7fff05ede768, rve=0x2ba5e39893b0) at core/rvalue.c:1913
#18 0x0000000000587133 in rval_expr_eval_int (h=0x7fff05ee1920, msg=0x2ba5e3b59178, res=0x7fff05edf0e0, rve=0x2ba5e3989ae0) at core/rvalue.c:1921
#19 0x0000000000433810 in do_action (h=0x7fff05ee1920, a=0x2ba5e398a448, msg=0x2ba5e3b59178) at core/action.c:1043
#20 0x00000000004414df in run_actions (h=0x7fff05ee1920, a=0x2ba5e398a448, msg=0x2ba5e3b59178) at core/action.c:1564
#21 0x00000000004305a6 in do_action (h=0x7fff05ee1920, a=0x2ba5e350f4f0, msg=0x2ba5e3b59178) at core/action.c:691
#22 0x00000000004414df in run_actions (h=0x7fff05ee1920, a=0x2ba5e350f4f0, msg=0x2ba5e3b59178) at core/action.c:1564
#23 0x0000000000433ccc in do_action (h=0x7fff05ee1920, a=0x2ba5e3513e28, msg=0x2ba5e3b59178) at core/action.c:1058
#24 0x00000000004414df in run_actions (h=0x7fff05ee1920, a=0x2ba5e3503118, msg=0x2ba5e3b59178) at core/action.c:1564
#25 0x0000000000433ccc in do_action (h=0x7fff05ee1920, a=0x2ba5e35141b0, msg=0x2ba5e3b59178) at core/action.c:1058
#26 0x00000000004414df in run_actions (h=0x7fff05ee1920, a=0x2ba5e34ccb70, msg=0x2ba5e3b59178) at core/action.c:1564
#27 0x0000000000441cad in run_top_route (a=0x2ba5e34ccb70, msg=0x2ba5e3b59178, c=0x0) at core/action.c:1646
#28 0x0000000000545339 in receive_msg (
    buf=0x595f530 "REGISTER ... removed for security reasons
#29 0x0000000000640719 in receive_tcp_msg (
    tcpbuf=0x2ba6070bb638 "REGISTER... removed for security reasons
#30 0x00000000006426b0 in tcp_read_req (con=0x2ba6070bb340, bytes_read=0x7fff05ee2444, read_flags=0x7fff05ee243c) at core/tcp_read.c:1631
#31 0x000000000064c49d in handle_io (fm=0x2ba5e3ba3d50, events=1, idx=-1) at core/tcp_read.c:1862
#32 0x00000000006505d8 in io_wait_loop_epoll (h=0xad0bc0, t=2, repeat=0) at core/io_wait.h:1061
#33 0x0000000000645f12 in tcp_receive_loop (unix_sock=37) at core/tcp_read.c:1974
#34 0x000000000063347a in tcp_init_children () at core/tcp_main.c:5083
#35 0x0000000000425978 in main_loop () at main.c:1728
#36 0x000000000042bd72 in main (argc=13, argv=0x7fff05ee2de8) at main.c:2666

(gdb) frame 1
#1  0x00002ba7077959fb in sql_reconnect (sc=0x2ba5e34aac10) at sql_api.c:187
187             sc->dbh = sc->dbf.init(&sc->db_url);
(gdb) list
182             }
183             if (sc->dbh!=NULL) {
184                     /* already connected */
185                     return 0;
186             }
187             sc->dbh = sc->dbf.init(&sc->db_url);
188             if (sc->dbh==NULL) {
189                     LM_ERR("failed to connect to the database [%.*s]\n",
190                                             sc->name.len, sc->name.s);
191                     return -1;

(gdb) frame 2
#2  0x00002ba7077a3448 in sql_check_connection (dbl=0x2ba5e34aac10) at sqlops.c:234
234             if(sql_reconnect(dbl)<0) {
(gdb) list
229                     LM_CRIT("no database handle with reconnect disabled\n");
230                     return -1;
231             }
232
233         LM_DBG("try to establish SQL connection\n");
234             if(sql_reconnect(dbl)<0) {
235                     LM_ERR("failed to connect to database\n");
236                     return -1;
237             }
238             return 0;

@miconda

This comment has been minimized.

Copy link
Member

commented Feb 28, 2019

Can you get from gdb the output for next commands:

frame 1
p sc
p *sc
@srinikart

This comment has been minimized.

Copy link
Author

commented Mar 1, 2019

Hello Daniel,

Thank you for your reply. I have spent the last few weeks reviewing our system and have noticed that we have a few outdated shared libraries in use. Unixodbc seems a bit dated as well as the odbc client SDK for the informix database engine. I've updated the unix odbc libraries and have noticed that the segmentation fault appears to be occurring within the Informix csdk libraries. In particular, at a function call of SQLFreeHandle. A few online searches has shown that there is indeed a memory violation that occurs within that function call when a protocol issue is encountered. No further detail regarding protocol is mentioned but it seems to fit very closely to the issue at hand (TCP disconnect indicating a protocol issue and then a seg fault). The fix to this issue is within an updated Informix CSDK library which i am in the process of installing on my system. I am hoping this resolves the issue. Thank you again for taking a look into this; I will let you all know one way or another what the results are.

@srinikart

This comment has been minimized.

Copy link
Author

commented Mar 7, 2019

Hello Daniel and Henning,

I have confirmed that it is indeed the out-dated Informix CSDK libraries that were at fault. I have updated said libraries to the latest version (clientsdk.4.10.FC9DE) and the crash no longer occurs. Kamailio is able to detect that a database connection is severed, gracefully handles it, and reconnects to the offline database once it's back online. I appreciate your responses to this item. I will go ahead and close this ticket now.

Thank you,

Karthik

@srinikart srinikart closed this Mar 7, 2019

@henningw

This comment has been minimized.

Copy link
Contributor

commented Mar 8, 2019

Thank you for the confirmation.

@srinikart

This comment has been minimized.

Copy link
Author

commented Mar 8, 2019

Well, it appears that over the course of the past several weeks i have managed to confuse myself on the issues at hand. Upon re-reading the ticket there are two issues that I had uncovered. The first issue is where a database engine goes offline at some point during normal call processing causing a kamailio crash. This issue has been addressed as per the updated Informix CSDK library which i recently confirmed. The second issue is where a database is offline at kamailio startup and crashes kamailio based on the steps below:

  • leave a database offline (shut down)
  • start up kamailio
  • kamailio starts even though the database is offline which is good
  • my test query begins to execute against the offline database and kamailio reacts gracefully to not
    being able to submit the queries to the offline database. also good.
  • now, if another entirely unrelated query is executed (a query against another database server for example) the program crashes.

The gdb output is exactly the same as previously pasted even with the updated Informix CSDK libraries. I was hoping the Informix CSDK update would solve this issue also, but it didn't.

Here are the results to your inquiry about gdb output.

(gdb) frame 1
#1  0x00002aec870c59fb in sql_reconnect (sc=0x2aeb63048b98) at sql_api.c:187
187             sc->dbh = sc->dbf.init(&sc->db_url);
(gdb) p sc
$4 = (sql_con_t *) 0x2aeb63048b98
(gdb) p *sc
$5 = {name = {s = 0x2aeb63048ab0 "voip=>unixodbc://xxxxxxx:yyyyyyy@localhost/voip", len = 4}, conid = 2025152179, db_url = {
    s = 0x2aeb63048ab6 "unixodbc://xxxxxxx:yyyyyyy@localhost/voip", len = 43}, dbh = 0x0, dbf = {cap = 0, use_table = 0, init = 0, init2 = 0, close = 0, query = 0,
    fetch_result = 0, raw_query = 0, free_result = 0, insert = 0, delete = 0, update = 0, replace = 0, last_inserted_id = 0, insert_update = 0, insert_delayed = 0,
    insert_async = 0, affected_rows = 0, start_transaction = 0, end_transaction = 0, abort_transaction = 0, query_lock = 0, raw_query_async = 0}, next = 0x0}

Thank you for looking into this.

@srinikart srinikart reopened this Mar 8, 2019

@srinikart

This comment has been minimized.

Copy link
Author

commented Apr 3, 2019

I have determined the root cause of the crash and will post explanation here in the next day or so. If you are currently reviewing this issue, please set aside for now. Thanks.

Karthik

@srinikart

This comment has been minimized.

Copy link
Author

commented Apr 4, 2019

Daniel/Henning,

The root cause of the crash lies in the sqlops/sql_api.c file within the function sql_connect. I pasted that function below so we can reference it when reviewing my notes below it:

int sql_connect(int mode)
{
sql_con_t *sc;
sc = _sql_con_root;
while(sc)
{
if (db_bind_mod(&sc->db_url, &sc->dbf))
{
LM_DBG("database module not found for [%.*s]\n",
sc->name.len, sc->name.s);
return -1;
}
if (!DB_CAPABILITY(sc->dbf, DB_CAP_RAW_QUERY))
{
LM_ERR("database module does not have DB_CAP_ALL [%.*s]\n",
sc->name.len, sc->name.s);
return -1;
}
sc->dbh = sc->dbf.init(&sc->db_url);
if (sc->dbh==NULL)
{
if(mode) {
LM_ERR("failed to connect to the database [%.*s]\n",
sc->name.len, sc->name.s);
return -1;
} else {
LM_INFO("failed to connect to the database [%.*s] - trying next\n",
sc->name.len, sc->name.s);
}
}
sc = sc->next;
}
return 0;
}

Notice the if(mode) clause. Looks like the statements within it need to be reversed. That is, if mode, then continue trying connecting to other database instances. If not mode, then return false immediately.

The setup for the crash begins to manifest if you have more database instances to connect to in the sql_con_t linked list when the code encounters a database instance it can't connect to and returns false.

If at a later time one of those database instances (ones remaining in the linked list that we weren't able to connect to because of a pre-mature return) has a sql submitted to it, the sql_reconnect function gets called because the connection structure has been initialized for that database instance but unfortunately because there was no actual attempt to connect made in sql_connect, the sc->dbf member is null. Basically this piece of code never gets executed for the remaining database instances in the linked list with the sql_connect function :
if (db_bind_mod(&sc->db_url, &sc->dbf))

sc->dbf remains null and access to it via sql_reconnect creates the segmentation fault.

This is clearly seen in the gdb output.

I have tested the code with reversing the logic in the if(mode) statement and all works well.

If you agree with my analysis, please let me know how we should proceed here.

Either i can make the change or you can. I am fine with either.

Thanks,

Karthik

@henningw

This comment has been minimized.

Copy link
Contributor

commented Apr 5, 2019

Thank you for the analysis. I will do a quick test by myself and then probably fix it directly in the code.

henningw added a commit that referenced this issue Apr 5, 2019

@henningw

This comment has been minimized.

Copy link
Contributor

commented Apr 5, 2019

Should be fixed with the referenced commit, I will also backport it.

@henningw henningw closed this Apr 5, 2019

@henningw

This comment has been minimized.

Copy link
Contributor

commented Apr 5, 2019

Small addition - backport not necessary, this is new functionality in git master.

@srinikart

This comment has been minimized.

Copy link
Author

commented Apr 5, 2019

Perfect! Thank you for updating the code.

Karthik

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.