Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

At times Segfault during deconstruction after upgrade from 1.76 to 1.80 #111

Closed
nvinzens opened this issue Oct 28, 2019 · 66 comments
Closed
Milestone

Comments

@nvinzens
Copy link

Via the changelog we found the maybe relevant change:
https://metacpan.org/diff/file?target=MJEVANS/DBD-Oracle-1.80/&source=ZARQUON%2FDBD-Oracle-1.76#dbdimp.c

We use it in a rather complex internal tool and the segfault sometimes happens at the very end. Still we are able to consistently reproduce it.

How could we support you in finding the root cause?

@djzort
Copy link
Collaborator

djzort commented Dec 6, 2019

can you provide a script that causes the segfaut?

@kbucheli
Copy link

kbucheli commented Dec 6, 2019

Unfortunately not really. The script is huge (~70'000 lines) and does a lot. And it does only fail on certain data, but the difference is not easy to figure out as it does a lot of queries and it only segfaults on the teardown of the driver.
Is there any easy option to log just the SQL interaction to figure out differences between working and non-working?

@djzort
Copy link
Collaborator

djzort commented Dec 6, 2019

You can enable tracing? https://metacpan.org/pod/DBI#trace

@nvinzens
Copy link
Author

I created 2 tracesfiles, one from a run of the script that segfaults and one from a run where it doesn't.
The used trace flag was 'DBD'.
traces.zip

@abraxxa
Copy link

abraxxa commented Jan 2, 2020

We‘re having the same issue with each test case that uses DBIx::Class for example via Test::WWW::Mechanize::Catalyst.
Our workaround is to call $schema->storage->disconnect before done_testing.

@CarstenGrohmann
Copy link

I created 2 tracesfiles, one from a run of the script that segfaults and one from a run where it doesn't.
The used trace flag was 'DBD'.
traces.zip

The trace files are quite big. Each contains round about 2 million lines. Can you repeat your test with export DBI_TRACE=5 perl your_test_script.pl and share the part with the DESTROY lines (as shown in #65 (comment)) plus the last 100 lines before the DESTROY lines?

Does anybody have a stack trace or a core dump?

@CarstenGrohmann
Copy link

We‘re having the same issue with each test case that uses DBIx::Class for example via Test::WWW::Mechanize::Catalyst.
Our workaround is to call $schema->storage->disconnect before done_testing.

Yes, doing an explicit disconnect e.g. with $dbh->disconnect prevents the segfault. That's an experience with #65.

@nvinzens
Copy link
Author

nvinzens commented Jan 6, 2020

Made new traces with tracelevel 5.
traces.zip

@mrdvt92
Copy link

mrdvt92 commented Mar 27, 2020

/bin/sh: line 1: 90272 Segmentation fault (core dumped)

I have this random issue too but I'm not sure if it was the upgrade from oracle-instantclient12.2 to oracle-instantclient19.6 or the upgrade from perl-DBD-Oracle-1.74-12.2.0.1.0 to perl-DBD-Oracle-1.80-19.6.0.0.0.

I'll try to down grade perl-DBD-Oracle to see if we still get the random Seq Faults but it is weird.

@djzort
Copy link
Collaborator

djzort commented May 20, 2020

@mrdvt92 it would almost certainly be 1.80 of dbd::oracle

@whindsx
Copy link

whindsx commented May 28, 2020

It happens for me as well in 1.791.

In my case I'm able to recreate in situations where there are mutliple connections, at least one of them lives outside the main script and no disconnect is called.

Ex.

connect.pl

use DBI;
use DBD::Oracle;

$dbh = DBI->connect("dbi:Oracle:$DATABASE", $USER, $PASSWORD);

require("connect.inc");

#$dbh->disconnect;

connect.inc

$dbh2 = DBI->connect("dbi:Oracle:$DATABASE", $USER, $PASSWORD);

Uncommenting $dbh->disconnect does fix the Seg Fault in this example. Setting local scope for $dbh2 also fixes it.

Perl 5.30.0 (with threads)
DBD::Oracle 1.791
InstantClient 12.2.0.1.0

@demianriccardi
Copy link

I am also observing this issue for a module with multiple oracle connections. using installs from Backpan I was able to zero in on a change between versions 1.75_2 (has no segmentation fault) and 1.77_1 (has segmentation fault). [there were no versions available in between]

I also see the segmentation fault clear out if there is an explicit disconnect for 1.77_1 and beyond.

Perl 5.30.2 (no threads) [also observed for 5.30.1 with threads]
InstantClient 12_2

@shild
Copy link

shild commented Jun 20, 2020

We just ran into this issue by upgrading to the 19c client. Here is what I sent to the dbi-users list,

More info, this error does not occur with DBD::Oracle 1.76.

DBD::Oracle 1.80 => works with 18c client, but fails with 19c.
DBD::Oracle 1.76 => works with all client versions.

On 6/19/20 5:48 PM, Scott wrote:

We have run into an issue when we upgraded to Oracle client 19c. Some of the users processes are segfaulting on exit.

#0 0x00007f82ee84ccc0 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1 0x00007f82e6444f43 in kputxabt () from /u01/app/oracle/product/19.3.0.0/lib/libclntsh.so.19.1
#2 0x00007f82e926e6c3 in ora_db_rollback () from /usr/local/perl-5.22.0-thr/lib/site_perl/5.22.0/x86_64-linux-thread-multi/auto/DBD/Oracle/Oracle.so
#3 0x00007f82e9266b11 in XS_DBD__Oracle__db_DESTROY () from /usr/local/perl-5.22.0-thr/lib/site_perl/5.22.0/x86_64-linux-thread-multi/auto/DBD/Oracle/Oracle.so
#4 0x00007f82ed10291d in XS_DBI_dispatch () from /usr/local/perl-5.22.0-thr/lib/site_perl/5.22.0/x86_64-linux-thread-multi/auto/DBI/DBI.so

I tested the same process on a server still using the 18c client and the core dump does not happen.

I assuming this is the change causing the segfault with 19c client.

Destroy envhp with last dbh (GH#93, GH#89, Dean Hamstead, CarstenGrohmann)

@dfskoll
Copy link

dfskoll commented Aug 6, 2020

This appears to have something to do with global destruction. The following code segfaults:

use DBI;
use DBD::Oracle;

{
$dbh = DBI->connect("dbi:Oracle:XEPDB1", 'db', 'password');
$dbh2 = DBI->connect("dbi:Oracle:XEPDB1", 'db', 'password');
print "dbh = $dbh\n";
print "dbh2 = $dbh2\n";
}
whereas the following code does not:

use DBI;
use DBD::Oracle;

{
my $dbh = DBI->connect("dbi:Oracle:XEPDB1", 'db', 'password');
my $dbh2 = DBI->connect("dbi:Oracle:XEPDB1", 'db', 'password');
print "dbh = $dbh\n";
print "dbh2 = $dbh2\n";
}

So there must be some object that's being destroyed in the wrong order when global destruction happens. (Tested on Perl 5.16.3, CentOS 7.8, DBD::Oracle 1.80, Oracle 18c)

@dfskoll
Copy link

dfskoll commented Aug 6, 2020

I added some debugging code. The one that does not segfault (with the my variables) prints this:

In destructor: Calling dbd_db_disconnect
SessionEnd: 0x18c0a30 0x1890608 0x18906f0 0x188f918
In destructor: Back from dbd_db_disconnect
In destructor: Calling dbd_db_disconnect
SessionEnd: 0x176cba0 0x1879fb8 0x187a0a0 0x1882f40
In destructor: Back from dbd_db_disconnect
In dbd_dr_destroy

The one that does segfault prints this:

In destructor: Calling dbd_db_disconnect
SessionEnd: 0xa27c10 0x9f77e8 0x9f78d0 0x9f6af8
In destructor: Back from dbd_db_disconnect
In dbd_dr_destroy
In destructor: Calling dbd_db_disconnect
SessionEnd: 0x8d3d10 0x9e1198 0x9e1280 0x9ea120
Segmentation fault

Notice how in the one that segfaults, dbd_dr_destroy is called before the second $dbh destructor is called. The global destructor is destroying objects in the wrong order.

@dfskoll
Copy link

dfskoll commented Aug 6, 2020

The attached patch fixes the problem for me. I would not say I'm particularly happy with this patch; I see it more as a workaround than a proper fix, but I'm attaching it for anyone who wants to try it out.

dbd-oracle.patch.txt

@whindsx
Copy link

whindsx commented Aug 7, 2020

The attached patch fixes the problem for me. I would not say I'm particularly happy with this patch; I see it more as a workaround than a proper fix, but I'm attaching it for anyone who wants to try it out.

dbd-oracle.patch.txt

Unfortunately the patch did not work for my case. I still got the same seg fault.

It would be nice to have a proper fix for this because as it is now my $work is locked in at v1.76.

@djzort
Copy link
Collaborator

djzort commented Aug 9, 2020

What perl and oracle versions did people try with this patch?

@dfskoll
Copy link

dfskoll commented Aug 10, 2020 via email

@whindsx
Copy link

whindsx commented Aug 10, 2020

What perl and oracle versions did people try with this patch?

CentOS 7.8.2003
Perl 5.30.0
DBI 1.642
InstantClient 12.2
Oracle Database 19c

CentOS 8.2.2004
Perl 5.32.0
DBI 1.643
InstantClient 19.8
Oracle Database 19c

I tried with both my case and the @dfskoll case. I'm not using Oracle XE btw.

I applied the patch correctly. I don't know what I'm doing wrong, it seems like it should work. Maybe someone else can give it a go.

@djzort
Copy link
Collaborator

djzort commented Aug 11, 2020

The attached patch fixes the problem for me. I would not say I'm particularly happy with this patch; I see it more as a workaround than a proper fix, but I'm attaching it for anyone who wants to try it out.

dbd-oracle.patch.txt

Can you flip this over to a pull request? That will have it run through Travis

@dfskoll
Copy link

dfskoll commented Aug 11, 2020 via email

@cjbj
Copy link

cjbj commented Aug 12, 2020 via email

@djzort
Copy link
Collaborator

djzort commented Aug 12, 2020

@mjegh are you around? its looking like time for a release

@mjegh
Copy link
Member

mjegh commented Aug 12, 2020

I'm not sure I can. I retired and don't have access to Oracle now and so I cannot even run the test suite. Also, the Linux machine I did the build on was at work. I might be able to get access for a while at the weekend. Can you point me at the distzilla instructions you gave me before? as I can't find them. I'll try and work out what has been changed as I've not been keeping up.

dfskoll added a commit to dfskoll/DBD-Oracle that referenced this issue Aug 12, 2020
This patch fixes perl5-dbi#111

During global destruction, the function dbd_dr_destroy is sometimes called
before all handles are destroyed.  It frees resources uses in the handle
DESTROY function, causing a segfault when the handle DESTROY function tries
to disconnect the handle.

This patch simply sets a flag in dbd_dr_destroy which makes per-handle
DESTROY functions skip trying to disconnect the handle.
@dfskoll
Copy link

dfskoll commented Aug 12, 2020 via email

@avorop
Copy link

avorop commented Feb 12, 2022

I've tried to rewrite login6 function, to support more concise caching of OCIEnv*. Attached patch fixes 2 problems:

  1. Segfault at time of cleanup.
  2. Problem with multiple charset in multiple connections to Oracle.

The rewrite is relatively large. I had to add refcounting to cached environments and removed use of global variables for storing information about charsets. Additionally there are few fixes that silence warnings (which really were small errors)

I've added 2 tests. One reproduces problem with Segfault, and another problem with different charsets.

Well, everything is relativ. Tests work in CYGWIN. Also I didn't have chance to test DRCP and shared connections. Though, I suspect that support for shared connections is broken. At least it looks very suspicious.

It would be good, if someone tries to run it in other environments, since even copying of code could introduce some unexpected side-effects.

patch.txt

@djzort
Copy link
Collaborator

djzort commented Feb 16, 2022

@avorop wow awesome effort!

@djzort
Copy link
Collaborator

djzort commented Feb 16, 2022

@avorop are you able to submit this as a pull request?

@djzort
Copy link
Collaborator

djzort commented Feb 16, 2022

Ive pulled the patch in to a commit, one very minor tweak to get it to apply. here is the commit on a brand for people to test

https://github.com/perl5-dbi/DBD-Oracle/tree/gh111

@djzort
Copy link
Collaborator

djzort commented Feb 16, 2022

it looks like there are 64bit windows accommodations in that diff too

@whindsx
Copy link

whindsx commented Feb 18, 2022

I can't comment on the charset issue or DRCP. But the segfault issue was ever present in my $work environment. So far I'm happy to say I'm unable to reproduce the segfault after quite a bit a of testing.

CentOS 7.9.2009 3.10.0-1160.53.1.el7.x86_64
Perl 5.30.0 (threads)
DBI 1.642
Oracle Instant Client 19.9 and 21.4

I am seeing this though:

$ prove -bv t/14threads.t 
t/14threads.t .. 
ok 1 - session 0 created
ok 2 - session 1 matches previous session
ok 3 - session 2 matches previous session
ok 4 - session 3 matches previous session
ok 5 - session 4 matches previous session
ok 6 - one imp_data in pool
ok 7 - thread gets two separate sessions
ok 8 - get same session after free
ok 9 - two imp_data in pool
Attempt to free unreferenced scalar: SV 0x1ac1530, Perl interpreter: 0x603010 at t/lib/DBDOracleTestLib.pm line 196.
Attempt to free unreferenced scalar: SV 0x7fa0c8001598, Perl interpreter: 0x603010 at t/lib/DBDOracleTestLib.pm line 196.
ok 10 - thread 0, loop 1 created session
ok 11 - thread 0, loop 2 matches previous session
ok 12 - thread 0, loop 3 matches previous session
ok 13 - thread 1, loop 1 matches previous session
ok 14 - thread 1, loop 2 matches previous session
ok 15 - thread 1, loop 3 matches previous session
ok 16 - thread 2, loop 1 matches previous session
ok 17 - thread 2, loop 2 matches previous session
ok 18 - thread 2, loop 3 matches previous session
ok 19 - pool empty
1..19
ok
All tests successful.
Files=1, Tests=19,  1 wallclock secs ( 0.02 usr  0.00 sys +  0.33 cusr  0.08 csys =  0.43 CPU)
Result: PASS

And this rare one that is probably related:

t/12impdata.t ............. ok 
t/14threads.t ............. All 9 subtests passed 
t/15nls.t ................. ok 

Test Summary Report
-------------------
t/14threads.t           (Wstat: 11 Tests: 9 Failed: 0)
  Non-zero wait status: 11
  Parse errors: No plan found in TAP output
Files=41, Tests=2208, 18 wallclock secs ( 0.35 usr  0.06 sys +  6.19 cusr  1.54 csys =  8.14 CPU)
Result: FAIL
Failed 1/41 test programs. 0/2208 subtests failed.
make: *** [test_dynamic] Error 255

So far it seems pretty good. I'll keep running this patch for a while and see if anything else comes up. Hopefully others can test as well.

Thanks @avorop.

@djzort
Copy link
Collaborator

djzort commented Feb 24, 2022

t/14threads.t

if you can just run perl -Ilib t/14threads.t you should get a full error output.

@avorop
Copy link

avorop commented Feb 24, 2022 via email

@avorop
Copy link

avorop commented Feb 25, 2022

It appears I was wrong. My perl does have ithreads, at least in Cygwin were I spent some time trying to understand, where the issue comes from. It is connected to my changes. this t/14threads.t uses imp_data to copy connection information between threads. So, I create SV to hold envhp. Pointer to it is stored in imp_dbh. When imp_data is captured, this pointer is copied as is, of course pointers to allocated envhp and other handles are copied in the same way. Then this data is passed to another thread and used by that thread to access data pointed to. Everything works. EXCEPT that all SV suddenly get funny reference counts. Here is trace for one of such SV:

init_drh 8011e0b30 new envhp 4f48d0 in SV 8013d5588
envhp from SV 8013d5588
Using env SV 8013d5588 for connection (refs 2)
dbd_db_login6 skip connect (impset). Env in SV 8013d5588 (refs 0)
dbd_db_login6 skip connect (impset). Env in SV 8013d5588 (refs 36)
dbd_db_login6 skip connect (impset). Env in SV 8013d5588 (refs 36)
dbd_db_login6 skip connect (impset). Env in SV 8013d5588 (refs 0)
dbd_db_login6 skip connect (impset). Env in SV 8013d5588 (refs 0)
dbd_db_login6 skip connect (impset). Env in SV 8013d5588 (refs 0)
dbd_db_login6 skip connect (impset). Env in SV 8013d5588 (refs 0)
dbd_db_login6 skip connect (impset). Env in SV 8013d5588 (refs 0)
dbd_db_login6 skip connect (impset). Env in SV 8013d5588 (refs 0)
dbd_db_login6 skip connect (impset). Env in SV 8013d5588 (refs 0)
dbd_db_login6 skip connect (impset). Env in SV 8013d5588 (refs 0)
dbd_db_login6 skip connect (impset). Env in SV 8013d5588 (refs 0)
dbd_db_login6 skip connect (impset). Env in SV 8013d5588 (refs 0)
dbd_db_login6 skip connect (impset). Env in SV 8013d5588 (refs 0)
dbd_db_login6 skip connect (impset). Env in SV 8013d5588 (refs 0)
dbd_db_login6 skip connect (impset). Env in SV 8013d5588 (refs 0)
Decrement envhp SV 8013d5588
Attempt to free unreferenced scalar: SV 0x8013d5588, Perl interpreter: 0x800000560 at t/lib/DBDOracleTestLib.pm line 196.

The SV is not visible to Perl, so perl should not mess with refcount. And it is so, unless imp_data is captured. But in the trace above the reference count is changed. There are even really funny values like 36. And then I come across this passage in perldoc perlapi:

        On some platforms, Windows for example, all allocated memory
        owned by a thread is deallocated when that thread ends. So if
        you need that not to happen, you need to use the shared memory
        functions, such as "savesharedpvn".

That implies, that on Windows such blind copying of pointers between threads is dangerous, and if the memory gets overwritten, then it would result in crash. I don't know if handles allocated by Oracle suffer from the same problem or it uses some memory outside of current thread. Since there are not so many crashes, then the latter must be true.

Anyway, below is patch that protects against "Freeing unreferenced SV". Though since I've spent so much time on this already, I shall try to rewrite current support of multi-threading to fix memory leaking when threads are in use. At least it appears to be possible. I'll let you know how it goes.

patch.txt

@andynmaas
Copy link

Our system is using DBD::Oracle 1.74 and Oracle client 12.1.0.2. As soon as I pointing to Oracle 19.3.0.0 client it got a lot core dump because of segmentation fault. Then I compiled DBD::Oracle 1.83 using Oracle client 19.3.0.0 it got worse more segmentation fault. Downgrade using 1.76 getting less segfault. Downgrade to 1.74 much better but still got segfault. Adding ulimit -u unlimited now it getting stable and so far no more segmentation fault. I hope that the resolution.

@djzort
Copy link
Collaborator

djzort commented Mar 15, 2022

Our system is using DBD::Oracle 1.74 and Oracle client 12.1.0.2. As soon as I pointing to Oracle 19.3.0.0 client it got a lot core dump because of segmentation fault. Then I compiled DBD::Oracle 1.83 using Oracle client 19.3.0.0 it got worse more segmentation fault. Downgrade using 1.76 getting less segfault. Downgrade to 1.74 much better but still got segfault. Adding ulimit -u unlimited now it getting stable and so far no more segmentation fault. I hope that the resolution.

Please try this branch #147

@djzort
Copy link
Collaborator

djzort commented Mar 15, 2022

@avorop i have applied your patch on #147

everyone should please try it

@djzort
Copy link
Collaborator

djzort commented Mar 24, 2022

@sunnavy
Copy link

sunnavy commented Apr 4, 2022

I tested with 19c and 1.90_1 works for me, thanks!

-sunnavy

bestpractical-mirror pushed a commit to bestpractical/rt that referenced this issue Apr 5, 2022
@djzort
Copy link
Collaborator

djzort commented Apr 21, 2022

1.90_3 is now tagged, watch for it on metacpan soon

@djzort djzort added this to the v1.90 milestone May 7, 2022
@djzort
Copy link
Collaborator

djzort commented May 7, 2022

I have created a v1.90 milestone and attached this issue to it

@andynmaas
Copy link

andynmaas commented May 7, 2022 via email

@djzort
Copy link
Collaborator

djzort commented Aug 11, 2022

@andynmaas can you provide more details as to how you are consistently creating a segfault?

@andynmaas
Copy link

andynmaas commented Aug 11, 2022 via email

@djzort
Copy link
Collaborator

djzort commented Aug 16, 2022

Would you be able to make some tiny script that replicates it?

@djzort
Copy link
Collaborator

djzort commented Aug 31, 2022

@andynmaas
Copy link

andynmaas commented Oct 11, 2022 via email

@whindsx
Copy link

whindsx commented Oct 17, 2022

Hi, I have test using Oracle client 19.3.0.0. It failed test t25 three of 83 test. It failed intermittently after sround 4000 run with segmentation fault. It also failed on DBD:Oracle 1.74, 1.76, and 1.84. Using Oracle 18.5.0.0 or 12.1.0.2 it pass make test connect to database and never fail with segmentation fault.

I would say it is an issue with Oracle Client 19.3 in particular. I ran 25plsql.t 4000 times without issue against 19.6 and cand-v1.90 (on a docker). The current client version available from Oracle is 19.16 so I would use that.

I don't see this as issue with DBD::Oracle cand-v1.90.

@djzort djzort closed this as completed Mar 22, 2023
@andynmaas
Copy link

andynmaas commented Nov 21, 2023 via email

@avorop
Copy link

avorop commented Nov 21, 2023 via email

@damil
Copy link

damil commented Mar 7, 2024

For info : I also observed segfaults with DBD::Oracle 1.83, ocli client 21.8.0.0 and server 19.0.0.0.0.
This only occurs when the same process connects to two databases, using connect_cached(). With regular connect() no segfault happens. Observed both on Windows and Linux.

The workaround was to add an END block in our module responsible for opening database connections :

END {
  if (my %drivers = DBI->installed_drivers) {
    if (my $ora_driver = $drivers{Oracle}) {
      my $CachedKids_hashref = $ora_driver->{CachedKids};
      # warn "cleaning DBI cache while exiting Local::Ctx::Plugin::Db\n" if $CachedKids_hashref;
      %$CachedKids_hashref = () if $CachedKids_hashref;
    }
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.