Segfault in version 1.0.1, read_parquet after creating a clickhouse odbc connection #31981

mvcalder-xbk · 2020-02-14T17:28:30Z

We started getting segfaults in upgraded versions of pandas and tracked it down to an interaction with making odbc connections. The specific odbc driver is for Clickhouse built in Dec19 from:

https://github.com/ClickHouse/clickhouse-odbc
commit 08b252b93fc771fab607fb973443543479b3d972

The core dump is reproducable in the example below:

Code Sample, a copy-pastable example if possible

import pyodbc
import pandas as pd

con_str = f"Driver=libclickhouseodbc.so;url=http://clickhouse/query;timeout=600"
with pyodbc.connect(con_str, autocommit=True) as con:
    pass

df = pd.DataFrame({'A': [1,1,1], 'B': ['a', 'b', 'c']})
df.to_parquet('/tmp/foo.pq')
# This line core dumps:
pd.read_parquet('/tmp/foo.pq')

Problem description

In the code above, creating the clickhouse odbc connection (not using it) leads to a segfault when reading the parquet file:

terminate called after throwing an instance of 'std::bad_cast'
  what():  std::bad_cast
Aborted (core dumped)

This does not happen in pandas 0.25.3 but does in pandas 1.0.1.

Expected Output

No core dump.

Odbc driver dependencies

        linux-vdso.so.1 (0x00007ffe02bee000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f246c269000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f246c061000)
        libltdl.so.7 => /usr/lib/x86_64-linux-gnu/libltdl.so.7 (0x00007f246be57000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f246ba66000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f246cd89000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f246b862000)

Output of `pd.show_versions()`

[paste the output of pd.show_versions() here below this line]

pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.6.9.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-1062.4.3.el7.x86_64
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.0
pip : 9.0.1
setuptools : 45.2.0
Cython : 0.29.14
pytest : 5.3.2
hypothesis : 5.1.2
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : 0.999999999
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.2
numexpr : None
odfpy : None
openpyxl : 3.0.2
pandas_gbq : None
pyarrow : 0.15.1.dev539+g8cf0c8e0a
pytables : None
pytest : 5.3.2
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.11
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None
numba : None

The text was updated successfully, but these errors were encountered:

WillAyd · 2020-02-14T23:22:05Z

Do you have the same issue with a stable version of pyarrow? We don't have a lot of C++ internally so I think root cause might be elsewhere

jreback · 2020-02-15T00:30:12Z

this is from pyarrow

pls open an issue there

jorisvandenbossche · 2020-02-15T08:56:37Z

This does not happen in pandas 0.25.3 but does in pandas 1.0.1.

Are you using the same pyarrow version in both environments?

mvcalder-xbk · 2020-02-18T12:57:34Z

Yes, this is the same pyarrow version in both cases. I'll open an issue there. Thank you.

Arrow JIRA ticket.

Clickhouse ODBC Issue.

Stacktrace:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ffff7a24801 in __GI_abort () at abort.c:79
#2  0x00007ffff63c1957 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007ffff63c7ab6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007ffff63c7af1 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007ffff63c7d24 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffff63c6a52 in __cxa_bad_cast () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007ffff64131ec in std::__cxx11::collate<char> const& std::use_facet<std::__cxx11::collate<char> >(std::locale const&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007fffbe4b8279 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > std::__cxx11::regex_traits<char>::transform_primary<char const*>(char const*, char const*) const () from /usr/local/lib/libparquet.so.100
#9  0x00007fffbe4bd71c in std::__detail::_BracketMatcher<std::__cxx11::regex_traits<char>, false, false>::_M_ready() () from /usr/local/lib/libparquet.so.100
#10 0x00007fffbe4bda9e in void std::__detail::_Compiler<std::__cxx11::regex_traits<char> >::_M_insert_character_class_matcher<false, false>() () from /usr/local/lib/libparquet.so.100
#11 0x00007fffbe4c0569 in std::__detail::_Compiler<std::__cxx11::regex_traits<char> >::_M_atom() () from /usr/local/lib/libparquet.so.100
#12 0x00007fffbe4c0ad8 in std::__detail::_Compiler<std::__cxx11::regex_traits<char> >::_M_alternative() () from /usr/local/lib/libparquet.so.100
#13 0x00007fffbe4c0a43 in std::__detail::_Compiler<std::__cxx11::regex_traits<char> >::_M_alternative() () from /usr/local/lib/libparquet.so.100
#14 0x00007fffbe4c0d1c in std::__detail::_Compiler<std::__cxx11::regex_traits<char> >::_M_disjunction() () from /usr/local/lib/libparquet.so.100
#15 0x00007fffbe4c1469 in std::__detail::_Compiler<std::__cxx11::regex_traits<char> >::_Compiler(char const*, char const*, std::locale const&, std::regex_constants::syntax_option_type) () from /usr/local/lib/libparquet.so.100
#16 0x00007fffbe4a93d1 in parquet::ApplicationVersion::ApplicationVersion(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /usr/local/lib/libparquet.so.100
#17 0x00007fffbe4c1c03 in parquet::FileMetaData::FileMetaDataImpl::FileMetaDataImpl(void const*, unsigned int*, std::shared_ptr<parquet::Decryptor> const&) () from /usr/local/lib/libparquet.so.100
#18 0x00007fffbe4a9e62 in parquet::FileMetaData::FileMetaData(void const*, unsigned int*, std::shared_ptr<parquet::Decryptor> const&) () from /usr/local/lib/libparquet.so.100
#19 0x00007fffbe4a9ec2 in parquet::FileMetaData::Make(void const*, unsigned int*, std::shared_ptr<parquet::Decryptor> const&) () from /usr/local/lib/libparquet.so.100
#20 0x00007fffbe48acaf in parquet::SerializedFile::ParseUnencryptedFileMetadata(std::shared_ptr<arrow::Buffer> const&, long, long, std::shared_ptr<arrow::Buffer>*, unsigned int*, unsigned int*) () from /usr/local/lib/libparquet.so.100
#21 0x00007fffbe492d75 in parquet::SerializedFile::ParseMetaData() () from /usr/local/lib/libparquet.so.100
#22 0x00007fffbe48d8f8 in parquet::ParquetFileReader::Contents::Open(std::shared_ptr<arrow::io::RandomAccessFile>, parquet::ReaderProperties const&, std::shared_ptr<parquet::FileMetaData>) () from /usr/local/lib/libparquet.so.100
#23 0x00007fffbe48e598 in parquet::ParquetFileReader::Open(std::shared_ptr<arrow::io::RandomAccessFile>, parquet::ReaderProperties const&, std::shared_ptr<parquet::FileMetaData>) () from /usr/local/lib/libparquet.so.100
#24 0x00007fffbe3a89bd in parquet::arrow::FileReaderBuilder::Open(std::shared_ptr<arrow::io::RandomAccessFile>, parquet::ReaderProperties const&, std::shared_ptr<parquet::FileMetaData>) () from /usr/local/lib/libparquet.so.100
#25 0x00007fffbe7dc348 in __pyx_pf_7pyarrow_8_parquet_13ParquetReader_2open(__pyx_obj_7pyarrow_8_parquet_ParquetReader*, _object*, int, _object*, __pyx_obj_7pyarrow_8_parquet_FileMetaData*, int) ()
   from /usr/local/lib/python3.6/dist-packages/pyarrow-0.15.1.dev539+g8cf0c8e0a-py3.6-linux-x86_64.egg/pyarrow/_parquet.cpython-36m-x86_64-linux-gnu.so
#26 0x00007fffbe7dcbc9 in __pyx_pw_7pyarrow_8_parquet_13ParquetReader_3open(_object*, _object*, _object*) () from /usr/local/lib/python3.6/dist-packages/pyarrow-0.15.1.dev539+g8cf0c8e0a-py3.6-linux-x86_64.egg/pyarrow/_parquet.cpython-36m-x86_64-linux-gnu.so
#27 0x000000000050ac25 in _PyCFunction_FastCallDict (kwargs=<optimized out>, nargs=<optimized out>, args=<optimized out>, func_obj=<built-in method open of pyarrow._parquet.ParquetReader object at remote 0x7fffbfc6b938>) at ../Objects/methodobject.c:231
#28 _PyCFunction_FastCallKeywords (kwnames=<optimized out>, nargs=<optimized out>, stack=<optimized out>, func=<optimized out>) at ../Objects/methodobject.c:294
#29 call_function.lto_priv () at ../Python/ceval.c:4851
#30 0x000000000050d390 in _PyEval_EvalFrameDefault () at ../Python/ceval.c:3351
#31 0x0000000000508245 in PyEval_EvalFrameEx (throwflag=0, f=
    Frame 0x142a818, for file /usr/local/lib/python3.6/dist-packages/pyarrow-0.15.1.dev539+g8cf0c8e0a-py3.6-linux-x86_64.egg/pyarrow/parquet.py, line 137, in __init__ (self=<ParquetFile(reader=<pyarrow._parquet.ParquetReader at remote 0x7fffbfc6b938>) at remote 0x7fffc4b68cc0>, source='/tmp/foo.pq', metadata=None, common_metadata=None, read_dictionary=None, memory_map=False, buffer_size=0)) at ../Python/ceval.c:754
#32 _PyEval_EvalCodeWithName.lto_priv.1836 () at ../Python/ceval.c:4166
#33 0x0000000000509642 in _PyFunction_FastCallDict () at ../Python/ceval.c:5075
#34 0x0000000000595311 in _PyObject_FastCallDict (kwargs={'metadata': None, 'memory_map': False, 'read_dictionary': None, 'common_metadata': None, 'buffer_size': 0}, nargs=2, args=0x7fffffffc430, func=<function at remote 0x7fffbfc5e378>)
    at ../Objects/abstract.c:2310
#35 _PyObject_Call_Prepend (kwargs={'metadata': None, 'memory_map': False, 'read_dictionary': None, 'common_metadata': None, 'buffer_size': 0}, args=<optimized out>, obj=<optimized out>, func=<function at remote 0x7fffbfc5e378>) at ../Objects/abstract.c:2373
#36 method_call.lto_priv () at ../Objects/classobject.c:314
#37 0x000000000054a6ff in PyObject_Call (kwargs={'metadata': None, 'memory_map': False, 'read_dictionary': None, 'common_metadata': None, 'buffer_size': 0}, args=('/tmp/foo.pq',), func=<method at remote 0x7ffff7f67fc8>) at ../Objects/abstract.c:2261
#38 slot_tp_init () at ../Objects/typeobject.c:6420
#39 0x0000000000551b81 in type_call.lto_priv () at ../Objects/typeobject.c:915
---Type <return> to continue, or q <return> to quit---

jorisvandenbossche · 2020-02-20T13:38:04Z

For reference, the arrow issue is here: https://issues.apache.org/jira/browse/ARROW-7873

mvcalder-xbk mentioned this issue Feb 14, 2020

Segfault in pandas version 1.0.1, read_parquet after creating a clickhouse odbc connection ClickHouse/clickhouse-odbc#259

Closed

WillAyd added the Needs Info Clarification about behavior needed to assess issue label Feb 14, 2020

WillAyd closed this as completed Feb 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segfault in version 1.0.1, read_parquet after creating a clickhouse odbc connection #31981

Segfault in version 1.0.1, read_parquet after creating a clickhouse odbc connection #31981

mvcalder-xbk commented Feb 14, 2020 •

edited

Loading

INSTALLED VERSIONS

WillAyd commented Feb 14, 2020

jreback commented Feb 15, 2020

jorisvandenbossche commented Feb 15, 2020

mvcalder-xbk commented Feb 18, 2020 •

edited

Loading

jorisvandenbossche commented Feb 20, 2020

Segfault in version 1.0.1, read_parquet after creating a clickhouse odbc connection #31981

Segfault in version 1.0.1, read_parquet after creating a clickhouse odbc connection #31981

Comments

mvcalder-xbk commented Feb 14, 2020 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Odbc driver dependencies

Output of pd.show_versions()

INSTALLED VERSIONS

WillAyd commented Feb 14, 2020

jreback commented Feb 15, 2020

jorisvandenbossche commented Feb 15, 2020

mvcalder-xbk commented Feb 18, 2020 • edited Loading

jorisvandenbossche commented Feb 20, 2020

mvcalder-xbk commented Feb 14, 2020 •

edited

Loading

Output of `pd.show_versions()`

mvcalder-xbk commented Feb 18, 2020 •

edited

Loading