Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: avoid segfaults when casting from boost_python types #20507

Closed
wants to merge 1 commit into from

Conversation

dwpaley
Copy link

@dwpaley dwpaley commented Dec 3, 2021

For calls to PyArray_GetCastingImpl where from is a dtype of a
boost_python class, it's not safe to assume the pointer
(NPY_DType_Slots *)(from)->dt_slots is non-NULL. This causes
segfaults when resolving overloaded calls to boost_python methods. Thus
we check the pointer before dereferencing.

This would resolve boostorg/python#376 and several more issues in downstream projects. For example cctbx_project has been pinned to numpy 1.20 for a while: cctbx/cctbx_project#627

Testing this is a little tricky because the problem only occurs when building boost_python modules in an environment with numpy. If a regression test is needed I'm happy to try and figure something out, but would appreciate any examples of c++ code that gets built for testing, and how to add it to the build system.

For calls to PyArray_GetCastingImpl where `from` is a dtype of a
boost_python class, it's not safe to assume the pointer
(NPY_DType_Slots *)(from)->dt_slots has been initialized. This causes
segfaults when resolving overloaded calls to boost_python methods. Thus
we test for a null pointer before dereferencing.
@seberg
Copy link
Member

seberg commented Dec 3, 2021

Could you explain how it is possible that the dtype is not fully initialized? Is boost::python intentionally getting away with creating a NumPy PyArray_Descr object without ever registering it with NumPy?

NumPy should fill all of this in once you call PyArray_RegisterDataType. I am a bit shocked that it is even possible to get very far without doing that (although I suppose e.g. type-numbers are not used that much for user dtypes in the old system).

@dwpaley
Copy link
Author

dwpaley commented Dec 3, 2021

As demonstrated by @jawsc-alan on the boost issue, this occurs when Py_Initialize() or np::initialize() are called inside a BOOST_PYTHON_MODULE definition. Does that give any clues? Does it short-circuit type registration for the module that's in the middle of being initialized?

I have debug builds of all this stuff but I'm not super familiar with it. Willing to experiment with it if you have any suggestions.

If it's helpful I will include the defective boost code before and after macro expansion:

minipy.cpp
#include <boost/python.hpp>
#include <boost/python/extract.hpp>
#include <boost/python/numpy.hpp>
#include <string>
#include <sstream>
#include <vector>

using namespace std;
using namespace boost::python;
namespace np = boost::python::numpy;

class A {
public:
  A():sum(0.0) {}

  void add(double d) {sum += d;}
  void add(const A& s) {sum += s.sum;}

  void mul(double d) {sum *= d;}
  double val() const { return sum;}
private:
  double sum;
};

BOOST_PYTHON_MODULE(minipy)
{
  Py_Initialize();
  if(0) np::initialize(false);
  else np::initialize();

  class_<A>("A", init<>())
  .def("add", static_cast<void (A::*) (const A&)>(&A::add))
  .def("add", static_cast<void (A::*) (double)>(&A::add))
  .def("mul", &A::mul)
  .def("__call__", &A::val)
  ;
};
minipy.cpp expanded
class A {
public:
  A():sum(0.0) {}

  void add(double d) {sum += d;}
  void add(const A& s) {sum += s.sum;}

  void mul(double d) {sum *= d;}
  double val() const { return sum;}
private:
  double sum;
};

void init_module_minipy();

extern "C" __attribute__((__visibility__("default"))) PyObject* PyInit_minipy() {
  static PyModuleDef_Base initial_m_base = { { 1,
# 25 "src/minipy.cpp" 3 4
__null
# 25 "src/minipy.cpp"
}, 0, 0, 0 };

static PyMethodDef initial_methods[] = { { 0, 0, 0, 0 } };

static struct PyModuleDef moduledef = { initial_m_base, "minipy", 0, -1, initial_methods, 0, 0, 0, 0, };

return boost::python::detail::init_module( moduledef, init_module_minipy );
}

void init_module_minipy()
{
  Py_Initialize();
  if(0) np::initialize(false);
  else np::initialize();

  class_<A>("A", init<>())
  .def("add", static_cast<void (A::*) (const A&)>(&A::add))
  .def("add", static_cast<void (A::*) (double)>(&A::add))
  .def("mul", &A::mul)
  .def("__call__", &A::val)
  ;
};

@seberg
Copy link
Member

seberg commented Dec 3, 2021

Thanks, my problem right now is, that I spend an hour yesterday trying to get either a test setup running or trying to understand what boost::python generates here. And I didn't make fast progress on either :/.

For a setup: Is there a very simple setup.py that will work with this example? That would be familiar ground for me to work with, so I don't have to find the include paths for boost, etc.

Other questions that should help me get started (if you have pointers):

  • One of the backtraces showed PyArray_EquivTypes being called. But I am unsure on what exactly or why/from where?
    • Even more interesting what exactly does get passed into it?
  • The examples seem to use scalar values, I am still confused why this would go into code paths designed for NumPy types at all?
  • The dtypes generated in dtype.cpp in boost pythons are confusing: Is that code to auto-generate new dtypes/descriptors for users, or is it actually only trying to get the existing NumPy ones!?

To clarify my problem: As much as I hate that this broke boost::python, the proposed fix here looks like a very small band-aid for a potentially enormous wound. A good fix seems far more likely in boost::python but right now I am not even quite sure what exactly is the issue.

@dwpaley
Copy link
Author

dwpaley commented Dec 3, 2021

I don't have a setup.py, but could you use a Docker image? I put it on Dockerhub as dwpaley/numpy_boostpy_segfaut (note typo in the name). The Dockerfile is on the Boost issue and reproduced here:

Dockerfile
FROM fedora:35

RUN dnf install -y dnf-plugins-core which git findutils gdb valgrind gcc g++
RUN dnf install -y glibc-devel libffi-devel libstdc++-devel openssl-devel \
  zlib-devel
RUN dnf install -y python3-devel cython
RUN dnf debuginfo-install -y python3 glibc libffi libgcc libstdc++ openssl zlib

RUN dnf install -y bzip2
RUN mkdir /boost && \
  cd /boost && \
  curl -LO https://boostorg.jfrog.io/artifactory/main/release/1.77.0/source/boost_1_77_0.tar.bz2 && \
  bunzip2 boost_1_77_0.tar.bz2 && \
  tar -xf boost_1_77_0.tar

RUN ln -s `which python3` /usr/bin/python
RUN pip3 install numpy
RUN cd /boost/boost_1_77_0/tools/build && \
  ./bootstrap.sh && \
  ./b2 install --prefix=/boost/b2 && \
  cd /boost/boost_1_77_0 && \
  /boost/b2/bin/b2 --with-python --prefix=/boost_build -d0 variant=debug install


RUN dnf debuginfo-install -y cython
RUN mkdir /numpy && \
  cd /numpy && \
  git clone https://github.com/numpy/numpy && \
  cd numpy && \
  git submodule update --init && \
  CFLAGS="-g" CPPFLAGS="-g" python setup.py build_ext -i

RUN mkdir /work && \
  cd /work && \
  mkdir lib objs src

COPY minipy.cpp /work/src
COPY repro.py /work/lib

RUN cd /work && \
  g++ -g -I/boost_build/include/ -I/usr/include/python3.10 -fPIC -std=c++11  -Iinclude   -MMD -c -o objs/minipy.o src/minipy.cpp && \
  g++ -g -Wl,-rpath,/boost_build/lib -shared -o lib/minipy.so objs/minipy.o -L/boost_build/lib -Llib -ldl -lboost_python310 -lboost_numpy310


RUN mv /usr/local/lib64/python3.10/site-packages/numpy /usr/local/lib64/python3.10/site-packages/numpy.nerf && \
  ln -s /numpy/numpy/numpy /usr/local/lib64/python3.10/site-packages/numpy

To run the code, $ cd /work/lib; gdb --args python repro.py; and to rebuild the Python module:

$ cd /work && \
  g++ -g -I/boost_build/include/ -I/usr/include/python3.10 -fPIC -std=c++11  -Iinclude   -MMD -c -o objs/minipy.o src/minipy.cpp && \
  g++ -g -Wl,-rpath,/boost_build/lib -shared -o lib/minipy.so objs/minipy.o -L/boost_build/lib -Llib -ldl -lboost_python310 -lboost_numpy310

When breaking on PyArray_EquivTypes at numpy/core/src/multiarray/multiarraymodule.c:1477, there are 54 calls while importing minipy and 1 more call when calling the overloaded method right before the segfault. For the first 54 calls the types are like PyDoubleArrType_Type, PyFloatArrType_Type etc. On the final call, these are type1 and type2:

Breakpoint 1, PyArray_EquivTypes (type1=0x555b63727250, type2=0x7f6181f54060 <DOUBLE_Descr>) at numpy/core/src/multiarray/multiarraymodule.c:1477
1477	{
(gdb) p *type1
$2 = {ob_base = {ob_refcnt = 6,
    ob_type = 0x7f618237dd20 <boost::python::class_metatype_object>},
  typeobj = 0x0, kind = 96 '`', type = 19 '\023', byteorder = 66 'B',
  flags = -126 '\202', type_num = 32609, elsize = 48, alignment = 0,
  subarray = 0x1, fields = <unknown at remote 0x7f6182a305e0>, names = 0x0,
  f = 0x0, metadata = 0x0, c_metadata = 0x555b637273e8, hash = 140056781150864}
(gdb) p *type2
$3 = {ob_base = {ob_refcnt = 14, ob_type = 0x555b637b7b70},
  typeobj = 0x7f6181f6bd40 <PyDoubleArrType_Type>, kind = 102 'f', type = 100 'd',
  byteorder = 61 '=', flags = 0 '\000', type_num = 12, elsize = 8, alignment = 8,
  subarray = 0x0, fields = None, names = 0x0,
  f = 0x7f6181f540c0 <_PyDouble_ArrFuncs>, metadata = 0x0, c_metadata = 0x0,
  hash = 4565598238590070520}

If the calls to Py_Initialize() and boost::python::numpy::initialize() are removed, then this isn't hit at all.

Sorry that I don't immediately have answers to your other questions. I do note that the call to dtype::register_scalar_converters() (with the function body in boost_1_77_0/libs/python/src/numpy/dtype.cpp) registers about 15 standard C++ and NumPy types:

void dtype::register_scalar_converters()
{
  array_scalar_converter<bool>::declare();
  array_scalar_converter<npy_uint8>::declare();
  array_scalar_converter<npy_int8>::declare();
  array_scalar_converter<npy_uint16>::declare();
etc etc

where declare involves a call to pyconv::registry::push_back. So the class appears to be for handling existing NumPy types. A comment in the hpp file says: 'A boost.python "object manager" (subclass of object) for numpy.dtype.'

Fully appreciate your point about this band-aid fix, but I have the impression boost_python is not super actively developed anymore, so anything involving a fundamental repair there might be hard to achieve. Thanks very much for your attention to this! As it stands now, cctbx_project and others depending on it will be in a real tricky situation when NumPy 1.20 is no longer supported.

@seberg
Copy link
Member

seberg commented Dec 3, 2021

Thanks a lot, yeah, sorry that I am a bit daft about docker :/. I had run that container yesterday, but not very familiar with it, so need a bit longer to get it. (The fedora one failed for some reason during setup, the centos one worked, but after docker run -i -t <hash_that_seemed_right> bash I didn't quite find my way around, maybe I just missed something and that is probably the wrong approach anyway).

Thanks a lot for that debug-info! *type1 does not make any sense at all, so it seems to point to random data. There seem two possibilities here:

  1. This never made any sense, but previously it would always just return False silently.
  2. The issue is much earlier.

I am strongly expecting 1 here to be honest. Could you check that it is the same type of nonsense even on the NumPy versions where this succeeds happily? On the up-side, if it is 1, maybe we can argue to just add a "sanity-check" in PyArray_EquivTypes and/or fix boost::python to just delete the code entirely, which is easy ;).

@seberg
Copy link
Member

seberg commented Dec 3, 2021

OK, thanks, I got your docker running now, so I can dig a bit myself.

OK, *type1 does not point to random data, but it looks like it points at something that is not a NumPy dtype to begin with (maybe/probably the original boost::python created object, I am not sure).

@dwpaley
Copy link
Author

dwpaley commented Dec 3, 2021

OK, *type1 does not point to random data

Yup, I'm sure ob_type = 0x7f618237dd20 <boost::python::class_metatype_object> is a parent class or metaclass for the boost wrapped class, but it does look like there are a lot of uninitialized data members in there.

I have to step away from this for a bit, but I will run the same code with v1.20.x and see if the calls to PyArray_EquivTypes look the same.

@seberg
Copy link
Member

seberg commented Dec 3, 2021

Odd problem running gdb:

Program terminated with signal SIGTRAP, Trace/breakpoint trap.
The program no longer exists.

and that's it. Seems to happen with a NumPy import (no breakpoints anything), just running python seems OK.

@dwpaley
Copy link
Author

dwpaley commented Dec 3, 2021

Odd problem running gdb:

Interesting, I don't know what that's about. An alternative is like this:

  • Start the Docker container with $ docker run -it --cap-add=SYS_PTRACE dwpaley/numpy_boostpy_segfaut
  • In your python script, add the line: import os; print(os.getpid()); import pdb; pdb.set_trace()
  • In a second terminal, do $ docker ps to get the id of the running container. Then $ docker exec -it <container id> /bin/bash
  • In one terminal, start the python script and get the pid.
  • In the other terminal, do $ gdb -p <pid>, then set breakpoints and continue, then go back to python and continue.

This might skip some of the Detaching after vfork from child process stuff that I see when running directly from gdb. Not sure if there's a connection, but this way has always been reliable for me if a little unwieldy.

@seberg
Copy link
Member

seberg commented Dec 3, 2021

Frustratingly, I never managed to get it to work. My best current guess is some wonky interaction with the host kernel...

In any case, I guess the magic line dtype dt(python::detail::borrowed_reference(obj->ob_type)); is no magic at all: It just casts the pointer to something boost::python thinks it can understand.

But that is utter nonsense and calling NumPy API with invalid values. What we we end up with is that: *type1 is just type(<object>) with the <object> from a.add(<object>). Of course that object can't possibly be an actual PyArray_Descr * (as the NumPy API requires).

Now, that whole chunk of code in boost::python seems to be designed to achieve roughly the following (as well as some potential user dtype support, but that should not really matter):

After changing the .add override from double to long in minipy.cpp, you would expect is that:

import minipy as m

a = m.A()
a.add(np.int_(3))  # works `int_` is long
a.add(np.longlong(3))  # also works, because of that extra path.

(The above assumes that np.dtype(np.int_) == np.dtype(np.longlong) which is true on 64bit linux systems.)

I would be exceedingly surprised if that ever worked, i.e. the simple thing to do would be to just delete the whole second branch and cut your losses. Even the windows-only path in equivalent seems just as wrong to me (can never return True).

Of course it should not be hard to fix it up to hard-code the simple equivalences (long == long long or long == int depending on the platform). If you want to do that, I am happy to give a rough draft of it at the very least.

As for band-aid fixing NumPy... not too happy, but I suppose we could do it, probably there isn't even a point in covering all the possible things that can go wrong, but just the obvious nonsense we get here...

@seberg seberg added the triage review Issue/PR to be discussed at the next triage meeting label Dec 4, 2021
@seberg
Copy link
Member

seberg commented Dec 4, 2021

Marking for triage-review, we need to make a call whether to add a "do not break boost::python" hack (and maybe it doesn't even matter if boost::python is fixed).

My guess is, we should probably just add the check, but with a note that it was added for boost::python and probably should just be removed in 1-2 years:

if (Py_TYPE(type1) == &PyType_Type)

or more complete, but theoretically a slight slowdown for paths that are heavy on EquivTypes:

if (!PyArray_DescrCheck(Py_TYPE(type1))

(giving a DeprecationWarning may be good, but could be a bit too spammy to be useful considering that boost::python is hopefully the only problem.)

@dwpaley
Copy link
Author

dwpaley commented Dec 6, 2021

If I understand correctly, you're saying the problem is in dtype.cpp, method equivalent(dtype const & a, dtype const & b) where reinterpret_cast<PyArray_Descr*>(a.ptr()) is nonsense if a is a type that Numpy doesn't know about, like a custom boost::python class.

I think it's very likely correct that this was silently returning False in the past when one of the types was a custom one, but also suspect there are places where a and b are types that Numpy knows about, and this should work for those cases. Is there a way to check the validity of reinterpret_cast<PyArray_Descr*>(a.ptr())? Is this what you're suggesting with the two if-tests in your last message? That seems like a good fix for boost::python by avoiding casting to PyArray_Descr when it is nonsense, but again I'm not sure what would be involved in getting a PR merged over there.

So in summary I'd really appreciate this band-aid fix along with clarifying your thoughts a bit on what could make a longer-term boost::python fix. Do I have the right idea here?

Please let me know if I can do anything else to help with this. I can try other stuff with my debug build here if needed.

@seberg
Copy link
Member

seberg commented Dec 6, 2021

Yeah, you understand me correctly, except that I think the following line:

https://github.com/boostorg/python/blob/f5d14ef15e98838b3cf18692d46c481287ed50d0/src/numpy/dtype.cpp#L161-L162

is shady. And as you rightly say the later reinterpret_cast<PyArray_Descr*>(a.ptr()) is problematic, because a here is type(obj) which is not a dtype (even though the function signature of equivalent claims it is!).

Looking at it again, the signature of equivalent is incorrect (it works with the type/class of obj and never on dtype). But I now see that the windows special case actually works, because it uses dtype::get_builtin<int>() which returns <PyArray_Descr*>(dtype::get_builtin<T>().ptr())->typeobj (which is the scalar type and it makes sense to match it against type(obj)!)

So the correct thing to fix it, is to ditch all calls to PyArray_EquivTypes (they are all buggy) and expand the windows special case:
https://github.com/boostorg/python/blob/f5d14ef15e98838b3cf18692d46c481287ed50d0/src/numpy/dtype.cpp#L111-L117

so that you also have a special case for if (sizeof(long) == sizeof(long long), which will cover 64bit linux. I suppose long long is so rare, that nobody ever realized that this does not work.

@seberg
Copy link
Member

seberg commented Dec 6, 2021

Wait, no... the windows path doesn't use the ->typeobj? So it should also be broken? But it would be easy to add ->typeobj to it I guess.

@seberg
Copy link
Member

seberg commented Dec 17, 2021

@dwpaley are you OK with your "fix" here? (i.e. if I translate it to an equivalent hack? – This exact version may not be quite safe and is not quite the right place).

The reason I ask is because I now think that by chance, this may have worked somewhat reliably before... It probably was comparing two NumPy scalar types (rather than dtypes), but comparing some random stuff in the type might find a "match" based on them sharing some code as an implementation detail. So some cases could change in behaviour (some conversions that used to work fine start failing), although it may also reject unlikely buggy cases.

seberg added a commit to seberg/numpy that referenced this pull request Dec 17, 2021
This adds an almost random "sanity check" to `PyArray_EquivTypes`
for the sole purpose of allowing boost::python compiled libs to
_not crash_.
boost::python is buggy, it needs to be fixed.  This may break them
(I am unsure), because some conversions which used to work may fail
here (if they worked, they only worked because random type data may
have matched up correctly for our scalar types).

We could error, or warn or... but I hope boost::python will just fix
this soon enough and future us can just delete the whole branch.

Replaces numpygh-20507
@seberg
Copy link
Member

seberg commented Dec 17, 2021

Closing, because gh-20507 is (I think) the proper way to achieve the same thing. I still suspect it used to "work", but considering that this is a big bug in boost::python, I am willing to say: OK, we help you not crash, but if you want the right thing you have to fix boost::python.

@seberg seberg closed this Dec 17, 2021
charris pushed a commit to charris/numpy that referenced this pull request Dec 17, 2021
This adds an almost random "sanity check" to `PyArray_EquivTypes`
for the sole purpose of allowing boost::python compiled libs to
_not crash_.
boost::python is buggy, it needs to be fixed.  This may break them
(I am unsure), because some conversions which used to work may fail
here (if they worked, they only worked because random type data may
have matched up correctly for our scalar types).

We could error, or warn or... but I hope boost::python will just fix
this soon enough and future us can just delete the whole branch.

Replaces numpygh-20507
charris pushed a commit to charris/numpy that referenced this pull request Dec 17, 2021
This adds an almost random "sanity check" to `PyArray_EquivTypes`
for the sole purpose of allowing boost::python compiled libs to
_not crash_.
boost::python is buggy, it needs to be fixed.  This may break them
(I am unsure), because some conversions which used to work may fail
here (if they worked, they only worked because random type data may
have matched up correctly for our scalar types).

We could error, or warn or... but I hope boost::python will just fix
this soon enough and future us can just delete the whole branch.

Replaces numpygh-20507
dwpaley added a commit to cctbx/cctbx_project that referenced this pull request Jan 4, 2022
This reflects the resolution of #627 as discussed in several other issues
and PRs:

- boostorg/python#376
- numpy/numpy#20507
- numpy/numpy#20616

Leaving this "bibliography" here because the fix in numpy PR 20616 is
considered temporary; thus someday we may have to revisit this to fix the
underlying bug in boost::python.

Co-authored-by: Billy Poon <bkpoon@lbl.gov>
dwpaley added a commit to cctbx/cctbx_project that referenced this pull request Jan 5, 2022
This reflects the resolution of #627 as discussed in several other issues
and PRs:

- boostorg/python#376
- numpy/numpy#20507
- numpy/numpy#20616

Leaving this "bibliography" here because the fix in numpy PR 20616 is
considered temporary; thus someday we may have to revisit this to fix the
underlying bug in boost::python.

Co-authored-by: Billy Poon <bkpoon@lbl.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
00 - Bug triage review Issue/PR to be discussed at the next triage meeting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

boost::python::numpy::initialize() incompatibilities with numpy>=1.21
2 participants