Check why shallow_copy_from is called on a wrong object #2

nunoplopes · 2021-06-29T18:28:40Z

It required this patch:

diff --git a/c10/core/TensorImpl.cpp b/c10/core/TensorImpl.cpp
index a0c7673641..3c027a4a17 100644
--- a/c10/core/TensorImpl.cpp
+++ b/c10/core/TensorImpl.cpp
@@ -480,9 +480,12 @@ void TensorImpl::copy_tensor_metadata_except_version_counter(
     const TensorImpl* src_impl,
     TensorImpl* dest_impl,
     bool allow_tensor_metadata_change) {
-  dest_impl->storage_ = src_impl->storage_;
-  dest_impl->sizes_and_strides_ = src_impl->sizes_and_strides_;
-  dest_impl->storage_offset_ = src_impl->storage_offset_;
+  dest_impl->storage_ = src_impl->storage();
+  dest_impl->sizes_and_strides_.set_sizes(src_impl->sizes());
+  auto strides = src_impl->strides();
+  memcpy(dest_impl->sizes_and_strides_.strides_data(), strides.begin(),
+         sizeof(int64_t) * strides.size());
+  dest_impl->storage_offset_ = src_impl->storage_offset();
   dest_impl->data_type_ = src_impl->data_type_;
   dest_impl->device_opt_ = src_impl->device_opt_;
   dest_impl->key_set_ = src_impl->key_set_;

But ideally it wouldn't be needed, as shallow_copy_from would be called between Torchy objects. So why pickle used different objects?

This is the backtrace, while executing a TorchVision model:

(gdb) bt
#0  c10::TensorImpl::shallow_copy_from (this=0x555558e79400, impl=...)
    at ../c10/core/TensorImpl.h:1270
#1  0x00007fffbc5610d6 in torch::autograd::VariableHooks::set_data (
    this=<optimized out>, self=..., new_data=...)
    at ../torch/csrc/autograd/variable.cpp:440
#2  0x00007fffc2eeac63 in THPVariable_set_data (self=0x7fff7d5b7840,
    data=0x7fff7d5b4980, unused=<optimized out>)
    at ../torch/csrc/autograd/python_variable.cpp:316
#3  0x00005555556cf597 in _PyObject_GenericSetAttrWithDict ()
    at /tmp/build/80754af9/python_1599203911753/work/Objects/object.c:1366
#4  0x00005555556cf687 in PyObject_GenericSetAttr (value=0x7fff7d5b4980,
    name=<optimized out>, obj=0x7fff7d5b7840)
    at /tmp/build/80754af9/python_1599203911753/work/Objects/object.c:1416
#5  PyObject_SetAttr ()
    at /tmp/build/80754af9/python_1599203911753/work/Objects/object.c:1045
#6  0x00005555557156b7 in _PyEval_EvalFrameDefault ()
    at /tmp/build/80754af9/python_1599203911753/work/Python/ceval.c:2372
#7  0x00005555556df86b in function_code_fastcall (globals=<optimized out>,
    nargs=2, args=<optimized out>, co=<optimized out>)
    at /tmp/build/80754af9/python_1599203911753/work/Objects/call.c:283
#8  _PyFunction_Vectorcall.localalias.355 ()
    at /tmp/build/80754af9/python_1599203911753/work/Objects/call.c:410
#9  0x00005555556dfe79 in _PyObject_Vectorcall (kwnames=0x0, nargsf=2,
    args=0x7fffffffb610, callable=0x7fff848710d0)
    at /tmp/build/80754af9/python_1599203911753/work/Include/cpython/abstract.h:127
#10 method_vectorcall ()
    at /tmp/build/80754af9/python_1599203911753/work/Objects/classobject.c:89
#11 0x00005555555d22d6 in _PyObject_Vectorcall (kwnames=0x0, nargsf=1,
    args=0x7fffffffb6b0, callable=0x7fff7f353a40)
    at /tmp/build/80754af9/python_1599203911753/work/Include/cpython/abstract.h:127
#12 _PyObject_FastCall ()
    at /tmp/build/80754af9/python_1599203911753/work/Include/cpython/abstract.h:147
#13 object_vacall (base=<optimized out>, callable=0x7fff7f353a40,
    vargs=<optimized out>)
    at /tmp/build/80754af9/python_1599203911753/work/Objects/call.c:1186
#14 0x0000555555691e1e in PyObject_CallFunctionObjArgs (
    callable=<optimized out>)
    at /tmp/build/80754af9/python_1599203911753/work/Objects/call.c:1259
#15 0x00007fff84762615 in _Pickle_FastCall (obj=0x7fff7d5b0770,
    func=0x7fff7f353a40)
    at /usr/local/src/conda/python-3.8.5/Modules/_pickle.c:362
#16 load_build.isra.38 ()
    at /usr/local/src/conda/python-3.8.5/Modules/_pickle.c:6707
#17 load () at /usr/local/src/conda/python-3.8.5/Modules/_pickle.c:6961
#18 0x00005555556c3e6a in method_vectorcall_NOARGS ()

The text was updated successfully, but these errors were encountered:

nunoplopes · 2021-09-08T16:58:32Z

I've investigated this issue and there's no way around it without change PyTorch itself.
A python program may do non_torchy_tensor.set_(tochy_tensor). If the torchy tensor isn't materialized, we won't store non_torchy_tensor as a shared tensor on the trace. So when the trace is flushed this tensor hasn't the metadata updated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check why shallow_copy_from is called on a wrong object #2

Check why shallow_copy_from is called on a wrong object #2

nunoplopes commented Jun 29, 2021

nunoplopes commented Sep 8, 2021

Check why shallow_copy_from is called on a wrong object #2

Check why shallow_copy_from is called on a wrong object #2

Comments

nunoplopes commented Jun 29, 2021

nunoplopes commented Sep 8, 2021