Jaliyae/dataloader api #2

jaliyae · 2018-11-15T10:04:03Z

This combine all changes to the dataloader/sampler and an API for chunk data set. Let's review to see if these are enough to cover all our scenarios.

The major change is not copying the dataset for each thread inside the dataloader and instead reuse a single dataset object. It also introduces the chunk dataset and the new API read_chunk.

xzhu1900 · 2018-11-15T20:35:09Z

torch/csrc/api/include/torch/data/dataloader.h

      });
    }
-    if (options_.workers == 0) {
+    if (options_.workers == 0 || options_.chunk_loading) {


Thinking about this, I think there are two problems here:

If the main thread just keep a copy of the data set, the reset() method won't affect the data set in the working thread, i.e. the sampler in the working thread is still broken.

Even we managed to have one copy of the data set and have everyone hold a reference to it, how do we keep the main thread and the worker thread in sync? For example, make the worker thread halt when we reset the data set in main thread?

We make sure that when creating the chunk data loader using make_chunk_dataloader() we always wrap the Dataset that the user provided with the SharedDataSet. That points to a single reference no matter how many copies we make of the outside wrapper.

I just look look at the SharedDataSet in pytorch#13800

It seems that data set is a shared_ptr to the real thing. If we are adopting that implementation, do we need to wrap it again with a unique pointer here? I am not sure if it's a good idea to wrap a shared pointer with a unique pointer.

the whole point of doing that is we don't know about the wrapping at the SharedDataSet, so we cannot get that shared_ptr out. Since we are creating a unique pointer for the main_thread we cannot create it again. So leaving it as it is.

xzhu1900 · 2018-11-15T20:47:26Z

torch/csrc/api/include/torch/data/datasets/chunk.h

+
+  /// Read an entire chunk. A derived class needs to override this method.
+  /// This is the only API, other than the constructor that 
+  virtual Batch read_chunk(size_t chunk_index) = 0;


Do we need a virtual function to get the chunk count here? This information is needed to reset the chunk sampler right?

good point. the Dataset already has the virtual method size() we need to implement it here.

The dataset does have a size() method, but we don't necessary need to implement it here. We can defer it to the developer of ChunkDataSet. That developer needs to implement both size() and get_chunk_size()

yes, added a new method.

xzhu1900 · 2018-11-15T20:52:25Z

torch/csrc/api/include/torch/data/datasets/chunk.h

+    typename ExampleSampler = samplers::RandomSampler>
+class ChunkDataSet : public BatchDataset<Self, Batch, BatchRequest> {
+ public:
+  ~ChunkDataSet() {


Let's make this a virtual destructor for future inheritance.

xzhu1900 · 2018-11-15T21:24:49Z

torch/csrc/api/include/torch/data/datasets/chunk.h

+/// keeps an internal state. `ChunkSampler` selects, which chunk to load next,
+/// while the `ExampleSampler` determins the order of Examples that are returned
+/// in each `get_batch` call. The hierarchical sampling approach used here is
+/// inspired by this paper http://martin.zinkevich.org/publications/nips2010.pdf


Just curious, is it okey to refer to a paper in an open-source project?

Summary: hopefully this one doesn't break master. Pull Request resolved: pytorch#14053 Differential Revision: D13093406 Pulled By: suo fbshipit-source-id: 8fed44f1a3d463748726cb14acac2ea53dedf29b

…r_api

…#16852) Summary: Add test/.hypothesis/ to .gitignore to pass git status --porcelain check in CI build Pull Request resolved: pytorch#16852 Differential Revision: D14000206 Pulled By: soumith fbshipit-source-id: 5da99a4bb242c12aa35776f7254f6399a7fa6d8c

…ytorch#17337) Summary: Attempt #2 (attempt 1 is pytorch#16705 and got reverted because of CI failures) Fixes pytorch#14805 Pull Request resolved: pytorch#17337 Differential Revision: D14175626 Pulled By: soumith fbshipit-source-id: 66f2e10e219a1bf88ed342ec5c89da6f2994d8eb

Summary: Currently there is a mismatch in naming between Python BatchNorm `running_var` and C++ BatchNorm `running_variance`, which causes JIT model parameters loading to fail (pytorch/vision#728 (comment)): ``` terminate called after throwing an instance of 'c10::Error' what(): No such serialized tensor 'running_variance' (read at /home/shahriar/Build/pytorch/torch/csrc/api/src/serialize/input-archive.cpp:27) frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x85 (0x7f2d92d32f95 in /usr/local/lib/libc10.so) frame #1: torch::serialize::InputArchive::read(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, at::Tensor&, bool) + 0xdeb (0x7f2d938551ab in /usr/local/lib/libtorch.so.1) frame #2: torch::nn::Module::load(torch::serialize::InputArchive&) + 0x98 (0x7f2d9381cd08 in /usr/local/lib/libtorch.so.1) frame #3: torch::nn::Module::load(torch::serialize::InputArchive&) + 0xf9 (0x7f2d9381cd69 in /usr/local/lib/libtorch.so.1) frame #4: torch::nn::Module::load(torch::serialize::InputArchive&) + 0xf9 (0x7f2d9381cd69 in /usr/local/lib/libtorch.so.1) frame #5: torch::nn::operator>>(torch::serialize::InputArchive&, std::shared_ptr<torch::nn::Module> const&) + 0x32 (0x7f2d9381c7b2 in /usr/local/lib/libtorch.so.1) frame pytorch#6: <unknown function> + 0x2b16c (0x5645f4d1916c in /home/shahriar/Projects/CXX/build-TorchVisionTest-Desktop_Qt_5_12_1_GCC_64bit-Debug/TorchVisionTest) frame pytorch#7: <unknown function> + 0x27a3c (0x5645f4d15a3c in /home/shahriar/Projects/CXX/build-TorchVisionTest-Desktop_Qt_5_12_1_GCC_64bit-Debug/TorchVisionTest) frame pytorch#8: <unknown function> + 0x2165c (0x5645f4d0f65c in /home/shahriar/Projects/CXX/build-TorchVisionTest-Desktop_Qt_5_12_1_GCC_64bit-Debug/TorchVisionTest) frame pytorch#9: <unknown function> + 0x1540b (0x5645f4d0340b in /home/shahriar/Projects/CXX/build-TorchVisionTest-Desktop_Qt_5_12_1_GCC_64bit-Debug/TorchVisionTest) frame pytorch#10: __libc_start_main + 0xf3 (0x7f2d051dd223 in /usr/lib/libc.so.6) frame pytorch#11: <unknown function> + 0x1381e (0x5645f4d0181e in /home/shahriar/Projects/CXX/build-TorchVisionTest-Desktop_Qt_5_12_1_GCC_64bit-Debug/TorchVisionTest) ``` Renaming C++ BatchNorm `running_variance` to `running_var` should fix this problem. This is a BC-breaking change, but it should be easy for end user to rename `running_variance` to `running_var` in their call sites. Pull Request resolved: pytorch#17371 Reviewed By: goldsborough Differential Revision: D14172775 Pulled By: yf225 fbshipit-source-id: b9d3729ec79272a8084269756f28a8f7c4dd16b6

…fc3ff6 (pytorch#18028) Summary: Pull Request resolved: pytorch#18028 Previous import was 520e8e135f1ad75959bf9b5bd15c361b8caeb8d6 Included changes: - **[d1f45b1](houseroad/foxi@d1f45b1)**: update the gitignore (pytorch#6) <Lu Fang> - **[398135c](houseroad/foxi@398135c)**: Remove static variable in header (#3) <Lu Fang> - **[f817be1](houseroad/foxi@f817be1)**: sync to ONNX cb544d07cc022e3fe83622fda9b2b1fa00b75b89 (#2) <Lu Fang> Reviewed By: zrphercule Differential Revision: D14464213 fbshipit-source-id: b5d166f05f7fd503dec11d676e219cc6c6a373f9

Summary: Tracing models which attempts to return this in-place value doesn't turn out well. I haven't run any tests to confirm the results to be honest, but regardless of the outcome, the operation happens in-place, so it should work as before. Sample output from traced model attempting to set `max_norm` on `Embedding`: ``` a leaf Variable that requires grad has been used in an in-place operation. (check_inplace at /pytorch/torch/csrc/autograd/VariableTypeUtils.h:49) frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f0ecc5cc021 in /usr/local/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f0ecc5cb8ea in /usr/local/lib/python3.7/site-packages/torch/lib/libc10.so) frame #2: <unknown function> + 0x38ab2f (0x7f0ecb55ab2f in /usr/local/lib/python3.7/site-packages/torch/lib/libtorch.so.1) frame #3: torch::autograd::VariableType::embedding_renorm_(at::Tensor&, at::Tensor const&, double, double) const + 0x76 (0x7f0ecb5b5966 in /usr/local/lib/python3.7/site-packages/torch/lib/libtorch.so.1) frame #4: <unknown function> + 0x56c958 (0x7f0ecb73c958 in /usr/local/lib/python3.7/site-packages/torch/lib/libtorch.so.1) frame #5: <unknown function> + 0x672286 (0x7f0ecb842286 in /usr/local/lib/python3.7/site-packages/torch/lib/libtorch.so.1) frame pytorch#6: torch::jit::InterpreterState::run(std::vector<c10::IValue, std::allocator<c10::IValue> >&) + 0x22 (0x7f0ecb83d842 in /usr/local/lib/python3.7/site-packages/torch/lib/libtorch.so.1) frame pytorch#7: <unknown function> + 0x65c6ac (0x7f0ecb82c6ac in /usr/local/lib/python3.7/site-packages/torch/lib/libtorch.so.1) frame pytorch#8: <unknown function> + 0x3c8ab4 (0x7f0f06bc0ab4 in /usr/local/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame pytorch#9: <unknown function> + 0x3ad2c3 (0x7f0f06ba52c3 in /usr/local/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame pytorch#10: <unknown function> + 0x11663e (0x7f0f0690e63e in /usr/local/lib/python3.7/site-packages/torch/lib/libtorch_python.so) <omitting python frames> frame pytorch#39: python_call + 0x11 (0x5563c3c521c1 in uwsgi) frame pytorch#40: uwsgi_request_wsgi + 0x100 (0x5563c3c54410 in uwsgi) frame pytorch#41: wsgi_req_recv + 0xac (0x5563c3becabc in uwsgi) frame pytorch#42: simple_loop_run + 0xc4 (0x5563c3c35be4 in uwsgi) frame pytorch#43: simple_loop + 0x10 (0x5563c3c35a00 in uwsgi) frame pytorch#44: uwsgi_ignition + 0x241 (0x5563c3c3a3a1 in uwsgi) frame pytorch#45: uwsgi_worker_run + 0x275 (0x5563c3c3ec35 in uwsgi) frame pytorch#46: <unknown function> + 0x8f22c (0x5563c3c3f22c in uwsgi) frame pytorch#47: <unknown function> + 0x3c13e (0x5563c3bec13e in uwsgi) frame pytorch#48: __libc_start_main + 0xf1 (0x7f0f138922e1 in /lib/x86_64-linux-gnu/libc.so.6) frame pytorch#49: _start + 0x2a (0x5563c3bec16a in uwsgi) : operation failed in interpreter: op_version_set = 0 def forward(self, input_1: Tensor) -> Tensor: _0 = torch.norm(self.item_embedding.weight, 2, 1, True) _1 = torch.div(self.item_embedding.weight, _0) m_weight = torch.t(_1) input_2 = torch.contiguous(input_1) weight_1 = torch.embedding_renorm_(self.item_embedding.weight, input_2, 1., 2.) ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE x = torch.embedding(weight_1, input_2, -1, False, False) input_3 = torch.div(x, torch.norm(x, 2, 2, True)) max_batch_size = ops.prim.NumToTensor(torch.size(input_3, 0)) hx = torch.zeros([2, int(max_batch_size), 70], dtype=6, layout=0, device=torch.device("cpu")) _2 = [self.lstm_layer.weight_ih_l0, self.lstm_layer.weight_hh_l0, self.lstm_layer.weight_ih_l1, self.lstm_layer.weight_hh_l1] input_4, _3, _4 = torch.lstm(input_3, [hx, hx], _2, False, 2, 0.10000000000000001, False, False, True) input = torch.matmul(input_4, torch.t(self.rnn2item.weight)) tastevec = torch.div(input, torch.norm(input, 2, 2, True)) outputs = torch.matmul(tastevec, m_weight) ``` Pull Request resolved: pytorch#18684 Differential Revision: D14782041 Pulled By: ezyang fbshipit-source-id: 7b2fc19b7d5b6600263644498bb728319a19f39d

Summary: We have encountered `std::bad_cast` error when running PyTorch binary built with cxx11 abi on CentOS7, stack trace: ``` #0 0x00007fec10160207 in raise () from /lib64/libc.so.6 #1 0x00007fec101618f8 in abort () from /lib64/libc.so.6 #2 0x00007fec015767d5 in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6 #3 0x00007fec01574746 in ?? () from /lib64/libstdc++.so.6 #4 0x00007fec01574773 in std::terminate() () from /lib64/libstdc++.so.6 #5 0x00007fec01574993 in __cxa_throw () from /lib64/libstdc++.so.6 pytorch#6 0x00007fec015c94d2 in std::__throw_bad_cast() () from /lib64/libstdc++.so.6 pytorch#7 0x00007feb2ab3c2d7 in std::__cxx11::numpunct<char> const& std::use_facet<std::__cxx11::numpunct<char> >(std::locale const&) () from /root/.local/lib/python2.7/site-packages/torch/lib/libcaffe2.so pytorch#8 0x00007feb28643d62 in torch::jit::script::strtod_c(char const*, char**) () from /root/.local/lib/python2.7/site-packages/torch/lib/libcaffe2.so ``` We are suspecting this line will get compiled to gcc abi dependent symbol: ``` char decimal_point = std::use_facet<std::numpunct<char>>(std::locale()).decimal_point(); ``` Pull Request resolved: pytorch#21293 Differential Revision: D15609910 Pulled By: bddppq fbshipit-source-id: e247059729863868e4b36d6fec4fcbc36fbc4bb1

Summary: Turing GPUs (compute capability 7.5) require CUDA10 to work properly. We've seen some issues for these GPUs using PyTorch binaries with CUDA9 or older: [Discussion Board #1](https://discuss.pytorch.org/t/cudnn-status-execution-failed-error/38575) [Discussion Board #2](https://discuss.pytorch.org/t/cublas-runtime-error-on-gpu-running-but-works-on-cpu/46545/6) Tested on using CUDA9 with an RTX 2080Ti. Pull Request resolved: pytorch#21468 Differential Revision: D15696170 Pulled By: ezyang fbshipit-source-id: ed43f4e4948d3f97ec8e7d7952110cbbfeafef2a

jaliyae added 11 commits November 5, 2018 11:51

Modifications to the data loader for enabling chunk support.

0795b33

The major change is not copying the dataset for each thread inside the dataloader and instead reuse a single dataset object. It also introduces the chunk dataset and the new API read_chunk.

fix a typo

74da57f

fixed review comments and rebase to latest pytorch

880aeeb

Making the Sampler::reset to accept a new size

2231d8e

add default nullopt to existing samplers' reset method

aedc16b

Merge remote-tracking branch 'upstream/master' into jaliyae/samplers

9b5d23a

fix review comments

b15f676

Merge branch 'master' into jaliyaek/dataloader

41208e4

api changes for the chunk dataset

8849983

merging with the sampler change

64b18c3

api changes for dataloader

213ef94

xzhu1900 reviewed Nov 15, 2018

View reviewed changes

fix review comments

ea1a6ef

jaliyae added 4 commits November 21, 2018 13:55

Merge branch 'master' into jaliyae/dataloader_api

8764e24

updating the API change based on the SharedBatchDataset change

7b51fa6

fix test failures

e5e7ab8

Merge remote-tracking branch 'upstream/master' into jaliyae/dataloade…

818937c

…r_api

jaliyae closed this Feb 19, 2019

jaliyae deleted the jaliyae/dataloader_api branch February 19, 2019 18:49

Jaliyae/dataloader api #2

Jaliyae/dataloader api #2

Uh oh!

Conversation

jaliyae commented Nov 15, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants