objects.transfer: minor refactoring, move lazier taskset inside custom executor #6591

skshetry · 2021-09-10T11:47:13Z

No description provided.

efiop · 2021-09-10T11:50:35Z

dvc/utils/threadpool.py

+class ThreadPoolExecutor(futures.ThreadPoolExecutor):
+    @property
+    def max_workers(self) -> int:
+        return self._max_workers
+
+    def imap_unordered(
+        self, fn: Callable[..., _T], *iterables: Iterable[Any]
+    ) -> Iterator[_T]:
+        """Lazier version of map that does not preserve ordering of results.
+
+        It does not create all the futures at once to reduce memory usage.
+        """
+
+        def create_taskset(n: int) -> Set[futures.Future]:
+            return {self.submit(fn, *args) for args in islice(it, n)}
+
+        it = zip(*iterables)
+        tasks = create_taskset(self.max_workers * 5)
+        while tasks:
+            done, tasks = futures.wait(
+                tasks, return_when=futures.FIRST_COMPLETED
+            )
+            for fut in done:
+                yield fut.result()
+            tasks.update(create_taskset(len(done)))


Btw, there was a plan to try using async here e.g. with fsspec's put/get methods. Maybe it would be reasonable to do that instead now?

Could you please elaborate? I am not sure I understand.

So we use thread pool executor right now to parallelize uploading/downloading of objects, but fsspec provides put/get methods that accept batches and async filesystems do that with async, effectively managing the coroutines themselves, instead of us manually using threads.

So what you are suggesting is to have something like TransferManager that handles multithreaded and async transfer as well right?

@skshetry Ah, no, I meant that we possibly won't need transfer manager at all, and could just fs.put(obj_list) and let it do its thing. So we would be effectively delegating transfer management to particular filesystem's put/get methods.

@skshetry Btw, happy to jump on a call to talk about it a bit more.

I think we should wait for fsspec changes to be done, this will likely require us to have colored APIs too (odb.add_async, fs.utils.transfer_async etc.). And, there's a problem of running fsspec's sync FS in async mode too.

As this change is a very minor restructuring, I'd prefer to merge this as-is.

…executor

skshetry · 2021-09-10T12:38:10Z

dvc/objects/transfer.py

-                    break
-                except (FileNotFoundError, ObjectFormatError):
-                    pass
+        dir_obj = find_tree_by_obj_id([cache_odb, src], dir_hash)


Why do we call it dir_obj instead of tree?

It's a terminology mess (+ .dir objects as we call them, and not .tree). Also driving me insane for 3.0 🙂

.dir is a physical repr thing similar to how it's json and kept in some structure, it's fine till 3.0. Though I'd say that we s/dir_obj/tree in all of these abstractions right away.

It's a lot of leftovers from when remote was migrated to odb, and yeah it would be good to just drop the old terminology entirely whenever possible

skshetry added refactoring Factoring and re-factoring A: object-storage Related to the object/content-addressable storage labels Sep 10, 2021

skshetry requested review from pmrowla and isidentical September 10, 2021 11:47

skshetry self-assigned this Sep 10, 2021

skshetry requested a review from a team as a code owner September 10, 2021 11:47

skshetry added this to In progress in DVC 07 Sep - 21 Sep 2021 via automation Sep 10, 2021

efiop reviewed Sep 10, 2021

View reviewed changes

objects.transfer: minor refactoring, move lazy taskset inside custom …

d4207c0

…executor

skshetry force-pushed the transfer-minor-refactoring branch from a0eb98f to d4207c0 Compare September 10, 2021 12:03

skshetry commented Sep 10, 2021

View reviewed changes

skshetry merged commit d2d6c73 into iterative:master Sep 13, 2021

DVC 07 Sep - 21 Sep 2021 automation moved this from In progress to Done Sep 13, 2021

skshetry deleted the transfer-minor-refactoring branch September 13, 2021 05:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

objects.transfer: minor refactoring, move lazier taskset inside custom executor #6591

objects.transfer: minor refactoring, move lazier taskset inside custom executor #6591

skshetry commented Sep 10, 2021

efiop Sep 10, 2021

skshetry Sep 10, 2021

efiop Sep 10, 2021 •

edited

skshetry Sep 10, 2021

efiop Sep 10, 2021

efiop Sep 10, 2021

skshetry Sep 13, 2021 •

edited

skshetry Sep 10, 2021 •

edited

efiop Sep 10, 2021

skshetry Sep 11, 2021

pmrowla Sep 13, 2021

Navigation Menu

objects.transfer: minor refactoring, move lazier taskset inside custom executor #6591

objects.transfer: minor refactoring, move lazier taskset inside custom executor #6591

Conversation

skshetry commented Sep 10, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

efiop Sep 10, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skshetry Sep 13, 2021 • edited

Choose a reason for hiding this comment

skshetry Sep 10, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

efiop Sep 10, 2021 •

edited

skshetry Sep 13, 2021 •

edited

skshetry Sep 10, 2021 •

edited