collect_cuts_in_buckets_ slowness? #678

danpovey · 2022-04-18T05:55:00Z

From looking at the output of py-spy dump --pid=<training-process>, I notice that sometimes it seems to be spending time in _collect_cuts_in_buckets. I suppose I anticipated that this would perhaps happen in a separate data-loader thread, not in the main thread. Note, this is a run that was started with:python3 ./pruned_transducer_stateless3/train.py --world-size 8 --num-epochs 40 --start-epoch 20 --full-libri 1 --exp-dir pruned_transducer_stateless3/exp --max-duration 300 --use-fp16 1 --lr-epochs 4 --num-workers 5 --giga-prob 0.5.
This doesn't dominate the training-- I get this only maybe 10% or 20% of the time-- but it probably makes an appreciable difference.
It would probably be more ideal to have some more incremental way of doing this.

Process 2690310: /ceph-fj/fangjun/software/py38/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=6, pipe_handle=8) --multiprocessing-fork
Python v3.8.0 (/ceph-fj/fangjun/software/py38/bin/python3.8)

Thread 2690310 (active+gil): "MainThread"
    _collect_cuts_in_buckets (lhotse/dataset/sampling/dynamic_bucketing.py:349)
        Arguments:
            self: <DynamicBucketer at 0x7f4d2f514b50>
            n_cuts: 217
        Locals:
            _: 133
            cuts: (<MonoCut at 0x7f4d3ef4d460>)
            duration: 5.75
            bucket_idx: 12
    __iter__ (lhotse/dataset/sampling/dynamic_bucketing.py:334)
        Locals:
            ready_buckets: [<collections.deque at 0x7f4f89df0dc0>, <collections.deque at 0x7f4ce7d94820>, <collections.deque at 0x7f4ce5e21e80>, ...]
            sampling_bucket: <collections.deque at 0x7f4f89df0dc0>
            batcher: <DurationBatcher at 0x7f4d3bb12970>
            batch: <CutSet at 0x7f4d4c3010a0>
            batch_size: 217
            _: 216
    _next_batch (lhotse/dataset/sampling/dynamic_bucketing.py:185)
        Arguments:
            self: <DynamicBucketingSampler at 0x7f4d2b838070>
    __next__ (lhotse/dataset/sampling/base.py:256)
        Arguments:
            self: <DynamicBucketingSampler at 0x7f4d2b838070>
        Locals:
            batches: [<CutSet at 0x7f4d4c3010a0>]
            _: 1
    _next_index (torch/utils/data/dataloader.py:512)
        Arguments:
            self: <_MultiProcessingDataLoaderIter at 0x7f4cfba11df0>
    _try_put_index (torch/utils/data/dataloader.py:1209)
        Arguments:
            self: <_MultiProcessingDataLoaderIter at 0x7f4cfba11df0>
    _process_data (torch/utils/data/dataloader.py:1227)
        Arguments:
            self: <_MultiProcessingDataLoaderIter at 0x7f4cfba11df0>
            data: {"inputs": <Tensor at 0x7f4d4e2717c0>, "supervisions": {"text": [...], "sequence_idx": <Tensor at 0x7f4d4e2710e0>, ...}}
    _next_data (torch/utils/data/dataloader.py:1183)
        Arguments:
            self: <_MultiProcessingDataLoaderIter at 0x7f4cfba11df0>
        Locals:
            info: (1, {"inputs": <Tensor at 0x7f4d4e2717c0>, ...})
            worker_id: 1
            data: {"inputs": <Tensor at 0x7f4d4e2717c0>, "supervisions": {"text": [...], "sequence_idx": <Tensor at 0x7f4d4e2710e0>, ...}}
    __next__ (torch/utils/data/dataloader.py:521)
        Arguments:
            self: <_MultiProcessingDataLoaderIter at 0x7f4cfba11df0>
    train_one_epoch (train.py:726)
        Arguments:
            params: {"best_train_loss": 0.1834969507367963, "best_valid_loss": 0.14670906683585203, "best_train_epoch": 20, ...}
            model: <DistributedDataParallel at 0x7f4f265d3df0>
            optimizer: <Eve at 0x7f4f265e8dc0>
            scheduler: <Eden at 0x7f4f26b7cb80>
            sp: <SentencePieceProcessor at 0x7f4f26b7c7c0>
            train_dl: <DataLoader at 0x7f4d2bd92400>
            giga_train_dl: <DataLoader at 0x7f4d2b838fd0>
            valid_dl: <DataLoader at 0x7f4cfba09e80>
            rng: <Random at 0x522b720>
            scaler: <GradScaler at 0x7f4d2b48d190>
            tb_writer: <SummaryWriter at 0x7f4eb487b8e0>
            world_size: 8
            rank: 0
        Locals:
            libri_tot_loss: {"frames": 1424368.0284411646, "loss": 236555.3318336773, "simple_loss": 363824.98212052934, ...}
            giga_tot_loss: {"frames": 1478261.3725208302, "loss": 282613.99806341354, "simple_loss": 415192.64430341555, ...}
            tot_loss: {"frames": 1451002.3512052558, "loss": 259233.39561773292, "simple_loss": 389768.41394849814, ...}
            dl_weights: [0.5, 0.5]
            iter_libri: <_MultiProcessingDataLoaderIter at 0x7f4ce63fed90>
            iter_giga: <_MultiProcessingDataLoaderIter at 0x7f4cfba11df0>
            batch_idx: 8107
            idx: 1
            dl: <_MultiProcessingDataLoaderIter at 0x7f4cfba11df0>
            batch: {"inputs": <Tensor at 0x7f4cecade3b0>, "supervisions": {"text": ["BUT THAT WALLET ISN'T A PHYSICAL THING ", ...], ...}}
            batch_size: 127
            libri: False
            loss: <Tensor at 0x7f4d32c86cc0>
            loss_info: {"frames": 7407, "loss": 1514.5911865234375, "simple_loss": 2210.47119140625, "pruned_loss": 409.3556213378906}
            prefix: "giga"
            cur_lr: 0.00020478380486557086
            valid_info: {"frames": 698248, "loss": 102681.625, "simple_loss": 173539.96875, "pruned_loss": 15911.64453125}
    run (train.py:1023)
        Arguments:
            rank: 0
            world_size: 8
            args: <Namespace at 0x7f4f89e0b250>
        Locals:
            params: {"best_train_loss": 0.1834969507367963, "best_valid_loss": 0.14670906683585203, "best_train_epoch": 20, ...}
            rng: <Random at 0x522b720>
            tb_writer: <SummaryWriter at 0x7f4eb487b8e0>
            device: <torch.device at 0x7f4f26b914d0>
            sp: <SentencePieceProcessor at 0x7f4f26b7c7c0>

The text was updated successfully, but these errors were encountered:

danpovey · 2022-04-18T06:00:14Z

... one reason this has more of an effect on speed than you might expect, is that different workers reach this point at different times, so with 8 training processes, you spend 8x as long waiting for this to complete.

pzelasko · 2022-04-18T11:23:27Z

With map-style datasets, the sampler is placed in the same process as the main training loop. I'll think if there's an easy way to offload this to some background thread.

The other option is to use IterableDatasetWrapper(dataset, sampler) to move the sampler to dataloading workers, but then you'd need to de-duplicate the data across GPU nodes and dataloading workers. Lhotse did that in the early days of the project, but then due to dynamically sized batches it resulted in an uneven number of batches in each GPU node. Newer PyTorch versions can handle that with with Join([model]) context manager (link: https://pytorch.org/docs/stable/distributed.algorithms.join.html?highlight=join#torch.distributed.algorithms.Join).

danpovey · 2022-04-19T03:08:40Z

BTW, if threads become a problem in future, I would have thought it would be possible to simply spread out the work of the sampler across multiple minibatches by wrapping its work in a generator and stepping it on every minibatch. But admittedly this is a little ugly and seems like a hack to do the work of a thread.

pzelasko mentioned this issue Apr 19, 2022

PoC for reading cuts in background thread in dynamic bucketing #680

Open

pzelasko mentioned this issue May 16, 2022

Pruned Transducer Stateless2 for Aidatatang-200zh (char-based) k2-fsa/icefall#355

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

collect_cuts_in_buckets_ slowness? #678

collect_cuts_in_buckets_ slowness? #678

danpovey commented Apr 18, 2022 •

edited

Loading

danpovey commented Apr 18, 2022

pzelasko commented Apr 18, 2022

danpovey commented Apr 19, 2022

collect_cuts_in_buckets_ slowness? #678

collect_cuts_in_buckets_ slowness? #678

Comments

danpovey commented Apr 18, 2022 • edited Loading

danpovey commented Apr 18, 2022

pzelasko commented Apr 18, 2022

danpovey commented Apr 19, 2022

danpovey commented Apr 18, 2022 •

edited

Loading