Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

collect_cuts_in_buckets_ slowness? #678

Open
danpovey opened this issue Apr 18, 2022 · 3 comments
Open

collect_cuts_in_buckets_ slowness? #678

danpovey opened this issue Apr 18, 2022 · 3 comments

Comments

@danpovey
Copy link
Collaborator

danpovey commented Apr 18, 2022

From looking at the output of py-spy dump --pid=<training-process>, I notice that sometimes it seems to be spending time in _collect_cuts_in_buckets. I suppose I anticipated that this would perhaps happen in a separate data-loader thread, not in the main thread. Note, this is a run that was started with:python3 ./pruned_transducer_stateless3/train.py --world-size 8 --num-epochs 40 --start-epoch 20 --full-libri 1 --exp-dir pruned_transducer_stateless3/exp --max-duration 300 --use-fp16 1 --lr-epochs 4 --num-workers 5 --giga-prob 0.5.
This doesn't dominate the training-- I get this only maybe 10% or 20% of the time-- but it probably makes an appreciable difference.
It would probably be more ideal to have some more incremental way of doing this.

Process 2690310: /ceph-fj/fangjun/software/py38/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=6, pipe_handle=8) --multiprocessing-fork
Python v3.8.0 (/ceph-fj/fangjun/software/py38/bin/python3.8)

Thread 2690310 (active+gil): "MainThread"
    _collect_cuts_in_buckets (lhotse/dataset/sampling/dynamic_bucketing.py:349)
        Arguments:
            self: <DynamicBucketer at 0x7f4d2f514b50>
            n_cuts: 217
        Locals:
            _: 133
            cuts: (<MonoCut at 0x7f4d3ef4d460>)
            duration: 5.75
            bucket_idx: 12
    __iter__ (lhotse/dataset/sampling/dynamic_bucketing.py:334)
        Locals:
            ready_buckets: [<collections.deque at 0x7f4f89df0dc0>, <collections.deque at 0x7f4ce7d94820>, <collections.deque at 0x7f4ce5e21e80>, ...]
            sampling_bucket: <collections.deque at 0x7f4f89df0dc0>
            batcher: <DurationBatcher at 0x7f4d3bb12970>
            batch: <CutSet at 0x7f4d4c3010a0>
            batch_size: 217
            _: 216
    _next_batch (lhotse/dataset/sampling/dynamic_bucketing.py:185)
        Arguments:
            self: <DynamicBucketingSampler at 0x7f4d2b838070>
    __next__ (lhotse/dataset/sampling/base.py:256)
        Arguments:
            self: <DynamicBucketingSampler at 0x7f4d2b838070>
        Locals:
            batches: [<CutSet at 0x7f4d4c3010a0>]
            _: 1
    _next_index (torch/utils/data/dataloader.py:512)
        Arguments:
            self: <_MultiProcessingDataLoaderIter at 0x7f4cfba11df0>
    _try_put_index (torch/utils/data/dataloader.py:1209)
        Arguments:
            self: <_MultiProcessingDataLoaderIter at 0x7f4cfba11df0>
    _process_data (torch/utils/data/dataloader.py:1227)
        Arguments:
            self: <_MultiProcessingDataLoaderIter at 0x7f4cfba11df0>
            data: {"inputs": <Tensor at 0x7f4d4e2717c0>, "supervisions": {"text": [...], "sequence_idx": <Tensor at 0x7f4d4e2710e0>, ...}}
    _next_data (torch/utils/data/dataloader.py:1183)
        Arguments:
            self: <_MultiProcessingDataLoaderIter at 0x7f4cfba11df0>
        Locals:
            info: (1, {"inputs": <Tensor at 0x7f4d4e2717c0>, ...})
            worker_id: 1
            data: {"inputs": <Tensor at 0x7f4d4e2717c0>, "supervisions": {"text": [...], "sequence_idx": <Tensor at 0x7f4d4e2710e0>, ...}}
    __next__ (torch/utils/data/dataloader.py:521)
        Arguments:
            self: <_MultiProcessingDataLoaderIter at 0x7f4cfba11df0>
    train_one_epoch (train.py:726)
        Arguments:
            params: {"best_train_loss": 0.1834969507367963, "best_valid_loss": 0.14670906683585203, "best_train_epoch": 20, ...}
            model: <DistributedDataParallel at 0x7f4f265d3df0>
            optimizer: <Eve at 0x7f4f265e8dc0>
            scheduler: <Eden at 0x7f4f26b7cb80>
            sp: <SentencePieceProcessor at 0x7f4f26b7c7c0>
            train_dl: <DataLoader at 0x7f4d2bd92400>
            giga_train_dl: <DataLoader at 0x7f4d2b838fd0>
            valid_dl: <DataLoader at 0x7f4cfba09e80>
            rng: <Random at 0x522b720>
            scaler: <GradScaler at 0x7f4d2b48d190>
            tb_writer: <SummaryWriter at 0x7f4eb487b8e0>
            world_size: 8
            rank: 0
        Locals:
            libri_tot_loss: {"frames": 1424368.0284411646, "loss": 236555.3318336773, "simple_loss": 363824.98212052934, ...}
            giga_tot_loss: {"frames": 1478261.3725208302, "loss": 282613.99806341354, "simple_loss": 415192.64430341555, ...}
            tot_loss: {"frames": 1451002.3512052558, "loss": 259233.39561773292, "simple_loss": 389768.41394849814, ...}
            dl_weights: [0.5, 0.5]
            iter_libri: <_MultiProcessingDataLoaderIter at 0x7f4ce63fed90>
            iter_giga: <_MultiProcessingDataLoaderIter at 0x7f4cfba11df0>
            batch_idx: 8107
            idx: 1
            dl: <_MultiProcessingDataLoaderIter at 0x7f4cfba11df0>
            batch: {"inputs": <Tensor at 0x7f4cecade3b0>, "supervisions": {"text": ["BUT THAT WALLET ISN'T A PHYSICAL THING ", ...], ...}}
            batch_size: 127
            libri: False
            loss: <Tensor at 0x7f4d32c86cc0>
            loss_info: {"frames": 7407, "loss": 1514.5911865234375, "simple_loss": 2210.47119140625, "pruned_loss": 409.3556213378906}
            prefix: "giga"
            cur_lr: 0.00020478380486557086
            valid_info: {"frames": 698248, "loss": 102681.625, "simple_loss": 173539.96875, "pruned_loss": 15911.64453125}
    run (train.py:1023)
        Arguments:
            rank: 0
            world_size: 8
            args: <Namespace at 0x7f4f89e0b250>
        Locals:
            params: {"best_train_loss": 0.1834969507367963, "best_valid_loss": 0.14670906683585203, "best_train_epoch": 20, ...}
            rng: <Random at 0x522b720>
            tb_writer: <SummaryWriter at 0x7f4eb487b8e0>
            device: <torch.device at 0x7f4f26b914d0>
            sp: <SentencePieceProcessor at 0x7f4f26b7c7c0>

@danpovey
Copy link
Collaborator Author

... one reason this has more of an effect on speed than you might expect, is that different workers reach this point at different times, so with 8 training processes, you spend 8x as long waiting for this to complete.

@pzelasko
Copy link
Collaborator

With map-style datasets, the sampler is placed in the same process as the main training loop. I'll think if there's an easy way to offload this to some background thread.

The other option is to use IterableDatasetWrapper(dataset, sampler) to move the sampler to dataloading workers, but then you'd need to de-duplicate the data across GPU nodes and dataloading workers. Lhotse did that in the early days of the project, but then due to dynamically sized batches it resulted in an uneven number of batches in each GPU node. Newer PyTorch versions can handle that with with Join([model]) context manager (link: https://pytorch.org/docs/stable/distributed.algorithms.join.html?highlight=join#torch.distributed.algorithms.Join).

@danpovey
Copy link
Collaborator Author

BTW, if threads become a problem in future, I would have thought it would be possible to simply spread out the work of the sampler across multiple minibatches by wrapping its work in a generator and stepping it on every minibatch. But admittedly this is a little ugly and seems like a hack to do the work of a thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants