Refactor `CutSet.describe` to enable parallel statistics computation #1168

pzelasko · 2023-09-29T16:53:46Z

Resolves #1167. Below is a toy example showing how to use this with multiple cut sets (for simplicity, I'm replicating an existing one here).

from lhotse import CutSet
from lhotse.cut.describe import CutSetStatistics
from concurrent.futures import ProcessPoolExecutor


def work(cs): return CutSetStatistics().accumulate(cs)

if __name__ == "__main__":
    cuts = CutSet.from_file("libri-train-5.jsonl.gz")

    print("Sequential")
    cuts.repeat(100).describe()

    print("Parallel")
    with ProcessPoolExecutor(8) as ex:
        stats = list(ex.map(
            work,
            [cuts] * 100,
        ))
    stats = stats[0].combine(*stats[1:])
    stats.describe()

…hotse-speech#1168)

Refactor CutSet.describe to enable parallel statistics computation

2aa176b

pzelasko added this to the v1.17 milestone Sep 29, 2023

pzelasko merged commit 81e5c4b into master Sep 29, 2023
10 checks passed

pzelasko deleted the feature/modular-describe branch September 29, 2023 17:20

flyingleafe pushed a commit to flyingleafe/lhotse that referenced this pull request Oct 11, 2023

Refactor CutSet.describe to enable parallel statistics computation (l…

bffffda

…hotse-speech#1168)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor `CutSet.describe` to enable parallel statistics computation #1168

Refactor `CutSet.describe` to enable parallel statistics computation #1168

pzelasko commented Sep 29, 2023

Refactor CutSet.describe to enable parallel statistics computation #1168

Refactor CutSet.describe to enable parallel statistics computation #1168

Conversation

pzelasko commented Sep 29, 2023

Refactor `CutSet.describe` to enable parallel statistics computation #1168

Refactor `CutSet.describe` to enable parallel statistics computation #1168