Release v0.9.0 · huggingface/datatrove

What's Changed

Fix CI test hangs in inference pipeline by @JoelNiklaus in #420
Remove 'file_path' from document metadata when checkpointing by @lewtun in #412
Speed up Benchmark Submission by @JoelNiklaus in #422
Standardize Paths by @JoelNiklaus in #423
Add benchmark mode by @JoelNiklaus in #424
Truncate context by @JoelNiklaus in #425
Mention optimized-parquet in readme by @lhoestq in #421
Fix Dataset name in test by @JoelNiklaus in #427
Readme and Dependencies by @JoelNiklaus in #428
Fix vLLM cache corruption on shared filesystems by @JoelNiklaus in #426
Miscellaneous changes to Inference Benchmarking by @JoelNiklaus in #429
Speed up Benchmark Submission by @JoelNiklaus in #430
Benchmark analysis improvements by @JoelNiklaus in #431
Fix ModuleNotFoundError when unpickling pipeline in SLURM jobs by @JoelNiklaus in #432
Extend Benchmarking Framework by @JoelNiklaus in #433
Simplify analysis by @JoelNiklaus in #434
Add vLLM server metrics to benchmark analysis by @JoelNiklaus in #435
Improve Benchmark Reliability and Add Features by @JoelNiklaus in #436
Optimize Benchmarking Infrastructure by @JoelNiklaus in #437
Fix memory unit mismatch in distributed Ray helpers by @JoelNiklaus in #439
Cleanup dependencies by @JoelNiklaus in #440
Benchmark quality of life by @JoelNiklaus in #441
Track inference time by @JoelNiklaus in #442
Add token estimation script for large HF datasets by @JoelNiklaus in #444
Finalize benchmark by @JoelNiklaus in #445
Add smol_data example for 100B dataset workflows by @JoelNiklaus in #449
Add skip_bad_requests option to InferenceRunner by @JoelNiklaus in #450
Make inference server startup policy configurable by @JoelNiklaus in #451
Handle HF Hub commit-race retries in DiskWriter by @JoelNiklaus in #448
Stabilize tokenizer and manager teardown for CI test shutdown by @JoelNiklaus in #452
Apply max_examples globally across parallel tasks by @JoelNiklaus in #453
Add hyperlinks to model and dataset in dataset card template by @JoelNiklaus in #454
Add lfs-verify to retryable HF Hub upload errors by @JoelNiklaus in #455
Glob all parquet files by @lewtun in #456
Fix SLURM CPU binding error in inference jobs by @JoelNiklaus in #457
Use public sharding API for streaming datasets by @JoelNiklaus in #458
Pin huggingface-hub to transformers-compatible range by @JoelNiklaus in #459
Feat/support dataset configs by @JoelNiklaus in #447
Refactor inference dataset card update flow by @JoelNiklaus in #460
Split full and changed-file style targets by @JoelNiklaus in #461
Add standalone FinePhrase inference example by @JoelNiklaus in #462
Isolate Xet cache per Slurm task by @JoelNiklaus in #465
Preserve checkpoint progress for skipped bad requests by @JoelNiklaus in #464
Harden HF writer retry behavior by @JoelNiklaus in #463
bump version and adapt authors by @JoelNiklaus in #466

New Contributors

@lewtun made their first contribution in #412

Full Changelog: v0.8.0...v0.9.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.9.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Contributors

Contributors

Uh oh!