HPLT - High Performance Language Technologies
A space that combines petabytes of natural language data with large-scale model training
Pinned
Repositories
Showing 10 of 17 repositories
- warc2text-runner Public
Scripts for parallelized extraction of plain texts from WARC archieves. Aiming at common and reproducible extraction approach.
-
-
-
-
-
- monolingual-multilingual-instruction-tuning Public
Monolingual or Multilingual Instruction Tuning: Which Makes a Better Alpaca
- OpusCleaner Public
OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.
-